Extract-Transform-Load (ETL)¶
This brief tutorial describes how to use GeoTrellis’ Extract-Transform-Load (“ETL”) functionality to create a GeoTrellis catalog. We will accomplish this in four steps:
- we will build the ETL assembly from code in the GeoTrellis source tree,
- we will compose JSON configuration files describing the input and output data,
- we will perform the ingest, creating a GeoTrellis catalog, and
- we will exercise the ingested data using a simple project.
It is assumed throughout this tutorial that Spark 2.0.0 or greater is installed, that the GDAL command line tools are installed, and that the GeoTrellis source tree has been locally cloned.
Local ETL¶
Build the ETL Assembly¶
Navigate into the GeoTrellis source tree, build the assembly, and copy
it to the /tmp
directory:
cd geotrellis
./sbt "project spark-etl" assembly
cp spark-etl/target/scala-2.11/geotrellis-spark-etl-assembly-1.0.0.jar /tmp
Although in this tutorial we have chosen to build this assembly directly
from the GeoTrellis source tree, in some applications it may be
desirable to create a class in one’s own code base that uses or derives
from geotrellis.spark.etl.SinglebandIngest
or
geotrellis.spark.etl.MultibandIngest
, and use that custom class as
the entry-point. Please see the Chatta
Demo
for an example of how to do that.
Compose JSON Configuration Files¶
The configuration files that we create in this section are intended for use with a single multiband GeoTiff image. Three JSON files are required: one describing the input data, one describing the output data, and one describing the backend(s) in which the catalog should be stored. Please see our more detailed ETL documentation for more information about the configuration files.
We will now create three files in the /tmp/json
directory:
input.json
, output.json
, and backend-profiles.json
. (The
respective schemas that those files must obey can be found
here,
here,
and
here.)
Here is input.json
:
[{
"format": "multiband-geotiff",
"name": "example",
"cache": "NONE",
"backend": {
"type": "hadoop",
"path": "file:///tmp/rasters"
}
}]
The value multiband-geotiff
is associated with the format
key.
That is required if you want to access the data as an RDD of
SpatialKey
-MultibandTile
pairs. Making that value geotiff
instead of multiband-geotiff
would result in SpatialKey
-Tile
pairs. The value example
associated with the key name
gives the
name of the layer(s) that will be created. The cache
key gives the
Spark caching strategy that will be used during the ETL process.
Finally, the value associated with the backend
key specifies where
the data should be read from. In this case, the source data are stored
in the directory /tmp/rasters
on local filesystem and accessed via
Hadoop.
Here is the output.json
file:
{
"backend": {
"type": "hadoop",
"path": "file:///tmp/catalog/"
},
"reprojectMethod": "buffered",
"pyramid": true,
"tileSize": 256,
"keyIndexMethod": {
"type": "zorder"
},
"resampleMethod": "cubic-spline",
"layoutScheme": "zoomed",
"crs": "EPSG:3857"
}
That file says that the catalog should be created on the local
filesystem in the directory /tmp/catalog
using Hadoop. The source
data is pyramided so that layers of zoom level 0 through 12 are created
in the catalog. The tiles are 256-by-256 pixels in size and are indexed
in according to Z-order. Bicubic resampling (spline rather than
convolutional) is used in the reprojection process, and the CRS
associated with the layers is EPSG 3857 (a.k.a. Web Mercator).
Here is the backend-profiles.json
file:
{
"backend-profiles": []
}
In this case, we did not need to specify anything since we are using
Hadoop for both input and output. It happens that Hadoop only needs to
know the path to which it should read or write, and we provided that
information in the input.json
and output.json
files. Other
backends such as Cassandra and Accumulo require information to be
provided in the backend-profiles.json
file.
Create the Catalog¶
Before performing the ingest, we will first retile the source raster. This is not strictly required if the source image is small enough (probably less than 2GB), but is still good practice even if it is not required.
mkdir -p /tmp/rasters
gdal_retile.py source.tif -of GTiff -co compress=deflate -ps 256 256 -targetDir /tmp/rasters
The result of this command is a collection of smaller GeoTiff tiles in
the directory /tmp/rasters
.
Now with all of the files that we need in place
(/tmp/geotrellis-spark-etl-assembly-1.0.0.jar
, /tmp/json/input.json
,
/tmp/json/output.json
, /tmp/json/backend-profiles.json
, and
/tmp/rasters/*.tif
) we are ready to perform the ingest. That can be
done by typing:
rm -rf /tmp/catalog
$SPARK_HOME/bin/spark-submit \
--class geotrellis.spark.etl.MultibandIngest \
--master 'local[*]' \
--driver-memory 16G \
/tmp/geotrellis-spark-etl-assembly-1.0.0.jar \
--input "file:///tmp/json/input.json" \
--output "file:///tmp/json/output.json" \
--backend-profiles "file:///tmp/json/backend-profiles.json"
After the spark-submit
command completes, there should be a
directory called /tmp/catalog
which contains the catalog.
Optional: Exercise the Catalog¶
Clone or download this example code (a zipped version of which can be downloaded from here). The example code is a very simple project that shows how to read layers from an HDFS catalog, perform various computations on them, then dump them to disk so that they can be inspected.
Once obtained, the code can be built like this:
cd EtlTutorial
./sbt "project tutorial" assembly
cp tutorial/target/scala-2.11/tutorial-assembly-0.jar /tmp
The code can be run by typing:
mkdir -p /tmp/tif
$SPARK_HOME/bin/spark-submit \
--class com.azavea.geotrellis.tutorial.EtlExercise \
--master 'local[*]' \
--driver-memory 16G \
/tmp/tutorial-assembly-0.jar /tmp/catalog example 12
In the block above, /tmp/catalog
is an HDFS URI pointing to the
location of the catalog, example
is the layer name, and 12
is
the layer zoom level. After running the code, you should find a number
of images in /tmp/tif
which are GeoTiff renderings of the tiles of
the raw layer, as well as the layer with various transformations applied
to it.
GeoDocker ETL¶
The foregoing discussion showed how to ingest data to the local
filesystem, albeit via Hadoop. In this section, we will give a basic
example of how to use the ETL machinery to ingest into HDFS on
GeoDocker. Throughout this section we will assume that the files that
were previously created in the local /tmp
directory (namely
/tmp/geotrellis-spark-etl-assembly-1.0.0.jar
, /tmp/json/input.json
,
/tmp/json/output.json
, /tmp/json/backend-profiles.json
, and
/tmp/rasters/*.tif
) still exist.
In addition to the dependencies needed to complete the steps given
above, this section assumes that user has a recent version of
docker-compose
installed and working.
Edit output.json
¶
Because we are planning to ingest into HDFS and not to the filesystem,
we must modify the output.json
file that we used previously. Edit
/tmp/json/output.json
so that it looks like this:
{
"backend": {
"type": "hadoop",
"path": "hdfs://hdfs-name/catalog/"
},
"reprojectMethod": "buffered",
"pyramid": true,
"tileSize": 256,
"keyIndexMethod": {
"type": "zorder"
},
"resampleMethod": "cubic-spline",
"layoutScheme": "zoomed",
"crs": "EPSG:3857"
}
The only change is the value associated with the path
key; it now
points into HDFS instead of at the local filesystem.
Download docker-compose.yml
File¶
We must now obtain a docker-compose.yml
file. Download this
file
and move it to the /tmp
directory. The directory location is
important, because docker-compose
will use that to name the network
and containers that it creates.
Bring Up GeoDocker¶
With the docker-compose.yml
file in place, we are now ready to start
our GeoDocker instance:
cd /tmp
docker-compose up
After a period of time, the various Hadoop containers should be up and working.
Perform the Ingest¶
In a different terminal, we will now start another container:
docker run -it --rm --net=tmp_default -v $SPARK_HOME:/spark:ro -v /tmp:/tmp openjdk:8-jdk bash
Notice that the network name was derived from the name of the directory
in which the docker-compose up
command was run. The
--net=tmp_default
switch connects the just-started container to the
bridge network that the GeoDocker cluster is running on. The
-v $SPARK_HOME:/spark:ro
switch mounts our local Spark installation
at /spark
within the container so that we can use it. The
-v /tmp:/tmp
switch mounts our host /tmp
directory into the
container so that we can use the data and jar files that are there.
Within the just-started container, we can now perform the ingest:
/spark/bin/spark-submit \
--class geotrellis.spark.etl.MultibandIngest \
--master 'local[*]' \
--driver-memory 16G \
/tmp/geotrellis-spark-etl-assembly-1.0.0.jar \
--input "file:///tmp/json/input.json" \
--output "file:///tmp/json/output.json" \
--backend-profiles "file:///tmp/json/backend-profiles.json"
The only change versus what we did earlier is the location of the
spark-submit
binary.
Optional: Exercise the Catalog¶
Now, we can exercise the catalog:
rm -f /tmp/tif/*.tif
/spark/bin/spark-submit \
--class com.azavea.geotrellis.tutorial.EtlExercise \
--master 'local[*]' \
--driver-memory 16G \
/tmp/tutorial-assembly-0.jar /tmp/catalog example 12
The only differences form what we did earlier are the location of the
spark-submit
binary and URI specifying the location of the catalog.