-
Notifications
You must be signed in to change notification settings - Fork 8
2. Development and Configuration
ADAMpro builds on Apache Spark 2 and uses a large variety of libraries and packages, e.g. Google Protocol Buffers and grpc. The repository has the following structure:
-
conf
folder for configuration files; note that the conf folder is automatically included to the resources -
grpc
the proto file (included from the proto sub-repository) -
grpcclient
general grpc client code for communicating with the grpc server -
scripts
useful scripts for deploying running ADAMpro -
src
ADAMpro sources -
web
web UI of ADAMpro
Clone the repository from github. Note that the folder grpc
is a sub-module. Hence, you will have to run
git clone --recursive https://github.com/vitrivr/ADAMpro.git
for cloning the repository.
If you get an error when initalizing the spark-driver, you will need to modify your /etc/hosts for fixing this issue
ADAMpro comes with a set of unit tests which can be run from the test package. Note that for having all test pass, a certain setup is necessary. For instance, for having the PostGIS test pass, the database has to be set up and it must be configured in the configuration file. You may use the script setupLocalUnitTests.sh
for setting up all the necessary Docker containers for then performing the unit tests.
We recommend the use of IntelliJ IDEA for developing ADAMpro. It can be run locally using the run commands in the IDE for debugging purposes. For this, in build.sbt
remove the %provided
option and ExclusionRule("io.netty")
from the coreLibs
.
Note that the behaviour of ADAMpro, when run locally, might be different than when submitted to Apache Spark (using ./spark-submit
), in particular because of the inclusion of different package versions (e.g., Apache Spark will come with a certain version of netty, which is used even if build.sbt
includes a newer version; we refrain from using the spark.driver.userClassPathFirst
option as this is experimental).
ADAMpro can be debugged even if being submitted to Spark. By setting the debugging option in the SPARK_SUBMIT_OPTS
command before submitting, a remote debugger can be attached:
export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005
In here, we have opened port 5005 and, given the suspend
option, have the application wait until a debugger attaches.
The Docker container we provide has the SPARK_SUBMIT_OPTS
options set and we use port 5005 in the Docker containers provided for debugging (however, note that the suspend
option which makes the application wait until a debugger attaches is turned off in the Docker container).
In your IDE, bind to the application by setting up remote debugging on the port specified. For more information on how to use remote debugging consider e.g., this article: https://community.hortonworks.com/articles/15030/spark-remote-debugging.html
For checking the performance of ADAMpro, also consider the creation of flame graphs. For more information see here.
ADAMpro can be configured using a configuration file. This repository contains a ./conf/
folder with configuration files.
-
application.conf
is used when running ADAMpro from an IDE -
assembly.conf
is the conf file included in the assembly jar (when runningsbt assembly
)
When starting ADAMpro, you can provide a adampro.conf
file in the same path as the jar, which is then used instead of the default configuration. (Note the file adampro.conf.template
which is used as a template for the Docker container.)
The configuration file can be used to specify configurations for running ADAMpro. The file ADAMConfig.scala reads the configuration file and provides the configurations to the application.
The file contains information on
- the path to all the internal files (catalog, etc.), e.g.,
internalsPath = "/adampro/internals"
- the grpc port, e.g.,
grpc {port = "5890"}
- the storage engines to use, e.g.,
engines = ["parquet", "index", "postgres", "postgis", "cassandra", "solr"]
For all the storage engines specified, in the storage
section, more details have to be provided (note that the name specified in engines
must match the name in the storage
section):
parquet {
engine = "ParquetEngine"
hadoop = true
basepath = "hdfs://spark:9000/"
datapath = "/adampro/data/"
}
or
parquet {
engine = "ParquetEngine"
hadoop = false
path = "~/adampro-tmp/data/"
}
The parameters specified in here are passed directly to the storage engines; it may make sense to consider the code of the single storage engine to see which parameters are necessary to specify (or to consider the exemplary configuration files in the configuration folder). The name of the class is specified in the field engine
.