Skip to content

2. Development and Configuration

Ivan edited this page Nov 15, 2018 · 1 revision

Code basis and Repository

ADAMpro builds on Apache Spark 2 and uses a large variety of libraries and packages, e.g. Google Protocol Buffers and grpc. The repository has the following structure:

  • conf folder for configuration files; note that the conf folder is automatically included to the resources
  • grpc the proto file (included from the proto sub-repository)
  • grpcclient general grpc client code for communicating with the grpc server
  • scripts useful scripts for deploying running ADAMpro
  • src ADAMpro sources
  • web web UI of ADAMpro

Development

Clone Repository

Clone the repository from github. Note that the folder grpc is a sub-module. Hence, you will have to run

git clone --recursive https://github.com/vitrivr/ADAMpro.git 

for cloning the repository.

Known Issues

If you get an error when initalizing the spark-driver, you will need to modify your /etc/hosts for fixing this issue

Unit tests

ADAMpro comes with a set of unit tests which can be run from the test package. Note that for having all test pass, a certain setup is necessary. For instance, for having the PostGIS test pass, the database has to be set up and it must be configured in the configuration file. You may use the script setupLocalUnitTests.sh for setting up all the necessary Docker containers for then performing the unit tests.

Debugging

We recommend the use of IntelliJ IDEA for developing ADAMpro. It can be run locally using the run commands in the IDE for debugging purposes. For this, in build.sbt remove the %provided option and ExclusionRule("io.netty") from the coreLibs.

Note that the behaviour of ADAMpro, when run locally, might be different than when submitted to Apache Spark (using ./spark-submit), in particular because of the inclusion of different package versions (e.g., Apache Spark will come with a certain version of netty, which is used even if build.sbt includes a newer version; we refrain from using the spark.driver.userClassPathFirst option as this is experimental).

ADAMpro can be debugged even if being submitted to Spark. By setting the debugging option in the SPARK_SUBMIT_OPTS command before submitting, a remote debugger can be attached:

export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005

In here, we have opened port 5005 and, given the suspend option, have the application wait until a debugger attaches.

The Docker container we provide has the SPARK_SUBMIT_OPTS options set and we use port 5005 in the Docker containers provided for debugging (however, note that the suspend option which makes the application wait until a debugger attaches is turned off in the Docker container).

In your IDE, bind to the application by setting up remote debugging on the port specified. For more information on how to use remote debugging consider e.g., this article: https://community.hortonworks.com/articles/15030/spark-remote-debugging.html

Flame graphs

For checking the performance of ADAMpro, also consider the creation of flame graphs. For more information see here.

Configuration

Configuration files

ADAMpro can be configured using a configuration file. This repository contains a ./conf/ folder with configuration files.

  • application.conf is used when running ADAMpro from an IDE
  • assembly.conf is the conf file included in the assembly jar (when running sbt assembly)

When starting ADAMpro, you can provide a adampro.conf file in the same path as the jar, which is then used instead of the default configuration. (Note the file adampro.conf.template which is used as a template for the Docker container.)

Configuration parameters

The configuration file can be used to specify configurations for running ADAMpro. The file ADAMConfig.scala reads the configuration file and provides the configurations to the application.

The file contains information on

  • the path to all the internal files (catalog, etc.), e.g., internalsPath = "/adampro/internals"
  • the grpc port, e.g., grpc {port = "5890"}
  • the storage engines to use, e.g., engines = ["parquet", "index", "postgres", "postgis", "cassandra", "solr"]

For all the storage engines specified, in the storage section, more details have to be provided (note that the name specified in engines must match the name in the storage section):

  parquet {
    engine = "ParquetEngine"
    hadoop = true
    basepath = "hdfs://spark:9000/"
    datapath = "/adampro/data/"
  }

or

  parquet {
    engine = "ParquetEngine"
    hadoop = false
    path = "~/adampro-tmp/data/"
  }

The parameters specified in here are passed directly to the storage engines; it may make sense to consider the code of the single storage engine to see which parameters are necessary to specify (or to consider the exemplary configuration files in the configuration folder). The name of the class is specified in the field engine.