The installation steps below are packaged into two containers called Tube and Spark which are orchestrated by a docker-compose called etl-compose. You can skip the direction below and use those tools directly.
- Install Java and remember to configure JAVA_HOME in the
~/.bash\_profile
or~/.bashrc
Download the latest version of HADOOP to install HADOOP locally. Then, configure your local HADOOP as a psuedo cluster by following steps:
export HADOOP_HOME=YOUR_LOCAL_HADOOP_FOLDER/<VERSION_NUMBER>/libexec/
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
I installed HADOOP via brew
, so my HADOOP_HOME=/usr/local/Cellar/hadoop/3.1.0/libexec/
You can find all the Hadoop configuration files in the location $HADOOP_HOME/etc/hadoop
. You need to make suitable changes in those configuration files according to your Hadoop infrastructure.
core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/Cellar/hadoop/3.1.0/hdfs/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<name>yarn.resourcemanager.address</name>
<value>127.0.0.1:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>127.0.0.1:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>127.0.0.1:8031</value>
</property>
</configuration>
hadoop-env.sh
In this bash file, change the following line:
export JAVA_HOME=TO_YOUR_JAVA_HOME
export HADOOP_HOME=TO_YOUR_HADOOP_HOME
export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true -Dsun.security.krb5.debug=true -Dsun.security.spnego.debug"
export HADOOP_OS_TYPE=${HADOOP_OS_TYPE:-$(uname -s)}
case ${HADOOP_OS_TYPE} in
Darwin*)
export HADOOP_OPTS="${HADOOP_OPTS} -Djava.security.krb5.realm= "
export HADOOP_OPTS="${HADOOP_OPTS} -Djava.security.krb5.kdc= "
export HADOOP_OPTS="${HADOOP_OPTS} -Djava.security.krb5.conf= "
;;
esac
export HDFS_NAMENODE_OPTS="-Dhadoop.security.logger=INFO,RFAS"
export HDFS_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=INFO,RFAS"
export HDFS_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS"
export HDFS_NAMENODE_USER=YOUR_USER_NODE_NAME
export HDFS_DATANODE_USER=YOUR_USER_NODE_NAME
export HDFS_SECONDATRYNAMENODE_USER=YOUR_USER_NODE_NAME
configure your environment to allow localhost ssh by adding the pub key to authorized list:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Check if you can ssh to the localhost without passphrase by running ssh localhost
.
Following this direction HADOOP's direction
Following this link to install SPARK
Download latest SQOOP here SQOOP
tar -xvf sqoop-1.4.7.tar.gz
export SQOOP_HOME=YOUR_SQOOP_HOME_FOLDER
cd $SQOOP_HOME/conf
mv sqoop-env-template.sh sqoop-env.sh
Edit two following lines in $SQOOP_HOME/sqoop-env.sh
:
export HADOOP_COMMON_HOME=YOUR_HADOOP_HOME
export HADOOP_MAPRED_HOME=YOUR_HADOOP_HOME
Add sqoop path to your bash profile:
export SQOOP_HOME=YOUR_SCOPE_HOME
export PATH=$SQOOP_HOME/bin:$PATH
Download postgresql jar file to SQOOP_HOME/lib/
mv local_settings.example.py local_settings.py
Change the configuration keys to suitable values.
Start hadoop if it's stopped: $HADOOP_HOME/sbin/start-dfs.sh
- Run
python run_import.py
to import data from a SQL database to HDFS. - Run
python run_spark.py
to import data from the HDFS files to RDD.