Skip to content

Some examples showing different techniques working with Batch and Stream processing.

Notifications You must be signed in to change notification settings

drummel/data-processing-examples

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Data Processing Examples

A quickstart to getting running with a few Stream and Batch processing frameworks across languages.

Batch Processing with Hadoop

Hadoop is a set of utilities based around batchprocessing of information. To set up an individual developer machine can be done easily as long as a suitable JDK is already installed. For more information look at Apache Hadoop's Single Node Setup Instructions.

To get started, grab a copy of hadoop stable from a mirror and decompress it to a local directory which will become your $HADOOP_HOME.

wget http://apache.tradebit.com/pub/hadoop/common/hadoop-1.1.2/hadoop-1.1.2-bin.tar.gz
tar -zxvf hadoop-1.1.2-bin.tar.gz
mkdir -p data/input

To reduce typing, you can add hadoop bin to your $PATH.

export PATH=$PATH:./hadoop-1.1.2/bin

Streaming Example

Hadoop streaming is not for real time stream processing, it is a way to use any language to create Map/Reduce jobs. For more information check the Streaming Tutorial.

# Example from tutorial
hadoop jar ./hadoop-1.1.2/contrib/streaming/hadoop-streaming-1.1.2.jar -input data/input -output data/output -mapper /bin/cat -reducer /usr/bin/wc

After every run you will have to remove the output directory, otherwise there is a warning and stops you from running again.

# rm -R data/output
hadoop jar ./hadoop-1.1.2/contrib/streaming/hadoop-streaming-1.1.2.jar -input data/input -output data/output -mapper 'ruby map.rb' -reducer 'ruby reduce.rb' -file ./scripts/example_01/map.rb -file ./scripts/example_01/reduce.rb

Pig

That was cool, but for doing fun experiments PIG can do much of the same stuff with less code and quicker. The jobs it creates are quite fast as well. The best example of using Pig can be found on it's Wikipedia page.

Just get it, uncompress it and things run.

wget http://apache.petsads.us/pig/stable/pig-0.11.1.tar.gz
tar -zxvf pig-0.11.1.tar.gz

To get a shell going, start it in local mode.

./pig-0.11.1/bin/pig -x local

Alternate Hadoop Installation

For intro, the self install works well. When you need to use HDFS and get a better dev env going a good upgrade is the Cloudera Installation.

Storm

Hadoop is fun, great for batch processing but fails with real time data. You can make jobs smaller and do in increments but that isn't what Hadoop and Map/Reduce was designed for. Instead try out Storm, it is designed for stream processing and is very fast at it.

#Spark

The in-between is a relatively new project from Cal Berkeley call Spark. It does a good job at batch processing as well as a great job doing stream processing.

About

Some examples showing different techniques working with Batch and Stream processing.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages