A few sample MapReduce jobs that can be run on Microsoft HDInsight Hadoop distribution.
Languages and frameworks used:
- Java
- C#/.NET
- JavaScript
- Hive
- Pig
Input file formats:
- Plain text
- JSON
- CSV
- Folder: iis
- Dataset: iis-log-small.txt
###HadoopIISStatusCodeCount Job A MapReduce application that counts the number of log entries for each HTTP status code.
####How to run (Java version):
- Build the project and export it to a JAR file.
- Load the dataset into HDFS: run the following command in Hadoop command line:
`hadoop fs -put <path-to-dataset-file> /iis/input/iis-log-small.txt`
- To execute the MapReduce job, run:
`hadoop jar <path-to-JAR-file> HadoopIISStatusCodeCount -m 3 -r 3 /iis/input /iis/output-java`
- To browse the results, run:
`hadoop fs -cat /iis/output-java/part-00000`
####How to run (.NET version):
- Build the project, check if
HadoopIISStatusCodeCount.dll
andMRLib
folder exist in the build output folder. - Make sure that the build output path does not contain spaces.
- Upload the dataset to HDFS (see Java version above).
- Navigate to
MRLib
folder and execute the MapReduce job viaMRRunner
:
`MRRunner.exe -dll ..\HadoopIISStatusCodeCount.dll -- /iis/input /iis/output-dotnet`
- To browse the results, run:
`hadoop fs -cat /iis/output-dotnet/part-00000`
- Folder: ufo
- Dataset: ufo-sightings.json
###SightingCountByState Job A MapReduce script that counts the number of reported UFO sightings for each U.S. state.
####How to run (JavaScript version)
- Upload the dataset to HDFS:
`hadoop fs -put <path-to-dataset-file> /ufo/input/json-sightings.json`
- Upload the script to HDFS:
`hadoop fs -put <path-to-JS-file> /ufo/jobs/SightingCountByState.js`
- Run the following command in HDInsight Interactive JavaScript Console:
`runJs("/ufo/jobs/SightingCountByState.js", "/ufo/input", "/ufo/output-js")`
- To browse the results, use the following JavaScript Console command:
`#cat /ufo/output-js/part-r-00000`
###Top10StatesBySightingCount Job A Pig job defined in HDInsight JavaScript syntax that determines the top 10 U.S. states by the number of UFO reports.
####How to run
- Upload the dataset and JavaScript file from the SightingCountByState job (see above) if they don't exist yet.
- Navigate to HDInsight Interactive JavaScript Console and run the following command:
`pig.from("/ufo/input").mapReduce("/ufo/jobs/SightingCountByState.js", "state, count: int").orderBy("count DESC").take(10).to("/ufo/output-pig")`
- Browse the results via JavaScript Console:
`#cat /ufo/output-pig/part-r-00000`
- Folder: dou
- Dataset: dou-2012-may-final.csv
###AverageAgeAndSalaryByPosition Job A Hive script that uses SQL-like language to query the average age and monthly salary for various software engineering professions.
####How to run:
- Upload the dataset to HDFS:
`hadoop fs -put <path-to-CSV-file> /dou/input/may2012/dou-2012-may-final.csv`
- Navigate to HDInsight Interactive Hive Console.
- Paste and execute the command from
01-CreateTable.hql
file to create Hive table schema. - Paste and execute the query from
02-AverageAgeAndSalaryByPosition.hql
.
Note: you can also redirect Hive output to a local file or use Hive Excel Add-in.