This is a fork of the original CommonCrawl examples, adapted to be used as a starting point for your entry to the Norvig Web Data Science Award.
We recommend using the virtual machine image as development environment as described on the contest website.
See the code for all examples on Github.
All examples support the same arguments:
org.commoncrawl.examples.Example*
-in <inputpath>
-out <outputpath>
[ -overwrite ]
[ -numreducers <number_of_reducers> ]
[ -conf <conffile> ]
[ -maxfiles <maxfiles> ]
Where:
-in
Point to the path of your input files. You can use globbing if your Hadoop distribution supports it.-out
Point to the path to store the output files.-overwrite
If output path exists, this switch will allow the example to overwrite the existing directory.-numreducers
Set the maximum amount of reducers to run. Defaults to a single reducer.-conf
Path to additional configuration.-maxfiles
Maximum amount of files to process.
These examples are included:
-
org.commoncrawl.examples.ExampleArcMicroformat
An example showing how to analyze the CommonCrawl ARC web content files. -
org.commoncrawl.examples.ExampleMetadataDomainPageCount
An example showing how to use the CommonCrawl 'metadata' files to quickly gather high level information about the corpus' content. -
org.commoncrawl.examples.ExampleMetadataStats
An example showing how to use the CommonCrawl 'metadata' files to quickly gather high level information about the corpus' content. -
org.commoncrawl.examples.ExampleTextWordCount An example showing how to use the CommonCrawl 'textData' files to efficiently work with CommonCrawl corpus text content.
In the terminal you can build and package the examples by moving to the commoncrawl-examples directory ~/git/commoncrawl-examples and run:
$ ant
Inside Eclipse you can build the project by selecting "Project → Build Project" from the menu bar.
Both methods wil create a jar bundle in ~/git/commoncrawl-examples/dist/lib.
To run the an example on maximally 5 input files, open a shell and run:
$ hadoop jar dist/lib/commoncrawl-examples-1.0.1.jar [EXAMPLECLASS] -in [INPUT] -out [OUTPUT] -maxfiles 5
For org.commoncrawl.examples.ExampleMetadataStats that would be
$ hadoop jar dist/lib/commoncrawl-examples-1.0.1.jar org.commoncrawl.examples.ExampleMetadataStats -in [INPUT] -out [OUTPUT] -maxfiles 5
You can use this same command for each included example.
The Eclipse project includes a run configuration for the ExampleTextWordCount example. You can select it from the "Run" menu entry. You can use this run configuration as a template for other configurations.
An example counting the occurrences of HTTP status codes. You can run the pig script from the terminal by moving to the examples directory and run:
$ pig example.pig
These examples come with an InputFormat for MapReduce and a Loader for Pig:
The above examples should show you how to load the CommonCrawl ARC files using these classes.