README.html

<h1>Norvig Web Data Science Award Examples</h1>

<p>This is a fork of the original <a href="https://github.com/commoncrawl/commoncrawl-examples">CommonCrawl examples</a>, adapted to be used
as a starting point for your entry to the <a href="http://norvigaward.github.com">Norvig Web Data Science Award</a>.</p>

<h2>Overview of the examples</h2>

<h3>Example MapReduce code</h3>

<p>See the code for all examples <a href="https://github.com/norvigaward/commoncrawl-examples/tree/master/src/java/org/commoncrawl/examples">on Github</a>.</p>

<p>All examples support the same arguments:</p>

<pre><code>org.commoncrawl.examples.Example*
                         -in &lt;inputpath&gt;
                         -out &lt;outputpath&gt;
                       [ -overwrite ]
                       [ -numreducers &lt;number_of_reducers&gt; ]
                       [ -conf &lt;conffile&gt; ]
                       [ -maxfiles &lt;maxfiles&gt; ]
</code></pre>

<p>Where:</p>

<ul>
<li><code>-in</code> <br />
Point to the path of your input files. You can use globbing if your Hadoop
distribution supports it.</li>
<li><code>-out</code> <br />
Point to the path to store the output files.</li>
<li><code>-overwrite</code> <br />
If output path exists, this switch will allow the example to overwrite the
existing directory.</li>
<li><code>-numreducers</code> <br />
Set the maximum amount of reducers to run. Defaults to a single reducer.</li>
<li><code>-conf</code> <br />
Path to additional configuration.</li>
<li><code>-maxfiles</code> <br />
Maximum amount of files to process.</li>
</ul>

<p>These examples are included:</p>

<ul>
<li><p><a href="https://github.com/norvigaward/commoncrawl-examples/blob/master/src/java/org/commoncrawl/examples/ExampleArcMicroformat.java">org.commoncrawl.examples.ExampleArcMicroformat</a> <br />
An example showing how to analyze the CommonCrawl ARC web content files.</p></li>
<li><p><a href="https://github.com/norvigaward/commoncrawl-examples/blob/master/src/java/org/commoncrawl/examples/ExampleMetadataDomainPageCount.java">org.commoncrawl.examples.ExampleMetadataDomainPageCount</a> <br />
An example showing how to use the CommonCrawl 'metadata' files to quickly
gather high level information about the corpus' content.</p></li>
<li><p><a href="https://github.com/norvigaward/commoncrawl-examples/blob/master/src/java/org/commoncrawl/examples/ExampleMetadataStats.java">org.commoncrawl.examples.ExampleMetadataStats</a> <br />
An example showing how to use the CommonCrawl 'metadata' files to quickly
gather high level information about the corpus' content.</p></li>
<li><p><a href="https://github.com/norvigaward/commoncrawl-examples/blob/master/src/java/org/commoncrawl/examples/ExampleTextWordCount.java">org.commoncrawl.examples.ExampleTextWordCount</a>
An example showing how to use the CommonCrawl 'textData' files to efficiently
work with CommonCrawl corpus text content.</p></li>
</ul>

<h3>Build and package the examples</h3>

<p>In the terminal you can build and package the examples by moving to the
commoncrawl-examples directory ~/git/commoncrawl-examples and run:</p>

<pre><code>$ ant
</code></pre>

<p>Inside Eclipse you can build the project by selecting "Project → Build Project"
from the menu bar. </p>

<p>Both methods wil create a jar bundle in ~/git/commoncrawl-examples/dist/lib.</p>

<h3>Running the MapReduce examples</h3>

<p>To run the an example on maximally 5 input files, open a shell and run:</p>

<pre><code>$ hadoop jar dist/lib/commoncrawl-examples-1.0.1.jar [EXAMPLECLASS] -in [INPUT] -out [OUTPUT] -maxfiles 5
</code></pre>

<p>For org.commoncrawl.examples.ExampleMetadataStats that would be</p>

<pre><code>$ hadoop jar dist/lib/commoncrawl-examples-1.0.1.jar org.commoncrawl.examples.ExampleMetadataStats -in [INPUT] -out [OUTPUT] -maxfiles 5
</code></pre>

<p>You can use this same command for each included example.</p>

<p>The Eclipse project includes a run configuration for the ExampleTextWordCount
example. You can select it from the "Run" menu entry. You can use this run
configuration as a template for other configurations.</p>

<h3>Example Pig script</h3>

<ul>
<li><a href="https://github.com/norvigaward/commoncrawl-examples/blob/master/example.pig">example.pig</a>  </li>
</ul>

<p>An example counting the occurrences of HTTP status codes. You can run the pig
script from the terminal by moving to the examples directory and run:</p>

<pre><code>$ pig example.pig
</code></pre>

<h2>Using the CommonCrawl ARC files in MapReduce and Pig</h2>

<p>These examples come with an InputFormat for MapReduce and a Loader for Pig:</p>

<ul>
<li><a href="https://github.com/norvigaward/commoncrawl-examples/tree/master/src/java/org/commoncrawl/hadoop/mapred">ArcInputFormat, ArcRecordReader, and ArcRecord</a></li>
<li><a href="https://github.com/norvigaward/commoncrawl-examples/blob/master/src/java/org/commoncrawl/pig/ArcLoader.java">ArcLoader</a></li>
</ul>

<p>The above examples should show you how to load the CommonCrawl ARC files using
these classes.</p>