forked from commoncrawl/commoncrawl-examples
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.html
111 lines (81 loc) · 4.93 KB
/
README.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
<h1>Norvig Web Data Science Award Examples</h1>
<p>This is a fork of the original <a href="https://github.com/commoncrawl/commoncrawl-examples">CommonCrawl examples</a>, adapted to be used
as a starting point for your entry to the <a href="http://norvigaward.github.com">Norvig Web Data Science Award</a>.</p>
<h2>Overview of the examples</h2>
<h3>Example MapReduce code</h3>
<p>See the code for all examples <a href="https://github.com/norvigaward/commoncrawl-examples/tree/master/src/java/org/commoncrawl/examples">on Github</a>.</p>
<p>All examples support the same arguments:</p>
<pre><code>org.commoncrawl.examples.Example*
-in <inputpath>
-out <outputpath>
[ -overwrite ]
[ -numreducers <number_of_reducers> ]
[ -conf <conffile> ]
[ -maxfiles <maxfiles> ]
</code></pre>
<p>Where:</p>
<ul>
<li><code>-in</code> <br />
Point to the path of your input files. You can use globbing if your Hadoop
distribution supports it.</li>
<li><code>-out</code> <br />
Point to the path to store the output files.</li>
<li><code>-overwrite</code> <br />
If output path exists, this switch will allow the example to overwrite the
existing directory.</li>
<li><code>-numreducers</code> <br />
Set the maximum amount of reducers to run. Defaults to a single reducer.</li>
<li><code>-conf</code> <br />
Path to additional configuration.</li>
<li><code>-maxfiles</code> <br />
Maximum amount of files to process.</li>
</ul>
<p>These examples are included:</p>
<ul>
<li><p><a href="https://github.com/norvigaward/commoncrawl-examples/blob/master/src/java/org/commoncrawl/examples/ExampleArcMicroformat.java">org.commoncrawl.examples.ExampleArcMicroformat</a> <br />
An example showing how to analyze the CommonCrawl ARC web content files.</p></li>
<li><p><a href="https://github.com/norvigaward/commoncrawl-examples/blob/master/src/java/org/commoncrawl/examples/ExampleMetadataDomainPageCount.java">org.commoncrawl.examples.ExampleMetadataDomainPageCount</a> <br />
An example showing how to use the CommonCrawl 'metadata' files to quickly
gather high level information about the corpus' content.</p></li>
<li><p><a href="https://github.com/norvigaward/commoncrawl-examples/blob/master/src/java/org/commoncrawl/examples/ExampleMetadataStats.java">org.commoncrawl.examples.ExampleMetadataStats</a> <br />
An example showing how to use the CommonCrawl 'metadata' files to quickly
gather high level information about the corpus' content.</p></li>
<li><p><a href="https://github.com/norvigaward/commoncrawl-examples/blob/master/src/java/org/commoncrawl/examples/ExampleTextWordCount.java">org.commoncrawl.examples.ExampleTextWordCount</a>
An example showing how to use the CommonCrawl 'textData' files to efficiently
work with CommonCrawl corpus text content.</p></li>
</ul>
<h3>Build and package the examples</h3>
<p>In the terminal you can build and package the examples by moving to the
commoncrawl-examples directory ~/git/commoncrawl-examples and run:</p>
<pre><code>$ ant
</code></pre>
<p>Inside Eclipse you can build the project by selecting "Project → Build Project"
from the menu bar. </p>
<p>Both methods wil create a jar bundle in ~/git/commoncrawl-examples/dist/lib.</p>
<h3>Running the MapReduce examples</h3>
<p>To run the an example on maximally 5 input files, open a shell and run:</p>
<pre><code>$ hadoop jar dist/lib/commoncrawl-examples-1.0.1.jar [EXAMPLECLASS] -in [INPUT] -out [OUTPUT] -maxfiles 5
</code></pre>
<p>For org.commoncrawl.examples.ExampleMetadataStats that would be</p>
<pre><code>$ hadoop jar dist/lib/commoncrawl-examples-1.0.1.jar org.commoncrawl.examples.ExampleMetadataStats -in [INPUT] -out [OUTPUT] -maxfiles 5
</code></pre>
<p>You can use this same command for each included example.</p>
<p>The Eclipse project includes a run configuration for the ExampleTextWordCount
example. You can select it from the "Run" menu entry. You can use this run
configuration as a template for other configurations.</p>
<h3>Example Pig script</h3>
<ul>
<li><a href="https://github.com/norvigaward/commoncrawl-examples/blob/master/example.pig">example.pig</a> </li>
</ul>
<p>An example counting the occurrences of HTTP status codes. You can run the pig
script from the terminal by moving to the examples directory and run:</p>
<pre><code>$ pig example.pig
</code></pre>
<h2>Using the CommonCrawl ARC files in MapReduce and Pig</h2>
<p>These examples come with an InputFormat for MapReduce and a Loader for Pig:</p>
<ul>
<li><a href="https://github.com/norvigaward/commoncrawl-examples/tree/master/src/java/org/commoncrawl/hadoop/mapred">ArcInputFormat, ArcRecordReader, and ArcRecord</a></li>
<li><a href="https://github.com/norvigaward/commoncrawl-examples/blob/master/src/java/org/commoncrawl/pig/ArcLoader.java">ArcLoader</a></li>
</ul>
<p>The above examples should show you how to load the CommonCrawl ARC files using
these classes.</p>