Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integrity integration #10

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from
Draft

integrity integration #10

wants to merge 4 commits into from

Conversation

jt55401
Copy link

@jt55401 jt55401 commented Jul 2, 2024

The intention of this PR is to enable integrity file output.
This will involve:

  1. modifying WEATGenerator.java to output cdx files.

I've started by laying out TODO's in the places where I think we will need to make changes.

This will need to coordinate with a PR in crawl-tools: https://github.com/commoncrawl/crawl-tools/pull/37

@jt55401 jt55401 changed the title laying down TODO's for integrity integration integrity integration Jul 3, 2024
@sebastian-nagel
Copy link

Few remarks:

  • the actual work is done by code in commoncrawl/ia-web-commons. Likely, the bulk needs to be implemented there, using existing class implementations to build upon.
  • this project has a quite long list of dependencies, most of them in very old versions. Setting up a dev environment to test the job might be painful. It could be easier to move the job definition to a new project with a short dependency list ( ia-web-commons, hadoop-client, utilities), eventually also upgrading the job to MapReduce v2.

coordinate with a PR in crawl-tools: https://github.com/commoncrawl/crawl-tools/pull/37

While ia-hadoop-tools is a public repository, crawl-tools isn't. Please, keep in mind that it may be annoying for anybody reading about this issue if they cannot read the information given in the linked issue. So, all information related to this issue should be shared here or in other public repositories. And, of course, it's possible to link a public repository from a private one, for example to discuss the integration of the new job into internal tools and workflows.

Copy link

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jt55401, thanks! Just a few comments, didn't try to run it.


if(path.endsWith(".gz")) {
watOutputBasename = inputBasename.substring(0,inputBasename.length()-3) + ".wat.gz";
wetOutputBasename = inputBasename.substring(0,inputBasename.length()-3) + ".wet.gz";
cdxWatOutputBasename = inputBasename.substring(0,inputBasename.length()-3) + ".cdxwat.gz";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This results in:

name.warc.gz      name.cdx.gz
name.warc.wat.gz  name.warc.wat.cdxwat.gz
name.warc.wet.gz  name.warc.wet.cdxwet.gz
  • "wat" is given twice
  • ".warc" is removed for CDX files derived from WARC files
    • should be the same for WAT/WET files
    • a CDX file does not follow the WARC format
    • (a WAT or WET file does)

Maybe the following looks better?

name.warc.gz      name.cdx.gz
name.warc.wat.gz  name.wat.cdx.gz
name.warc.wet.gz  name.wet.cdx.gz

} else {
watOutputBasename = inputBasename + ".wat.gz";
wetOutputBasename = inputBasename + ".wet.gz";
cdxWatOutputBasename = inputBasename + ".cdxwat.gz";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

}

String watOutputFileString = basePath.toString() + "/wat/" + watOutputBasename;
String wetOutputFileString = basePath.toString() + "/wet/" + wetOutputBasename;
String cdxWetOutputFileString = basePath.toString() + "/cdxwet/" + cdxWetOutputBasename;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fixed output path. Do we want to have the CDX files for WAT/WET there?

The configuration for the CDX indexing uses different output paths, cf. https://github.com/commoncrawl/webarchive-indexing/blob/main/run_index_hadoop.sh

@jt55401
Copy link
Author

jt55401 commented Aug 4, 2024

@sebastian-nagel - I've made a few commits and this code should be more to your liking.

  1. filenames should be more consistent now
  2. new config param for cdx base path
  3. fixed a few bugs preventing compile

this project has a quite long list of dependencies, most of them in very old versions. Setting up a dev environment to test the job might be painful. It could be easier to move the job definition to a new project with a short dependency list ( ia-web-commons, hadoop-client, utilities), eventually also upgrading the job to MapReduce v2.

I compiled and tested this with a pretty vanilla Java 11 environment, and everything seemed to work fine. The only issue I ran into is that the internetarchvie maven repo has numerous http (as opposed to https) dependencies, which Maven doesn't like by default. I overrode this behavior in my local maven settings, and everything worked fine.

in ~/.m2/settings.xml:

<settings xmlns="http://maven.apache.org/SETTINGS/1.2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.2.0 http://maven.apache.org/xsd/settings-1.2.0.xsd">
     <mirrors>
          <mirror>
               <id>maven-default-http-blocker</id>
               <mirrorOf>dummy</mirrorOf>
               <name>Dummy mirror to override default blocking mirror that blocks http</name>
               <url>http://0.0.0.0/</url>
         </mirror>
    </mirrors>
</settings>

@sebastian-nagel
Copy link

Hi @jt55401, I've run a test with on a Hadoop single-node cluster.

The job run by

yarn jar target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator -outputCDX -cdxBasePath cdx t03 warc/CC-MAIN-20240412194614-20240412224614-00370.warc.gz

finished with status success. However, the generated CDX files index the WARC file but not the WAT resp. WET file:

 CDX N b a m s k r M S V g
warcinfo:/CC-MAIN-20240412194614-20240412224614-00370.warc.gz/ia-web-commons.1.1.10-SNAPSHOT-20240903092205 20240412194614 warcinfo:/CC-MAIN-20240412194614-20240412224614-00370.warc.gz/ia-web-commons.1.1.10-SNAPSHOT-20240903092205 warc-info - - - - 471 0 CC-MAIN-20240412194614-20240412224614-00370.warc.gz
com,kristinroksphotography,0f)/list-959.html 20240412221412 http://0f.kristinroksphotography.com/list-959.html warc/request - - - - 441 471 CC-MAIN-20240412194614-20240412224614-00370.warc.gz
com,kristinroksphotography,0f)/list-959.html 20240412221412 http://0f.kristinroksphotography.com/list-959.html text/html 200 GWZOIQR42OCBZEQAOHVP423CK3NTWBZB - - 25173 912 CC-MAIN-20240412194614-20240412224614-00370.warc.gz

This needs to be fixed. I'd start to try implementing this in ia-web-commons by extending the classes org.archive.extract.WATExtractorOutput (resp. WETExtractorOutput) to, say, WatCdxExtractorOutput. I'm not 100% sure whether this approach works, needs a try.

I've observed three more points which can be ignored for now:

  • the CDX is not compressed
  • not in CDXJ format
  • both CDX files are the same, except that the *.wet.cdx.gz misses the last CDX record - I'm unable to explain why

@jt55401
Copy link
Author

jt55401 commented Sep 3, 2024

OK, I will take a look @sebastian-nagel , thank you.

Did you just run single node hadoop with config in our nutch project? (and feed it some small seed list of 1 site or something?) or did you do something more to test this? (I will try to get some time set aside to set this up for myself as well)

@sebastian-nagel
Copy link

sebastian-nagel commented Sep 5, 2024

A plain single-node setup with minimal configuration, see nutch-test-single-node-cluster but without Nutch installed. For testing I took one WARC file from April 2024 and copied it from local disk to HDFS via:

hadoop fs -mkdir -p /user/$USER/warc
hadoop fs -copyFromLocal CC-MAIN-20240412194614-20240412224614-00370.warc.gz warc/

See above for the command to launch the job. Output is then in hdfs:/user/$USER/{cdx,wat,wet}/

@jt55401
Copy link
Author

jt55401 commented Sep 26, 2024

  1. move into IA web commons
  2. this really will need to be it's own step in the crawl, since the files need to exist so we get offsets, length, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants