Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing OAI-PMH non-compliant repositories #8

Open
wetneb opened this issue Nov 30, 2015 · 21 comments
Open

Indexing OAI-PMH non-compliant repositories #8

wetneb opened this issue Nov 30, 2015 · 21 comments

Comments

@wetneb
Copy link
Collaborator

wetneb commented Nov 30, 2015

So, BASE (http://www.base-search.net) indexes many repositories that have an OAI-PMH interface, which is already very useful and covers most of the respectable repositories.

But unfortunately, many repositories do not provide such an interface. And even when they provide one, the protocol is far from perfect: for instance, it is hard to detect accurately whether a full text is freely available behind a metadata record. Reaching out to the repository admins to encourage them to expose their metadata correctly is, from my experience, not effective at all.

If we want to go beyond this, I think we need to crawl!

  • One (conceptually) simple option would be to crawl for PDF files, extract metadata from them and dump this in a proper database with indexes. I think this is what CiteSeerX does, but only for papers within a particular field. In my experience the metadata that comes out of this is quite noisy.
  • The other option we have discussed would be to leverage existing scrapers (Zotero) to extract cleaner metadata from HTML pages. Zotero does the scraping pretty well, but I have no clue what crawling software we should use. Any idea? The scrapy framework looks nice but I'm a complete newcommer in this field so I have probably missed better options.

How much resources (servers) do we need for this? Where could we get them?

@mekarpeles
Copy link
Collaborator

@wetneb thanks for this excellent summary. We can apply some pressure and communicate the importance of this initiative to the Internet Archive, they may be able to provide us with resources (disk/storage space). I can talk to Brewster this Friday about that possibility (a researcher VM with ~5tb of space, for starters). At very least, hopefully it will help us develop crawlers, write tutorials for the community to follow, and store documents temporarily (we can also use the openjournal user account to upload files directly to the Internet Archive, and they have an S3 style API for bulk upload. Internet Archive also does OCR on papers/pdfs that are uploaded, which can be a big mutual win. Re: noisy data, I am not sure how accurate their OCR is for academic (especially math-heavy) works. Perhaps worth exploring as an experiment.

Regarding crawlers, perhaps we can create a document an awesome-list of all crawling initiatives for academic papers, journals, databases, and metadata. I have ideas on how we can leverage libraries like scrapy -- as you elude to, I'm sure there existing tools (and I imagine creating n source-specific crawlers and reverse-engineering their indexing scheme, and having/maintaining a repository of these crawlers of each source the public to use, will be more success / more comprehensive than a deep crawl)

@pietsch, this is a fairly comprehensive list of publicly accessible sources that use OAI-PMH, right?

@cleegiles
Copy link
Collaborator

Please see comments below.

It is important not to reinvent wheels. Some of what we discuss
falls into that domain.

On 11/30/15 5:45 PM, Antonin wrote:

So, BASE (http://www.base-search.net) indexes many repositories that have an OAI-PMH interface, which is already very useful and covers most of the respectable repositories.

But unfortunately, many repositories do not provide such an interface. And even when they provide one, the protocol is far from perfect: for instance, it is hard to detect accurately whether a full text is freely available behind a metadata record. Reaching out to the repository admins to encourage them to expose their metadata correctly is, from my experience, not effective at all.

If we want to go beyond this, I think we need to crawl!

  • One (conceptually) simple option would be to crawl for PDF files, extract metadata from them and dump this in a proper database with indexes. I think this is what CiteSeerX does, but only for papers within a particular field. In my experience the metadata that comes out of this is quite noisy.
    No longer, CiteSeerX now crawls for all scholarly documents.
    Please note that some repositories prevent crawling with their
    robots.txt, except for Googlebot.

Yes, it is noisy but not that bad and is always getting better. Others
are doing this and
contributing to the extraction algorithms. Please see our previous list
of extraction methods.
We may have a tutorial at major conference soon.

  • The other option we have discussed would be to leverage existing scrapers (Zotero) to extract cleaner metadata from HTML pages. Zotero does the scraping pretty well, but I have no clue what crawling software we should use. Any idea? The scrapy framework looks nice but I'm a complete newcommer in this field so I have probably missed better options.
    Our crawling code is available on the CiteSeerX GitHub. We are always
    crawling and recently with
    Semantic Scholar at AI2. We use Heritrix, an excellent tool.
    How much resources (servers) do we need for this? Where could we get them?
    This is storage and bandwidth intensive. The problems are:
  • crawling is very time consuming and needs a great deal of coordinated
    parallel threads.
  • crawling brings back unwanted PDFs which must be filtered or
    classified. We have
    a few papers on this if anyone is interested.
  • a scholarly paper PDF is about one Meg - a million a Tera. However,
    the others have
    to are usually stored.

We've found a factor of 3 for all documents we expect to crawl. That is
we crawl 3 times
as many PDFs for the same number of PDF files.

We share our crawl seeds if anyone wants to use them.

Who here crawls?


Reply to this email directly or view it on GitHub:
#8

@cleegiles
Copy link
Collaborator

These papers may be of help. One can publish improvements. We would very
much like to improve ours. Suggestions most welcomed.

Best

Lee

On 11/30/15 7:04 PM, Michael E. Karpeles wrote:

@wetneb thanks for this excellent summary. We can apply some pressure and communicate the importance of this initiative to the Internet Archive, they may be able to provide us with resources (disk/storage space). I can talk to Brewster this Friday about that possibility (a researcher VM with ~5tb of space, for starters). At very least, hopefully it will help us develop crawlers, write tutorials for the community to follow, and store documents temporarily (we can also use the openjournal user account to upload files directly to the Internet Archive, and they have an S3 style API for bulk upload. Internet Archive also does OCR on papers/pdfs that are uploaded, which can be a big mutual win. Re: noisy data, I am not sure how accurate their OCR is for academic (especially math-heavy) works. Perhaps worth exploring as an experiment.

Regarding crawlers, perhaps we can create a document an awesome-list of all crawling initiatives for academic papers, journals, databases, and metadata. I have ideas on how we can leverage libraries like scrapy -- as you elude to, I'm sure there existing tools (and I imagine creating n source-specific crawlers and reverse-engineering their indexing scheme, and having/maintaining a repository of these crawlers of each source the public to use, will be more success / more comprehensive than a deep crawl)

@pietsch, this is a fairly comprehensive list of publicly accessible sources that use OAI-PMH, right?


Reply to this email directly or view it on GitHub:
#8 (comment)

@mekarpeles
Copy link
Collaborator

@cleegiles Thanks for the excellent contribution. Also, for the record, I view OpenJournal as a working group for reducing (exactly as you say) re-implemented wheels. So I'm on board! I don't anticipate building anything directly through OpenJournal -- if I contribute to something, it will be an existing project (unless there's a very compelling reason why something new needs to be built, and if that's the case, I'd still prefer someone else leads it, i.e. limited bandwidth).

There are several folks affiliated w/ the Internet Archive I know who are working on crawling efforts. Some aren't comfortable announcing. @nthmost can likely share her ideas for pubmed crawling. I have experience crawling, am happy to participate, but do not currently work on a crawler.

@cleegiles I'll try to generate some traffic to this thread by pinging other institutions and seeing if they can weigh in on the status of their crawlers -- thanks for nudging us in that direction.

@wetneb
Copy link
Collaborator Author

wetneb commented Dec 1, 2015

@cleegiles Thanks for the clarification, it's great that CiteSeerX is not limited to comp. sci. anymore!
Do not get me wrong, what CiteSeerX does is massively useful. But at http://dissem.in we need to match publications with preprints, which is quite hard as soon as the title or the authors differ by a few words. So, metadata quality is critical for us.

I can imagine that CiteSeerX requires a lot of resources indeed. We cannot afford this for dissemin, and this is the reason why I thought using scrapers would be cheaper (both in terms of bandwidth, storage, and computing) and could potentially yield cleaner metadata. But this option only works for repositories, not home pages.

I have been playing around on indexing researchgate.net recently (with crawling and scraping), and I have been in touch with Mike Taylor who has started something along these lines for SSRN.

@mekarpeles Having a researcher VM would be amazing! And I'm really looking forward to hearing more crawling stories, especially from the Internet Archive!

@davidar
Copy link
Collaborator

davidar commented Dec 1, 2015

And even when they provide one, the protocol is far from perfect: for instance, it is hard to detect accurately whether a full text is freely available behind a metadata record.

Yes, this is an issue we're trying to deal with in ipfs-inactive/archives#3 (enriching OAI-PMH metadata with fulltext links).

@cleegiles Is this something that CiteSeerX could help with?

@wetneb
Copy link
Collaborator Author

wetneb commented Dec 1, 2015

@davidar that's awesome! I hope you will succeed.

@davidar
Copy link
Collaborator

davidar commented Dec 1, 2015

@wetneb well, my plan was basically the same as what you outlined in the OP, so I'm afraid I might be reinventing the wheel? Perhaps this is something we could collaborate on?

@wetneb
Copy link
Collaborator Author

wetneb commented Dec 1, 2015

@davidar I would love to! Joining the discussion there then.

@pietsch
Copy link
Collaborator

pietsch commented Dec 1, 2015

Hi @mekarpeles,

@pietsch, this is a fairly comprehensive list of publicly accessible sources that use OAI-PMH, right?

Yes, these are the 3881 OAI-PMH sources BASE is currently harvesting (in intervals). The other lists I am aware of are the Directory of Open Access Journals (DOAJ) and the official (if incomplete and out-of-date) list of OAI-PMH repositories.

@cleegiles
Copy link
Collaborator

We do this with our crawler. If the link goes directly to a non-open
source publisher, there is no reason to
crawl. We have a blacklist and whitelist of where we go now which we can
share. It's fairly complete.

Another way is to parse the link to the document. Links that are not
directly to pdfs are usually not
downloadable. It would be useful to do a sample to see how often this is
true, but we've found it
to nearly always be the case. A counter example is where one has to sign
in but anyone can have
an account.

On 12/1/15 3:03 AM, David A Roberts wrote:

And even when they provide one, the protocol is far from perfect: for instance, it is hard to detect accurately whether a full text is freely available behind a metadata record.
Yes, this is an issue we're trying to deal with in ipfs-inactive/archives#3 (enriching OAI-PMH metadata with fulltext links).

@cleegiles Is this something that CiteSeerX could help with?


Reply to this email directly or view it on GitHub:
#8 (comment)

@cleegiles
Copy link
Collaborator

The best metadata seems to be the Web of Science (WoS) but it has to be
purchased and is not cheap.

This seems to be a good source of metadata, but we have not compared it
to ours

http://research.microsoft.com/en-us/projects/mag/
We currently have a project to clean up our and other metadata.

On 12/1/15 2:27 AM, Antonin wrote:

@cleegiles Thanks for the clarification, it's great that CiteSeerX is not limited to comp. sci. anymore!
Do not get me wrong, the metadata CiteSeerX extracts is very useful. But at http://dissem.in we need to match publications with preprints, which is quite hard as soon as the title or the authors differ by a few words. So, metadata quality is critical for us.

I can imagine that CiteSeerX requires a lot of resources indeed. We cannot afford this for dissemin, and this is the reason why I thought using scrapers would be cheaper (both in terms of bandwidth, storage, and computing) and could potentially yield cleaner metadata. But this option only works for repositories, not home pages.

I have been playing around on indexing researchgate.net recently (with crawling and scraping), and I have been in touch with Mike Taylor who has started something along these lines for SSRN.


Reply to this email directly or view it on GitHub:
#8 (comment)

@wetneb wetneb reopened this Dec 1, 2015
@nthmost
Copy link
Collaborator

nthmost commented Dec 1, 2015

Hi everybody,

As @mekarpeles mentioned, I've been working on PubMed collection efforts for over a year now. That code is represented in the metapub project and install-able via pypi (pip install metapub).

The primary purpose of the FindIt tool within metapub is to be able to pull fulltext article matter (just PDFs right now) at high identity confidence. I.e. if a researcher thinks they are getting pubmed ID #123456, the result of using FindIt should be exactly that article about 99% of the time.

Here's the overview of how it works. Starting from a PubMed ID, FindIt does the following steps on each article:

  • uses the pubmed ID to pull down the PubMed XML for the article
  • uses the PubMedCentral ID, if any, to produce a url to a pdf
  • if not in PubMedCentral, looks up the journal name within the FindIt machinery to see if we can apply a known “dance” to get a PDF link on the publisher’s website.
  • if journal name not currently filed in FindIt, reports as “NOFORMAT"

I can explain all of this in detail -- for now I think it suffices to say that my approach is very different from crawling or screen scraping, and this is very much by design.

I built and deployed this engine in production at a genetic testing/diagnostic company to save scientists time in tracking down the article texts they needed to research and prove the calls they were making on people's genetic test reports.

As a result, FindIt's coverage of the NCBI journal list is heavily skewed towards medical genetics, and its testing has focused on pubmed citations found in HGMD and Clinvar. This gave FindIt a nicely controlled constraint for its success; now it's time to branch out and try to complete its coverage across all PubMed domains.

I recently completed a long-running coverage test in which I iterated over every named NCBI journal from the Entrez list, found 3 to 5 article IDs per journal from different years (if possible), and then ran FindIt over those IDs. (Total pmids = ~117k.) I have yet to analyze these results, but will probably do so on the plane back from Hawai'i (where i've been hiding during the evolution of all this discussion).

After completing metapub's coverage of PubMed, I'm interested in starting a project with the same design constraints (i.e. high confidence, next-to-no actual scraping) that covers all journals that have DOIs. There is a lot of machinery in metapub that could apply to a broader swath of disciplines via the use of the CrossRef API and the dx.doi.org redirect.

I'm eager to get more involved with all of you!

@cleegiles
Copy link
Collaborator

Good thing to do.

If I understand what you are doing, you are interested in full
documents, PDFs?

If so, how do you extract the text from the PDF?

Best

Lee

On 12/1/15 4:16 PM, nthmost wrote:

Hi everybody,

As @mekarpeles mentioned, I've been working on PubMed collection efforts for over a year now. That code is represented in the metapub project and install-able via pypi (pip install metapub).

The primary purpose of the FindIt tool within metapub is to be able to pull fulltext article matter (just PDFs right now) at high identity confidence. I.e. if a researcher thinks they are getting pubmed ID #123456, the result of using FindIt should be exactly that article about 99% of the time.

Here's the overview of how it works. Starting from a PubMed ID, FindIt does the following steps on each article:

  • uses the pubmed ID to pull down the PubMed XML for the article
  • uses the PubMedCentral ID, if any, to produce a url to a pdf
  • if not in PubMedCentral, looks up the journal name within the FindIt machinery to see if we can apply a known “dance” to get a PDF link on the publisher’s website.
  • if journal name not currently filed in FindIt, reports as “NOFORMAT"

I can explain all of this in detail -- for now I think it suffices to say that my approach is very different from crawling or screen scraping, and this is very much by design.

I built and deployed this engine in production at a genetic testing/diagnostic company to save scientists time in tracking down the article texts they needed to research and prove the calls they were making on people's genetic test reports.

As a result, FindIt's coverage of the NCBI journal list is heavily skewed towards medical genetics, and its testing has focused on pubmed citations found in HGMD and Clinvar. This gave FindIt a nicely controlled constraint for its success; now it's time to branch out and try to complete its coverage across all PubMed domains.

I recently completed a long-running coverage test in which I iterated over every named NCBI journal from the Entrez list, found 3 to 5 article IDs per journal from different years (if possible), and then ran FindIt over those IDs. (Total pmids = ~117k.) I have yet to analyze these results, but will probably do so on the plane back from Hawai'i (where i've been hiding during the evolution of all this discussion).

After completing metapub's coverage of PubMed, I'm interested in starting a project with the same design constraints (i.e. high confidence, next-to-no actual scraping) that covers all journals that have DOIs. There is a lot of machinery in metapub that could apply to a broader swath of disciplines via the use of the CrossRef API and the dx.doi.org redirect.

I'm eager to get more involved with all of you!


Reply to this email directly or view it on GitHub:
#8 (comment)

@mekarpeles
Copy link
Collaborator

@cleegiles as an aside, the Internet Archive does OCR on any pdfs uploaded to them (I think this was @nthmost's plan, however I'm interested in hear your opinions). I'm sure many institutions would benefit from more contributions towards a more proficient library for extracting text from pdf.

Ideally, in the future, the .tex version of the paper will be available via something like github... But as frustrating as that is, I'll keep it contained as a separate issue :)

@davidar
Copy link
Collaborator

davidar commented Dec 2, 2015

Ideally, in the future, the .tex version of the paper will be available via something like github...

That would be amazing, unfortunately arXiv seems to be the only ones distributing TeX sources currently

@jbenet
Copy link
Collaborator

jbenet commented Dec 2, 2015

Love what's going on in this thread

@davidar can't wait for a full offline-friendly experience of arxiv with ipfs+TeX.js :)

@cleegiles
Copy link
Collaborator

PSF to text extraction for "quality" information is still an open
question. I would guess
that IA is using PDFBox, a reasonable selection.

It's very important to know what tool that is being used for PDF to text
extraction.
There are many available. There are companies that make a living on
their modifications
of existing software or creating their own, i.e. gonitro.com. I would
put their proprietary software as state of
the art in comparison with Google. Many scientists are very concerned
how data can be extracted from PDFs since this is
the only place that some data exists and can be digitized - odd isn't it.

Most open source extractors are ok but not of high quality, this is one
reason CiteSeerX's extraction flaws.
We use either PDFBox or PDFlib TET (not open source). Google's is
extremely good! Not released yet.

AI2 I've been told is about to release a very good converter. We haven't
seen it, but many of
the tools they've released so far have been excellent - we use them.

Interesting, surprisingly few scholarly papers are published in .tex,
many are in .doc or .docx,
especially in medicine, engineering or sciences outside of computer
science and physics.

On 12/1/15 7:24 PM, Michael E. Karpeles wrote:

@cleegiles as an aside, the Internet Archive does OCR on any pdfs uploaded to them (I think this was @nthmost's plan, however I'm interested in hear your opinions). I'm sure many institutions would benefit from more contributions towards a more proficient library for extracting text from pdf.

Ideally, in the future, the .tex version of the paper will be available via something like github... But as frustrating as that is, I'll keep it contained as a separate issue :)


Reply to this email directly or view it on GitHub:
#8 (comment)

@nthmost
Copy link
Collaborator

nthmost commented Dec 6, 2015

@cleegiles my work's not focused on doing OCR, as the PDF format vis-a-vis people's usage of it in academic papers is pretty far removed from standardized; I'd rather do as @mekarpeles has suggested and upload PDFs to the Archive to be OCRed there.

That said, at my last job, we were able to use pdfminer (a Python library) to good effect to turn medical genetics papers (in English) into machine-indexable text. We built indexes over these texts and mapped mentions of important genetics concepts back to their pubmed IDs, so that medical concepts (referenced in the NIH medgen database) could be mapped to pubmed citation evidence.

@nemobis
Copy link

nemobis commented Mar 10, 2022

Sorry for being 6+ years late to the party. ;)

Reaching out to the repository admins to encourage them to expose their metadata correctly

COAR focuses on this part of the puzzle quite a bit. They have various ongoing initiatives like https://www.coar-repositories.org/news-updates/ccsd-and-coar-announce-plans-to-launch-preprint-directory/ .

extract cleaner metadata from HTML pages

https://scholar.archive.org/ is arguably doing just this: all the PDFs and other full text sources the Internet Archive finds with its scanning and crawling efforts get mined for academic works to index. Everyone go contribute! https://github.com/internetarchive/fatcat

@mekarpeles
Copy link
Collaborator

@nemobis +1! And to everyone else who continue to tireless further the space. I know @bnewbold et al have leveraged the amazing work of others in the community to build factact into another great resource in the space. Proud to watch these efforts mature and grateful for everyone's work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants