-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indexing OAI-PMH non-compliant repositories #8
Comments
@wetneb thanks for this excellent summary. We can apply some pressure and communicate the importance of this initiative to the Internet Archive, they may be able to provide us with resources (disk/storage space). I can talk to Brewster this Friday about that possibility (a researcher VM with ~5tb of space, for starters). At very least, hopefully it will help us develop crawlers, write tutorials for the community to follow, and store documents temporarily (we can also use the Regarding crawlers, perhaps we can create a document an awesome-list of all crawling initiatives for academic papers, journals, databases, and metadata. I have ideas on how we can leverage libraries like scrapy -- as you elude to, I'm sure there existing tools (and I imagine creating @pietsch, this is a fairly comprehensive list of publicly accessible sources that use OAI-PMH, right? |
Please see comments below. It is important not to reinvent wheels. Some of what we discuss On 11/30/15 5:45 PM, Antonin wrote:
Yes, it is noisy but not that bad and is always getting better. Others
We've found a factor of 3 for all documents we expect to crawl. That is We share our crawl seeds if anyone wants to use them. Who here crawls?
|
These papers may be of help. One can publish improvements. We would very Best Lee On 11/30/15 7:04 PM, Michael E. Karpeles wrote:
|
@cleegiles Thanks for the excellent contribution. Also, for the record, I view OpenJournal as a working group for reducing (exactly as you say) re-implemented wheels. So I'm on board! I don't anticipate building anything directly through OpenJournal -- if I contribute to something, it will be an existing project (unless there's a very compelling reason why something new needs to be built, and if that's the case, I'd still prefer someone else leads it, i.e. limited bandwidth). There are several folks affiliated w/ the Internet Archive I know who are working on crawling efforts. Some aren't comfortable announcing. @nthmost can likely share her ideas for pubmed crawling. I have experience crawling, am happy to participate, but do not currently work on a crawler. @cleegiles I'll try to generate some traffic to this thread by pinging other institutions and seeing if they can weigh in on the status of their crawlers -- thanks for nudging us in that direction. |
@cleegiles Thanks for the clarification, it's great that CiteSeerX is not limited to comp. sci. anymore! I can imagine that CiteSeerX requires a lot of resources indeed. We cannot afford this for dissemin, and this is the reason why I thought using scrapers would be cheaper (both in terms of bandwidth, storage, and computing) and could potentially yield cleaner metadata. But this option only works for repositories, not home pages. I have been playing around on indexing researchgate.net recently (with crawling and scraping), and I have been in touch with Mike Taylor who has started something along these lines for SSRN. @mekarpeles Having a researcher VM would be amazing! And I'm really looking forward to hearing more crawling stories, especially from the Internet Archive! |
Yes, this is an issue we're trying to deal with in ipfs-inactive/archives#3 (enriching OAI-PMH metadata with fulltext links). @cleegiles Is this something that CiteSeerX could help with? |
@davidar that's awesome! I hope you will succeed. |
@wetneb well, my plan was basically the same as what you outlined in the OP, so I'm afraid I might be reinventing the wheel? Perhaps this is something we could collaborate on? |
@davidar I would love to! Joining the discussion there then. |
Hi @mekarpeles,
Yes, these are the 3881 OAI-PMH sources BASE is currently harvesting (in intervals). The other lists I am aware of are the Directory of Open Access Journals (DOAJ) and the official (if incomplete and out-of-date) list of OAI-PMH repositories. |
We do this with our crawler. If the link goes directly to a non-open Another way is to parse the link to the document. Links that are not On 12/1/15 3:03 AM, David A Roberts wrote:
|
The best metadata seems to be the Web of Science (WoS) but it has to be This seems to be a good source of metadata, but we have not compared it
On 12/1/15 2:27 AM, Antonin wrote:
|
Hi everybody, As @mekarpeles mentioned, I've been working on PubMed collection efforts for over a year now. That code is represented in the metapub project and install-able via pypi ( The primary purpose of the FindIt tool within metapub is to be able to pull fulltext article matter (just PDFs right now) at high identity confidence. I.e. if a researcher thinks they are getting pubmed ID #123456, the result of using FindIt should be exactly that article about 99% of the time. Here's the overview of how it works. Starting from a PubMed ID, FindIt does the following steps on each article:
I can explain all of this in detail -- for now I think it suffices to say that my approach is very different from crawling or screen scraping, and this is very much by design. I built and deployed this engine in production at a genetic testing/diagnostic company to save scientists time in tracking down the article texts they needed to research and prove the calls they were making on people's genetic test reports. As a result, FindIt's coverage of the NCBI journal list is heavily skewed towards medical genetics, and its testing has focused on pubmed citations found in HGMD and Clinvar. This gave FindIt a nicely controlled constraint for its success; now it's time to branch out and try to complete its coverage across all PubMed domains. I recently completed a long-running coverage test in which I iterated over every named NCBI journal from the Entrez list, found 3 to 5 article IDs per journal from different years (if possible), and then ran FindIt over those IDs. (Total pmids = ~117k.) I have yet to analyze these results, but will probably do so on the plane back from Hawai'i (where i've been hiding during the evolution of all this discussion). After completing metapub's coverage of PubMed, I'm interested in starting a project with the same design constraints (i.e. high confidence, next-to-no actual scraping) that covers all journals that have DOIs. There is a lot of machinery in metapub that could apply to a broader swath of disciplines via the use of the CrossRef API and the dx.doi.org redirect. I'm eager to get more involved with all of you! |
Good thing to do. If I understand what you are doing, you are interested in full If so, how do you extract the text from the PDF? Best Lee On 12/1/15 4:16 PM, nthmost wrote:
|
@cleegiles as an aside, the Internet Archive does OCR on any pdfs uploaded to them (I think this was @nthmost's plan, however I'm interested in hear your opinions). I'm sure many institutions would benefit from more contributions towards a more proficient library for extracting text from pdf. Ideally, in the future, the .tex version of the paper will be available via something like github... But as frustrating as that is, I'll keep it contained as a separate issue :) |
That would be amazing, unfortunately arXiv seems to be the only ones distributing TeX sources currently |
Love what's going on in this thread @davidar can't wait for a full offline-friendly experience of arxiv with ipfs+TeX.js :) |
PSF to text extraction for "quality" information is still an open It's very important to know what tool that is being used for PDF to text Most open source extractors are ok but not of high quality, this is one AI2 I've been told is about to release a very good converter. We haven't Interesting, surprisingly few scholarly papers are published in .tex, On 12/1/15 7:24 PM, Michael E. Karpeles wrote:
|
@cleegiles my work's not focused on doing OCR, as the PDF format vis-a-vis people's usage of it in academic papers is pretty far removed from standardized; I'd rather do as @mekarpeles has suggested and upload PDFs to the Archive to be OCRed there. That said, at my last job, we were able to use pdfminer (a Python library) to good effect to turn medical genetics papers (in English) into machine-indexable text. We built indexes over these texts and mapped mentions of important genetics concepts back to their pubmed IDs, so that medical concepts (referenced in the NIH medgen database) could be mapped to pubmed citation evidence. |
Sorry for being 6+ years late to the party. ;)
COAR focuses on this part of the puzzle quite a bit. They have various ongoing initiatives like https://www.coar-repositories.org/news-updates/ccsd-and-coar-announce-plans-to-launch-preprint-directory/ .
https://scholar.archive.org/ is arguably doing just this: all the PDFs and other full text sources the Internet Archive finds with its scanning and crawling efforts get mined for academic works to index. Everyone go contribute! https://github.com/internetarchive/fatcat |
So, BASE (http://www.base-search.net) indexes many repositories that have an OAI-PMH interface, which is already very useful and covers most of the respectable repositories.
But unfortunately, many repositories do not provide such an interface. And even when they provide one, the protocol is far from perfect: for instance, it is hard to detect accurately whether a full text is freely available behind a metadata record. Reaching out to the repository admins to encourage them to expose their metadata correctly is, from my experience, not effective at all.
If we want to go beyond this, I think we need to crawl!
How much resources (servers) do we need for this? Where could we get them?
The text was updated successfully, but these errors were encountered: