Missing Publication and Dataset Resources #6

HaritzPuerto · 2019-10-07T13:05:25Z

Hi,

I executed python corpus.py corpus.ttl and then python download_corpus_resources.py to download the corpus but I got this output. Is this the expected output? It looks like some publications cannot be downloaded.

Number of records in the corpus: 586
Number of research publications: 480
Successfully downloaded 474 pdf files.
Missing publication resources: {'012df4a72af52b038483', 'dca54974ff51a5f7f8ab', '5f48a343cb75195cd646', 'c8f9b19b39e34d98a557','988428e18884e28e037c', '42c2755ec0f983870e62'}
Number of datasets: 106
Successfully downloaded 101 resource files.
Missing dataset resources: {'875ffb2b04b1392cd1f2', 'fe338b5b2f3f6b0d11a4', '53ca68ba0ded95220662', '33b1ce039c67a6658644', '379ff5f518e664ba2353'}

I checked the publication with id: "012df4a72af52b038483", and it looks like the link is not broken. Here is the link I got from corpus.jsonld
https://aasldpubs.onlinelibrary.wiley.com/doi/pdf/10.1002/hep.23220

@ceteri @philipskokoh Do you know why this happen?

Thanks

The text was updated successfully, but these errors were encountered:

tong-zeng · 2019-10-07T19:38:58Z

similar error at my side:

Number of records in the corpus: 586
Number of research publications: 480
Successfully downloaded 474 pdf files.
Missing publication resources: {'c8f9b19b39e34d98a557', '988428e18884e28e037c', 'dca54974ff51a5f7f8ab', '42c2755ec0f983870e62', '5f48a343cb75195cd646', '012df4a72af52b038483'}
Number of datasets: 106
Successfully downloaded 104 resource files.
Missing dataset resources: {'fe338b5b2f3f6b0d11a4', '33b1ce039c67a6658644'}

tong-zeng · 2019-10-07T20:01:10Z

I think the reason is wiley insert the src link dynamically via client-side javascript.

But the code requests.get(uri) just get the html, without the javascript excuted, that is why soup.find('embed')['src'] get Nothing.

philipskokoh · 2019-10-08T05:38:05Z

@HaritzPuerto and @tong-zeng: It seems that onlinelibrary.wiley.com changes the format of their html files. Let me look at the new update.
For dataset resources, it's true that the script can't download some dataset resources.

philipskokoh · 2019-10-08T06:46:14Z

Yes. @tong-zeng is correct. Now, they inject the src link using javascript instead of inside <embed> tag.

philipskokoh · 2019-10-08T09:03:22Z

@HaritzPuerto , @tong-zeng : I find that onlinelibrary.wiley.com uses doi/pdfdirect/... to send the pdf file. Hopefully the link is static across different client. Could you try my patched code on the forked repo:
https://github.com/philipskokoh/rclc

For 'dca54974ff51a5f7f8ab',
the open access comes from www.sciencedirect.com, and it seems that my way of downloading it gets rejected by sciencedirect. I am afraid we need to manually download this particular resources.
If you have any better idea to collect this open access, feel free to suggest to me.

tong-zeng · 2019-10-08T15:06:37Z

@philipskokoh Thank you. I agree with you, for those resources difficult to download, we can just download it manually if the are not too much, otherwise, would you consider removing them from the resources list?

philipskokoh · 2019-10-09T01:23:48Z

@philipskokoh Thank you. I agree with you, for those resources difficult to download, we can just download it manually if the are not too much, otherwise, would you consider removing them from the resources list?

The corpus is growing, new publications will be added, and I do not know what are the possible error responses while downloading them. I'll update the script accordingly when the publications added later.
I prefer to try downloading all resources and report all failed downloads. It helps the user to easily navigate from that.

HaritzPuerto · 2019-10-09T02:07:55Z

I guess for now we can just manually download these particular publications.It is not a big issue. But as Philips said, new publications will be added. I guess (and hope XD) in most of them it wont be a problem as right now.

ceteri · 2019-10-09T03:08:20Z

Thank you all for tracking this problem with publication PDFs!

Looking at those publication URLs, the problems seem to be with both Wiley and Elsevier, for example using JavaScript (for session tokens?) on their PDF downloads. That will prevent use of libraries such as requests although we could eventually use selenium or more long-term perhaps we could use a service such as diffbot.

For now, how about this -- as new publications get added to the corpus, we can:

avoid using those sources (Wiley, Elsevier, SSRN) for open access PDFs
run the download script prior to each corpus version release, and include the output in the release notes

NYU is still working to get a public S3 bucket for us to use with the competition. I may just create one for now, then transfer ownership to the NYU account when they have permissions worked out. In any case, if we had the PDFs in a shareable storage bucket this would be no issue.

ceteri · 2019-10-09T03:10:01Z

The dataset resources will be more difficult to resolve. We're still trying to identify consistent URLs for each dataset.

How about, if a dataset is missing a public URL, that could be considered a warning instead of an error?

philipskokoh · 2019-10-10T06:19:00Z

The dataset resources will be more difficult to resolve. We're still trying to identify consistent URLs for each dataset.

How about, if a dataset is missing a public URL, that could be considered a warning instead of an error?

What is missing a public URL? Are we considering Wiley, Elsevier, and SSRN as non-public URLs? I can skip these domains (and print a warning message) in the download script.

Selenium and diffbot are good viable solution if we have large number of sources from wiley, Elsevier, and SSRN.

… d/l

working toward #6

ceteri · 2020-01-03T22:48:08Z

We're getting closer. This still needs work to download from specific sites more effectively. See the error log in https://github.com/Coleridge-Initiative/rclc/blob/master/errors.txt

Some of those errors will be handled by manual override in RCHuman

ceteri · 2020-01-13T04:13:33Z

Will assign among our NYU-CI team:

Troubleshoot the PDF download process, based on the observed errors

ceteri added a commit to Coleridge-Initiative/RCHuman that referenced this issue Nov 29, 2019

resolving corpus errors Coleridge-Initiative/rclc#6

8dfb8bf

ceteri added a commit that referenced this issue Dec 24, 2019

working toward #6, tracing d/l errors, also introduced Ray for faster…

3acb7d9

… d/l

ceteri mentioned this issue Dec 24, 2019

working toward #6 #11

Merged

ceteri added a commit that referenced this issue Dec 24, 2019

Merge pull request #11 from Coleridge-Initiative/fix_pdf

f6648af

working toward #6

ceteri mentioned this issue Jan 13, 2020

missing requirements #15

Open

ceteri mentioned this issue Mar 1, 2020

contingency for extracting text from PDF #27

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing Publication and Dataset Resources #6

Missing Publication and Dataset Resources #6

HaritzPuerto commented Oct 7, 2019

tong-zeng commented Oct 7, 2019

tong-zeng commented Oct 7, 2019 •

edited

Loading

philipskokoh commented Oct 8, 2019

philipskokoh commented Oct 8, 2019

philipskokoh commented Oct 8, 2019

tong-zeng commented Oct 8, 2019

philipskokoh commented Oct 9, 2019

HaritzPuerto commented Oct 9, 2019

ceteri commented Oct 9, 2019

ceteri commented Oct 9, 2019

philipskokoh commented Oct 10, 2019

ceteri commented Jan 3, 2020

ceteri commented Jan 13, 2020

Missing Publication and Dataset Resources #6

Missing Publication and Dataset Resources #6

Comments

HaritzPuerto commented Oct 7, 2019

tong-zeng commented Oct 7, 2019

tong-zeng commented Oct 7, 2019 • edited Loading

philipskokoh commented Oct 8, 2019

philipskokoh commented Oct 8, 2019

philipskokoh commented Oct 8, 2019

tong-zeng commented Oct 8, 2019

philipskokoh commented Oct 9, 2019

HaritzPuerto commented Oct 9, 2019

ceteri commented Oct 9, 2019

ceteri commented Oct 9, 2019

philipskokoh commented Oct 10, 2019

ceteri commented Jan 3, 2020

ceteri commented Jan 13, 2020

tong-zeng commented Oct 7, 2019 •

edited

Loading