-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing Publication and Dataset Resources #6
Comments
similar error at my side:
|
I think the reason is wiley insert the src link dynamically via client-side javascript. But the code |
@HaritzPuerto and @tong-zeng: It seems that onlinelibrary.wiley.com changes the format of their html files. Let me look at the new update. |
Yes. @tong-zeng is correct. Now, they inject the src link using javascript instead of inside |
@HaritzPuerto , @tong-zeng : I find that For |
@philipskokoh Thank you. I agree with you, for those resources difficult to download, we can just download it manually if the are not too much, otherwise, would you consider removing them from the resources list? |
The corpus is growing, new publications will be added, and I do not know what are the possible error responses while downloading them. I'll update the script accordingly when the publications added later. |
I guess for now we can just manually download these particular publications.It is not a big issue. But as Philips said, new publications will be added. I guess (and hope XD) in most of them it wont be a problem as right now. |
Thank you all for tracking this problem with publication PDFs! Looking at those publication URLs, the problems seem to be with both Wiley and Elsevier, for example using JavaScript (for session tokens?) on their PDF downloads. That will prevent use of libraries such as For now, how about this -- as new publications get added to the corpus, we can:
NYU is still working to get a public S3 bucket for us to use with the competition. I may just create one for now, then transfer ownership to the NYU account when they have permissions worked out. In any case, if we had the PDFs in a shareable storage bucket this would be no issue. |
The dataset resources will be more difficult to resolve. We're still trying to identify consistent URLs for each dataset. How about, if a dataset is missing a public URL, that could be considered a warning instead of an error? |
What is missing a public URL? Are we considering Wiley, Elsevier, and SSRN as non-public URLs? I can skip these domains (and print a warning message) in the download script. Selenium and diffbot are good viable solution if we have large number of sources from wiley, Elsevier, and SSRN. |
We're getting closer. This still needs work to download from specific sites more effectively. See the error log in https://github.com/Coleridge-Initiative/rclc/blob/master/errors.txt Some of those errors will be handled by manual override in |
Will assign among our NYU-CI team: Troubleshoot the PDF download process, based on the observed errors |
Hi,
I executed
python corpus.py corpus.ttl
and thenpython download_corpus_resources.py
to download the corpus but I got this output. Is this the expected output? It looks like some publications cannot be downloaded.I checked the publication with id: "012df4a72af52b038483", and it looks like the link is not broken. Here is the link I got from corpus.jsonld
https://aasldpubs.onlinelibrary.wiley.com/doi/pdf/10.1002/hep.23220
@ceteri @philipskokoh Do you know why this happen?
Thanks
The text was updated successfully, but these errors were encountered: