Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingested CL 2024 Volume 4 #4296

Merged
merged 1 commit into from
Jan 6, 2025
Merged

Ingested CL 2024 Volume 4 #4296

merged 1 commit into from
Jan 6, 2025

Conversation

davidstap
Copy link
Collaborator

No description provided.

@davidstap davidstap added this to the 2025Q1 milestone Jan 2, 2025
@davidstap davidstap requested a review from mjpost January 2, 2025 08:52
@davidstap davidstap self-assigned this Jan 2, 2025
Copy link

github-actions bot commented Jan 2, 2025

Build successful. Some useful links:

This preview will be removed when the branch is merged.

@mjpost
Copy link
Member

mjpost commented Jan 2, 2025

Thanks, @davidstap. I have a request, if you have time: can you migrate the ingestion script to use the acl_anthology python module instead? I think it shouldn't be too difficult:

@davidstap
Copy link
Collaborator Author

I looked at the documentation and example, but I'm having some difficulties migrating the script.

  • I have changed the Anthology class from anthology to acl_anthology.
  • Other functions that need to be changed: Paper and Volume from anthology need to be changed to their acl_anthology counterparts defined in collections.volume and collections.paper.
  • Some helper functions need to be changed/removed: normalize from normalize_anth and make_simple_element, indent, compute_hash_from_file from anthology.utils.
  • From the documentation and example script it is clear how to query Anthology data, but I am not sure how to change/create new data.
  • @mbollmann if you have time I'd appreciate some pointers. Some insights on how to modify the main loop where data is stored in ingest_mitpress.py (lines 404-470) would especially be much appreciated.

@mbollmann
Copy link
Member

@davidstap Thanks for starting to look into this! My hope with the new library is that manual construction of XML will be entirely unnecessary, but I haven't prioritized the necessary functionality for this yet. It’s mostly there, but not really finalized or tested. I can start with this particular ingestion script though to see what changes would be needed.

@davidstap
Copy link
Collaborator Author

Thanks @mbollmann that'd be great. Let me know if I can help.

@mbollmann
Copy link
Member

Had a quick look now, and I would move the adaption of the ingestion script into a new issue/PR, as I don’t think it is that trivial. Currently, it’s fundamentally built around constructing XML nodes manually, which is not really something I’d like to support with the new library, as it defeats the whole point of encapsulating serialization logic within a library in the first place. The new library is designed from the ground up to have its classes closely mirror what’s in our data files, to make converting from/to XML easy, so I believe we should make proper use of that. I need to implement a couple more things for that to work smoothly, though.

@davidstap Can you send me the source file from MIT press that you used for this ingestion, so that I can test-run the script locally?

@davidstap
Copy link
Collaborator Author

Thanks @mbollmann, that makes sense and explains why it's not straightforward to make the adaption of the ingestion script. I've just shared the MIT press source file through e-mail.

@mjpost can you review the CL ingestion?

@mjpost mjpost merged commit c3d6d89 into master Jan 6, 2025
4 checks passed
@mjpost mjpost deleted the ingest_cl_2024_issue4 branch January 6, 2025 23:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants