Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rename files when sending to s3 and match correctly when updating cocina #1449

Merged
merged 4 commits into from
Jan 15, 2025

Conversation

peetucket
Copy link
Member

@peetucket peetucket commented Jan 7, 2025

Why was this change made? 🤔

Fixes #1443 - deal with scenario of identical filenames with only differing extensions but renaming media files when sending to whisper, and handling the matching of output files when adding to cocina.

Note that this only occurs for captioning, and not OCR...but perhaps could be extended to do this if needed.

HOLD for integration testing

Go to sul-dlss/infrastructure-integration-test#812 after merging

How was this change tested? 🤨

@peetucket peetucket force-pushed the 1443-filename-matching-change branch from 3df5ff2 to a607a21 Compare January 7, 2025 21:11
def stem(path)
File.basename(path, '.*').to_s
File.basename(path, File.extname(path))
end
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same output, just a bit more canonical using all ruby methods

@peetucket peetucket changed the title WIP rename files when sending to s3 and match correctly when updating cocina [WIP] rename files when sending to s3 and match correctly when updating cocina Jan 7, 2025
@peetucket peetucket changed the title [WIP] rename files when sending to s3 and match correctly when updating cocina rename files when sending to s3 and match correctly when updating cocina Jan 8, 2025
@peetucket
Copy link
Member Author

@peetucket peetucket changed the title rename files when sending to s3 and match correctly when updating cocina [HOLD] rename files when sending to s3 and match correctly when updating cocina Jan 8, 2025
@peetucket peetucket changed the title [HOLD] rename files when sending to s3 and match correctly when updating cocina rename files when sending to s3 and match correctly when updating cocina Jan 8, 2025
Copy link
Contributor

@edsu edsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a minor suggestion for one of the comments, otherwise looks good!

lib/dor/text_extraction/cocina_updater.rb Show resolved Hide resolved
Copy link
Member

@jmartin-sul jmartin-sul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slight worries about the corner case behavior and readability of extracted_filename_with_extension. but given the unit test coverage here, the integration test coverage in general, and the ability to recover from preservation if we notice issues, i wouldn't hold this up over those concerns.

def extracted_filename_with_extension(path)
basename_with_extension_stem = stem(path)
extension_from_filename = basename_with_extension_stem.include?('_') ? basename_with_extension_stem.split('_').last : ''
basename_without_extension_stem = basename_with_extension_stem.gsub("_#{extension_from_filename}", '')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm having trouble coming up with a concrete example right now, so maybe the concern is unfounded, but it feels like this might run into issues on transcription output files that have underscores unrelated to our disambiguation, maybe leading to erroneous fall through to this part of the matching code?

i also think that if extension_from_filename is empty string, this will sub all underscores with empty string, which i worry could also lead to other unexpected collision behavior, though i'd expect that to be very rare, if it is actually possible.

Copy link
Member Author

@peetucket peetucket Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your concerns are noted, and I added a new spec file to verify some behaviors, and made a small tweak as a result which better handles the behavior of a file with no underscores in it. I also made some additional clarifications into the code comments.

I think there is little risk here because (1) this code is only used to add files that come back from OCR or speech to text, so will be limited to those outputs (2) we still prefer an exact filename match (e.g. "some_file_with_underscores.txt" prefers to match with "some_file_with_underscores.tiff" anyway, so a secondary check only happens if it doesn't) and (3) even if the secondary check were run (which would yield "some_file_with.underscores"), that wouldn't throw an error, and would be very unlikely to match something in the existing object, thus becoming a no-op.

basename_with_extension_stem = stem(path)
extension_from_filename = basename_with_extension_stem.include?('_') ? basename_with_extension_stem.split('_').last : ''
basename_without_extension_stem = basename_with_extension_stem.gsub("_#{extension_from_filename}", '')
"#{basename_without_extension_stem}.#{extension_from_filename}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and then also, if extension_from_filename is empty string, i think this will actually return a filename with an erroneous trailing .. e.g., wouldn't this method turn file.mp3 into file. and foo_bar into foo.bar? but maybe that doesn't matter because it's only ever getting fed .vtt files?

seems unlikely that this matters in practice for the comparison that's using this method now. but it does feel like it makes the method comment slightly inaccurate in a way that makes it harder to reason about possible changes/refactorings in the future.

i wonder if it'd be worth a comment or guard clause indicating that input is expected to be something like .vtt?

@peetucket peetucket force-pushed the 1443-filename-matching-change branch from db69224 to 819d143 Compare January 15, 2025 00:43
@peetucket peetucket requested a review from edsu January 15, 2025 00:45
@peetucket peetucket merged commit 6a12109 into main Jan 15, 2025
4 of 5 checks passed
@peetucket peetucket deleted the 1443-filename-matching-change branch January 15, 2025 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Transcription of audio files unexpectedly skipped for test item
3 participants