-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rename files when sending to s3 and match correctly when updating cocina #1449
Conversation
3df5ff2
to
a607a21
Compare
def stem(path) | ||
File.basename(path, '.*').to_s | ||
File.basename(path, File.extname(path)) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same output, just a bit more canonical using all ruby methods
@andrewjbtw to retest with https://argo-qa.stanford.edu/view/druid:xx631hs6770 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a minor suggestion for one of the comments, otherwise looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slight worries about the corner case behavior and readability of extracted_filename_with_extension
. but given the unit test coverage here, the integration test coverage in general, and the ability to recover from preservation if we notice issues, i wouldn't hold this up over those concerns.
def extracted_filename_with_extension(path) | ||
basename_with_extension_stem = stem(path) | ||
extension_from_filename = basename_with_extension_stem.include?('_') ? basename_with_extension_stem.split('_').last : '' | ||
basename_without_extension_stem = basename_with_extension_stem.gsub("_#{extension_from_filename}", '') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm having trouble coming up with a concrete example right now, so maybe the concern is unfounded, but it feels like this might run into issues on transcription output files that have underscores unrelated to our disambiguation, maybe leading to erroneous fall through to this part of the matching code?
i also think that if extension_from_filename
is empty string, this will sub all underscores with empty string, which i worry could also lead to other unexpected collision behavior, though i'd expect that to be very rare, if it is actually possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your concerns are noted, and I added a new spec file to verify some behaviors, and made a small tweak as a result which better handles the behavior of a file with no underscores in it. I also made some additional clarifications into the code comments.
I think there is little risk here because (1) this code is only used to add files that come back from OCR or speech to text, so will be limited to those outputs (2) we still prefer an exact filename match (e.g. "some_file_with_underscores.txt" prefers to match with "some_file_with_underscores.tiff" anyway, so a secondary check only happens if it doesn't) and (3) even if the secondary check were run (which would yield "some_file_with.underscores"), that wouldn't throw an error, and would be very unlikely to match something in the existing object, thus becoming a no-op.
basename_with_extension_stem = stem(path) | ||
extension_from_filename = basename_with_extension_stem.include?('_') ? basename_with_extension_stem.split('_').last : '' | ||
basename_without_extension_stem = basename_with_extension_stem.gsub("_#{extension_from_filename}", '') | ||
"#{basename_without_extension_stem}.#{extension_from_filename}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and then also, if extension_from_filename
is empty string, i think this will actually return a filename with an erroneous trailing .
. e.g., wouldn't this method turn file.mp3
into file.
and foo_bar
into foo.bar
? but maybe that doesn't matter because it's only ever getting fed .vtt
files?
seems unlikely that this matters in practice for the comparison that's using this method now. but it does feel like it makes the method comment slightly inaccurate in a way that makes it harder to reason about possible changes/refactorings in the future.
i wonder if it'd be worth a comment or guard clause indicating that input is expected to be something like .vtt
?
db69224
to
819d143
Compare
Why was this change made? 🤔
Fixes #1443 - deal with scenario of identical filenames with only differing extensions but renaming media files when sending to whisper, and handling the matching of output files when adding to cocina.
Note that this only occurs for captioning, and not OCR...but perhaps could be extended to do this if needed.
HOLD for integration testingGo to sul-dlss/infrastructure-integration-test#812 after merging
How was this change tested? 🤨