-
-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"warcio check" does not warn of illegal characters in field names or values, including LF #158
Comments
Hi @acidus99, just my 2 cents regarding this issue and the couple other you made recently about warcio check (#157 & #156), I would strongly suggest to not rely on "warcio check", warcio itself is not spec compliant and therefore using it to verify the validity of WARCs is a bad idea. I advised on #74 that the command should be removed until the issues mentioned belowed are fixed, but I am starting to loose hope that it will ever be fixed. That being said, of course you're free to do the experimentations and work you want, it's just my personal opinion based on observations by various members of the archiving community. If you want to learn more about why warcio is not spec compliant, there are multiple issues opened on the subject: Have a nice week-end! |
Hi @acidus99 thanks for reporting this - we will try to get to this when we can! We are working on a lot of different tools and will try to address this when we can. @CorentinB As you know, this is an open source project, and we do our best to fix things when we can, and we are not being paid for this work and have to balance priorities as best we can. Not everyone has the resources to help maintain open source tools, but since you describe yourself as 'digital archivist and open source enthusiast. Software Engineer @ the Internet Archive' - would you like to contribute a PR to fix some of these issues, in the spirit of open source? Or, perhaps you would like to provide financial support for fixing particular issues, we do have OpenCollective and GitHub Sponsorships. Happy to discuss more, if you'd like to be involved in a more constructive way. The 'warcio check' feature was contributed by @wumpus, who is also an unpaid contributor, and perhaps this was outside the original scope of the tool, which was creating and writing WARC files. We are all trying our best here. |
'warcio test' is intended to check the whole standard. It's an unfinished PR. Please give it a look. I would appreciate it if people would stop claiming that their opinion is the only one regarding the standard. #74 suffers from this problem. Both @ikreymer and I have expressed our opinions about a way forward for this problem. @acidus99 if you'd like to refile this bug, there are 2 possibilities: One bug would be if warcio is capable of creating such an invalid warc. The second would be for warcio to notice and complain about this problem. And if you think the documentation for |
(Closing this because warcio check only checks digests, not anything else -- @acidus99 please open another bug for these other scenarios.) |
Hi @ikreymer & @wumpus, open source is not only code, it's also discussion, sharing views, filling issues, reporting bugs, that kind of things. Open source is also about making people aware of the various issues they might encounter when using software X or Y. I have no interest in contributing code to warcio (for now), because I have been able to see in previous discussions that you have apparently no interest in following the WARC standard. Although, if your decision on that change in the future, then of course I would be more than happy to spend time contributing if and when needed. Regarding the various comments you made: "I would appreciate it if people would stop claiming that their opinion is the only one regarding the standard." "warcio check was only intended to check digests." "Closing this because warcio check only checks digests, not anything else" no it doesn't, and I would appreciate if you would all stop claiming that it does and remove that feature. That being said, I understand your frustration from me coming back to those issues every few months, so I won't talk about it anymore, and I'll cross my fingers that you will seriously consider fixing warcio one day or another. |
@ikreymer @wumpus My apologies for not understanding intended the scope of
warcio doesn't seem to create non-compliant WARCs like those in this or #157. No need for a bug here.
This seems reasonable, but both this and #157 sound like they would be more appropriate as part of @wumpus's More broadly @ikreymer I can certainly sympathize with running an open source project where everyone has opinions but orders of magnitude fewer are able or willing to contribute. Your work has been immensely helpful. while I'm not a python developer, I have contributed a one-off donation of $100. Thank you for your hard work. |
@acidus99 I'm happy to try to move my 2019 I spent a lot of time writing it, and it's a shame that it got derailed by an astronomy contract job before it was done. |
That would be amazing. I am not a python developer, or I would offer to help you write the code. That said, I'm completely willing to help you document it, test it, and give feedback on any linting behavior you want to implement. |
"warcio check" does not warn if an illegal character is present in either a field name or value.
See the attached WARC. (To keep things simple, this is just a gzipped file, not a per-record gzipped WARC)
illegal-chars.warc.gz
Running
warcio check illegal-chars.warc
returns no warnings or errors. However if you look at the WARC in a hex editor, you will see a custom field which contains the BEL characters (0x07) in both the field name and field value:This problem actually becomes worse if the illegal character is a LF
Per the WARC spec, a LF is not allowed in the field of a record. However, as discussed in #157, warcio treats a LF as a CRLF. This means if a field name or value contains a LFLF, warcio treats this as the separator between the header and block, and immediately starts reading N bytes as the block bytes, where N is the value of the
Content-Length
field. Since this was not actually the separator between the header and the block, some of the block is left over, and warcio returns the follow error:You can see this behavior with this WARC (again, just a plain gzipped file)
lf-in-field.warc.gz
This error message is confusing because there is nothing wrong with the Content-Length. The problem is instead with illegal characters in the field! If warcio reported illegal characters in the field name, it would be much more clear.
I found this issue while troubleshooting a WARC creation library that was using custom fields on WARC records. It was outputting error messages into the custom field, and included a LFLF from a stack trace. warcio read this LFLF as the header/block separator. Once I figured out what was happening, I checked warcio with other illegal characters like 0x07, and found no illegal characters were flagged.
Expected behavior: warcio check should tell you if the field names or values contain illegal characters
The text was updated successfully, but these errors were encountered: