Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode valid characters #1408

Merged
merged 1 commit into from
Dec 30, 2024
Merged

Conversation

70ray
Copy link
Collaborator

@70ray 70ray commented Dec 27, 2024

so escaped characters don't get changed back to plain characters. This should fix the issue with the test project projectID462699ad8f1c6 which has custom characters which are regex special characters found by @srjfoo.

Sandbox at: https://www.pgdp.org/~rp31/c.branch/regex_escape

@chrismiceli
Copy link
Collaborator

chrismiceli commented Dec 28, 2024

It seems the whole point of the json_encode is to change the unicode characters to their escaped versions. This seems like the RegExp.escape method is perfect for:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/escape
Unfortunately this isn't present in many browsers yet. This polyfil is small but has some dependencies:
https://github.com/zloirock/core-js/blob/master/packages/core-js/modules/esnext.regexp.escape.js
In our own code we already have:

function escapeRegExp(string) {

but I don't think it handles the unicode issue you're addressing here so we should fix both at the same time if possible.

@70ray
Copy link
Collaborator Author

70ray commented Dec 28, 2024

It seems the whole point of the json_encode is to change the unicode characters to their escaped versions. This seems like the RegExp.escape method is perfect for: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/escape Unfortunately this isn't present in many browsers yet. This polyfil is small but has some dependencies: https://github.com/zloirock/core-js/blob/master/packages/core-js/modules/esnext.regexp.escape.js In our own code we already have:

function escapeRegExp(string) {

but I don't think it handles the unicode issue you're addressing here so we should fix both at the same time if possible.

Thanks Chris, I found it difficult to understand what's going on here. The php function build_character_regex_filter() in unicode.inc produces a string with escaped characters like \u{003f} instead of ?. But they were getting interpreted back to plain characters when the regex was created by the javascript new Regex( ... ) function.
...
While trying to explain why this fix works I realised there was a much better way of fixing it.

@70ray 70ray changed the title Json encode valid characters Encode valid characters Dec 28, 2024
@70ray 70ray requested review from srjfoo and cpeel December 28, 2024 09:52
@jmdyck
Copy link
Contributor

jmdyck commented Dec 28, 2024

[...] a string with escaped characters like \u{003f} instead of ?. But they were getting interpreted back to plain characters when the regex was created by the javascript new Regex( ... ) function.

Actually, they get converted to 'plain characters' as soon as the enclosing string literal is evaluated (probably when the browser parses the JS code). I.e., the string value that's passed to new Regex already contains a plain ?.

Rather than protect the unicode escapes from JS string-literal evaluation, it might be simpler to avoid the latter altogether. One way would be to (in effect) replace:

    var validCharacterPattern = '$valid_character_pattern';
    var validCharRegex = new RegExp(validCharacterPattern, "u");

with

    var validCharRegex = /$valid_character_pattern/;

i.e. the unicode escapes are interpolated directly into a regexp literal, where they are treated as intended. (As far as I can tell, there's no benefit to going through validCharacterPattern.)

@70ray
Copy link
Collaborator Author

70ray commented Dec 28, 2024

[...] a string with escaped characters like \u{003f} instead of ?. But they were getting interpreted back to plain characters when the regex was created by the javascript new Regex( ... ) function.

Actually, they get converted to 'plain characters' as soon as the enclosing string literal is evaluated (probably when the browser parses the JS code). I.e., the string value that's passed to new Regex already contains a plain ?.

Rather than protect the unicode escapes from JS string-literal evaluation, it might be simpler to avoid the latter altogether. One way would be to (in effect) replace:

    var validCharacterPattern = '$valid_character_pattern';
    var validCharRegex = new RegExp(validCharacterPattern, "u");

with

    var validCharRegex = /$valid_character_pattern/;

i.e. the unicode escapes are interpolated directly into a regexp literal, where they are treated as intended. (As far as I can tell, there's no benefit to going through validCharacterPattern.)

Thanks Michael, that seems the best way to do it.

@70ray 70ray marked this pull request as draft December 28, 2024 19:01
@70ray 70ray marked this pull request as ready for review December 28, 2024 19:46
@70ray
Copy link
Collaborator Author

70ray commented Dec 28, 2024

Sandbox updated.

@70ray 70ray requested a review from jmdyck December 29, 2024 07:54
@cpeel cpeel merged commit d4d3fff into DistributedProofreaders:master Dec 30, 2024
12 checks passed
@70ray 70ray deleted the regex_escape branch December 30, 2024 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants