CLDR-18187 Add complex segmentation to scriptMetadata.txt #4262

sffc · 2025-01-06T19:35:24Z

This PR completes the ticket.

See ticket for details. The issue discussed in multiple CLDR Design WG meetings, but this specific solution was not.

ALLOW_MANY_COMMITS=true

macchiati

Structure looks good. The file is actually created using a tool as described on https://cldr.unicode.org/development/updating-codes/updating-script-metadata.
I'll add you to the writers on the sheet, and the PR will need to also modify GenerateScriptMetadata.java*

*When we do that, we need to add to the header of the .txt file that it is generated according to https://cldr.unicode.org/development/updating-codes/updating-script-metadata

srl295 · 2025-01-07T20:25:06Z

100% on the value of the data. Is it time to move this to an XML document though perhaps in supplemental? Could still output it as a .txt for release.

sffc · 2025-01-08T23:59:41Z

OK, I added it to the Java file and the Google Sheets.

While doing this, I realized, is this data meaningfully different from the column "LB letters"?

CC @markusicu

sffc · 2025-01-09T00:05:21Z

We could potentially add a third value to the enumeration in the LB Letters column to distinguish scripts like Thai, which need a dictionary for word and line segmentation, from Han, which needs a dictionary for only word segmentation.

markusicu · 2025-01-09T00:10:23Z

Idea: Consider changing LBLetters(Hani) to "No" but adding WBLetters and making that "Yes" for Hani.

macchiati · 2025-01-09T00:12:38Z

Good idea!

macchiati · 2025-01-09T00:16:37Z

I think Shane's idea is a bit simpler. The question is whether we know of any APIs that reflect the value as a boolean; when they read the data they would need to make a code change.

…

On Wed, Jan 8, 2025 at 4:10 PM Markus Scherer ***@***.***> wrote: Idea: Consider changing LBLetters(Hani) to "No" but adding WBLetters and making that "Yes" for Hani. — Reply to this email directly, view it on GitHub <#4262 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMG7M5G42YYIC6HQ6TL2JW5ALAVCNFSM6AAAAABUWHSBTCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZYHEZDINZRHA> . You are receiving this because your review was requested.Message ID: ***@***.***>

sffc · 2025-01-09T00:39:45Z

I kind-of like a new column because (1) it doesn't break users of the old column and (2) it would potentially allow for scripts that need special rules for line break but not for word break (say, line break allowed on syllable boundaries).

eggrobin · 2025-01-09T00:40:27Z

# 7 - LB letters:
#		YES if the major languages using the script allow linebreaks between letters (excluding hyphenation). 
#		Derived from LB property.

How is that derivation actually done? Depending on how you interpret between letters, the values in this file look wrong (or at the very least inconsistent) for all but one of the scripts that use the Brahmic style of line breaking (see https://www.unicode.org/reports/tr14/#BreakOpportunities).

Bali; 33; 1B05; ID; 1; LIMITED_USE; NO; NO; YES; NO; NO; NO
Batk; 33; 1BC0; ID; 1; LIMITED_USE; NO; NO; YES; NO; NO; NO
Brah; 33; 11005; IN; 1; EXCLUSION; NO; NO; YES; NO; NO; NO
Cham; 33; AA00; VN; 1; LIMITED_USE; NO; NO; YES; NO; NO; NO
Diak; 33; 1190C; MV; 1; EXCLUSION; NO; NO; YES; YES; NO; NO
Gran; 33; 11315; IN; 1; EXCLUSION; NO; NO; NO; NO; NO; NO
Gukh; 33; 1611C; NP; 1; EXCLUSION; NO; NO; YES; NO; NO; NO
Java; 33; A984; ID; 1; LIMITED_USE; NO; NO; YES; NO; NO; NO
Kawi; 33; 11F1B; ID; 1; EXCLUSION; NO; YES; YES; NO; NO; NO # That one looks correct.
Maka; 33; 11EE5; ID; 1; EXCLUSION; NO; NO; MIN; NO; NO; NO
Tutg; 33; 11392; IN; 1; EXCLUSION; NO; NO; YES; NO; NO; NO

sffc · 2025-01-09T00:46:01Z

Bali, Java, Hatr, and Elym have comments in the spreadsheet saying that they might be wrong.

But, if we go by that description of the column, I would expect Thai to be "NO" because Thai should have line-breaks at word boundaries. I've seen bugs before where the break engine found breaks in the middle of words and it was wrong.

macchiati · 2025-01-09T01:00:01Z

Shane: The description of LB letters doesn't reference *word breaks* at all. It is just a question of whether you can get line breaks between two characters XY, where X and Y are letters of that script. Robin: The spreadsheet data for that column isn't derived, and probably predates https://www.unicode.org/reports/tr14/#LB28a. Ideally the data would be maintained in the UCD, but the UTC didn't want to have script metadata when the subject was raised (ages ago). If it were, we could have invariant tests for that.

…

On Wed, Jan 8, 2025 at 4:46 PM Shane F. Carr ***@***.***> wrote: Bali, Java, Hatr, and Elym have comments in the spreadsheet <https://docs.google.com/spreadsheets/d/1Y90M0Ie3MUJ6UVCRDOypOtijlMDLNNyyLk36T6iMu0o/edit?gid=0#gid=0> saying that they might be wrong. But, if we go by that description of the column, I would expect Thai to be "NO" because Thai should have line-breaks at word boundaries. I've seen bugs before where the break engine found breaks in the middle of words and it was wrong. — Reply to this email directly, view it on GitHub <#4262 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMFXCRKTICFUPU7IFND2JXBF5AVCNFSM6AAAAABUWHSBTCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZYHE3DINJUG4> . You are receiving this because your review was requested.Message ID: ***@***.***>

CLDR-18187 Add complex segmentation to scriptMetadata.txt

f0a56de

sffc requested a review from macchiati January 6, 2025 19:35

github-actions bot assigned sffc Jan 6, 2025

macchiati reviewed Jan 6, 2025

View reviewed changes

sffc added 2 commits January 8, 2025 15:09

Add to GenerateScriptMetadata.txt

1f44532

Add ComplexBrk to [Generate]ScriptMetadata.java

5091e78

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLDR-18187 Add complex segmentation to scriptMetadata.txt #4262

CLDR-18187 Add complex segmentation to scriptMetadata.txt #4262

sffc commented Jan 6, 2025 •

edited

Loading

macchiati left a comment

srl295 commented Jan 7, 2025

sffc commented Jan 8, 2025

sffc commented Jan 9, 2025

markusicu commented Jan 9, 2025

macchiati commented Jan 9, 2025

macchiati commented Jan 9, 2025 via email

sffc commented Jan 9, 2025

eggrobin commented Jan 9, 2025

sffc commented Jan 9, 2025

macchiati commented Jan 9, 2025 via email

CLDR-18187 Add complex segmentation to scriptMetadata.txt #4262

Are you sure you want to change the base?

CLDR-18187 Add complex segmentation to scriptMetadata.txt #4262

Conversation

sffc commented Jan 6, 2025 • edited Loading

macchiati left a comment

Choose a reason for hiding this comment

srl295 commented Jan 7, 2025

sffc commented Jan 8, 2025

sffc commented Jan 9, 2025

markusicu commented Jan 9, 2025

macchiati commented Jan 9, 2025

macchiati commented Jan 9, 2025 via email

sffc commented Jan 9, 2025

eggrobin commented Jan 9, 2025

sffc commented Jan 9, 2025

macchiati commented Jan 9, 2025 via email

sffc commented Jan 6, 2025 •

edited

Loading