Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix accuracy of sourcepos for inline elements #452

Closed
digitalmoksha opened this issue Aug 10, 2024 · 10 comments
Closed

Fix accuracy of sourcepos for inline elements #452

digitalmoksha opened this issue Aug 10, 2024 · 10 comments

Comments

@digitalmoksha
Copy link
Collaborator

This issue is for discussing how we might fix the accuracy of sourcepos for inline elements.

Simple recreation

A
 B

This should generate an AST that looks like this

<document sourcepos="1:1-2:2" xmlns="http://commonmark.org/xml/1.0">
  <paragraph sourcepos="1:1-2:2">
    <text sourcepos="1:1-1:1" xml:space="preserve">A</text>
    <softbreak sourcepos="1:2-1:2" />
    <text sourcepos="2:2-2:2" xml:space="preserve">B</text>
  </paragraph>
</document>

Instead, the last text has sourcepos="2:1-2:1"

Related issue Unexpected sourcepos and PR Inline sourcepos fixes which moved outputting of inline sourcepos into an option, experimental_inline_sourcepos

Two other examples captured in tests are

comrak/src/tests/core.rs

Lines 580 to 603 in 9f4d391

// Ignored per https://github.com/kivikakk/comrak/pull/439#issuecomment-2225129960.
#[ignore]
#[test]
fn link_sourcepos_truffle_bergamot() {
assert_ast_match!(
[],
"- A\n [![B](/B.png)](/B)\n",
(document (1:1-2:21) [
(list (1:1-2:21) [
(item (1:1-2:21) [
(paragraph (1:3-2:21) [
(text (1:3-1:3) "A")
(softbreak (1:4-1:4))
(link (2:4-2:21) [
(image (2:5-2:16) [
(text (2:7-2:7) "B")
])
])
])
])
])
])
);
}
and

comrak/src/tests/core.rs

Lines 532 to 555 in 9f4d391

// Ignored per https://github.com/kivikakk/comrak/pull/439#issuecomment-2225129960.
#[ignore]
#[test]
fn link_sourcepos_truffle() {
assert_ast_match!(
[],
"- A\n[![B](/B.png)](/B)\n",
(document (1:1-2:18) [
(list (1:1-2:18) [
(item (1:1-2:18) [
(paragraph (1:3-2:18) [
(text (1:3-1:3) "A")
(softbreak (1:4-1:4))
(link (2:1-2:18) [
(image (2:2-2:13) [
(text (2:4-2:4) "B")
])
])
])
])
])
])
);
}

@digitalmoksha
Copy link
Collaborator Author

Including the comment from @kivikakk in the PR, as a starting point of discussion:


How awkward. In all these cases:

- A
  B
- A
B
- A
   B

the inline parser is initialised with the subject "A\nB" and a block_offset of 2, and doesn't have any distinguishing information. This is a consequence of cmark/Comrak's particular way of collecting block content before handing it off to the inline parser — it normalises things such that the inline parser doesn't need to worry about the minutae of what falls into a block and what doesn't, but we also lose information which we cannot recover.

This will require something of a rearchitecting — sourcepos info needs to be associated and carried through from much earlier in the block processing stage, with the possibility of multiple spans per content block[1] — and I have no real inclination to do such work at present.

In light of all this, for now I'm going to adjust to match upstream (#301 (comment)) and not output inline-level sourcepos in HTML in the same cases as cmark. Block-level sourcepos will be unaffected, as it's not known to have issues.


[1] In the examples above, we still need to be passing through the same inline Subject::input of "A\nB", but with sourcepos marker collections [(1:3-1:4),(2:3-2:3)], [(1:3-1:4),(2:1-2:1)] and [(1:3-1:4),(2:4-2:4)] respectively (where (1:4) aligns with the \n which is turned into a soft break). Right now we only pass a single point (1:3) and infer the rest incrementally.

This changes a lot of the bookkeeping on the block parser side, and a complete rework of sourcepos handling on the inline parser side, so I can't even say for sure that I won't hit an even bigger roadblock. Also worth noting in conjunction is this comment about autolink sourcepos unsoundness.

@digitalmoksha
Copy link
Collaborator Author

I've started researching this a little, trying to get familiar again with how this works, and play around with a few things.

When I take simple text, such as

  A

I get

    <text sourcepos="1:3-1:3" xml:space="preserve">A</text>

So it doesn't count any leading spaces. So it seems like we would want the same behavior for a following line in the same paragraph. Meaning

  A
   B

should yield

    <text sourcepos="1:3-1:3" xml:space="preserve">A</text>
    <softbreak sourcepos="1:4-1:4" />
    <text sourcepos="2:4-2:4" xml:space="preserve">A</text>

But right now it seems like we're always adding the block_offset for the second line, rather than calculating a new start position like we do for the first line. I'm wondering if we can fix that without larger changes 🤔

@digitalmoksha
Copy link
Collaborator Author

As an additional researching angle, I've been looking at the output from https://github.com/jgm/commonmark-hs.

For example

√ commonmark-hs/commonmark-cli (master= [])
> cabal run exes -- --sourcepos
Up to date
  A
   B

<p data-sourcepos="stdin@1:3-2:1;2:4-3:1"><span data-sourcepos="stdin@1:3-1:4">A</span>
<span data-sourcepos="stdin@2:4-2:5">B</span></p>

@digitalmoksha
Copy link
Collaborator Author

I know you said this

the inline parser is initialised with the subject "A\nB" and a block_offset of 2, and doesn't have any distinguishing information. This is a consequence of cmark/Comrak's particular way of collecting block content before handing it off to the inline parser — it normalises things such that the inline parser doesn't need to worry about the minutae of what falls into a block and what doesn't, but we also lose information which we cannot recover.

but it wasn't until I stepped through the code that it clicked into place. Any leading spaces have been stripped off, so we have no idea how many spaces on each line were stripped off.

@kivikakk
Copy link
Owner

Yes, indeed; the necessary bookkeeping must start at the block parser, since it's the one doing the normalising. Both parsers would need to be adjusted to communicate that information.

What's likely to be simplest would be having the block parser give the inline parser a list of tuples like (nchars, startrow, startcol) alongside the subject string, where the sum of nchars is the length of the subject string. Thus spans of text are associated with a start location, and the existing bookkeeping would handle the intermediate locations.

This would probably remove some work from the inline parser, such as the bookkeeping done in Subject::handle_newline, itself of course being the site of the current woes!

But the changes in the block parser itself to enable this would be significant, in a way that's hard to estimate without first looking carefully (and probably trying a speculative approach or two). It's not set up for this kind of thing at all — presumably why it wasn't added to cmark upstream in the first place.

@digitalmoksha
Copy link
Collaborator Author

digitalmoksha commented Aug 12, 2024

My initial thought was maybe having a vector of offsets, say line_offsets, in the AST for each line that gets added in add_line in the block parser. It looks like spaces are only trimmed from the beginning of the line. Then as we process each line in the inline parser, rather than just relying on the block_offset and the column_offset, we can add in the correct offset for the line.

@kivikakk
Copy link
Owner

If it works, it works! 😸

@digitalmoksha
Copy link
Collaborator Author

I spun up #453 to play with some ideas

@digitalmoksha
Copy link
Collaborator Author

I think we can close this with #453 merged. Will open another issue if any other problems are found.

There may come a day when we can move inline sourcepos out of experimental status, but it certainly needs more battle testing before that can happen.

@kivikakk would you have time to release a 0.27.0 version? I can certainly do it if needed, but I defer to you.

@kivikakk
Copy link
Owner

Will do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants