Segmenter does not permit skipping long tokens #11

atombender · 2021-03-19T04:24:16Z

If the stream encounters a token that exceeds maxTokenSize, it appears the caller has no way of truncating or skipping the token: The error is final. There's no way to alter maxTokenSize, either. As far as I can see, the only solution is to not use Segmenter at all, but to write new streaming logic from scratch.

The text was updated successfully, but these errors were encountered:

mschoch · 2021-03-19T14:00:47Z

I see the following options (though I'm no super familiar with the code at this point):

Move this method from export_test.go into a non-test file, allowing anyone to change the max token size themselves, just as the tests do today (and probably should be renamed to SetMaxTokenSize)

segment/export_test.go

Line 12 in b318295

func (s *Segmenter) MaxTokenSize(n int) {
Add a new option (default to false for backwards compat) to truncate long tokens. This seems like the simplest approach, as places in the code today that return the error, would instead emit a token with what had been read up to that point, and processing continues as normal, remaining text in the long token becomes the start of a new token...
Same 2, but add code eat the rest of the token after the point of truncation (harder because now we're changing the actual logic)
Similar to 2 and 3, but drop all contents of the long tokens completely (also changing the actual logic), probably exposed as a different option, or if we support multiple options, an enumeration of possible ways to handle long tokens.

Do you have a sense of which of these best supports your requirements?

Also, pinging @abhinavdangeti and @sreekanth-cb

atombender · 2021-03-19T20:25:12Z

I didn't see a good solution to this, so I ended up implementing my own wrapper around SegmentWords() that has an unlimited buffer size; this was easier than I thought it would be, and I now see that Segmenter is really just a helper to work with readers.

The reason this is safe in my program is that we already have the data in memory, so the maximum token size doesn't matter much to memory size, though I do truncate tokens to a maximum size.

I still think it would be a good idea to make Segmenter able to skip or truncate long tokens, though. So either option 2 or 3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmenter does not permit skipping long tokens #11

Segmenter does not permit skipping long tokens #11

atombender commented Mar 19, 2021

mschoch commented Mar 19, 2021

atombender commented Mar 19, 2021

Segmenter does not permit skipping long tokens #11

Segmenter does not permit skipping long tokens #11

Comments

atombender commented Mar 19, 2021

mschoch commented Mar 19, 2021

atombender commented Mar 19, 2021