Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmenter does not permit skipping long tokens #11

Open
atombender opened this issue Mar 19, 2021 · 2 comments
Open

Segmenter does not permit skipping long tokens #11

atombender opened this issue Mar 19, 2021 · 2 comments

Comments

@atombender
Copy link

If the stream encounters a token that exceeds maxTokenSize, it appears the caller has no way of truncating or skipping the token: The error is final. There's no way to alter maxTokenSize, either. As far as I can see, the only solution is to not use Segmenter at all, but to write new streaming logic from scratch.

@mschoch
Copy link
Contributor

mschoch commented Mar 19, 2021

I see the following options (though I'm no super familiar with the code at this point):

  1. Move this method from export_test.go into a non-test file, allowing anyone to change the max token size themselves, just as the tests do today (and probably should be renamed to SetMaxTokenSize)

    func (s *Segmenter) MaxTokenSize(n int) {

  2. Add a new option (default to false for backwards compat) to truncate long tokens. This seems like the simplest approach, as places in the code today that return the error, would instead emit a token with what had been read up to that point, and processing continues as normal, remaining text in the long token becomes the start of a new token...

  3. Same 2, but add code eat the rest of the token after the point of truncation (harder because now we're changing the actual logic)

  4. Similar to 2 and 3, but drop all contents of the long tokens completely (also changing the actual logic), probably exposed as a different option, or if we support multiple options, an enumeration of possible ways to handle long tokens.

Do you have a sense of which of these best supports your requirements?

Also, pinging @abhinavdangeti and @sreekanth-cb

@atombender
Copy link
Author

I didn't see a good solution to this, so I ended up implementing my own wrapper around SegmentWords() that has an unlimited buffer size; this was easier than I thought it would be, and I now see that Segmenter is really just a helper to work with readers.

The reason this is safe in my program is that we already have the data in memory, so the maximum token size doesn't matter much to memory size, though I do truncate tokens to a maximum size.

I still think it would be a good idea to make Segmenter able to skip or truncate long tokens, though. So either option 2 or 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants