Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kgo: return decompression errors while consuming #883

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

kgo: return decompression errors while consuming #883

wants to merge 1 commit into from

Conversation

twmb
Copy link
Owner

@twmb twmb commented Jan 8, 2025

Kafka can return partial batches, so decompression errors are common. If I ask for at most 100 bytes, and the broker has two 60 byte batches, I will receive one valid 60 byte batch and then a partial 40 byte batch. The second partial batch will fail at decompressing. This is the reason I previously never returned decompression errors.

However, if a client truly does produce somewhat-valid compressed data that some decompressors can process, but others (Go's) cannot, then the first batch received could fail to decompress. The client would fail processing, return an empty batch, and try consuming at the same spot. The client would spin loop trying to consume and the end user would never be aware.

Now, if the first error received is a decompression error, we bubble it up to the end user.

This is hard to test internally, so this was hack manually tested.

Scenario one:

  • I changed the code to ignore crc errors, since that just got in the way
  • I ran a local kfake where the first five bytes of a RecordBatch.Records was overwritten with "aaaaa"
  • I consumed before this patch -- the client spin-looped, never progressing and never printing anything.
  • I consumed after this patch -- the client immediately received the error.

Scenario two:

  • Same crc ignoring
  • I ran a local kfake where, when consuming, all batches AFTER the first had their RecordBatch.Records overwritten with "aaaaa".
  • I consumed before and after this patch -- in both cases, the client progressed to the end of the partition and no errors were printed.
  • To double verify the decompression error was being encountered, I added a println in kgo where the decompression error is generated -- the println was always encountered.

Closes #854.

Kafka can return partial batches, so decompression errors are common. If
I ask for at most 100 bytes, and the broker has two 60 byte batches, I
will receive one valid 60 byte batch and then a partial 40 byte batch.
The second partial batch will fail at decompressing. This is the reason
I previously never returned decompression errors.

However, if a client truly does produce somewhat-valid compressed data
that *some* decompressors can process, but *others* (Go's) cannot, then
the first batch received could fail to decompress. The client would fail
processing, return an empty batch, and try consuming at the same spot.
The client would spin loop trying to consume and the end user would
never be aware.

Now, if the first error received is a decompression error, we bubble it
up to the end user.

This is hard to test internally, so this was hack manually tested.

Scenario one:
* I changed the code to ignore crc errors, since that just got in the
  way
* I ran a local kfake where the first five bytes of a
  RecordBatch.Records was overwritten with "aaaaa"
* I consumed _before_ this patch -- the client spin-looped, never
  progressing and never printing anything.
* I consumed _after_ this patch -- the client immediately received the
  error.

Scenario two:
* Same crc ignoring
* I ran a local kfake where, when consuming, all batches AFTER the
  first had their RecordBatch.Records overwritten with "aaaaa".
* I consumed before and after this patch -- in both cases, the client
  progressed to the end of the partition and no errors were printed.
* To double verify the decompression error was being encountered, I
  added a println in kgo where the decompression error is generated --
  the println was always encountered.

Closes #854.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bubble up batch processing failures when fetching
1 participant