Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kgo: avoid rare panic #603

Merged
merged 1 commit into from
Oct 22, 2023
Merged

kgo: avoid rare panic #603

merged 1 commit into from
Oct 22, 2023

Conversation

twmb
Copy link
Owner

@twmb twmb commented Oct 21, 2023

Scenario is:

  • Metadata update is actively running and has stopped an active session, returning all topicPartitions that were actively in list/epoch. These list/epoch loads are stored in reloadOffsets. Metadata grabs the session change mutex.
  • Client.Close is now called, stores client.consumer.kill(true). The Close is blocked briefly because Close calls assignPartitions which tries to lock to stop the session. Close is now paused -- however, importantly, the consumer.kill atomic is set to true.
  • Metadata tries to start a new session. startNewSession returns noConsumerSession because consumer.kill is now true.
  • Metadata calls reloadOffsets.loadWithSession, which panics once the session tries to access the client variable c.

This panic can only happen if all of the following are true:

  • Client.Close is being called
  • Metadata is updating
  • Metadata response is moving a partition from one broker to another
  • The timing is perfect

The fix to this is to check in listOrEpoch if the consumerSession is noConsumerSession. If so, return early.

Note that doOnMetadataUpdate, incWorker, and decWorker already checked noConsumerSession. The other methods do not need to check:

  • mapLoadsToBrokers is called in listOrEpochs on a valid session
  • handleListOrEpochResults is the same
  • desireFetch is only called in source after noConsumerSession is checked, and manageFetchConcurrency is called only in desireFetch

Closes redpanda-data/redpanda#13791.

Scenario is:
* Metadata update is actively running and has stopped an active session,
  returning all topicPartitions that were actively in list/epoch. These
  list/epoch loads are stored in reloadOffsets. Metadata grabs the
  session change mutex.
* Client.Close is now called, stores client.consumer.kill(true). The
  Close is blocked briefly because Close calls assignPartitions which
  tries to lock to stop the session. Close is now paused -- however,
  importantly, the consumer.kill atomic is set to true.
* Metadata tries to start a new session. startNewSession returns
  noConsumerSession because consumer.kill is now true.
* Metadata calls reloadOffsets.loadWithSession, which panics once
  the session tries to access the client variable c.

This panic can only happen if all of the following are true:
* Client.Close is being called
* Metadata is updating
* Metadata response is moving a partition from one broker to another
* The timing is perfect

The fix to this is to check in listOrEpoch if the consumerSession is
noConsumerSession. If so, return early.

Note that doOnMetadataUpdate, incWorker, and decWorker already checked
noConsumerSession. The other methods do not need to check:
* mapLoadsToBrokers is called in listOrEpochs on a valid session
* handleListOrEpochResults is the same
* desireFetch is only called in source after noConsumerSession is
  checked, and manageFetchConcurrency is called only in desireFetch

Closes redpanda-data/redpanda#13791.
@twmb twmb merged commit 067ec8e into master Oct 22, 2023
6 checks passed
@twmb twmb deleted the rp-13791 branch October 22, 2023 03:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant