Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc: Understanding the impact of CO-events #304

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 15 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -599,9 +599,22 @@ The following example was "contrived". The `drupal_loadtest` example was run for
Aggregated | 432.98 | 294.11 | 3,390 | 14
```

From these two tables, it is clear that there was a statistically significant event affecting the load testing metrics. In particular, note that the standard deviation between the "raw" average and the "adjusted" average is considerably larger than the "raw" average, calling into questing whether or not your load test was "valid". (The answer to that question depends very much on your specific goals and load test.)
Note: It is beyond the scope of Goose to test for statistically significant changes in the right-tail, or other locations, of the distribution of response times. Goose produces the raw data you need to conduct these tests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Goose is able to do this to some degree. Goose tracks how long it takes each GooseUser to run through all their GooseTasks, and after each pass updates the average time. If a given pass takes more than 2x the average, it issues a warning that there may have been a Coordinated Omission event.

While it did originally track a cadence for every single request, this was far too much overhead at scale, hence the simplified version that landed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Goose is able to do this to some degree.

Hmm its a nest of thorns and especially when you move away from well known/trodden paths such as testing for equality between distributions.... I don't believe I've ever seen any one address what the distribution of the maxima is for their (sampled) distribution, and then establish the observed maximum is within the range expected, or not.

Gil Tene seemed to have great trouble just with persuading people the one metric you care about is the maximum.

Best to steer clear of making these sorts of claims - in our opinion.

While it did originally track a cadence for every single request, this was far too much overhead at scale, hence the simplified version that landed.

What is in place is, in our opinion, a defensible workaround until the proper implementation is arrived at.

Copy link
Contributor Author

@bbros-dev bbros-dev Jul 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a given pass takes more than 2x the average

If you leave that in place I suggest offering a reason those two random numbers, i.e. 2 and the average value, were chosen.

After rejecting our conjecture the minimum is the correct value to use for the default 'cadence', we took the time to give an explanation for using the minimum as the default cadence in the absence of a user providing a particular value. That seems to have been ignored, which is unfortunate, but causes us no harm.

However we must note it is not obvious to us that the CI around the order statistics are symmetric - but we are guessing that is what you're trying to do. So we're struggling to work out what statistical theory you are trying to apply and why.
Unless a statistician emerges as a maintainer, we'd suggest moving Goose away from these topics is the prudent course of action.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values Goose is using were chosen through empirical experimentation and observations.

While in a fully controlled environment it would be ideal to strive for the minimum response time, it has never been my experience that this is the expectation for websites on the internet. It is my expectation that most people running load tests are doing so to observe and measure performance (frequently with the aim of further optimization).

Currently each GooseUser thread simply watches the performance for a few cycles and tracks the average cadence and uses this as its baseline. Yes, there are outliers that return quicker, and there are outliers that return quicker, but on average we tend to observe reasonable consistency. So my intent is that through empirical observation Goose determines the "normal" cadence for each GooseUser, and then alerts when some event causes things to take longer. Goose is unable to determine what caused the event, but aids the tester by making them aware of it. This is a hint to the tester to review all available tools to better understand what happened: was there a bottleneck, a cache stampede, resource starvation, etc... And does the event impact the goals of the load test (fully something for the tester to decide).

What is your use-case that you assume the minimum response time is the expected cadence?

As for the magic 2, this was borrowed from the implementation in HdrHistogram:
https://github.com/HdrHistogram/HdrHistogram_rust/blob/9c09314ac91848fd696b699892414cb337d9abce/src/lib.rs#L916

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While in a fully controlled environment it would be ideal to strive for the minimum response time, it has never been my experience that this is the expectation for websites on the internet.

This has nothing to do with response times or expectations about the behavior of the system under test. I cannot stress that enough - you're focused on the wrong end of the problem.

Perhaps I misunderstand:
Ack we are dealing with a workaround and some of this becomes moot when the CO guard lands.
Nonetheless, CO-mitigation reflects nothing about your expectations of response times. It is simply to ensure that when a CO event occurs you see that reflected in the empirical distribution. Plotting the empirical cumulative dist function, ideally, you should see the CDF start to climb with the CO mitigation in place. With the minimum it will start to climb very soon after the CO event occurs. Using a larger time interval will hide the evidence of the CO event. Experiments will show this. You'll also notice that as you run the different experiments your "expectation for websites on the internet" are unchanged. All you're observing is a signal or a signal being hidden.

So my intent is that through empirical observation Goose determines the "normal" cadence for each GooseUser, and then alerts when some event causes things to take longer

I give up. But please don't impose your preferences for this. I you insist it must be the default can it easily be turned off?

What is your use-case that you assume the minimum response time is the expected cadence?

In the workaround at hand we are talking about a device to alert us to CO events that we know with 100% certainty would, without this workaround, render the upper order statistics near meaningless. It has nothing to do with my expectations about anything, its a workaround.

When the CO fix lands I will be able to configure Goose to make requests with a time interval between them that I expect to reflect the behavior of my users. When those "users" are backend microservices I can make then behave in very specific manner. But again the interval I pass to the CO-fixed Goose will reflect nothing about my expectations about response time - it should reflect my expectations about request arrival times. Request arrival times have nothing to do with response times (in a CO-fixed system) - without the fix CO response times do impact request times - that's the defect being workaround and then fixed

No?

As an aside:

on average we tend to observe reasonable consistency

Try not to care about these statistical issues. No offence intended but I'm now reasonably confident they are not your strength, just as Rust is not my strength.
Allow for more the possibility of infinitely more diversity in the world than what you have experienced, and likewise with respect to views and expectations you expect other people to hold.

The role of Goose is to make requests and record responses in a way that isn't misleading. Nothing more.
The CO issue unfortunately impacts the distribution you observe, but it really has nothing to do with statistics.

To recognize this notice that Gil Tene's neat definition of the CO is this:

what each tester thread is actually doing as a result of waiting for a long response before sending the next one is the equivalent of asking the system "is it ok for me to test your latency right now? Because if it's not, I'm fine waiting and testing it only when you are ready..."

The *fix for that has nothing to do with statistics. Likewise making a request rather than waiting for the system being tested to say "OK I'm ready now, you can test me again" reveals nothing about my expectations about anything.

There are a set of use cases where people try to mimic human behavior. While important and fascinating, that's a small fraction of network traffic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has nothing to do with response times or expectations about the behavior of the system under test. I cannot stress that enough - you're focused on the wrong end of the problem.

Ok, so you're saying the goal is more that we enforce the request cadence?

Ack we are dealing with a workaround and some of this becomes moot when the CO guard lands.

Most articles and papers I've found seem to point to HdrHistogram as the best workaround, and that is what Goose has copied.

Nonetheless, CO-mitigation reflects nothing about your expectations of response times. It is simply to ensure that when a CO event occurs you see that reflected in the empirical distribution. Plotting the empirical cumulative dist function, ideally, you should see the CDF start to climb with the CO mitigation in place. With the minimum it will start to climb very soon after the CO event occurs. Using a larger time interval will hide the evidence of the CO event. Experiments will show this. You'll also notice that as you run the different experiments your "expectation for websites on the internet" are unchanged. All you're observing is a signal or a signal being hidden.

My concern with this is the mitigation (taken from HdrHistogram) works around this by back-filling metrics for requests that never actually happened. It seems undesirable to me for a significant portion of our metrics to be "estimated". (Fortunately we store both: the raw and the back-filled.)

I give up. But please don't impose your preferences for this. I you insist it must be the default can it easily be turned off?

That's not very constructive to solving this, but okay.

Please see the Coordinated Mitigation section in the README, specifically the --co-mitigation flag: this is how it's enabled/configured.

When the CO fix lands I will be able to configure Goose to make requests with a time interval between them that I expect to reflect the behavior of my users.

On the roadmap (but apparently not documented anywhere, oddly) is enhancing set_wait_time() (likely with a different named function) so it can instead be configured to impose a maximum time for each task. This would make it so you can enforce that each task runs after a certain duration. This will require setting a timeout on each request (or a way to cancel requests when the timeout is reached). In this way you'd be able to configure and enforce a set cadence for each GooseUser. Is this more what you're looking for? (The timeout is necessary as otherwise we end up with cascading requests...)

Try not to care about these statistical issues. No offense intended but I'm now reasonably confident they are not your strength, just as Rust is not my strength.

My goal is to be sure I understand what I'm committing: I'm not willing to commit documentation or code changes that don't make sense to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so you're saying the goal is more that we enforce the request cadence?

Correct. When it is on it is one. When it is off it is off.

That's not very constructive to solving this, but okay.

Nothing to do with being constructive or not. Time is finite. There are multiple things to with each minute.

impose a maximum time for each task

In our opinion, that works against preventing CO from distorting the response times.

I'm not willing to commit documentation or code changes that don't make sense to me.

Sensible.


Goose also shows multiple percentile graphs, again showing first the "raw" metrics followed by the "adjusted" metrics. The "raw" graph would suggest that less than 1% of the requests for the `GET (Anon) node page` were slow, and less than 0.1% of the requests for the `GET (Auth) node page` were slow. However, through Coordinated Omission Mitigation we can see that statistically this would have actually affected all requests, and for authenticated users the impact is visible on >25% of the requests.
Nonetheless, for users interested in establishing if there was an event(s) affecting the shape of the distribution of load test metrics (by a statistically significant amount): The following program is a reasonable starting point.

1. Run a test in circumstances where you believe the test can serve as a baseline sample from the 'healthy' state, record the raw response data (record the CO-adjusted data if using the minimum 'cadence' to adjust for CO).
2. Run a test in circumstances where you want to compare the distribution of response times against your baseline sample, record the CO-adjusted response data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Goose will monitor and tell you if it believes there was a CO-event, and will back-fill the metrics accordingly.

While the person running the load test remains the responsible party that oversees and reviews everything that's going on: Goose should simplify this considerably.

Copy link
Contributor Author

@bbros-dev bbros-dev Jul 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Goose will monitor and tell you if it believes there was a CO-event

Rather avoid this type of pseudo-intelligence. It clutters the code and gives a false sense of security.

If our understanding is correct: When you set the cadence as 'minimum' you'll get CO-mitigation right after the warm-up period.
Correct?

We're still of the view the minimum is the correct setting, so that is the only behavior that concerns us right now ;)

When the 'proper' solution arrives users will define their sample interval or rate/sec. In that case it would be convenient to have a flag that uses the minimum interval (maximum req/sec) obtained during some warmup period - and that might even come to be accepted as the default in the community. But we're not there yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather avoid this type of pseudo-intelligence. It clutters the code and gives a false sense of security.

The intent is quite the opposite: bring the tester's attention to abnormally slow requests together with timestamps that aid in tracking down the cause.

If --co-mitigation minimum is enabled, the user_cadence will indeed by the fastest time each GooseUser thread has completed all GooseTasks:

// Update `minimum_cadence` if this was the fastest seen.

If our understanding is correct: When you set the cadence as 'minimum' you'll get CO-mitigation right after the warm-up period.

In the current implementation, the "warm up period" is per-GooseUser thread, and is 3 complete iterations of all their GooseTasks:

if request_cadence.counter > 3 {

We're still of the view the minimum is the correct setting, so that is the only behavior that concerns us right now ;)

What is your use-case where this makes sense? Are you load testing in a fully controlled environment?

Copy link
Contributor Author

@bbros-dev bbros-dev Jul 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent is quite the opposite: bring the tester's attention to abnormally slow requests together with timestamps that aid in tracking down the cause.

Understood. But try an experiment where you freeze the server for some time period smaller than the average and do that often - you'll discover that frequently you are hiding the CO event.

No?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense, but this would be true for any event that is less than the period you've selected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...which is why you'd use the minimum. Of course.

The remaining concern for me is that this could result in the metrics that Goose returns containing as much or more "back-filled metrics" (that never actually happened) as "real metrics" (that actually happened).


Use a [Kolmogorov-Smirnov](https://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm), [Anderson-Darling](https://www.itl.nist.gov/div898/handbook/eda/section3/eda35e.htm) or some such test to establish if the two sample distributions are different. Take care to adjust the test statistic distribution for any differences in sample sizes (non-trivial). Alternatively, take care to ensure the two runs produce samples of the same size (generally feasible, but do take into account the CO-adjustment process backfills data).

The KS and AD tests assume the two data samples are independent of one another. However, Goose produces the CO-adjusted data from the raw data. Hence, obviously, the CO-adjusted data is not independent of the raw data produced in the same test/run.

There are situations where absolute values of a percentile are of interest, e.g.service level agreements, irrespective of the circumstances. Consequently, Goose produces percentile tables, showing the "raw" metrics followed by the "adjusted" metrics.

Returning to the example data. The "raw" graph indicates less than 1% of the responses to requests for the `GET (Anon) node page` were as slow as 3 seconds or worse, and less than 0.1% of the responses to requests for the `GET (Auth) node page` were as slow as 3 seconds or worse.

However, the data generated by Coordinated Omission mitigation indicates **2% of responses** to requests across all pages **were delayed by 2 seconds or worse**. For authenticated users **>25% of responses** to requests were **more than ten times slower than the raw data indicated** (comment form posting being slightly less affected).

```
------------------------------------------------------------------------------
Expand Down