-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doc: Understanding the impact of CO-events #304
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -599,9 +599,22 @@ The following example was "contrived". The `drupal_loadtest` example was run for | |||||
Aggregated | 432.98 | 294.11 | 3,390 | 14 | ||||||
``` | ||||||
|
||||||
From these two tables, it is clear that there was a statistically significant event affecting the load testing metrics. In particular, note that the standard deviation between the "raw" average and the "adjusted" average is considerably larger than the "raw" average, calling into questing whether or not your load test was "valid". (The answer to that question depends very much on your specific goals and load test.) | ||||||
Note: It is beyond the scope of Goose to test for statistically significant changes in the right-tail, or other locations, of the distribution of response times. Goose produces the raw data you need to conduct these tests. | ||||||
|
||||||
Goose also shows multiple percentile graphs, again showing first the "raw" metrics followed by the "adjusted" metrics. The "raw" graph would suggest that less than 1% of the requests for the `GET (Anon) node page` were slow, and less than 0.1% of the requests for the `GET (Auth) node page` were slow. However, through Coordinated Omission Mitigation we can see that statistically this would have actually affected all requests, and for authenticated users the impact is visible on >25% of the requests. | ||||||
Nonetheless, for users interested in establishing if there was an event(s) affecting the shape of the distribution of load test metrics (by a statistically significant amount): The following program is a reasonable starting point. | ||||||
|
||||||
1. Run a test in circumstances where you believe the test can serve as a baseline sample from the 'healthy' state, record the raw response data (record the CO-adjusted data if using the minimum 'cadence' to adjust for CO). | ||||||
2. Run a test in circumstances where you want to compare the distribution of response times against your baseline sample, record the CO-adjusted response data. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Goose will monitor and tell you if it believes there was a CO-event, and will back-fill the metrics accordingly. While the person running the load test remains the responsible party that oversees and reviews everything that's going on: Goose should simplify this considerably. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Rather avoid this type of pseudo-intelligence. It clutters the code and gives a false sense of security. If our understanding is correct: When you set the cadence as 'minimum' you'll get CO-mitigation right after the warm-up period. We're still of the view the minimum is the correct setting, so that is the only behavior that concerns us right now ;) When the 'proper' solution arrives users will define their sample interval or rate/sec. In that case it would be convenient to have a flag that uses the minimum interval (maximum req/sec) obtained during some warmup period - and that might even come to be accepted as the default in the community. But we're not there yet. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The intent is quite the opposite: bring the tester's attention to abnormally slow requests together with timestamps that aid in tracking down the cause. If Line 1514 in 025a7bb
In the current implementation, the "warm up period" is per-GooseUser thread, and is 3 complete iterations of all their GooseTasks: Line 1529 in 025a7bb
What is your use-case where this makes sense? Are you load testing in a fully controlled environment? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Understood. But try an experiment where you freeze the server for some time period smaller than the average and do that often - you'll discover that frequently you are hiding the CO event. No? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It makes sense, but this would be true for any event that is less than the period you've selected. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ...which is why you'd use the minimum. Of course. The remaining concern for me is that this could result in the metrics that Goose returns containing as much or more "back-filled metrics" (that never actually happened) as "real metrics" (that actually happened). |
||||||
|
||||||
Use a [Kolmogorov-Smirnov](https://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm), [Anderson-Darling](https://www.itl.nist.gov/div898/handbook/eda/section3/eda35e.htm) or some such test to establish if the two sample distributions are different. Take care to adjust the test statistic distribution for any differences in sample sizes (non-trivial). Alternatively, take care to ensure the two runs produce samples of the same size (generally feasible, but do take into account the CO-adjustment process backfills data). | ||||||
|
||||||
The KS and AD tests assume the two data samples are independent of one another. However, Goose produces the CO-adjusted data from the raw data. Hence, obviously, the CO-adjusted data is not independent of the raw data produced in the same test/run. | ||||||
|
||||||
There are situations where absolute values of a percentile are of interest, e.g.service level agreements, irrespective of the circumstances. Consequently, Goose produces percentile tables, showing the "raw" metrics followed by the "adjusted" metrics. | ||||||
|
||||||
Returning to the example data. The "raw" graph indicates less than 1% of the responses to requests for the `GET (Anon) node page` were as slow as 3 seconds or worse, and less than 0.1% of the responses to requests for the `GET (Auth) node page` were as slow as 3 seconds or worse. | ||||||
|
||||||
However, the data generated by Coordinated Omission mitigation indicates **2% of responses** to requests across all pages **were delayed by 2 seconds or worse**. For authenticated users **>25% of responses** to requests were **more than ten times slower than the raw data indicated** (comment form posting being slightly less affected). | ||||||
|
||||||
``` | ||||||
------------------------------------------------------------------------------ | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Goose is able to do this to some degree. Goose tracks how long it takes each
GooseUser
to run through all theirGooseTasks
, and after each pass updates the average time. If a given pass takes more than 2x the average, it issues a warning that there may have been a Coordinated Omission event.While it did originally track a cadence for every single request, this was far too much overhead at scale, hence the simplified version that landed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm its a nest of thorns and especially when you move away from well known/trodden paths such as testing for equality between distributions.... I don't believe I've ever seen any one address what the distribution of the maxima is for their (sampled) distribution, and then establish the observed maximum is within the range expected, or not.
Gil Tene seemed to have great trouble just with persuading people the one metric you care about is the maximum.
Best to steer clear of making these sorts of claims - in our opinion.
What is in place is, in our opinion, a defensible workaround until the proper implementation is arrived at.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you leave that in place I suggest offering a reason those two random numbers, i.e.
2
and the average value, were chosen.After rejecting our conjecture the minimum is the correct value to use for the default 'cadence', we took the time to give an explanation for using the minimum as the default cadence in the absence of a user providing a particular value. That seems to have been ignored, which is unfortunate, but causes us no harm.
However we must note it is not obvious to us that the CI around the order statistics are symmetric - but we are guessing that is what you're trying to do. So we're struggling to work out what statistical theory you are trying to apply and why.
Unless a statistician emerges as a maintainer, we'd suggest moving Goose away from these topics is the prudent course of action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The values Goose is using were chosen through empirical experimentation and observations.
While in a fully controlled environment it would be ideal to strive for the minimum response time, it has never been my experience that this is the expectation for websites on the internet. It is my expectation that most people running load tests are doing so to observe and measure performance (frequently with the aim of further optimization).
Currently each
GooseUser
thread simply watches the performance for a few cycles and tracks the average cadence and uses this as its baseline. Yes, there are outliers that return quicker, and there are outliers that return quicker, but on average we tend to observe reasonable consistency. So my intent is that through empirical observation Goose determines the "normal" cadence for each GooseUser, and then alerts when some event causes things to take longer. Goose is unable to determine what caused the event, but aids the tester by making them aware of it. This is a hint to the tester to review all available tools to better understand what happened: was there a bottleneck, a cache stampede, resource starvation, etc... And does the event impact the goals of the load test (fully something for the tester to decide).What is your use-case that you assume the minimum response time is the expected cadence?
As for the magic 2, this was borrowed from the implementation in HdrHistogram:
https://github.com/HdrHistogram/HdrHistogram_rust/blob/9c09314ac91848fd696b699892414cb337d9abce/src/lib.rs#L916
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has nothing to do with response times or expectations about the behavior of the system under test. I cannot stress that enough - you're focused on the wrong end of the problem.
Perhaps I misunderstand:
Ack we are dealing with a workaround and some of this becomes moot when the CO guard lands.
Nonetheless, CO-mitigation reflects nothing about your expectations of response times. It is simply to ensure that when a CO event occurs you see that reflected in the empirical distribution. Plotting the empirical cumulative dist function, ideally, you should see the CDF start to climb with the CO mitigation in place. With the minimum it will start to climb very soon after the CO event occurs. Using a larger time interval will hide the evidence of the CO event. Experiments will show this. You'll also notice that as you run the different experiments your "expectation for websites on the internet" are unchanged. All you're observing is a signal or a signal being hidden.
I give up. But please don't impose your preferences for this. I you insist it must be the default can it easily be turned off?
In the workaround at hand we are talking about a device to alert us to CO events that we know with 100% certainty would, without this workaround, render the upper order statistics near meaningless. It has nothing to do with my expectations about anything, its a workaround.
When the CO fix lands I will be able to configure Goose to make requests with a time interval between them that I expect to reflect the behavior of my users. When those "users" are backend microservices I can make then behave in very specific manner. But again the interval I pass to the CO-fixed Goose will reflect nothing about my expectations about response time - it should reflect my expectations about request arrival times. Request arrival times have nothing to do with response times (in a CO-fixed system) - without the fix CO response times do impact request times - that's the defect being workaround and then fixed
No?
As an aside:
Try not to care about these statistical issues. No offence intended but I'm now reasonably confident they are not your strength, just as Rust is not my strength.
Allow for more the possibility of infinitely more diversity in the world than what you have experienced, and likewise with respect to views and expectations you expect other people to hold.
The role of Goose is to make requests and record responses in a way that isn't misleading. Nothing more.
The CO issue unfortunately impacts the distribution you observe, but it really has nothing to do with statistics.
To recognize this notice that Gil Tene's neat definition of the CO is this:
The *fix for that has nothing to do with statistics. Likewise making a request rather than waiting for the system being tested to say "OK I'm ready now, you can test me again" reveals nothing about my expectations about anything.
There are a set of use cases where people try to mimic human behavior. While important and fascinating, that's a small fraction of network traffic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so you're saying the goal is more that we enforce the request cadence?
Most articles and papers I've found seem to point to HdrHistogram as the best workaround, and that is what Goose has copied.
My concern with this is the mitigation (taken from HdrHistogram) works around this by back-filling metrics for requests that never actually happened. It seems undesirable to me for a significant portion of our metrics to be "estimated". (Fortunately we store both: the raw and the back-filled.)
That's not very constructive to solving this, but okay.
Please see the Coordinated Mitigation section in the README, specifically the
--co-mitigation
flag: this is how it's enabled/configured.On the roadmap (but apparently not documented anywhere, oddly) is enhancing
set_wait_time()
(likely with a different named function) so it can instead be configured to impose a maximum time for each task. This would make it so you can enforce that each task runs after a certain duration. This will require setting a timeout on each request (or a way to cancel requests when the timeout is reached). In this way you'd be able to configure and enforce a set cadence for each GooseUser. Is this more what you're looking for? (The timeout is necessary as otherwise we end up with cascading requests...)My goal is to be sure I understand what I'm committing: I'm not willing to commit documentation or code changes that don't make sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. When it is on it is one. When it is off it is off.
Nothing to do with being constructive or not. Time is finite. There are multiple things to with each minute.
In our opinion, that works against preventing CO from distorting the response times.
Sensible.