-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue 57 - errors on VCL reload because of duplicated counters #70
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Did I understand right that the bug is that over time the stats contain these reload_
prefixed entries. They do go away after that cooldown. but while they exist it breaks the whole metrics scrape and export?
But if the names are unique by the timestamp there, why is prometheus giving and error? Does our code somehow strip the timestamp out or the metric identifier? And then we end up with multiple metrics with the same name/labels which prometheus then rejects?
Good code in general. Left some suggestions and I think one bug (though I assume its possible to have only one unique reload_ entries you might know better).
varnish.go
Outdated
@@ -15,6 +15,11 @@ import ( | |||
"github.com/prometheus/client_golang/prometheus" | |||
) | |||
|
|||
const ( | |||
vbeReload = "VBE.reload_" | |||
vbeReloadLength = len(vbeReload) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do this inline in findMostRecentVbeReloadPrefix
, no need to make this globals constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, it was maybe and excessive attempt of optimization on my side. :D
Done: feb6172
varnish.go
Outdated
var mostRecentVbeReloadPrefix string | ||
for vName, _ := range countersJSON { | ||
// Checking only the required ".happy" stat | ||
if strings.HasPrefix(vName, vbeReload) && strings.HasSuffix(vName, ".happy") { | ||
dotAfterPrefixIndex := vbeReloadLength + strings.Index(vName[vbeReloadLength:], ".") | ||
vbeReloadPrefix := vName[:dotAfterPrefixIndex] | ||
if strings.Compare(vbeReloadPrefix, mostRecentVbeReloadPrefix) > 0 { | ||
mostRecentVbeReloadPrefix = vbeReloadPrefix | ||
} | ||
} | ||
} | ||
return mostRecentVbeReloadPrefix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is only one set on reload_
entries, this function will return the only ones timestamp and that will then be (unintentionally?) ignored in ScrapeVarnishFrom
. Did you intend this, I assume letting the one go through is ok, if its ok to let one of many through as well.
var mostRecentVbeReloadPrefix string | |
for vName, _ := range countersJSON { | |
// Checking only the required ".happy" stat | |
if strings.HasPrefix(vName, vbeReload) && strings.HasSuffix(vName, ".happy") { | |
dotAfterPrefixIndex := vbeReloadLength + strings.Index(vName[vbeReloadLength:], ".") | |
vbeReloadPrefix := vName[:dotAfterPrefixIndex] | |
if strings.Compare(vbeReloadPrefix, mostRecentVbeReloadPrefix) > 0 { | |
mostRecentVbeReloadPrefix = vbeReloadPrefix | |
} | |
} | |
} | |
return mostRecentVbeReloadPrefix | |
mostRecentVbeReloadPrefix := "" | |
numReloadedPrefix := 0 | |
for vName, _ := range countersJSON { | |
// Checking only the required ".happy" stat | |
if strings.HasPrefix(vName, vbeReload) && strings.HasSuffix(vName, ".happy") { | |
numReloadedPrefix++ | |
dotAfterPrefixIndex := vbeReloadLength + strings.Index(vName[vbeReloadLength:], ".") | |
vbeReloadPrefix := vName[:dotAfterPrefixIndex] | |
if strings.Compare(vbeReloadPrefix, mostRecentVbeReloadPrefix) > 0 { | |
mostRecentVbeReloadPrefix = vbeReloadPrefix | |
} | |
} | |
} | |
if numReloadedPrefix <= 1 { | |
return "" | |
} | |
return mostRecentVbeReloadPrefix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is only one set on reload_ entries, this function will return the only ones timestamp
Correct
and that will then be (unintentionally?) ignored in ScrapeVarnishFrom.
I don't think this is true. ScrapeVarnishFrom
will use isOutdatedVbe
to determine that the VBE counters are not, in fact, outdated, so they won't be skipped (this should be covered by the current VCL version
test case of Test_IsOutdatedVbe
).
Let me know if I missed something!
Correct. Varnish starts with just the For example, if you reload Varnish many times in a row, you could end up with many concurrent sets of The only way to expose all the counters at the same time would probably be to add the timestamp as additional label on the Prometheus metrics. This way the conflict would be avoided, but a new time series would be created in Prometheus on each Varnish reload. IMHO it would not be good memory-wise for Prometheus and also it would be harder to extract useful metrics (like, how to ignore the outdated ones in queries?). I think it's better to just expose the current active values.
Also correct. The timestamp is not part of the exported metrics (name or tags), so there are naming conflicts.
Thanks! |
Is it possible to create a release containing this fix? |
@zipkid latest release has this https://github.com/jonnenauha/prometheus_varnish_exporter/releases/tag/1.6.1 |
Fixes #57
Issue
The exporter fails when VCL is reloaded.
This happens because Varnish backend counters (VBE) are repeated for each reload with its timestamp:
Example:
VBE.boot.default.happy
andVBE.reload_20210114_155148_19902.default.happy
... thus generating duplicated metrics.
The
-p vcl_cooldown=1
workaround is not good enough, because old counters are still present in varnishstat for some time (granularity is 30s, see https://varnish-cache.org/docs/6.5/reference/varnishd.html#vcl-cooldown).The result is that on each reload the exporter keeps failing for some time (1m in my tests with
-p vcl_cooldown=1
, 10m without it).Solution
I propose to consider only the counters with the most recent timestamp.
I quickly calculate the most recent timestamp by checking the ".happy" counters, then I filter out the outdated counters.
Should work for any Varnish version.
Tests included:
6.5.1.json
file with 2 different timestamps to simulate 2 consecutive VCL reloads. Existing tests would fail without the fix.