-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression after #56 when running varnishreload ("... was collected before with the same name and label values") #57
Comments
Had some more time to look into this now. The root cause of this is that the inactive VCLs are kept in state A workaround is to set Note: if you use As for prometheus_varnish_exporter, not sure if there is some other sane workaround? Perhaps add the VBE name as a label? Note: as long as there are multiple warm VCLs with conflicting backend names, the /metrics endpoint outputs error message instead of metrics! |
Yeah, these has always been a bit of an issue. I think what the metrics observer wants to know is how many configs are loaded, not cold ones or whatever... Just how many active. So I think cleaning the timestamp and other pre/postfixes out is the right thing to do. If I understand right the problem is same identifier. We should just track the names during a scrape and ignore duplicates? If the varnishstat json output does not tell us which are cold, not sure what else to do (except remove the name processing and export duplicates as they are). |
Not sure how prometheus would act if the endpoint serves metrics with two identical lines but totally different values, will most likely cause confusion? And if we should filter locally, what would we use to decide the new vs old one, we cannot really assume anything from the name? I wonder if it is an acceptable solution to add the VCL name to the measurement outputs as another tag, something like this?
That way we would never cause duplicate values, but on the other hand we just push the potential problem to the next level of the stack (i.e. whatever is running the prometheus query needs to figure out what is right/wrong). For my use case, I just ran with vcl_cooldown=1, since we rarely re-activate the previous VCL anyway. That way it is a very minimal chance that we happen to scrape bad data, and if so it will be "fixed" by next scrape. But I understand that this might not be suitable for all installations. |
Could you send me a scrape .json that has this boot and reloaded values. I would like to get a clear picture what metrics are duplicated while the "cooldown" is going on. If it just the What would be correct imo is to sum up the values for all of them. I mean if there is cooldown with value 2 and current active one with value 2. The correct Prometheus value export would be sum of those, not export two with different But yes if we would build something to identify the |
All VBE counters are repeated:
Listed only VBE above, those are the only one which mentions Summing could perhaps work yes, if all are counters (which most see to be, except |
Hi, I created this PR to fix this issue. |
After my patch in #56 was introduced, there are now problem with duplicate metric names. For some reason I missed this... Apologies.
Since varnishreload does not remove the old VCL automatically, there will be VBE stats for both VCLs. This causes double metrics to be registered, triggering errors in prometheus client lib:
from:
Note that these are seen when actually calling the scrape endpoint!
Not sure how to work around this, varnishstat does not seem to identify the currently active VCL, or am I missing something?
The text was updated successfully, but these errors were encountered: