Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression after #56 when running varnishreload ("... was collected before with the same name and label values") #57

Closed
stromnet opened this issue Nov 14, 2019 · 6 comments · Fixed by #70

Comments

@stromnet
Copy link
Contributor

  • Varnish 6.3.1
  • prometheus_varnish_exporter 1.5.1

After my patch in #56 was introduced, there are now problem with duplicate metric names. For some reason I missed this... Apologies.

Since varnishreload does not remove the old VCL automatically, there will be VBE stats for both VCLs. This causes double metrics to be registered, triggering errors in prometheus client lib:

collected metric "varnish_backend_happy" { label:<name:"backend" value:"backendname" > label:<name:"server" value:"unknown" > gauge:<value:1.3510798882111488e+16 > } was collected before with the same name and label values
collected metric "varnish_backend_happy" { label:<name:"backend" value:"backendname" > label:<name:"server" value:"unknown" > gauge:<value:6.597069766656e+12 > } was collected before with the same name and label values

from:

VBE.boot.backendname.happy Happy health probes
VBE.reload_20191114_072450_277.backendname.happy Happy health probes

Note that these are seen when actually calling the scrape endpoint!

Not sure how to work around this, varnishstat does not seem to identify the currently active VCL, or am I missing something?

@stromnet stromnet changed the title Regression after #56 after varnishreload Regression after #56 when running varnishreload Nov 14, 2019
@stromnet stromnet changed the title Regression after #56 when running varnishreload Regression after #56 when running varnishreload ("... was collected before with the same name and label values") Nov 14, 2019
@stromnet
Copy link
Contributor Author

Had some more time to look into this now. The root cause of this is that the inactive VCLs are kept in state warm for (by default) 600 seconds after switch, before they are marked as cold.
As long as they are warm, they emit VBE stats.

A workaround is to set -p vcl_cooldown=1 or something really low (with possible other impacts this might have on a particular setup).

Note: if you use varnishadm param.set vcl_cooldown 1 then it will take up to 270s before it actually takes effect, since the mgt_vcl_poker timer-thingy (mgt_vcl.c) only updats it's timer when it runs. This caused some confusion for me at first.. :)

As for prometheus_varnish_exporter, not sure if there is some other sane workaround? Perhaps add the VBE name as a label?

Note: as long as there are multiple warm VCLs with conflicting backend names, the /metrics endpoint outputs error message instead of metrics!

@jonnenauha
Copy link
Owner

Yeah, these has always been a bit of an issue. I think what the metrics observer wants to know is how many configs are loaded, not cold ones or whatever... Just how many active.

So I think cleaning the timestamp and other pre/postfixes out is the right thing to do. If I understand right the problem is same identifier. We should just track the names during a scrape and ignore duplicates?

If the varnishstat json output does not tell us which are cold, not sure what else to do (except remove the name processing and export duplicates as they are).

@stromnet
Copy link
Contributor Author

Not sure how prometheus would act if the endpoint serves metrics with two identical lines but totally different values, will most likely cause confusion? And if we should filter locally, what would we use to decide the new vs old one, we cannot really assume anything from the name?

I wonder if it is an acceptable solution to add the VCL name to the measurement outputs as another tag, something like this?

varnish_backend_happy {vcl="boot", backend="backendname", server="unknown"} = 1.3510798882111488e+16
varnish_backend_happy {vcl="reload_20191114_072450_277", backend="backendname", server="unknown"} = 6.597069766656e+12

That way we would never cause duplicate values, but on the other hand we just push the potential problem to the next level of the stack (i.e. whatever is running the prometheus query needs to figure out what is right/wrong).

For my use case, I just ran with vcl_cooldown=1, since we rarely re-activate the previous VCL anyway. That way it is a very minimal chance that we happen to scrape bad data, and if so it will be "fixed" by next scrape. But I understand that this might not be suitable for all installations.

@jonnenauha
Copy link
Owner

Could you send me a scrape .json that has this boot and reloaded values. I would like to get a clear picture what metrics are duplicated while the "cooldown" is going on. If it just the *_happy metric or everything that has the backend name in the Varnish metric name e.g. VBE.boot.us1.bereq_hdrbytes.

What would be correct imo is to sum up the values for all of them. I mean if there is cooldown with value 2 and current active one with value 2. The correct Prometheus value export would be sum of those, not export two with different vcl. When the cooldown goes away, the counter metric will drop again causing a reset from Prometheus servers perspective.

But yes if we would build something to identify the vcl lable thats fine, then you would just do the sum function in your prometheus/grafana side. This is basically what happened before this code was touched, it was just not the vcl label but it was included in the backend label.

@stromnet
Copy link
Contributor Author

All VBE counters are repeated:

$ docker run -d --rm -it -name varnish-export-test varnish:latest
$ docker exec -it varnish-export-test bash
root@aa9707a800a8:/etc/varnish# varnishstat -j|grep VBE
  "VBE.boot.default.happy": {
  "VBE.boot.default.bereq_hdrbytes": {
  "VBE.boot.default.bereq_bodybytes": {
  "VBE.boot.default.beresp_hdrbytes": {
  "VBE.boot.default.beresp_bodybytes": {
  "VBE.boot.default.pipe_hdrbytes": {
  "VBE.boot.default.pipe_out": {
  "VBE.boot.default.pipe_in": {
  "VBE.boot.default.conn": {
  "VBE.boot.default.req": {
  "VBE.boot.default.unhealthy": {
  "VBE.boot.default.busy": {
  "VBE.boot.default.fail": {
  "VBE.boot.default.fail_eacces": {
  "VBE.boot.default.fail_eaddrnotavail": {
  "VBE.boot.default.fail_econnrefused": {
  "VBE.boot.default.fail_enetunreach": {
  "VBE.boot.default.fail_etimedout": {
  "VBE.boot.default.fail_other": {
  "VBE.boot.default.helddown": {
root@aa9707a800a8:/etc/varnish# varnishreload
VCL 'reload_20191118_075527_574' compiled
VCL 'reload_20191118_075527_574' now active
root@aa9707a800a8:/etc/varnish# varnishstat -j|grep VBE
  "VBE.boot.default.happy": {
  "VBE.boot.default.bereq_hdrbytes": {
  "VBE.boot.default.bereq_bodybytes": {
  "VBE.boot.default.beresp_hdrbytes": {
  "VBE.boot.default.beresp_bodybytes": {
  "VBE.boot.default.pipe_hdrbytes": {
  "VBE.boot.default.pipe_out": {
  "VBE.boot.default.pipe_in": {
  "VBE.boot.default.conn": {
  "VBE.boot.default.req": {
  "VBE.boot.default.unhealthy": {
  "VBE.boot.default.busy": {
  "VBE.boot.default.fail": {
  "VBE.boot.default.fail_eacces": {
  "VBE.boot.default.fail_eaddrnotavail": {
  "VBE.boot.default.fail_econnrefused": {
  "VBE.boot.default.fail_enetunreach": {
  "VBE.boot.default.fail_etimedout": {
  "VBE.boot.default.fail_other": {
  "VBE.boot.default.helddown": {
  "VBE.reload_20191118_075527_574.default.happy": {
  "VBE.reload_20191118_075527_574.default.bereq_hdrbytes": {
  "VBE.reload_20191118_075527_574.default.bereq_bodybytes": {
  "VBE.reload_20191118_075527_574.default.beresp_hdrbytes": {
  "VBE.reload_20191118_075527_574.default.beresp_bodybytes": {
  "VBE.reload_20191118_075527_574.default.pipe_hdrbytes": {
  "VBE.reload_20191118_075527_574.default.pipe_out": {
  "VBE.reload_20191118_075527_574.default.pipe_in": {
  "VBE.reload_20191118_075527_574.default.conn": {
  "VBE.reload_20191118_075527_574.default.req": {
  "VBE.reload_20191118_075527_574.default.unhealthy": {
  "VBE.reload_20191118_075527_574.default.busy": {
  "VBE.reload_20191118_075527_574.default.fail": {
  "VBE.reload_20191118_075527_574.default.fail_eacces": {
  "VBE.reload_20191118_075527_574.default.fail_eaddrnotavail": {
  "VBE.reload_20191118_075527_574.default.fail_econnrefused": {
  "VBE.reload_20191118_075527_574.default.fail_enetunreach": {
  "VBE.reload_20191118_075527_574.default.fail_etimedout": {
  "VBE.reload_20191118_075527_574.default.fail_other": {
  "VBE.reload_20191118_075527_574.default.helddown": {
root@aa9707a800a8:/etc/varnish#

Listed only VBE above, those are the only one which mentions default or reload_20191118_075527_574.

Summing could perhaps work yes, if all are counters (which most see to be, except happy and conn). But imo it would be cleaner to do a varnishadm vcl.list and use that. And perhaps raise a ticket with varnish project to add a flag to only return the active VCL in varnishstat, if possible?

@LorenzoPeri
Copy link
Contributor

Hi, I created this PR to fix this issue.
Can you please review it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants