Regression after #56 when running varnishreload ("... was collected before with the same name and label values") #57

stromnet · 2019-11-14T07:35:34Z

Varnish 6.3.1
prometheus_varnish_exporter 1.5.1

After my patch in #56 was introduced, there are now problem with duplicate metric names. For some reason I missed this... Apologies.

Since varnishreload does not remove the old VCL automatically, there will be VBE stats for both VCLs. This causes double metrics to be registered, triggering errors in prometheus client lib:

collected metric "varnish_backend_happy" { label:<name:"backend" value:"backendname" > label:<name:"server" value:"unknown" > gauge:<value:1.3510798882111488e+16 > } was collected before with the same name and label values
collected metric "varnish_backend_happy" { label:<name:"backend" value:"backendname" > label:<name:"server" value:"unknown" > gauge:<value:6.597069766656e+12 > } was collected before with the same name and label values

from:

VBE.boot.backendname.happy Happy health probes
VBE.reload_20191114_072450_277.backendname.happy Happy health probes

Note that these are seen when actually calling the scrape endpoint!

Not sure how to work around this, varnishstat does not seem to identify the currently active VCL, or am I missing something?

The text was updated successfully, but these errors were encountered:

stromnet · 2019-11-15T15:19:10Z

Had some more time to look into this now. The root cause of this is that the inactive VCLs are kept in state warm for (by default) 600 seconds after switch, before they are marked as cold.
As long as they are warm, they emit VBE stats.

A workaround is to set -p vcl_cooldown=1 or something really low (with possible other impacts this might have on a particular setup).

Note: if you use varnishadm param.set vcl_cooldown 1 then it will take up to 270s before it actually takes effect, since the mgt_vcl_poker timer-thingy (mgt_vcl.c) only updats it's timer when it runs. This caused some confusion for me at first.. :)

As for prometheus_varnish_exporter, not sure if there is some other sane workaround? Perhaps add the VBE name as a label?

Note: as long as there are multiple warm VCLs with conflicting backend names, the /metrics endpoint outputs error message instead of metrics!

jonnenauha · 2019-11-15T16:57:43Z

Yeah, these has always been a bit of an issue. I think what the metrics observer wants to know is how many configs are loaded, not cold ones or whatever... Just how many active.

So I think cleaning the timestamp and other pre/postfixes out is the right thing to do. If I understand right the problem is same identifier. We should just track the names during a scrape and ignore duplicates?

If the varnishstat json output does not tell us which are cold, not sure what else to do (except remove the name processing and export duplicates as they are).

stromnet · 2019-11-15T18:57:06Z

Not sure how prometheus would act if the endpoint serves metrics with two identical lines but totally different values, will most likely cause confusion? And if we should filter locally, what would we use to decide the new vs old one, we cannot really assume anything from the name?

I wonder if it is an acceptable solution to add the VCL name to the measurement outputs as another tag, something like this?

varnish_backend_happy {vcl="boot", backend="backendname", server="unknown"} = 1.3510798882111488e+16
varnish_backend_happy {vcl="reload_20191114_072450_277", backend="backendname", server="unknown"} = 6.597069766656e+12

That way we would never cause duplicate values, but on the other hand we just push the potential problem to the next level of the stack (i.e. whatever is running the prometheus query needs to figure out what is right/wrong).

For my use case, I just ran with vcl_cooldown=1, since we rarely re-activate the previous VCL anyway. That way it is a very minimal chance that we happen to scrape bad data, and if so it will be "fixed" by next scrape. But I understand that this might not be suitable for all installations.

jonnenauha · 2019-11-17T13:07:03Z

Could you send me a scrape .json that has this boot and reloaded values. I would like to get a clear picture what metrics are duplicated while the "cooldown" is going on. If it just the *_happy metric or everything that has the backend name in the Varnish metric name e.g. VBE.boot.us1.bereq_hdrbytes.

What would be correct imo is to sum up the values for all of them. I mean if there is cooldown with value 2 and current active one with value 2. The correct Prometheus value export would be sum of those, not export two with different vcl. When the cooldown goes away, the counter metric will drop again causing a reset from Prometheus servers perspective.

But yes if we would build something to identify the vcl lable thats fine, then you would just do the sum function in your prometheus/grafana side. This is basically what happened before this code was touched, it was just not the vcl label but it was included in the backend label.

stromnet · 2019-11-18T08:09:14Z

All VBE counters are repeated:

$ docker run -d --rm -it -name varnish-export-test varnish:latest
$ docker exec -it varnish-export-test bash
root@aa9707a800a8:/etc/varnish# varnishstat -j|grep VBE
  "VBE.boot.default.happy": {
  "VBE.boot.default.bereq_hdrbytes": {
  "VBE.boot.default.bereq_bodybytes": {
  "VBE.boot.default.beresp_hdrbytes": {
  "VBE.boot.default.beresp_bodybytes": {
  "VBE.boot.default.pipe_hdrbytes": {
  "VBE.boot.default.pipe_out": {
  "VBE.boot.default.pipe_in": {
  "VBE.boot.default.conn": {
  "VBE.boot.default.req": {
  "VBE.boot.default.unhealthy": {
  "VBE.boot.default.busy": {
  "VBE.boot.default.fail": {
  "VBE.boot.default.fail_eacces": {
  "VBE.boot.default.fail_eaddrnotavail": {
  "VBE.boot.default.fail_econnrefused": {
  "VBE.boot.default.fail_enetunreach": {
  "VBE.boot.default.fail_etimedout": {
  "VBE.boot.default.fail_other": {
  "VBE.boot.default.helddown": {
root@aa9707a800a8:/etc/varnish# varnishreload
VCL 'reload_20191118_075527_574' compiled
VCL 'reload_20191118_075527_574' now active
root@aa9707a800a8:/etc/varnish# varnishstat -j|grep VBE
  "VBE.boot.default.happy": {
  "VBE.boot.default.bereq_hdrbytes": {
  "VBE.boot.default.bereq_bodybytes": {
  "VBE.boot.default.beresp_hdrbytes": {
  "VBE.boot.default.beresp_bodybytes": {
  "VBE.boot.default.pipe_hdrbytes": {
  "VBE.boot.default.pipe_out": {
  "VBE.boot.default.pipe_in": {
  "VBE.boot.default.conn": {
  "VBE.boot.default.req": {
  "VBE.boot.default.unhealthy": {
  "VBE.boot.default.busy": {
  "VBE.boot.default.fail": {
  "VBE.boot.default.fail_eacces": {
  "VBE.boot.default.fail_eaddrnotavail": {
  "VBE.boot.default.fail_econnrefused": {
  "VBE.boot.default.fail_enetunreach": {
  "VBE.boot.default.fail_etimedout": {
  "VBE.boot.default.fail_other": {
  "VBE.boot.default.helddown": {
  "VBE.reload_20191118_075527_574.default.happy": {
  "VBE.reload_20191118_075527_574.default.bereq_hdrbytes": {
  "VBE.reload_20191118_075527_574.default.bereq_bodybytes": {
  "VBE.reload_20191118_075527_574.default.beresp_hdrbytes": {
  "VBE.reload_20191118_075527_574.default.beresp_bodybytes": {
  "VBE.reload_20191118_075527_574.default.pipe_hdrbytes": {
  "VBE.reload_20191118_075527_574.default.pipe_out": {
  "VBE.reload_20191118_075527_574.default.pipe_in": {
  "VBE.reload_20191118_075527_574.default.conn": {
  "VBE.reload_20191118_075527_574.default.req": {
  "VBE.reload_20191118_075527_574.default.unhealthy": {
  "VBE.reload_20191118_075527_574.default.busy": {
  "VBE.reload_20191118_075527_574.default.fail": {
  "VBE.reload_20191118_075527_574.default.fail_eacces": {
  "VBE.reload_20191118_075527_574.default.fail_eaddrnotavail": {
  "VBE.reload_20191118_075527_574.default.fail_econnrefused": {
  "VBE.reload_20191118_075527_574.default.fail_enetunreach": {
  "VBE.reload_20191118_075527_574.default.fail_etimedout": {
  "VBE.reload_20191118_075527_574.default.fail_other": {
  "VBE.reload_20191118_075527_574.default.helddown": {
root@aa9707a800a8:/etc/varnish#

Listed only VBE above, those are the only one which mentions default or reload_20191118_075527_574.

Summing could perhaps work yes, if all are counters (which most see to be, except happy and conn). But imo it would be cleaner to do a varnishadm vcl.list and use that. And perhaps raise a ticket with varnish project to add a flag to only return the active VCL in varnishstat, if possible?

LorenzoPeri · 2021-01-19T22:03:10Z

Hi, I created this PR to fix this issue.
Can you please review it?

stromnet changed the title ~~Regression after #56 after varnishreload~~ Regression after #56 when running varnishreload Nov 14, 2019

stromnet changed the title ~~Regression after #56 when running varnishreload~~ Regression after #56 when running varnishreload ("... was collected before with the same name and label values") Nov 14, 2019

stromnet mentioned this issue Nov 15, 2019

Setting vcl_cooldown=0 makes main process unresponsive (100% cpu) varnishcache/varnish-cache#3135

Closed

stromnet mentioned this issue Jul 1, 2020

varnishstat reports "cold" VCLs after panic varnishcache/varnish-cache#3358

Closed

lorenzoperi-tomtom mentioned this issue Jan 15, 2021

Fix issue 57 - errors on VCL reload because of duplicated counters #69

Closed

LorenzoPeri mentioned this issue Jan 19, 2021

Fix issue 57 - errors on VCL reload because of duplicated counters #70

Merged

jonnenauha closed this as completed in #70 Jan 27, 2021

krazy350 mentioned this issue Apr 13, 2021

Support for varnish 6.3.1 #66

Open

timkante mentioned this issue Jul 20, 2021

varnish-exporter resource consumption issues mittwald/kube-httpcache#81

Closed

zipkid mentioned this issue Nov 22, 2021

500 Response on varnish reload #74

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression after #56 when running varnishreload ("... was collected before with the same name and label values") #57

Regression after #56 when running varnishreload ("... was collected before with the same name and label values") #57

stromnet commented Nov 14, 2019

stromnet commented Nov 15, 2019

jonnenauha commented Nov 15, 2019

stromnet commented Nov 15, 2019

jonnenauha commented Nov 17, 2019

stromnet commented Nov 18, 2019

LorenzoPeri commented Jan 19, 2021

Regression after #56 when running varnishreload ("... was collected before with the same name and label values") #57

Regression after #56 when running varnishreload ("... was collected before with the same name and label values") #57

Comments

stromnet commented Nov 14, 2019

stromnet commented Nov 15, 2019

jonnenauha commented Nov 15, 2019

stromnet commented Nov 15, 2019

jonnenauha commented Nov 17, 2019

stromnet commented Nov 18, 2019

LorenzoPeri commented Jan 19, 2021