varnishstat reports "cold" VCLs after panic #3358

stromnet · 2020-07-01T11:52:53Z

For some reason (not yet determined but not really related to this), we have had a panic in some varnish child processes. After this, prometheus_varnish_exporter started to choke on duplicate metrics, as it does not support different VCL instances present in varnishstat.

After some digging it was discovered that varnishstat, which normally only gives out metrics for warm VCLs, was reporting cold VCLs as well. Then it was noticed that the child process had paniced/been restarted.

In docker below, the -QUIT is not really registered as a panic but just a child process died (signal). On my real RHEL7 systems, a -QUIT triggers a Child process core dumped and I think it was Child process panic.

Reproduced in in 6.3.1, 6.3.2, 6.4.0 from CentOS7 RPMs, and as shown below, in latest varnish docker image.

Expected Behavior

Cool VCLs should not be reported in varnishstat after child panic, i.e. same behaviour as before panic.

Current Behavior

Cool VCLs appear in varnishstatafter a child panic has occured, but not before.

Possible Solution

No idea.

Steps to Reproduce (for bugs)

Using varnish docker image:

$ docker run --rm -it --name varnish-test varnish:latest varnishd -F -f /etc/varnish/default.vcl -p vcl_cooldown=1
$ docker exec -it varnish-test bash
# apt update && apt install -y procps
# varnishstat -1|grep happy
VBE.boot.default.happy                        0          .   Happy health probes
# varnishadm vcl.list
active   auto    warm         0    boot
# varnishreload
VCL 'reload_20200701_113808_693' compiled
VCL 'reload_20200701_113808_693' now active
# varnishadm vcl.list
available   auto    cold         0    boot
active      auto    warm         0    reload_20200701_113808_693
# varnishstat -1|grep happy
VBE.reload_20200701_113808_693.default.happy            0          .   Happy health probes

as seen above, we have one cold and one warm VCL, and only the warm one is exposed in varnishstat.

Now, find the child pid and force a panic:

# ps axuww|grep varnishd
varnish      1  0.0  0.0  18608  5688 pts/0    SLs+ 11:33   0:00 varnishd -F -f /etc/varnish/default.vcl -p vcl_cooldown=1
vcache      18  0.1  1.4 284996 88896 pts/0    SLl+ 11:33   0:00 varnishd -F -f /etc/varnish/default.vcl -p vcl_cooldown=1
root       948  0.0  0.0   3084   892 pts/1    S+   11:38   0:00 grep varnishd
# kill -QUIT 18

Varnishstat reports the cold VCL too

# varnishadm vcl.list
available   auto    cold         0    boot
active      auto    warm         0    reload_20200701_113808_693
# varnishstat -1|grep happy
VBE.boot.default.happy                        0          .   Happy health probes
VBE.reload_20200701_113808_693.default.happy            0          .   Happy health probes

Context

After at least one reload + one panic has occured, prometheus_varnish_exporter fails. That itself is a non-perfect behaviour, tracked in jonnenauha/prometheus_varnish_exporter#57, but this still seems like a bug in Varnish.

On the good side, this made me caught the fact that varnish child had paniced.. Root cause for that TBD, enabled core dumps now.

Your Environment

Version used: 6.3.1, but reproduced in latest
Operating System and version: RHEL7
Source of binary packages used (if any): RPM & docker debs (in example above), no additional modules.

The text was updated successfully, but these errors were encountered:

dridi · 2020-07-01T12:14:50Z

Do you get something from varnishadm panic.show?

stromnet · 2020-07-01T15:36:48Z

In the docker variant it says:

# varnishadm panic.show
Child has not panicked or panic has been cleared
Command failed with error code 300

And same when using -QUIT on RHEL machine.
Here are the stats from -QUIT on RHEL machine btw (the one I was unsure about above):

# varnishstat -1|head
MGT.uptime               23940      1496.25 Management process uptime
MGT.child_start              2         0.12 Child process started
MGT.child_exit               0         0.00 Child process normal exit
MGT.child_stop               0         0.00 Child process unexpected exit
MGT.child_died               1         0.06 Child process died (signal)
MGT.child_dump               1         0.06 Child process core dumped
MGT.child_panic              0         0.00 Child process panic

so -QUIT does not panic it, just coredump/signal-kills it.

But, this bug report is not really about the panicing, rather than when a child process is restarted, varnishstats show "too much" info.

dridi · 2020-07-01T15:47:28Z

Collecting the panic output could have helped us figure how you ended up in that situation, and maybe reproduce the bug. We'll probably have a look anyway.

stromnet · 2020-07-01T18:35:34Z

Good to know, was not aware of that.

On the real machines I've enabled the possibility to dump core (ulimit) and are now waiting for next crash (it has happened on 6 of 6 machines so far, getting equal share of traffic, so I'd be surprised if it does not occur again). Will use that command as well whenever it happens!
But yeah, regardless of what originally caused these panics, the same state was induced by just killing child process.

bsdphk · 2020-07-13T11:34:56Z

I think this might be related to #3362

nigoroll · 2020-12-11T16:34:21Z

Pretty sure this has a different root cause (fix for 3362 is coming), will look

dridi added a=more-info-needed b=bug c=varnishd c=varnishstat r=6.3 labels Jul 1, 2020

nigoroll self-assigned this Dec 11, 2020

nigoroll closed this as completed in b609a53 Dec 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

varnishstat reports "cold" VCLs after panic #3358

varnishstat reports "cold" VCLs after panic #3358

stromnet commented Jul 1, 2020

dridi commented Jul 1, 2020

stromnet commented Jul 1, 2020

dridi commented Jul 1, 2020

stromnet commented Jul 1, 2020 •

edited

Loading

bsdphk commented Jul 13, 2020

nigoroll commented Dec 11, 2020

varnishstat reports "cold" VCLs after panic #3358

varnishstat reports "cold" VCLs after panic #3358

Comments

stromnet commented Jul 1, 2020

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

dridi commented Jul 1, 2020

stromnet commented Jul 1, 2020

dridi commented Jul 1, 2020

stromnet commented Jul 1, 2020 • edited Loading

bsdphk commented Jul 13, 2020

nigoroll commented Dec 11, 2020

stromnet commented Jul 1, 2020 •

edited

Loading