Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

varnishstat reports "cold" VCLs after panic #3358

Closed
stromnet opened this issue Jul 1, 2020 · 6 comments
Closed

varnishstat reports "cold" VCLs after panic #3358

stromnet opened this issue Jul 1, 2020 · 6 comments

Comments

@stromnet
Copy link

stromnet commented Jul 1, 2020

For some reason (not yet determined but not really related to this), we have had a panic in some varnish child processes. After this, prometheus_varnish_exporter started to choke on duplicate metrics, as it does not support different VCL instances present in varnishstat.

After some digging it was discovered that varnishstat, which normally only gives out metrics for warm VCLs, was reporting cold VCLs as well. Then it was noticed that the child process had paniced/been restarted.

In docker below, the -QUIT is not really registered as a panic but just a child process died (signal). On my real RHEL7 systems, a -QUIT triggers a Child process core dumped and I think it was Child process panic.

Reproduced in in 6.3.1, 6.3.2, 6.4.0 from CentOS7 RPMs, and as shown below, in latest varnish docker image.

Expected Behavior

Cool VCLs should not be reported in varnishstat after child panic, i.e. same behaviour as before panic.

Current Behavior

Cool VCLs appear in varnishstatafter a child panic has occured, but not before.

Possible Solution

No idea.

Steps to Reproduce (for bugs)

Using varnish docker image:

$ docker run --rm -it --name varnish-test varnish:latest varnishd -F -f /etc/varnish/default.vcl -p vcl_cooldown=1
$ docker exec -it varnish-test bash
# apt update && apt install -y procps
# varnishstat -1|grep happy
VBE.boot.default.happy                        0          .   Happy health probes
# varnishadm vcl.list
active   auto    warm         0    boot
# varnishreload
VCL 'reload_20200701_113808_693' compiled
VCL 'reload_20200701_113808_693' now active
# varnishadm vcl.list
available   auto    cold         0    boot
active      auto    warm         0    reload_20200701_113808_693
# varnishstat -1|grep happy
VBE.reload_20200701_113808_693.default.happy            0          .   Happy health probes

as seen above, we have one cold and one warm VCL, and only the warm one is exposed in varnishstat.

Now, find the child pid and force a panic:

# ps axuww|grep varnishd
varnish      1  0.0  0.0  18608  5688 pts/0    SLs+ 11:33   0:00 varnishd -F -f /etc/varnish/default.vcl -p vcl_cooldown=1
vcache      18  0.1  1.4 284996 88896 pts/0    SLl+ 11:33   0:00 varnishd -F -f /etc/varnish/default.vcl -p vcl_cooldown=1
root       948  0.0  0.0   3084   892 pts/1    S+   11:38   0:00 grep varnishd
# kill -QUIT 18

Varnishstat reports the cold VCL too

# varnishadm vcl.list
available   auto    cold         0    boot
active      auto    warm         0    reload_20200701_113808_693
# varnishstat -1|grep happy
VBE.boot.default.happy                        0          .   Happy health probes
VBE.reload_20200701_113808_693.default.happy            0          .   Happy health probes

Context

After at least one reload + one panic has occured, prometheus_varnish_exporter fails. That itself is a non-perfect behaviour, tracked in jonnenauha/prometheus_varnish_exporter#57, but this still seems like a bug in Varnish.

On the good side, this made me caught the fact that varnish child had paniced.. Root cause for that TBD, enabled core dumps now.

Your Environment

  • Version used: 6.3.1, but reproduced in latest
  • Operating System and version: RHEL7
  • Source of binary packages used (if any): RPM & docker debs (in example above), no additional modules.
@dridi
Copy link
Member

dridi commented Jul 1, 2020

Do you get something from varnishadm panic.show?

@stromnet
Copy link
Author

stromnet commented Jul 1, 2020

In the docker variant it says:

# varnishadm panic.show
Child has not panicked or panic has been cleared
Command failed with error code 300

And same when using -QUIT on RHEL machine.
Here are the stats from -QUIT on RHEL machine btw (the one I was unsure about above):

# varnishstat -1|head
MGT.uptime               23940      1496.25 Management process uptime
MGT.child_start              2         0.12 Child process started
MGT.child_exit               0         0.00 Child process normal exit
MGT.child_stop               0         0.00 Child process unexpected exit
MGT.child_died               1         0.06 Child process died (signal)
MGT.child_dump               1         0.06 Child process core dumped
MGT.child_panic              0         0.00 Child process panic

so -QUIT does not panic it, just coredump/signal-kills it.

But, this bug report is not really about the panicing, rather than when a child process is restarted, varnishstats show "too much" info.

@dridi
Copy link
Member

dridi commented Jul 1, 2020

Collecting the panic output could have helped us figure how you ended up in that situation, and maybe reproduce the bug. We'll probably have a look anyway.

@stromnet
Copy link
Author

stromnet commented Jul 1, 2020

Good to know, was not aware of that.

On the real machines I've enabled the possibility to dump core (ulimit) and are now waiting for next crash (it has happened on 6 of 6 machines so far, getting equal share of traffic, so I'd be surprised if it does not occur again). Will use that command as well whenever it happens!
But yeah, regardless of what originally caused these panics, the same state was induced by just killing child process.

@bsdphk
Copy link
Contributor

bsdphk commented Jul 13, 2020

I think this might be related to #3362

@nigoroll
Copy link
Member

Pretty sure this has a different root cause (fix for 3362 is coming), will look

@nigoroll nigoroll self-assigned this Dec 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants