Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error "varnish_main_bans label:<name:\"action\" value:\"gone\" > gauge:<value:1 > should be a Counter" when scraping metrics for Varnish 3.0 #50

Closed
glennslaven opened this issue Aug 21, 2019 · 4 comments · Fixed by #51

Comments

@glennslaven
Copy link
Contributor

  • Varnish 3.0.5
  • prometheus_varnish_exporter 1.5.1

So I really should have checked this when I checked 4.0, but along with the a flag, Varnish 3 also supported the i flag which should be treated as a gauge. The default case in the flag switch does treat it as a gauge already though.

Yes, we still have Varnish 3 that needs supporting, sorry ;)

The issue here seems to be in the grouping of the n_ban* metrics with an action label. In Varnish 3 the metrics that start with n_ban have a mix of flags. n_ban and n_ban_gone are i flags & should be gauges, but the rest are a types and so should be a counter. (see https://github.com/varnishcache/varnish-cache/blob/3.0/include/vsc_fields.h#L142-L148).

In Varnish 4+ each of these is not grouped together and they get stored as separate metric names like this:

        # HELP varnish_main_bans Count of bans
        # TYPE varnish_main_bans gauge
        varnish_main_bans 1
        # HELP varnish_main_bans_added Bans added
        # TYPE varnish_main_bans_added counter
        varnish_main_bans_added 1

But in Varnish 3 it tries to group them:

        # HELP varnish_main_bans Number of bans
        # TYPE varnish_main_bans counter
        varnish_main_bans{action="add"} 1
        varnish_main_bans{action="dups"} 0
        varnish_main_bans{action="obj_test"} 0
        varnish_main_bans{action="re_test"} 0
        varnish_main_bans{action="retire"} 0
        # HELP varnish_main_bans_total Number of bans
        # TYPE varnish_main_bans_total gauge
        varnish_main_bans_total 1

And the n_ban_gone metric isn't there due to the

collected metric varnish_main_bans label:<name:\"action\" value:\"gone\" > gauge:<value:1 >  should be a Counter

error.

I think this is because in Varnish 4 the metric name was changed from n_ban to n_bans and so the grouping is no longer matching.

@glennslaven glennslaven changed the title Error "varnish_main_bans label:<name:\"action\" value:\"gone\" > gauge:<value:1 > should be a Counter" when scraping metrics for varnish 3.0 Error "varnish_main_bans label:<name:\"action\" value:\"gone\" > gauge:<value:1 > should be a Counter" when scraping metrics for Varnish 3.0 Aug 21, 2019
@glennslaven glennslaven mentioned this issue Aug 21, 2019
@jonnenauha
Copy link
Owner

jonnenauha commented Oct 13, 2019

I see your point, we should fix the bug. But i still see MAIN.bans_xxx in 4.0.5 and above test files. So the grouping seems to be used.

I'm afraid this will break the groupings for existing users, change the metric names and their dasboards stop working for bans after this update.

As the grouping does not seem to be broken other than with 3.x, could we just remove the "ban" grouping during runtime when we are dealing with a major version of 3 as its queried at the start of the program.

It's bit hard to nowadays upkeep this as I don't use Varnish anymore for day to day work. Testing these different versions locally is also pain in the ass. My main concern is that we don't break/change the metric names for users with > 3.x varnish, which is the majority here.

@glennslaven
Copy link
Contributor Author

Sorry, I may be missing something, but as I understood it the grouping creates the metrics with the labels like varnish_main_bans{action="add"}.

So Varnish 4+ has always had the metrics like varnish_main_bans_added and continue to after this PR.

@glennslaven
Copy link
Contributor Author

We've been running with #51 for a while now and all our varnish 4+ instances still have all the varnish_main_bans_... metrics, it's just that the varnish 3 ones have varnish_main_n_ban_add & varnish_main_n_ban_dups, etc rather than varnish_main_bans{action="add"} & varnish_main_bans{action="dups"}.

On the testing different versions, in the PR I did introduce the Test_PrometheusExport func which actually gets Prometheus to render the metrics based on the test scrape files, so at least we don't need to run varnish to test the generated output. My purpose there was to reproduce the error which didn't happen until the actual rendering of the metrics. However, that test could be expanded to also validate that certain metrics are generated as expected.

I understand about maintaining something you don't use anymore, it's difficult. I really appreciate the work you've done on this module, it's made my work much easier!

@jonnenauha
Copy link
Owner

Sorry it took me a while to get to this... don't have very much time nowadays to respond quickly here.

You are absolutely right. I just checked my 5.x production varnish and it does not go the groupings as intended. If I understand this correctly now, 3.x used to do the groupings and that was fine. But now that we started applying the new counter type there is a mix of gauge and counter typed ban metrics from varnish. So this breaks now 3.x as it tries to squeeze gauges and counters into a single output metrics (with type labels).

I was a bit confused by the whole thing, thanks for clarifying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants