Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no metrics during sync progress #15

Open
alevchuk opened this issue Aug 24, 2020 · 8 comments
Open

no metrics during sync progress #15

alevchuk opened this issue Aug 24, 2020 · 8 comments
Labels
needs more info More information is required

Comments

@alevchuk
Copy link
Contributor

My node was down for a few days, after resuming it, there were no metrics while the sync was happening.

2020-08-22T23:58:25Z ERROR Retry after exception socket.timeout: timed out
2020-08-22T23:58:55Z ERROR Retry after exception socket.timeout: timed out
2020-08-22T23:59:26Z ERROR Retry after exception socket.timeout: timed out
2020-08-23T00:00:06Z ERROR Retry after exception socket.timeout: timed out
2020-08-23T00:00:37Z ERROR Retry after exception socket.timeout: timed out
2020-08-23T00:01:08Z ERROR Retry after exception socket.timeout: timed out
2020-08-23T00:01:39Z ERROR Retry after exception socket.timeout: timed out
2020-08-23T00:02:11Z ERROR Retry after exception socket.timeout: timed out
2020-08-23T00:02:43Z ERROR Retry after exception socket.timeout: timed out
2020-08-23T00:03:17Z ERROR Retry after exception socket.timeout: timed out
2020-08-23T00:03:53Z ERROR Retry after exception socket.timeout: timed out
2020-08-23T00:04:31Z ERROR Retry after exception socket.timeout: timed out
2020-08-23T00:04:44Z ERROR Refresh failed during retry. Cause: max timeout exceeded while retrying task: 300s
2020-08-23T00:04:44Z INFO Refresh took 0:06:57.398758 seconds, sleeping for 600.0 seconds

Misc:

  • I increased all the timeouts / retry counts / sleep intervals that I could find, yet there was not a single data point going thru.
  • I have magnetic disk.
  • After sync was done metrics resumed as normal

Monitoring is most needed during unordinary conditions like this one. I don't think it makes sense to export any other metrics during sync, yet sync progress metric is important during sync. For example, it enables sync performance debugging.

@jvstein
Copy link
Owner

jvstein commented Aug 26, 2020

Based on the socket timeout error message, I suspect the RPC server was totally down during that time.

Were you running version 0.20.0? Do you have the logs from the bitcoin node?

@alevchuk
Copy link
Contributor Author

alevchuk commented Aug 26, 2020

Based on the socket timeout error message, I suspect the RPC server was totally down during that time.

No, it was responding to src/bitcoin-cli getblockchaininfo. It took a couple seconds yet it worked. What is the current timeout for socket.timeout?

Were you running version 0.20.0?

I upgraded to v0.20.1 while this was happening. Didn't help.

Do you have the logs from the bitcoin node?

Yes, then just had the regular logs. No errors.

e.g.

2020-08-22T22:08:35Z UpdateTip: new best=0000000000000000000a3c2821514f18f40fd85359a0fc330729f73f945085ec height=644579 version=0x37ffe000 log2_work=92.217442 tx=560402682 date='2020-08-20T16:32:15Z' progress=0.998863 cache=7.9MiB(58173txo)

After sync finished, the monitor started working again.

@jvstein
Copy link
Owner

jvstein commented Aug 26, 2020

No, it was responding to src/bitcoin-cli getblockchaininfo.

Got it. Then the uptime call was likely successful. I'm guessing you saw the bitcoin_exporter_errors metric climb significantly during that period.

What is the current timeout for socket.timeout?

The timeout sent into the rpc client should be the same TIMEOUT value. It's 30s by default, which based on the timestamps might be what you had initially.

I've honestly never run into a similar issue, even during syncs. Are you running the node and the exporter together on an under-powered machine?

Right now there's a global try/except around the whole metric update. An exhaustion of retries will fail all subsequent metrics. It could be updated to do a best effort update on all metrics.

@alevchuk
Copy link
Contributor Author

alevchuk commented Aug 26, 2020

No, it was responding to src/bitcoin-cli getblockchaininfo.

Got it. Then the uptime call was likely successful. I'm guessing you saw the bitcoin_exporter_errors metric climb significantly during that period.

It did: error chart

I've honestly never run into a similar issue, even during syncs. Are you running the node and the exporter together on an under-powered machine?

I'm running on bitcoind and exporter both on t3a.small with magnetic storage - it's meets my needs and the sync was reasonably fast.

Right now there's a global try/except around the whole metric update. An exhaustion of retries will fail all subsequent metrics. It could be updated to do a best effort update on all metrics.

Yes, looks like this is the root cause is there. It's currently all or nothing. Allowing some metrics would be a much better behavior. Also, is there a way to prioritize blockchaininfo even if other RPC's throw exceptions?

@jvstein
Copy link
Owner

jvstein commented Aug 26, 2020

I'm running on bitcoind and exporter both on t3a.small with magnetic storage - it's meets my needs and the sync was reasonably fast.

Just curious because my node starts around 1GB memory usage and climbs pretty quickly from there. My instance is pretty unconstrained, memory-wise, and I only see bitcoin.rpc.InWarmupError in my historical metrics.

screenshot-20200826-230726

Hopefully I can reproduce by artificially limiting the available memory.

Also, is there a way to prioritize blockchaininfo even if other RPC's throw exceptions?

I'll try to just make each metric independent, instead of prioritizing them.

@jvstein
Copy link
Owner

jvstein commented Sep 7, 2020

@alevchuk - I just pushed a new branch with a rewrite of the metric refresh to run the RPC calls in parallel and also be more lenient with failures.

https://github.com/jvstein/bitcoin-prometheus-exporter/tree/issue_15/async_refresh

Are you able to give it a test against your node?

@alevchuk
Copy link
Contributor Author

alevchuk commented Sep 8, 2020

i repro'd again on latest master before applying the patch (stopped bitcoind for 1 hour to test the high-io sync). then switched to the branch and run bitcoind again with still some lag and high-io sync.

got this crash
https://gist.github.com/alevchuk/693bfa8e88d6841c22973908e107a21f

yet not sure it this was before or after the "Verifying last 6 blocks at level 3" which takes a few minutes before oppening network ports

@alevchuk
Copy link
Contributor Author

alevchuk commented Sep 8, 2020

getting socket.timeout: timed out
https://gist.github.com/alevchuk/52f8f14a6ddff548dcb920b47a5056ea

that's when starting the monitoring after bitcoind opens all network ports

@jvstein jvstein added the needs more info More information is required label Feb 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs more info More information is required
Projects
None yet
Development

No branches or pull requests

2 participants