Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The default IdleTimeout for long connections of the go-zero http service is set to 2.4s #4545

Open
lystack opened this issue Jan 6, 2025 · 3 comments

Comments

@lystack
Copy link

lystack commented Jan 6, 2025

The default IdleTimeout for long connections of the http service is set to 2.4s
When using go-zero to start an http service, go-zero will set the default ReadTimeout to 0.8*3s (the default timeout of the route)
The code is

func (ng *engine) withTimeout() internal.StartOption {
 return func(svr *http.Server) {
 	timeout := ng.timeout
 	if timeout > 0 {
 		// factor 0.8, to avoid clients send longer content-length than the actual content,
 		// without this timeout setting, the server will time out and respond 503 Service Unavailable,
 		// which triggers the circuit breaker.
 		svr.ReadTimeout = 4 * timeout / 5
 		// factor 1.1, to avoid servers don't have enough time to write responses.
 		// setting the factor less than 1.0 may lead clients not receiving the responses.
 		svr.WriteTimeout = 11 * timeout / 10
 	}
 }
}

In the go http.Server api, if IdleTimeout is not set and ReadTimeout is set, IdleTimeout will use the value of ReadTimeout, which means that when the connection is idle for 2.4 seconds, the service will actively disconnect the long connection.

type Server struct {
 // ReadTimeout is the maximum duration for reading the entire
 // request, including the body. A zero or negative value means
 // there will be no timeout.
 //
 // Because ReadTimeout does not let Handlers make per-request
 // decisions on each request body's acceptable deadline or
 // upload rate, most users will prefer to use
 // ReadHeaderTimeout. It is valid to use them both.
 ReadTimeout time.Duration

 // IdleTimeout is the maximum amount of time to wait for the
 // next request when keep-alives are enabled. If IdleTimeout
 // is zero, the value of ReadTimeout is used. If both are
 // zero, there is no timeout.
 IdleTimeout time.Duration

}

In my scenario, the gateway uses a long connection pool when connecting to the go-zero service. The connection idle time of the long connection pool is 60s. When performing stress testing, there is a 0.001% probability of a 502 gateway error.

To Reproduce

  1. Start an http service using go-zero
  2. Use a client with connection pool to connect to the http service
  3. After using the packet capture tool to capture the packet and analyzing it through Wireshark, you can see that go-zero actively disconnects the http long connection

Expected behavior
The default IdelTimeout should be set more reasonably, not using the value of ReadTimeout by default, and providing configuration for user definition

Screenshots
No screenshots

Environments (please complete the following information):

  • OS: Linux
  • go-zero version v1.6.5

More description
I can solve this problem by using rest.StartWithOpts and a custom IdleTimeout at startup, but I think this is a bug.

func withIdleTimeout(keepaliveTime int64) rest.StartOption {
   return func(svr *http.Server) {
   	svr.IdleTimeout = time.Duration(keepaliveTime) * time.Millisecond
   }
}
@adamqddnh
Copy link

I think the current code is not a bug; rather, it's a designed behavior to fine-tune the ReadTimeout and WriteTimeout for the server, considering potential edge cases that could lead to timeouts or failures. The factors used for scaling are intended to handle the following concerns:

  • ReadTimeout factor of 0.8: Ensures the server doesn’t timeout prematurely due to clients sending longer-than-expected content.
  • WriteTimeout factor of 1.1: Gives the server a little more time to send responses, ensuring that clients receive the entire response without a timeout.

What Would Happen if You Removed the Factors?
If you remove the scaling factors and replace the logic with a simpler version, like the one you propose, there are a few things to consider:

  1. ReadTimeout and WriteTimeout would be set exactly to the configured timeout:
  • Pros: Simplifies the code and could be more predictable. If you feel that the original timeout values are sufficient for most use cases, removing the factors might work fine.
  • Cons: It might not account for edge cases where the read or write operations take slightly longer due to network conditions or large content. The original design used factors to handle these scenarios, reducing the risk of timeouts that could trigger failures (like 503 Service Unavailable, which is mentioned in the code comments).
  1. Changing withIdleTimeout() to set IdleTimeout:
  • Purpose: The IdleTimeout you propose seems to serve a different purpose. It's not related to the specific handling of read and write timeouts. IdleTimeout controls how long an idle connection can remain open before the server closes it.
  • Impact: Changing the code to just set IdleTimeout doesn’t directly address the timeouts for reading and writing data. This would only affect the server's behavior with idle connections, which is a separate concern.

@lystack
Copy link
Author

lystack commented Jan 7, 2025

I agree that it is necessary to set ReadTimeout and WriteTimeout, but after setting ReadTimeout, go-zero does not set IdleTimeout. At this time, IdleTimeout will use the value of ReadTimeout, causing the go-zero service to actively disconnect after the http long connection is idle for 2.4s. I think this is unreasonable.

@lystack
Copy link
Author

lystack commented Jan 7, 2025

In addition, the 502 error occurs when go-zero actively closes the persistent connection, and the upstream reuses the closed persistent connection. Because the upstream may not have received the tcp fin network packet at this time, this has nothing to do with ReadTimeout and WriteTimeout. This is a low-probability problem that may only occur in high-concurrency scenarios. In my case, the probability of 502 is 0.001% under 1000qps concurrency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants