In one of the previous posts I have briefly described what is the Hystrix Circuit Breaker and why it’s worth to consider using it. Now I go more into details and explain what are the key parameters and how they affect its behavior.
When you apply circuit breaker concept with Hystrix into your source code you should set up appropriate values for these three key parameters:
- request volume threshold
- sleep window in milliseconds
- error threshold percentage.
What they mean and how they actually work?
A total number of requests that goes through the circuit breaker after which the breaking strategy is applied. When a number of requests are lower than this value, the breaking will never happen.
By this number, you can make sure circuit breaker starts its work on a certain throughput of requests.
What is important is that this number is calculated separately for each sleep window. So for example when you set it to 20 (default value), requests are short-circuited only after 20 requests volume in the sleep window.
When the number of them is lower, they are all passed to downstream service – even if they all fail.
This number controls when the requests should be short-circuited and not passed to downstream anymore. The default value is 50%.
So when you already reached the volume threshold of requests in a given sleep window, this number is used to determine whether subsequent requests should be cut. If at least 50% of previous requests failed, then no other request will be passed further and the fallback response will be served back.
If you want to be more sensitive and stop passing requests to a downstream resource you should set this number to the lower value. Then for example responses will be short-circuited after 10% of failed requests.
The last very important property in Hystrix Circuit Breaker. It controls the size of the window of received requests. The default value is 5 seconds. In order to circuit breaker work, all things must happen in that window. When in 5 seconds you will receive at least 20 requests (volume threshold) and the certain percentage of them will fail (error threshold) circuit will be opened for consecutive 5 seconds. After that time, the first request will be served to a downstream resource, and in case of success, the circuit will be closed again. If you set up sleep window too high, then in case of failures your circuit will be opened for a long period of time. So if you want to measure execution in 30 seconds based window, keep in mind that it may result in 30-second circuit open period.
Edit - February 11th, 2019
There are actually two important properties in terms of controlling sleep window.
So, first is the
circuitBreaker.sleepWindowInMilliseconds, which says how long the circuit will be tripped resulting in not allowing any request to pass (and returning a fallback response).
The other one is
metrics.rollingStats.timeInMilliseconds. This metric controls the size of the window at which the requests are monitored. When in the window period requests volume threshold will be met with certain error percentage, then circuit will be open for
So you can monitor your requests in a e.g. 30-second buckets (
metrics.rollingStats.timeInMilliseconds) and in case of certain volume and error percentage open circuit for e.g. 5 seconds (
Credits to Devin Whalen and Anders for this edit.
There are several ways of specifying those numbers. You can set up defaults with properties file, like this:
You can set them per command:
Or you can set them programmatically as well:
For more information go directly to official docs.