Little’s law states that the long-term average number L of customers in a stationary system is equal to the long-term average effective arrival rate λ multiplied by the average time W that a customer spends in the system.
Where
L = Average number of customers in a stationary system
λ = Average arrival rate in the system
W = Average time a customer spend in the system
In context of an API it means:
L = Average number of concurrent requests system can serve
λ = Average arrival rate of requests in the system
W = Average latency of each request
If an API endpoint takes on average 100ms and the API endpoint is receiving 100 requests per second then the average number of concurrent requests in the system is 10.
L = 100 * (100/1000) = 10
If rate remains constant then latency(W) is directly proportional to concurrent requests (L).
Let’s see if we can see it in action.
We will create a simple Spring Boot Java 17 application by going to start.spring.io
We will only have a single file as shown below.
package com.example.myapi;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
import java.util.Map;
@SpringBootApplication
public class RateLimitingDemoApplication {
public static void main(String[] args) {
SpringApplication.run(RateLimitingDemoApplication.class, args);
}
}
@RestController
class HomeController {
@GetMapping
public Map<String, String> home() throws Exception {
Thread.sleep(100);
return Map.of("page", "home");
}
}
As you can see above if we make a request to http://localhost:8080 we will get the following response. The request will take at least 100ms as we have set sleep of 100ms.
{
"page": "home"
}
In the calculation above I wrote that our API can service on average 10 concurrent requests when our API is hit with 100 requests per second and average latency is 100 ms.
To show this we will set server.tomcat.threads.max
value to 10 and server.tomcat.accept-count=100
in application.properties
.
server.tomcat.threads.max
: Maximum amount of worker threads.server.tomcat.accept-count
: Maximum queue length for incoming connection requests when all possible request processing threads are in use.
And using that we will see that our API will never serve more than 100 requests per second but latency of the API will start to go up as we put more load on the API.
Restart the server.
We will use a CLI tool called hey to load test our service. If you are on Mac then you can use brew to install hey. For others refer to the project homepage.
brew install hey
We will run hey with different combinations and show that we will always remain under 100 requests per second.
In the first test case we are making 1000 requests(-n) in total. We have 10 workers(-c) and each worker is rate limited to 10 requests per second (-q)
hey -n 1000 -c 10 -q 10 http://localhost:8080/
Summary:
Total: 10.5063 secs
Slowest: 0.1072 secs
Fastest: 0.1005 secs
Average: 0.1040 secs
Requests/sec: 95.1808
Latency distribution:
10% in 0.1014 secs
25% in 0.1021 secs
50% in 0.1035 secs
75% in 0.1050 secs
90% in 0.1058 secs
95% in 0.1063 secs
99% in 0.1066 secs
As you can see above we get 95.1808
requests per second.
L = 95.1808 * 0.1040
L = 9.89
The above shows with the current hey load test configuration our system was serving 10 requests at a time. The average client perceived latency 104ms was close to the latency of the API endpoint 100ms.
Let’s run our second test case. We are again making 1000 requests in total. This time we have 20 concurrent workers and 10 queries per second.
hey -n 1000 -c 20 -q 10 http://localhost:8080/
Summary:
Total: 10.3901 secs
Slowest: 0.3104 secs
Fastest: 0.1019 secs
Average: 0.2043 secs
Requests/sec: 96.2455
Latency distribution:
10% in 0.2020 secs
25% in 0.2040 secs
50% in 0.2055 secs
75% in 0.2071 secs
90% in 0.2085 secs
95% in 0.2093 secs
99% in 0.2160 secs
As you can see above we get 96.2455
requests per second.
L = 96.2455 * 0.2043
L = 19.66
The above shows with the current hey load test configuration our system was serving close to 20 requests at a time. The average client perceived latency 205ms was almost double of the latency of the API endpoint 100ms.
We will run our third test case where we set number of concurrent workers to 100 and qps to 100.
hey -n 1000 -c 100 -q 100 http://localhost:8080/
Summary:
Total: 10.3540 secs
Slowest: 1.1458 secs
Fastest: 0.1056 secs
Average: 0.9862 secs
Requests/sec: 96.5809
Latency distribution:
10% in 0.9295 secs
25% in 1.0287 secs
50% in 1.0334 secs
75% in 1.0385 secs
90% in 1.0419 secs
95% in 1.0494 secs
99% in 1.1396 secs
As you can see we are rate limited to 96 rps which is below 100 rps.
L = 96.5809 * 0.9862
L = 95.24
The above shows with the current hey load test configuration our system was serving close to 95 requests at a time. The average client perceived latency of 980ms was almost nine times of the latency of the API endpoint 100ms.
Let’s do the final test by setting server.tomcat.threads.max=100
.
hey -n 1000 -c 100 -q 100 http://localhost:8080/
Summary:
Total: 1.0929 secs
Slowest: 0.1248 secs
Fastest: 0.1011 secs
Average: 0.1077 secs
Requests/sec: 915.0301
Latency distribution:
10% in 0.1032 secs
25% in 0.1049 secs
50% in 0.1068 secs
75% in 0.1081 secs
90% in 0.1128 secs
95% in 0.1205 secs
99% in 0.1235 secs
L = 915.03 * 0.1077
L = 98.5
In this test we are serving close to 100 requests at a time but since the number of worker threads are the same we are able to process requests concurrently and maintain high throughput.
Using little law you can set concurrency limits in your application/service. You can understand how latency will be impacted by load.
In this sample service we were not consuming much CPU as we were just doing Thread.sleep. In a real world your service will be much more involved and you might have to apply little law to sub-components to determine limits of the overall system.