This post was first published in aivarsk.com
Cal Paterson wrote a great article comparing and describing synchronous and asynchronous Python frameworks and explaining why asynchronous frameworks go a bit wobbly under load. This is a story of how we experienced wobbliness in a recent project.
We are using FastAPI, Pydantic, and Kubernetes to build microservices. One of them is a query service that returns a paginated result containing a list of entities implemented as Pydantic models. During tests, we tried to retrieve thousands of entities from the API endpoint. It took several seconds to produce results as we expected but some requests failed. As we started to investigate, it turned out that the liveness and readiness probes of the Kubernetes container failed and containers were restarted by Kubernetes leading to failing requests. Why didn’t the FastAPI service respond to probes? It was alive and working and FastAPI should be able to handle concurrent requests.
Let’s start with a simplified service code for testing this behavior in isolation. The response model still contains a lot of fields because it is the key to triggering the issue we faced. The real models have even more fields.
from datetime import date, datetime from typing import List from fastapi import FastAPI from pydantic import BaseModel class Address(BaseModel): id: int str1: str = None str2: str = None str3: str = None str4: str = None str5: str = None str6: str = None str7: str = None str8: str = None class Account(BaseModel): id: int address: Address = None str1: str = None str2: str = None str3: str = None str4: str = None str5: str = None str6: str = None str7: str = None str8: str = None str9: str = None str10: str = None str12: str = None str13: str = None str14: str = None str15: str = None str16: str = None class Client(BaseModel): id: int address: Address = None bank_accounts: List[Account] str1: str = None str2: str = None str3: str = None str4: str = None str5: str = None str6: str = None str7: str = None str8: str = None class ClientsResponse(BaseModel): items: List[Client] app = FastAPI() @app.get("/.well-known/live") def live(): return "OK" @app.get("/clients", response_model=ClientsResponse) def clients(): return ClientsResponse( items=[ Client(id=i, address=Address(id=i), bank_accounts=[Account(id=i)]) for i in range(40000) ] )
This service provides two endpoints: /.well-known/live
for liveness checks and /clients
for returning a list of clients.
The second piece of code will test the concurrency of the service by calling the liveness probe endpoint and counting how many requests per second it can process:
import time import requests count = 0 second = int(time.time()) while True: try: r = requests.get("http://localhost:8000/.well-known/live", timeout=1) count += 1 except requests.exceptions.ReadTimeout as ex: pass now = int(time.time()) if now != second: print(second, count) second = now count = 0
Once both scripts are running I see that the current setup can process 600 liveness probe requests per second. As soon as I request the real endpoint curl localhost:8000/clients
these numbers drop and stay at 0 for several seconds:
1642154590 673
1642154591 649
1642154592 384
1642154593 0
1642154594 0
1642154595 0
1642154596 0
1642154597 0
1642154598 0
1642154599 0
1642154600 0
1642154601 0
1642154602 1
1642154603 608
1642154604 664
What is happening? FastAPI is an asynchronous framework. Unlike traditional multi-threading where the kernel tries to enforce fairness by brutal force, FastAPI relies on cooperative multi-threading where threads voluntarily yield their execution time to others. Services can be implemented both as coroutines (async def
) or regular functions. Synchronous functions which are not yielding their execution time are called through a thread pool to ensure they do not block the main execution thread.
Despite doing their best to run concurrently, FastAPI still has synchronous code that is executed from the main thread. Some of those functions do a lot of work and may clog the main thread when processing many large response objects. These functions are:
_prepare_response_content
converts Pydantic models to Python dictionaries.jsonable_encoder
ensures that the whole object tree can be converted to JSON. It does the most work for our test case.
So what is the solution to improve the concurrency of FastAPI services? One of the solutions is to run several Uvicorn workers and hope that all of them are not clogged at the same time. That introduces some new challenges with monitoring (Prometheus multiprocess mode) and even functionality but is doable.
The other solution is to off-load the encoding of the response to another thread and unblock the main thread. FastAPI even has a special response type Response
that skips the _prepare_resonse_content
and jsonable_encoder
functions and returns response data as-is. Since our service function is already executed through a thread pool, we can convert the response to JSON there. And it requires minimal changes to the code:
from fastapi.responses import Response
return Response(
content=ClientsResponse(
items=[
Client(id=i, address=Address(id=i), bank_accounts=[Account(id=i)])
for i in range(40000)
]
).json(),
media_type="application/json",
)
With those changes applied, the FastAPI service behaves much better:
1642158924 551
1642158925 666
1642158926 578
1642158927 13
1642158928 9
1642158929 2
1642158930 423
1642158931 690
1642158932 661
1642158933 692