Engineering

SQL, spotlight on Django problems

October 9, 2020

October 9, 2020 by Héctor Alvarez

As backend developers, after years working high level we tend to forget about database efficiency and complexity because of either the ORM and the database. On one side the ORMs are nice tools, that help on building big systems easily hiding a lot of the underlying complexity, on the other side modern databases are very fast and reliant so performance doesn’t matter too much (until when it starts to matter), they also tend to scale up the server that provides the database instead of fine tuning the queries.

In ERS team, we go low level and do manual queries directly in the database, translating what is done in the django application but bare SQL, going that low puts the spotlight on what is wrong in django and also shows the path to solve it.

Django ORM and DDD

Django’s ORM like any other is again a very convenient way of abstracting the access to the database and retrieving complex models with all the whistles and bells, and if used correctly is a perfect tool for a fast path company like Ebury, but the ORM includes a lot of black magic that just works, hidding queries being run that the developer is usually not aware of.

Lets put a simple example:

Lets say its needed to retrieve the last BankAccountMasks created after January the first 2020, with the idea of rendering all that information at some endpoint:

BankAccountMask.objects.filter(created__gte=datetime(2020, 1, 1))

Two weeks later a new ticket requires to include the client’s ‘name‘ and ‘client_crm_id‘ for each of the masks.

Most developers will just do:

'client_name': mask.client.name,
'client_crm_id': mask.client.client_crm_id

That just works, the ticket goes out fast, team Lead is happy, product is happy, everybody happy? NO the database is not, because every time you iterate over EACH of the items it has to do a new request to retrieve the client’s name and client_crm_id, so instead of doing just one query the system is doing hundreds of them, that issue is called N+1 queries.
What has to be done in that case is go back to the original query and modify it so it pre-fetches the new client fields like so:

BankAccountMask.objects.filter(created__gte=datetime(2020, 1, 1)).select_related('client__name', 'client__client_crm_id')

Any backend developer know that there is something happening that is not actually controlled by them, but using a framework like Django that provides you all that power just few keystrokes away makes them forget and/or forgive on making the queries more efficient.


The reason why this happens all the time is because in Django the database models (aka persistence model) and the domain model are mixed, developers tend to forget that what should be done is work with the domain model and not with the database model, the database model is just to retrieve and persist the data required for the domain model. The domain model should consist in just python primitives or specific classes and not direct Django models, so all the magic of the ORM doesn’t pass silently.

SEPARATE DOMAIN MODEL FROM PERSISTENCE MODEL


When creating a new workflow it should be required to create a domain repository (or how we would like to call it) where domain models are constructed from the database models, including all the required information.

For any specific workflow developers should then only use the domain model. If at some point something new is required in that workflow, they should go back to the domain repository and include that in the domain model or even create a new domain model. With that simple solution the magic is totally removed, the developer is aware of everything that happens, so its under control by them and not the ORM.

Finally the workflow should record the last state of the model in the database so the information is persisted.

Model Properties

Model properties are used for doing small calculations for classes, there is actually no inconvenient on using them when all the information is already available in the model, like the __unicode__ or __str__ properties, usually they are just concatenate some strings together just to identify the object.
The problem comes with properties that require other models to construct themselves in that case the system gets perverted, because on one side 99% of the time the developer is not aware of all that complexity and the needed queries to build the property pass silently (Django just works) but the performance is destroyed.
The funniest part is that most of such properties are never changed in the lifetime of the object, but they are calculated hundred of times.


Good property example:

@property
def is_active(self): 
    return self.status == "active" and self.expiration_date > datetime.datetime.now()

Everything goes fast and easy, because all the information is already in memory and there is no extra query to perform. Yes every time the property is accessed it has to be calculated but we don’t have to persist the data because its already available. Moreover the status can change over time so the model is active or not depending on the current timestamp.

bad property example:

@property
def reference(self): 
    return u"{}BB{num:06d}".format(self.client.brand.code, num=self.pk)

In that property, this is a N+2 queries, because the model itself doesn’t have the client already, and then it access to the client’s brand, so a second query has to be requested to the database.
Most probably this reference should be calculated just once and should not change in the lifetime of the model.

DON’T CREATE PROPERTIES THAT REQUIRE ANY OTHER MODEL TO BE ABLE TO CALCULATE ITSELF

To know SQL better and dirt your hands querying directly on the database makes you much more aware of all these problems, please don’t get this wrong ORMs are actually very nice tools because that level of abstraction from the database is very necessary, but loosing the control about the queries being sent is not, this makes ORMs both best friend for the backend developer and also their worst enemy, to really understand what they do behind the curtain makes us better developers and also make us care about the database.

disclaimer: There are better and worse examples, there is no intention on pointing out to anyone.


Your email address will not be published. Required fields are marked *