SQL, spotlight on Django problems

Engineering

by Héctor Alvarez on October 9, 2020

As backend developers, after years working high level we tend to forget about database efficiency and complexity because of either the ORM and the database. On one side the ORMs are nice tools, that help on building big systems easily hiding a lot of the underlying complexity, on the other side modern databases are very fast and reliant so performance doesn’t matter too much (until when it starts to matter), they also tend to scale up the server that provides the database instead of fine tuning the queries.

In ERS team, we go low level and do manual queries directly in the database, translating what is done in the django application but bare SQL, going that low puts the spotlight on what is wrong in django and also shows the path to solve it.

Django ORM and DDD

Django’s ORM like any other is again a very convenient way of abstracting the access to the database and retrieving complex models with all the whistles and bells, and if used correctly is a perfect tool for a fast path company like Ebury, but the ORM includes a lot of black magic that just works, hidding queries being run that the developer is usually not aware of.

Lets put a simple example:

Lets say its needed to retrieve the last BankAccountMasks created after January the first 2020, with the idea of rendering all that information at some endpoint:

BankAccountMask.objects.filter(created__gte=datetime(2020, 1, 1))

Two weeks later a new ticket requires to include the client’s ‘name‘ and ‘client_crm_id‘ for each of the masks.

Most developers will just do:

'client_name': mask.client.name,
'client_crm_id': mask.client.client_crm_id

That just works, the ticket goes out fast, team Lead is happy, product is happy, everybody happy? NO the database is not, because every time you iterate over EACH of the items it has to do a new request to retrieve the client’s name and client_crm_id, so instead of doing just one query the system is doing hundreds of them, that issue is called N+1 queries.
What has to be done in that case is go back to the original query and modify it so it pre-fetches the new client fields like so:

BankAccountMask.objects.filter(created__gte=datetime(2020, 1, 1)).select_related('client__name', 'client__client_crm_id')

Any backend developer know that there is something happening that is not actually controlled by them, but using a framework like Django that provides you all that power just few keystrokes away makes them forget and/or forgive on making the queries more efficient.

The reason why this happens all the time is because in Django the database models (aka persistence model) and the domain model are mixed, developers tend to forget that what should be done is work with the domain model and not with the database model, the database model is just to retrieve and persist the data required for the domain model. The domain model should consist in just python primitives or specific classes and not direct Django models, so all the magic of the ORM doesn’t pass silently.

SEPARATE DOMAIN MODEL FROM PERSISTENCE MODEL

When creating a new workflow it should be required to create a domain repository (or how we would like to call it) where domain models are constructed from the database models, including all the required information.

For any specific workflow developers should then only use the domain model. If at some point something new is required in that workflow, they should go back to the domain repository and include that in the domain model or even create a new domain model. With that simple solution the magic is totally removed, the developer is aware of everything that happens, so its under control by them and not the ORM.

Finally the workflow should record the last state of the model in the database so the information is persisted.

Model Properties

Model properties are used for doing small calculations for classes, there is actually no inconvenient on using them when all the information is already available in the model, like the __unicode__ or __str__ properties, usually they are just concatenate some strings together just to identify the object.
The problem comes with properties that require other models to construct themselves in that case the system gets perverted, because on one side 99% of the time the developer is not aware of all that complexity and the needed queries to build the property pass silently (Django just works) but the performance is destroyed.
The funniest part is that most of such properties are never changed in the lifetime of the object, but they are calculated hundred of times.

Good property example:

@property
def is_active(self): 
    return self.status == "active" and self.expiration_date > datetime.datetime.now()

Everything goes fast and easy, because all the information is already in memory and there is no extra query to perform. Yes every time the property is accessed it has to be calculated but we don’t have to persist the data because its already available. Moreover the status can change over time so the model is active or not depending on the current timestamp.

bad property example:

@property
def reference(self): 
    return u"{}BB{num:06d}".format(self.client.brand.code, num=self.pk)

In that property, this is a N+2 queries, because the model itself doesn’t have the client already, and then it access to the client’s brand, so a second query has to be requested to the database.
Most probably this reference should be calculated just once and should not change in the lifetime of the model.

DON’T CREATE PROPERTIES THAT REQUIRE ANY OTHER MODEL TO BE ABLE TO CALCULATE ITSELF

To know SQL better and dirt your hands querying directly on the database makes you much more aware of all these problems, please don’t get this wrong ORMs are actually very nice tools because that level of abstraction from the database is very necessary, but loosing the control about the queries being sent is not, this makes ORMs both best friend for the backend developer and also their worst enemy, to really understand what they do behind the curtain makes us better developers and also make us care about the database.

disclaimer: There are better and worse examples, there is no intention on pointing out to anyone.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_5ZETTGME4T	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_51187572_43	1 minute	This cookie is set by Google and is used to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	16 years 4 months	These cookies are set via embedded youtube-videos. They register anonymous statistical data on for example how many times the video is displayed and what settings are used for playback.No sensitive data is collected unless you log in to your google account, in that case your choices are linked with your account, for example if you click “like” on a video.

Cookie	Duration	Description
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	These cookies are set via embedded youtube-videos.
yt-remote-device-id	never	These cookies are set via embedded youtube-videos.
yt.innertube::nextId	never	These cookies are set via embedded youtube-videos.
yt.innertube::requests	never	These cookies are set via embedded youtube-videos.