Daemons with Celery I

Engineering

by Ebury Labs on April 13, 2016

We’re used to working with python so the examples I’ll use will be either pure python or python-based pseudocode. Python has a really simple and elegant syntax so it’s almost understandable for people who don’t know the language.

I’m going to explore a common problem in the backend of all kinds of applications that have heavy processing requirements, such as those that reprocess all links of each object from a class due to an update or recalculate an attribute that depends on huge functions, and walk through asynchronous solutions.

In this first entry I’ll model and explain the problem in detail and present a first solution. We’ll iterate this solution in future posts to make improvements.

The problem

Using this code example:

def foo_action(foo_obj):
    # ...
    # Do process
    # ...
    foo_obj.modify_state(to=state.COMPLETED)

    # ...
    # Do save
    # ...
    foo_obj.save()

    # ...
    # Do post-save incredibly heavy process
    # ...
    for bar_obj in bar.get_all_objects():
        bar.recalculate_value_using_foo(foo_obj)

With this function we have two separate steps: firstly process and save the object, secondly a heavy post-save process. In the majority of cases, using signals and handlers will be an effective way to manage this situation but, if we’re working with legacy code and haven’t the opportunity to use this mechanism, we’ll be forced to work with polling. Polling is a mechanism that periodically asks if it is necessary to do the process and, if it is, does it.

In this case we have a light processing of the object (foo) that changes the state to completed and, after that, saves it. The problem is that we need to recalculate the value of the other type of object (bar) and we know that these calculations are really intensive. As it isn’t necessary to calculate these bar values immediately once the foo object is modified, we can move this process to an asynchronous task to avoid blocking foo objects.

Solution: first approach

We’ll start with a really naive approach where the periodic process is implemented in a script and scheduled using a tool like cron.

To improve this we’ll use Celery as our async tasks magic machine. Celery is a python library that provides a way to encapsulate functions as tasks, queue them and process them once a processing slot is available. We need to know six basics concepts:

Task: Basic processing unit of celery. The function that will be executed.
Queue: Where you will store your celery tasks waiting to be processed.
Worker: Processes that will execute your tasks. It’s like another instance of your application dedicated to executing celery tasks.
Broker: The manager that receives tasks, handles queues and sends tasks to workers.
Periodic Task: Like normal tasks, but executed periodically using time intervals or crontabs.
Celery beat: A special process of celery that manages Periodic Tasks, queuing them when the period is met.

Using Celery we can split the problem code as follows:

def foo_action(foo_obj):
    # ...
    # Do process
    # ...
    foo_obj.modify_state(to=state.COMPLETED)

    # ...
    # Do save
    # ...
    foo_obj.save()

@periodic_task(crontab('*', '*', '*', '*', '*'))
def foo_action_postsave(foo_obj):
    # ...
    # Do post-save incredibly heavy process
    # ...
    for bar_obj in bar.get_all_objects():
        bar.recalculate_value_using_foo(foo_obj)

I simply extracted the post-save process into periodic tasks, which will be executed following the crontab “* * * * *” (every minute). This means that every minute our celery beat will queue a foo_action_postsave that will recalculate the value of each bar object.

But, as we’re working with a producer-consumer pattern, it’s natural to ask ourselves: What if our producer creates tasks faster than our consumer can execute them? The answer in this case is that eventually our celery queue will collapse.

I’ll explore how to navigate this issue in my next labs entry on the topic.

All code used as examples can be found in our own repository:
Code repository

Also, documentation for Cron and Celery tools:
Cron
Celery

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_5ZETTGME4T	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_51187572_43	1 minute	This cookie is set by Google and is used to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	16 years 4 months	These cookies are set via embedded youtube-videos. They register anonymous statistical data on for example how many times the video is displayed and what settings are used for playback.No sensitive data is collected unless you log in to your google account, in that case your choices are linked with your account, for example if you click “like” on a video.

Cookie	Duration	Description
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	These cookies are set via embedded youtube-videos.
yt-remote-device-id	never	These cookies are set via embedded youtube-videos.
yt.innertube::nextId	never	These cookies are set via embedded youtube-videos.
yt.innertube::requests	never	These cookies are set via embedded youtube-videos.