Recursive Amazon Lambda Functions

Engineering

by Alvaro Luque on November 22, 2016

Unless you’ve been in a cave for the last year, you may have noticed how “Serverless” is the new sticky word in the dev community and it’s like we’ve not had time to get used to the “Microservices” buzz. But don’t go crazy just yet, it’s not the time throw away your “still hot” blueprint of your architecture because a new trending topic has arrived. We decided to keep it cool and just experiment a little bit.

This blog describes how we used AWS Lambda and overcame some of its current limitations to solve a typical data processing pipeline problem.

Not long ago, we had the chance to revamp one of our ETL processes, which injects information from our main back office application into Birst, our key Business Intelligence tool. We had two daily processes, one from the back office system generating information to be uploaded, and a separate Birst process to ingest this data, but these process were completely disconnected and so timing issues would cause problems and potentially result in failure. It’s here where we decided to experiment with two things, Birst’s API and Amazon Lambda, which in Lambda’s case is an exceptional start point if you want to play a little bit with this new “Serverless” technology.

Thanks to the integration that Lambda has with many other Amazon services, it allows us to launch functions based on service events, and it’s precisely an S3 event from where we wanted to start. On the whiteboard, everything was pretty clear: once we generate our report from the back office and upload it to S3, a lambda function will be triggered that will use the Birst API to upload the file and run the required ETL jobs, sending some emails notifications once everything is done. We then stumbled upon a little issue: lambda only allows 5 minutes of total execution for each lambda invoked, and the Birst ETL process could take up to 40 minutes.

Our initial thought was, “well, we could create one lambda function hooked up with the S3 event to upload the file, and to have another one scheduled to be executed every 5 minutes which checks the status of the process“, but this had other problems as well. We use a Birst token provided by the API to check on job statuses and now for this to work we would need to persistence the token on the first lambda function, and to retrieve it in the second one. But what happens if two reports are generated at the same time? Should we be checking a list of tokens every 5 minutes? At this point it’s starting to get complicated, we have to store the token of the Birst process using… DynamoDB maybe? We are trying to use Lambda because it’s cheap and comfy but it’s gonna be a riddle just to deploy it and to maintain with a database as well, maybe it’s easier to come back to a good old crontab.

But, thinking a little bit more creatively on this, we decided to try to hijack Lambda by making the function recursive. “What if we make a lambda function that calls itself once it is running out of time?” As it happens, boto3, the python library that controls AWS client, has natively included this into the Lambda environment, very handy!

Let’s see an example of this so you can replicate it for yourself. First of all, we need to create a new lambda function using the wizard. Skip the templates and, in the next step, the event which we will be triggering it is irrelevant so in my example I will select S3 Put:

In the second step, name your lambda function, select the interpreter (for this example we will use Python 2.7.3), choose the appropriate level of permissions that you want to assign to the function and the amount of resources that it will need (with the default values is ok). For this example we will change the timeout to 10 secs. Finish the rest of the steps in the wizard and then we are ready to start coding.

In the lambda editor, you can find a tab named code, which takes you to an inline editor to modify your function. Remove everything and paste this code:

import json
import boto3
import logging

AWS_ACCESS_KEY_ID = "YOURAWSACCESSKEYIDHERE"
AWS_SECRET_ACCESS_KEY = "YOURSECRETACCESSKEYHERE"
MAX_RETRIES = 4
TIMEOUT_REMAINING = 5000

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):
    """
    This is the main function that Amazon Lambda will call in order to
    start the execution
    """

    # execute loop while time remaining is higher that timeout period
    while context.get_remaining_time_in_millis() > TIMEOUT_REMAINING:
        # do whatever retries function you want 
        # (make sure the execution time here is lower than the TIMEOUT_REMAINING time)
        continue
    else:
        
        # relaunch lambda function if retries left
        retries_left = update_num_retries(event)
        if  retries_left > 0:
            relaunch_lambda(event, context)
        else:
            return False
            
            
def update_num_retries(event):
    """
    Updates the number of iterations left that the lambda function.
    """
    if not event.get("NUM_RETRIES"):
        event["NUM_RETRIES"] = MAX_RETRIES
    elif event.get("NUM_RETRIES") > 0:
        event["NUM_RETRIES"] = event.get("NUM_RETRIES")-1
        
    logger.info("Number of retries left: %d" % event["NUM_RETRIES"])
    
    return event.get("NUM_RETRIES")


def relaunch_lambda(event, context):
    """
    Creates a new AWS client and execute the same lambda function
    asynchronously
    """
    session = boto3.session.Session(aws_access_key_id=AWS_ACCESS_KEY_ID, aws_secret_access_key=AWS_SECRET_ACCESS_KEY)
    lambda_cli = session.client("lambda")
    lambda_cli.invoke_async(FunctionName=context.function_name, InvokeArgs=json.dumps(event))

If you look at the code, we create a main lambda function lambda_handle, which will be the one invoked by Amazon to run the process. Here, we have declared a loop, that could, for example, call an external API for an status until you receive what you need. If the execution time of the lambda surpass a threshold, it will quit the loop, update the number of retries and launch the function again if it didn’t run out of chances. Notice that you need to change the AWS credentials in order to make it run in your environment.

Save and run. After 5 seconds, the lambda will finish the test, but just the first iteration, the other 4 retries will be executing in the background. Thanks to the integration with Amazon CloudWatch, we are able to see log entries for the whole execution, just by clicking on the logs link.

screenshot-from-2016-10-06-17-43-12

Of course, Amazon Lambda is not really designed to work this way, but sometimes is good to push the limits and find workarounds to the service. Lambda is still at an early stage and many features will surely come in the next months. Some of the functionality we would like to see are:

Integration with repositories like git
Extension (or just the removal) of the execution time
Testing environment integration, maybe with external IDEs (the inline editor is not the best debugger, to be honest)
The possibility to launch delayed functions
Python 3 compatibility
Environment variables

I hope you like it! If you have any suggestions please leave them below in the comments section 😀

PS. Special thanks go to my colleague Abraham Gonzalez, who suffered my stubbornness during this process.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_5ZETTGME4T	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_51187572_43	1 minute	This cookie is set by Google and is used to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	16 years 4 months	These cookies are set via embedded youtube-videos. They register anonymous statistical data on for example how many times the video is displayed and what settings are used for playback.No sensitive data is collected unless you log in to your google account, in that case your choices are linked with your account, for example if you click “like” on a video.

Cookie	Duration	Description
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	These cookies are set via embedded youtube-videos.
yt-remote-device-id	never	These cookies are set via embedded youtube-videos.
yt.innertube::nextId	never	These cookies are set via embedded youtube-videos.
yt.innertube::requests	never	These cookies are set via embedded youtube-videos.