As we pass the end of the fourth quarter 2021/2022, we want to celebrate the achievement of having no downtime for our core systems.
Our systems are at the heart of how Ebury does business. They allow Front Office and Operations teams to operate and are vital for the correct performance of Ebury Online, our public platform and our APIs. That’s why they have to work 24/7 because any downtime affects our colleagues, and — ultimately — our clients.
When we experience any incident in a platform feature that is crucial for Ebury’s operations, it affects our overall availability, so reducing or avoiding downtime is critical for offering a high-quality service. “Our target is having a 99.9% of platform availability on a 24/5 basis. In a quarter, that translates to a maximum amount of an hour and half of outage time”, explains Jennifer Hurtado, Technical Support Team Lead.
Even though the team has been meeting this quality standard consistently for the last three quarters, this is the first quarter in which we get to achieve the best number of all: zero outage minutes. This achievement is even more remarkable if we take into account that the last few weeks have been very productive, with no less than 371 releases
Among the reasons that explain this excellent performance, Jennifer highlights the importance of our robust incident management process, which includes a retrospective meeting — or post-mortem — after each incident. These meetings are aimed at reviewing our response to an incident in order to understand what went wrong and create action plans to avoid that similar situations in the future. “That way, every time we have a problem we learn from it and we execute specific actions aimed to improve the process or whatever we think is necessary”, explains Jennifer.
What happens when the problem doesn’t come from you, but from an external provider? In this case, Jennifer says “when you depend on a third party in very critical processes, you must have a contingency plan to protect yourself, and even consider having a different third-party provider ready to use if your main one has a problem.” Even in this situation, the post-mortem process keeps its usefulness, because small actions can be taken in order to improve, for example, having an early error detection.
Apart from this continuous improvement process through learning from past mistakes, there are more factors that influence having shorter outage times, such as a progressive improvement of the architecture that allows dividing the server into smaller pieces that can be more easily isolated in case of problems.
We have very exciting times coming for Ebury in the following months, and from the Tech team, we’ll do our best to maintain the great performance we’ve achieved..