A machine learning approach to SWIFT transaction classification

Engineering

by Ebury Labs on January 14, 2022

In order to move value around the world reliably and securely for our clients, Ebury is a financial institution connected to the SWIFT network, the world’s leading provider of financial messaging services.

More than 10.000 financial institutions are currently connected to the network, enabling international funds transfer across more than 200 countries.

In a nutshell, we send and receive messages to and from the network – mainly regarding payments that we are sending to our clients’ beneficiaries, incoming funds that we are receiving from clients or third parties and credits or debits to maintain our balances with the liquidity providers.

The reconciliation process

Ebury must ensure that our internal business activity and records match all externally received information. There are many reasons that a financial institution would be required to maintain a rigorous reconciliation process: client balances, accounting purposes, regulatory and compliance requirements, fraud prevention and bounced checks.

Regarding the SWIFT network related transactions, we are interested in one specific message type received: the MT942 – which is the detailed list of entries debited, credited, or booked to the accounts.

One of the first steps of the complex reconciliation process is to classify the entries that we receive in the MT942 message into different groups. Examples of groups are if an entry is related to a liquidity provider activity, a client fund, a client returned fund, a company account movement, etc.

The correct classification into those groups will ensure that the reconciliation process is automatic and does not require any manual intervention, which is desired to avoid human mistakes and handle the workload on top of hundreds of entries daily.

Criteria-based classification

The first attempt to classify the transaction entries into the groups was to analyse the messages and identify their patterns manually. The technical solution was to introduce a list of if-statements after parsing the MT942 message entries to define the group.

Initially, it was efficient and straightforward to quickly add a new rule every time an entry was wrongly classified. However, the classification accuracy decreased to less than 75% when we grew to several thousand customers, introduced more liquidity providers, and opened more accounts. Additionally, it became impossible to analyse thousands of entries and understand the independent patterns that could group them correctly.

The conclusion was to research and test a better solution to provide better accuracy for the entry classification and enable the automatic reconciliation process for most transactions.

Machine learning-based classification

The engineering and product teams started to evaluate a new way to approach the problem – wondering if a machine learning-based approach would find the patterns and result in better accuracy.

Before running the experiments, the team first defined the input as the content of the MT942 message and generated the matrix of TF-IDF features using the TfidfVectorizer. Then, a dataset was extracted with all the entries previously classified to be used as an input for the experiments – 80% of the data to train the model and 20% to validate the model results.

The first experiment used a generative classifier called MultinomialNB, the most popular one used to analyse categorical text data. The model results were not promising because this method is much more efficient for use cases with features with strong independence conditions.

The following experiment used a discriminative classifier called LogisticRegression, which does not assume conditional independence of the features. The model generated by the classifier resulted in an accuracy of 99%.

Nowadays, all entries inside the MT942 messages are classified in the corresponding groups in real-time by the team’s model.

Coming next

The generated model works retrospectively, so we have to retrain the model when the accuracy decreases. Ideally, we would be able to automatically retrain a new model, compare the new accuracy with the previous one and promote it to be used instead of the old one.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_5ZETTGME4T	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_51187572_43	1 minute	This cookie is set by Google and is used to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	16 years 4 months	These cookies are set via embedded youtube-videos. They register anonymous statistical data on for example how many times the video is displayed and what settings are used for playback.No sensitive data is collected unless you log in to your google account, in that case your choices are linked with your account, for example if you click “like” on a video.

Cookie	Duration	Description
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	These cookies are set via embedded youtube-videos.
yt-remote-device-id	never	These cookies are set via embedded youtube-videos.
yt.innertube::nextId	never	These cookies are set via embedded youtube-videos.
yt.innertube::requests	never	These cookies are set via embedded youtube-videos.