In a previous article, we explained how we took care of data modelling and exchange between Ebury services using Apache Avro, whilst addressing the most important things to take into consideration when building robust and maintainable data models.
As business grows, our internal services ecosystem evolves with it. Keeping track of the schema versions used by each producer and consumer group will add complexity to the system’s maintainability. Any schema compatibility issues which were not foreseen beforehand could compromise the business at runtime.
Moreover, the size increase due to attaching the schema to each message (no matter how small it is), can also add a considerable overhead in certain real-time environments.
How can we improve the management of the different schemas and their versions in our infrastructure?
Schema Registry
Schema Registry is an Open-Sourced application developed and maintained by Confluent, which manages schema lifecycle across a Kafka infrastructure. It lives outside the Kafka cluster, but closely collaborates with all producers and consumers within the ecosystem.
The main purpose of the Schema Registry is to serve as a single source of truth regarding schema versioning, enforcing data governance through the system.
Every time a producer is about to send a new message to a Kafka topic, it first checks if the schema is used to define the message is available in the Schema Registry (where schemas are known as subjects). If not, it will register it, returning a unique ID for identifying that particular version. Afterwards, every producer sending a message with the same schema will just prepend the unique ID to the message, instead of the whole definition.
When a consumer reads a message, it will extract the bytes used to identify the schema ID, and if it is not locally cached, will ask to retrieve the copy stored by the Schema Registry, so that the message can be correctly deserialized for its processing. This makes both producers and consumers synced, as they will use the same structure to serialise and deserialise the data.
As only IDs are kept in the messages, Schema Registry becomes a critical component of the infrastructure. Should it not be available for any reason, services can become jeopardised.
Versioning
Once a new schema is registered, its compatibility type can be configured at any time, following the rules described previously. Compatibility can be set both at general level (for the whole Schema Registry deployment) or on an individual basis (per subject).
Every time a new version of a schema is produced, Schema Registry’s validator will check if the changes comply with the established compatibility rules. If not, an issue will be raised and the new version will not be registered, preventing the introduction of any breaking changes which might affect data workflows.
Due to these checks, consumers can be quite confident to avoid experiencing any issues with data processing due to model mismatches.
Schema Registry will keep a versioned history of all schemas and their accepted changes, so that anybody (both humans and machines) can check how the model has evolved through time. Any of these versions are retrievable through their unique ID, making them available to any producer or consumer connected to the Schema Registry.
Final words
Schema Registry provides a RESTful API to perform all operations in a programmatic way. Nevertheless, there is a basic interface (Schema Registry UI) which perfectly fits for development and playground environments.
If you found these articles interesting, do not forget to share them within your network! More useful content will be published regularly.