Chris Tacey-Green shares battle-tested principles for implementing event-driven architectures in highly regulated environments, covering Inbox/Outbox patterns, event versioning, and fault tolerance strategies.
Chris Tacey-Green discusses the shift from synchronous commands to asynchronous events within highly regulated environments. He explains the critical role of Inbox and Outbox patterns in preventing data loss, the nuances of event versioning, and how to maintain decoupling between domains. He shares "battle-tested" principles for implementing fault tolerance and managing eventual consistency.
Foundations
If we take our title, event-driven, cloud native, banking, we'll break that down and we'll define each part. An event essentially is a change in state somewhere in the system. That could be caused by a user's action, an asynchronous background task, an external entity, an external system to the platform that you're building. It may carry data, and we might call that a fat event. Or it might simply be a notification, which would be a thin event.
I haven't made the dietary definition of an event up. This is something that's discussed out in the world. There's actually quite a famous paper that talks to putting your events on a diet. I would tend to aim to have your events lean. Essentially, all of the data that pertains to the event, put it in there. Anything else, don't do it.
Yes, you do have these levels of an event. Before we get on to anything else, I am quickly going to discuss commands versus events. This is because this is a conversation I get into time after time. If you build an event-driven system and then you start pumping commands around it, you're not getting all the benefits that you would like to get from an event-driven system. Actually, you're going to screw yourself over in the future.
Very simple differentiation. A command is me saying, I want something to happen. I'm explicitly asking you to do that thing. I'm going to wait because I'm expecting some result. Even if it's asynchronous, I'm expecting a result. An event is me shouting into the world saying that something happened. I'm not expecting anything to happen off the back of that. In fact, I'm not necessarily expecting anyone to be listening to me. I could be shouting into the ether. No one's subscribed to that event, and that is fine.
This differentiation comes up a lot. Get it burned into your brains if you've not worked with these architectures before. Understand what each thing is and when to use which one.
We know what an event is. It's a change in state. Therefore, an event-driven architecture, quite simple. It's where we combine multiple systems that are reacting to events. It tends to consist of producers, so systems that publish events. Consumers, systems that receive an event. Nice and simple.
A quick shout on event sourcing. This tends to end up in the same conversations as event-driven architectures. When you talk to people and you talk about event-driven architectures, a lot of them will think of event sourcing. These are not the same thing. Please spread this to your teams. You do not need to do event sourcing in order to do an event-driven architecture.
Event sourcing is actually how the state of your application is represented. It's represented as an immutable sequence of events. If we considered a shopping cart online, if we weren't doing event sourcing, we might represent the state of that shopping cart as, I have four hats in my shopping cart. If I went to look at my state at my database, I would see a record there that says, hats times four.
Event sourcing, the state of my shopping cart is represented, probably in this case, as four events. You would have four records, and each one represents me adding a hat to the shopping cart. In order for me to know the state of an application when it's event sourced, I need to play back those events in order to know, at this point in time, there are four hats in my shopping cart.
It's a complicated pattern to apply. I've seen people really struggle with understanding it. It takes people time to learn it. Understand that you do not need to do event sourcing to do event-driven architectures. The reason they come hand-in-hand is that if you have done event sourcing, adding that little extra bit of subscribing to an event is much easier. That's why they tend to come hand-in-hand, but you do not need to do this, and understand there are dragons here.
Next up, cloud native. Essentially, designing, constructing, operating workloads in the cloud. Technically, the cloud has any operating model in it. We could spin up virtual machines. I could SSH into a virtual machine and copy a zip file over and run a manual service on that VM. Actually, when we're talking cloud native, we tend to be talking about doing modern engineering practices. Highly scalable. I have put microservice-based in here, although I realize there are other approaches to problems, modular monoliths, that they exist, and they are good patterns to use. These are systems that would be deployed using modern DevOps principles, CI/CD practices. Hopefully, cloud native is something that you guys are already fairly comfortable with.
Our title was Event-Driven, Cloud Native, Banking. We've got one more thing left to define, and that is banking. These are large, slow, highly regulated organizations that promise to keep your cash safe under their mattress so that you don't have to put it under yours. They also tend to be terrified of any of the modern principles that we've just been discussing. Lots of them will use fax machines as integration mechanisms. Fortunately, Investec, the bank that I work for, is not one of them. We are a pretty modern, agile organization, and so we've been doing a lot of more modern engineering practices like the ones that we're going to be talking about today.
There's our foundation set.
Why Eventing?
Let's get into the details. Why do we want to do event-driven things? There are actually many different reasons. I've picked out a few because they apply to real situations, real use cases that we have had to solve for at the bank. You can read online about many different benefits and drawbacks of event-driven architectures. I'm just going to pick out some that make sense.
Decoupling is an obvious one once you get into using events. We have a very real use case here of transaction monitoring at a bank. Transaction monitoring, essentially everything that happens on a client's account, we need to be paying attention to, monitoring, looking for anything that's strange. If you think about times where you've traveled to a new country, we want to be able to see that and work out if that's something abnormal or if that's something that we would expect of you as a client.
To solve for transaction monitoring, they need lots of data from our payment system. We've got two options here. We could couple the two things. Payments at a bank is a very important thing. Highly regulated. PSD2, if you want to go read about that regulation, it's a lot of fun. Payments is crucial and we have to build it with reliability at core. Transaction monitoring is not something that has to come in a payment flow. It's something that happens behind the scenes. You do fraud checks on a payment, but you don't necessarily have to monitor transactions actively in order for a payment to go out the door.
If we couple these two things, and we've got two ways of coupling, we could either determine that payments has to hit an API on transaction monitoring. It's going to push that data to that service. Or maybe we determine that transaction monitoring, it's its responsibility, so it's going to pull from an API on payments. Either way, we are now coupling these two systems. We're coupling two systems that actually should be independent of one another, have very different reliability expectations, very different expectations from the organization. Not ideal.
By moving to an event-driven architecture, we can split these two things. In the decoupled version, you can see that payments has no idea that transaction monitoring exists. Payments gets to focus on its flow and it pumps out, publishes events. In this case, I've called out two, that payment was initiated, which has some data about the location of the user, the channel that they were coming in on, the creditor, the debtor. We also pump out an event to say that the payment was processed. Which gateway did it go down?
Transaction monitoring now gets to be independent. It gets to look at the event stream of payments and say, for my use case, to monitor transactions, I'm going to pull these two events. In the future, it could pull new events. It could go and find new data that it wants to use. It completely decouples these two things. Now transaction monitoring can go down without taking payments with it. Decoupling, very important benefit.
The second benefit is an immutable activity log. Before we moved to an event-driven architecture for payments, we obviously had payments running through the organization, but it was hard to know where a payment was in all of the many flow points in a bank. You have lots of different things that happen. Fraud checks, sanctions. You choose gateways. Actually, payment gateways themselves tell you lots of different things as responses when you've sent a payment out. We were struggling to see that.
When we moved to an event-driven model, we now had this immutable activity log of the events that were powering a payment. That's the crucial bit. It wasn't some audit log off the side. It wasn't us explicitly messaging logs to a log aggregation that we then needed to correlate. The events we saw, we trusted, because that's how the system was running.
Now actually here, I've picked out a couple of events. There's way more in the flow. We can now see, as a business, with very nicely business-oriented event names, which is something that's important to do within your domain design, we can see that a payment was initiated. We can see that a fraud check was completed. Or maybe we can see that a fraud check, actually, we're still waiting on that. It's fallen into a manual operational process where someone's needing to do additional fraud checks on something. Huge benefit that you get from using events to power your system. Again, very real thing that we have running in production.
The third one, fan-out. I haven't called it out, but there's also fan-in as the alternative. If we take another payment-related one, where off the back of a payment, we need to do two things. We need to update our payment limits. We need to say, for example, you might only be able to spend £10,000 a day. We'll have that limit assigned to you as a client, and we need to update your payment limits when you've made a payment so that we know where are you against that limit. We also want to send comms. Maybe a push notification, an SMS, an email, a pigeon to say that your payment has completed.
Without an event-driven architecture, we of course can solve for this problem. We can do these things. We end up wrapping them together. We end up saying, ok, so we need to update the payment limits, and we need to send comms. If payment limits fails, we need to handle that failure somehow. Do we wait for that until we send out our comms? Do we still send out our comms, and then we go and fix the payment limits issue? We can avoid all of that by having a simple event fan-out. One single event that says, this payment was processed. Then, two independent processes. Actually, in reality, there'll be way more than this. Two independent processes that go off and do the things that they need to do. Client comms does not care about the payment limits service. It shouldn't need to, and it can just work independently. We'll get onto fault tolerance, but it also means that each of these can handle their faults, their retries, their fallouts independently of one another. Fault tolerance, a huge benefit of an event-driven architecture. When we're talking about highly regulated industries, we have to be tolerant to all faults. We are not talking about some IoT processing, or big data analytics, or anything like that. We are talking about vital things that must happen.
In this case, and again, very real situation, I'm not going to mention specifics about the fraud engine. We have a fraud engine. It's an external vendor, who have some reliability issues. We can't fix those reliability issues. We didn't build that software. We do need to be able to handle them. With an event-driven architecture, we have three places that we can handle faults. You can customize these however you like, based on the domain and the use case that you're solving for. The first level is that transient box. Actually, this is no different to in-process retries that you'll probably all have written in your code. If you think about like Polly and .NET, where you just define like, we're happy to retry five times. We'll add a bit of a jitter. We'll wait a couple of seconds. Hopefully that transient issue, that network issue is solved, and our request goes through. No different to normal. The only additional benefit you get is that because it's event-driven, what this actually means is that this is asynchronous. This is eventually consistent. You might be able to extend out those transient retries a little bit longer than you might otherwise have.
The second level we now get, so fraud engine's still down. Our transient retries are still failing. We can now actually back off to our eventing tech. It really doesn't matter which cloud-native eventing tech you're using. It could be Kinesis. It could be Azure Event Hubs. It could be some managed Kafka instance. It doesn't matter. You can configure this thing on all of them. This is where you would say, we know things are problematic, but we're still going to retry, but we're going to back off a bit more. We can back off for whatever the organization is happy for us to back off to. Until we eventually say, things have gone really bad, we need to dead letter this thing. Dead lettering is pretty important, mainly for the problem of poisonous messages, poisonous events. If some naughty person pumps out an event into your system, and it breaks your eventing contract, or it has ad data that just cannot be processed, you need a way for that to escape from your architecture eventually. Otherwise, it will continue retrying forever, and you'll have some fun screwing around in databases to fix that. We have to dead letter. We have our third level of fault tolerance there, because we can, we will alert some human. We will wake someone up at 2 a.m., and they will have to go and look and replay that event, if they determine that they want to replay that event. It's not a poisonous message. Fault tolerance, huge benefit of event-driven architectures. We have really benefited from that within our highly regulated use cases.
The fifth one that I'm going to call out is plug and play. This example we're talking about here, is the build-out of a new capability, rewards. We want to offer rewards to you, and we need to actually build out that capability. We don't have it. With some mature platforms, like payments, accounts, client, where now they're publishing events, they're publishing well-defined, ideally domain-designed events out into the world, we actually have a really nice benefit here that rewards might be able to be built without bugging any of them. If the events are good, we can slot this capability in. It just needs permissions to those events, permissions to the event streams, and it now knows when a client's onboarded. It now knows when an account is created. It knows when payments are processed. Once you reach this level of maturity with your event-driven architecture, you can plug in new capabilities really nicely.
What Hurts? What Helps?
We're going to get into some of the pain now. What hurts? Yes, I will talk about what helps. The first one is not a tech problem. Event-driven architectures are hard for people, mainly people who have not yet worked on those architectures before. This is hard, and we see it. We see it in our architects and engineers who needed to learn these new concepts, needed to learn these new patterns. We've seen it with new joiners who, in one of our spaces where we had event sourcing as well as event-driven architecture, it took about six months for a new joiner to get to the point where they were delivering at the same pace as the engineers in that team already. That is a very real consideration. I think we very easily look at the technical tradeoffs. This is a real organizational impacting thing that people are going to be slower to deliver. It's a different paradigm when you're designing those solutions. When teams just step into this world, they may almost forget that they have a different paradigm, and they'll start solving for problems they don't need to solve for. They might forget about the problems that they really do need to solve for now, that is, eventual consistency, the fault tolerance that we've talked about. Some of the other things that I'll get onto. Your people will find it hard, and that should not be ignored.
What helps? There are things that help. Hopefully, you have a developer platform. If you don't, just quickly create one of those. Hopefully, you have some concept like paved roads in your organization. Getting event-driven artifacts into your developer platform, which look like service templates. As an engineer, I can now step in and go, here's a good-shaped template of an event-driven microservice. That will help. That will help people get started quicker. Application modules that take away a lot of the problems that actually we'll talk about coming up, take away those problems so that not every single engineer is having to solve the same problem over and again. Do that and do it early. We did, and we did find it much easier for multiple teams to start building out these architectures.
Your developer platform can have all these lovely artifacts, but you do still need to train your people. In fact, it's a dangerous world if you've given them the keys to developing and smashing an event-driven system into production, but you've not focused on training them. At 2 a.m., when that thing falls over, they're not going to have any idea of what the lovely magic that you've written into that developer platform does. We need to train them. We had an enablement team, and we took that enablement team, and we took a delivery team, they essentially came to us and said, we keep seeing people building event-driven architectures. We'd like some of that, but we've never done it before. We managed to book out an entire week with that team and our enablement team. This doesn't really scale, but it really did work. We ran through some training materials, probably some stuff that's similar to what we're going to be going through today. We also designed and built an event-driven system, it was very small, in their space, that actually ended up in production. By the end of the five days, it wasn't quite in production, but they had a working system. I'm calling that out because I think it's a shift from how some of us think about training materials. This wasn't just a, please go and read a bunch of documentation, or go and watch a video. We sat with them. We taught them some stuff. We then got in, designed their system with them, built the system with them, found the problems, solved them. That team are now off confidently building event-driven systems.
Third callout, aligning on standards and principles across the estate. The earlier you do this, the better. It doesn't need to be that you define everything, but defining things like your event contracts, defining things about the permissions model that you want on your event streams, ideally defining what technology drives those event streams. All of those things, write them down, agree them, so that when you go off and you want to consume someone else's events, it's not completely different to the other system that you've already consumed events from. You end up in that world and you're not going to find pace ever.
My prompt for this image was myself and my clone spending lots of my money.
Second pain, two ends of a spectrum here, duplicating events and losing events. Both things that in highly regulated industries, including aviation, we do not want if you go off and you go off to pay your rent and we just happen to lose that event and your landlord never gets your rent payment. You're not going to do very well off that. Alternatively, you go off, you buy your new property. You put down your deposit and we pay it twice, you're going to be pretty angry. We cannot handle this. It's a callout that's important because in some event-driven architectures, again, with big data, analytics, IoT devices, this may not be a problem. You can afford to lose an event every 100,000 events. We can't. We are a bank. We cannot have that happen. This requires design and build upfront. You cannot leave this until later. You will hurt yourself.
What helps? Two things. Inbox patterns and outbox patterns, and building both of those into that developer platform, into the frameworks that we've just talked about. Build them in immediately so that people get this stuff for free and they don't stumble into paying two deposits.
What's an inbox pattern? What's an outbox pattern? Let's look at the outbox first. An outbox pattern protects you from losing events when you publish them. In our example here, we're looking at onboarding. We're onboarding Aurelia. When we're making that modification to the client's table, we're saving that state. We draw a little transaction around it along with an outbox. In that outbox, we put the event that a client was onboarded with our unique ID. We now know that we have updated our state and published an event at the same time within the same transactional boundary. Without that, you can very easily end up in a situation where you've updated the state, we know Aurelia exists, but then something falls over when we try to publish that event. Not going to be good. We're going to lose any of the benefits that we wanted to get from our event-driven architecture. We save that record to our outbox, and we then just need a little dispatcher pattern. The dispatcher goes off, maybe it's just polling that outbox table, that's fine, and it's going to take that event and actually publish it onto whichever technology you've chosen: Kafka, Kinesis, Event Hubs, whatever it is that you've decided to use. Fine, using an outbox, we've now protected ourselves from losing events. Crucially, we haven't actually protected ourselves from duplicating events. That dispatcher could still publish something twice. Not ideal. Also, the eventing technology that we publish to may just do some kind of at-least-once delivery and we're going to end up getting duplicate events. We still need to handle those. That's ok, because we're going to use an inbox.
The inbox is on the consumer side, where we're now receiving that client-onboarded event. Rather than going off and doing our business logic, dealing with that event, however it is that we're dealing with, where we could fail for real business validation reasons or we could run into some transient issue that we've talked about before, no. We immediately pump that event into an inbox. The inbox just states, here's the idea of the event, here's the data, we received it. Off the back of that, you then go and do your business logic. Fine, perfect. What this avoids now is if our at-least-once delivery eventing tech, pumps out the same event, that's absolutely fine now. We're protected by our inbox. We're going to check the ID, and say, I've seen that event before. I'm not doing it again. We're nicely protected.
The third painful thing to deal with is breaking event contracts. We talked about coupling, and how, by using events, we can decouple systems. However, you are coupled by your events. These are a contract that you have promised to the world. Crucially, you can't take them back. One of the things about event-driven architectures is that you'll publish them onto an event stream, and it's an immutable event stream, and it goes back all the way to the beginning of time. Someone has the right to go back to the beginning of time and replay all of those events. Once you have published something, some data point on your events, you don't get to take it back. Consumers failing because of a change in that event data is a really painful remediation process. We talked about events being immutable. They're not immutable if you're off editing events in your datastore or in your event stream. It's a real thing that I've seen companies doing. Please don't do it. Please don't put yourself in that pain. We need to really care about our event contracts here.
What helps? I find the thought of your event being like an API contract helps people. Probably just because we're more comfortable with what API contracts are and how important they are to not bring in breaking changes. Consider your events like you would consider your APIs. Design them carefully. Be aware that any property that you put on that contract is out there in the world. If you want to remove it, that's a breaking change. Ideally, avoid those breaking changes if you can. If you can't avoid them, version them like you would an API. On a REST API, you would bring in a v2 if you can't remove that breaking change. You can do that same thing with events. Something that you'll see on some event standards is the concept of a data version property. Just some metadata that you put on your event that says, this is actually version two of this event. Now, what this allows your consumers to do, and you can just imagine it as a really simple if-else statement in their code, they now can check that. They can see, ok, for v1, we're going to do this event handling. For v2, this property has been removed, or the data type's changed, or the whole structure of this event has changed. Ideally not, that would almost be a different event. We can go and branch off and handle that event differently. Now we can replay from the beginning of time because we've got v1, v1, v1, v1, v2. Fine, safe.
The other thing that will help is separating your domain and integration events. I'll go into what I mean by domain and integration events. I have a little picture. You will thank yourself later. Essentially, if we draw our bounded context, we draw our domain, you may well have an event-driven architecture within your domain. If we take payments, you may well have some internal events in that domain. That's cool. The key thing is to model your integration events, the events that tie multiple domains together, model those differently. Because this allows you to protect yourself from bleeding domain concepts out that you will then be tied to. You've now contractually said, I'm accidentally pumping this domain concept out, and you're now consuming it, and I want to change my domain, and I can't. We'll get into that in a little bit more detail.
The fourth thing that needs considering, it's not necessarily an immediate pain, event ordering. Unless you explicitly configure it, cloud-native eventing tech does not care about the order of your events. They're usually built for scale. They'll make massive statements, and they'll be like, yes, we can handle a million events a second. That's because they don't care about the order of your events. Your retries don't either. We talked about fault tolerance. You could be retrying. You could be backing off. You're playing these events through independent of other events. There is no ordering within your technology, within your architecture. You can introduce it, but just know that it carries more risk. Allowing a client to make two $1 million payments because we hadn't updated our balance yet is slightly career-limiting. We should not do that. There is more risk as soon as we require ordering on our events. It's not that it's not a no-go. We have two approaches that you can follow for event ordering. Firstly, we can bring in an order. We can stamp our events with a version property, which, for our first event of an aggregate, it would say this is version one. For our second one, version two, version three, version four. You can call it whatever you like. I like to call it version. That can enforce ordering. Within your inbox pattern then, you can add code to that lovely framework that everyone's getting and everyone's using. You can add code that checks for ordering. For our aggregate, let's say an aggregate is our shopping cart. For our shopping cart, we now have ordering, and we know that event one, event two, event three, event four for adding those hats. Maybe we care about the order that those hats were added, in which case, within our inbox, within our ordering check, we say, for this shopping cart, I haven't seen version one event yet. Therefore, I'm not going to process event two. I'm going to back off. I'm going to go back to the event stream. Hopefully, eventually, we're living in an eventually consistent world here, event one comes in, we process it. We then get back to retrying event two, and we say, I've seen event one, I'm going to go off and do it. The key thing here is that it is less scalable if you bring in that kind of ordering, because you've just seen what's happened. We've now essentially built a queue into our event-driven architecture without using queuing technologies. It will scale less. It will still work. We have very real implementations of this within the bank where we need that kind of ordering, and it works. You just have to be aware of the impact on scale. The second option here is that you introduce implicit ordering, where your domain handles the types of events that it can process without necessarily saying, I'm going to process them one after another. It may well be that, in this example, we can't pay a beneficiary until we've seen that the beneficiary was created. That makes sense. We don't know the beneficiary details yet. We've implicitly added ordering to our system by having that domain validation. We haven't had to stamp the events. This is a very real approach to the problem. We have a platform within the bank that does this, and it works absolutely fine. They've never needed to introduce ordering version stamps on their events. Two very valid options.
Summary
Good timing. Bring it all together with a very scary number of boxes, and I'm not expecting you to immediately understand this, but I want to try to bring all of the concepts that we've talked about here. I appreciate there was a lot, and into a very real, again, banking use case where we've got payments and we've got communications. We talked about them before. We now have our domain and integration events. We have our two domains, we have payments and we have communications. Let's follow the flow-through. We have our API. Someone has gone to create a payment. It flows through into our outbox. We have our outbox to avoid losing the event. We save the payment to our payments database, and we also save our domain event to our outbox. Maybe we have some internal domain event handling. I'm not going to go into any more detail on that, but maybe we do. Maybe we need to do some stuff. That's fine. We have an inbox on that event handler that avoids processing that event multiple times. Perfect. Our event is called something really funky because we own our domain, and so we've been really verbose with our event naming. SwiftFPSPaymentProcessed, sweet. We'll name it whatever we like because we now have our integration event publisher. This is where I was talking about the difference between domain events and integration events. You build this into your service template so that people will just have this available to them, and they won't accidentally bleed this very specific event out into the world. We have our publisher. Our publisher does three things. It filters, it aggregates, and it transforms. Not all domain events will become integration events. Fine. Sometimes you might have a fan-in where multiple domain events become just one integration event out into the world. Crucially, transformation, where we say our domain event has all of these properties, we only want to publish these. Perfect. We've got the nice protection here. Some people might see some similarities here with ACLs. You are protecting your boundary. We pump out our payment processed event. Look, we removed all of our silly domain language. We now just know that a payment was processed. Into our other domain now for communications, we have our integration event handler. It handles integration events, well-named. We handle our payment processed. We have another inbox. Lovely. We have some, maybe, filtering, aggregation, transformation here too, to move into our domain events. For brevity, I didn't want to fill this with the same thing, but you have the same thing here as you had in the other domain. We then go off and do our work after the inbox has protected us from sending multiple SMSes to you. Again, we can follow that same flow. SMS delivered, gets transformed to communication sent out into the integration events. I'm not expecting you to immediately be able to go off and build this, but these are all there deliberately. This is how you can do event-driven architectures within a highly regulated industry, with all of these protections in place. What we found, as I stated, we found by building this stuff in to our developer platform. Teams haven't needed to solve these problems. They've not needed to run into these issues at 2 a.m. We've managed to build quite a few things, quite a few platforms in the cloud, in Azure.

Comments
Please log in or register to join the discussion