Batteries
We live in an Amish house. When we moved in, there was no electricity supplied from the power company, but the previous owners did leave behind a little solar setup. The batteries were getting old and in need of replacing, but the solar panels and controller/battery manager were still fine. These powered a few 12v outlets and LED lights scattered throughout the house. We put in commercial power to the house, but we wanted to keep the small solar setup for a kind of backup, and because I like tinkering, and this is something to tinker with.
I knew how solar systems, especially isolated ones like this, worked, from a theoretical perspective at least, before we moved in. The sun shines on the solar panels, which generates electricity, which in turn then goes into the controller that manages how fast and to what level the batteries charge. The batteries then store and supply electricity to the lights and other loads on the system. I also figured that it would be really easy to just replace the batteries and have a nice small backup system. The easy thing to do would be to just replace the batteries with new ones of the same size and type.
Complexities
Of course, reality had different plans. We could find no suppliers for the same batteries that were already here. So we took the opportunity to do a bit of investigation (and tinkering) to right-size the system for our needs.
So, we embarked on an epic journey to buy some batteries for our solar setup. Yes, what really should have been a task ended up turning into a project, or something even larger. As programmers, this isn’t exactly a new concept, right? It’s not as simple as going into a store and saying “give me some batteries for a solar setup”. There are many different battery chemistries available. And multiple supplied voltages within each type of chemistry. And multiple available capacities within each voltage.
What type of battery chemistry did we want? Lead-acid is a common choice, because they are relatively cheap for their storage capacity, and generally long-lasting if maintained well. But, with lead-acid batteries, they don’t like to be drained to a low capacity often. You really shouldn’t ever drain them down below 50%. So that impacts the battery sizing you’ll need. Or, if you go with a Lithium-based battery chemistry. These are more expensive, but you can drain them down more. Thus you can use more of the available capacity. And they are easier to maintain. But maybe don’t last as long.
The more I dug into the question of what batteries we needed, the more complexities I found. The more questions arose. More inputs are necessary.
Software Complexities
We often encounter the same thing when designing software systems. Add this new feature. Oh, and you need to publish data out of here. And react to these events from this system. Oh, what happens if you get an activation event before a registration event? The more systems are connected, the more complexities there are.
It is a common approach now to tend towards event-driven systems. This way, individual pieces of the whole system do not need to know about any of the other pieces. They just concern themselves with the bits of information (events) they actually care about. This is a good model in my opinion, but there are a lot of unspoken complexities here that are often glossed over in the blogs and books and presentations on this architectural approach.
The perfectly ideal event-driven system is elegant and robust. But it is literally impossible to create that perfect system. This is still a broken world we live in. So you need to think about a great many scenarios; not just the happy path. What happens when events are processed out of order? When they are sent out of order? What about when there’s a bug in the producer code that sets a field incorrectly? A consumer bug that processes this event incorrectly?
It becomes all-too easy to ignore these complexities, especially when deciding how to design a new system. We want to work with the newest and the “current” technologies, the the point of ignoring the fact that this solution doesn’t actually fit the problems we are facing. Or perhaps it does fit the problem, but we have been lied to about the real amount of work necessary to build a system like this.
There are a great many kinds of problems that do work in an event-driven system. However, this is not a silver-bullet architecture, so let’s take a look at some of the things you need to keep in mind when developing such a system.
Message Retries
Retries are a must. As already stated, this is a broken world, and things will go bump in the night. Think about what errors might occur, and which of those might be transient errors. Which types of errors should not be retried? Can this operation be retried at all?
Just as important as the retries themselves, is how do you handle messages that exhaust their retries and ultimately fail. Who do you notify? How do you notify them?
Events Out of Order
Even when using systems that say they deliver messages in order, can you guarantee that order throughout the entire system? For instance, Kafka may hand out the messages in a partition in order, but did it get those messages in order? Do you have multiple nodes all competing to pull those messages, and processing them in parallel?
Unknown & Invalid Data
Send, and expect, the bare minimum. Don’t validate the pieces of a message that you don’t actually care about. Pay attention to the message versions, and pay attention to how to maintain backwards compatibility.
Idempotentcy & Duplicates
Strive to make actions idempotent, and to not fail on duplicates. Sometimes this is harder than you might think. If you have a message to create a user, and the user’s email already exists, is that a failure because of a duplicate email address, or is that a success because it is a duplicate message?
Message Broker Availability
Is your broker up and running? Are all your consumers healthy? Is there any lag or latencies in message processing? Make sure your code is robust enough to appropriately handle times when the broker is unavailable.
Transactions
Transactions raise some special questions in an event-driven system. Do you enlist the messaging system in your transactions? Maybe an outbox (or inbox) pattern is more appropriate. Something that might feed into this question is to consider if the message broker is an essential dependency of your service. In other words, if the broker is down, or unreachable, does your service fail health checks?
Data Synchronization
How will you get the data back into sync when it becomes out of sync. Not if it becomes out of sync, but when. Is there an option to replay events? Maybe a full data export/import? How do you detect when data is out of sync?
Coordinate Deploys
Or rather, don’t coordinate deploys. But do consider backwards compatibility, and cleaning up afterwards.
Summary
I don’t mean to scare anybody away from an event-driven system. But I do see a lot of code added without much consideration for any of the topics here, and then complaints about how fragile and inconsistent the system ends up being.
Oh, and the batteries for the solar setup? We are still using the old batteries that were here when we bought the place.