Refactoring Microservices – Error Codes

Do you have legacy microservices?  It’s not surprising if you do, especially if some of those microservices (or macroservices, or miniliths) were really just a lift-and-shift of some already legacy software into a microservice structure and deployment.  Microservices present a unique challenge for refactoring, because the clients are so much harder to find than in a traditional monolithic application.  The compiler can no longer help find the clients of a piece of code.  Practices around service discovery and logging can help a lot, but may also not be foolproof or guaranteed to catch every client.  At the same time, you can refactor the guts of the service all you want without caring about the clients, as long as you do not change the external API for the service.  You do have tests around that external API, right?

Although the clients may be harder to find, microservices typically expose a clearer API than a monolithic counterpart might.  Sure, a class’s public method declarations can provide a pretty clear API, but we all know how muddled an in-memory structure like that can become when multiple developers get their fingers in there, and time has its way with what could have once been a pristine structure.  In a microservice, you are typically forced to intentionally think about how a piece of functionality should be exposed to the wider service ecosystem.  However, even if the API is clearer, it may not be any cleaner than in any other legacy system.

One example of this is the error responses.  Recently, I was working in one of our “legacy” microservices, and the error responses are just horrible.  This microservice handles REST requests, and, as you might expect, responds with HTTP response codes to indicate success or failure.  However, the actual response codes were generally one of either 200 or 500.  Every once in a while you could get a 404.  Any issue with the client’s request really ended up being a 500.  While it is still an error code, sending a 500 series response for a problem with the client’s request is a horrible practice.  Dumb clients may handle both these scenarios in the same way; which is fine.  But this bad error “coding” practice can cause much trouble for smart clients, that perhaps might know that they can retry a 500 error code after some delay, but that they should not retry a 400 error code without modifying the request.

In this case, where 500s were being sent when 400-somethings were meant, I felt it was safe to refactor this error API to something more logical.  If any clients were actually retrying erred requests, they would now know they should not simply retry.  This also puts the error into the right ball-field, whereas in the legacy way, it looked like the receiving service was the one in error, when in reality it was the calling service.  Fixing the error codes makes it apparent that the error is actually on the client side.

Not every potential change to error codes is quite so clear-cut, and may require noting the changes as actually API-breaking (such as changes that make a previously successful call now fail).  Perhaps in certain situations even removing what used to be a required field, and thus making previously failing calls now succeed, could even be counted as a breaking change where the clients may need to be notified and updated.

Originally Posted on my Blogger site October of 2018

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.