Everything doesn’t have to be asynchronous and sometimes making something asynchronous leads to problems. The following rules and guidelines should help you minimize those problems and make proper use of asynchronous calling.
Some Important Rules for Async Coding
I see a lot of asynchronous coding these days. By that I mean specifically calls to some other service which will return some form of “ok I’ll do that” message and then queue, schedule, or perform that in the background. Meanwhile the caller continues on with it’s duties. I’ve seen this touted and used mostly as a way to “improve the user experience” by way of not making their interface wait around for the task to complete. But to do it right there are certain things we must do to ensure we don’t make things worse for the user and even worse for those who have to maintain that which our async ways have wrought. I call them the Basic Async Coding Rules.
Rule 1: Trust But Verify (The Reagan Rule)
Unfortunately I’ve seen too much code or web interfaces where you request something to be done, click the button, and it says “OK, I did that for you” before the task is done. Essentially, all it really did was enqueue it - but the caller thinks it was done. This is bad. The caller should never consider the task done unless it has verified the work was done. Then your calling code must somehow verify the task was completed. Perhaps that is done via a polling mechanism where you check a status field or table, or perhaps you pull the expected data and validate it is as you expected.
You may think this breaks the point of doing things asynchronously. In some cases it would indeed and for those cases I refer you to the guidelines I write about further in this article. But this verification is tightly related to the next rule which will illustrate more why this rule matters. But first let us consider an actual example.
In Redis’ Sentinel you can remove a Redis pod from Sentinel’s management and this process is done asynchronously. Specifically, Redis will check the pod exists and if not return an error. Otherwise it will tell you it is did it, but it may not yet have fully removed it. There is an edge case not worth going into here where this can trigger an old configuration to b ere-applied due to network buffering of messages related to a removed pod. If your Sentinel management system calls this remove command and does not properly validate the pod was fully removed, you’ll have problems.
Rule 2: Respect the User’s Trust
Basically this rule means never telling the caller/user a task was “done” of “completed” just because it was successfully submitted via an async call. Whether that calls to a queueing system or some other service which does the task asynchronously, the concern here is telling the user or caller the task was done. There are a few specific problems here. First, you are breaking an element of trust between the user and your system. You tell them the task is complete but in reality that task could actually have failed by the time it was run.
Secondly, doing this makes it more difficult to get to the root of a problem. This is tightly related to the first problem in that your support people start from a position of “the task was done” and have to discover it was not actually done. This delays time to fix/recover and wastes human cycles catching up. This is then compounded when dealing with customers/users as it takes even more time to bring them up to speed - and in the meantime can make you look like you don’t know what you, or your systems, are doing.
A key way to follow this rule is to not tell them it is done, but tell them it is queued, or in process, or in a “pending” state. Then your calling code must somehow verify the task was completed. Perhaps that is done via a polling mechanism where you check a status field or table, or perhaps you pull the expected data and validate it is as you expected. Once you have verified the task was done, and done correctly, you can then tell the user it is done.
Rule 3: Don’t Chain Async Calls.
I am saddened by how often this rule is broken as it leads to so many problems when things don’t work as expected. Thee is a strong relationship between this rule and rule #2. Every call to an async call which calls and async call which calls an async call is yet another “Breaking Point of Confusion” - a point in your system which can break in unexpected and hard to trace ways.
Doing this leads to complicated troubleshooting and a higher bar to comprehensive understanding - the ability for the support person or programmer to understand the system as a whole. With all of the hype around micro-services this is a looming pit of darkness. If your front end asynchronously calls a micro-service which makes another async call to micro-service which then does that to a third micro-service it becomes really complicated and difficult to validate the original task was successful or to keep track of where something broke. At some point that task has to be done synchronously anyway - the only question is where in your chain of events it will happen. The more hops you have to this point the more overly-complicated your system is and the more opportunities you have for problems to arise.
Rule 4: If you break Rule 3, Then For The Love of All That is Holy, Don’t Break Rules 1 and 2.
Of course chances are pretty good that you’re going to violate Rule 3 at some point. There may even be valid times for it. But anytime you break a rule you need to be even more strict about not breaking related rules - it can become a very slippery slope. This means you treat each link in the chain as if it were the user. You apply rules one and two to each service which accepts a call and then performs an async call to another. That means if you have “FE -> Svc1 -> Svc2 -> BG task A” that Svc1 trusts Svc2 to have accepted the task but then it must verify it via whatever means needed so it can communicate that up the chain and be able to respond with a current state if that is how that step works. Further this must be applied at Svc2 when it spawns the background task to do the work - which is ultimately a synchronous task under it.
Regarding Rule #2, each link in the chain must not tell it’s caller that the task was “done”. It should return with something indicating the state of “queued”, “accepted”, or even “relayed”. This makes it easier to follow the chain of events at three in the morning and find where things broke much faster. All too often in these cases where this rule is not applied the person troubleshooting in the middle of the night will see the first async call returned a message or status indicating the task was “done” or “successful” and stop investigating that chain. This is quite understandable given this. I’ve seen this problem far more often than I used to expect. When reading or expressing it, it seems so obvious. But in reality where the rush to get things pushed further down the stack takes precedence we apparently often skip this and just “know” that the (next) backend will do the right thing.
So there you have my first four rules of asynchronous coding. These rules apply regardless of how your task is asynchronous. It could be a call to an external (to the caller) service, a background event which happens in a fork, thread, coroutine, or goroutine. But ultimately these rules will make your maintenance and user experience much better. Now it is time for some guidelines which will help you keep to these rules - often before you get to the code.
Guideline 1: Consider Benefits vs. Costs
At some point the task you are performing will be done synchronously. The closer to the caller this happens the less complicated your system is and the easier it is to follow these basic rules. Expressed another way you could think of it as the Async version of “YAGNI” - You Aint Gonna Need It. Every async call adds complexity to your system. Add enough complexity and the system becomes complicated. Complexity is fine, complication is bad.
Are you only making this call async so the user of your interface can then do other things? Then ask yourself how long the call will really take and what other benefits making it asynchronously provides, if any. If the call takes under half of a second does it really provide enough benefit to make is async? Is this task part of a sequence of steps a user is taking? If so, can the user continue without this task having been confirmed? If not, it is probable that your are not getting enough benefit for the cost. Let us look at two scenarios.
Scenario 1: Registration Confirmation
In this scenario the user is signing up for your service and you wish to validate their email before the account is “complete”. In the meantime your new user could be looking around the rest of your sight with limited access. In the case the validation involves sending them an email and waiting for them to click a link confirming they received it. This is a fine example of when it is appropriate to use an async task for this and let the user continue.
Scenario 2: Username Checking
For this scenario consider the user registration process where you let the user pick a username. Since at some point enough users will result in an attempt to use a username that already exists you need to implement a means to let the user know if the name they want is taken. This is not a good candidate for an async task. In this case you need to submit the name to a process which checks the data store for the existence of the name already, get some form of “lock” on it if not taken, and let the user know it is available or unavailable immediately. Doing this asynchronously can produce a situation where the user has moved on to the next step in the account creation process only to find later that the name is taken. Even worse is the user being told it was accepted but then the background task which thought it had a usable name hits an error later on when it tries to commit.
These two example scenarios should get the idea across between whether a given task - regardless of time-to-complete is a good candidate for doing in the background or not. Anything which prevents the user from completing the next step of a process is probably a bad candidate for asynchronous calling.
Guideline 2: Minimize Distance Between Request and Synchronous Completion
As all tasks will at some point be performed synchronously this guideline helps by calling out the effective distance between the initiation of the task (the request) and the code which actually performs it. You want this distance to be the minimum possible, and factor in the time to complete the task. Let us consider the above example of username availability.
If your front-end servers can make this call in milliseconds, then it should be done right there in the application code immediately rather than farmed out to a job queue. The complexity involved with the additional queueing and validation is more costly than simply calling it right there and returning the boolean of availability.
Now let us instead consider a username validation system which as to 1) verify format 2) verify local availability, and 3) consult a third party service which can take up to five seconds to respond. In this case, and to encapsulate and isolate third party interaction, making this validation component into a micro-service which handles each step asynchronously and is called via the front end asynchronously with status being shown to the user could be quite useful. A bit complex, sure, but not to the level of complication. In this case I’d expect the front end the user sees to have each step and it’s status listed and updated upon it’s result.
This guideline becomes very important when breaking Rule #3. Keep the chain of async calls to an absolute minimum. This may seem obvious but frankly when we find ourselves implementing such a system we find it all to easy to simply push the responsibility on to the next step and fail to consider how many steps we are currently in. One way to help make this obvious is to draw diagrams showing each step and the possible error conditions each step can result in. Sometime seeing the resulting spaghetti makes you stop and say “wait a minute, that looks complicated”.
Of course these rules and guidelines don’t cover every situation and are not intended to be code-specific. They apply at the “bigger picture” level rather than the code level, but are informative to the code and design of your overall system. Every async call is a potential disaster point. Keeping these to a minimum will provide a more robust system. Ensuring each point has a complete and correct understanding of the tasks’ current state will make your troubleshooting life much easier - and lead to less upset customers.