Retry, Retry & Retry…

Vignesh Thirunavukkarasu
3 min readDec 11, 2021

Understanding the retry mechanisms in Software Engineering

A quick glance of when and where we need retry mechanisms.

A standard retry pattern

Retry is usually needed when one service / client talking to another service / system, and the second system is down or unable to respond within the time-frame.

Mostly people talk of microservice retry (as a circuit breaker pattern). However, retry is a crucial element in many places, including db persistence, fetching password from secrets (when the container / VM / EC2 is starting), Messaging bus communications.

Having said that, the Error codes vary based on the protocol used. If it’s a db/ message bus error — it has a specific Error code, and HTTP calls have the standard 4xx,500 Error codes which the initiating system can catch and act accordingly.

Now, the golden standard pattern is to have 3 retries with a delay of few seconds between them. Even after the 3 retries the Error repeats, then we log the Error and move ahead.

During HTTP calls, most developers miss to leave out the timeout part from the initiating system. Apparently, I find that this is very critical in system design. Because, the API call would simple kill time & resource for the 2nd system, blocking other API calls.

Now coming to the delay part, It’s a fundamental need that a system needs few seconds to get it’s things sorted out properly if there is an Error. The delay could be an incremental one(+1) or with constant incrementals (+x).

I recently came across a system where they use Fibonacci based incrementals. I found it to be really interesting usage of the famous series — we don’t need to maintain the count, just the retry max time would suffice.

Ok. Let’s talk little deeper on where we can apply the retry mechanisms.

I usually don’t implement retry on authenticating systems — SSO/ LDAP. Also, be very cautious about retry with threading — you need to use thread-safe variables and each threads have their own instance that can track accordingly.

As pointed out earlier, using a Fibonacci based pattern is good in use cases that are not critically bound on SLA based — Ex: on Backend Jobs(talend/ spark), that have some time relaxation comparatively.

Final thoughts — I used to work more on SMS based systems and while I used to interact with vendors, they have mentioned that their system will try to retry for 3 times within a particular time and the messages are bound to keep trying for more count, upto 3 days max. I was quite surprised on how much complex this mechanisms would be.

Once we received 2 SMS instead of 1, when enquired they mentioned that the systems could have done a retry and since they were oversees MSISDN numbers, before the response of the first one was received the 2nd retry would have initiated. This made me to think more on the conditional retry.

Also — during retries I would prefer to have a different mechanisms (addition to the traditional logging) — like having the data on a In-Memory Cache system from where an API could be exposed and the data can be viewed on why the Error occurred at the first place. As an additional Step, the Production Support team can either rectify the Error and submit again (if at all such a thing is possible).

Retry with Error logging on Cache

On a personal note I haven’t used any external libraries, because they tend to slightly bloat the program.

-cheers

--

--