“You can’t avoid failure, you have to design for it”: A conversation with Sumit Saha on building reliable distributed systems

· · Views: 9,824

In today’s tech-driven landscape, the demand for resilient, scalable systems has never been greater. At TechGrid.Media, we’re excited to share an exclusive conversation with Sumit Saha, a seasoned software engineer with a remarkable career spanning global tech leaders like Microsoft, Google, and bp. Currently based in Prague and working at Microsoft, Sumit brings deep expertise in distributed systems, site reliability engineering, and cloud infrastructure.

In this interview, Sumit offers valuable insights on what it truly means to build “reliable distributed systems” — from handling data integrity challenges to designing with failure in mind. With real-world lessons from his experience at some of the world’s most advanced engineering teams, this discussion is a must-read for anyone looking to improve system reliability, observability, and scalability.

1. What does “reliable distributed systems” mean today? What trade-offs are there?

In the current context, reliability means that the IT system performs well even when there are obstacles, bottlenecks, or delays. Achieving reliability is not always easy, as you often have to choose between speed, accuracy, and the amount of data to store; the balance between these trade-offs depends on the business goals.

Sometimes, for example, accepting eventual consistency when the data is delayed is fine as long as you have the plan to correct upcoming issues later. Adding tracking or accountability is also important, but on the other hand, it can slow the process, so it must be considered alongside performance needs. There are many other examples of trade-offs, but you should keep in mind that a system’s purpose is to handle failures instead of avoiding them.

2. Hard lessons about keeping systems in sync?

Maintaining systems in sync is, in fact, more complicated, and it is easy for systems to fail unless you work hard to prevent it. One hard lesson for me is to build idempotent operations. The term means these operations can be repeated safely in the future. Secondly, simply copying data is not enough anymore; therefore, you need to check if it is still correct. Thirdly, consider small steps instead of multi-step transactions, as they perform better and improve reliability. Finally, it is not always evident that network issues and retries can also lead to subtle bugs and undermine data integrity.

3. How do you balance speed and traceability?

Achieving the balance between speed and traceability is not easy, but it is still possible. These are my tips to achieve it. First, separate fast operations from tracking and logging processes. It is also recommended to use trace IDs for following requests across all the services. As for logs, keep them structured and easy to search for each team member. Moreover, make the logging simple, but use the same pattern across teams. Last but not least, besides developing observability tools, integrate them into your workflow.

4. Observability win or failure example?

Observability defines the way you respond to issues. For example, a small metric helped engineers catch and fix a significant performance bottleneck before its escalation. On the other hand, a missing trace ID once turned a billing bug into a weeks-long challenge.

Dashboards and alerts are essential for preventing and fixing failures. However, their curation is still necessary.  An overabundance of various, including minor, alerts can distract engineers from the critical ones. And if failures happen, treat them as opportunities for growth – for example, improving your monitoring approach.

5. What principle has stuck with you?

First, build systems that, in worst-case scenarios, fail clearly and are easy to understand, as simple designs are easier to scale and maintain. Clarity also matters for understanding the root of an issue. It is better to recognise a clear error and proceed with explicit instructions than to hide an issue. Furthermore, define roles and responsibilities between services and invest much in monitoring. It is proven that good monitoring tools will help you detect and fix problems faster and more efficiently.

Share
f 𝕏 in
Copied