During my career as a software engineer, I have come to see that few challenges are more meaningful than figuring out how to manage and store data. This became evident when I was involved with the vast migration of Docker container images, design and implementation of a robust vulnerability scanning system for said container images, monolithic-to-microservice system transition, and creation of new API access tokens. In each project, the determinations about data models, storage solutions, and retrieval strategies were at the core of the work. These high-load experiences taught me that the structures of storage and the architectures around them are the backbone of almost every modern application.
As I reflect on the work I have done at T-Bank and VK, one thing stands out: the careful selection of data technologies. What unites distributed systems such as container registries, rate-limiting, vulnerability scanning, social network authentication and other systems are the foundational, broad, and deep data-technology-selection skills we employ. These projects each have specialized requirements, yet they all depend on an understanding of how to collect, partition, process, and store data.
This article examines three vital resources that have deeply shaped my mindset and methods when architecting data-centric systems. These resources are Martin Kleppmann’s “Designing Data-Intensive Applications,” Neha Narkhede’s “Apache Kafka: The Definitive Guide,” and Eben Hewitt’s “Apache Cassandra: The Definitive Guide.” Though these works cover different technologies, together they offer a powerful intellectual arsenal for building the kinds of systems that have “the data” at their core, systems that must scale and perform reliably, in the face of “business need” changes.
Understanding the Data Landscape
Prior to taking a look at the specific suggestions, it’s crucial to set the larger context in which these works sit. Today’s software systems are almost never monolithic; they’re hardly ever isolated. They’re composed of a huge number of interconnected services that need to do the same one thing: process data in a manner that’s consistent and fault-tolerant. A modern continuous integration pipeline wouldn’t be possible without a sound underpinning of data distribution and synchronization.
For the developers of these systems, a small range of concepts occupies a very large space of necessity. They must be familiar with data partitioning, replication, and (especially, for modern systems) streaming. The following books tackle these topics from several complementary angles, and I believe they will be immensely useful.
Designing Data-Intensive Applications (Martin Kleppmann)
“Designing Data-Intensive Applications” by Martin Kleppmann is a book that many see as a modern classic in the space of data-centric software design. I happened to pick it up while needing to optimize a subsystem for working with vulnerability scans of millions of Docker images. In doing so, I was forced to confront just how necessary it is to truly comprehend the fundamental principles of large-scale data management and the elusiveness of that comprehension if one inverts the principles in the design of the apparatus in question.
This book’s value lies in its in-depth treatment of the varied data models and query languages with which a modern developer must be conversant. Most programmers start their careers working with relational databases, and these remain an important part of many systems. But once you try to build something like microservices or work with truly unstructured data, the elegant simplicity of an SQL database often seems inadequate. Kleppmann’s explorations of SQL, NoSQL, and graph databases make it very clear that no single approach can handle every scenario, and he nudges you toward understanding the trade-offs of each one.
Anyone who needs to build or maintain high-availability, high-performance systems must also explore storage engines. They are the concept on which everything else is built, and they are covered to some depth in the book. On a very high level, storage engines can be broken down into two basic types: those that keep their data in memory and those that do not. Types of storage engines in popular use today include in-memory, disk-based and hybrid.
This book also does a great job of examining the ways to work with distributed data. “Data-Intensive Applications” does the heavy lifting of analyzing the strengths and weaknesses of various approaches to replication. Whether you do master-slave, master-master, or an approach that does not use a clear leader (like Dynamo, or using Zookeeper for leader election), the book gives you what you need to understand those approaches better and think more deeply about their impacts on your application’s architecture and availability. And if you do reach for eventual consistency or some other flavor of distributed consensus, the book will get you thinking about those topics in a serious way.
I would urge first-time readers to regard this volume as both a linear story and a reference work that they can turn to when confronted with a problem about architecture or operation. It might seem like a lot to get through, but the sheer weight of experience contained in these pages is something the aspiring architect cannot do without. I have pointed out certain key ideas throughout this review. These are all good reasons why you should read the book and apply its lessons to whatever you’re currently building.
Apache Kafka: The Definitive Guide (Neha Narkhede, Gwen Shapira, and Todd Palino)
Today, the real-time transfer of data in software systems is often just as important as the data itself. At T-Bank, for instance, we depend on a continuous stream of events to follow the movement of new container images (and the scanning of those images for vulnerabilities), trigger alerts for our SRE (Site Reliability Engineering) team when something goes wrong, and do a number of other mission-critical jobs. At VK, I was responsible for creating new API access tokens and directing microservices through event-driven “workhorse” channels that communicated at low-latency and with robust reliability.
“Apache Kafka: The Definitive Guide” exactly tackles this angle of contemporary architecture. Quickly coming to be the standard for high-speed, fail-safe, real-time data streams, Kafka is allied with modern, cloud-native architecture. This book offers not only a full-bodied introduction to Kafka’s features and what makes it tick, but also an in-depth look at the internal architectures that take the system from simple to complex, and in doing so, make it way more awesome. Narkhede, Shapira, and Palino start with the plain basics and then jump rapidly into advanced and more-amazing topics, like how Kafka keeps data around for a long time when you almost don’t expect it to.
The authors investigate the architectural structures needed to maintain Kafka at scale. The most crucial of all the lessons I learned from the book involves partition management. Distributing data properly across partitions is key to achieving high throughput, and careful planning is needed to avoid pitfalls that can lead to data skew or consumer lag (or both!). Monitoring partition health and understanding how offsets are tracked are two essentials necessary for any Kafka partition that aims to achieve the appearance (and reality) of being a well-oiled data pipeline.
Ensuring operational stability is one of the main thrusts of “Apache Kafka: The Definitive Guide.” The book covers strategies that help achieve what the authors call “near failure-free operations.” These strategies revolve around the key concepts of replication, durability, and maintainability. The book has a lot of good things to say about each of these concepts. For example, it discusses some of the important metrics to monitor (such as consumer lag) that give a pretty good early warning signal if the system is starting to experience stress and may be approaching a failure state.
If you want to construct or enhance a real-time data processing setup, this book is for you. It covers integrating Kafka with existing microservices, adopting event-driven designs, and scaling your cluster as your needs grow. The content is informed by real-world usage patterns, including those on a large scale from well-known internet companies. That gives it a practical dimension that goes beyond mere theory.
Apache Cassandra: The Definitive Guide (Eben Hewitt)
Cassandra occupies a singular position in the data storage system ecosystem. It has no master to direct it, which makes it highly available; you can write to it almost anywhere in the world. You can also read from it almost anywhere in the world so long as you have some voiceless command that allows you to either “get” or “find” something in the multitude of data that your Cassandra instance is steering. Cassandra is so good at writing that it is better than most other systems at writing under pressure and in front of a global audience, which makes it a go-to for many scenarios demanding massive write throughput and geo-redundancy.
One part that “Apache Cassandra: The Definitive Guide” makes clear is that Cassandra should not simply be treated as a drop-in replacement for a relational database. The peer-to-peer design enhances reliability and fault tolerance, since no single node is a bottleneck or a single point of failure. However, it also requires a deeper understanding of data replication, consistency levels, and how the gossip protocol keeps cluster states in sync. Covering the architecture and design of Cassandra is necessary in order to tackle some of the really tough questions that arise during the process of using it and to help avoid some costly mistakes.
Another focus of this book, data modeling, is crucial for anyone who wants to use Cassandra effectively. In the relational world, developers often normalize data to eliminate redundancy and maintain integrity. In Cassandra, the common practice is to design tables based on the queries you expect to run, often denormalizing data to achieve optimal performance and partition distribution. This process may seem counterintuitive at first, but it is the natural outcome of embracing the underlying architectural assumptions. The book offers detailed guidance on effective table design and reminds readers to choose partition keys carefully to avoid hotspots.
“Apache Cassandra: The Definitive Guide” gives great depth to operational topics. High throughput and global scale create subtle complexities that extend far beyond development. We cover the intricacies of many operational topics, but to get a flavor of what we consider important, here is a list of some of the knowledge areas we believe operators should be familiar with compaction strategies, read and write path tuning for different workloads, availability strategies and their implications for performance, fault tolerance and performance optimization in a multi-data center environment.
I found the book’s exploration of consistency levels to be particularly relevant. For example, Cassandra permits developers to specify the consistency guarantees they want for each query. These guarantees can range from weak (like ONE) to strong (like ALL). Anderson allows us to see the different ways one might go about achieving different ends, and in so doing, makes us privy to some of the trade-offs one might want to consider when coming up with a design that operates at a certain level of availability, for instance
Integrating Knowledge Across the Three Books
Reading any one of these books in isolation will deepen your knowledge, but the true benefit emerges when their perspectives are combined. In a typical microservices environment, you might rely on Kafka to handle real-time streaming of events between services, use Cassandra to store large volumes of write-heavy data across geographically dispersed nodes, and follow the general architectural principles from “Designing Data-Intensive Applications” to stitch the first two resources together coherently. The synergy of these three resources gives you a roadmap to not only construct a system operating under high load that is also resilient but to also possess a fundamental understanding of why what you’ve done works.
Lessons Learned from Real-World Usage
One of the most profound lessons learned from these projects is that gradually introducing new technologies often yields better results than trying to move everything over at once. When we began the transition of VK’s large monolithic systems into microservices, we started small, by selecting certain services for careful migration, and we only moved services to Kafka or Cassandra after running successful proof-of-concept trials. This incremental migration approach allowed us to test our assumptions, tweak our configurations, and keep the existing workflows from falling into chaos.
Another crucial realization is that constant monitoring and alerting are critical to success. Whether you’re watching over Cassandra nodes, Kafka brokers, or a vital real-time pipeline, you must plan on some component failing at the absolute worst time. And when that happens, the first-person defense against diagnosing slowdowns, consumer lag, or node outages comes in the forms of robust metrics, logs, and alarms. But those are also feeding into the first layer of the all-important iterative improvement cycle, learning how the system behaves under load, applying any and all necessary optimizations, and verifying the somewhat tentative results in production.
When data integrity is of the utmost importance, such as with user authentication or financial transactions, projects have found that it’s not so easy to consistently deliver on three essential promises: consistency, availability, and low latency. At T-Bank, we recognize that we are dealing with sensitive and, in many instances, mission-critical data. So, we usually opt for a more consistent and reliable service, even if it means we’re trading off for a slower response time. In other contexts, where the data isn’t quite so sensitive, we’re more willing to accept eventual consistency and allow our system to be more performant or resilient to partial failures.
Conclusion
Data storage might seem like a more focused topic than the wider domains of software architecture or DevOps, yet it is at the heart of virtually every contemporary application. Forming the basis upon which features are built and user experiences are delivered, the modern data storage and retrieval pipeline is well-structured, resilient, and efficient, by necessity, if not by design. The path from intake to storage to processing to delivery is straight and clear. Poorly defined or inefficient data storage and access will cripple the application, no matter how clever its features or how engaging its user interface.
Reading Martin Kleppmann’s “Designing Data-Intensive Applications” gives you a comprehensive look at architectural principles. You get deep into the pivotal details on high-throughput, fault-tolerant, messaging pipelines that support real-time analytics and communication when you study “Apache Kafka: The Definitive Guide.” If you want to learn about enormous amounts of data and store it “across multiple data centers without sacrificing availability, then “Apache Cassandra: The Definitive Guide” is your go-to.
For anyone on a similar journey, I urge you to view these texts as essential, accessible resources and to dip into them whenever you face new design problems. Use them, along with the kind of learned-from-experience knowledge that can only come from doing rolling demos over months or years, to work out the sorts of trade-offs, between consistency and availability, for instance, that these authors describe. And use them to adopt habits of thought and practice that keep your systems both reliable and transparent. The aftermath of the last big crash taught us (once again) that a lot of well-oiled machine rework goes on behind the scenes; stuff that doesn’t directly add user-visible value but that is nonetheless crucial.