Recently I stumbled upon a very interesting book that was a missing piece for some years in the big data literature landscape. The topic of Big Data is, well, big. The idea is actually pretty simple. Software architectures capable of collect, ingest and store a pornographic amount of information. It happens that this is not a problem with a straightforward solution. Worst, there isn't a solution, because the Big Data problem is also not really a problem but instead a family of problems. The reason why it is so is because the information we want to collect, ingest and store can be of many different types. This implies a large set of problems to resolve and as such a large set of solutions arise.
With this comes the inevitable analysis of trade offs. Which problem do I need to solve? Which solution is the best for my problem? For several years there was not a very clear answer and the literature was scarce in information in this respect. The first and, in my opinion, the best work out there targeting these questions was written some years ago by Martin Kleppmann called Designing data intensive applications. This book is an exhaustive analysis, the definitive reference, on the topic of distributed systems with special focus in the data architectures. I really recommend this book as a mandatory reading for anyone doing serious work on distributed systems.
This work is, however, more for the theoretical/academic point of view. Don't get me wrong, Martin does frame the theory with very concrete examples and with the utmost state of the art technology. However there are some topics that are inherent in the software industry that are just out of scope, and very right so.
One of the problems that is usually out of scope in books strictly technical is all the organizational management and product context in which these distributed systems are born. A distributed system, or a big data solution, is just a piece of the bigger picture, and what we mean by bigger picture is basically the entire company. And it is a this precise point where the Foundations for Architecting Data Solutions comes to help. Yeah I know it doesn't have a very impressive score (2.87/5). But in my opinion this is mainly due wrong expectations. Most people certainly assumed a more thorough and systematic technical approach on the topic of Architecting Data Solutions. However the author had a very different objective in mind, and understandable so. Trying to do another version of Martin Kleppmann's work would be a waste of time and most probably would not achieve the same excellency as Martin did. In my opinion the author's idea was to extend this valuable work with the further analysis in the organizational context.
Ted Malaska and Jonathan Seidman they actually did a great job framing the technical solutions, well dissected in Martin's work, in the real world software industry. They tackle the problem of evaluating and selection of data solutions, they expose the main categories of big data solutions that are on the market and the main problems each one is supposed to target. Another very important and most of the time completely ignored is the management of risk in this case they applied the ideas from risk management in the context of data projects, in a chapter called (go figure) Managing Risk in Data Projects.
One of the hard lessons most of us learn the hard way, I mean doing errors, is related with the Interface Design. Interfaces are the bread and butter by which we design software architectures. However these same principles are also fundamental when designing data solutions. Ted and Jonathan do a great job emphasizing the importance of clear boundaries, clear interfaces/protocols and the idea of decoupling as a fundamental building block for a robust implementation of data solutions.
From all the chapters of the book the least interesting was the distributed storage systems and the main reason was not because they do a poor work but because Kleppman as I said before is the definitive work in this area and if you had previously read DDIA you'll feel that much more could be said in this chapter. Again, not fault for the authors, this is a mere problem of expectations management.
The rest of the book is comprised of three more chapters
- The meta of Enterprise data
- Ensuring data Integrity
- Data processing
The meta of Enterprise data gives and overview of a topic that is typically missed by literature. Most of books out there they target a type of data storage paradigm or implementation. They always focus on importance of the data. The concept of metadata is a completely ugly duck. The authors did an awesome job framing the importance of metadata, what kind of metadata there is, how we process and store. These thecnics become more and more significant due the latest trends in data regulation, like GDPR.
Ensuring data Integrity is a chapter that may seem, mainly for the less experienced, as a not so relevant one. However those with experience in the design and implementation of data solutions the ideas tackled here are clearly of uttermost importance.
When working with open source enterprise data management systems, it’s common to use multiple storage and processing layers in our data architecture, which often means storing data in multiple formats in order to optimize access. This can even mean duplicating data, which in the past might have been viewed as an antipattern because of expense and complexity, but with newer systems and cheap storage, this becomes much more practical.
What doesn’t change is the need to ensure the integrity of the data as it moves through the system from the data sources to the final storage of the data. When we talk about data integrity, we mean being able to ensure that the data is accurate and consistent throughout our data pipelines. To ensure data integrity, it’s critical that we have a known lineage for all data as it moves through the system.
This is the gist of the chapter. The end of if revolves around another, apparently not so important, topic which is the concept of data fidelity. This is a very fancy name for the property of loss of information. Data pipelines they basically collect ingest and store information, however during the process information that is present at the collection phase is not there at storage phase. This means information loss. This is a typical and very costly error that is committed again and again. The fact that the authors have spend a fair amount of effort dealing with concepts like full fidelity and derived data reveals the vast experience they have and most certainly even them fell in this trap of information loss.
Finally the last chapter Data processing can be summed up with this image (extracted from the book btw)
which I think it is a pretty nice overview of the current state of art in data pipelines.
Overall, and despite the mediocre rate, I would definitely recommend this book. It is not a piece of art like Kleppmann's work. But on the other hand it isn't a 500+ pages behemoth. This is a very compact introduction of more or less 150 pages which will give you a very good guide for the analysis, design and implementation of big data solutions