2022 ObservabilityEngineering
- (Majors et al., 2022) ⇒ C. Majors, L. Fong-Jones, and G. Miranda. (2022). “Observability Engineering.” O'Reilly Media. ISBN:9781492076414
Subject Headings: Distributed System Observability, Observability-Driven Development.
Notes
Cited By
2022
- Authors v-blog post https://youtu.be/FZRpQOaePFU
Quotes
Book Overview
https://www.oreilly.com/library/view/observability-engineering/9781492076438/
Observability is critical for building, changing, and understanding the software that powers complex modern systems. Teams that adopt observability are much better equipped to ship code swiftly and confidently, identify outliers and aberrant behaviors, and understand the experience of each and every user. This practical book explains the value of observable systems and shows you how to practice observability-driven development.
Authors Charity Majors, Liz Fong-Jones, and George Miranda from Honeycomb explain what constitutes good observability, show you how to improve upon what youâ??re doing today, and provide practical dos and don'ts for migrating from legacy tooling, such as metrics, monitoring, and log management. Youâ??ll also learn the impact observability has on organizational culture (and vice versa).
You'll explore:
- How the concept of observability applies to managing software at scale
- The value of practicing observability when delivering complex cloud native applications and systems
- The impact observability has across the entire software development lifecycle
- How and why different functional teams use observability with service-level objectives
- How to instrument your code to help future engineers understand the code you wrote today
- How to produce quality code for context-aware system debugging and maintenance
- How data-rich analytics can help you debug elusive issues
Table of Contents
Preface
Who This Book Is For Why We Wrote This Book What You Will Learn Conventions Used in This Book Using Code Examples O’Reilly Online Learning How to Contact Us Acknowledgments
I. The Path to Observability
1. What Is Observability?
The Mathematical Definition of Observability
Applying Observability to Software Systems
Mischaracterizations About Observability for Software
Why Observability Matters Now
Is This Really the Best Way?
Why Are Metrics and Monitoring Not Enough?
Debugging with Metrics Versus Observability
The Role of Cardinality
The Role of Dimensionality
Debugging with Observability
Observability Is for Modern Systems
Conclusion
2. How Debugging Practices Differ Between Observability and Monitoring
How Monitoring Data Is Used for Debugging
Troubleshooting Behaviors When Using Dashboards
The Limitations of Troubleshooting by Intuition
Traditional Monitoring Is Fundamentally Reactive
How Observability Enables Better Debugging
Conclusion
3. Lessons from Scaling Without Observability An Introduction to Parse Scaling at Parse The Evolution Toward Modern Systems The Evolution Toward Modern Practices Shifting Practices at Parse Conclusion
4. How Observability Relates to DevOps, SRE, and Cloud Native Cloud Native, DevOps, and SRE in a Nutshell Observability: Debugging Then Versus Now Observability Empowers DevOps and SRE Practices Conclusion
II. Fundamentals of Observability
5. Structured Events Are the Building Blocks of Observability
Debugging with Structured Events
The Limitations of Metrics as a Building Block
The Limitations of Traditional Logs as a Building Block
Unstructured Logs
Structured Logs
Properties of Events That Are Useful in Debugging
Conclusion
6. Stitching Events into Traces Distributed Tracing and Why It Matters Now The Components of Tracing Instrumenting a Trace the Hard Way Adding Custom Fields into Trace Spans Stitching Events into Traces Conclusion
7. Instrumentation with OpenTelemetry
A Brief Introduction to Instrumentation
Open Instrumentation Standards
Instrumentation Using Code-Based Examples
Start with Automatic Instrumentation
Add Custom Instrumentation
Send Instrumentation Data to a Backend System
Conclusion
8. Analyzing Events to Achieve Observability
Debugging from Known Conditions
Debugging from First Principles
Using the Core Analysis Loop
Automating the Brute-Force Portion of the Core Analysis Loop
This Misleading Promise of AIOps
Conclusion
9. How Observability and Monitoring Come Together
Where Monitoring Fits
Where Observability Fits
System Versus Software Considerations
Assessing Your Organizational Needs
Exceptions: Infrastructure Monitoring That Can’t Be Ignored
Real-World Examples
Conclusion
III. Observability for Teams
10. Applying Observability Practices in Your Team Join a Community Group Start with the Biggest Pain Points Buy Instead of Build Flesh Out Your Instrumentation Iteratively Look for Opportunities to Leverage Existing Efforts Prepare for the Hardest Last Push Conclusion
11. Observability-Driven Development Test-Driven Development Observability in the Development Cycle Determining Where to Debug Debugging in the Time of Microservices How Instrumentation Drives Observability Shifting Observability Left Using Observability to Speed Up Software Delivery Conclusion
12. Using Service-Level Objectives for Reliability
Traditional Monitoring Approaches Create Dangerous Alert Fatigue
Threshold Alerting Is for Known-Unknowns Only
User Experience Is a North Star
What Is a Service-Level Objective?
Reliable Alerting with SLOs
Changing Culture Toward SLO-Based Alerts: A Case Study
Conclusion
13. Acting on and Debugging SLO-Based Alerts
Alerting Before Your Error Budget Is Empty
Framing Time as a Sliding Window
Forecasting to Create a Predictive Burn Alert
The Lookahead Window
The Baseline Window
Acting on SLO Burn Alerts
Using Observability Data for SLOs Versus Time-Series Data
Conclusion
14. Observability and the Software Supply Chain
Why Slack Needed Observability
Instrumentation: Shared Client Libraries and Dimensions
Case Studies: Operationalizing the Supply Chain
Understanding Context Through Tooling
Embedding Actionable Alerting
Understanding What Changed
Conclusion
IV. Observability at Scale
15. Build Versus Buy and Return on Investment
How to Analyze the ROI of Observability
The Real Costs of Building Your Own
The Hidden Costs of Using “Free” Software
The Benefits of Building Your Own
The Risks of Building Your Own
The Real Costs of Buying Software
The Hidden Financial Costs of Commercial Software
The Hidden Nonfinancial Costs of Commercial Software
The Benefits of Buying Commercial Software
The Risks of Buying Commercial Software
Buy Versus Build Is Not a Binary Choice
Conclusion
16. Efficient Data Storage
The Functional Requirements for Observability
Time-Series Databases Are Inadequate for Observability
Other Possible Data Stores
Data Storage Strategies
Case Study: The Implementation of Honeycomb’s Retriever
Partitioning Data by Time
Storing Data by Column Within Segments
Performing Query Workloads
Querying for Traces
Querying Data in Real Time
Making It Affordable with Tiering
Making It Fast with Parallelism
Dealing with High Cardinality
Scaling and Durability Strategies
Notes on Building Your Own Efficient Data Store
Conclusion
17. Cheap and Accurate Enough: Sampling
Sampling to Refine Your Data Collection
Using Different Approaches to Sampling
Constant-Probability Sampling
Sampling on Recent Traffic Volume
Sampling Based on Event Content (Keys)
Combining per Key and Historical Methods
Choosing Dynamic Sampling Options
When to Make a Sampling Decision for Traces
Translating Sampling Strategies into Code
The Base Case
Fixed-Rate Sampling
Recording the Sample Rate
Consistent Sampling
Target Rate Sampling
Having More Than One Static Sample Rate
Sampling by Key and Target Rate
Sampling with Dynamic Rates on Arbitrarily Many Keys
Putting It All Together: Head and Tail per Key Target Rate Sampling
Conclusion
18. Telemetry Management with Pipelines
Attributes of Telemetry Pipelines
Routing
Security and Compliance
Workload Isolation
Data Buffering
Capacity Management
Data Filtering and Augmentation
Data Transformation
Ensuring Data Quality and Consistency
Managing a Telemetry Pipeline: Anatomy
Challenges When Managing a Telemetry Pipeline
Performance
Correctness
Availability
Reliability
Isolation
Data Freshness
Use Case: Telemetry Management at Slack
Metrics Aggregation
Logs and Trace Events
Open Source Alternatives
Managing a Telemetry Pipeline: Build Versus Buy
Conclusion
V. Spreading Observability Culture
19. The Business Case for Observability
The Reactive Approach to Introducing Change
The Return on Investment of Observability
The Proactive Approach to Introducing Change
Introducing Observability as a Practice
Using the Appropriate Tools
Instrumentation
Data Storage and Analytics
Rolling Out Tools to Your Teams
Knowing When You Have Enough Observability
Conclusion
20. Observability’s Stakeholders and Allies
Recognizing Nonengineering Observability Needs
Creating Observability Allies in Practice
Customer Support Teams
Customer Success and Product Teams
Sales and Executive Teams
Using Observability Versus Business Intelligence Tools
Query Execution Time
Accuracy
Recency
Structure
Time Windows
Ephemerality
Using Observability and BI Tools Together in Practice
Conclusion
21. An Observability Maturity Model
A Note About Maturity Models
Why Observability Needs a Maturity Model
About the Observability Maturity Model
Capabilities Referenced in the OMM
Respond to System Failure with Resilience
Deliver High-Quality Code
Manage Complexity and Technical Debt
Release on a Predictable Cadence
Understand User Behavior
Using the OMM for Your Organization
Conclusion
22. Where to Go from Here Observability, Then Versus Now Additional Resources Predictions for Where Observability Is Going
Foreword
Over the past couple of years, the term “observability” has moved from the niche fringes of the systems engineering community to the vernacular of the software engineering community. As this term gained prominence, it also suffered the (alas, inevitable) fate of being used interchangeably with another term with which it shares a certain adjacency: “monitoring.”
What then followed was every bit as inevitable as it was unfortunate: monitoring tools and vendors started co-opting and using the same language and vocabulary used by those trying to differentiate the philosophical, technical, and sociotechnical underpinnings of observability from that of monitoring. This muddying of the waters wasn’t particularly helpful, to say the least. It risked conflating “observability” and “monitoring” into a homogenous construct, thereby making it all the more difficult to have meaningful, nuanced conversations about the differences.
To treat the difference between monitoring and observability as a purely semantic one is a folly. Observability isn’t purely a technical concept that can be achieved by buying an “observability tool” (no matter what any vendor might say) or adopting the open standard du jour. To the contrary, observability is more a sociotechnical concept. Successfully implementing observability depends just as much, if not more, on having the appropriate cultural scaffolding to support the way software is developed, deployed, debugged, and maintained, as it does on having the right tool at one’s disposal.
In most (perhaps even all) scenarios, teams need to leverage both monitoring and observability to successfully build and operate their services. But any such successful implementation requires that practitioners first understand the philosophical differences between the two. What separates monitoring from observability is the state space of system behavior, and moreover, how one might wish to explore the state space and at precisely what level of detail. By “state space,” I’m referring to all the possible emergent behaviors a system might exhibit during various phases: starting from when the system is being designed, to when the system is being developed, to when the system is being tested, to when the system is being deployed, to when the system is being exposed to users, to when the system is being debugged over the course of its lifetime. The more complex the system, the ever expanding and protean the state space.
Observability allows for this state space to be painstakingly mapped out and explored in granular detail with a fine-tooth comb. Such meticulous exploration is often required to better understand unpredictable, long-tail, or multimodal distributions in system behavior. Monitoring, in contrast, provides an approximation of overall system health in broad brushstrokes.
It thus follows that everything from the data that’s being collected to this end, to how this data is being stored, to how this data can be explored to better understand system behavior varies vis-à-vis the purposes of monitoring and observability.
Over the past couple of decades, the ethos of monitoring has influenced the development of myriad tools, systems, processes, and practices, many of which have become the de facto industry standard. Because these tools, systems, processes, and practices were designed for the explicit purpose of monitoring, they do a stellar job to this end. However, they cannot—and should not—be rebranded or marketed to unsuspecting customers as “observability” tools or processes. Doing so would provide little to no discernible benefit, in addition to running the risk of being an enormous time, effort, and money sink for the customer.
Furthermore, tools are only one part of the problem. Building or adopting observability tooling and practices that have proven to be successful at other companies won’t necessarily solve all the problems faced by one’s organization, inasmuch as a finished product doesn’t tell the story behind how the tooling and concomitant processes evolved, what overarching problems it aimed to solve, what implicit assumptions were baked into the product, and more.
Building or buying the right observability tool won’t be a panacea without first instituting a conducive cultural framework within the company that sets teams up for success. A mindset and culture rooted in the shibboleths of monitoring—dashboards, alerts, static thresholds—isn’t helpful to unlock the full potential of observability. An observability tool might have access to a very large volume of very granular data, but successfully making sense of the mountain of data—which is the ultimate arbiter of the overall viability and utility of the tool, and arguably that of observability itself!—requires a hypothesis-driven, iterative debugging mindset.
Simply having access to state-of-the-art tools doesn’t automatically cultivate this mindset in practitioners. Nor does waxing eloquent about nebulous philosophical distinctions between monitoring and observability without distilling these ideas into cross-cutting practical solutions. For instance, there are chapters in this book that take a dim view of holding up logs, metrics, and traces as the “three pillars of observability.” While the criticisms aren’t without merit, the truth is that logs, metrics, and traces have long been the only concrete examples of telemetry people running real systems in the real world have had at their disposal to debug their systems, and it was thus inevitable that the narrative of the “three pillars” cropped up around them.
What resonates best with practitioners building systems in the real world isn’t abstract, airy-fairy ideas but an actionable blueprint that addresses and proposes solutions to pressing technical and cultural problems they are facing. This book manages to bridge the chasm that yawns between the philosophical tenets of observability and its praxis thereof, by providing a concrete (if opinionated) blueprint of what putting these ideas into practice might look like.
Instead of focusing on protocols or standards or even low-level representations of various telemetry signals, the book envisages the three pillars of observability as the triad of structured events (or traces without a context field, as I like to call them), iterative verification of hypothesis (or hypothesis-driven debugging, as I like to call it), and the “core analysis loop.” This holistic reframing of the building blocks of observability from the first principles helps underscore that telemetry signals alone (or tools built to harness these signals) don’t make system behavior maximally observable. The book does not shirk away from shedding light on the challenges one might face when bootstrapping a culture of observability in an organization, and provides valuable guidance on how to go about it in a sustainable manner that should stand observability practitioners in good stead for long-term success.
...
I. The Path to Observability
1. What Is Observability?
2. How Debugging Practices Differ Between Observability and Monitoring
3. Lessons from Scaling Without Observability
4. How Observability Relates to DevOps, SRE, and Cloud Native
II. Fundamentals of Observability
5. Structured Events Are the Building Blocks of Observability
6. Stitching Events into Traces
7. Instrumentation with OpenTelemetry
In the previous two chapters, we described the principles of structured events and tracing. Events and traces are the building blocks of observability that you can use to understand the behavior of your software applications. You can generate those fundamental building blocks by adding instrumentation code into your application to emit telemetry data alongside each invocation. You can then route the emitted telemetry data to a backend data store, so that you can later analyze it to understand application health and help debug issues.
...
...
References
;
| Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
|---|---|---|---|---|---|---|---|---|---|---|
| 2022 ObservabilityEngineering | C. Majors L. Fong-Jones G. Miranda | Observability Engineering | 2022 |