2022 ObservabilityEngineering

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Distributed System Observability, Observability-Driven Development.

Notes

Cited By

2022

Quotes

Book Overview

https://www.oreilly.com/library/view/observability-engineering/9781492076438/

Observability is critical for building, changing, and understanding the software that powers complex modern systems. Teams that adopt observability are much better equipped to ship code swiftly and confidently, identify outliers and aberrant behaviors, and understand the experience of each and every user. This practical book explains the value of observable systems and shows you how to practice observability-driven development.

Authors Charity Majors, Liz Fong-Jones, and George Miranda from Honeycomb explain what constitutes good observability, show you how to improve upon what youâ??re doing today, and provide practical dos and don'ts for migrating from legacy tooling, such as metrics, monitoring, and log management. Youâ??ll also learn the impact observability has on organizational culture (and vice versa).

You'll explore:

  • How the concept of observability applies to managing software at scale
  • The value of practicing observability when delivering complex cloud native applications and systems
  • The impact observability has across the entire software development lifecycle
  • How and why different functional teams use observability with service-level objectives
  • How to instrument your code to help future engineers understand the code you wrote today
  • How to produce quality code for context-aware system debugging and maintenance
  • How data-rich analytics can help you debug elusive issues

Table of Contents

Preface
   Who This Book Is For
   Why We Wrote This Book
   What You Will Learn
   Conventions Used in This Book
   Using Code Examples
   O’Reilly Online Learning
   How to Contact Us
   Acknowledgments
I. The Path to Observability
1. What Is Observability?
   The Mathematical Definition of Observability
   Applying Observability to Software Systems
   Mischaracterizations About Observability for Software
   Why Observability Matters Now
       Is This Really the Best Way?
       Why Are Metrics and Monitoring Not Enough?
   Debugging with Metrics Versus Observability
       The Role of Cardinality
       The Role of Dimensionality
   Debugging with Observability
   Observability Is for Modern Systems
   Conclusion
2. How Debugging Practices Differ Between Observability and Monitoring
   How Monitoring Data Is Used for Debugging
       Troubleshooting Behaviors When Using Dashboards
       The Limitations of Troubleshooting by Intuition
       Traditional Monitoring Is Fundamentally Reactive
   How Observability Enables Better Debugging
   Conclusion
3. Lessons from Scaling Without Observability
   An Introduction to Parse
   Scaling at Parse
   The Evolution Toward Modern Systems
   The Evolution Toward Modern Practices
   Shifting Practices at Parse
   Conclusion
4. How Observability Relates to DevOps, SRE, and Cloud Native
   Cloud Native, DevOps, and SRE in a Nutshell
   Observability: Debugging Then Versus Now
   Observability Empowers DevOps and SRE Practices
   Conclusion
II. Fundamentals of Observability
5. Structured Events Are the Building Blocks of Observability
   Debugging with Structured Events
   The Limitations of Metrics as a Building Block
   The Limitations of Traditional Logs as a Building Block
       Unstructured Logs
       Structured Logs
   Properties of Events That Are Useful in Debugging
   Conclusion
6. Stitching Events into Traces
   Distributed Tracing and Why It Matters Now
   The Components of Tracing
   Instrumenting a Trace the Hard Way
   Adding Custom Fields into Trace Spans
   Stitching Events into Traces
   Conclusion
7. Instrumentation with OpenTelemetry
   A Brief Introduction to Instrumentation
   Open Instrumentation Standards
   Instrumentation Using Code-Based Examples
       Start with Automatic Instrumentation
       Add Custom Instrumentation
       Send Instrumentation Data to a Backend System
   Conclusion
8. Analyzing Events to Achieve Observability
   Debugging from Known Conditions
   Debugging from First Principles
       Using the Core Analysis Loop
       Automating the Brute-Force Portion of the Core Analysis Loop
   This Misleading Promise of AIOps
   Conclusion
9. How Observability and Monitoring Come Together
   Where Monitoring Fits
   Where Observability Fits
   System Versus Software Considerations
   Assessing Your Organizational Needs
       Exceptions: Infrastructure Monitoring That Can’t Be Ignored
       Real-World Examples
   Conclusion
III. Observability for Teams
10. Applying Observability Practices in Your Team
   Join a Community Group
   Start with the Biggest Pain Points
   Buy Instead of Build
   Flesh Out Your Instrumentation Iteratively
   Look for Opportunities to Leverage Existing Efforts
   Prepare for the Hardest Last Push
   Conclusion
11. Observability-Driven Development
   Test-Driven Development
   Observability in the Development Cycle
   Determining Where to Debug
   Debugging in the Time of Microservices
   How Instrumentation Drives Observability
   Shifting Observability Left
   Using Observability to Speed Up Software Delivery
   Conclusion
12. Using Service-Level Objectives for Reliability
   Traditional Monitoring Approaches Create Dangerous Alert Fatigue
   Threshold Alerting Is for Known-Unknowns Only
   User Experience Is a North Star
   What Is a Service-Level Objective?
       Reliable Alerting with SLOs
       Changing Culture Toward SLO-Based Alerts: A Case Study
   Conclusion
13. Acting on and Debugging SLO-Based Alerts
   Alerting Before Your Error Budget Is Empty
   Framing Time as a Sliding Window
   Forecasting to Create a Predictive Burn Alert
       The Lookahead Window
       The Baseline Window
       Acting on SLO Burn Alerts
   Using Observability Data for SLOs Versus Time-Series Data
   Conclusion
14. Observability and the Software Supply Chain
   Why Slack Needed Observability
   Instrumentation: Shared Client Libraries and Dimensions
   Case Studies: Operationalizing the Supply Chain
       Understanding Context Through Tooling
       Embedding Actionable Alerting
       Understanding What Changed
   Conclusion
IV. Observability at Scale
15. Build Versus Buy and Return on Investment
   How to Analyze the ROI of Observability
   The Real Costs of Building Your Own
       The Hidden Costs of Using “Free” Software
       The Benefits of Building Your Own
       The Risks of Building Your Own
   The Real Costs of Buying Software
       The Hidden Financial Costs of Commercial Software
       The Hidden Nonfinancial Costs of Commercial Software
       The Benefits of Buying Commercial Software
       The Risks of Buying Commercial Software
   Buy Versus Build Is Not a Binary Choice
   Conclusion
16. Efficient Data Storage
   The Functional Requirements for Observability
       Time-Series Databases Are Inadequate for Observability
       Other Possible Data Stores
       Data Storage Strategies
   Case Study: The Implementation of Honeycomb’s Retriever
       Partitioning Data by Time
       Storing Data by Column Within Segments
       Performing Query Workloads
       Querying for Traces
       Querying Data in Real Time
       Making It Affordable with Tiering
       Making It Fast with Parallelism
       Dealing with High Cardinality
       Scaling and Durability Strategies
       Notes on Building Your Own Efficient Data Store
   Conclusion
17. Cheap and Accurate Enough: Sampling
   Sampling to Refine Your Data Collection
   Using Different Approaches to Sampling
       Constant-Probability Sampling
       Sampling on Recent Traffic Volume
       Sampling Based on Event Content (Keys)
       Combining per Key and Historical Methods
       Choosing Dynamic Sampling Options
       When to Make a Sampling Decision for Traces
   Translating Sampling Strategies into Code
       The Base Case
       Fixed-Rate Sampling
       Recording the Sample Rate
       Consistent Sampling
       Target Rate Sampling
       Having More Than One Static Sample Rate
       Sampling by Key and Target Rate
       Sampling with Dynamic Rates on Arbitrarily Many Keys
       Putting It All Together: Head and Tail per Key Target Rate Sampling
   Conclusion
18. Telemetry Management with Pipelines
   Attributes of Telemetry Pipelines
       Routing
       Security and Compliance
       Workload Isolation
       Data Buffering
       Capacity Management
       Data Filtering and Augmentation
       Data Transformation
       Ensuring Data Quality and Consistency
   Managing a Telemetry Pipeline: Anatomy
   Challenges When Managing a Telemetry Pipeline
       Performance
       Correctness
       Availability
       Reliability
       Isolation
       Data Freshness
   Use Case: Telemetry Management at Slack
       Metrics Aggregation
       Logs and Trace Events
   Open Source Alternatives
   Managing a Telemetry Pipeline: Build Versus Buy
   Conclusion
V. Spreading Observability Culture
19. The Business Case for Observability
   The Reactive Approach to Introducing Change
   The Return on Investment of Observability
   The Proactive Approach to Introducing Change
   Introducing Observability as a Practice
   Using the Appropriate Tools
       Instrumentation
       Data Storage and Analytics
       Rolling Out Tools to Your Teams
   Knowing When You Have Enough Observability
   Conclusion
20. Observability’s Stakeholders and Allies
   Recognizing Nonengineering Observability Needs
   Creating Observability Allies in Practice
       Customer Support Teams
       Customer Success and Product Teams
       Sales and Executive Teams
   Using Observability Versus Business Intelligence Tools
       Query Execution Time
       Accuracy
       Recency
       Structure
       Time Windows
       Ephemerality
   Using Observability and BI Tools Together in Practice
   Conclusion
21. An Observability Maturity Model
   A Note About Maturity Models
   Why Observability Needs a Maturity Model
   About the Observability Maturity Model
   Capabilities Referenced in the OMM
       Respond to System Failure with Resilience
       Deliver High-Quality Code
       Manage Complexity and Technical Debt
       Release on a Predictable Cadence
       Understand User Behavior
   Using the OMM for Your Organization
   Conclusion
22. Where to Go from Here
   Observability, Then Versus Now
   Additional Resources
   Predictions for Where Observability Is Going

Foreword

Over the past couple of years, the term “observability” has moved from the niche fringes of the systems engineering community to the vernacular of the software engineering community. As this term gained prominence, it also suffered the (alas, inevitable) fate of being used interchangeably with another term with which it shares a certain adjacency: “monitoring.”

What then followed was every bit as inevitable as it was unfortunate: monitoring tools and vendors started co-opting and using the same language and vocabulary used by those trying to differentiate the philosophical, technical, and sociotechnical underpinnings of observability from that of monitoring. This muddying of the waters wasn’t particularly helpful, to say the least. It risked conflating “observability” and “monitoring” into a homogenous construct, thereby making it all the more difficult to have meaningful, nuanced conversations about the differences.

To treat the difference between monitoring and observability as a purely semantic one is a folly. Observability isn’t purely a technical concept that can be achieved by buying an “observability tool” (no matter what any vendor might say) or adopting the open standard du jour. To the contrary, observability is more a sociotechnical concept. Successfully implementing observability depends just as much, if not more, on having the appropriate cultural scaffolding to support the way software is developed, deployed, debugged, and maintained, as it does on having the right tool at one’s disposal.

In most (perhaps even all) scenarios, teams need to leverage both monitoring and observability to successfully build and operate their services. But any such successful implementation requires that practitioners first understand the philosophical differences between the two. What separates monitoring from observability is the state space of system behavior, and moreover, how one might wish to explore the state space and at precisely what level of detail. By “state space,” I’m referring to all the possible emergent behaviors a system might exhibit during various phases: starting from when the system is being designed, to when the system is being developed, to when the system is being tested, to when the system is being deployed, to when the system is being exposed to users, to when the system is being debugged over the course of its lifetime. The more complex the system, the ever expanding and protean the state space.

Observability allows for this state space to be painstakingly mapped out and explored in granular detail with a fine-tooth comb. Such meticulous exploration is often required to better understand unpredictable, long-tail, or multimodal distributions in system behavior. Monitoring, in contrast, provides an approximation of overall system health in broad brushstrokes.

It thus follows that everything from the data that’s being collected to this end, to how this data is being stored, to how this data can be explored to better understand system behavior varies vis-à-vis the purposes of monitoring and observability.

Over the past couple of decades, the ethos of monitoring has influenced the development of myriad tools, systems, processes, and practices, many of which have become the de facto industry standard. Because these tools, systems, processes, and practices were designed for the explicit purpose of monitoring, they do a stellar job to this end. However, they cannot—and should not—be rebranded or marketed to unsuspecting customers as “observability” tools or processes. Doing so would provide little to no discernible benefit, in addition to running the risk of being an enormous time, effort, and money sink for the customer.

Furthermore, tools are only one part of the problem. Building or adopting observability tooling and practices that have proven to be successful at other companies won’t necessarily solve all the problems faced by one’s organization, inasmuch as a finished product doesn’t tell the story behind how the tooling and concomitant processes evolved, what overarching problems it aimed to solve, what implicit assumptions were baked into the product, and more.

Building or buying the right observability tool won’t be a panacea without first instituting a conducive cultural framework within the company that sets teams up for success. A mindset and culture rooted in the shibboleths of monitoring—dashboards, alerts, static thresholds—isn’t helpful to unlock the full potential of observability. An observability tool might have access to a very large volume of very granular data, but successfully making sense of the mountain of data—which is the ultimate arbiter of the overall viability and utility of the tool, and arguably that of observability itself!—requires a hypothesis-driven, iterative debugging mindset.

Simply having access to state-of-the-art tools doesn’t automatically cultivate this mindset in practitioners. Nor does waxing eloquent about nebulous philosophical distinctions between monitoring and observability without distilling these ideas into cross-cutting practical solutions. For instance, there are chapters in this book that take a dim view of holding up logs, metrics, and traces as the “three pillars of observability.” While the criticisms aren’t without merit, the truth is that logs, metrics, and traces have long been the only concrete examples of telemetry people running real systems in the real world have had at their disposal to debug their systems, and it was thus inevitable that the narrative of the “three pillars” cropped up around them.

What resonates best with practitioners building systems in the real world isn’t abstract, airy-fairy ideas but an actionable blueprint that addresses and proposes solutions to pressing technical and cultural problems they are facing. This book manages to bridge the chasm that yawns between the philosophical tenets of observability and its praxis thereof, by providing a concrete (if opinionated) blueprint of what putting these ideas into practice might look like.

Instead of focusing on protocols or standards or even low-level representations of various telemetry signals, the book envisages the three pillars of observability as the triad of structured events (or traces without a context field, as I like to call them), iterative verification of hypothesis (or hypothesis-driven debugging, as I like to call it), and the “core analysis loop.” This holistic reframing of the building blocks of observability from the first principles helps underscore that telemetry signals alone (or tools built to harness these signals) don’t make system behavior maximally observable. The book does not shirk away from shedding light on the challenges one might face when bootstrapping a culture of observability in an organization, and provides valuable guidance on how to go about it in a sustainable manner that should stand observability practitioners in good stead for long-term success.

...

I. The Path to Observability

1. What Is Observability?

2. How Debugging Practices Differ Between Observability and Monitoring

3. Lessons from Scaling Without Observability

4. How Observability Relates to DevOps, SRE, and Cloud Native

II. Fundamentals of Observability

5. Structured Events Are the Building Blocks of Observability

6. Stitching Events into Traces

7. Instrumentation with OpenTelemetry

In the previous two chapters, we described the principles of structured events and tracing. Events and traces are the building blocks of observability that you can use to understand the behavior of your software applications. You can generate those fundamental building blocks by adding instrumentation code into your application to emit telemetry data alongside each invocation. You can then route the emitted telemetry data to a backend data store, so that you can later analyze it to understand application health and help debug issues.

...


...

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2022 ObservabilityEngineeringC. Majors
L. Fong-Jones
G. Miranda
Observability Engineering2022