Containers Trend Report. Explore the current state of containers, containerization strategies, and modernizing architecture.
Securing Your Software Supply Chain with JFrog and Azure. Leave with a roadmap for keeping your company and customers safe.
Modern systems span numerous architectures and technologies and are becoming exponentially more modular, dynamic, and distributed in nature. These complexities also pose new challenges for developers and SRE teams that are charged with ensuring the availability, reliability, and successful performance of their systems and infrastructure. Here, you will find resources about the tools, skills, and practices to implement for a strategic, holistic approach to system-wide observability and application monitoring.
There has been a lot of chatter about eBPF in cloud-native communities over the last 2 years. eBPF was a mainstay at KubeCon, eBPF days and eBPF summits are rapidly growing in popularity, companies like Google and Netflix have been using eBPF for years, and new use cases are emerging all the time. Especially in observability, eBPF is expected to be a game changer. So let’s look at eBPF — what is the technology, how is it impacting observability, how does it compare with existing observability practices, and what might the future hold? What Is eBPF Really? eBPF is a programming framework that allows us to safely run sandboxed programs in the Linux kernel without changing kernel code. It was originally developed for Linux (and it is still where the technology is most mature today), but Microsoft is rapidly evolving the eBPF implementation for Windows. eBPF programs are, by design, highly efficient and secure — they are verified by the kernel to ensure they don’t risk the operating system’s stability or security. So Why Is eBPF a Big Deal? To understand this, we need to understand user space and kernel space. User space is where all applications run. Kernel space sits between user space and the physical hardware. Applications in user space can’t access hardware directly. Instead, they make system calls to the kernel, which then accesses the hardware. All memory access, file read/writes, and network traffic go through the kernel. The kernel also manages concurrent processes. Basically, everything goes through the kernel (see the figure below). And eBPF provides a safe, secure way to extend kernel functionality. User space and kernel space Historically, for obvious reasons, changing anything in the kernel source code or operating systems layer has been super hard. The Linux kernel has 30M lines of code, and it takes several years for any change to go from an idea to being available widely. First, the Linux community has to agree to it. Then, it has to become part of the official Linux release. Then, after a few months, it is picked up by distributions like Red Hat and Ubuntu, which take it to a wider audience. Technically, one could load kernel modules to one’s kernel and make changes directly, but this is very high risk and involves complex kernel-level programming, so is almost universally avoided. eBPF comes along and solves this — and gives a secure and efficient mechanism to attach and run programs in the kernel. Let’s look at how eBPF ensures both security and performance. Highly Secure Stringent verification — Before any eBPF program can be loaded into a kernel, it is verified by the eBPF verifier, which ensures the code is absolutely safe — e.g., no hard loops, invalid memory access, unsafe operations. Sandboxed — eBPF programs are run in a memory-isolated sandbox within the kernel, separate from other kernel components. This prevents unauthorized access to kernel memory, data structures, and kernel source code. Limited operations — eBPF programs typically have to be written in a small subset of the C language — a restricted instruction set. This limits the operations that eBPF programs can perform, reducing the risk of security vulnerabilities. High-Performance/Lightweight Run as native machine code — eBPF programs are run as native machine instructions on the CPU. This leads to faster execution and better performance. No context switches — A regular application regularly context-switches between user-space and kernel-space, which is resource intensive. eBPF programs, as they run in the kernel layer, can directly access kernel data structures and resources. Event-driven — eBPF programs typically run only in response to specific kernel events vs being always-on. This minimizes overhead. Optimized for hardware — eBPF programs are compiled into machine code by the kernel’s JIT (Just-In-Time) compiler just before execution, so the code is optimized for the specific hardware it runs on. So eBPF provides a safe and efficient hook into the kernel for programming. And given everything goes through the kernel, this opens up several new possibilities that weren’t possible until now. Why Is This a Big Deal Only Now? The technology around eBPF has evolved over a long time and has been ~30 years in the making. In the last 7–8 years, eBPF has been used at scale by several large companies and now we’re entering an era where the use of eBPF is becoming mainstream. See this video by Alexei Starovoitov, the co-creator of Linux and co-maintainer of eBPF, on the evolution of eBPF. eBPF — A Brief History 1993- A paper from Lawrence Berkeley National Lab explores using a kernel agent for packet filtering. This is where the name BPF (“Berkeley Packet Filter”) comes from. 1997 — BPF is officially introduced as part of the Linux kernel (version 2.1.75). 1997–2014 — Several features are added to improve, stabilize and expand BPF capabilities. 2014 — A significant update is introduced, called “extended Berkeley packet Filter” (eBPF). This version makes big changes to BPF technology & makes it more widely usable — hence the word “extended” Why this release was big, was that this made extending kernel functionality easy. A programmer could code more or less like they would a regular application — and the surrounding eBPF infrastructure takes care of the low-level verification, security, and efficiency. An entire supporting ecosystem and scaffolding around eBPF makes this possible (see figure below). Source: https://ebpf.io/what-is-ebpf/ Even better, eBPF programs could be loaded and unloaded from the kernel without any restarts. All this suddenly allowed for widespread adoption and application. Widespread Adoption in Production Systems eBPF’s popularity has exploded in the last 7–8 years, with several large companies using it in scale production systems. By 2016, Netflix was using eBPF widely for tracing. Brendan Gregg, who implemented it, became widely known in infrastructure & operations circles as an authority on eBPF. 2017 — Facebook open-sourced Katran, their eBPF-based load balancer. Every single packet to Facebook.com since 2017 has passed through eBPF. 2020- Google made eBPF part of its Kubernetes offering. eBPF now powers the networking, security, and observability layer of GKE. By now there’s also broad enterprise adoption in companies like Capital One and Adobe. 2021 — Facebook, Google, Netflix, Microsoft & Isovalent came together to announce the eBPF foundation to manage the growth of eBPF technology. Now there are thousands of companies using eBPF and hundreds of eBPF projects coming up each year exploring different use cases. eBPF is now a separate subsystem within the Linux kernel with a wide community to support it. The technology itself has expanded considerably with several new additions. So What Can We Do With eBPF? The most common use cases for eBPF are in 3 areas: Networking Security Observability Security and networking have seen wider adoption and application, fueled by projects like Cilum. In comparison, eBPF-based observability offerings are earlier in their evolution and just getting started. Let’s look at the use cases in security and networking first. Security Security is a highly popular use case for eBPF. Using eBPF, programs can observe everything happening at the kernel level, process events at a high speed to check for unexpected behavior, and raise alerts much more rapidly than otherwise. For example: Google uses eBPF for intrusion detection at scale Shopify uses eBPF to implement container security Several third-party security offerings now use eBPF for data gathering and monitoring. Networking Networking is another widely applied use case. Being at the eBPF layer allows for comprehensive network observability, like visibility into the full network path including all hops, along with source and destination IP. With eBPF programs, one can process high-volume network events and manipulate network packets directly within the kernel with very low overhead. This allows for various networking use cases like load balancing, DDoS prevention, Traffic shaping, and Quality of Service (QoS). Cloudflare uses eBPF to detect and prevent DDoS attacks, processing 10M packets per second without impacting network performance. Meta’s eBPF-based Katran does load-balancing for all of Facebook Observability By now it must be straightforward how eBPF can be useful in observability. Everything passes through the kernel. And eBPF provides a highly performant and secure way to observe everything from the kernel. Let us dive deeper into observability and look at the implications of this technology. How Exactly Does eBPF Impact Observability? To explore this, let’s step out of the eBPF universe and into the observability universe and look at what makes up our standard observability solution. Any observability solution has 4 major components: Data collection — Getting telemetry data from applications and infrastructure Data processing — Filtering, indexing, and performing computations on the collected data Data storage — Short-term and long-term storage of data User experience layer — Determining how data is consumed by the user Of this, what eBPF impacts (as of today), is really just the data collection layer — the easy gathering of telemetry data directly from the kernel using eBPF. eBPF — Impact on observability So what we mean when we say “eBPF observability” today, is using eBPF as the instrumentation mechanism to gather telemetry data, instead of using other methods of instrumenting. Other components of an observability solution remain unaffected. How eBPF Observability Works To fully understand the underlying mechanisms behind eBPF observability, we need to understand the concept of hooks. As we saw earlier, eBPF programs are primarily event-driven — i.e., they are triggered any time a specific event occurs. For example, every time a function call is made, an eBPF program can be called to capture some data for observability purposes. First, these hooks can be in kernel space or user space. So eBPF can be used to monitor both user space applications as well as kernel-level events. Second, these hooks can either be pre-determined/static or inserted dynamically into a running system (without restarts!). Four distinct eBPF mechanisms allow for each of these (see figure below): Static and dynamic eBPF hooks into user space and kernel space Kernel tracepoints — used to hook into events pre-defined by kernel developers (with TRACE_EVENT macros) USDT — used to hook into predefined tracepoints set by developers in application code Kprobes (Kernel Probes) — used to dynamically hook into any part of the kernel code at runtime Uprobes (User Probes) — used to dynamically hook into any part of a user-space application at runtime There are several pre-defined hooks in the kernel space that one can easily attach an eBPF program to (e.g., system calls, function entry/exit, network events, kernel tracepoints). Similarly in the user space, many language runtimes, database systems, and software stacks expose predefined hooks for Linux BCC tools that eBPF programs can hook into. But what’s more interesting is kprobes and uprobes. What if something is breaking in production and I do not have sufficient information and I want to dynamically add instrumentation at runtime? That is where kprobes and uprobes allow for powerful observability. eBPF kprobes and uprobes For example, using uprobes, one can hook into a specific function within an application without modifying the application’s code, at runtime. Whenever the function is executed, an eBPF program can be triggered to capture required data. This allows for exciting possibilities like live debugging. Now that we know how observability with eBPF works, let’s look at use cases. eBPF Observability Use Cases eBPF can be used for almost all common existing observability use-cases, and in addition opens up new possibilities. System and Infrastructure Monitoring: eBPF allows for deep monitoring of system-level events such as CPU usage, memory allocation, disk I/O, and network traffic. For example, LinkedIn uses eBPF for all their infra monitoring. Container and Kubernetes Monitoring: Visibility into Kubernetes-specific metrics, resource usage, and health of individual containers and pods. Application Performance Monitoring (APM): Fine-grained observability into user-space applications and visibility into application throughput, error rates, latency, and traces. Custom Observability: Visibility into custom metrics specific to applications or infra that may not be easily available without writing custom code. Advanced Observability: eBPF can be used for advanced observability use cases such as live debugging, low-overhead application profiling, and system call tracing. There are new applications of eBPF in observability emerging every day. What does this mean for how observability is done today? Is eBPF likely to replace existing forms of instrumentation? Let’s compare with existing options. eBPF vs. Existing Instrumentation Methods Today, there are two main ways to instrument applications and infrastructure for observability, apart from eBPF. Agent-based instrumentation: Independent software SDKs/libraries integrated into application code or infrastructure nodes to collect telemetry data. Sidecar proxy-based instrumentation: Sidecars are lightweight, independent processes that run alongside an application or service. They are popular in microservices and container-based architectures such as Kubernetes. For a detailed comparison of how eBPF-based instrumentation compares against agents and sidecars, see here. Below is a summary view: eBPF vs. agents vs. sidecars: Comparison As we can see, eBPF outperforms existing instrumentation methods across nearly all parameters. There are several benefits: Can cover everything in one go (infrastructure, applications) Less intrusive — eBPF is not inline of running workloads like code agents, which run every time the workload runs. Data collection is out-of-band and sandboxed, so there is no impact on a running system. Low performance overhead — eBPF runs as native machine code and there is no context switching. More secure — due to in-built security measures like verification. Easy to install — can be dropped in without any code change or restarts. Easy to maintain and update — again no code change & restarts. More scalable — driven by easy implementation & maintenance, and low performance overhead In terms of cons, the primary gap with eBPF observability today is in distributed tracing (feasible, but the use case is still in early stages). In balance, given the significant advantages eBPF offers over existing instrumentation methods, we can reasonably expect that eBPF will emerge as the default next-generation instrumentation platform. Implications for Observability What does this mean for the observability industry? What changes? Imagine an observability solution: that you can drop into the kernel in 5 minutes no code change or restarts covers everything in one go — infrastructure, applications, everything has near-zero overhead is highly secure That is what eBPF makes possible. And that is the reason why there is so much excitement around the technology. We can expect the next generation of observability solutions to all be instrumented with eBPF instead of code agents. Traditional players like Datadog and NewRelic are already investing in building eBPF-based instrumentation to augment their code-based agent portfolio. Meanwhile there are several next-generation vendors built on eBPF, solving both niche use cases and for complex observability. While traditional players had to build individual code agents language by language and for each infrastructure component over several years, the new players can get to the same degree of coverage in a few months with eBPF. This allows them to also focus on innovating higher up the value chain like data processing, user experience, and even AI. In addition, their data processing and user experience layers are also built ground-up to support the new use cases, volumes and frequency. All this should drive a large amount of innovation in this space and make observability more seamless, secure and easy to implement over the coming years. Who Should Use eBPF Observability? First, if you’re in a modern cloud-native environment (Kubernetes, microservices), then the differences between eBPF-based and agent-based approaches are most visible (performance overhead, security, ease of installation etc). Second, if you are operating at a large scale, then eBPF-based lightweight agents will drive dramatic improvements over the status quo. This is likely one of the reasons why eBPF adoption has been highest in technology companies with massive footprints like LinkedIn, Netflix, and Meta. Third, if you’re short on tech capacity and are looking for an observability solution that requires almost no effort to install and maintain, then go straight for an eBPF-based solution. Summary In summary, by offering a significantly better instrumentation mechanism, eBPF has the potential to fundamentally reshape our approach to observability in the years ahead. While in this article we primarily explored eBPF’s application in data collection/instrumentation, future applications could see eBPF used in data processing or even data storage layers. The possibilities are broad and as yet unexplored.
In the beginning, IT teams utilized Application Performance Monitoring (APM) and Network Performance Monitoring (NPM) as tools to oversee and diagnose problems at the application and infrastructure levels. However, with the advent of contemporary development practices, the introduction of numerous distributed components made it difficult for APM and NPM solutions to offer complete visibility across the entire system. As a result, observability emerged as the logical successor to APM and NPM, owing to its capacity to provide comprehensive visibility within a distributed IT system. Through observability, businesses can take proactive measures to resolve production-level issues. Observability solutions aim to address three significant challenges, i.e., enhancing digital experiences, ensuring high availability and scalability, and keeping the peak performance intact. In this article, we will discuss what enhancing digital experience signifies and how organizations are enhancing digital experiences through observability. Enhancing Digital Experience In modern times, outstanding digital experiences have become a necessity. It has been observed that 80% of customer interactions take place on digital platforms. Furthermore, 70% of customers consider digital experience essential when evaluating a product. Observability refers to the ability to have visibility into the complete IT ecosystem, which can assist in identifying the underlying causes of issues that affect the digital experience. Some key performance indicators (KPIs) used to measure digital experience include average response time, error rate, load time, and usability. Dubai Customs Reduced the Number of Tests by 90% And Sped up the Release Cycle by 70% Dubai Customs relies heavily on Mirsal, a crucial application that manages the inflow and outflow of goods from the Emirates. A failure in this application can cause significant delays and long queues at the border, leading to chaos in trade. Therefore, ensuring the maximum uptime of the application is critical. In addition, Dubai Customs aims to proactively engage with their customers to help them complete their transactions in case of any issues. However, their legacy monitoring tool could not meet these requirements. To address these challenges, Dubai Customs adopted an end-to-end observability solution that continuously monitors Mirsal, ensuring high uptime and user experience. This solution helped Dubai Customs to identify the root cause of issues that impacted the user experience and accelerated the time-to-market and release cycle by 70%. Furthermore, teams used a shift-left approach to eliminate bugs in pre-production, leading to the release of high-quality software. All tests were streamlined into a unified solution, reducing the number of tests required for each release from 30 to 3. As a result, Dubai Customs can now capture the insights it needs about customer issues and track their journey to understand their pain points, leading to enhanced customer service. TSB Bank Accelerated Digital Innovation and Enhanced Customer Experiences TSB, the 7th largest retail bank in the UK, aimed to expand its digital footprint and promote innovation. To achieve this goal, it built a modern banking platform using AWS, IBM Cloud, and BT Cloud. However, the transition to multi-cloud architecture led to numerous distributed components, which made it challenging to obtain full-stack visibility. As a result, TSB's teams had to spend a significant amount of time chasing reactive problems, leading to a compromise on efficiency and customer experience. To address these challenges, TSB adopted an observability solution that could track all the components and dependencies across multi-cloud, providing real-time insights into the customer experience. Observability also enabled TSB's engineers to analyze the root cause of problems and resolve them before changes went live in the production environment. Ultimately, the adoption of observability accelerated TSB's digital transformation, enhanced employee efficiency, and improved customer experience. Channel 7 Delivered an A-Grade Streaming Experience With 100% Uptime Channel 7 is Australia's leading free-to-air commercial television network, renowned for its popular news, sports, and entertainment programs. The network had the right to stream and televise significant sporting events such as the AFL Grand Final, the 2020 Tokyo Olympics, and the 2022 Winter Olympics. To capture the imagination of millions of people and ensure a seamless streaming experience, Channel 7 needed to guarantee higher uptime and an A-grade streaming experience. As a result, knowing scalable infrastructure and components was of paramount importance. However, its homegrown monitoring and troubleshooting tools were insufficient to cope with the scale of the mega sporting events, leading to the adoption of observability. Now, Channel 7 can collect valuable data at the application level to understand the customer journey and identify issues that affect the user experience. It can also determine the number of users the infrastructure can handle at a particular time. Opting for observability was a significant victory for Channel 7, resulting in 100% uptime and a flawless user experience. As a result, Channel 7 could stream an impressive 4.7 billion minutes during the Tokyo 2020 Olympics, 32.6 million minutes during the AFL Grand Final, and 376 million minutes during the Winter Olympics 2022. Swiggy Served More Than 30 Million Users and Increased Productivity by 10% Swiggy, a popular food ordering and delivery platform in India with over 30 million customers across 500 cities, faced the challenge of delivering a comprehensive digital experience while sustaining user growth. As the business expanded, the tech stack became more complex, making it difficult to optimize operational efficiency and ensure the platform's reliability, scalability, and stability. The existing monitoring and troubleshooting tools were inadequate to provide insights into the end-user experience. To overcome these challenges, Swiggy implemented observability. With this solution, engineers could obtain customer insights within 15 minutes, enabling them to identify and prevent outages. Observability also enabled Swiggy to pinpoint the areas of the user interface where most users land, allowing them to optimize app design and enhance the overall app experience. Wrapping Up Observability enhances the digital experience by speeding up the release cycle and accelerating digital innovation. It ensures high availability by decreasing processing time and MTTR and increasing customer conversions. And moreover, maintains the peak performance while increasing deployment frequency and reducing downtime and cognitive load. Do you know any organizations worthy of mentioning in this list, let me know in the comments below.
Are you looking to get away from proprietary instrumentation? Are you interested in open-source observability, but lack the knowledge to just dive right in? This workshop is for you, designed to expand your knowledge and understanding of open-source observability tooling that is available to you today. Dive right into a free, online, self-paced, hands-on workshop introducing you to Prometheus. Prometheus is an open-source systems monitoring and alerting tool kit that enables you to hit the ground running with discovering, collecting, and querying your observability today. Over the course of this workshop, you will learn what Prometheus is, what it is not, install it, start collecting metrics, and learn all the things you need to know to become effective at running Prometheus in your observability stack. Previously, I shared an introduction to Prometheus, installing Prometheus, and an introduction to the query language as free online workshop labs. In this article, you'll continue your journey exploring basic Prometheus queries using PromQL. Your learning path continues in this article with the exploration of a few basic PromQL queries. Note this article is only a short summary, so please see the complete lab found online here to work through it in its entirety yourself: The following is a short overview of what is in this specific lab of the workshop. Each lab starts with a goal. In this case, it is fairly simple: This lab dives into using PromQL to explore basic querying so that you can use it to visualize collected metrics data. You'll start off by getting comfortable selecting metrics and at the same time learning the basic definitions of metrics query terminology. Each query is presented for you to cut and paste into your own Prometheus expression browser and followed by an example output so that you see what the query is doing. These examples have been generated on a system running for several hours giving them more color and diversity when graphing data results that you might see if you are quickly working self-paced through this workshop. This doesn't take away from the exercise; it's meant to provide a bit more extensive example results than you might be seeing yourself. Next, you'll learn about how to filter your query results as you narrow the sets of data you're selecting using metric labels, also known as metric dimensions. You'll explore matching operators, filtering with regular expressions, dig into instant vectors, range vectors, learn about time-shifting, explore a bunch of ways to visualize your data in graphs, take a side step into the discussion around the functions without and by, and finally, learn how to use math in your queries. Selecting Data The very basic beginning of any query language is being able to select data from your metrics collection. You'll kick this lab off learning about the basic terminology involved with your metrics queries, such as: Metric name - Querying for all time series data collected for that metric Label - Using one or more assigned labels filters metric output Timestamp - Fixes a query single moment in the time of your choosing Range of time - Setting a query to select over a certain period of time Time-shifted data - Setting a query to select over a period of time while adding an offset from the time of query execution (looking in the past) Applying all of these you'll quickly start selecting metrics by name, such as the metric below from the services demo you set up in a previous lab: demo_api_request_duration_seconds_count This query, when entered in your Prometheus expression browser will result in something like the following query results: You'll continue to learn how to filter using labels, and how to narrow down your query results and metrics dimensions. After several iterations, you'll eventually find the data you're looking for with the following selection query filtering with multiple labels: demo_api_request_duration_seconds_count{method="POST", path="/api/foo", status="500"} The final results are much more refined than where you started: This wraps up the selection query section and sends you onward to having hands-on practical experience with selection queries. More Ways to Skin Your Data While using the equals operator is one way to select and filter data, there are more ways to approach it. The next section spends some time sharing how you can use regular expressions in your queries. This wraps up with you learning that up to now you've done queries selecting the single latest value for all time series data found, better known as an instant vector. You are now introduced to the concept of a range vector, or queries using functions that require a range of values. They have a duration specifier in the form of [number, unit]; for example, the following query selects all user CPU usage over the last minute: demo_cpu_usage_seconds_total{mode="user"}[1m] Resulting in the following display of all the values found over the last minute for this metric: Next, you'll learn how to take the range vectors and apply a little time-shifting to look at a window of data in the past, something that is quite common in troubleshooting issues after the fact. While this can be fun, one of the most common things you'll want to see in your data is how some metric has changed over a certain period of time. This is done using the rate function and you'll explore both this and its helper, the irate function. Below is the rate query: rate(demo_api_request_duration_seconds_count{method="POST"}[1m]) And the corresponding results in a graph: While these functions up to now have visualized counters in graph form, you have many other types of metrics you'll want to explore, such as gauge metrics. You'll spend time exploring all these examples and running more queries to visualize gauge metrics. This leads to a bit of a more complex issue. What do you do with queries where you want to aggregate over highly dimensional data in your queries? This requires simplifying and creating a less detailed view of the data results with functions like sum, min, max, and avg. These do not aggregate over time, but across multiple series at each point in time. The following query will give you a table view of all the dimensions you have at a single point in time with this metric: demo_api_request_duration_seconds_count{job="services"} The results are showing that this metric is highly dimensional: In this example, you are going to use the sum function, look across all the previous dimensions for a five-minute period of time, and then look across all captured time series data: sum( rate(demo_api_request_duration_seconds_count{job="services"}[5m]) ) The resulting graph shows the sum across highly dimensional metrics: You'll work through the entire list of aggregation functions, testing most of them in hands-on query examples using your demo installation to generate visualizations of your time series data collection. You'll finish up this lab by learning about how to apply arithmetic, or math to your queries. You'll sum it all up with a query to calculate per-mode CPU usage divided by the number of cores to find a per-core usage value of 0 to 1. To make it more interesting, the metrics involved have mismatched dimensions, so you'll be telling it to group by the one extra label dimension. It all sounds like a complex problem you have no idea how to solve? Don't worry, this entire lab builds you from the ground up to reach this point. You will be running this query to solve this problem and ready for the next more advanced query lab coming up! Missed Previous Labs? This is one lab in the more extensive free online workshop. Feel free to start from the very beginning of this workshop here if you missed anything previously: You can always proceed at your own pace and return any time you like as you work your way through this workshop. Just stop and later restart Perses to pick up where you left off. Coming Up Next I'll be taking you through the following lab in this workshop where you'll continue learning about the PromQL and dig deeper into advanced queries to gain more complex insights into your collected metrics. Stay tuned for more hands-on material to help you with your cloud-native observability journey.
The Apollo router is a powerful routing solution designed to replace the GraphQL Gateway. Built using Rust, it offers a high degree of flexibility, loose coupling, and exceptional performance. This self-hosted graph routing solution is highly configurable, making it an ideal choice for developers who require a high-performance routing system. With its ability to handle large amounts of traffic and complex data, the Apollo router is quickly becoming a popular choice among developers seeking a reliable and efficient routing solution. The binary implementation of the Apollo router comes equipped with a built-in telemetry plugin that functions as an OpenTelemetry collector agent. This plugin is responsible for sending traces and logs to the appropriate endpoints, making it an essential component of the performance monitoring process. By leveraging the capabilities of the telemetry plugin, developers can gain valuable insights into the performance of their applications, identify bottlenecks and optimize their system accordingly. With this integrated telemetry functionality, the Apollo router provides a streamlined and efficient performance monitoring solution. OpenTelemetry Agent and Collector OpenTelemetry offers an agent or router telemetry plugin that collects telemetry data from a host, including metrics and traces. This agent runs on a host machine and provides a centralized reporting system, which is particularly useful when running multiple instances of the router. The collector enables developers to process and send metrics to multiple locations beyond their APM tool, providing flexibility and versatility in performance monitoring. By leveraging the OpenTelemetry agent, developers can gain valuable insights into the performance of their applications, identify bottlenecks, and optimize their system for better performance. After the router telemetry agent collects telemetry data from the host, it then forwards that data to a collector. The collector is another program that receives and processes the telemetry data, making it easier to analyze and understand. By sending data to the collector, developers can gain a more comprehensive view of their application's performance, which can help them identify potential issues and optimize their system for better performance. The use of a collector also enables developers to store data in a centralized location, making it easier to access and analyze. Telemetry plugin configuration for Apollo Router Agent: Open Telemetry Collector Configuration: Splunk APM (Application Performance Monitoring) Splunk APM is a highly sophisticated tool designed for application performance monitoring and troubleshooting, especially for cloud-native and microservices-based applications. It is built on open source and OpenTelemetry instrumentation, which enables the collection of data from various programming languages and environments. With its advanced features, Splunk APM provides an efficient and reliable solution for monitoring application performance and identifying and resolving issues quickly. The OpenTelemetry Collector is a tool used to export data to your Application Performance Monitoring (APM) tool. To set up a basic configuration for the OpenTelemetry Collector, you need to have both an OpenTelemetry Protocol (OTLP) receiver for the router and an exporter to forward the data to your APM tool. By using the OTLP receiver and exporter, you can easily configure the OpenTelemetry Collector to collect and transmit data from various sources to your APM tool, making it an essential component for effective application monitoring and troubleshooting. In conclusion, telemetry plays a crucial role in performance monitoring and optimization. The Apollo router comes with a built-in telemetry plugin that functions as an OpenTelemetry collector agent, which allows developers to gain valuable insights into their application's performance. OpenTelemetry also offers an agent and collector for collecting telemetry data from a host and sending it to a centralized reporting system. Additionally, the OpenTelemetry Collector is an essential tool for exporting data to an APM tool, such as Splunk APM, which provides a sophisticated solution for monitoring application performance and identifying and resolving issues quickly. By leveraging these tools, developers can optimize their system's performance, enhance user experience, and ultimately achieve their business objectives.
This article explains why Deep data observability is different from shallow: deep data observability is truly comprehensive in terms of data sources, data formats, data granularity, validator configuration, cadence, and user focus. The Need for “Deep” Data Observability 2022 was the year when data observability really took off as a category (as opposed to old-school “data quality tools”), with the official Gartner terminology for the space. Similarly, Matt Turck consolidated the data quality and data observability categories in the 2023 MAD Landscape analysis. Nevertheless, the industry is nowhere near fully formed. In his 2023 report titled “Data Observability—the Rise of the Data Guardians,” Oyvind Bjerke at MMC Ventures discusses the space as having massive amounts of untapped potential for further innovation. With the backdrop of this dynamic space, we go ahead and define data observability as: The degree to which an organization has visibility into its data pipelines. A high degree of data observability enables data teams to improve data quality. However, not all data observability platforms, i.e., tools specifically designed to help organizations reach data observability, are created equal. The tools differ in terms of the degree of data observability they can help data-driven teams achieve. We thus distinguish between deep data observability and shallow data observability. They differ on the following dimensions: Data sources, data formats, data granularity, validator configuration, validator cadence, and user focus. In the rest of this article, we dive deep into deep data observability and explain the six dimensions that distinguish “Deep” data observability from “Shallow” data observability. The six dimensions used to distinguish Shallow Data Observability from Deep Data Observability. The Six Pillars of Deep Data Observability 1. Data Sources: Truly End-to-End Shallow data observability solutions tend to focus only on the data warehouse through SQL queries. Deep data observability solutions, on the other hand, provide data teams with equal degrees of observability across data streams, data lakes, and data warehouses. There are two reasons why this is important: First, data does not magically appear in the data warehouse. It often comes through a streaming source and lands in a data lake before it gets pushed to the data warehouse. Bad data can appear anywhere along the way, and you want to identify issues as soon as possible and pinpoint their origin. Secondly, in an increasing amount of data use cases such as machine learning and automated decision-making, data never touches the data warehouse. For a data observability tool to be proactive and future-proof, it needs to be truly end-to-end, also in lakes and streams. 2. Data Formats: Structured and Semi-Structured Data streams and lakes segue nicely into the next section: data formats. Shallow data observability is focused on the data warehouse, meaning it obtains observability for structured data. However, to reach a high degree of data observability end-to-end in your data stack, the data observability solution must support data formats that are common in data streams and lakes (and increasingly warehouses). With deep data observability, data teams can obtain high-quality data by monitoring data quality not only in structured datasets but also for semi-structured data in nested formats, e.g., JSON blobs. 3. Data Granularity: Univariate and Multivariate Validation of Individual Datapoints and Aggregate Data Shallow data observability originally rose to fame based on analyzing one-dimensional (univariate) statistics about aggregate data (e.g., metadata). For example, looking at the average number of null values in one column. However, countless cases of bad data have told us that data teams need to validate not only summary statistics and distributions but also individual data points. In addition, they need to look at dependencies (multivariate) between fields (or columns), and not just individual fields—real-world data does come with dependencies so most data quality problems are multivariate in nature. Deep data observability helps data-driven teams do exactly this: univariate and multivariate validation of individual data points and aggregated data. Let’s have a look at an example of when multivariate validation is needed. The dataset below is segmented by country and on product_type (multiple variables, not just one), which is necessary in order to validate each individual subsegment (set of records). Each subsegment will likely have unique volume, freshness, anomalies, and distribution, which means it must be validated individually. Let’s say this dataset tracks all transaction data from an e-comm business. Then each country is likely to display individual purchasing behaviors, which means they need to be validated individually. Double-clicking one more time, we might also find that within each country, each product_type is subject to different purchasing behaviors too. Thus, we need to segment both columns to validate the data truly. Segmentation based on multiple variables is an example of multivariate validation provided in a Deep Data Observability platform. 4. Validator Configuration: Automatically Suggested as Well as Manually Configured Depending on your organization, you might be looking for various degrees of scalability in your data systems. For example, suppose you’re looking for a “set it and forget it” type of solution that alerts you whenever something unexpected happens; then, shallow data observability is what you’re after. In that case, you will get a bird’s eye view of, e.g., all tables in your data warehouse and whether they behave as expected. Conversely, your business might have unique business logic or custom validation rules you’ll want to set up. The degree to which you can do this custom setup in a scalable way determines the degree to which you have deep data observability. If each custom rule requires a data engineer to write SQL, you’re looking at a not-so-scalable setup, and it will be very challenging to reach the state of deep data observability. Instead, if you have a quick-to-implement menu of validators that can be combined in a tailored way to suit your business, then deep data observability is within reach. Setting up custom validators should not be reserved for code-savvy data team members only. 5. Multi-Cadence Validation: As Frequently as Needed, Including Real-Time Again, depending on your business needs, you might have different requirements for data observability on various time horizons. For example, suppose you use a standard type of setup where data is loaded into your warehouse daily. In that case, shallow data observability, which only supports a standard daily cadence, fulfills your needs. Instead, if your data infrastructure is more complex, with some sources being updated in real-time, some daily, and others less frequently, you will need support to validate data with all of these cadences. This multi-cadence need is especially true for companies relying on any kind of data for rapid decision-making or real-time product features, e.g., dynamic pricing, IoT applications, retail businesses that rely heavily on digital marketing, etc. A deep data observability platform has full support for validating data for all these use cases. It ensures that you get insights into your data at the right time according to your business context. It also means that you can act on bad data right when it occurs and before it hits your downstream use cases. 6. User Focus: Both Technical and Non-Technical Data quality is an inherently cross-functional problem, which is part of the reason why it can be so challenging to solve. For example, the person who knows what “good” data looks like in a CRM dataset might be a salesperson with their boots on the ground in sales calls. Thus, the person that moves (or ingests) data from the CRM system into the data warehouse might have no insight into this at all and might naturally be more concerned with whether the data pipelines ran as scheduled. Shallow data observability solutions primarily cater to one single user group. They either focus on the data engineer, who cares mostly about the nuts and bolts of the pipelines and whether the system scales. Or, they focus on business users, who might care mostly about dashboards and summary statistics. Deep data observability is obtained when both types of users are kept in mind. In practice, this means providing multiple modes of controlling a data observability platform: through a command line interface and through a graphical user interface. It might also entail multiple access levels and privileges. In this way, all users can collaborate on configuring data validation and obtain a high degree of visibility into data pipelines. This, in turn, effectively democratizes data quality within the whole business. What’s Next? We’ve now covered the six dimensions differentiating shallow and deep data observability. We hope that this report gives you two frameworks to rely on when evaluating your business needs for data quality and data observability tooling.
The main purpose of this article and use case is to scrape AWS CloudWatch metrics into the Prometheus time series and to visualize the metrics data in Grafana. Prometheus and Grafana are the most powerful, robust open-source tools for monitoring, collecting, visualizing, and performance metrics of the deployed applications in production. These tools give greater visibility other than collecting the metrics also, where we can set up critical alerts, live views, and custom dashboards. CloudWatch Exporter is an open-source tool to capture metrics as defined in yml configuration file. Architecture The CloudWatch Exporter will collect the metrics from AWS Cloud watch every 15 seconds (default), and it will expose them as key/value pairs in /the metrics API response. Using that configuration, the exporter will collect those metrics from CloudWatch every 15 seconds (default) and expose them as key-value pairs in the '/metrics' API response. The CloudWatchExporter's /metrics endpoint should then be added to the Prometheus configuration as a scrape job. Prometheus allows us to define the scraping frequency, so we can adjust the frequency of calls to CloudWatch to eventually tune the cost. Setup Instructions AWS Access Setup Set up a new user under IamUser. Assign CloudWatchReadOnly permissions to that user. Generate a key and secret access key for that user. Guard the secret with your life as you cannot see it ever again! Save it safe, where you can find it. Grafana Setup Install Grafana mac: brew install Grafana Start Grafana: brew services start Grafana Access Grafana Add Prometheus as a data source AWS Cloud Watch Exporter How to run a CloudWatch exporter locally: Establish an AWS session for the exporter. Passing in the key and access can be done more elegantly. $ aws configure AWS Access Key ID [********************]: enter_your_access_key_here AWS Secret Access Key [********************]: enter_your_secret_key_hereDefault region name [eu-west-1]: Default output format [None]: Running the Exporter: AWS Exporter Git Hub Local Path to AWS Exporter: cd /Users/jayam000/cw/prometheus-2.36.1.linux-amd64/cloudwatchexporters Run the exporter: java -jar cloudwatch_exporter-0.6.0-jar-with-dependencies.jar 1234 cloudwatchmonconfig.yml Sample yml file to capture the request counts on AWS. The original file used can be found as an attachment. YAML --- region: eu-west-1 metrics: - aws_namespace: AWS/ELB aws_metric_name: HealthyHostCount aws_dimensions: [AvailabilityZone, LoadBalancerName] aws_statistics: [Average] - aws_namespace: AWS/ELB aws_metric_name: UnHealthyHostCount aws_dimensions: [AvailabilityZone, LoadBalancerName] aws_statistics: [Average] - aws_namespace: AWS/ELB aws_metric_name: RequestCount aws_dimensions: [AvailabilityZone, LoadBalancerName] aws_statistics: [Sum] - aws_namespace: AWS/ELB aws_metric_name: Latency aws_dimensions: [AvailabilityZone, LoadBalancerName] aws_statistics: [Average] - aws_namespace: AWS/ELB aws_metric_name: SurgeQueueLength aws_dimensions: [AvailabilityZone, LoadBalancerName] aws_statistics: [Maximum, Sum] - aws_namespace: AWS/ElastiCache aws_metric_name: CPUUtilization aws_dimensions: [CacheClusterId] aws_statistics: [Average] - aws_namespace: AWS/ElastiCache aws_metric_name: NetworkBytesIn aws_dimensions: [CacheClusterId] aws_statistics: [Average] - aws_namespace: AWS/ElastiCache aws_metric_name: NetworkBytesOut aws_dimensions: [CacheClusterId] aws_statistics: [Average] - aws_namespace: AWS/ElastiCache aws_metric_name: FreeableMemory aws_dimensions: [CacheClusterId] aws_statistics: [Average] Success! You should now be able to access the CloudWatch metrics here. Prometheus (Using Docker) Command to run Prometheus via docker: docker run -p 9090:9090 -v /Users/jayam000/cw/prometheus-2.36.1.linux-amd64/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus Setup instructions to add AWS cloud watch metrics scrape config into Prometheus: Edit prometheus.yml to include the below configuration: - job_name: "cloudwatch" static_configs: - targets: ["host.docker.internal:1234"] Prometheus scrape configuration file. YAML # my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: # - alertmanager:9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "first_rules.yml" # - "second_rules.yml" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: "prometheus" # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: - targets: ["localhost:9090"] - job_name: "cloudwatch" static_configs: - targets: ["host.docker.internal:1234"] # scrape_interval: 3m # #scrape_timeout: 30s Note: Had to change from localhost:1234 to host.docker.internal:1234 due to a bizarre connectivity issue. After the change, started seeing Prometheus collect metrics. Default scrape interval: 15 secs. Access Prometheus here. Click on Graph and select a specific parameter to view the time series corresponding to that attribute. Click on Status/Targets to check the health of the data sources added. Screenshot of Prometheus showing the configured targets, in this case, cloud watch exporter Prometheus graph showing the number of cloud watch requests. This data is similar to the data displayed in the Grafana dashboard. Grafana Dashboard Grafana dashboard of cloud watch metrics from Prometheus. Conclusion The given sample version extracts availability metrics from AWS. It does not extract application metrics or logs. This given sample did not require any custom code to be built. Adding new data source into Prometheus required yml config update and service restart. Incorporating new metrics into cloud watch. Feel free to reach out should you have any questions around the setup, and we would be happy to assist you.
If you're tired of managing your infrastructure manually, ArgoCD is the perfect tool to streamline your processes and ensure your services are always in sync with your source code. With ArgoCD, any changes made to your version control system will automatically be synced to your organization's dedicated environments, making centralization a breeze. Say goodbye to the headaches of manual infrastructure management and hello to a more efficient and scalable approach with ArgoCD! This post will teach you how to easily install and manage infrastructure services like Prometheus and Grafana with ArgoCD. Our step-by-step guide makes it simple to automate your deployment processes and keep your infrastructure up to date. We will explore the following approaches: Installation of ArgoCD via Helm Install Prometheus via ArgoCD Install Grafana via ArgoCD Import Grafana dashboard Import ArgoCD metrics Fire up an Alert Prerequisites: Helm Kubectl Docker for Mac with Kubernetes Installation of ArgoCD via Helm To install ArgoCD via Helm on a Kubernetes cluster, you need to: Add the ArgoCD Helm chart repository. Update the Helm chart repository. Install the ArgoCD Helm chart using the Helm CLI. Finally, verify that ArgoCD is running by checking the status of its pods. # Create a namespace kubectl create namespace argocd # Add the ArgoCD Helm Chart helm repo add argo https://argoproj.github.io/argo-helm # Install the ArgoCD helm upgrade -i argocd --namespace argocd --set redis.exporter.enabled=true --set redis.metrics.enabled=true --set server.metrics.enabled=true --set controller.metrics.enabled=true argo/argo-cd # Check the status of the pods kubectl get pods -n argocd When installing ArgoCD, we enabled two flags that exposed two sets of ArgoCD metrics: Application Metrics : controller.metrics.enabled=true API Server Metrics : server.metrics.enabled=true To access the installed ArgoCD, you will need to obtain its credentials: Username: admin Password: kubectl -n argocd \ get secret \ argocd-initial-admin-secret \ -o jsonpath="{.data.password}" | base64 -d Link a Primary Repository to ArgoCD ArgoCD uses the Application CRD to manage and deploy applications. When you create an Application CRD, you specify the following: Source reference to the desired state in Git. Destination reference to the target cluster and namespace. ArgoCD uses this information to continuously monitor the Git repository for changes and deploy them to the target environment. Let’s put it into action by applying the changes: cat <<EOF | kubectl apply -f - apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: workshop namespace: argocd spec: destination: namespace: argocd server: https://kubernetes.default.svc project: default source: path: argoCD/ repoURL: https://github.com/naturalett/continuous-delivery targetRevision: main syncPolicy: automated: prune: true selfHeal: true EOF Let’s access the server UI by using the kubectl port forwarding: kubectl port-forward service/argocd-server -n argocd 8080:443 Connect to ArgoCD. Install Prometheus via ArgoCD By installing Prometheus, you will be able to leverage the full stack and take advantage of its features. When you install the full stack, you will get access to the following: Prometheus Grafana dashboard, and more. In our demo, we will apply from the Kube Prometheus Stack the following services: Prometheus Grafana AlertManager The node-exporter will add the separately with its own helm chart while deactivating the pre-installed default that comes with the Kube Prometheus Stack. There are two ways to deploy Prometheus: Option 1: By applying the CRD. Option 2: By using automatic deployment based on the kustomization. In our blog, the installation of Prometheus will happen automatically, which means that Option 2 will be applied automatically. Option 1 — Apply the CRD cat <<EOF | kubectl apply -f - apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: prometheus namespace: argocd spec: destination: name: in-cluster namespace: argocd project: default source: repoURL: https://prometheus-community.github.io/helm-charts targetRevision: 45.6.0 chart: kube-prometheus-stack EOF Option 2 — Define the Installation Declaratively This option has already been applied based on the CRD we deployed earlier in the step of linking a primary repository to ArgoCD. The CRD is responsible for syncing our application.yaml files with the configuration specified in the kustomization. After Prometheus will get deployed, then it exposes its metrics to /metrics. To display these metrics in Grafana, we need to define a Prometheus data source. In addition, we also have additional metrics that we want to display in Grafana, so we’ll need to scrape them in Prometheus. Access the Prometheus server UI Let’s access Prometheus by using the kubectl port forwarding: kubectl port-forward service/kube-prometheus-stack-prometheus -n argocd 9090:9090 Connect to Prometheus. Prometheus Node Exporter For the installation of the Node Exporter, we utilized the declarative approach, which also happened in Option 2. The installation process will happen automatically, just like it occurred in Option 2, once we link the primary repository to ArgoCD. We will specify the configuration for the Node Exporter’s application using a declarative approach. Prometheus Operator CRDs Due to an issue with the Prometheus Operator Custom Resource Definitions, we have decided to deploy the CRD separately. The installation process will be automatic, similar to the one in Option 2, which relied on linking a primary repository to ArgoCD in an earlier step. Install Grafana via ArgoCD We used the same declarative approach as Option 2 to define the installation of the Grafana. The installation process will take place automatically, just like it does in Option 2, following the earlier step of linking a primary repository to ArgoCD. Since the Grafana installation is part of the Prometheus stack, it was installed automatically when the Prometheus stack was installed. To access the installed Grafana, you will need to obtain its credentials: Username: admin Password: kubectl get secret \ -n argocd \ kube-prometheus-stack-grafana \ -o jsonpath="{.data.admin-password}" | base64 --decode ; echo Let’s access Grafana by using the kubectl port forwarding: kubectl port-forward service/kube-prometheus-stack-grafana -n argocd 9092:80 Connect to Grafana. Importing the ArgoCD Metrics Dashboard Into Grafana We generated a configMap for the ArgoCD Dashboard and deployed it through kustomization. During the deployment of Grafana, we linked the configMap to create the Dashboard and then leveraged Prometheus to extract the ArgoCD metrics data for gaining valuable insights into its performance. The ArgoCD dashboard’s metrics were made available as a result of an earlier section in the blog post: --set server.metrics.enabled=true \ --set controller.metrics.enabled=true This enabled us to view and monitor the metrics easily through the dashboard. Confirm the ArgoCD metrics: # Verify if the services exist kubectl get service -n argocd argocd-application-controller-metrics kubectl get service -n argocd argocd-server-metrics # Configure port forwarding to monitor Application Metrics kubectl port-forward service/argocd-application-controller-metrics -n argocd 8082:8082 # Check the Application Metrics http://localhost:8082/metrics # Configure port forwarding to monitor API Server Metrics kubectl port-forward service/argocd-server-metrics -n argocd 8083:8083 # Check the API Server Metrics http://localhost:8083/metrics Fire up an Alert Execute the following script to trigger an alert: curl -LO https://raw.githubusercontent.com/naturalett/continuous-delivery/main/trigger_alert.sh chmod +x trigger_alert.sh ./trigger_alert.sh Let’s access the Alert Manager: kubectl port-forward service/alertmanager-operated -n argocd 9093:9093 Connect to Alert Manager. Confirm that the workshop alert has been triggered: Clean up the Environment By deleting the workshop ApplicationsSet, all the dependencies that were installed as per the defined kustomization will be removed. Delete the ArgoCD installation and any associated dependencies: kubectl delete crd alertmanagerconfigs.monitoring.coreos.com kubectl delete crd alertmanagers.monitoring.coreos.com kubectl delete crd podmonitors.monitoring.coreos.com kubectl delete crd probes.monitoring.coreos.com kubectl delete crd prometheuses.monitoring.coreos.com kubectl delete crd prometheusrules.monitoring.coreos.com kubectl delete crd servicemonitors.monitoring.coreos.com kubectl delete crd thanosrulers.monitoring.coreos.com kubectl delete crd applications.argoproj.io kubectl delete crd applicationsets.argoproj.io kubectl delete crd appprojects.argoproj.io helm del -n argocd argocd Summary Through our learning process, we have developed proficiency in automating infrastructure management and synchronizing our environment with changes to our source code. Specifically, we have learned how to deploy ArgoCD and utilize its ApplicationsSet to deploy a Prometheus stack. Additionally, we have demonstrated the process of extracting service metrics to Prometheus and visualizing them in Grafana, as well as triggering alerts in our monitoring system. For continued learning and access to valuable resources, we encourage you to explore our tutorial examples on Github.
This is an article from DZone's 2023 Software Integration Trend Report.For more: Read the Report Our approach to scalability has gone through a tectonic shift over the past decade. Technologies that were staples in every enterprise back end (e.g., IIOP) have vanished completely with a shift to approaches such as eventual consistency. This shift introduced some complexities with the benefit of greater scalability. The rise of Kubernetes and serverless further cemented this approach: spinning a new container is cheap, turning scalability into a relatively simple problem. Orchestration changed our approach to scalability and facilitated the growth of microservices and observability, two key tools in modern scaling. Horizontal to Vertical Scaling The rise of Kubernetes correlates with the microservices trend as seen in Figure 1. Kubernetes heavily emphasizes horizontal scaling in which replications of servers provide scaling as opposed to vertical scaling in which we derive performance and throughput from a single host (many machines vs. few powerful machines). Figure 1: Google Trends chart showing correlation between Kubernetes and microservice (Data source: Google Trends ) In order to maximize horizontal scaling, companies focus on the idempotency and statelessness of their services. This is easier to accomplish with smaller isolated services, but the complexity shifts in two directions: Ops – Managing the complex relations between multiple disconnected services Dev – Quality, uniformity, and consistency become an issue. Complexity doesn't go away because of a switch to horizontal scaling. It shifts to a distinct form handled by a different team, such as network complexity instead of object graph complexity. The consensus of starting with a monolith isn't just about the ease of programming. Horizontal scaling is deceptively simple thanks to Kubernetes and serverless. However, this masks a level of complexity that is often harder to gauge for smaller projects. Scaling is a process, not a single operation; processes take time and require a team. A good analogy is physical traffic: we often reach a slow junction and wonder why the city didn't build an overpass. The reason could be that this will ease the jam in the current junction, but it might create a much bigger traffic jam down the road. The same is true for scaling a system — all of our planning might make matters worse, meaning that a faster server can overload a node in another system. Scalability is not performance! Scalability vs. Performance Scalability and performance can be closely related, in which case improving one can also improve the other. However, in other cases, there may be trade-offs between scalability and performance. For example, a system optimized for performance may be less scalable because it may require more resources to handle additional users or requests. Meanwhile, a system optimized for scalability may sacrifice some performance to ensure that it can handle a growing workload. To strike a balance between scalability and performance, it's essential to understand the requirements of the system and the expected workload. For example, if we expect a system to have a few users, performance may be more critical than scalability. However, if we expect a rapidly growing user base, scalability may be more important than performance. We see this expressed perfectly with the trend towards horizontal scaling. Modern Kubernetes systems usually focus on many small VM images with a limited number of cores as opposed to powerful machines/VMs. A system focused on performance would deliver better performance using few high-performance machines. Challenges of Horizontal Scale Horizontal scaling brought with it a unique level of problems that birthed new fields in our industry: platform engineers and SREs are prime examples. The complexity of maintaining a system with thousands of concurrent server processes is fantastic. Such a scale makes it much harder to debug and isolate issues. The asynchronous nature of these systems exacerbates this problem. Eventual consistency creates situations we can't realistically replicate locally, as we see in Figure 2. When a change needs to occur on multiple microservices, they create an inconsistent state, which can lead to invalid states. Figure 2: Inconsistent state may exist between wide-sweeping changes Typical solutions used for debugging dozens of instances don't apply when we have thousands of instances running concurrently. Failure is inevitable, and at these scales, it usually amounts to restarting an instance. On the surface, orchestration solved the problem, but the overhead and resulting edge cases make fixing such problems even harder. Strategies for Success We can answer such challenges with a combination of approaches and tools. There is no "one size fits all," and it is important to practice agility when dealing with scaling issues. We need to measure the impact of every decision and tool, then form decisions based on the results. Observability serves a crucial role in measuring success. In the world of microservices, there's no way to measure the success of scaling without such tooling. Observability tools also serve as a benchmark to pinpoint scalability bottlenecks, as we will cover soon enough. Vertically Integrated Teams Over the years, developers tended to silo themselves based on expertise, and as a result, we formed teams to suit these processes. This is problematic. An engineer making a decision that might affect resource consumption or might impact such a tradeoff needs to be educated about the production environment. When building a small system, we can afford to ignore such issues. Although as scale grows, we need to have a heterogeneous team that can advise on such matters. By assembling a full-stack team that is feature-driven and small, the team can handle all the different tasks required. However, this isn't a balanced team. Typically, a DevOps engineer will work with multiple teams simply because there are far more developers than DevOps. This is logistically challenging, but the division of work makes more sense in this way. As a particular microservice fails, responsibilities are clear, and the team can respond swiftly. Fail-Fast One of the biggest pitfalls to scalability is the fail-safe approach. Code might fail subtly and run in non-optimal form. A good example is code that tries to read a response from a website. In a case of failure, we might return cached data to facilitate a failsafe strategy. However, since the delay happens, we still wait for the response. It seems like everything is working correctly with the cache, but the performance is still at the timeout boundaries. This delays the processing. With asynchronous code, this is hard to notice and doesn't put an immediate toll on the system. Thus, such issues can go unnoticed. A request might succeed in the testing and staging environment, but it might always fall back to the fail-safe process in production. Failing fast includes several advantages for these scenarios: It makes bugs easier to spot in the testing phase. Failure is relatively easy to test as opposed to durability. A failure will trigger fallback behavior faster and prevent a cascading effect. Problems are easier to fix as they are usually in the same isolated area as the failure. API Gateway and Caching Internal APIs can leverage an API gateway to provide smart load balancing, caching, and rate limiting. Typically, caching is the most universal performance tip one can give. But when it comes to scale, failing fast might be even more important. In typical cases of heavy load, the division of users is stark. By limiting the heaviest users, we can dramatically shift the load on the system. Distributed caching is one of the hardest problems in programming. Implementing a caching policy over microservices is impractical; we need to cache an individual service and use the API gateway to alleviate some of the overhead. Level 2 caching is used to store database data in RAM and avoid DB access. This is often a major performance benefit that tips the scales, but sometimes it doesn't have an impact at all. Stack Overflow recently discovered that database caching had no impact on their architecture, and this was because higher-level caches filled in the gaps and grabbed all the cache hits at the web layer. By the time a call reached the database layer, it was clear this data wasn't in cache. Thus, they always missed the cache, and it had no impact. Only overhead. This is where caching in the API gateway layer becomes immensely helpful. This is a system we can manage centrally and control, unlike the caching in an individual service that might get polluted. Observability What we can't see, we can't fix or improve. Without a proper observability stack, we are blind to scaling problems and to the appropriate fixes. When discussing observability, we often make the mistake of focusing on tools. Observability isn't about tools — it's about questions and answers. When developing an observability stack, we need to understand the types of questions we will have for it and then provide two means to answer each question. It is important to have two means. Observability is often unreliable and misleading, so we need a way to verify its results. However, if we have more than two ways, it might mean we over-observe a system, which can have a serious impact on costs. A typical exercise to verify an observability stack is to hypothesize common problems and then find two ways to solve them. For example, a performance problem in microservice X: Inspect the logs of the microservice for errors or latency — this might require adding a specific log for coverage. Inspect Prometheus metrics for the service. Tracking a scalability issue within a microservices deployment is much easier when working with traces. They provide a context and a scale. When an edge service runs into an N+1 query bug, traces show that almost immediately when they're properly integrated throughout. Segregation One of the most important scalability approaches is the separation of high-volume data. Modern business tools save tremendous amounts of meta-data for every operation. Most of this data isn't applicable for the day-to-day operations of the application. It is meta-data meant for business intelligence, monitoring, and accountability. We can stream this data to remove the immediate need to process it. We can store such data in a separate time-series database to alleviate the scaling challenges from the current database. Conclusion Scaling in the age of serverless and microservices is a very different process than it was a mere decade ago. Controlling costs has become far harder, especially with observability costs which in the case of logs often exceed 30 percent of the total cloud bill. The good news is that we have many new tools at our disposal — including API gateways, observability, and much more. By leveraging these tools with a fail-fast strategy and tight observability, we can iteratively scale the deployment. This is key, as scaling is a process, not a single action. Tools can only go so far and often we can overuse them. In order to grow, we need to review and even eliminate unnecessary optimizations if they are not applicable. This is an article from DZone's 2023 Software Integration Trend Report.For more: Read the Report
In recent years, the term MLOps has become a buzzword in the world of AI, often discussed in the context of tools and technology. However, while much attention is given to the technical aspects of MLOps, what's often overlooked is the importance of the operations. There is often a lack of discussion around the operations needed for machine learning (ML) in production and monitoring specifically. Things like accountability for AI performance, timely alerts for relevant stakeholders, and the establishment of necessary processes to resolve issues are often disregarded for discussions about specific tools and tech stacks. ML teams have traditionally been research-oriented, focusing heavily on training models to achieve high testing scores. However, once the model is ready to be deployed in real business processes and applications, the culture around establishing production-oriented operations is lacking. As a consequence, there is a lack of clarity regarding who is responsible for the models' outcomes and performance. Without the right operations in place, even the most advanced tools and technology won't be enough to ensure healthy governance for your AI-driven processes. 1. Cultivate a Culture of Accountability As previously stated, data science and ML teams have traditionally been research-oriented and were measured on model evaluation scores and not on real-world, business-related outcomes. In such an environment, there is no way monitoring will be done correctly because frankly - no one cares sufficiently. To fix this situation, the team responsible for building AI models must take ownership and feel accountable for the models' success or failure in serving the business function it was designed for. The best way to achieve this is by measuring the individual's and the team's performance based on production-oriented KPIs and creating an environment that fosters a sense of ownership over the model's overall performance rather than just in controlled testing environments. While some team members may remain focused on research, it's important to recognize that achieving good test scores in experiments is not sufficient to ensure the model's success in production. The ultimate success of the model lies in its effectiveness in real-world business processes and applications. 2. Make a "Monitoring Plan" Part of Your Release Checklist To ensure the ongoing success of an AI-driven application, planning how it is going to be monitored is a critical factor that should not be overlooked. In healthy engineering organizations, there is always a release checklist that entails setting up a monitoring plan whenever a new component is released. AI teams should follow that pattern. The person or team responsible for building a model must have a clear understanding of how it fits into the overall system and should be able to predict potential issues that could arise, as well as identify who needs to be alerted and what actions should be taken in the event of an issue. While some potential issues may be more research-oriented, such as data or concept drift, there are many other factors to consider, such as a broken feature pipeline or a third-party data provider changing input formats. Therefore, it is important to anticipate as many of these issues as possible and set up a plan to effectively deal with them should they arise. Although it's very likely that there are potential issues that will remain unforeseen, it's still better to do something rather than nothing, and typically, the first 80% of issues can be anticipated with 20% of the work. 3. Establish an On-Call Rotation Sharing the responsibility among team members may be necessary or helpful, depending on the size of your team and the number of models or systems under your control. By setting up an "on-call" rotation, everyone can have peace of mind knowing that there is at least one knowledgeable person available to handle any issues the moment they arise. It's important to note that taking care of an issue doesn't necessarily mean solving the problem immediately. Sometimes, it might mean triaging and deferring it to a later time or waking up the person who is best equipped to solve the problem. Sharing an on-call rotation with pre-existing engineering teams can also be an option in some instances. However, this is use-case dependent and may not be possible for every team. Regardless of the approach, it is imperative to establish a shared knowledge base that the person on-call can utilize so that your team can be well-prepared to take care of emerging issues. 4. Set up a Shared Knowledge Base To maintain healthy monitoring operations, it is essential to have accessible resources that detail how your system works and its main components. This is where wikis and playbooks come in. Wikis can provide a central location for documentation on your system, including its architecture, data sources, and model dependencies. Playbooks can be used to document specific procedures for handling common issues or incidents that may arise. Having these resources in place can help facilitate knowledge sharing and ensure that everyone on the team is equipped to troubleshoot and resolve issues quickly. It also allows for smoother onboarding of new team members who can quickly get up to speed on the system. In addition, having well-documented procedures and protocols can help reduce downtime and improve response times when issues transpire. 5. Implement Post Mortems Monitoring is an iterative process. It is impossible to predict everything that might go wrong in advance. But when an issue does occur and goes undetected or unresolved for too long, it is important to conduct a thorough analysis of the issue and identify the root cause. Once a root cause is understood, the built monitoring plan can be amended and improved accordingly. Post mortems also help in building a culture of accountability, which, as discussed earlier, is the key factor in having successful monitoring operations. 6. Get the Right Tools for Effective Monitoring Once you have established the need of maintaining healthy monitoring operations and addressed any cultural considerations, the next critical step is to equip your team members with the appropriate tools to empower them to be accountable for the model's performance in the business function it serves. This means implementing tools that enable timely alerts for issues (which is difficult due to issues typically starting small and hidden), along with capabilities for root cause analysis and troubleshooting. Integrations with your existing tools, such as ticketing systems, as well as issue tracking and management capabilities, are also essential for seamless coordination and collaboration among team members. Investing in the right tools will empower your team to take full ownership and accountability, ultimately leading to better outcomes for the business. Conclusion By following these guidelines, you can be sure that your AI team will be set up for successful production-oriented operations. Monitoring is a crucial aspect of MLOps, involving accountability, timely alerts, troubleshooting, and much more. Taking the time to set up healthy monitoring practices leads to continuous improvements.
As with back-end development, observability is becoming increasingly crucial in front-end development, especially when it comes to troubleshooting. For example, imagine a simple e-commerce application that includes a mobile app, web server, and database. If a user reports that the app is freezing while attempting to make a purchase, it can be challenging to determine the root cause of the problem. That's where OpenTelemetry comes in. This article will dive into how front-end developers can leverage OpenTelemetry to improve observability and efficiently troubleshoot issues like this one. Why Front-End Troubleshooting? Similar to back-end development, troubleshooting is a crucial aspect of front-end development. For instance, consider a straightforward e-commerce application structure that includes a mobile app, a web server, and a database. Suppose a user reported that the app is freezing while attempting to purchase a dark-themed mechanical keyboard. Without front-end tracing, we wouldn't have enough information about the problem since it could be caused by different factors such as the front-end or back-end, latency issues, etc. We can try collecting logs to get some insight, but it's challenging to correlate client-side and server-side logs. We might attempt to reproduce the issue from the mobile application, but it could be time-consuming and impossible if the client-side conditions aren't available. However, if the issue isn't reproduced, we need more information to identify the specific problem. This is where front-end tracing comes in handy because, with the aid of front-end tracing, we can stop making assumptions and instead gain clarity on the location of the issue. Front-End Troubleshooting With Distributed Tracing Tracing data is organized in spans, which represent individual operations like an HTTP request or a database query. By displaying spans in a tree-like structure, developers can gain a comprehensive and real-time view of their system, including the specific issue they are examining. This allows them to investigate further and identify the cause of the problem, such as bottlenecks or latency issues. Tracing can be a valuable tool for pinpointing the root cause of an issue. The example below displays three simple components: a front-end a back-end, and a database. When there is an issue, the trace encompasses spans from both the front-end app and the back-end service. By reviewing the trace, it's possible to identify the data that was transmitted between the components, allowing developers to follow the path from the specific user click in the front-end to the DB query. Rather than relying on guesswork to identify the issue, with tracing, you can have a visual representation of it. For example, you can determine whether the request was sent out from the device, whether the back-end responded, whether certain components were missing from the response and other factors that may have caused the app to become unresponsive. Suppose we need to determine if a delay caused a problem. In Helios, there's a functionality that displays the span's duration. Here's what it looks like: Now you can simply analyze the trace to pinpoint the bottleneck. In addition, each span in the trace is timestamped, allowing you to see exactly when each action took place and whether there were any delays in processing the request. Helios comes with a span explorer that was created explicitly for this purpose. The explorer enables the sorting of spans based on their duration or timestamp: The trace visualization provides information on the time taken by each operation, which can help identify areas that require optimization. A default view available in Jaeger is also an effective method to explore all the bottlenecks by displaying a trace breakdown. Adding Front-End Instrumentation to Your Traces in OpenTelemetery: Advanced Use Cases It's advised to include front-end instrumentation in your traces to enhance the ability to analyze bottlenecks. While many SDKs provided by OpenTelemetry are designed for back-end services, it's worth noting that OpenTelemetry has also developed an SDK for JavaScript. Additionally, they plan to release more client libraries in the future. Below, we will look at how to integrate these libraries. Aggregating Traces Aggregating multiple traces from different requests into one large trace can be useful for analyzing a flow as a whole. For instance, imagine a purchasing process that involves three REST requests, such as validating the user, billing the user, and updating the database. To see this flow as a single trace for all three requests, developers can create a custom span that encapsulates all three into one flow. This can be achieved using a code example like the one below. const { createCustomSpan } = require('@heliosphere/web-sdk'); const purchaseFunction = () => { validateUser(user.id); chargeUser(user.cardToken); updateDB(user.id); }; createCustomSpan("purchase", {'id': purchase.id}, purchaseFunction); From now on, the trace will include all the spans generated under the validateUser, chargeUser, and updateDB categories. This will allow us to see the entire flow as a single trace rather than separate ones for each request. Adding Span Events Adding information about particular events can be beneficial when investigating and analyzing front-end bottlenecks. With OpenTelemetry, developers can utilize a feature called Span Event, which allows them to include a report about an event and associate it with a specific span. A Span Event is a message on a span that describes a specific event with no duration and can be identified by a single time stamp. It can be seen as a basic log and appears in this format: const activeSpan = opentelemetry.trace.getActiveSpan(); activeSpan.addEvent('User clicked Purchase button); Span Events can gather various data, such as clicks, device events, networking events, and so on. Adding Baggage Baggage is a useful feature provided by OpenTelemetry that allows adding contextual information to traces. This information can be propagated across all spans in a trace and can be helpful in transferring user data, such as user identification, preferences, and Stripe tokens, among other things. This feature can benefit front-end development since user data is a crucial element in this area. You can find more information about Baggage right here. Deploying Front-End Instrumentation Deploying the instrumentation added to your traces is straightforward, just like deploying any other OpenTelemetry SDK. Additionally, you can use Helios's SDK to visualize and gain more insights without setting up your own infrastructure. To do this, simply visit the Helios website, register, and follow the steps to install the SDK and add the code snippet to your application. The deployment instructions for the Helios front-end SDK are shown below: Where to Go From Here: Next Steps for Front-End Developers Enabling front-end instrumentation is a simple process that unlocks a plethora of new troubleshooting capabilities for full-stack and front-end developers. It allows you to map out a transaction, starting from a UI click and to lead up to a specific database query or scheduled job, providing unique insights for bottleneck identification and issue analysis. Both OpenTelemetry and Helios support front-end instrumentation, making it even more accessible for developers. Begin utilizing these tools today to enhance your development workflow.
Joana Carvalho
Performance Engineer,
Postman
Eric D. Schabell
Director Technical Marketing & Evangelism,
Chronosphere
Chris Ward
Zone Leader,
DZone
Ted Young
Director of Open Source Development,
LightStep