Great! All our components are up and running: the OpenTelemetry demo application and the Elastic Agent DaemonSet running on the Kubernetes cluster on the Strigo machine and our Elastic deployment on Elastic Cloud. Let's start exploring Kibana!
First up: the APM app in Kibana. The APM app allows you to monitor your software services and applications in real-time; visualize detailed performance information on your services, identify and analyze errors, and monitor host-level and APM agent-specific metrics like JVM and Go runtime metrics.
Having access to application-level insights with just a few clicks can drastically decrease the time you spend debugging errors, slow response times, and crashes.
In Kibana, navigate to Observability > APM.
On the Services page, the inventory provides a quick, high-level overview of the health and general performance of all instrumented services. It shows the name, environment, latency, throughput and failed transaction rate of the services.
The time picker next to the search bar defines the time window within which APM data from instrumented services are available. So, if you are unable to find a service that you are expecting, make sure your time range in the time picker is properly set.
Let's take closer look at the Service Map. Select Service Map next to Inventory.
The Service Map is a real-time visual representation of the instrumented services in your application’s architecture. It shows you how these services are connected. There are 2 type of shapes:
Circles: these are the instrumented services. The icon are based on the programming language of the service
Diamonds: these are dependencies like databases, external services and messaging queues. The icon represents the dependency type, if it's unknown entity a generic icon is used. More on dependencies later.
If you select a service, some high-level metrics are shown like the average transaction duration, requests per minute, and failed transaction per minute. The service map is generated automatically.
Machine learning jobs can be created to calculate anomaly scores on APM transaction durations within the selected service. When these jobs are active, service maps will display a color-coded anomaly indicator based on the detected anomaly score:
Green: max anomaly score ≤25. Service is healthy.
Yellow: max anomaly score 26-74. Anomalous activity detected. Service may be degraded.
Red: max anomaly score ≥75. Anomalous activity detected. Service is unhealthy.
Because it takes some time for the anomaly detection to start showing results (it needs historical data), we won't enable this during this workshop.
We only have 1 application running in this workshop environment, but in real-life you might have multiple different applications running. This is where service groups come in. You can group services together to build meaningful views that remove noise, simplify investigations across services, and combine related alerts.
Navitgate to Service groups. There are no service groups defined yet. Let's create one for our OpenTelemetry Demo application. Select Create group.
Give the new group a name and description, and choose your favoroute color.
Next, select services by specifying a query. We don't have a lot of metadata available in this workshop, so we're going to select all the services that have a name: "service.name: *". Go ahead and click Save group.
In real life, you'd use more meaningful queries to group services, for example:
labels.team: "top-team"
service.environment: "production"
service.name: "demo-*"
Go back to the list of services and select the adservice. This brings you to the service overview. The Service overview contains a wide variety of charts and tables that provide high-level visibility into how a service is performing across your infrastructure.
Next to the service name, you'll see some extra information like:
Service details like service version, runtime version, framework, and APM agent name and version
Container and orchestration information
Cloud provider, machine type, service name, region, and availability zone
Serverless function names and event trigger type
The latency graph visualizes the response times for the service. You can filter the Latency chart to display the average, 95th, or 99th percentile latency times for the service. If you have enough historical data, you can compare the latency with the day before or the week before. If you have anomaly detection enabled, you'll be able to compare the latency with the expected bounds from the corresponding anomaly detection job.
The Throughput chart visualizes the average number of transactions per minute for the selected service. And the Transactions table displays a list of transaction groups for the selected service and includes the latency, traffic, error rate, and the impact for each transaction. Transactions that share the same name are grouped, and only one entry is displayed for each group.
By default, transaction groups are sorted by Impact to show the most used and slowest endpoints in your service. Select the oteldemo.AdService/GetAds transaction group to go to the transaction details page.
This page is similar to the service overview but focuses on the selected transaction group.
Scroll down to the bottom of the page. The Latency distribution is a plot of all transaction durations for the given time period. The transactions on the left are the fast transactions and the transactions on the right are the slow transactions.
The Trace samples are based on the bucket selection in the Latency distribution chart; update the samples by selecting a new bucket. The following information about the selected trace sample is shown:
Trace sample timeline: this waterfall is useful for understanding the parent/child hierarchy of transactions and spans, and ultimately determining why a request was slow
Trace sample metadata: learn more about the trace sample
Trace sample logs: logs related to the sampled trace
Click on a span to view the span details
Close the span detail flyout and select Investigate.
Here you can jump to log and metric data related to the selected span. This provides an immense amount of context and speeds up the root cause analysis. The following context is available for the selected span of our adservice:
Container logs
Contrainer metrics
Host logs
Host metrics
Trace logs
Show in service map
View transaction in Discover
Any custom links you can configure yourself
Distributed tracing allows you to trace requests through your service architecture automatically, and visualize those traces in one single view in the APM app. From initial web requests to your front-end service, to queries made to your back-end services, this makes finding possible bottlenecks throughout your application much easier and faster.
First, select one of the slower transactions. Then, next to the Investigate button, select the View full trace button.
You'll get a waterfall view of all the transactions and spans related to trace sample that was selected. All the services in a distributed trace are separated by color and listed in the order they occur.
By definition, a distributed trace includes more than one transaction. When viewing distributed traces in the timeline waterfall, you’ll see an icon with 2 arrows ⇄, which indicates the next transaction in the trace.
Okay, let's switch gears and move on to the next sub-item of the APM menu. In the right navigation bar select Traces.
Traces displays your application’s entry (root) transactions. Transactions with the same name are grouped together and only shown once in this table. By default, transactions are sorted by Impact. Impact helps show the most used and slowest endpoints in your service — in other words, it’s the collective amount of pain a specific endpoint is causing your users. If there’s a particular endpoint you’re worried about, select it to view its transaction details.
The final item in the APM menu is Dependencies.
APM agents collect details about external calls made from instrumented services. Sometimes, these external calls resolve into a downstream service that’s instrumented — in these cases, you can utilize distributed tracing to drill down into problematic downstream services. Other times, though, it’s not possible to instrument a downstream dependency — like with a database or third-party service. Dependencies gives you a window into these uninstrumented, downstream dependencies.
Select the Redis dependency to see detailed latency, throughput, and failed transaction rate metrics. You can also see that Redis is used by the cartservice.
In this section, we will use the Discover page in Kibana to explore the APM data in our cluster.
Navigate to Analytics > Discover. Make sure the APM dataset is selected at the top-left of the page.
You can explore the fields in the data in the sidebar on the left. Search for "service.name" and select the field. It will show a list of the top values and how many documents are related to each specific service. As you can see, the frontend generates the most documents.
Let's filter out all the logs related to the accounting service.
In the search bar, fill in: service.name: "accountingservice" and press enter.
You'll see all the document coming from the account service.
It's super easy and fast to filter through the data and find a specific piece of information you're looking for.
Let's create some visualizations with our data! Navigate to Analytics > Visualize Library and select Create new visualization.
Select Lens as the visualization type.
Make sure the APM dataset is selected on the top-left.
Let's create our first visualization by dragging the Records to the workspace. The result is a graph of the number of records over time. Click Save.
Give the Lens visualization a name and save it to a new dashboard.
We will be redirected to our brand new dashboard with the visualization we just created. Add another visualization to the dashboard by selecting Create visualization.
This time, we'll make a donut chart that shows the different money currencies being used in our webshop.
Search for the "app_payment_currency" field and drag it to the workspace. Select donut type at the bottom. Go ahead and select Save and return.
To get a more colourful donut, generate some data using the frontend and buy some items with different currencies.
Let's add one more visualization.
We want to display the amount of money we sold during the selected time period. Search for "app_payment_amount" and drag it to the workspace. Make sure the metric type is selected at the bottom. On the right, change the function to Sum and give it the name "Total sales". Select the custom value format and prefix the number with a dollar sign and remove the decimals: $0,0.
Save the visualization.
Feel free to add even more visualizations to the dashboard. :)