Now that we've explored the APM app in Kibana, we'll go a step further and use it to find and analyze issues in our application.
We're going to intentionally break the application by enabling a feature flag.
Navigate to the frontend in your local browser:
Find the public IP address of your machine by clicking on the cogwheel > machine info and copy the public IP (or public DNS).
In your local browser open a new tabblad and go to http://<public ip>:8080/feature.
Find the productCatalogFailure feature flag and select Edit.
Tick the checkbox Enabled to break the product catalog service and Save.
This feature flag will produce errors for one certain product. Our task is to find out which one.
We want to get an alert when something is going wrong, so let's create a new alerting rule. Go to Observability > APM, on the top right select Alert and rules > Create error count rule.
By default, the rule will check every minute if the error count is above 10 for the last 5 minutes. Save the rule without changing anything. You have to confirm you don't want to configure any actions. In real life, you'd want to add an action to your rules, for example send an alert on Slack or Microsoft Teams.
Go to the alert overview: Observability > Alerts. After a few minutes, an alert will pop up with the following reason: "Error count is x in the last 5 mins for service: frontend". Let's start investigating by selecting the eye icon (View in app) in the Actions column.
We're redirected to the Errors tab of the frontend service. Notice all the errors occurring (purple bars) when switched the feature flag. Select the error message to view more details.
Next to Error sample, use the arrows to go through the errors. You'll see 2 separate kinds of errors occur:
One error about the fail feature flag being enabled
13 INTERNAL: Error: ProductCatalogService Fail Feature Flag Enabled
And one error about a certain product failing to be retrieved (keep clicking the arrow until you find one)
13 INTERNAL: failed to prepare order: failed to get product #"OLJCESPC7Z"
Congrats! You've found the product ID of the broken product: OLJCESPC7Z. You can check the OpenTelemetry demo documentation to confirm that this is the broken product.
What if the error doesn't include the exact reason of the failure? There's another very easy way to discover which product is broken. Let's act like we don't know the product ID yet and start investigating again.
Navigate to Transactions and select the HTTP GET transaction (which has the most impact).
Scroll to the bottom and select Failed transaction correlations. The correlations on the Failed transaction correlations tab help you discover which attributes are most influential in distinguishing between transaction failures and successes. You can immediately find the broken product ID in the results.
You could do the same for the transactions of the product catalog service.
Clean up by disabling the productCatalogFailure feature flag again.
After a few minutes, the alert will recover.