Diagnostic Information in Application Insights notifications


LearnAzureDevOps-O5

Diagnostic Information in Application Insights notifications

When using Application Insights to monitor the health and performance of your applications, the diagnostic information in its notifications plays a critical role in triaging, scoping, and diagnosing issues. Let's break down how you can use this information effectively to manage and resolve problems in the context of Triage, Scope, and Diagnose.

1. Triage: Initial Alert Handling

The Triage phase is about quickly identifying the problem, assessing its severity, and deciding whether immediate action is needed.

How Application Insights Notifications Help in Triage:

  1. Alerting on Thresholds: Application Insights sends notifications based on custom or built-in thresholds. These could be related to:

  • Request failure rates: A high number of failed HTTP requests (e.g., 5xx errors).

  • Performance degradation: Slow response times exceeding thresholds.

  • Availability tests: Monitoring the application's uptime via synthetic transactions or external tests.

  1. Initial Data: The notification you receive may include high-level details like:

  • Severity: Alerts are often categorized (e.g., "Critical," "Warning," or "Informational").

  • Impact: A metric showing how many users are affected or how much traffic is impacted (e.g., 50% of requests failed).

  • Time: A timestamp or time window during which the issue occurred.

Triage Actions:

  1. Quickly assess impact: Look at the notification for the number of impacted users, regions, or affected services. For instance, a sharp rise in failure rates or response time can indicate an urgent problem.

  2. Evaluate urgency: If the error is affecting a critical feature or a large number of users, you'll prioritize it. If it's a low-severity issue with minimal impact (like a few intermittent errors), you might defer it for further investigation.

Example in Triage:

You receive a notification that the "request failure rate" for your API has spiked by 40% over the last 10 minutes.

  1. Impact: This alert might tell you that a large number of users are unable to access critical services, which indicates a potentially high-impact issue.

  2. Severity: The alert severity could be “Critical” based on the large number of failures.

Your immediate action might be to look at high-level metrics in the Application Insights portal or console (failure rate, user impact) to get a sense of what might be wrong.

2. Scope: Understand the Extent of the Problem

Once you’ve triaged the issue and confirmed that it's something that requires your attention, the next step is to scope the problem — understanding how far it extends, what systems are involved, and what resources you’ll need.

How Application Insights Notifications Help in Scoping:

  1. Correlation IDs:

Application Insights often attaches Correlation IDs to events, which allow you to trace individual requests across distributed systems. This is useful for understanding if the issue is localized or spans multiple systems or microservices.

  1. Affected Endpoints/Services:

Application Insights can provide information on which parts of your system are failing.

For example:

  • Which API endpoints are impacted?

  • Which microservices are experiencing errors?

  • Which regions or datacenters are reporting problems?

  1. Performance Metrics:

Metrics like response time, throughput, and dependencies can help you assess how widespread the problem is. If it's a service outage, this may show degraded performance in one area, or if there’s a scaling issue, it might indicate uneven performance across multiple regions or services.

Scoping Actions:

  1. Identify affected systems:

Use correlation IDs to follow individual requests and see which part of the application or service is failing. If the failure is on the backend, you may need to look into server-side issues. If it’s on the frontend, you may need to check the web application code or network requests.

  1. Check dependencies:

Use Application Insights' dependency tracking feature to see if an external dependency (e.g., a database, API, or external service) is causing the issue. If a database query is slow, it could be causing delays in your application.

  1. Filter by region, user, or session:

If your application is geographically distributed, check whether the issue is localized to a particular region or datacenter. If the issue affects only a specific group of users (e.g., those on mobile devices), it may be related to a specific platform or device type.

Example in Scoping:

The alert you received indicated a high failure rate for the "checkout" API.

You can:

  • Check the Correlation ID provided in the alert to trace the problem across the backend and frontend.

  • Filter to see if this failure is occurring in all regions or is specific to one region, helping you determine if it’s a broader infrastructure issue or something specific to a particular region.

  • Look at dependency tracking to see if the failure is related to a database query or an external API call, such as a payment gateway.

3. Diagnose: Identify the Root Cause

After triaging and scoping, the next phase is to diagnose the root cause of the issue. This involves deeper investigation into logs, telemetry, and other data to understand exactly what went wrong.

How Application Insights Helps in Diagnosis:

  1. Exception Tracking:

Application Insights automatically tracks exceptions and errors across your application. These may be:

  • Unhandled exceptions e.g., NullReferenceException, TimeoutException.

  • Custom exceptions: Errors raised by your code, which may provide deeper insights.

  1. Performance Logs:

Application Insights also tracks performance logs that can help diagnose issues like:

  • Slow database queries.

  • High CPU usage on the server.

  • Latency or timeouts from external dependencies.

  1. Application Map:

The Application Map in Application Insights is a visual tool that shows how different components of your application interact. If you see a bottleneck or failure in the application map, it can help you pinpoint where the problem is occurring.

  1. Kusto Query Language (KQL):

KQL is a powerful querying language for filtering, aggregating, and analyzing telemetry data in Application Insights. You can write custom queries to look for specific patterns, trends, and anomalies over time.

  1. Failure Correlation:

By correlating application logs, dependencies, and errors with the actual request data, you can trace back to the root cause of failures. For example, if an API call is failing due to a slow database query, Application Insights can help you see this relationship.

Diagnosis Actions:

  1. Examine exception details:

Look at stack traces, exception messages, and the context in which the exception occurred. Is it a common, recurring error? Does it occur after a specific user action or input?

  1. Investigate performance:

If there is a performance issue (e.g., high response times), use Application Insights' performance tracking to identify which operations are taking the longest. Are database queries slow? Are network calls taking too long? Is there an issue with server resources (e.g., CPU, memory)?

  1. Check dependencies:

Use dependency tracking to see if external services (like a third-party API or a database) are causing delays or failures. Are there timeouts or high error rates in any external service?

  1. Run custom queries:

Use KQL to query Application Insights for patterns or trends that could point to the issue. For example, you can search for all requests with a certain status code over a given time period to identify common failure points.

Example in Diagnose:

You determine that the spike in request failures is due to timeouts in your database. By drilling into the exception logs and checking the dependency tracking in Application Insights, you find that a slow database query is responsible for the high latency, which causes the timeouts.

  • You can see the exact query and the duration in the logs.

  • A KQL query shows that a database table has grown significantly and is no longer optimized.

The root cause is identified: a slow, unoptimized database query is the source of the issue.

Summary: Using Application Insights for Triage, Scope, and Diagnose

  1. Triage:

Alerts and notifications give you the first insight into the severity of the issue and whether it requires immediate attention. They often contain high-level details like failure rates, affected users, and error codes.

  1. Scope:

Application Insights helps you understand the extent of the problem by providing data on affected systems, regions, dependencies, and services. Correlation IDs and dependency tracking are key for scoping the issue.

  1. Diagnose:

Detailed logs, exceptions, performance data, and KQL queries allow you to investigate the root cause of the problem, whether it’s an exception, slow queries, or service failures. The Application Map and dependency tracking are especially helpful in pinpointing bottlenecks or failures in distributed systems.

By leveraging Application Insights effectively, you can triage, scope, and diagnose issues with confidence, leading to faster resolutions and improved system reliability.

Related Articles


Rajnish, MCT

Leave a Reply

Your email address will not be published. Required fields are marked *


SUBSCRIBE

My newsletter for exclusive content and offers. Type email and hit Enter.

No spam ever. Unsubscribe anytime.
Read the Privacy Policy.