In this article, we explore what anomaly detection is and what benefits it can provide to an organization.
Anomaly detection is the process of identifying and recognizing potential issues or incidents within a system. It is a critical component for maintaining system reliability and performance.
Anomaly detection is essential because it allows organizations to proactively address problems before they escalate, minimizing downtime and reducing the impact on users. There are various types of detection, including dynamic, which identifies deviations from normal behavior, and static, which triggers alerts when predefined conditions are met.
There are two main approaches to performing anomaly detection: static and dynamic. Each approach has its own strengths and is suited to different types of data and scenarios. Static anomaly detection relies on predefined thresholds that remain constant over time, while dynamic anomaly detection adjusts thresholds based on the evolving characteristics of the data.
Before discussing static and dynamic threshold detection, we need to understand what we mean by "normal" in this context.
What is “normal” here?
"normal" refers to the expected range of values or behavior for a particular metric or system performance parameter under typical operating conditions. Defining what constitutes "normal" is crucial for identifying anomalies or issues that deviate from usual patterns.
Consider an example:
Imagine you are monitoring the response time of a web application. The response time has been between 100ms and 300ms. This range is considered normal. If the response time suddenly drops to 50ms or spikes to 500ms, it may indicate an issue since these values deviate from the established normal range.
Static thresholds are predefined, fixed values set to monitor specific metrics such as CPU utilization, error rates, or response times. Alerts are triggered when these metrics exceed or fall below the set threshold.
Static thresholds set a fixed limit for a metric. This limit is used to compare with real-time data to decide if an alert should be triggered. If the metric goes above or below the threshold, an alert is generated. Since the threshold doesn’t adjust automatically, it needs to be updated manually if normal behavior changes.
The following graph explains a basic demonstration of Static thresholds.
To handle temporary spikes or drops, you can set time limits so alerts are only raised if the metric consistently stays outside the threshold for a certain period. As shown in the picture (B).
For example:
Imagine you set a static threshold of 70% for CPU utilization. If the CPU usage goes above 70%, an alert is triggered. However, if CPU usage temporarily spikes above 70% but then drops back down quickly, no alert is raised, as long as the usage does not stay above 70% for the duration of the set time limit. If the CPU usage consistently stays above 70% for a specified period, an alert will be generated to notify you of a potential issue.
What if:
What if the CPU usage exceeds a static threshold, but there's no real issue? Alternatively, what if a problem arises within the range defined by the threshold because the normal trend has shifted?
For example, if your CPU usage typically ranges from 20% to 60%, but over time it shifts to 30% to 70%, the static threshold set at 65% might no longer be effective. You would then need to update the threshold to match the new range.
If the range changes again, say to 27% to 67%? Constantly adjusting these thresholds manually is time-consuming and impractical, potentially leading to missed issues or false alarms.
This is where the dynamic threshold becomes essential. Unlike static thresholds, dynamic threshold detection adapts to changing usage patterns, identifying deviations from current trends. This approach ensures that alerts are triggered based on the system's actual behavior, not outdated thresholds, effectively eliminating blind spots.
A dynamic threshold automatically adjusts based on real-time data, eliminating the need for constant manual updates. It uses statistical algorithms to analyze data points, historical data, identify trends, and detect anomalies. This approach reduces false alarms and ensures alerts are triggered only when there’s a genuine issue, making it more efficient and less of a manual burden compared to static thresholds.
The following graph explains a basic demonstration of Dynamic thresholds.
For example:
Let's take a previous example using dynamic anomaly detection to monitor CPU utilization. Instead of a fixed threshold, the system learns that normal CPU usage typically ranges from 30% to 60% based on the past 30 days of data.
The system calculates that the average usage is 45% with a standard deviation of 7.5%. Using a 2-standard deviation rule, the dynamic threshold sets a normal range of 30% to 60%
Lower bound: 45% - (2 * 7.5%) = 30%
Upper bound: 45% + (2 * 7.5%) = 60%.
If CPU usage goes outside this range, the system starts monitoring more closely.
However, a brief spike to 65% that quickly returns to normal wouldn't trigger an alert. If the CPU usage consistently stays above 60% or below 30% for a specified period (let's say an hour), an alert is generated to notify you of a potential issue. This approach allows for normal fluctuations while still catching sustained abnormal behavior.
The key difference is adaptability. If over time the normal CPU usage shifts to 40% to 70%, the system will automatically adjust its "normal" range to about 40% to 70%, ensuring that alerts remain relevant without manual reconfiguration.
A metric is the specific quantitative measurement you're monitoring for anomalies. It's the foundation of your anomaly detection process, providing the raw data that you'll analyze. Metrics can be simple or complex, depending on your needs and the system you're monitoring.
For example, in an e-commerce setting, you might track metrics such as daily sales revenue, number of transactions, or average order value. In a technical environment, metrics could include server response time, CPU usage, or network throughput. The choice of metric is crucial as it directly impacts what kinds of anomalies you can detect.
Datapoints are the individual measurements of your chosen metric, collected at regular intervals. Each datapoint represents a specific value at a particular moment in time, forming the basic units of your dataset.
For instance, if you're monitoring hourly website traffic, each datapoint would represent the number of visitors for a specific hour. So you might have datapoints like:
The frequency of datapoint collection depends on the nature of your metric and the granularity of analysis you need.
A time slice, also known as a time window, is the duration over which you group and analyze your data points. It defines the granularity of your analysis and can significantly impact your ability to detect different types of anomalies.
For example, if you're analyzing server performance, you might use different time windows for different purposes:
The choice of time window depends on the typical behavior of your metric and the types of anomalies you're looking to detect.
The evaluation period is the specific timeframe over which you apply your anomaly detection analysis. It's the duration for which you want to identify anomalies, often representing the most recent data you're examining.
For instance, in monitoring website traffic:
The evaluation period is distinct from the historical look back (which establishes baselines) and the time slice (which determines analysis granularity). It focuses your analysis on a specific, usually recent, timeframe of interest. Choosing an appropriate evaluation period depends on the nature of your data and the types of anomalies you're trying to detect.
The historical look back period refers to how far back in time you consider data when establishing patterns and detecting anomalies. This historical context is crucial for understanding the normal behavior of your metric over time.
For instance, if you're monitoring retail sales:
The appropriate look back period depends on factors like the stability of your metric, the presence of seasonal patterns, and how quickly the underlying system changes.
Using your historical data, you establish a baseline - the expected "normal" behavior of your metric. This serves as a reference point for detecting deviations. The baseline can be static or dynamic, depending on the nature of your data.
For example, in monitoring energy consumption of a building:
Establishing an accurate baseline is crucial for effective anomaly detection, as it provides the context for determining what constitutes "abnormal" behavior.
Trend analysis involves examining how your metric changes over time. It helps in identifying patterns, seasonality, and gradual shifts that might not be apparent when looking at individual datapoints.
For example, in analyzing stock prices:
Trend analysis can help distinguish between normal variations and true anomalies, especially in metrics with complex patterns.
Snapshots are periodic "pictures" of your data at specific points in time. Comparing snapshots can help identify sudden changes or anomalies that might not be apparent in continuous data.
For instance, in monitoring database performance:
Snapshots are particularly useful for detecting slow-moving anomalies or changes that occur over longer time scales.
Outlier detection is the process of identifying datapoints that significantly differ from other observations. Various statistical and machine learning techniques can be used for outlier detection, considering factors like the baseline, threshold, and trends.
For example, in fraud detection:
Effective outlier detection requires careful tuning to balance between catching genuine anomalies and avoiding false positives.
Standard deviation (SD) tells us how much data typically varies from the average. In anomaly detection, we use it to separate “normal" from “unusual” data. The 68-95-99.7 rule is a simple way to remember how much data falls within 1, 2, or 3 standard deviations from the average:
Here's a breakdown:
In anomaly detection, we often use 2 SD (95% rule) or 3 SD (99.7% rule) as thresholds for identifying outliers or anomalies.
Simplified Example:
Let's say we're monitoring the CPU usage of a web server. After analyzing the historical CPU usage data, we find:
Using the standard deviation rule for anomaly detection:
1 SD (68% rule):
2 SD (95% rule):
3 SD (99.7% rule):
Anomaly Detection Example:
This simple example demonstrates how standard deviation can be used to create thresholds for normal vs. anomalous behavior in a metric, allowing for automated detection of unusual patterns or events.
1. Establish Normal Behavior:
2. Monitor Current Data:
3. Identify and Respond to Anomalies:
4. Update Normal Behavior:
Example: Detecting anomalies in daily website visitors
Step 1: Define the metric
Step 2: Collect historical data
Step 3: Calculate baseline statistics
Step 4: Set the anomaly threshold
Step 5: Monitor current data
Step 6: Compare current data to thresholds
Step 7: Identify anomalies
Step 8: Analyze and respond
Step 9: Update the model (for dynamic thresholds)
Example scenario:
This process helps identify unusual patterns in your website traffic, allowing you to respond quickly to potential issues or opportunities.
Temperstack is here to revolutionize your workflow. We've combined the best of static and dynamic approaches into one seamless platform, eliminating the need for manual calculations and constant monitoring.
With Temperstack, you can say goodbye to:
Our solution comes pre-loaded with expert-crafted thresholds, meticulously developed through extensive research and industry insights. These intelligent settings provide a robust foundation for your anomaly detection needs right out of the box.
But we don't stop there. Temperstack offers the flexibility to fine-tune these thresholds to your organization's unique requirements. In just a few clicks, you can customize your anomaly detection system to perfection.
Ready to transform your approach to anomaly detection? Start your free trial today and experience the Temperstack difference. Streamline your processes, enhance your insights, and focus on what truly matters – growing your business.
Hari is an accomplished engineering leader and innovator with over 15 years of experience across various industries. Currently serving as the cofounder and CTO of Temperstack, Hari has been instrumental in scaling engineering teams, products, and infrastructure to support hyper-growth. Previously, he held Director of Engineering positions at Practo, Dunzo, Zeta, and Aknamed, where he consistently drove innovation and operational excellence.
Samdisha is a skilled technical writer at Temperstack, leveraging her expertise to create clear and comprehensive documentation. In her role, she has been pivotal in developing user manuals, API documentation, and product specifications, contributing significantly to the company's technical communication strategy.
Hari Prashanth K R | Co-Founder & CTO Temperstack
Subscribe to our newsletter & never miss our latest news and promotions.