About the Dynamic Thresholds app
Note
The Dynamic Thresholds app version 1.7.0 requires:
- Web Console version 3.9.0 or later
- ITRS Analytics Platform version 2.18.0 or later
The Dynamic Thresholds app is designed to help you monitor your systems more intelligently and reduce alert fatigue. Traditional static thresholds often generate unnecessary alerts when normal fluctuations occur, overwhelming teams and diverting attention from real issues. In contrast, the Dynamic Thresholds app provides adaptive, data-driven anomaly detection by learning from historical data and automatically adjusting to expected behavior. This ensures that alerts focus on statistically significant deviations, allowing you to respond only to genuine anomalies.
Use the Dynamic Thresholds app to set up and manage dynamic thresholds for various metrics. The app leverages a deviation model that learns the historical behavior of your metrics, automatically establishing and adjusting flexible upper and lower boundaries.
The app also offers:
- Configurable warning and critical severity levels for alert escalation.
- Adjustable training period to fine-tune adaptability to recent patterns.
- Noise reduction features, including metric smoothing and configurable alert delays, to minimize false positives.
From the app’s main screen, you can:
- View a list of all existing dynamic threshold configurations.
- Edit an existing dynamic threshold configuration by clicking on a row.
- See the status of each configuration. With an admin account, you can toggle the Enabled checkbox of a configuration directly from the table to activate or deactivate a configuration.
- Identify the name, metric, group, number of entities, who created the configuration, and last modified date for each configuration.
- Search for any string field in the configurations.
Define a new dynamic threshold configuration Copied
Tip
Watch this product demo tour in full screen to learn how to create a dynamic threshold configuration for your metrics.
Dry run results Copied
Once a metric is selected, a chart is automatically generated on the right-hand side of the screen. This chart shows the metrics and thresholds for any matching entity within the last 24 hours and provides a comprehensive view of your metric behavior.
This chart also displays:
-
The primary metric value and its statistical spread, providing better insight into data variability and confidence.
-
A shaded range for each data sample which represents the statistical distribution of values, including the minimum and maximum bounds and 25th to 75th percentile ranges.
-
Aggregation of scalar metric values during threshold calculations for more precise results. Click on the
icon to select your preferred aggregation method:- Mean - the average value over the time window.
- Min - the minimum value in the time window.
- Max - the maximum value in the time window.
- Percentile - a configurable percentile (min, p5, p25, p50, p75, p95, max).
Note
Custom aggregations are only available when your configuration meets the following conditions:
- Scalar metrics are used, not histogram metrics nor rolling window.
- Noise reduction is set to None or Alert Delay.
- Use raw values option is disabled.
When these conditions are not met, the aggregation controls will be disabled with a helpful tooltip explaining what needs to be changed.
-
Raw data when the Use raw values button is enabled. The chart will display the metric line using raw data instead of downsample data. Raw values can only be viewed while editing a configuration and will not persist upon saving.
Use this chart to preview the behavior of your dynamic threshold configuration. Changing the selected metric or adjusting the thresholds and training window sliders will automatically update the chart.
Click Analyse Results to simulate the configuration for matching entities. The Dry Run Results table then gives you an overview of how the configuration impacts most entities. You can click on a row to preview the dynamic threshold behavior for a specific entity and adjust your configuration based on these results.
When you adjust your configuration, a banner will appear indicating that the results are no longer valid. Click Refresh Results to show the updated results based on the new configuration.
Import and export configurations Copied
The Dynamic Thresholds app supports importing and exporting configurations in JSON format. This is useful for moving configurations across different environments.
Import Copied
With the Import feature, you can:
- Upload up to 100 JSON files at a time, with a maximum size of 2 MB per file.
- Validate each file to ensure it is valid JSON and a proper threshold configuration.
- Override existing configurations to prevent conflicts.
When importing files, you may encounter the following status messages:
Valid
The configuration has been successfully validated and is ready for import.
Uploaded
The configuration has been successfully uploaded to the server.
Conflict
The configuration file is valid, but the filename conflicts with an existing configuration. You can choose to override the existing file.
Error
The file could not be imported due to invalid JSON or failed ThresholdConfig validation.
Failed
The file upload to the server failed.
Oversized
The file exceeds the 2MB size limit.
Skipped
The file was skipped because the maximum of 100 configurations was reached.
Export Copied
With the Export feature, you can:
- Download individual configuration files or bulk export them into a single JSON file.
- Select multiple JSON files to export by ticking on the corresponding checkboxes or clicking Select All.
- Maintain consistent naming, as exported files retain the configuration names used in the app.
Threshold metric staleness Copied
When a dynamic threshold configuration is enabled, the app publishes threshold metrics for every matching entity. These metrics flow through the platform like ordinary source metrics and contribute to entity activity. As a result, threshold metrics can keep entities appearing active in Entity Viewer even when the original source metric has stopped producing data.
Helm settings Copied
Each dynamic threshold configuration defines a refreshInterval that controls how often the data source asks the platform for fresh aggregation data. When that refresh runs, the platform’s latest-seen timestamps for source metrics are applied to the cached threshold metric. That refresh is the point at which the cached metric’s timestamp is updated from live source activity.
The Helm value thresholdMetricStaleAfter sets how long the app continues to re-publish a cached threshold metric after the underlying source metric stops sending new data. When that duration is exceeded, the cached threshold metric is evicted and is no longer sent.
thresholdMetricStaleAfter is a signal generator daemon setting defined under the daemon key in the signal generator Helm chart values (for example in values.yaml or a values override file). The chart then renders it into the deployment ConfigMap. These values use ISO 8601 duration format (for example PT5M for 5 minutes, PT1H for 1 hour).
| Setting | Default | Role |
|---|---|---|
daemon.thresholdMetricRefreshInterval |
PT30S |
How often the daemon re-publishes cached threshold metrics and runs staleness checks. |
daemon.thresholdMetricStaleAfter |
PT1H |
Maximum age of a cached threshold metric before it is evicted and no longer re-sent. |
thresholdMetricStaleAfter must be significantly larger than the largest refreshInterval across all threshold configurations. If thresholdMetricStaleAfter is equal to or smaller than a configuration’s refreshInterval, a race can occur, wherein the staleness logic may evict the metric and clear its signal just before the next scheduled refresh repopulates the cache, even though the source is still healthy.
It is recommended to set thresholdMetricStaleAfter to at least three times the largest refreshInterval.
| thresholdMetricStaleAfter | Example refreshInterval | Safe? |
|---|---|---|
PT1H (default) |
300s (5 minutes) |
Yes — ample margin |
PT15M |
300s |
Yes |
PT5M |
300s |
No — same window; risk of spurious clears |
PT10M |
600s (10 minutes) |
No — same window |
Configuring thresholdMetricStaleAfter Copied
You can configure this setting in three ways:
Note
Replace<namespace>with your ITRS Analytics namespace (oftenitrs). Replace chart names, repository URLs, and release names with those your organization uses for the signal generator.
Option A: Helm upgrade with --reuse-values
Copied
This updates the value for the current release. The override is kept across subsequent helm upgrade --reuse-values runs, but is lost if helm upgrade is run without --reuse-values (for example, during a fresh install or a CI job that supplies its own values file).
helm upgrade --install iax-app-signal-generator itrs-snapshots/iax-app-signal-generator \
--devel --reuse-values -n <namespace> \
--set daemon.thresholdMetricStaleAfter=PT5M
Restart the deployment so the daemon picks up the new ConfigMap:
kubectl rollout restart deployment/iax-app-signal-generator -n <namespace>
kubectl rollout status deployment/iax-app-signal-generator -n <namespace>
Option B: Direct ConfigMap edit Copied
Use this for a fast change without going through the chart, for example when testing or debugging.
Warning
Editing the ConfigMap directly is temporary. The nexthelm upgradeusually rebuilds the ConfigMap fromvalues.yamland overwrites your changes.
kubectl edit configmap iax-app-signal-generator -n <namespace>
Find the thresholdMetricStaleAfter entry under grpcServices and set the value, for example:
thresholdMetricStaleAfter: PT5M
Save the file, then restart the deployment:
kubectl rollout restart deployment/iax-app-signal-generator -n <namespace>
kubectl rollout status deployment/iax-app-signal-generator -n <namespace>
Option C: Values override file Copied
This is the recommended method, as it persists all changes across future Helm upgrades.
Add the daemon block to values override file of your deployment or to the chart values.yaml in source control:
daemon:
thresholdMetricStaleAfter: PT5M
Deploy with your override file:
helm upgrade --install iax-app-signal-generator itrs-snapshots/iax-app-signal-generator \
--devel -n <namespace> -f custom-values.yaml
Restart the deployment if the chart does not roll pods automatically.
Verifying the setting Copied
Check the value stored on the release by running:
helm get values iax-app-signal-generator -n <namespace> -a | grep "thresholdMetricStaleAfter"
Confirm that the daemon started with the expected configuration by running:
kubectl logs deployment/iax-app-signal-generator -n <namespace> \
| grep -i "staleAfter\|Resending thresholds"
Verifying that eviction is working Copied
Tail logs and watch Send batch lines. Batch sizes should shrink and eventually stop once stale metrics are evicted.
kubectl logs -f deployment/iax-app-signal-generator -n <namespace> \
| grep -E "Send batch|Evicted stale"
With DEBUG logging enabled, you should also see Evicted stale threshold metric for each evicted entity or metric. To raise the log level via Helm:
helm upgrade --install iax-app-signal-generator itrs-snapshots/iax-app-signal-generator \
--devel --reuse-values -n <namespace> \
--set loglevel.app=DEBUG
Signals when a threshold metric is evicted Copied
When a cached threshold metric is evicted because thresholdMetricStaleAfter has been exceeded, the app publishes a clear so the dynamic threshold signal moves to severity NONE, with a message indicating that the signal was cleared due to source metric staleness. You do not need to wait for entity age-out alone to drop a stale severity.
Note
After the app clears a signal toNONE, the visible signal updates immediately. In some cases, when the source metric resumes and values are within thresholds, the Data Pipeline Daemon may not immediately re-publish anOKstate and the signal can remain atNONEuntil the metric next crosses a threshold (for example,WarningorCritical).
Troubleshooting Copied
- If an entity stays active longer than expected: Confirm that threshold metrics are actually being evicted by checking staleness logs and observing that batch sizes are decreasing. Additionally, check entity attributes such as inactivity timestamps in Entity Viewer and verify no other data is still published for that entity.
- If spurious signal clears while the source is healthy: Review
thresholdMetricStaleAfterversus the largestrefreshInterval. Increase the Helm value or align refresh intervals so the 3× margin holds. - For deeper analysis: Your platform team can enable
DEBUGlogging on the app to trace staleness eviction lines in the logs.