Introduction
The alert definition feature allows you to set alerts on a metric using a PromQL query.
The alert definition can currently be defined at the client level.
Note: The user should have the following permissions to create and manage alert definitions.
- Administrator permission
- Manage Alerts
This is a Feature flag enabled functionality. Contact OpsRamp Support for assistance.
Create an Alert Definition
Follow these steps to create an alert definition:
Navigate to Setup > Account. The ACCOUNT DETAILS screen is displayed.
Click Monitoring tile.
Go to Alert Definitions tab and select METRIC BASED.
Click +ADD.
Enter the following information in DEFINITION DETAILS screen:
Name: Provide a unique name for the alert definition.
Alert Type: Select Metric or Trace as alert type.
Note: If Trace is selected, the alert type will be Trace, though you provide a metric query.Metric Query: Build a valid PromQL query using a metric. Use the filters and operations for the query as needed.
Examples:system_cpu_utilization
synthetic_response_time{name="1 HTTP"}
avg_over_time(alert_escalation_policy_count2{groups=~".*,device_group,.*"}[5m])
See PromQL for more information.
You can change the time-frame using the calendar icon.
The query result (time series) is displayed in the form of a graph.
- Critical Threshold: Enter a critical threshold value. Enter a number or a range.
Examples: <3, >1, 2-5, 10-15 - Warning Threshold: Enter a warning threshold value. Enter a number or a range.
Examples: <3, >1, 2-5, 10-15 Note: You can set both critical and warning thresholds or set only one threshold based on your requirements. - Trigger alert if conditions persist for: To avoid anomalous spikes, you can set a condition for an alert to trigger only if the metric value exceeds the thresholds persistently for some time.
The default time is set as 1 minute.
Example:The above screenshot shows the latest data point as 53.2. If the metric value is above the threshold for 1 minute continuously, only then is the alert triggered.- Set the critical threshold as 50 and warning threshold as 40.
If the metric value reached 80 and came back to 45, then a warning alert will be triggered.
- Set the critical threshold as 50 and warning threshold as 40.
- If there is no data: If there is no data coming in, then you can choose one of the options:
- Do not trigger alert - No alert will be triggered, if no data comes in.
- Trigger critical alert - A critical alert will be triggered if no data comes in.
- Trigger warning alert - A warning alert will be triggered, if no data comes in.
- Provide the information in the fields. Example: Trigger alert when an increase of more than 5 standard deviations away from the mean is detected.
- Provide the information in the fields. Limit: This feature monitors specific metrics and triggers an alert when the projected value of the metric is forecasted to exceed a predefined limit within a specified forecast period.
- Subject: Enter subject for the alert.
Note: Enter$
to add tokens.
Example: The alert is on the resource with host name: host - Description: Enter alert description.
Note: Enter$
to add tokens.
Example: The alert is on the resource with IP: IP
These tokens are displayed only after you provide a metric or a query in the Metric Query field. - Entity Type: Select either Resource or Client. Alerts can be on a specific resource like a server, or a client-level alert.
Note: For Dynamic Change Detection, you can select the Entity Type only as Resource. - Component: Select a component. This is to identify the alert.
- Resource Attributes: Define a resource attribute to the alert. These attributes are added to the alert.
Note: The resource attributes can be defined only for Resource entity type.- Select the attribute key and the attribute value from the dropdown boxes. These attributes can be seen in the alert details. Note: The maximum number of attributes you can select is 4, that is, host, name, UUID, and IP.
If you select the attribute value as $name, it will go to the metric and get the value of name and display it in the alert details screen. - Labels: Assign a value to a label. This is reflected in the alert details screen.
- Enter the name of the label in the Name box.
- Enter the value of the label in the Value box.
Example: If name is id and value is 10, then it will be set as id is 10.
Alert Conditions
- Static Threshold:
The Static Threshold feature allows you to set thresholds for a metric value. You can also set conditions based on which the alerts are triggered.
Dynamic Change Detection: The Dynamic Change Detection feature allows you to set conditions to trigger alerts. You can evaluate the data over a learning period, which can be specified in either hours or days.
Evaluate the data over a learning period of the last 4 HOURS (values should be between 1 to 8 hours).
How it works: It will look at the last 4 hours (default value, adjustable between 1 and 8 hours) in the time series, and it will identify if the metric value deviated from the mean value. It will trigger an alert when there is an increase of more than 5 standard deviations away from the mean is detected.The trigger condition will now have only a minutes option, ranging from 2 minutes to 60 minutes, with the default value set to 5 minutes.
- Note: Operations are not supported while building a query.
Forecast: The forecasting typically refers to predicting or estimating potential issues or events that might trigger an alert. This involves analyzing historical data, patterns, and trends to anticipate situations that could lead to issues or other predefined conditions.
The limit is a metric unit and predefined limits are set for each metric, determining the acceptable range. For example, CPU usage from 1% to 100%, disk space from 1KB to 100GB, network speed from 1Bps to 1Gbps.Critical Threshold: Enter a critical threshold value. Enter a number.
Example: 3 days
Warning Threshold: Enter a warning threshold value. Enter a number.
Example: 5 days
How it works: It will predict the occurrence when the specified limit is about to be reached and trigger an alert based on the timeframe specified in the critical or warning threshold.
The forecasting process will occur once daily starting from the creation of the alert definition.
- Note: Operations are not supported while building a query.
Notification Format
The Subject and Description entered here will reflect in the alert details screen.
Alert Identification
The alert identification section defines the scope of the alert.
- Click Save. The alert definition is saved successfully.
You can enable or disable an alert definition, from the Alert Definitions listing screen.
Alert limitation rule
Due to incorrect configuration of alert definition at the client level, multiple alerts might be generated. These alerts may impact the alert processing. Following are the rules set up to limit alert volume:
When the number of alerts generated for a specific alert definition exceeds 1,700 alerts within the last 1 hour, the system will:
- Trigger a Warning Alert.
- Send a notification to the user, alerting them about the high volume of alerts associated with a single alert definition.
When the number of alerts continues to increase and breaches 2,000 alerts within the last 1 hour, the system will:
- Trigger a Critical Alert to notify the user about the threshold breach.
- Automatically disable the alert definition to prevent further alert generation.
- Generate a Failure log with detailed information on the alert definition and associated metrics.
Note:
- The above rules are applicable for all alert conditions: Static Threshold, Dynamic Change Detection, and Forecast.
- The alerts (warning or critical) that are generated for the breach, have to be self-healed or suppressed.
Actions on an alert definition
Below are the actions you can perform on an alert definition.
Action | Description |
---|---|
Search | To search for an alert definition:
|
Filter | Filter alert definitions based on Entity Type and Status:
| View and Edit | To view an alert definition:
To edit an alert definition:
View Failure Logs | To view failure logs: | Remove | To remove an alert definition: | |