Internal documentation only
This page has been marked as draft.
What does the Risk tab mean?
Each metric on each entity in the model is given a ‘risk’ score. This is, broadly, the probability of a capacity issue happening should the demand increase. Machines at capacity, spending lots of time at capacity, or close to capacity with sufficient variance in the timeseries to suggest that a small change to that variance will result in threshold breaches, will be given higher risk scores. Often, timeseries that are not trending upwards but have lots of variance close with values close to capacity tend to have higher scores. It is a useful tab for finding out VMs that are behaving unusually but not necessarily with any definite growth trends. Risk is a score between 0 and 1 with 1 indicating that the machine is at capacity for significant periods of time.
Risk scores are combined through the hierarchy of the sunburst. As you hover over a segment you will see the combined risk score for all entities from that point in the sunburst. As you double click on the sunburst, you will see the risk tab change accordingly. You can expand each node of the tree in the risk panel to drill down to the leaf entity and corresponding metric. At each level, the ordering is based on risk score.
In the above example , the user wants to determine the machine and metric with the highest risk value in Cluster C47. By opening each level of the hierarchy, it can be seen that CPU on VM0510 has a risk score of 0.867. A star beside a value indicates that this is an aggregated score, determined by combining the risk scores of all entities under that point in the hierarchy. Hovering over the metric will present options to view the timeseries. Here is the CPU utilisation for the metric above.
There are a couple of issues with this metric that are contributing to the high risk score. Firstly, the capacity of the machine is changing significantly and often. Secondly, the machine is regularly spending quite some time at 100% CPU. As capacity decreases, it is possible that CPU will stay at 100% for higher. These are the kind of features of a metric timeseries that can contribute to a high risk score.
The other way of viewing risk is to view all entities from the point in the hardware ordered by risk. In this example we can see that CPU on VM1891 has the highest risk score.
The timeseries for this metric looks like:
Again, here the capacity is changing frequently, CPU behavioural pattern has changed significantly and is now flattening out very close to capacity.
For further information about risk score, you can refer to here.