Entity monitoring and eviction
Entity monitoring and eviction are important considerations for ITRS Analytics. This is because large numbers of entities can lead to excessive CPU and memory usage, which may cause instability or unresponsiveness of the UI.
Entities are at the core of ITRS Analytics. Many features rely on being able to access entities, get notified of entity changes, or iterate over all entities in a timely manner. This requires an amount of CPU and memory that is proportional to the number of entities in the system. As such, the number of entities is an important parameter to keep in mind when sizing an ITRS Analytics cluster.
Monitor the number of entities in the system Copied
Once the system is in steady state (past its initialisation phase and after it is connected to various data sources), entities should theoretically be long-lived and the number of entities in the system should be stable. You can verify this by either:
-
Monitoring the value of the
total_entity_count
metric over time. -
Periodically checking the logs produced by the
final-entity-stream
workload. For sample logs, see below:
Entity eviction Copied
Even for a correctly sized system in steady-state, the total number of entities will likely grow over time. Even a modest growth in the number of entities can lead to excessive resource consumption after a long period of time. Among other reasons, it may happen due to:
-
Some Kubernetes pods can get restarted, which causes new entities to be monitored while previous pod entities are still present but are no longer receiving new metrics or logs.
-
Some managed objects can be renamed (disks or hosts), which also causes new entities to appear and existing entities to remain.
With entity eviction, entities that have not been updated for a long time can be automatically purged. There are two types of rules that govern eviction:
Fixed eviction rules Copied
These rules evict entities after a specific period of inactivity. For example, evict any entity that has been inactive for more than 45 days.
Lifespan eviction rules Copied
These rules calculate the eviction time based on a percentage (called the time-to-live or ttl value) of an entity’s active lifetime (the active lifetime being the time between the entity first being observed and the last time an update was received). A minimum value is used to ensure that entities are not evicted too quickly and a maximum value is used to ensure that inactive entities are not kept for too long.
The formula used to calculate eviction time is (active lifespan * ttl) + minimum. For example, if an entity was active for 20 hours, with a TTL set to 25% and a minimum time of 4 hours, the eviction time would be (20 × 0.25) + 4
, which is 9 hours.
Rule conditions (optional) Copied
Rules can include conditions that restrict which entities they apply to. Only entities meeting these conditions will be subject to that rule.
Rule evaluation order Copied
Where multiple eviction rules are defined, each rule will be evaluated in order. If an entity matches the conditions of a rule, then that rule will be applied and evaluation will stop. If there are non-conditional rules, then all those rules will be evaluated and the one that produces the shortest eviction time will be applied.
Once an entity is evicted from the system, its metrics are no longer available. Should an entity get evicted and, at one point, be re-created (maybe because a monitored system was down for an extended period of time and came back up), then that entity and its metrics will be available again, but metrics collected for that entity prior to its eviction will remain unavailable.
Ephemeral entities Copied
These are typically produced by Geneos when certain plugins are used, for example, TCP Links, TOP, Trapmon, X-traffic, etc. Such plugins produce highly volatile dataview rows (in the case of the TOP plugin, it will create one row for each monitored process). Since dataview rows get published by Geneos as ITRS Analytics entities, this can result quite rapidly in hundreds of thousands of entities in ITRS Analytics.
In this scenario, the first remedy is to stop publishing dataviews with volatile rows. To delete the entities published by such dataviews, these need to be marked as ephemeral
through the Classification page in the Entity Viewer app.
Looking at the sample logs above, you’ll see that a large number of entities (about 75% of the 100,000 entities in the system) are coming from the TCP Links plugin. Once Geneos is configured to stop publishing data from that plugin, you need to create a new classification rule that will decorate all entities coming from the TCP Links dataview with a new attribute called ephemeral
. To learn how, see Create a new classification policy.
When such a classification is in place, entities produced by the TCP Links dataview will be considered ephemeral and will get evicted after 1 hour of inactivity. If such entities are no longer being published by Geneos, they are effectively inactive. This will result in the deletion of the 75,601 TCP Links entities within 1 hour of stopping the publishing.
Review eviction logs Copied
Entities getting evicted are logged by platformd
:
...
2022-07-21 13:00:09.658 [grpc-server-7162-0] INFO com.itrsgroup.platform.service.entity.InternalEntityManagementGrpcService(Component-3) - Evicting entity {pod=entity-containment-rules--1-kh2bg, kind=Pod, namespace=itrs}
2022-07-21 13:00:09.658 [grpc-server-7162-0] INFO com.itrsgroup.platform.service.entity.InternalEntityManagementGrpcService(Component-3) - Evicting entity {pod=entity-containment-rules--1-ff2sv, kind=Pod, node=docker-desktop, container=entity-containment-rules, namespace=itrs}
2022-07-21 13:00:09.658 [grpc-server-7162-0] INFO com.itrsgroup.platform.service.entity.InternalEntityManagementGrpcService(Component-3) - Evicting entity {pod=entity-containment-rules--1-glmg6, kind=Pod, node=docker-desktop, namespace=itrs}
2022-07-21 13:00:09.658 [grpc-server-7162-0] INFO com.itrsgroup.platform.service.entity.InternalEntityManagementGrpcService(Component-3) - Evicting entity {pod=entity-containment-rules--1-r2ls7, kind=Pod, node=docker-desktop, container=entity-containment-rules, namespace=itrs}
2022-07-21 13:00:09.658 [grpc-server-7162-0] INFO com.itrsgroup.platform.service.entity.InternalEntityManagementGrpcService(Component-3) - Evicting entity {pod=entity-containment-rules--1-j5jl4, kind=Pod, namespace=itrs}
2022-07-21 13:00:09.658 [grpc-server-7162-0] INFO com.itrsgroup.platform.service.entity.InternalEntityManagementGrpcService(Component-3) - Evicting entity {pod=entity-containment-rules--1-z6jt6, kind=Pod, namespace=itrs}
2022-07-21 13:00:09.658 [grpc-server-7162-0] INFO com.itrsgroup.platform.service.entity.InternalEntityManagementGrpcService(Component-3) - Evicting entity {pod=entity-containment-rules--1-fss7p, kind=Pod, namespace=itrs}
...