Entity monitoring and eviction
Entity monitoring and eviction are important considerations for the Obcerv platform. This is because large numbers of entities can lead to excessive CPU and memory usage, which may cause instability or unresponsiveness of the UI.
Entities are at the core of the Obcerv platform. Many features rely on being able to access entities, get notified of entity changes, or iterate over all entities in a timely manner. This requires an amount of CPU and memory that is proportional to the number of entities in the system. As such, the number of entities is an important parameter to keep in mind when sizing an Obcerv cluster.
Monitor the number of entities in the system Copied
Once the system is in steady state (past its initialisation phase and after it is connected to various data sources), entities should theoretically be long lived and the number of entities in the system should be stable. You can verify this by either:
-
Monitoring the value of the
total_entity_count
metric over time. -
Periodically checking the logs produced by the
final-entity-stream
workload. For sample logs, see below:
Entity eviction Copied
Even for a correctly sized system in steady-state, the total number of entities will likely grow over time. Even a modest growth in the number of entities can lead to excessive resource consumption after a long period of time. Among other reasons, it may happen due to:
-
Some Kubernetes pods can get restarted, which causes new entities to be monitored while previous pod entities are still present but are no longer receiving new metrics or logs.
-
Some managed objects can be renamed (disks or hosts), which also causes new entities to appear and existing entities to remain.
With entity eviction, entities from the system that have not been updated for a long time can be automatically purged. By default, for an entity to get evicted, it must not have received any update (metric, log, event, etc.) for the past 60 days. Note that this does not take the entity’s creation date into consideration but only its last update timestamp.
Once an entity is evicted from the system, its metrics are no longer available. Should an entity get evicted and, at one point, be re-created (maybe because a monitored system was down for an extended period of time and came back up), then that entity and its metrics will be available again, but metrics collected for that entity prior to its eviction will remain unavailable.
The default configuration applies an expiration of 2 hours to entities marked as ephemeral
and 60 days for all other entities.
Ephemeral entities Copied
The default eviction scheme — which is 60 days for entities that have not been updated — won’t be sufficient to cope with ephemeral entities. These are typically produced by Geneos when certain plugins are used, for example, TCP Links, TOP, Trapmon, X-traffic, etc. Such plugins produce highly volatile dataview rows (in the case of the TOP plugin, it will create one row for each monitored process). Since dataview rows get published by Geneos as Obcerv entities, this can result quite rapidly in hundreds of thousands of entities in Obcerv.
In this scenario, the first remedy is to stop publishing dataviews with volatile rows. To delete the entities published by such dataviews, these need to be marked as ephemeral
through the Classification page in the Overview app.
Looking at the sample logs above, you’ll see that a large number of entities (about 75% of the 100,000 entities in the system) are coming from the TCP Links plugin. Once Geneos is configured to stop publishing data from that plugin, you need to create a new classification rule that will decorate all entities coming from the TCP Links dataview with a new attribute called ephemeral
. To learn how, see Create a new classification policy.
When such a classification is in place, entities produced by the TCP Links dataview will be considered ephemeral and will get evicted after 2 hours of inactivity. If such entities are no longer being published by Geneos, they are effectively inactive. This will result in the deletion of the 75,601 TCP Links entities within 2 hours of stopping the publishing.
Review eviction logs Copied
Entities getting evicted are logged by platformd
:
...
2022-07-21 13:00:09.658 [grpc-server-7162-0] INFO com.itrsgroup.platform.service.entity.InternalEntityManagementGrpcService(Component-3) - Evicting entity {pod=entity-containment-rules--1-kh2bg, kind=Pod, namespace=itrs}
2022-07-21 13:00:09.658 [grpc-server-7162-0] INFO com.itrsgroup.platform.service.entity.InternalEntityManagementGrpcService(Component-3) - Evicting entity {pod=entity-containment-rules--1-ff2sv, kind=Pod, node=docker-desktop, container=entity-containment-rules, namespace=itrs}
2022-07-21 13:00:09.658 [grpc-server-7162-0] INFO com.itrsgroup.platform.service.entity.InternalEntityManagementGrpcService(Component-3) - Evicting entity {pod=entity-containment-rules--1-glmg6, kind=Pod, node=docker-desktop, namespace=itrs}
2022-07-21 13:00:09.658 [grpc-server-7162-0] INFO com.itrsgroup.platform.service.entity.InternalEntityManagementGrpcService(Component-3) - Evicting entity {pod=entity-containment-rules--1-r2ls7, kind=Pod, node=docker-desktop, container=entity-containment-rules, namespace=itrs}
2022-07-21 13:00:09.658 [grpc-server-7162-0] INFO com.itrsgroup.platform.service.entity.InternalEntityManagementGrpcService(Component-3) - Evicting entity {pod=entity-containment-rules--1-j5jl4, kind=Pod, namespace=itrs}
2022-07-21 13:00:09.658 [grpc-server-7162-0] INFO com.itrsgroup.platform.service.entity.InternalEntityManagementGrpcService(Component-3) - Evicting entity {pod=entity-containment-rules--1-z6jt6, kind=Pod, namespace=itrs}
2022-07-21 13:00:09.658 [grpc-server-7162-0] INFO com.itrsgroup.platform.service.entity.InternalEntityManagementGrpcService(Component-3) - Evicting entity {pod=entity-containment-rules--1-fss7p, kind=Pod, namespace=itrs}
...