ITRS Analytics deployment planning and resiliency
This guide explains how ITRS Analytics achieves resiliency through high availability, replication, and disaster recovery capabilities. Understanding these concepts helps you design a deployment that meets your organization’s uptime and compliance requirements.
Overview Copied
ITRS Analytics is built on a Kubernetes-native architecture designed for continuous high availability. By running redundant services across the cluster with intelligent load balancing and automated failover, the platform ensures uninterrupted access to your observability data, even during component or node failures. This design not only meets stringent compliance and uptime requirements, it also means your monitoring and alerting workflows continue seamlessly, with no intervention required, regardless of unexpected infrastructure issues.
Key resiliency concepts Copied
When planning your ITRS Analytics deployment, three fundamental concepts work together to provide different levels of configuration.
High availability (HA) Copied
High availability ensures that your observability platform continues to operate without interruption, even if individual components fail. This is achieved by deploying redundant services, load balancers, and failover mechanisms so that if one instance becomes unavailable, another seamlessly takes over.
Key characteristics:
- Focuses on minimizing downtime within the same site or region.
- None of the BYOC or Embedded clusters should run on nodes with network latency above 10 ms.
- As a result, architectures that span multiple data centers or availability zones are not supported.
- Supports deployments across subnets in availability zones.
Note
While HA configurations can deploy across multiple availability zones within a region, all nodes must maintain network latency below 10ms to ensure proper cluster operation.
Replication Copied
Replication safeguards critical data and configuration by creating and maintaining multiple synchronized copies in real time or near real time. In an observability platform, replication ensures metrics, logs, traces, and configuration changes are preserved if the entire platform becomes unavailable for a short time, reducing the risk caused by inaccessibility to the primary source of data.
Key advantages include:
- Protects against short-term regional outages.
- Maintains data integrity across multiple sites.
- Enables faster recovery compared to backup restoration.
- Supports both hot-hot and hot-warm deployment scenarios.
Disaster recovery (DR) Copied
Disaster recovery focuses on restoring or maintaining a full platform replica after a large-scale or catastrophic event, such as a data center outage, regional cloud failure, or severe cyber incident. With ITRS Analytics, disaster recovery isn’t an afterthought, it’s built into the way the platform operates.
These strategies typically include offsite backups, hot or warm standby environments, and detailed recovery runbooks designed to bring systems back online within defined Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO).
Tip
While HA and replication handle smaller, localized issues, disaster recovery is your safety net for catastrophic events. This approach satisfies even the most rigorous compliance frameworks and provides peace of mind that your observability remains intact, no matter the scenario.
Kubernetes deployment methods Copied
ITRS Analytics supports two primary methods for deploying into a Kubernetes platform. Your choice impacts the resiliency features available to you.
Bring Your Own Cluster (BYOC) Copied
In this scenario, customers have a dedicated team or expertise to deploy standard Kubernetes services from a hyperscaler or an on-premises system. This is the recommended approach for production deployments.
Native Bring Your Own Cluster (BYOC) environments typically offer broader capabilities and operational advantages compared to the Embedded Cluster deployment model.
Embedded Cluster (EC) Copied
This scenario is for customers who don’t have access to a Kubernetes platform and want ITRS to deploy the Replicated Embedded Cluster (packaged K0s) with ITRS Analytics.
Advantages of native BYOC deployments Copied
Native Bring Your Own Cluster (BYOC) environments provide several operational advantages over Embedded Cluster (EC) deployments. The scenarios below illustrate how these advantages play out in real-world ITRS Analytics operations.
Ensuring resilient access with load balancers Copied
Scenario: Your organization runs multiple ITRS Analytics ingestion services and UIs that must remain accessible even during high traffic spikes.
In a BYOC environment, Kubernetes load balancers automatically distribute traffic across multiple replicas of your services. They also integrate with DNS registries, keeping URLs and endpoints resilient during network changes.
In contrast, Embedded Cluster deployments do not include a built-in load balancer. As a result, additional coordination with a network team is required to ensure resilient access to ITRS Analytics services.
Why it matters for ITRS Analytics:
- Load balancers significantly improve the resilience and stability of ITRS Analytics ingestion endpoints and UI URLs.
- Automated DNS and forwarding updates reduce manual network maintenance.
Scaling storage dynamically with decoupled storage classes Copied
Scenario: Your ClickHouse workload grows steadily from 500GB to several terabytes of data over time.
With BYOC, storage is decoupled from individual nodes. Kubernetes ensures that persistent volumes follow the workload as pods are rescheduled, and extendable storage classes allow volumes to grow seamlessly as data increases.
Embedded Cluster deployments, however, rely solely on local node storage. If a node becomes unavailable, the associated workloads cannot be rescheduled elsewhere, causing the system to run in a degraded state until the original node is restored.
Why it matters for ITRS Analytics:
- Dynamic, extensible storage classes are ideal for data-heavy workloads such as ClickHouse.
- Users can start with smaller storage allocations (for example, 500GB) and expand them over time without downtime or complex planning.
Deploying secure workloads on tuned platforms Copied
Scenario: Your IT security team enforces strict policies and container security requirements for all workloads.
In a BYOC cluster, security policies are designed for Kubernetes, allowing containers to start and access resources as intended. Misconfigurations or access issues are easier to diagnose because the environment is Kubernetes-native.
Running an Embedded Cluster on servers that are secured with tools designed for traditional workloads can cause friction. Security agents may block EC installation steps, container operations, or access to required system resources.
Why it matters for ITRS Analytics:
- Troubleshooting Kubernetes-native security policies in BYOC environments is generally easier than diagnosing low-level server tooling that silently blocks EC operations.
- Security alignment reduces deployment friction and improves overall platform stability.
Streamlined support across teams Copied
Scenario: Your organization has separate teams for infrastructure, platform, and application operations.
In a BYOC setup, responsibilities are clearly divided: infrastructure teams manage nodes, platform teams administer Kubernetes, and application teams deploy and manage ITRS Analytics. Issues can be addressed at the appropriate layer without always escalating to ITRS.
The Embedded Cluster model hides Kubernetes from the application team, meaning any issue that surfaces within the cluster must be escalated to ITRS Support. This can slow down triage and restrict internal teams from participating in platform-level support.
Why it matters for ITRS Analytics:
- Better separation of responsibilities improves collaboration and reduces operational bottlenecks.
- Organizations retain the ability to diagnose cluster issues without relying solely on ITRS support.
Maintaining High Availability with Pod Management Copied
Scenario: A critical ClickHouse node fails during a server maintenance window.
In a BYOC environment with decoupled storage, Kubernetes can reschedule stateful pods on available nodes, keeping services running with minimal downtime.
In Embedded Cluster deployments, storage is tied to the physical node. If a node running a stateful set becomes unavailable, Kubernetes cannot reschedule the workload. It must wait for the node to return, resulting in degraded system performance.
Why it matters for ITRS Analytics:
- BYOC clusters maintain higher uptime and faster recovery from node failures.
- Resilience is built into the platform rather than dependent on specific hardware.
Deployment scenarios Copied
The following sections describe various deployment scenarios, each with specific benefits and trade-offs. Understanding these helps you select the right configuration for your requirements.
Non-HA single or multi-node (BYOC) Copied
This configuration is suitable for proof-of-concept deployments and smaller production environments where high availability is not a strict requirement.
Common use cases:
- SaaS proof-of-concepts
- Small SaaS Geneos or Opsview observability deployments
- Development and testing environments
Characteristics:
- Lower infrastructure costs.
- Backup and restore available with 24-hour recovery time objective.
- Suitable for environments with flexible uptime requirements.
- Managed by ITRS cloud operations teams for SaaS deployments.
Note
Proof-of-concept deployments come with no guarantee of highly available data due to their exploratory nature.
Non-HA single or multi-node (Embedded Cluster) Copied
Similar to the BYOC non-HA configuration, but deployed on-premises using Embedded Cluster. This option has additional limitations around data protection.
Common use cases:
- On-premises proof-of-concept deployments
- Small production use cases with relaxed uptime requirements
Important considerations:
- Lower infrastructure costs.
- Backup and restore functionality is not supported (Velero doesn’t support node filesystem storage classes).
- Risk of complete data loss if a node fails catastrophically.
- Requires complete rebuild if storage is lost.
Warning
Since backup and restore is not supported with Embedded Cluster, organizations should carefully assess their data protection requirements before choosing this deployment method.
HA multi-node small to medium (BYOC) Copied
This is the workhorse configuration for customers interested in a native microservices environment for their observability platform. Organizations provision the Kubernetes cluster from a dedicated team or use in-house built automation.
Common use cases:
- SaaS production deployments
- On-premises production deployments
- Multi-site deployments
- Medium-sized enterprise observability platforms
Key features:
- Full high availability benefits
- Extendable storage classes
- Backup and restore procedures available
- Can be deployed in hot-hot or hot-warm configurations for disaster recovery
Disaster recovery options:
-
Daily backups: Take daily backups and use documented procedures to stop the platform, restore, and restart. Recovery typically requires several hours.
-
Hot-hot replication: Run multiple identical instances with continuous data replication. Provides immediate failover with no data loss. (Note: Production-level data twinning capabilities are currently being finalized.)
-
Hot-warm standby: A secondary instance ingests data but runs with reduced app capacity. During a disaster, documented runbooks reconfigure the system to bring up additional apps for extended operation. (Note: This approach is under development.)
Tip
For critical observability platforms, we recommend hot-hot configurations. When issues arise in your primary infrastructure, you need immediate access to your observability data to help triage problems—not hours of waiting for disaster recovery procedures.
Sizing notes:
- Medium configurations may require dedicated nodes for TimescaleDB.
- ClickHouse requires node selectors only in large implementations.
- See Resource and hardware requirements for detailed specifications.
HA multi-node large (BYOC) Copied
This deployment is designed for high-throughput environments requiring 100,000 messages per second or more. ITRS SaaS teams can deploy this for larger customers or, over time, extend it to support a multi-tenant system.
Common use cases:
- SaaS production for large customers
- Large enterprise single-deployment environments
- Multi-tenant SaaS platforms
Architecture requirements:
- Storage systems (ClickHouse or TimescaleDB) require dedicated nodes
- Kubernetes node selectors ensure only designated workloads are scheduled on storage nodes
- Full high availability across all components
- Enterprise-grade disaster recovery capabilities
Deployment scenarios:
- Single large deployment for enterprise-wide observability
- Alternative to multiple smaller regional deployments
- Foundation for multi-tenant SaaS offerings
Note
Large deployments provide the same disaster recovery options as medium deployments (backup/restore, hot-hot, hot-warm) but at a scale suitable for the most demanding observability requirements.
HA multi-node small (Embedded Cluster) Copied
Important
This setup should only be chosen after all BYOC options have been exhausted. Teams must understand and accept the trade-offs compared to BYOC deployments.
This is the primary production configuration for Embedded Cluster deployments. It requires a set of common nodes that provide limited high availability capabilities.
Common use cases:
- On-premises production deployments without access to managed Kubernetes.
- Multi-site deployments in environments without Kubernetes expertise.
Important characteristics:
- Provides limited high availability compared to BYOC.
- Data is stored in multiple places across nodes.
- System runs in degraded state during node failures or upgrades.
- Cannot reschedule stateful workloads when nodes are unavailable.
- Backup and restore not supported (Velero limitation with node filesystem storage).
Current limitations and future enhancements:
Current limitations
To achieve full HA capability with one node out, some enhancements are still in development. For example, multiple Keycloak instances are needed so that if the node running Keycloak fails, the system remains accessible.
Future disaster recovery:
- Multi-instance data replication for continuous operations is forthcoming.
- Will eliminate the need for traditional DR scenarios with downtime and runbooks.
- Enables multiple Embedded Cluster instances to replicate data between sites.
HA multi-node medium to large (Embedded Cluster) Copied
Status: Not recommended for production use
While this deployment is fully tested by QA teams, it is not recommended for production environments.
Reasoning:
- Platforms handling 50,000 to 100,000 messages per second require significant data availability guarantees.
- Storing large volumes of data on node filesystems introduces substantial risk.
- Only fully supported Kubernetes persistent storage provides the necessary guarantees with low overhead.
- Single point of failure risks are too high for this scale of data.
Important
For deployments requiring 50,000 messages per second or higher throughput, always use a BYOC configuration with proper persistent storage classes.
Choosing the right deployment Copied
When selecting your deployment configuration, consider the following decision factors:
Production deployments Copied
| Priority | Recommendation |
|---|---|
| Best option | HA multi-node BYOC (small, medium, or large based on throughput) |
| Alternative | HA multi-node small Embedded Cluster (if BYOC is not available) |
| Not recommended | Non-HA configurations, or Embedded Cluster medium/large |
Proof-of-concept and development Copied
| Environment | Recommendation |
|---|---|
| SaaS POC | Non-HA BYOC |
| On-premises POC | Non-HA Embedded Cluster |
| Development/testing | Non-HA BYOC (single node acceptable) |
Key decision criteria Copied
| Consideration | Key question |
|---|---|
| Uptime requirements | Do you require continuous availability even during component failures? |
| Disaster recovery needs | How quickly must systems recover from a regional outage? |
| Data volume | What are your expected message rates and storage growth patterns? |
| Kubernetes expertise | Do you have teams capable of managing Kubernetes infrastructure effectively? |
| Budget constraints | What are the limits for infrastructure costs? |
| Compliance requirements | Do regulatory frameworks impose specific RPO or RTO targets? |