ITRS Analytics deployment planning and resiliency
This guide helps you understand the resiliency characteristics and trade-offs of different ITRS Analytics deployment options. Your choice of deployment model directly affects high availability, continuous operations, and your ability to meet uptime and compliance requirements.
Designing a resilient ITRS Analytics deployment Copied
ITRS Analytics is built on a Kubernetes-native architecture, designed for continuous high availability, scalable deployments, and resilient operations. A key decision that drives most resiliency characteristics is the choice of deployment model: Bring Your Own Cluster (BYOC) or Embedded Cluster (EC).
BYOC is the recommended deployment model, offering customers maximum flexibility, control, and enterprise-grade resiliency. Embedded Cluster can be suitable for small-scale or trial deployments, but it comes with specific trade-offs and limitations. This guide explains the implications of choosing an Embedded Cluster, rather than treating BYOC and EC as equivalent options.
ITRS Analytics achieves resiliency through high availability and continuous operations mechanisms. By running redundant services across the cluster with intelligent load balancing and automated failover, the platform ensures uninterrupted access to observability data even during component or node failures.
With a Kubernetes-native design, monitoring and alerting workflows continue seamlessly, meeting strict compliance and uptime requirements without requiring manual intervention. Understanding these characteristics and how they differ between BYOC and EC is essential for designing a deployment that aligns with your organization’s uptime, compliance, and continuous operations goals.
Key resiliency concepts Copied
When planning your ITRS Analytics deployment, two fundamental concepts work together to define the platform’s operational characteristics.
High availability (HA) Copied
High availability ensures that your observability platform continues to operate without interruption, even if individual components fail. This is achieved by deploying redundant services, load balancers, and failover mechanisms so that if one instance becomes unavailable, another seamlessly takes over.
Key characteristics:
- Focuses on minimizing downtime within the same site or region.
- None of the BYOC or Embedded clusters should run on nodes with network latency above 10 ms.
- As a result, architectures that span multiple data centers or availability zones are not supported.
- Supports deployments across subnets in availability zones.
Note
While HA configurations can deploy across multiple availability zones within a region, all nodes must maintain network latency below 10ms to ensure proper cluster operation.
Continuous operations Copied
ITRS Analytics is designed to maintain continuous operations during localized failures within a single cluster. Through high availability configurations, the platform automatically handles pod failures, node outages, and individual service disruptions without manual intervention. Kubernetes orchestration ensures workloads are rescheduled, traffic is rerouted, and services remain accessible even as infrastructure components fail and recover.
This continuous operations model focuses on keeping the platform available within a single site or region during common failure scenarios—precisely the situations where your observability data is most critical for troubleshooting and incident response.
Important
ITRS Analytics does not provide built-in cross-site or cross-region disaster recovery capabilities. The platform is designed for continuous operations during localized failures (pods, nodes, services) within a single deployment, not for automatic failover between geographically separated instances.
Requirements for disaster recovery across regions or data centers Copied
If your organization requires protection against large-scale or catastrophic events, such as complete data center outages, regional cloud failures, or severe cyber incidents, you must implement your own disaster recovery strategy by:
- Running two or more independent ITRS Analytics deployments in separate locations
- Implementing your own data synchronization mechanisms between deployments
- Managing failover procedures and traffic redirection during regional failures
- Maintaining recovery runbooks and testing DR processes regularly
This approach allows you to design disaster recovery that aligns with your specific Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO), while the platform itself focuses on maximizing uptime within each deployment.
Kubernetes deployment methods Copied
ITRS Analytics supports two primary methods for deploying into a Kubernetes platform. Your choice impacts the resiliency features available to you.
Bring Your Own Cluster (BYOC) Copied
In this scenario, customers have a dedicated team or expertise to deploy standard Kubernetes services from a hyperscaler or an on-premises system. This is the recommended approach for production deployments.
Native Bring Your Own Cluster (BYOC) environments typically offer broader capabilities and operational advantages compared to the Embedded Cluster deployment model.
Embedded Cluster (EC) Copied
This scenario is for customers who don’t have access to a Kubernetes platform and want ITRS to deploy the Embedded Cluster (packaged K0s) with ITRS Analytics.
Advantages of native BYOC deployments Copied
Native Bring Your Own Cluster (BYOC) environments provide several operational advantages over Embedded Cluster (EC) deployments. The scenarios below illustrate how these advantages play out in real-world ITRS Analytics operations.
Ensuring resilient access with load balancers Copied
Scenario: Your organization runs multiple ITRS Analytics ingestion services and UIs that must remain accessible even during high traffic spikes.
In a BYOC environment, Kubernetes load balancers automatically distribute traffic across multiple replicas of your services. They also integrate with DNS registries, keeping URLs and endpoints resilient during network changes.
In contrast, Embedded Cluster deployments do not include a built-in load balancer. As a result, additional coordination with a network team is required to ensure resilient access to ITRS Analytics services.
Why it matters for ITRS Analytics:
- Load balancers significantly improve the resilience and stability of ITRS Analytics ingestion endpoints and UI URLs.
- Automated DNS and forwarding updates reduce manual network maintenance.
Scaling storage dynamically with decoupled storage classes Copied
Scenario: Your ClickHouse workload grows steadily from 500GB to several terabytes of data over time.
With BYOC, storage is decoupled from individual nodes. Kubernetes ensures that persistent volumes follow the workload as pods are rescheduled, and extendable storage classes allow volumes to grow seamlessly as data increases.
Embedded Cluster deployments, however, rely solely on local node storage. If a node becomes unavailable, the associated workloads cannot be rescheduled elsewhere, causing the system to run in a degraded state until the original node is restored.
Why it matters for ITRS Analytics:
- Dynamic, extensible storage classes are ideal for data-heavy workloads such as ClickHouse.
- Users can start with smaller storage allocations (for example, 500GB) and expand them over time without downtime or complex planning.
Deploying secure workloads on tuned platforms Copied
Scenario: Your IT security team enforces strict policies and container security requirements for all workloads.
In a BYOC cluster, security policies are designed for Kubernetes, allowing containers to start and access resources as intended. Misconfigurations or access issues are easier to diagnose because the environment is Kubernetes-native.
Running an Embedded Cluster on servers that are secured with tools designed for traditional workloads can cause friction. Security agents may block EC installation steps, container operations, or access to required system resources.
Why it matters for ITRS Analytics:
- Troubleshooting Kubernetes-native security policies in BYOC environments is generally easier than diagnosing low-level server tooling that silently blocks EC operations.
- Security alignment reduces deployment friction and improves overall platform stability.
Streamlined support across teams Copied
Scenario: Your organization has separate teams for infrastructure, platform, and application operations.
In a BYOC setup, responsibilities are clearly divided: infrastructure teams manage nodes, platform teams administer Kubernetes, and application teams deploy and manage ITRS Analytics. Issues can be addressed at the appropriate layer without always escalating to ITRS.
The Embedded Cluster model hides Kubernetes from the application team, meaning any issue that surfaces within the cluster must be escalated to ITRS Support. This can slow down triage and restrict internal teams from participating in platform-level support.
Why it matters for ITRS Analytics:
- Better separation of responsibilities improves collaboration and reduces operational bottlenecks.
- Organizations retain the ability to diagnose cluster issues without relying solely on ITRS support.
Maintaining High Availability with Pod Management Copied
Scenario: A critical ClickHouse node fails during a server maintenance window.
In a BYOC environment with decoupled storage, Kubernetes can reschedule stateful pods on available nodes, keeping services running with minimal downtime.
In Embedded Cluster deployments, storage is tied to the physical node. If a node running a stateful set becomes unavailable, Kubernetes cannot reschedule the workload. It must wait for the node to return, resulting in degraded system performance.
Why it matters for ITRS Analytics:
- BYOC clusters maintain higher uptime and faster recovery from node failures.
- Resilience is built into the platform rather than dependent on specific hardware.
Deployment scenarios Copied
The following sections describe various deployment scenarios, each with specific benefits and trade-offs. Understanding these helps you select the right configuration for your requirements.
Non-HA single or multi-node (BYOC) Copied
This configuration is suitable for proof-of-concept deployments and smaller production environments where high availability is not a strict requirement.
Common use cases:
- SaaS proof-of-concepts
- Small SaaS Geneos or Opsview observability deployments
- Development and testing environments
Characteristics:
- Lower infrastructure costs.
- Backup and restore available with 24-hour recovery time objective.
- Suitable for environments with flexible uptime requirements.
- Managed by ITRS cloud operations teams for SaaS deployments.
Note
Proof-of-concept deployments come with no guarantee of highly available data due to their exploratory nature.
Non-HA single or multi-node (Embedded Cluster) Copied
Similar to the BYOC non-HA configuration, but deployed on-premises using Embedded Cluster. This option has additional limitations around data protection.
Common use cases:
- On-premises proof-of-concept deployments
- Small production use cases with relaxed uptime requirements
Important considerations:
- Lower infrastructure costs.
- Backup and restore functionality is not supported (Velero doesn’t support node filesystem storage classes).
- Risk of complete data loss if a node fails catastrophically.
- Requires complete rebuild if storage is lost.
Warning
Since backup and restore is not supported with Embedded Cluster, organizations should carefully assess their data protection requirements before choosing this deployment method.
Making your deployment decision Copied
Choosing the right deployment model for ITRS Analytics is fundamental to achieving the resiliency and operational characteristics your organization needs.
Key takeaways:
-
BYOC is recommended for production environments — It provides the most comprehensive resiliency features, including automatic load balancing, flexible storage, faster recovery from failures, and backup capabilities.
-
Embedded Cluster has important limitations — While suitable for trials or small deployments without Kubernetes expertise, it lacks backup support, ties storage to physical nodes, and cannot reschedule workloads during node failures.
-
Network latency matters — All deployments require nodes with less than 10ms network latency, meaning architectures cannot span multiple data centers or distant availability zones.
Start by evaluating whether you have access to a Kubernetes platform or team. If yes, BYOC is your path forward. If not, understand the Embedded Cluster trade-offs before proceeding, particularly around data protection and operational flexibility.
For detailed resource requirements and sizing guidance, see ITRS Analytics Sizer.