Upgrade to version 2.18.0: ClickHouse migration guide
ITRS Analytics 2.18.0 introduces the platform-wide ClickHouse migration. During the upgrade, ITRS Analytics deploys new ClickHouse-backed workloads, migrates retained data from the legacy stores, and then retires the old data path once the migration has completed successfully. For the package release notes for this required release, see ITRS Analytics 2.18.x changelog.
Important
ITRS Analytics 2.18.0+p2.18.3 is a required release.
When you upgrade an existing deployment with retained data, ITRS Analytics migrates that data from TimescaleDB to ClickHouse. This migration requires changes to platforms and apps, as well as infrastructure changes that must be completed outside the installation process. These temporary infrastructure changes do not apply to a straightforward fresh installation because there is no TimescaleDB-to-ClickHouse data migration to run. By concentrating the migration policy in a single release, the risks associated with supporting multiple upgrade paths are reduced to one.
What changes in version 2.18.0 Copied
Version 2.18.0 moves additional platform workloads to ClickHouse, including metrics, logs, signals, audit events, entities, and related query paths. As part of the upgrade:
- New ClickHouse workloads are deployed for platform data.
- Historical metrics, signals, and logs are migrated from the legacy backends.
- Configuration service data is moved away from TimescaleDB.
- Query handling for DPD, Entity Service, Latest Metrics, and other platform services is aligned with the ClickHouse data model.
This means the standard upgrade can complete before all retained data has been migrated. Background migration jobs can continue running for some time afterward, and both the old and new storage backends may need to coexist until that migration finishes.
Before you upgrade Copied
Complete the following checks before starting the standard upgrade procedure. The ClickHouse migration introduces behavior changes that can affect custom integrations, automation, and API clients.
Platform and query behavior changes Copied
- Make sure your Kubernetes version is still supported by ITRS Analytics. Version 2.18.x adds support for Kubernetes 1.34 and removes support for Kubernetes 1.27 and 1.28.
- Requests to metric query endpoints now require a namespace for metric lookups such as
GetMetricsandGetStatusMetrics. - Metric metadata and query behavior now follow the ClickHouse-backed model. If you have custom consumers of metric APIs, re-test them before production rollout.
- Metric query results now align with the normalized unit model used by ClickHouse-backed storage. Validate any automation that depends on previously returned units.
- Query services now preserve nanosecond timestamp precision.
NOT_EQUALSandNOT INnow match only when the key is present.- Greater-than and less-than entity expression operators are no longer supported.
If you rely on saved filters, generated filters, or application-side query builders, validate them before upgrading.
Ingestion and collector changes Copied
- Backticks in dimension keys, namespaces, and attribute names are normalized to underscores during ingestion.
- Some log data is normalized more aggressively during ingestion and migration, including timestamp and severity parsing and event identifier handling.
- For Geneos and FKM log data, entity and log-field handling is more normalized than in earlier releases. In particular, fields such as
sourceAlias,matchKey,ignoreKey, andoriginalSeveritymay no longer appear in exactly the same place they did before the ClickHouse migration. - Custom senders that still depend on the removed
SendNetprobeLogBatchpath must be updated before upgrading.
If your data pipeline or downstream tooling depends on exact field names or raw log formatting, verify the output after upgrading.
Confirm temporary migration capacity Copied
This applies when upgrading an existing deployment with retained data. Plan enough time to monitor the background data migration after the standard upgrade completes. The migration duration depends on the amount of retained metrics, signals, and logs in the system. During and after the upgrade, the legacy data stores and the new ClickHouse workloads can run at the same time.
Before upgrading, verify that the cluster has sufficient spare capacity for both current and post-upgrade workloads. This is important because, in practice, the storage required for data is expected to nearly double after the upgrade, which may significantly impact available disk space and overall cluster stability if not accounted for in advance. Check that the cluster has enough spare:
- CPU
- memory
- storage capacity
- storage performance
Use the latest resource and hardware requirements and the current sample configuration files as your baseline.
Pay particular attention to storage performance. Slow or undersized storage can significantly lengthen migration time and affect ClickHouse query performance after the upgrade.
Review updated sizing configuration YAMLs Copied
Migration effort is directly affected by the amount of retained data and by how closely your current deployment still matches the latest sample sizing. Before upgrading, review:
- metrics retention
- signal retention
- log retention
- current TimescaleDB and Loki disk usage
If historical data volume is unusually large, plan for a longer background migration period and higher temporary resource usage.
The latest sizing sample YAML files were also updated in ways that matter for upgrade planning. Before upgrading, compare your current deployment against the latest sample configuration file for your target size.
Major changes reflected in the updated sizing YAML files include:
- more explicit node-count, CPU, memory, and storage estimates in the file headers for each size
- HA guidance for service-layer workloads, not only database replicas
- higher or more explicit replica counts for ingestion,
sinkd,dpd, downsampled metrics, and entity-stream workloads in HA-oriented samples - clearer per-workload storage expectations for Kafka brokers and controllers, PostgreSQL, etcd, ClickHouse Keeper, ClickHouse traces, and downsampled metrics; TimescaleDB and Loki remain only as temporary migration workloads for upgrades with existing data, and their storage is not being re-sized as part of this change
- dedicated Timescale node placement and higher-performance Timescale storage guidance in larger AWS sample files where applicable
If you previously customized an older sizing file, do not assume that older replica counts, storage sizes, or node totals are still appropriate for version 2.18.0.
Review KOTS Admin Console storage settings Copied
If you manage the deployment through the KOTS Admin Console, review the Storage Class Settings and Disk Allocations sections carefully before upgrading. In 2.18.x, these UI fields are updated to reflect ClickHouse-backed workloads.
Storage Class Settings
| 2.17.x | 2.18.x |
|---|---|
| Kafka | Kafka |
| Kafka Controller | Kafka Controller |
| Downsampled Metrics Stream | Etcd Storage Class |
| Loki | ClickHouse Traces |
| Etcd | ClickHouse Metrics |
| Timescale Data | ClickHouse Logs |
| Timescale WAL | ClickHouse Keeper |
| Timescale Timeseries | ClickHouse Platform |
Disk Allocations
| 2.17.x | 2.18.x |
|---|---|
| Timescale Data Disk | ClickHouse Traces Disk |
| Timescale WAL Disk | ClickHouse Metrics Disk |
| Timescale Timeseries Disk | ClickHouse Logs Disk |
| ClickHouse Keeper Disk | |
| ClickHouse Entities Disk |
Review collector exposure and data-type settings Copied
Version 2.18.0 introduces separate ClickHouse-backed workloads for multiple data types. Before upgrading:
- confirm whether logs are meant to stay enabled
- confirm whether traces are meant to stay enabled
- keep those settings consistent throughout the upgrade
If you previously exposed collector traffic through NodePort, review that configuration carefully. In 2.18.x, collectors are split into lossy and lossless pairs, so existing NodePort exposure for collectors may also need matching lossless configuration using the current internalLossless or otelLossless settings.
Choose an infrastructure transition strategy Copied
This section applies only when upgrading an existing deployment with retained data where TimescaleDB runs on dedicated nodes with the dedicated=timescale-nodes:NoSchedule taint and instancegroup=timescale-nodes label. It does not apply to a straightforward fresh installation or to deployments where TimescaleDB runs on shared nodes. Before upgrading, choose one of the following strategies and complete it before you start the upgrade. This ensures that the new ClickHouse workloads have enough capacity during migration.
- Add additional nodes for ClickHouse first. This is the recommended approach.
- Reuse existing infrastructure. This avoids provisioning new nodes up front, but it is more complex to manage.
Add ClickHouse nodes first (Recommended strategy) Copied
This is the simplest approach. It provides the smoothest upgrade experience and the lowest operational risk, but it requires additional infrastructure.
-
Provision additional nodes for ClickHouse.
Use the following minimum sizing for the new ClickHouse nodes:
- Medium: 2 nodes, each with 15 CPU cores, 36 GiB RAM, and 2730 GiB storage
- Large: 2 nodes, each with 26 CPU cores, 46 GiB RAM, and 5472 GiB storage
-
Dedicate those nodes to ClickHouse.
For each new
$node, run:kubectl taint node $node dedicated=clickhouse-nodes:NoSchedule kubectl label node $node instancegroup=clickhouse-nodes -
After the upgrade and migration are complete, decommission the old TimescaleDB nodes if they are no longer needed. Do not remove the existing TimescaleDB node taints and labels. They should remain for the duration of the upgrade.
The exact process depends on how your nodes are managed, but the generic sequence is:
a. Drain the nodes. b. Confirm that the remaining nodes can admit the evicted pods. c. Remove the nodes.
Reuse existing infrastructure (Conservative strategy) Copied
This is the more economical approach because it does not require new infrastructure up front. However, it is more operationally sensitive and should only be attempted by an experienced Kubernetes administrator.
-
Remove the TimescaleDB node taints and labels.
For each
$nodededicated to TimescaleDB, run:kubectl taint node $node dedicated=timescale-nodes:NoSchedule- kubectl label node $node instancegroup- -
Check whether your storage driver creates persistent volumes with node affinity.
During the upgrade, TimescaleDB resources are reduced and their node selectors and tolerations are removed so that they can move to general-purpose nodes. Some storage drivers create persistent volumes that lock a workload to a specific node. If that happens, the TimescaleDB pods may not be rescheduled to other nodes.
# For BYOC installations NAMESPACE=itrs # For EC installations # NAMESPACE=kotsadm kubectl get pv $(kubectl get -n $NAMESPACE pvc timescale-ha-wal-timescale-0 -ojsonpath='{.spec.volumeName}') -ojsonpath='{.spec.nodeAffinity}'If this command returns a non-empty response, the TimescaleDB pods are likely to remain tied to their current nodes and may not be able to migrate.
-
Make sure the target nodes meet the minimum capacity.
If the target nodes do not already meet these minimums, resize them before upgrading.
Minimum sizing if TimescaleDB PVs have node affinity (pods stay on their current nodes):
- Medium: 2 nodes, each with 17 CPU cores and 52 GiB RAM
- Large: 2 nodes, each with 28 CPU cores and 62 GiB RAM
Minimum sizing if TimescaleDB PVs have no node affinity (pods can move to other nodes):
- Medium: 2 nodes, each with 15 CPU cores and 36 GiB RAM
- Large: 2 nodes, each with 26 CPU cores and 46 GiB RAM
-
Assign the target nodes to ClickHouse.
For each target
$node, run:kubectl taint node $node dedicated=clickhouse-nodes:NoSchedule kubectl label node $node instancegroup=clickhouse-nodes
During and after the upgrade Copied
After you start the standard upgrade procedure, allow the ClickHouse schema and migration jobs to run to completion, do not manually scale down or remove legacy data stores while migration is still in progress, and monitor the new ClickHouse workloads and core platform services until the deployment stabilizes.
Expect the operator to clean up legacy TimescaleDB and Loki workloads only after the migration has completed successfully.
After the upgrade finishes, validate both platform health and data continuity.
Check migration and platform health Copied
After a successful migration, some legacy PVCs can remain because the operator does not delete unused PVCs automatically. This can include old Loki, TimescaleDB, and downsampled-metrics PVCs such as:
loki-data-loki-0store-downsampled-metrics-bucketed-stream-*store-downsampled-metrics-raw-stream-*timescale-ha-data-timescale-*timescale-ha-tablespace-data-*timescale-ha-wal-timescale-*
Delete these PVCs only when all of the following are true:
- the 2.18 upgrade completed successfully
- all migration jobs completed successfully
- expected data is present in the new platform workloads
- valid backups have been taken
Example checks:
kubectl get jobs -n <namespace>
kubectl get pods -n <namespace>
Pay particular attention to any config-data-migration or chmigration-* jobs that are still running, failed, or repeatedly restarted.
Validate data access and integration behavior Copied
Confirm that users can still access:
- current and historical metrics
- signal timelines and latest signals
- log search and log source discovery
- audit event queries
- dashboards and data views that depend on entity and metric queries
If you use custom integrations, validate them against a representative sample of:
- metric queries
- status metric queries
- entity filters
- DPD subscriptions or task definitions
- any collector ingress or
NodePortexposure that custom senders still depend on
Validate retention and storage behavior Copied
After the platform is stable, confirm that:
- expected ClickHouse PVCs exist and are sized correctly
- retention settings are being applied as expected
- storage growth is tracking normally after the migration
Note
Version 2.18.0 relaxes upgrade blocking around changeddiskSizeandstorageClassvalues, but changing those configuration values does not resize existing volumes automatically. If you need larger volumes, expand the underlying PVCs separately using your storage platform’s supported procedure.