Upgrade to version 2.18.x: ClickHouse migration guide

ITRS Analytics 2.18.0 introduces the platform-wide ClickHouse migration. During the upgrade, ITRS Analytics deploys new ClickHouse-backed workloads, migrates retained data from the legacy stores, and then retires the old data path once the migration has completed successfully. For the package release notes for this required release, see ITRS Analytics 2.18.x changelog.

Important

ITRS Analytics 2.18.1+p2.18.3 is the required release for upgrades from 2.17.x to 2.18.x. It is the recommended and supported upgrade path for a smoother and more reliable upgrade experience.

When you upgrade an existing deployment with retained data, ITRS Analytics migrates that data from TimescaleDB to ClickHouse. This migration requires platform, application, and infrastructure changes that must be completed outside the installation process. These temporary infrastructure changes do not apply to a straightforward fresh installation because there is no TimescaleDB-to-ClickHouse data migration to run.

While migration is ongoing, there can be gaps in the historical data returned by queries because only the data that has already been migrated to ClickHouse is available. This is expected during the upgrade. All historical data becomes available after the migration completes.

What changes in version 2.18.x Copied

ITRS Analytics 2.18.0 moves additional platform workloads to ClickHouse, including metrics, logs, signals, audit events, entities, and related query paths. As part of the upgrade:

New ClickHouse workloads are deployed for platform data.
Historical metrics, signals, and logs are migrated from the legacy backends.
Configuration service data is moved away from TimescaleDB.
Query handling for DPD, Entity Service, Latest Metrics, and other platform services is aligned with the ClickHouse data model.

Note

This means the standard upgrade can complete before all retained data has been migrated. Background migration jobs can continue running for some time afterward, and both the old and new storage backends may need to coexist until the migration is complete.

For example, in upgrade testing on a small non-HA vSphere environment with all apps installed and default storage settings, the upgrade of ITRS Analytics to version 2.18.0 completed before the background Timescale-to-ClickHouse metrics migration finished. The migration then continued for approximately 10.5 hours after the upgrade started. During that time, both the legacy Timescale backend and the new ClickHouse metrics backend had to coexist until the migration was complete and the operator cleaned up the Timescale workload. The installer runs the migration jobs in the background, and they may continue running after the upgrade finishes, depending on the amount of data in Timescale and the performance of the migration jobs. You may run the commands in Monitor metrics migration progress to track migration progress and monitor the chmigration-metrics logs. The reported progress percentage reflects the percentage of tables migrated, not the amount of physical data migrated. You can continue using ITRS Analytics during this time.

Before you upgrade Copied

Complete the following checks before starting the standard upgrade procedure. The ClickHouse migration introduces behavior changes that can affect custom integrations, automation, and API clients.

Platform and query behavior changes Copied

Make sure your Kubernetes version is still supported by ITRS Analytics. Version 2.18.x adds support for Kubernetes 1.34 and removes support for Kubernetes 1.27 and 1.28.
Requests to metric query endpoints now require a namespace for metric lookups such as GetMetrics and GetStatusMetrics.
Metric metadata and query behavior now follow the ClickHouse-backed model. If you have custom consumers of metric APIs, re-test them before production rollout.
Metric query results now align with the normalized unit model used by ClickHouse-backed storage. Validate any automation that depends on previously returned units.
Query services now preserve nanosecond timestamp precision.
NOT_EQUALS and NOT IN now match only when the key is present.
Greater-than and less-than entity expression operators are no longer supported.

If you rely on saved filters, generated filters, or application-side query builders, validate them before upgrading.

Ingestion and collector changes Copied

Backticks in dimension keys, namespaces, and attribute names are normalized to underscores during ingestion.
Some log data is normalized more aggressively during ingestion and migration, including timestamp and severity parsing and event identifier handling.
For Geneos and FKM log data, entity and log-field handling is more normalized than in earlier releases. In particular, fields such as sourceAlias, matchKey, ignoreKey, and originalSeverity may no longer appear in exactly the same place they did before the ClickHouse migration.
Custom senders that still depend on the removed SendNetprobeLogBatch path must be updated before upgrading.

If your data pipeline or downstream tooling depends on exact field names or raw log formatting, verify the output after upgrading.

Review SSO realm references Copied

If you use Keycloak-based SSO, review any identity-provider configuration that references the Keycloak realm name directly before you start the upgrade.

In 2.17.x, the default realm name was obcerv. In 2.18.0, it is itrs-analytics. The upgrade process handles the realm rename inside the platform, but any external SSO configuration that still points to the old realm can fail after the upgrade.

This most commonly affects Microsoft Entra ID or other IAM apps that were created against the previous realm and still use URLs, issuer values, or metadata endpoints that include obcerv.

Before upgrading, identify any SSO configuration, client settings, or provider metadata that reference the old realm so you can update them to itrs-analytics and re-test sign-in as part of the upgrade.

Confirm temporary migration capacity Copied

This applies when upgrading an existing deployment with retained data. Plan enough time to monitor the background data migration after the standard upgrade completes. The migration duration depends on the amount of retained metrics, signals, and logs in the system. During and after the upgrade, the legacy data stores and the new ClickHouse workloads can run at the same time.

Before upgrading, verify that the cluster has sufficient spare capacity for both current and post-upgrade workloads. This is important because, in practice, the storage required for data is expected to nearly double after the upgrade, which may significantly impact available disk space and overall cluster stability if not accounted for in advance. Check that the cluster has enough spare:

CPU
memory
storage capacity
storage performance

Use the latest resource and hardware requirements and the current sample configuration files as your baseline.

Pay particular attention to storage performance. Slow or undersized storage can significantly lengthen migration time and affect ClickHouse query performance after the upgrade.

Review updated sizing configuration files Copied

Migration effort is directly affected by the amount of retained data and by how closely your current deployment still matches the latest sample sizing. Before upgrading, review:

metrics retention
signal retention
log retention
current TimescaleDB and Loki disk usage

If historical data volume is unusually large, plan for a longer background migration period and higher temporary resource usage.

The latest sample configuration files were also updated in ways that matter for upgrade planning. Before upgrading, compare your current deployment against the latest sample configuration file for your target size.

Major changes reflected in the updated sizing YAML files include:

more explicit node-count, CPU, memory, and storage estimates in the file headers for each size
HA guidance for service-layer workloads, not only database replicas
higher or more explicit replica counts for ingestion, sinkd, dpd, downsampled metrics, and entity-stream workloads in HA-oriented samples
clearer per-workload storage expectations for Kafka brokers and controllers, PostgreSQL, etcd, ClickHouse Keeper, ClickHouse traces, and downsampled metrics; TimescaleDB and Loki remain only as temporary migration workloads for upgrades with existing data, and their storage is not being re-sized as part of this change
dedicated Timescale node placement and higher-performance Timescale storage guidance in larger AWS sample files where applicable

If you previously customized an older sizing file, do not assume that older replica counts, storage sizes, or node totals are still appropriate for version 2.18.0.

Review KOTS Admin Console storage settings Copied

If you manage the deployment through the KOTS Admin Console, review the Storage Class Settings and Disk Allocations sections carefully before upgrading. In 2.18.x, these UI fields are updated to reflect ClickHouse-backed workloads.

Storage Class Settings

2.17.x	2.18.x
Kafka	Kafka
Kafka Controller	Kafka Controller
Downsampled Metrics Stream	Etcd Storage Class
Loki	ClickHouse Traces
Etcd	ClickHouse Metrics
Timescale Data	ClickHouse Logs
Timescale WAL	ClickHouse Keeper
Timescale Timeseries	ClickHouse Platform

Disk Allocations

2.17.x	2.18.x
Timescale Data Disk	ClickHouse Traces Disk
Timescale WAL Disk	ClickHouse Metrics Disk
Timescale Timeseries Disk	ClickHouse Logs Disk
	ClickHouse Keeper Disk
	ClickHouse Platform Disk

Review collector exposure and data-type settings Copied

Version 2.18.0 introduces separate ClickHouse-backed workloads for multiple data types. Before upgrading:

confirm whether logs are meant to stay enabled
confirm whether traces are meant to stay enabled
keep those settings consistent throughout the upgrade

If you previously exposed collector traffic through NodePort, review that configuration carefully. In 2.18.x, collectors are split into lossy and lossless pairs, so existing NodePort exposure for collectors may also need matching lossless configuration using the current internalLossless or otelLossless settings.

Review storage driver node affinity Copied

Warning

If you are installing on Embedded Cluster 2.18.0, or if your BYO cluster uses a storage driver that creates persistent volumes with node-level affinity, this is a required pre-upgrade check. On ITRS Analytics 2.18.0, you must disable the ClickHouse node selectors in the KOTS Admin Console under Advanced Settings > Enforce ClickHouse Node Selector.

Beginning with ITRS Analytics 2.18.1 on Embedded Cluster, Enforce ClickHouse Node Selector is permanently disabled and is no longer available for selection.

Upgrades will fail if node selectors cause ClickHouse pods to be scheduled on nodes other than the ones where their PVCs are currently attached, because the volumes cannot be moved due to node-level volume affinity.

If you intend to proceed with the upgrade while keeping node selectors enabled, and the ClickHouse workloads and their PVCs need to move to different nodes, please contact ITRS Support for guidance and assistance before starting the upgrade.

Choose an infrastructure transition strategy Copied

This section applies only when upgrading an existing deployment with retained data where TimescaleDB runs on dedicated nodes with the dedicated=timescale-nodes:NoSchedule taint and instancegroup=timescale-nodes label. It does not apply to a straightforward fresh installation or to deployments where TimescaleDB runs on shared nodes. Before upgrading, choose one of the following strategies and complete it before you start the upgrade. This ensures that the new ClickHouse workloads have enough capacity during migration.

Add additional nodes for ClickHouse first. This is the recommended approach.
Reuse existing infrastructure. This avoids provisioning new nodes up front, but it is more complex to manage.

Note
TimescaleDB and ClickHouse nodeSelector and tolerations for dedicated database nodes are not implemented for small or small-HA. They apply only to medium and large installations.

Add ClickHouse nodes first (Recommended strategy) Copied

This is the simplest approach. It provides the smoothest upgrade experience and the lowest operational risk, but it requires additional infrastructure.

Provision additional nodes for ClickHouse.

Use the following minimum sizing for the new ClickHouse nodes:
- Medium: 2 nodes, each with 15 CPU cores, 36 GiB RAM, and 2730 GiB storage
- Large: 2 nodes, each with 26 CPU cores, 46 GiB RAM, and 5472 GiB storage

Dedicate those nodes to ClickHouse.

For each new $node, run:

kubectl taint node $node dedicated=clickhouse-nodes:NoSchedule
kubectl label node $node instancegroup=clickhouse-nodes

After the upgrade and migration are complete, decommission the old TimescaleDB nodes if they are no longer needed.

Note
Do not remove the existing TimescaleDB node taints and labels. They should remain for the duration of the upgrade.
The exact process depends on how your nodes are managed, but the generic sequence is:

a. Drain the nodes.

b. Confirm that the remaining nodes can admit the evicted pods.

c. Remove the nodes.

Reuse existing infrastructure (Conservative strategy) Copied

This is the more economical approach because it does not require new infrastructure up front. However, it is more operationally sensitive and should only be attempted by an experienced Kubernetes administrator.

Remove the TimescaleDB node taints and labels. For each $node dedicated to TimescaleDB, run:

kubectl taint node $node dedicated=timescale-nodes:NoSchedule-
kubectl label node $node instancegroup-

Check whether your storage driver creates persistent volumes with node affinity.

Note
During the upgrade, TimescaleDB resources are reduced and their node selectors and tolerations are removed so that they can move to general-purpose nodes. Some storage drivers create persistent volumes that lock a workload to a specific node. If that happens, the TimescaleDB pods may not be rescheduled to other nodes.
```
# For BYOC installations
NAMESPACE=itrs

# For EC installations
# NAMESPACE=kotsadm

kubectl get pv $(kubectl get -n $NAMESPACE pvc timescale-ha-wal-timescale-0 -ojsonpath='{.spec.volumeName}') -ojsonpath='{.spec.nodeAffinity}'
```
If this command returns a non-empty response, the TimescaleDB pods are likely to remain tied to their current nodes and may not be able to migrate.
Make sure the target nodes meet the minimum capacity. If the target nodes do not already meet these minimums, resize them before upgrading.

Minimum sizing if TimescaleDB PVs have node affinity (pods stay on their current nodes):
- Medium: 2 nodes, each with 17 CPU cores and 52 GiB RAM
- Large: 2 nodes, each with 28 CPU cores and 62 GiB RAM
Minimum sizing if TimescaleDB PVs have no node affinity (pods can move to other nodes):
- Medium: 2 nodes, each with 15 CPU cores and 36 GiB RAM
- Large: 2 nodes, each with 26 CPU cores and 46 GiB RAM

Assign the target nodes to ClickHouse. For each target $node, run:

kubectl taint node $node dedicated=clickhouse-nodes:NoSchedule
kubectl label node $node instancegroup=clickhouse-nodes

Embedded Cluster workaround for medium and large Copied

Note
This workaround applies only to ITRS Analytics 2.18.0.

If you are upgrading a medium or large installation on Embedded Cluster, do not use dedicated ClickHouse nodes during the ITRS Analytics 2.18.0 upgrade. Embedded Cluster storage uses persistent volumes with node affinity, so dedicated-node scheduling for ClickHouse can prevent workloads from being placed on the nodes where their volumes are attached.

Clear the ClickHouse nodeSelector and tolerations settings in the configuration file so the ClickHouse workloads can remain schedulable on the nodes that already host their volumes:

iax:
  clickhouse:
    traces:
      nodeSelector:
        instancegroup: null
      tolerations: []
    metrics:
      nodeSelector:
        instancegroup: null
      tolerations: []
    platform:
      nodeSelector:
        instancegroup: null
      tolerations: []
    logs:
      nodeSelector:
        instancegroup: null
      tolerations: []

Apply this workaround only to medium and large Embedded Cluster installations. Small and small-HA do not use these dedicated ClickHouse scheduling settings.

During and after the upgrade Copied

After you start the standard upgrade procedure, allow the ClickHouse schema and migration jobs to run to completion, do not manually scale down or remove legacy data stores while migration is still in progress, and monitor the new ClickHouse workloads and core platform services until the deployment stabilizes.

Expect the operator to clean up legacy TimescaleDB and Loki workloads only after the migration has completed successfully.

After the upgrade finishes, validate both platform health and data continuity. Historical data from tables that have not yet been migrated is not available until the migration completes.

Check migration and platform health Copied

After a successful migration, some legacy PVCs can remain because the operator does not delete unused PVCs automatically. This can include old Loki, TimescaleDB, and downsampled-metrics PVCs such as:

loki-data-loki-0
store-downsampled-metrics-bucketed-stream-*
store-downsampled-metrics-raw-stream-*
timescale-ha-data-timescale-*
timescale-ha-tablespace-data-*
timescale-ha-wal-timescale-*

Delete these PVCs only when all of the following are true:

the 2.18 upgrade completed successfully
all migration jobs completed successfully
expected data is present in the new platform workloads
valid backups have been taken

Example checks:

kubectl get jobs -n <namespace>
kubectl get pods -n <namespace>

Pay particular attention to any config-data-migration or chmigration-* jobs that are still running, failed, or repeatedly restarted.

Validate data access and integration behavior Copied

Confirm that users can still access:

current and historical metrics
signal timelines and latest signals
log search and log source discovery
audit event queries
dashboards and data views that depend on entity and metric queries

If you use custom integrations, validate them against a representative sample of:

metric queries
status metric queries
entity filters
DPD subscriptions or task definitions
any collector ingress or NodePort exposure that custom senders still depend on

If you use Keycloak-based SSO, also confirm that your SSO configuration, client settings, and provider metadata all reference the itrs-analytics realm and then re-test sign-in before returning the system to users.

Validate retention and storage behavior Copied

After the platform is stable, confirm that:

expected ClickHouse PVCs exist and are sized correctly
retention settings are being applied as expected
storage growth is tracking normally after the migration

If you need to resize a matching set of PVCs manually, you can patch them in a shell loop. For example:

for pvc in $(kubectl get pvc -n itrs -o name | grep kafka-controller-data); do
  kubectl patch "$pvc" -n itrs -p '{"spec":{"resources":{"requests":{"storage":"10Gi"}}}}'
done

In this example, kubectl get pvc -n itrs -o name lists the PVC resource names in the itrs namespace, grep kafka-controller-data filters that list to matching PVCs, the for loop iterates over each matching PVC, and kubectl patch ... -p applies a patch that updates .spec.resources.requests.storage to 10Gi.

Note
Version 2.18.0 relaxes upgrade blocking around changed diskSize and storageClass values, but changing those configuration values does not resize existing volumes automatically. If you need larger volumes, expand the underlying PVCs separately using your storage platform’s supported procedure.

Previous article Next article

Upgrade to version 2.18.x: ClickHouse migration guide

What changes in version 2.18.x Copied

Before you upgrade Copied

Platform and query behavior changes Copied

Ingestion and collector changes Copied

Review SSO realm references Copied

Confirm temporary migration capacity Copied

Review updated sizing configuration files Copied

Review KOTS Admin Console storage settings Copied

Review collector exposure and data-type settings Copied

Review storage driver node affinity Copied

Choose an infrastructure transition strategy Copied

Add ClickHouse nodes first (Recommended strategy) Copied

Reuse existing infrastructure (Conservative strategy) Copied

Embedded Cluster workaround for medium and large Copied

During and after the upgrade Copied

Check migration and platform health Copied

Validate data access and integration behavior Copied

Validate retention and storage behavior Copied

Was this topic helpful?

Your thoughts...

How can we improve this topic?

Your thoughts...

Thank you for your feedback!