Upgrade to version 2.18.0: ClickHouse migration guide

ITRS Analytics 2.18.0 introduces the platform-wide ClickHouse migration. During the upgrade, ITRS Analytics deploys new ClickHouse-backed workloads, migrates retained data from the legacy stores, and then retires the old data path once the migration has completed successfully. For the package release notes for this required release, see ITRS Analytics 2.18.x changelog.

Important

ITRS Analytics 2.18.0+p2.18.3 is a required release.

When you upgrade an existing deployment with retained data, ITRS Analytics migrates that data from TimescaleDB to ClickHouse. This migration requires changes to platforms and apps, as well as infrastructure changes that must be completed outside the installation process. These temporary infrastructure changes do not apply to a straightforward fresh installation because there is no TimescaleDB-to-ClickHouse data migration to run. By concentrating the migration policy in a single release, the risks associated with supporting multiple upgrade paths are reduced to one.

What changes in version 2.18.0 Copied

Version 2.18.0 moves additional platform workloads to ClickHouse, including metrics, logs, signals, audit events, entities, and related query paths. As part of the upgrade:

This means the standard upgrade can complete before all retained data has been migrated. Background migration jobs can continue running for some time afterward, and both the old and new storage backends may need to coexist until that migration finishes.

Before you upgrade Copied

Complete the following checks before starting the standard upgrade procedure. The ClickHouse migration introduces behavior changes that can affect custom integrations, automation, and API clients.

Platform and query behavior changes Copied

If you rely on saved filters, generated filters, or application-side query builders, validate them before upgrading.

Ingestion and collector changes Copied

If your data pipeline or downstream tooling depends on exact field names or raw log formatting, verify the output after upgrading.

Confirm temporary migration capacity Copied

This applies when upgrading an existing deployment with retained data. Plan enough time to monitor the background data migration after the standard upgrade completes. The migration duration depends on the amount of retained metrics, signals, and logs in the system. During and after the upgrade, the legacy data stores and the new ClickHouse workloads can run at the same time.

Before upgrading, verify that the cluster has sufficient spare capacity for both current and post-upgrade workloads. This is important because, in practice, the storage required for data is expected to nearly double after the upgrade, which may significantly impact available disk space and overall cluster stability if not accounted for in advance. Check that the cluster has enough spare:

Use the latest resource and hardware requirements and the current sample configuration files as your baseline.

Pay particular attention to storage performance. Slow or undersized storage can significantly lengthen migration time and affect ClickHouse query performance after the upgrade.

Review updated sizing configuration YAMLs Copied

Migration effort is directly affected by the amount of retained data and by how closely your current deployment still matches the latest sample sizing. Before upgrading, review:

If historical data volume is unusually large, plan for a longer background migration period and higher temporary resource usage.

The latest sizing sample YAML files were also updated in ways that matter for upgrade planning. Before upgrading, compare your current deployment against the latest sample configuration file for your target size.

Major changes reflected in the updated sizing YAML files include:

If you previously customized an older sizing file, do not assume that older replica counts, storage sizes, or node totals are still appropriate for version 2.18.0.

Review KOTS Admin Console storage settings Copied

If you manage the deployment through the KOTS Admin Console, review the Storage Class Settings and Disk Allocations sections carefully before upgrading. In 2.18.x, these UI fields are updated to reflect ClickHouse-backed workloads.

Storage Class Settings

2.17.x 2.18.x
Kafka Kafka
Kafka Controller Kafka Controller
Downsampled Metrics Stream Etcd Storage Class
Loki ClickHouse Traces
Etcd ClickHouse Metrics
Timescale Data ClickHouse Logs
Timescale WAL ClickHouse Keeper
Timescale Timeseries ClickHouse Platform

Disk Allocations

2.17.x 2.18.x
Timescale Data Disk ClickHouse Traces Disk
Timescale WAL Disk ClickHouse Metrics Disk
Timescale Timeseries Disk ClickHouse Logs Disk
ClickHouse Keeper Disk
ClickHouse Entities Disk

Review collector exposure and data-type settings Copied

Version 2.18.0 introduces separate ClickHouse-backed workloads for multiple data types. Before upgrading:

If you previously exposed collector traffic through NodePort, review that configuration carefully. In 2.18.x, collectors are split into lossy and lossless pairs, so existing NodePort exposure for collectors may also need matching lossless configuration using the current internalLossless or otelLossless settings.

Choose an infrastructure transition strategy Copied

This section applies only when upgrading an existing deployment with retained data where TimescaleDB runs on dedicated nodes with the dedicated=timescale-nodes:NoSchedule taint and instancegroup=timescale-nodes label. It does not apply to a straightforward fresh installation or to deployments where TimescaleDB runs on shared nodes. Before upgrading, choose one of the following strategies and complete it before you start the upgrade. This ensures that the new ClickHouse workloads have enough capacity during migration.

This is the simplest approach. It provides the smoothest upgrade experience and the lowest operational risk, but it requires additional infrastructure.

  1. Provision additional nodes for ClickHouse.

    Use the following minimum sizing for the new ClickHouse nodes:

    • Medium: 2 nodes, each with 15 CPU cores, 36 GiB RAM, and 2730 GiB storage
    • Large: 2 nodes, each with 26 CPU cores, 46 GiB RAM, and 5472 GiB storage
  2. Dedicate those nodes to ClickHouse.

    For each new $node, run:

    kubectl taint node $node dedicated=clickhouse-nodes:NoSchedule
    kubectl label node $node instancegroup=clickhouse-nodes
    
  3. After the upgrade and migration are complete, decommission the old TimescaleDB nodes if they are no longer needed. Do not remove the existing TimescaleDB node taints and labels. They should remain for the duration of the upgrade.

    The exact process depends on how your nodes are managed, but the generic sequence is:

    a. Drain the nodes. b. Confirm that the remaining nodes can admit the evicted pods. c. Remove the nodes.

Reuse existing infrastructure (Conservative strategy) Copied

This is the more economical approach because it does not require new infrastructure up front. However, it is more operationally sensitive and should only be attempted by an experienced Kubernetes administrator.

  1. Remove the TimescaleDB node taints and labels.

    For each $node dedicated to TimescaleDB, run:

    kubectl taint node $node dedicated=timescale-nodes:NoSchedule-
    kubectl label node $node instancegroup-
    
  2. Check whether your storage driver creates persistent volumes with node affinity.

    During the upgrade, TimescaleDB resources are reduced and their node selectors and tolerations are removed so that they can move to general-purpose nodes. Some storage drivers create persistent volumes that lock a workload to a specific node. If that happens, the TimescaleDB pods may not be rescheduled to other nodes.

    # For BYOC installations
    NAMESPACE=itrs
    
    # For EC installations
    # NAMESPACE=kotsadm
    
    kubectl get pv $(kubectl get -n $NAMESPACE pvc timescale-ha-wal-timescale-0 -ojsonpath='{.spec.volumeName}') -ojsonpath='{.spec.nodeAffinity}'
    

    If this command returns a non-empty response, the TimescaleDB pods are likely to remain tied to their current nodes and may not be able to migrate.

  3. Make sure the target nodes meet the minimum capacity.

    If the target nodes do not already meet these minimums, resize them before upgrading.

    Minimum sizing if TimescaleDB PVs have node affinity (pods stay on their current nodes):

    • Medium: 2 nodes, each with 17 CPU cores and 52 GiB RAM
    • Large: 2 nodes, each with 28 CPU cores and 62 GiB RAM

    Minimum sizing if TimescaleDB PVs have no node affinity (pods can move to other nodes):

    • Medium: 2 nodes, each with 15 CPU cores and 36 GiB RAM
    • Large: 2 nodes, each with 26 CPU cores and 46 GiB RAM
  4. Assign the target nodes to ClickHouse.

    For each target $node, run:

    kubectl taint node $node dedicated=clickhouse-nodes:NoSchedule
    kubectl label node $node instancegroup=clickhouse-nodes
    

During and after the upgrade Copied

After you start the standard upgrade procedure, allow the ClickHouse schema and migration jobs to run to completion, do not manually scale down or remove legacy data stores while migration is still in progress, and monitor the new ClickHouse workloads and core platform services until the deployment stabilizes.

Expect the operator to clean up legacy TimescaleDB and Loki workloads only after the migration has completed successfully.

After the upgrade finishes, validate both platform health and data continuity.

Check migration and platform health Copied

After a successful migration, some legacy PVCs can remain because the operator does not delete unused PVCs automatically. This can include old Loki, TimescaleDB, and downsampled-metrics PVCs such as:

Delete these PVCs only when all of the following are true:

Example checks:

kubectl get jobs -n <namespace>
kubectl get pods -n <namespace>

Pay particular attention to any config-data-migration or chmigration-* jobs that are still running, failed, or repeatedly restarted.

Validate data access and integration behavior Copied

Confirm that users can still access:

If you use custom integrations, validate them against a representative sample of:

Validate retention and storage behavior Copied

After the platform is stable, confirm that:

Note

Version 2.18.0 relaxes upgrade blocking around changed diskSize and storageClass values, but changing those configuration values does not resize existing volumes automatically. If you need larger volumes, expand the underlying PVCs separately using your storage platform’s supported procedure.
["ITRS Analytics"] ["User Guide", "Technical Reference"]

Was this topic helpful?