Sample configuration for AWS EC2 handling 125k entities and 50k metrics/sec (medium) with NGINX Ingress controller

Download this sample AWS EC2 handling 125k entities and 50k metrics/sec (medium) configuration provided by ITRS.

# Example ITRS Analytics configuration for AWS EC2 handling ~500,000 entities, ~2,000,000 time series,
# ~50,000 datapoints/sec (tested with approximately 40,000 metrics/sec, 10,000 logs/sec, 100 signals/sec, 100 audit events/sec);
# and ~10,000 span/sec (pre-sampling).
# Actual ingestion composition may vary by deployment.
#
# NODE REQUIREMENTS:
# Total capacity needed: ~60 cores / ~138 GiB requests (~130 cores / ~178 GiB limits)
# These totals include optional Linkerd sidecar resources
# ClickHouse nodes:
# - Total capacity needed: 30 cores / 72 GiB (requests = limits)
# - Minimum per node: 16 cores / 64 GiB
# - Example: (2) m5.4xlarge (16 cores / 64 GiB) or equivalent
# - NOTE: ClickHouse workloads consume 15 cores per node (one replica set).
#   On a 16-core node, only 1 core remains for node overhead (kubelet, kube-proxy, containerd, OS, DaemonSets
#   such as logging/monitoring agents, Linkerd sidecars, etc.). If your cluster runs additional DaemonSets
#   or admin tooling, consider larger nodes (e.g., m5.8xlarge)
# Non-ClickHouse nodes:
# - Total capacity needed: ~30 cores / ~66 GiB requests (~100 cores / ~106 GiB limits)
# - Minimum per node: 8 cores / 16 GiB
# - Example: (5) c5.4xlarge (16 cores / 32GiB) or equivalent
#
# HA CONFIGURATION NOTE:
# This configuration provides seamless HA for service layer workloads (2 replicas minimum for stateless services).
#
# ClickHouse workloads (chmetrics, chplatform, chlogs, chtraces) each run 2 replicas in an active-active
# configuration — both replicas serve reads and writes simultaneously. There is no primary/standby distinction.
# Data is replicated asynchronously between replicas via ReplicatedMergeTree using ClickHouse Keeper (chkeeper).
#
# With 2 replicas, losing one replica has no immediate impact — reads and writes continue on the surviving
# replica without failover. The failed replica re-syncs automatically when it restarts (~2-5 minutes depending
# on data volume accumulated during the outage).
#
# ClickHouse Keeper (chkeeper) runs as a 3-node consensus cluster. Losing 1 of 3 keepers maintains quorum
# and has no impact on reads or writes. Losing 2 of 3 keepers breaks quorum, blocking replication and
# distributed DDL but NOT local reads or writes on individual ClickHouse nodes.
#
# There is no benefit to adding a 3rd ClickHouse replica purely for HA —
# 2 replicas already provide full read/write availability during single-node failure.
#
# DISK REQUIREMENTS:
# Estimated disk requirements based on default retention and the ingestion rate above
# (actual size will vary depending on the shape of the data being ingested).
# - Kafka broker: 200 GiB for each replica (x3)
# - Kafka controller: 10 GiB for each replica (x3)
# - Postgres: 3 GiB for each replica (x2)
# - ClickHouse Keeper: 2 GiB for each replica (x3)
# - ClickHouse Platform: 100 GiB for each replica (x2)
# - ClickHouse Metrics: 2 TiB for each replica (x2)
# - ClickHouse Logs: 500 GiB for each replica (x2)
# - ClickHouse Traces: 80 GiB for each replica (x2)
# - etcd: 16 GiB for each replica (x3)
#
# The configuration references a default storage class named `gp3` which uses EBS gp3 volumes. This storage class should
# be configured with the default minimum gp3 settings of 3000 IOPS and 125 MiB/s throughput.
#
# The configuration also references a storage class named `gp3-clickhouse` which uses EBS gp3 volumes, but with
# higher provisioned performance for ClickHouse disks. This storage class should be configured with 3000 IOPS and
# 300 MiB/s throughput.
#
# You can create these classes or change the config to use classes of your own, but they should be similar in performance.
#
# This configuration is based upon a certain number of IAX entities, average metrics per entity, and
# average metrics collection interval. The following function can be used to figure out what type of load to expect:
#
# metrics/sec = (IAX entities * metrics/entity) / average metrics collection interval
#
# In this example configuration, we have the following:
#
# 50,000 metrics/sec = (500,000 IAX entities * 1 metrics/entity) / 10 seconds average metrics collection interval
#
# NOTE: Ingestion, storage, and retrieval of OpenTelemetry spans is a beta feature.
#
# Additionally, the configuration is based upon a certain number of OpenTelemetry spans per second that are sampled
# based upon the following rules:
# - Error traces are always sampled
# - Target sampling probability per endpoint (corresponds to the name of the root span) is 0.01
# - Target sampling rate / second / endpoint (corresponds to the name of the root span) is 0.5
# - Root span duration outlier quantile is 0.95. The durations of all root spans are tracked and used to make guesses about
#   abnormally long spans
#
# UPGRADE NOTE: Timescale and Loki are no longer required for fresh installs (v2.18+).
# If upgrading from a pre-2.18 deployment, these workloads must remain and require additional resources:
# - Timescale: additional resource ~4 cores / ~32 GiB (requests = limits)
# - Loki: additional resource requests ~500m / ~1 GiB, with limits of ~1 core / ~8 GiB
# Additional disk requirements (sizes will vary based on existing deployment):
# - Timescale:
#   - 4 x timeseries data disks for each replica (x2)
#   - 1 x data disk for each replica (x2)
#   - 1 x WAL disk for each replica (x2)
# - Loki: 1 x data disk
#
# Upgrade node requirements:
# Pre-2.18 deployments include 2 dedicated Timescale nodes (tainted and labeled as `timescale-nodes`).
# There are two supported approaches for provisioning ClickHouse nodes during upgrade.
# For detailed steps, refer to the upgrade documentation.
#
# Option 1 — Add new ClickHouse nodes:
#   Provision 2 additional ClickHouse nodes (tainted and labeled as `clickhouse-nodes`),
#   e.g. (2) m5.4xlarge (16 cores / 64 GiB).
#   Total during upgrade: (5) non-ClickHouse + (2) Timescale + (2) ClickHouse = 9 nodes
#   After data migration to ClickHouse is complete, the 2 Timescale nodes can be removed (7 nodes remain).
#
# Option 2 — Reuse existing Timescale nodes:
#   Re-label and re-taint the 2 existing Timescale nodes as `clickhouse-nodes`, and relocate
#   Timescale workloads to the general (non-ClickHouse) node pool for the duration of the migration.
#   This avoids provisioning additional nodes.
#   Total during upgrade: (5) non-ClickHouse (including Timescale workloads) + (2) ClickHouse = 7 nodes
#   After data migration to ClickHouse is complete, Timescale and Loki are removed automatically,
#   freeing resources on the general nodes (7 nodes remain).
#
# If upgrading from a pre-2.18 deployment, uncomment the timescale and loki section at the bottom of this file
# and include additional resources and disks listed under "UPGRADE NOTE" above.
#
defaultStorageClass: "gp3"
apps:
  externalHostname: "iax.mydomain.internal"
  ingress:
    className: "nginx"
ingestion:
  externalHostname: "iax-ingestion.mydomain.internal"
  replicas: 2
  ingress:
    className: "nginx"
    annotations:
      nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
      nginx.ingress.kubernetes.io/use-regex: "true"
    usePathRegex: true
  producerProperties:
    buffer.memory: 67108864
  threadPoolSize: 6
  queueCapacity: 16
  resources:
    requests:
      memory: "750Mi"
      cpu: "500m"
    limits:
      memory: "1500Mi"
      cpu: "1"
  traces:
    jvmOpts: "-XX:MaxDirectMemorySize=200M -XX:MaxRAMPercentage=75"
    resources:
      requests:
        memory: "2Gi"
        cpu: "1"
      limits:
        memory: "4Gi"
        cpu: "3"
    sampler:
      expectedSpanRate: 20000
    producerProperties:
      buffer.memory: 67108864
iam:
  keycloak:
    replicas: 2
  ingress:
    className: "nginx"
kafka:
  replicas: 3
  diskSize: "200Gi"
  resources:
    requests:
      memory: "6Gi"
      cpu: "2"
    limits:
      memory: "6Gi"
      cpu: "4500m"
  controller:
    replicas: 3
sinkd:
  metrics:
    jvmOpts: "-XX:MaxDirectMemorySize=200M"
    replicas: 3
    consumerProperties:
      fetch.max.bytes: 20971520
      fetch.max.wait.ms: 250
      fetch.min.bytes: 5242880
      max.partition.fetch.bytes: 5242880
      max.poll.records: 100000
      receive.buffer.bytes: 131072
    resources:
      requests:
        memory: "1280Mi"
        cpu: "500m"
      limits:
        memory: "2Gi"
        cpu: "1500m"
  entities:
    resources:
      limits:
        memory: "1500Mi"
      requests:
        memory: "512Mi"
  signals:
    consumerProperties:
      max.partition.fetch.bytes: 1048576
  traces:
    consumerProperties:
      max.poll.records: 20000
    resources:
      requests:
        memory: "756Mi"
        cpu: "100m"
      limits:
        memory: "1500Mi"
        cpu: "1"
  logs:
    jvmOpts: "-XX:MaxDirectMemorySize=100M"
    consumerProperties:
      fetch.max.bytes: 20971520
      fetch.max.wait.ms: 1000
      fetch.min.bytes: 5242880
      max.partition.fetch.bytes: 5242880
      max.poll.records: 100000
    resources:
      limits:
        cpu: "1"
        memory: "1200Mi"
      requests:
        cpu: "250m"
        memory: "512Mi"
platformd:
  replicas: 2
  resources:
    requests:
      memory: "1536Mi"
      cpu: "1"
    limits:
      memory: "2Gi"
      cpu: "2250m"
dpd:
  replicas: 2
  jvmOpts: "-XX:MaxRAMPercentage=70"
  secondLevelEntityCacheHeapPercent: 25
  entityCache:
    inMemoryCacheSizeMb: 256
    writeBuffers: 8
    writeBufferSizeMb: 16
  hazelcast:
    jetIdleCooperativeMinMicroSeconds: 500
    jetIdleCooperativeMaxMicroSeconds: 3000
    jetIdleNonCooperativeMinMicroSeconds: 500
    jetIdleNonCooperativeMaxMicroSeconds: 3000
  consumerProperties:
    max.poll.records: 10000
    fetch.min.bytes: 524288
  metricsMultiplexer:
    maxFilterResultCacheSize: 500000
    maxConcurrentOps: 500
  resources:
    requests:
      memory: "5Gi"
      cpu: "2"
    limits:
      memory: "6Gi"
      cpu: "3"
entityStream:
  intermediate:
    jvmOpts: "-XX:InitialRAMPercentage=40 -XX:MaxRAMPercentage=60"
    consumerProperties:
      max.partition.fetch.bytes: 1048576
    producerProperties:
      buffer.memory: 67108864
    storedEntitiesCacheSize: 10000
    replicas: 2
    resources:
      requests:
        memory: "1536Mi"
        cpu: "750m"
      limits:
        memory: "2Gi"
        cpu: "2"
    rocksdb:
      memoryMib: 200
  final:
    jvmOpts: "-XX:InitialRAMPercentage=40 -XX:MaxRAMPercentage=55"
    consumerProperties:
      max.partition.fetch.bytes: 1048576
    producerProperties:
      buffer.memory: 67108864
    replicas: 2
    storedEntitiesCacheSize: 10000
    resources:
      requests:
        memory: "1536Mi"
        cpu: "1"
      limits:
        memory: "2560Mi"
        cpu: "2"
signalsStream:
  consumerProperties:
    max.partition.fetch.bytes: 1048576
  resources:
    requests:
      memory: "830Mi"
      cpu: "150m"
    limits:
      memory: "1536Mi"
      cpu: "1200m"
etcd:
  replicas: 3
  diskSize: "16Gi"
licenced:
  replicas: 2
clickhouse:
  traces:
    replicas: 2
    diskSize: "80Gi"
    storageClass: "gp3-clickhouse"
    resources:
      limits:
        cpu: "2"
        memory: "10Gi"
      requests:
        cpu: "2"
        memory: "10Gi"
    nodeSelector:
      instancegroup: clickhouse-nodes
    tolerations:
      - key: dedicated
        operator: Equal
        value: clickhouse-nodes
        effect: NoSchedule
  metrics:
    replicas: 2
    diskSize: "500Gi"
    diskCount: 4
    storageClass: "gp3-clickhouse"
    resources:
      limits:
        cpu: "8"
        memory: "10Gi"
      requests:
        cpu: "8"
        memory: "10Gi"
    nodeSelector:
      instancegroup: clickhouse-nodes
    tolerations:
      # must match the tainted ClickHouse nodes setting
      - key: dedicated
        operator: Equal
        value: clickhouse-nodes
        effect: NoSchedule
  platform:
    replicas: 2
    diskSize: "100Gi"
    storageClass: "gp3-clickhouse"
    resources:
      limits:
        cpu: "2"
        memory: "8Gi"
      requests:
        cpu: "2"
        memory: "8Gi"
    nodeSelector:
      instancegroup: clickhouse-nodes
    tolerations:
      # must match the tainted ClickHouse nodes setting
      - key: dedicated
        operator: Equal
        value: clickhouse-nodes
        effect: NoSchedule
  logs:
    replicas: 2
    diskSize: "500Gi"
    storageClass: "gp3-clickhouse"
    resources:
      limits:
        cpu: "3"
        memory: "8Gi"
      requests:
        cpu: "3"
        memory: "8Gi"
    nodeSelector:
      instancegroup: clickhouse-nodes
    tolerations:
      # must match the tainted ClickHouse nodes setting
      - key: dedicated
        operator: Equal
        value: clickhouse-nodes
        effect: NoSchedule
  keeper:
    replicas: 3
postgres:
  clusterSize: 2
kvStore:
  replicas: 2
  resources:
    limits:
      memory: "900Mi"
    requests:
      memory: "600Mi"
statusMetricsStream:
  resources:
    limits:
      memory: "1280Mi"
    requests:
      memory: "768Mi"
  consumerProperties:
    fetch.max.bytes: 20971520
    fetch.max.wait.ms: 250
    fetch.min.bytes: 5242880
    max.partition.fetch.bytes: 5242880
    max.poll.records: 100000
#
# The Timescale configs need to be enabled if upgrading from a pre-2.18 deployment.
# The following is an example config. The actual configs should match the existing deployment,
# except that the resource values below have been reduced since Timescale only serves reads during migration.
#
#timescale:
#  sharedBuffersPercentage: 40
#  bgwLruMaxPages: 8000
#  walBuffers: 64000
#  clusterSize: 2
#  dataDiskSize: "100Gi"
#  timeseriesDiskCount: 4
#  timeseriesDiskSize: "1Ti"
#  timeseriesStorageClass: "gp3-timescale"
#  walDiskSize: "250Gi"
#  walStorageClass: "gp3-timescale"
#  maxLocksPerTransaction: 10000
#  resources:
#    requests:
#      memory: "16Gi"
#      cpu: "2"
#    limits:
#      memory: "16Gi"
#      cpu: "2"
#  The following `nodeSelector` and `tolerations` configs can be removed if re-using Timescale nodes for ClickHouse
#  nodeSelector:
#    instancegroup: timescale-nodes
#  tolerations:
#    - key: dedicated
#      operator: Equal
#      value: timescale-nodes
#      effect: NoSchedule
#
# The Loki configs need to be enabled if upgrading from a pre-2.18 deployment.
# The following is an example config. The actual configs should match the existing deployment,
# except that the resource values below may be increased depending on the volume of the logs.
# Loki memory limits depend on chunk data volume. A reasonable guideline is 2-4x the chunk data size,
# with a minimum of 4 GiB. The 8 GiB limit below assumes chunk data of ~2-4 GiB.
# Adjust if your deployment has significantly larger log volume.
#
#loki:
#  diskSize: "30Gi"
#  retentionTime: "168h"
#  ingestionBurstSize: 9
#  ingestionRateLimit: 6
#  resources:
#    limits:
#      cpu: "1"
#      memory: "8Gi"
#    requests:
#      cpu: "500m"
#      memory: "1Gi"
#
["ITRS Analytics"] ["User Guide", "Technical Reference"]

Was this topic helpful?