Sample configuration for AWS EC2 with NGINX Ingress controller (large)

Download this sample configuration provided by ITRS.
# Example ITRS Analytics configuration for AWS EC2 handling ~1,000,000 entities, ~4,000,000 time series,
# ~100,000 datapoints/sec (tested with approximately 81,000 metrics/sec, 20,000 logs/sec, 200 signals/sec, 200 audit events/sec);
# and ~20,000 span/sec (pre-sampling).
# Actual ingestion composition may vary by deployment.
#
# NODE REQUIREMENTS:
# Total capacity needed: ~91 cores / ~205 GiB requests (~169 cores / ~237 GiB limits)
# These totals include optional Linkerd sidecar resources
# ClickHouse nodes:
# - Total capacity needed: 52 cores / 92 GiB (requests = limits)
# - Minimum per node: 32 cores / 64 GiB
# - Example: (2) m5.8xlarge (32 cores / 128GiB) or equivalent
# - Extra node memory benefits ClickHouse query performance via OS page cache
# Non-ClickHouse nodes:
# - Total capacity needed: ~39 cores / ~113 GiB requests (~111 cores / ~145 GiB limits)
# - Minimum per node: 8 cores / 16 GiB
# - Example: (6) c5.4xlarge (16 cores / 32GiB)  or equivalent
#
# HA CONFIGURATION NOTE:
# This configuration provides seamless HA for service layer workloads (2 replicas minimum for stateless services).
#
# ClickHouse workloads (chmetrics, chplatform, chlogs, chtraces) each run 2 replicas in an active-active
# configuration — both replicas serve reads and writes simultaneously. There is no primary/standby distinction.
# Data is replicated asynchronously between replicas via ReplicatedMergeTree using ClickHouse Keeper (chkeeper).
#
# With 2 replicas, losing one replica has no immediate impact — reads and writes continue on the surviving
# replica without failover. The failed replica re-syncs automatically when it restarts (~2-5 minutes depending
# on data volume accumulated during the outage).
#
# ClickHouse Keeper (chkeeper) runs as a 3-node consensus cluster. Losing 1 of 3 keepers maintains quorum
# and has no impact on reads or writes. Losing 2 of 3 keepers breaks quorum, blocking replication and
# distributed DDL but NOT local reads or writes on individual ClickHouse nodes.
#
# There is no benefit to adding a 3rd ClickHouse replica purely for HA —
# 2 replicas already provide full read/write availability during single-node failure.
#
# DISK REQUIREMENTS:
# Estimated disk requirements based on default retention and the ingestion rate above
# (actual size will vary depending on the shape of the data being ingested).
# - Kafka broker: 400 GiB for each replica (x3)
# - Kafka controller: 10 GiB for each replica (x3)
# - Postgres: 3 GiB for each replica (x2)
# - ClickHouse Keeper: 2 GiB for each replica (x3)
# - ClickHouse Platform: 200 GiB for each replica (x2)
# - ClickHouse Metrics: 4 TiB for each replica (x2)
# - ClickHouse Logs: 1 TiB for each replica (x2)
# - ClickHouse Traces: 150 GiB for each replica (x2)
# - etcd: 16 GiB for each replica (x3)
#
# The configuration references a default storage class named `gp3` which uses EBS gp3 volumes. This storage class should
# be configured with the default minimum gp3 settings of 3000 IOPS and 125 MiB/s throughput.
#
# The configuration also references a storage class named `gp3-clickhouse` which uses EBS gp3 volumes, but with
# higher provisioned performance for ClickHouse disks. This storage class should be configured with 3000 IOPS and
# 300 MiB/s throughput.
#
# You can create these classes or change the config to use classes of your own, but they should be similar in performance.
#
# This configuration is based upon a certain number of IAX entities, average metrics per entity, and
# average metrics collection interval. The following function can be used to figure out what type of load to expect:
#
# metrics/sec = (IAX entities * metrics/entity) / average metrics collection interval
#
# In this example configuration, we have the following:
#
# 100,000 metrics/sec = (1,000,000 IAX entities * 1 metrics/entity) / 10 seconds average metrics collection interval
#
# NOTE: Ingestion, storage, and retrieval of OpenTelemetry spans is a beta feature.
#
# Additionally, the configuration is based upon a certain number of OpenTelemetry spans per second that are sampled
# based upon the following rules:
# - Error traces are always sampled
# - Target sampling probability per endpoint (corresponds to the name of the root span) is 0.01
# - Target sampling rate / second / endpoint (corresponds to the name of the root span) is 0.5
# - Root span duration outlier quantile is 0.95. The durations of all root spans are tracked and used to make guesses about
#   abnormally long spans
#
# UPGRADE NOTE: Timescale and Loki are no longer required for fresh installs (v2.18+).
# If upgrading from a pre-2.18 deployment, these workloads must remain and require additional resources:
# - Timescale: additional resource ~4 cores / ~32 GiB (requests = limits)
# - Loki: additional resource requests ~500m / ~1 GiB, with limits of ~1 core / ~8 GiB
# Additional disk requirements (sizes will vary based on existing deployment):
# - Timescale:
#   - 4 x timeseries data disks for each replica (x2)
#   - 1 x data disk for each replica (x2)
#   - 1 x WAL disk for each replica (x2)
# - Loki: 1 x data disk
#
# Upgrade node requirements:
# Pre-2.18 deployments include 2 dedicated Timescale nodes (tainted and labeled as `timescale-nodes`).
# There are two supported approaches for provisioning ClickHouse nodes during upgrade.
# For detailed steps, refer to the upgrade documentation.
#
# Option 1 — Add new ClickHouse nodes:
#   Provision 2 additional ClickHouse nodes (tainted and labeled as `clickhouse-nodes`),
#   e.g. (2) m5.8xlarge (32 cores / 128 GiB).
#   Total during upgrade: (6) non-ClickHouse + (2) Timescale + (2) ClickHouse = 10 nodes
#   After data migration to ClickHouse is complete, the 2 Timescale nodes can be removed (8 nodes remain).
#
# Option 2 — Reuse existing Timescale nodes:
#   Re-label and re-taint the 2 existing Timescale nodes as `clickhouse-nodes`, and relocate
#   Timescale workloads to the general (non-ClickHouse) node pool for the duration of the migration.
#   This avoids provisioning additional nodes.
#   Total during upgrade: (6) non-ClickHouse (including Timescale workloads) + (2) ClickHouse = 8 nodes
#   After data migration to ClickHouse is complete, Timescale and Loki are removed automatically,
#   freeing resources on the general nodes (8 nodes remain).
#
# If upgrading from a pre-2.18 deployment, uncomment the timescale and loki section at the bottom of this file
# and include additional resources and disks listed under "UPGRADE NOTE" above.
#
defaultStorageClass: "gp3"
# Enforce even replica distribution of critical workloads across availability zones in multi-zone clusters.
enforceZoneSpread: true
apps:
  externalHostname: "iax.mydomain.internal"
  ingress:
    className: "nginx"
ingestion:
  externalHostname: "iax-ingestion.mydomain.internal"
  jvmOpts: "-XX:MaxRAMPercentage=80"
  replicas: 2
  ingress:
    className: "nginx"
    annotations:
      nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
      nginx.ingress.kubernetes.io/use-regex: "true"
    usePathRegex: true
  producerProperties:
    buffer.memory: 67108864
  threadPoolSize: 10
  queueCapacity: 32
  resources:
    requests:
      memory: "1Gi"
      cpu: "500m"
    limits:
      memory: "2Gi"
      cpu: "2"
  traces:
    jvmOpts: "-XX:MaxDirectMemorySize=500M -XX:MaxRAMPercentage=75"
    resources:
      requests:
        memory: "3Gi"
        cpu: "1"
      limits:
        memory: "5Gi"
        cpu: "3"
    sampler:
      expectedSpanRate: 30000
    producerProperties:
      buffer.memory: 67108864
iam:
  keycloak:
    replicas: 2
  ingress:
    className: "nginx"
iamd:
  replicas: 2
kafka:
  replicas: 3
  diskSize: "400Gi"
  resources:
    requests:
      memory: "12Gi"
      cpu: "3"
    limits:
      memory: "12Gi"
      cpu: "5"
  controller:
    replicas: 3
sinkd:
  metrics:
    jvmOpts: "-XX:MaxDirectMemorySize=200M"
    replicas: 4
    consumerProperties:
      fetch.max.bytes: 20971520
      fetch.max.wait.ms: 250
      fetch.min.bytes: 5242880
      max.partition.fetch.bytes: 5242880
      max.poll.records: 200000
      receive.buffer.bytes: 131072
    resources:
      requests:
        memory: "1280Mi"
        cpu: "500m"
      limits:
        memory: "2500Mi"
        cpu: "1500m"
  entities:
    resources:
      requests:
        memory: "512Mi"
      limits:
        memory: "1500Mi"
  signals:
    consumerProperties:
      max.partition.fetch.bytes: 1048576
  traces:
    consumerProperties:
      max.poll.records: 20000
    resources:
      requests:
        memory: "756Mi"
        cpu: "250m"
      limits:
        memory: "1800Mi"
        cpu: "1"
  logs:
    jvmOpts: "-XX:MaxDirectMemorySize=150M"
    consumerProperties:
      fetch.max.bytes: 20971520
      fetch.max.wait.ms: 1000
      fetch.min.bytes: 5242880
      max.partition.fetch.bytes: 5242880
      max.poll.records: 100000
    resources:
      limits:
        cpu: "1"
        memory: "1200Mi"
      requests:
        cpu: "250m"
        memory: "750Mi"
platformd:
  replicas: 2
  resources:
    requests:
      memory: "1536Mi"
      cpu: "1500m"
    limits:
      memory: "2Gi"
      cpu: "2500m"
dpd:
  replicas: 2
  jvmOpts: "-XX:MaxRAMPercentage=75"
  secondLevelEntityCacheHeapPercent: 35
  entityCache:
    inMemoryCacheSizeMb: 512
    writeBuffers: 12
    writeBufferSizeMb: 16
  hazelcast:
    jetIdleCooperativeMinMicroSeconds: 500
    jetIdleCooperativeMaxMicroSeconds: 1000
    jetIdleNonCooperativeMinMicroSeconds: 500
    jetIdleNonCooperativeMaxMicroSeconds: 1000
  consumerProperties:
    max.poll.records: 10000
    fetch.min.bytes: 524288
  metricsMultiplexer:
    maxFilterResultCacheSize: 2000000
    maxConcurrentOps: 1000
  resources:
    requests:
      memory: "6Gi"
      cpu: "2"
    limits:
      memory: "8Gi"
      cpu: "4"
entityStream:
  intermediate:
    jvmOpts: "-XX:MaxRAMPercentage=60"
    consumerProperties:
      max.partition.fetch.bytes: 2097152
      max.poll.records: 100000
    producerProperties:
      buffer.memory: 67108864
    streamProperties:
      num.stream.threads: 2
    storedEntitiesCacheSize: 12500
    replicas: 4
    resources:
      requests:
        memory: "3Gi"
        cpu: "1"
      limits:
        memory: "4Gi"
        cpu: "2"
    rocksdb:
      memoryMib: 300
  final:
    jvmOpts: "-XX:InitialRAMPercentage=40 -XX:MaxRAMPercentage=60"
    consumerProperties:
      max.partition.fetch.bytes: 1048576
    producerProperties:
      buffer.memory: 67108864
    replicas: 4
    storedEntitiesCacheSize: 10000
    resources:
      requests:
        memory: "1536Mi"
        cpu: "1"
      limits:
        memory: "2560Mi"
        cpu: "2"
signalsStream:
  consumerProperties:
    max.partition.fetch.bytes: 1048576
  resources:
    requests:
      memory: "830Mi"
      cpu: "150m"
    limits:
      memory: "1800Mi"
      cpu: "1200m"
etcd:
  diskSize: "16Gi"
  replicas: 3
licenced:
  replicas: 2
platformStatusd:
  resources:
    limits:
      memory: "800Mi"
clickhouse:
  traces:
    replicas: 2
    diskSize: "150Gi"
    storageClass: "gp3-clickhouse"
    resources:
      limits:
        cpu: "3"
        memory: "10Gi"
      requests:
        cpu: "3"
        memory: "10Gi"
    nodeSelector:
      instancegroup: clickhouse-nodes
    tolerations:
      - key: dedicated
        operator: Equal
        value: clickhouse-nodes
        effect: NoSchedule
  metrics:
    replicas: 2
    diskSize: "1Ti"
    diskCount: 4
    storageClass: "gp3-clickhouse"
    resources:
      limits:
        cpu: "16"
        memory: "16Gi"
      requests:
        cpu: "16"
        memory: "16Gi"
    nodeSelector:
      instancegroup: clickhouse-nodes
    tolerations:
      # must match the tainted ClickHouse nodes setting
      - key: dedicated
        operator: Equal
        value: clickhouse-nodes
        effect: NoSchedule
  platform:
    replicas: 2
    diskSize: "200Gi"
    storageClass: "gp3-clickhouse"
    resources:
      limits:
        cpu: "3"
        memory: "8Gi"
      requests:
        cpu: "3"
        memory: "8Gi"
    nodeSelector:
      instancegroup: clickhouse-nodes
    tolerations:
      # must match the tainted ClickHouse nodes setting
      - key: dedicated
        operator: Equal
        value: clickhouse-nodes
        effect: NoSchedule
  logs:
    replicas: 2
    diskSize: "1Ti"
    storageClass: "gp3-clickhouse"
    resources:
      limits:
        cpu: "4"
        memory: "12Gi"
      requests:
        cpu: "4"
        memory: "12Gi"
    nodeSelector:
      instancegroup: clickhouse-nodes
    tolerations:
      # must match the tainted ClickHouse nodes setting
      - key: dedicated
        operator: Equal
        value: clickhouse-nodes
        effect: NoSchedule
  keeper:
    replicas: 3
postgres:
  clusterSize: 2
kvStore:
  replicas: 2
  resources:
    limits:
      memory: "900Mi"
    requests:
      memory: "600Mi"
statusMetricsStream:
  resources:
    limits:
      memory: "1280Mi"
    requests:
      memory: "768Mi"
  consumerProperties:
    fetch.max.bytes: 20971520
    fetch.max.wait.ms: 250
    fetch.min.bytes: 5242880
    max.partition.fetch.bytes: 5242880
    max.poll.records: 200000
    receive.buffer.bytes: 131072
#
# The Timescale configs need to be enabled if upgrading from a pre-2.18 deployment.
# The following is an example config. The actual configs should match the existing deployment,
# except that the resource values below have been reduced since Timescale only serves reads during migration.
#
#timescale:
#  sharedBuffersPercentage: 40
#  bgwLruMaxPages: 8000
#  walBuffers: 64000
#  clusterSize: 2
#  dataDiskSize: "200Gi"
#  timeseriesDiskCount: 4
#  timeseriesDiskSize: "2Ti"
#  timeseriesStorageClass: "gp3-timescale"
#  walDiskSize: "300Gi"
#  walStorageClass: "gp3-timescale"
#  maxLocksPerTransaction: 10000
#  resources:
#    requests:
#      memory: "16Gi"
#      cpu: "2"
#    limits:
#      memory: "16Gi"
#      cpu: "2"
#  The following `nodeSelector` and `tolerations` configs can be removed if re-using Timescale nodes for ClickHouse
#  nodeSelector:
#    instancegroup: timescale-nodes
#  tolerations:
#    - key: dedicated
#      operator: Equal
#      value: timescale-nodes
#      effect: NoSchedule
#
# The Loki configs need to be enabled if upgrading from a pre-2.18 deployment.
# The following is an example config. The actual configs should match the existing deployment,
# except that the resource values below may be increased depending on the volume of the logs.
# Loki memory limits depend on chunk data volume. A reasonable guideline is 2-4x the chunk data size,
# with a minimum of 4 GiB. The 8 GiB limit below assumes chunk data of ~2-4 GiB.
# Adjust if your deployment has significantly larger log volume.
#
#loki:
#  diskSize: "30Gi"
#  retentionTime: "168h"
#  ingestionBurstSize: 12
#  ingestionRateLimit: 8
#  resources:
#    limits:
#      cpu: "1"
#      memory: "8Gi"
#    requests:
#      cpu: "500m"
#      memory: "1Gi"
Previous article Next article
Sample configuration for AWS EC2 with NGINX Ingress controller (large)

Was this topic helpful?

Your thoughts...

How can we improve this topic?

Your thoughts...

Thank you for your feedback!