Sample configuration for AWS EC2 handling 125k entities and 50k metrics/sec (medium) with NGINX Ingress controller

Download this sample AWS EC2 handling 125k entities and 50k metrics/sec (medium) configuration provided by ITRS.
# Example ITRS Analytics configuration for AWS EC2 handling 125k entities, 2M time series, and 50k metrics/sec and 10k OpenTelemetry
# spans/sec (pre-sampling).
#
# Nodes:
# - (2) m5.4xlarge (16 CPU, 64GiB Memory) for Timescale
# - (5) c5.4xlarge (16 CPU, 32GiB Memory) for all other workloads
#
# The resource requests for Timescale total ~16 cores and ~112GiB memory.
# The resource requests for the other workloads total ~55 cores and ~122GiB memory.
# These totals include Linkerd resources.
#
# Disk requirements:
# - Timescale:
#   - 4 x 1 TiB timeseries data disk for each replica (x2)
#   - 100 GiB data disk for each replica (x2)
#   - 150 GiB WAL disk for each replica (x2)
# - Kafka broker: 200 GiB for each replica (x3)
# - Kafka controller: 1 GiB for each replica (x1)
# - Postgres: 3 GiB for each replica (x2)
# - ClickHouse Keeper: 2 GiB for each replica (x3)
# - ClickHouse Traces: 60 GiB for each replica (x2)
# - Loki: 30 GiB
# - etcd: 1 GiB for each replica (x3)
# - Downsampled Metrics:
#   - Raw: 5 GiB for each replica (x2)
#   - Bucketed: 5 GiB for each replica (x2)
#
# The configuration references a default storage class named `gp3` which uses EBS gp3 volumes. This storage class should
# be configured with the default minimum gp3 settings of 3000 IOPS and 125 MiB/s throughput - you can create
# this class or change the config to use a class of your own, but it should be similar in performance.
#
# This configuration is based upon a certain number of Obcerv entities, average metrics per entity, and
# average metrics collection interval. The following function can be used to figure out what type of load to expect:
#
# metrics/sec = (Obcerv entities * metrics/entity) / average metrics collection interval
#
# In this example configuration, we have the following:
#
# 50,000 metrics/sec = (125,000 Obcerv entities * 4 metrics/entity) / 10 seconds average metrics collection interval
#
# NOTE: Ingestion, storage, and retrieval of OpenTelemetry spans is a beta feature.
#
# Additionally, the configuration is based upon a certain number of OpenTelemetry spans per second that are sampled
# based upon the following rules:
# - Error traces are always sampled
# - Target sampling probability per endpoint (corresponds to the name of the root span) is 0.01
# - Target sampling rate / second / endpoint (corresponds to the name of the root span) is 0.5
# - Root span duration outlier quantile is 0.95. The durations of all root spans are tracked and used to make guesses about
#   abnormally long spans
#

# For higher-volume installations, it is recommended to use a storage class with increased IOPS for the Timescale workload.
defaultStorageClass: "gp3"
apps:
  externalHostname: "obcerv.mydomain.internal"
  ingress:
    annotations:
      kubernetes.io/ingress.class: "nginx"
      nginx.org/mergeable-ingress-type: "master"
ingestion:
  externalHostname: "obcerv-ingestion.mydomain.internal"
  replicas: 2
  ingress:
    annotations:
      kubernetes.io/ingress.class: "nginx"
      nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
  resources:
    requests:
      memory: "512Mi"
      cpu: "500m"
    limits:
      memory: "768Mi"
      cpu: "1"
  traces:
    jvmOpts: "-XX:InitialRAMPercentage=65 -XX:MaxRAMPercentage=65 -XX:MaxDirectMemorySize=150M"
    resources:
      requests:
        memory: "3Gi"
        cpu: "2"
      limits:
        memory: "6Gi"
        cpu: "3"
iam:
  ingress:
    annotations:
      kubernetes.io/ingress.class: "nginx"
      nginx.org/mergeable-ingress-type: "minion"
kafka:
  replicas: 3
  diskSize: "200Gi"
  resources:
    requests:
      memory: "6Gi"
      cpu: "2"
    limits:
      memory: "6Gi"
      cpu: "4500m"
  controller:
    replicas: 3
timescale:
  sharedBuffersPercentage: 40
  bgwLruMaxPages: 8000
  walBuffers: 64000
  clusterSize: 2
  dataDiskSize: "100Gi"
  timeseriesDiskCount: 4
  timeseriesDiskSize: "1Ti"
  walDiskSize: "250Gi"
  maxLocksPerTransaction: 10000
  resources:
    requests:
      memory: "56Gi"
      cpu: "8"
    limits:
      memory: "56Gi"
      cpu: "8"
  nodeSelector:
    instancegroup: timescale-nodes
  tolerations:
  - key: dedicated
    operator: Equal
    value: timescale-nodes
    effect: NoSchedule
  retention:
    metrics:
      chunkSize: 8h
      retention: 30d
    metrics_5m:
      chunkSize: 1d
      retention: 90d
    metrics_1h:
      chunkSize: 5d
      retention: 180d
    metrics_1d:
      chunkSize: 20d
      retention: 1y
    statuses:
      chunkSize: 7d
      retention: 1y
    signal_details:
      chunkSize: 1d
      retention: 30d
loki:
  diskSize: "30Gi"
  ingestionBurstSize: 9
  ingestionRateLimit: 6
sinkd:
  timeseriesCacheMaxSize: 2000000
  replicas: 2
  rawReplicas: 3
  jvmOpts: "-Xms1536M -Xmx1536M -XX:MaxDirectMemorySize=100M"
  rawJvmOpts: "-Xms1024M -Xmx1536M"
  resources:
    requests:
      memory: "1536Mi"
      cpu: "250m"
    limits:
      memory: "3Gi"
      cpu: "3"
  rawResources:
    requests:
      memory: "1280Mi"
      cpu: "500m"
    limits:
      memory: "3Gi"
      cpu: "3"
  metrics:
    consumerProperties:
      max.partition.fetch.bytes: 524288
      max.poll.records: 10000
  dsMetrics:
    consumerProperties:
      max.partition.fetch.bytes: 1048576
  loki:
    consumerProperties:
      max.partition.fetch.bytes: 1048576
  entities:
    consumerProperties:
      max.partition.fetch.bytes: 1048576
      max.poll.records: 75000
  signals:
    consumerProperties:
      max.partition.fetch.bytes: 1048576
  traces:
    consumerProperties:
      max.poll.records: 20000
    resources:
      requests:
        memory: "756Mi"
        cpu: "100m"
      limits:
        memory: "1500Mi"
        cpu: "1"
platformd:
  replicas: 2
  resources:
    requests:
      memory: "1536Mi"
      cpu: "1"
    limits:
      memory: "2Gi"
      cpu: "2250m"
dpd:
  replicas: 2
  jvmOpts: "-Xmx4500M"
  maxEntitySerdeCacheEntries: 1000000
  entitiesInMemoryCacheSizeMb: 256
  consumerProperties:
    max.poll.records: 10000
    fetch.min.bytes: 524288
  metricsMultiplexer:
    maxFilterResultCacheSize: 500000
    maxConcurrentOps: 500
    localParallelism: 12
  selfMonitoringThresholds:
    metrics_partition_lag_warn: 500000
    metrics_partition_lag_critical: 2500000
  resources:
    requests:
      memory: "5Gi"
      cpu: "2500m"
    limits:
      memory: "6Gi"
      cpu: "4500m"
downsampledMetricsStream:
  replicas: 2
  bucketedReplicas: 2
  bucketedJvmOpts: "-XX:InitialRAMPercentage=75 -XX:MaxRAMPercentage=75"
  consumerProperties:
    fetch.min.bytes: 524288
    max.partition.fetch.bytes: 1048576
    max.poll.records: 10000
  resources:
    requests:
      memory: "3Gi"
      cpu: "1"
    limits:
      memory: "3Gi"
      cpu: "4"
  bucketedConsumerProperties:
    fetch.min.bytes: 524288
    max.partition.fetch.bytes: 1048576
    max.poll.records: 10000
  bucketedResources:
    requests:
      memory: "3Gi"
      cpu: "1"
    limits:
      memory: "6Gi"
      cpu: "4"
  rocksdb:
    raw:
      indexAndFilterRatio: 0.5
      memoryMib: 500
      writeBufferMib: 16
      writeBufferRatio: 0.25
    bucketed:
      indexAndFilterRatio: 0.5
      memoryMib: 200
      writeBufferMib: 16
      writeBufferRatio: 0.25
entityStream:
  intermediate:
    consumerProperties:
      max.partition.fetch.bytes: 1048576
    storedEntitiesCacheSize: 10000
    replicas: 2
    resources:
      requests:
        memory: "1536Mi"
        cpu: "750m"
      limits:
        memory: "2Gi"
        cpu: "2"
    rocksdb:
      memoryMib: 200
  final:
    jvmOpts: "-XX:InitialRAMPercentage=50 -XX:MaxRAMPercentage=50"
    consumerProperties:
      max.partition.fetch.bytes: 1048576
    replicas: 2
    storedEntitiesCacheSize: 10000
    resources:
      requests:
        memory: "1536Mi"
        cpu: "1"
      limits:
        memory: "2560Mi"
        cpu: "3"
signalsStream:
  consumerProperties:
    max.partition.fetch.bytes: 1048576
  resources:
    requests:
      memory: "830Mi"
      cpu: "150m"
    limits:
      memory: "1536Mi"
      cpu: "1200m"
etcd:
  replicas: 3
collection:
  daemonSet:
    tolerations:
    # must match the tainted Timescale nodes setting
    - key: dedicated
      operator: Equal
      value: timescale-nodes
      effect: NoSchedule
latestMetricsService:
  replicas: 2
  resources:
    limits:
      cpu: "2"
      memory: "4Gi"
    requests:
      cpu: "500m"
      memory: "1Gi"
clickhouse:
  traces:
    replicas: 2
    diskSize: "60Gi"
    resources:
      limits:
        cpu: "4"
        memory: "10Gi"
      requests:
        cpu: "4"
        memory: "10Gi"
  keeper:
    replicas: 3
postgres:
  clusterSize: 2
Previous article Next article
Sample configuration for AWS EC2 handling 125k entities and 50k metrics/sec (medium) with NGINX Ingress controller

Was this topic helpful?

Your thoughts...

How can we improve this topic?

Your thoughts...

Thank you for your feedback!