Sample configuration for AWS EC2 with NGINX Ingress controller (small, no HA)

Download this sample AWS EC2 handling 250k entities, 500k time series, and 10k datapoints/sec configuration provided by ITRS for installations with High Availability (HA) disabled.
# Example ITRS Analytics configuration for AWS EC2 handling ~250,000 entities, ~500,000 time series,
# ~10,000 datapoints/sec (tested with approximately 8100 metrics/sec, 2000 logs/sec, 50 signals/sec, 50 audit events/sec);
# and ~5000 span/sec (pre-sampling).
# Actual ingestion composition may vary by deployment.
#
# NODE REQUIREMENTS:
# - Total capacity needed: ~25 cores / ~65 GiB requests (~65 cores / ~85 GiB limits)
# - These totals include optional Linkerd sidecar resources
# - Minimum per node: 16 cores / 32 GiB
# - Example: (4) c5.4xlarge (16 cores / 32 GiB) or equivalent
#
# DISK REQUIREMENTS:
# Estimated disk requirements based on default retention and the ingestion rate above
# (actual size will vary depending on the shape of the data being ingested).
# - Kafka broker: 100 GiB
# - Kafka controller: 10 GiB
# - Postgres: 3 GiB
# - ClickHouse Keeper: 2 GiB
# - ClickHouse Platform: 50 GiB
# - ClickHouse Metrics: 400 GiB
# - ClickHouse Logs: 200 GiB
# - ClickHouse Traces: 50 GiB
# - etcd: 16 GiB
#
# The configuration references a default storage class named `gp3` which uses EBS gp3 volumes. This storage class should
# be configured with the default minimum gp3 settings of 3000 IOPS and 125 MiB/s throughput - you can create
# this class or change the config to use a class of your own, but it should be similar in performance.
#
# This configuration is based upon a certain number of IAX entities, average metrics per entity, and
# average metrics collection interval. The following function can be used to figure out what type of load to expect:
#
# metrics/sec = (IAX entities * metrics/entity) / average metrics collection interval
#
# In this example configuration, we have the following:
#
# 10,000 metrics/sec = (250,000 IAX entities * 1 metrics/entity) / 25 seconds average metrics collection interval
#
# NOTE: Ingestion, storage, and retrieval of OpenTelemetry spans is a beta feature.
#
# Additionally, the configuration is based upon a certain number of OpenTelemetry spans per second that are sampled
# based upon the following rules:
# - Error traces are always sampled
# - Target sampling probability per endpoint (corresponds to the name of the root span) is 0.01
# - Target sampling rate / second / endpoint (corresponds to the name of the root span) is 0.5
# - Root span duration outlier quantile is 0.95. The durations of all root spans are tracked and used to make guesses about
#   abnormally long spans
#
# UPGRADE NOTE: Timescale and Loki are no longer required for fresh installs (v2.18+).
# If upgrading from a pre-2.18 deployment, these workloads must remain and require additional resources:
# - Timescale: additional resource ~2 cores / ~8 GiB (requests = limits)
# - Loki: additional resource requests ~500m / ~1 GiB, with limits of ~1 core / ~8 GiB
# Additional disk requirements (sizes will vary based on existing deployment):
# - Timescale:
#   - 4 x timeseries data disks
#   - 1 x data disk
#   - 1 x WAL disk
# - Loki: 1 x data disk
#
# If upgrading from a pre-2.18 deployment, uncomment the timescale and loki section at the bottom of this file
# and include additional resources and disks listed under "UPGRADE NOTE" above.
#
defaultStorageClass: "gp3"
apps:
  externalHostname: "iax.mydomain.internal"
  ingress:
    className: "nginx"
ingestion:
  externalHostname: "iax-ingestion.mydomain.internal"
  ingress:
    className: "nginx"
    annotations:
      nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
      nginx.ingress.kubernetes.io/use-regex: "true"
    usePathRegex: true
  producerProperties:
    buffer.memory: 67108864
  resources:
    requests:
      memory: "512Mi"
      cpu: "500m"
    limits:
      memory: "1Gi"
      cpu: "750m"
  traces:
    jvmOpts: "-XX:MaxDirectMemorySize=120M -XX:MaxRAMPercentage=75"
    producerProperties:
      buffer.memory: 67108864
    resources:
      requests:
        memory: "1500Mi"
        cpu: "1"
      limits:
        memory: "2500Mi"
        cpu: "2"
iam:
  ingress:
    className: "nginx"
kafka:
  diskSize: "100Gi"
  resources:
    requests:
      memory: "3Gi"
      cpu: "1"
    limits:
      memory: "3Gi"
      cpu: "2"
sinkd:
  metrics:
    jvmOpts: "-XX:MaxDirectMemorySize=200M"
    consumerProperties:
      fetch.max.bytes: 20971520
      fetch.max.wait.ms: 250
      fetch.min.bytes: 5242880
      max.partition.fetch.bytes: 5242880
      max.poll.records: 100000
    resources:
      requests:
        memory: "768Mi"
      limits:
        memory: "1200Mi"
  entities:
    resources:
      limits:
        memory: "1200Mi"
  logs:
    resources:
      requests:
        memory: "512Mi"
  signals:
    consumerProperties:
      max.partition.fetch.bytes: 1048576
  traces:
    consumerProperties:
      max.poll.records: 20000
    resources:
      requests:
        memory: "756Mi"
        cpu: "100m"
      limits:
        memory: "1200Mi"
        cpu: "1"
platformd:
  resources:
    requests:
      memory: "1536Mi"
      cpu: "1"
    limits:
      memory: "2Gi"
      cpu: "2250m"
dpd:
  jvmOpts: "-XX:MaxRAMPercentage=70"
  secondLevelEntityCacheHeapPercent: 10
  hazelcast:
    jetIdleCooperativeMinMicroSeconds: 1000
    jetIdleCooperativeMaxMicroSeconds: 10000
    jetIdleNonCooperativeMinMicroSeconds: 1000
    jetIdleNonCooperativeMaxMicroSeconds: 10000
  consumerProperties:
    fetch.min.bytes: 524288
  metricsMultiplexer:
    maxFilterResultCacheSize: 500000
    maxConcurrentOps: 100
  resources:
    requests:
      memory: "4Gi"
      cpu: "1"
    limits:
      memory: "5Gi"
      cpu: "2"
entityStream:
  intermediate:
    consumerProperties:
      max.partition.fetch.bytes: 1048576
    producerProperties:
      buffer.memory: 67108864
    storedEntitiesCacheSize: 1000
  final:
    consumerProperties:
      max.partition.fetch.bytes: 1048576
    producerProperties:
      buffer.memory: 67108864
    resources:
      requests:
        memory: "1350Mi"
        cpu: "300m"
      limits:
        memory: "2Gi"
        cpu: "3"
signalsStream:
  consumerProperties:
    max.partition.fetch.bytes: 1048576
  resources:
    requests:
      memory: "768Mi"
      cpu: "150m"
    limits:
      memory: "1536Mi"
      cpu: "1200m"
etcd:
  diskSize: "16Gi"
clickhouse:
  traces:
    diskSize: "50Gi"
    resources:
      limits:
        cpu: "2"
        memory: "10Gi"
      requests:
        cpu: "2"
        memory: "10Gi"
  metrics:
    diskSize: "400Gi"
    resources:
      limits:
        cpu: "3"
        memory: "8Gi"
      requests:
        cpu: "3"
        memory: "8Gi"
  platform:
    diskSize: "50Gi"
    resources:
      limits:
        cpu: "3"
        memory: "10Gi"
      requests:
        cpu: "3"
        memory: "10Gi"
  logs:
    diskSize: "200Gi"
    resources:
      limits:
        cpu: "4"
        memory: "8Gi"
      requests:
        cpu: "4"
        memory: "8Gi"
statusMetricsStream:
  resources:
    limits:
      memory: "1280Mi"
    requests:
      memory: "768Mi"
#
# The Timescale configs need to be enabled if upgrading from a pre-2.18 deployment.
# The following is an example config. The actual configs should match the existing deployment,
# except that the resource values below have been reduced since Timescale only serves reads during migration.
#
#timescale:
#  sharedBuffersPercentage: 40
#  dataDiskSize: "50Gi"
#  timeseriesDiskCount: 4
#  timeseriesDiskSize: "512Gi"
#  walDiskSize: "50Gi"
#  resources:
#    requests:
#      memory: "8Gi"
#      cpu: "2"
#    limits:
#      memory: "8Gi"
#      cpu: "2"
#
# The Loki configs need to be enabled if upgrading from a pre-2.18 deployment.
# The following is an example config. The actual configs should match the existing deployment,
# except that the resource values below may be increased depending on the volume of the logs.
# Loki memory limits depend on chunk data volume. A reasonable guideline is 2-4x the chunk data size,
# with a minimum of 4 GiB. The 8 GiB limit below assumes chunk data of ~2-4 GiB.
# Adjust if your deployment has significantly larger log volume.
#
#loki:
#  diskSize: "30Gi"
#  retentionTime: "168h"
#  resources:
#    limits:
#      cpu: "1"
#      memory: "8Gi"
#    requests:
#      cpu: "500m"
#      memory: "1Gi"
Previous article Next article
Sample configuration for AWS EC2 with NGINX Ingress controller (small, no HA)

Was this topic helpful?

Your thoughts...

How can we improve this topic?

Your thoughts...

Thank you for your feedback!