Troubleshooting

Important
This information refers to the previous helm install method for the Obcerv Platform. If you are looking to install using the more streamlined Kubernetes Off-the-Shelf (KOTS) method, see the updated installation overview.

This page collects troubleshooting for various cases. Not all steps apply to all situations.

Installation status Copied

You may want to monitor installation progress to check for any errors or to debug a failed installation.

To do this:

Check the status of the installation:

kubectl describe obcerv <instance> -n <namespace>

Look for FAILED status and the corresponding reason.

Additional details may be found in the Obcerv operator log:

kubectl logs -n <namespace> <obcerv-operator-pod-name> operator

If there are no operator errors, then investigate the individual pods and cluster events.

Diagnostics Copied

The obcerv-diagnostics.sh script can be used to gather diagnostic details about an Obcerv instance. It uses kubectl to gather pod logs and information about the cluster nodes, events and deployed resources.

The following options are available:

Option	Description	Default
–namespace	Required. Kubernetes namespace to gather diagnostics from.
–help	Optional. Print the help text and exit.
–verbose	Optional. Print script debug info.
–kubeconfig	Optional. Kubeconfig file to use to connect to cluster.	Default: Kubeconfig in user’s home (${HOME}/.kube/config).
–output-dir	Optional. Destination folder to store the output.	Default: Current directory.
–compress	Optional. Tar and compress the output.

Sample usage:

bash obcerv-diagnostics.sh -z -n <namespace> -o /tmp/obcerv

Because the diagnostics script uses kubectl, the Kubernetes user running the script will need, at minimum, read-only permissions for the following commands:

kubectl get
kubectl describe
kubectl cluster-info

An example role for the user:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: my-custom-role
  namespace: test-namespace
rules:
- apiGroups:
  - "*"
  resources: ["*"]
  verbs:
  - get
  - describe
  - cluster-info

Authentication error on install or upgrade Copied

During installation and upgrades, a special realm-admin credential is used to perform a number of IAM-related configuration tasks. The password for the realm-admin user is stored in a Kubernetes Secret called iam-realm-admin-credential in the same namespace as the Obcerv instance.

Once you’ve installed an Obcerv instance, it is possible to modify the realm-admin password through the Keycloak admin console. However, to successfully complete subsequent configuration or upgrade tasks, you need to manually update the iam-realm-admin-credential password stored in the Kubernetes Secret as well. Otherwise, it will result to an authentication error.

Install errors with manual TLS certificates Copied

When installing the operator using manually created TLS certificates, you may see the following errors:

Invalid caCertificate value Copied

You may see the following output after running the helm command to deploy the Obcerv operator:

CABundle: decode base64: illegal base64 data at input byte 76, error found in #10 byte of ...|FLS0tLS0K","service"|.

To resolve this, ensure that the caCertificate value is on a single line with no carriage returns.

Legacy Common Name field error Copied

You may see the following output after running the helm command to deploy the Obcerv operator:

x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0

To resolve this, you must use a configuration file with CN and DNS fields set to the fully qualified service name when generating the webhook signing request and certificate rather than setting the CN value directly on the command line directly.

Below is an example of a valid configuration file:

[req]
req_extensions = v3_req
distinguished_name = req_distinguished_name
prompt = no
[req_distinguished_name]
CN = obcerv-operator-webhook.itrs.svc
[ v3_req ]
basicConstraints = CA:FALSE
keyUsage = nonRepudiation, digitalSignature, keyEncipherment
extendedKeyUsage = clientAuth, serverAuth
subjectAltName = @alt_names
[alt_names]
DNS.1 = obcerv-operator-webhook.itrs.svc

Linkerd Copied

Expired Linkerd certificates Copied

If Linkerd is installed and being used with the Obcerv installation, its certificates must remain valid to allow the Obcerv components to communicate via the Linkerd service mesh. To check the current state and determine if this needs to be performed within the next 60 days, run:

linkerd check --proxy

See the Maintenance section for steps on how to rotate the trust anchor for Linkerd. If this is not performed prior to expiry, these steps must be performed to replace the expired certificates.

`etcd-defrag` job error Copied

In an Openshift environment, the etcd-defrag job fails when Linkerd is running.

To avoid the error, you must do one of the following:

Disable the service mesh.
Use a Linkerd version that is not affected by the issue: stable-2.14.10 or earlier.

Third-party licenses Copied

Third-party licenses in deployed containers are located under the /licenses directory and can be copied to a local directory via the following command:

kubectl -n <namespace> cp <pod>:/licenses ./licenses

Timescale performance Copied

Large chunks Copied

Starting in Obcerv Platform 1.4.0, individual metric names that have very high throughput could result in their corresponding Timescale chunks becoming too large. The symptoms identifying this issue are very slow writes and reads, and the Timescale container’s memory usage showing large drops and spikes. This can be verified by running this command:

$ kubectl exec -it -n itrs timescale-0 -c timescale -- psql -U postgres -d obcerv -c "SELECT * FROM self_metrics.top_metric_chunks ORDER BY max_size DESC;"

 schema_name |  table_name   |                   metric                    | bucket | max_size    | max_size_pretty 
 -------------+---------------+---------------------------------------------+--------+----------+------------------
 dynamic     | metrics270    | kafka_log_log_size_value                    | raw    | 64424509440 | 60 GB
 dynamic     | metrics163    | kube_pod_fs_usage                           | raw    | 776355842   | 740 MB
 dynamic     | metrics158    | kube_pod_fs_used                            | raw    | 776355842   | 740 MB
 dynamic     | metrics156    | kube_pod_fs_size                            | raw    | 776355842   | 740 MB
 dynamic     | metrics167    | kube_pod_fs_inodes_free                     | raw    | 776355842   | 740 MB

Generally, chunks should not be significantly larger than 1Gi. Using the above as an example, the following steps can be taken to remedy the situation.

Identify the chunk size configuration parameter for the bucket size of the large chunk. For the raw bucket, this is 8 hours by default.
To get to the target size of 1Gi or less, divide the chunk size by 1Gi to calculate a dividing factor.
Divide the chunk size configuration parameter by this dividing factor to calculate the custom chunk size and round to the nearest 5 minute interval, which is 10 minutes in this example.

From the command line, run the following command to create a custom chunk size configuration record using the metric name, bucket size, custom chunk size, and retention:

$ kubectl exec -it -n itrs timescale-0 -c timescale -- psql -U postgres -d obcerv -c \
  "CALL set_metric_chunk_config(metric_pattern => 'kafka_log_log_size_value', storage_bucket => 'raw',  size => '10 minutes');"

Repeat for all metric names with chunks larger than 1Gi.

For the new chunk sizes to take effect, run the following command:

kubectl exec -it -n itrs timescale-0 -c timescale -- psql -U postgres -d obcerv -c "CALL configure_chunk_sizes();"

Warning
This change will only affect future chunks, and there may be degraded performance until the large chunks have closed. It may be necessary to drop the offending large chunks. Dropping chunks should be avoided unless absolutely necessary as it will result in data loss.

To drop Timescale chunks, do the following:

Use the hypertable name and run the following query to identify the chunk start and end time. Hypertables can use interval OR integer based start and end values:

$ kubectl exec -it -n itrs timescale-0 -c timescale -- psql -U postgres -d obcerv -c \
  "SELECT hypertable_schema, hypertable_name, range_start, range_end, range_start_integer, range_end_integer, get_relation_size(chunk_schema, chunk_name) AS size
  FROM timescaledb_information.chunks WHERE hypertable_name = 'metrics270' ORDER BY size DESC LIMIT 1;" 

 hypertable_schema | hypertable_name |      range_start       |       range_end        | range_start_integer | range_end_integer |    size    
-------------------+-----------------+------------------------+------------------------+---------------------+-------------------+------------
 dynamic           | metrics270      | 2023-05-30 08:00:00+00 | 2023-05-30 16:00:00+00 |                     |                   | 64424509440

To drop the chunk, use the following command and specify the fully-qualified table name and set the newer_than and older_than parameters to the range_start/range_start_integer and range_end/range_end_integer values from the previous query:

$ kubectl exec -it -n itrs timescale-0 -c timescale -- psql -U postgres -d obcerv -c \
  "SELECT drop_chunks('dynamic.metrics270', newer_than => '2023-05-30 08:00:00+00', older_than => '2023-05-30 16:00:00+00');"

                    drop_chunks                  
  -----------------------------------------------
   _timescaledb_internal._hyper_1671_43617_chunk

Repeat steps 2 and 3 for all chunks that need to be dropped.

Kubernetes RBAC permissions Copied

If you are deploying Obcerv in a shared cluster or dividing a dedicated cluster for use with Obcerv, you can find out Kubernetes permissions required by Obcerv and the Obcerv Operator. This can help with troubleshooting or debugging potential RBAC (role-based access control) issues.

Run the following commands to understand the Kubernetes permissions required by Obcerv:

For a dedicated cluster:

helm install --dry-run obcerv-operator itrs/obcerv-operator -n <namespace>

For a shared cluster:

helm install --dry-run obcerv-operator itrs/obcerv-operator -n <namespace> --set installCRDs=false

These commands provide a dry-run of the Obcerv Operator installation, and lists all resources that would be created, including roles and RBAC configurations. For example:

# Source: obcerv-operator/templates/rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: obcerv-operator
  namespace: itrs
  labels:
    helm.sh/chart: obcerv-operator-<version>
    app.kubernetes.io/name: obcerv-operator
    app.kubernetes.io/instance: obcerv-operator
    app.kubernetes.io/version: "<version>"
    app.kubernetes.io/managed-by: Helm
rules:
- apiGroups:
  - itrsgroup.com
  resources:
  - obcervs
  - obcervs/status
  verbs:
  - get
  - list
  - watch
  - update
  - patch
- apiGroups:
  - ""
  resources:
  - configmaps
  - endpoints
  - endpoints/restricted
  - persistentvolumeclaims
  - pods
  - secrets
  - serviceaccounts
  - services
  verbs:
  - get
  - list
  - watch
  - create
  - update
  - patch
  - delete
- apiGroups:
  - apps
  resources:
  - daemonsets
  - deployments
  - statefulsets
  verbs:
  - get
  - list
  - watch
  - create
  - update
  - patch
  - delete
- apiGroups:
  - batch
  resources:
  - jobs
  - cronjobs
  verbs:
  - get
  - list
  - watch
  - create
  - update
  - patch
  - delete
- apiGroups:
  - networking.k8s.io
  resources:
  - ingresses
  verbs:
  - get
  - list
  - watch
  - create
  - update
  - patch
  - delete
- apiGroups:
  - policy
  resources:
  - poddisruptionbudgets
  verbs:
  - get
  - list
  - watch
  - create
  - update
  - patch
  - delete
- apiGroups:
  - rbac.authorization.k8s.io
  resources:
  - rolebindings
  - roles
  verbs:
  - get
  - create
  - update
  - patch
  - delete
- apiGroups:
  - cert-manager.io
  resources:
  - issuers
  - certificates
  verbs:
  - get
  - list
  - watch
  - create
  - update
  - patch
  - delete
- apiGroups:
  - ""
  resources:
  - endpoints
  - pods
  - services
  verbs:
  - get
  - list
- apiGroups:
  - discovery.k8s.io
  resources:
  - endpointslices
  verbs:
  - get
  - list
- apiGroups:
  - ""
  resources:
  - events
  - resourcequotas
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - apps
  resources:
  - replicasets
  verbs:
  - get
  - list
  - watch

Analyze the RBAC section of the output and look for resources like service accounts, roles, and cluster roles. This can help you compare permissions with expected access and adjust RBAC settings accordingly.

Troubleshooting

Installation status Copied

Diagnostics Copied

Authentication error on install or upgrade Copied

Install errors with manual TLS certificates Copied

Invalid caCertificate value Copied

Legacy Common Name field error Copied

Linkerd Copied

Expired Linkerd certificates Copied

`etcd-defrag` job error Copied

Third-party licenses Copied

Timescale performance Copied

Large chunks Copied

Kubernetes RBAC permissions Copied

Was this topic helpful?

Your thoughts...

How can we improve this topic?

Your thoughts...

Thank you for your feedback!

Troubleshooting

Installation status Copied

Diagnostics Copied

Authentication error on install or upgrade Copied

Install errors with manual TLS certificates Copied

Invalid caCertificate value Copied

Legacy Common Name field error Copied

Linkerd Copied

Expired Linkerd certificates Copied

etcd-defrag job error Copied

Third-party licenses Copied

Timescale performance Copied

Large chunks Copied

Kubernetes RBAC permissions Copied

Was this topic helpful?

Your thoughts...

How can we improve this topic?

Your thoughts...

Thank you for your feedback!

`etcd-defrag` job error Copied