Troubleshooting
Important
This information refers to the previous helm install method for the Obcerv Platform. If you are looking to install using the more streamlined Kubernetes Off-the-Shelf (KOTS) method, see the updated installation overview.
This page collects troubleshooting for various cases. Not all steps apply to all situations.
Installation status Copied
You may want to monitor installation progress to check for any errors or to debug a failed installation.
To do this:
-
Check the status of the installation:
kubectl describe obcerv <instance> -n <namespace>
-
Look for
FAILED
status and the corresponding reason. -
Additional details may be found in the Obcerv operator log:
kubectl logs -n <namespace> <obcerv-operator-pod-name> operator
-
If there are no operator errors, then investigate the individual pods and cluster events.
Diagnostics Copied
The obcerv-diagnostics.sh
script can be used to gather diagnostic details about an Obcerv instance. It uses kubectl
to gather pod logs and information about the cluster nodes, events and deployed resources.
The following options are available:
Option | Description | Default |
---|---|---|
–namespace | Required. Kubernetes namespace to gather diagnostics from. | |
–help | Optional. Print the help text and exit. | |
–verbose | Optional. Print script debug info. | |
–kubeconfig | Optional. Kubeconfig file to use to connect to cluster. | Default: Kubeconfig in user’s home (${HOME}/.kube/config). |
–output-dir | Optional. Destination folder to store the output. | Default: Current directory. |
–compress | Optional. Tar and compress the output. |
Sample usage:
bash obcerv-diagnostics.sh -z -n <namespace> -o /tmp/obcerv
Because the diagnostics script uses kubectl
, the Kubernetes user running the script will need, at minimum, read-only permissions for the following commands:
kubectl get
kubectl describe
kubectl cluster-info
An example role for the user:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: my-custom-role
namespace: test-namespace
rules:
- apiGroups:
- "*"
resources: ["*"]
verbs:
- get
- describe
- cluster-info
Authentication error on install or upgrade Copied
During installation and upgrades, a special realm-admin
credential is used to perform a number of IAM-related configuration tasks. The password for the realm-admin
user is stored in a Kubernetes Secret called iam-realm-admin-credential
in the same namespace as the Obcerv instance.
Once you’ve installed an Obcerv instance, it is possible to modify the realm-admin
password through the Keycloak admin console. However, to successfully complete subsequent configuration or upgrade tasks, you need to manually update the iam-realm-admin-credential
password stored in the Kubernetes Secret as well. Otherwise, it will result to an authentication error.
Install errors with manual TLS certificates Copied
When installing the operator using manually created TLS certificates, you may see the following errors:
Invalid caCertificate value Copied
You may see the following output after running the helm
command to deploy the Obcerv operator:
CABundle: decode base64: illegal base64 data at input byte 76, error found in #10 byte of ...|FLS0tLS0K","service"|.
To resolve this, ensure that the caCertificate
value is on a single line with no carriage returns.
Legacy Common Name field error Copied
You may see the following output after running the helm
command to deploy the Obcerv operator:
x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0
To resolve this, you must use a configuration file with CN
and DNS
fields set to the fully qualified service name when generating the webhook signing request and certificate rather than setting the CN
value directly on the command line directly.
Below is an example of a valid configuration file:
[req]
req_extensions = v3_req
distinguished_name = req_distinguished_name
prompt = no
[req_distinguished_name]
CN = obcerv-operator-webhook.itrs.svc
[ v3_req ]
basicConstraints = CA:FALSE
keyUsage = nonRepudiation, digitalSignature, keyEncipherment
extendedKeyUsage = clientAuth, serverAuth
subjectAltName = @alt_names
[alt_names]
DNS.1 = obcerv-operator-webhook.itrs.svc
Linkerd Copied
Expired Linkerd certificates Copied
If Linkerd is installed and being used with the Obcerv installation, its certificates must remain valid to allow the Obcerv components to communicate via the Linkerd service mesh. To check the current state and determine if this needs to be performed within the next 60 days, run:
linkerd check --proxy
See the Maintenance section for steps on how to rotate the trust anchor for Linkerd. If this is not performed prior to expiry, these steps must be performed to replace the expired certificates.
etcd-defrag
job error
Copied
In an Openshift environment, the etcd-defrag
job fails when Linkerd is running.
To avoid the error, you must do one of the following:
- Disable the service mesh.
- Use a Linkerd version that is not affected by the issue:
stable-2.14.10
or earlier.
Third-party licenses Copied
Third-party licenses in deployed containers are located under the /licenses
directory and can be copied to a local directory via the following command:
kubectl -n <namespace> cp <pod>:/licenses ./licenses
Timescale performance Copied
Large chunks Copied
Starting in Obcerv Platform 1.4.0, individual metric names that have very high throughput could result in their corresponding Timescale chunks becoming too large. The symptoms identifying this issue are very slow writes and reads, and the Timescale container’s memory usage showing large drops and spikes. This can be verified by running this command:
$ kubectl exec -it -n itrs timescale-0 -c timescale -- psql -U postgres -d obcerv -c "SELECT * FROM self_metrics.top_metric_chunks ORDER BY max_size DESC;"
schema_name | table_name | metric | bucket | max_size | max_size_pretty
-------------+---------------+---------------------------------------------+--------+----------+------------------
dynamic | metrics270 | kafka_log_log_size_value | raw | 64424509440 | 60 GB
dynamic | metrics163 | kube_pod_fs_usage | raw | 776355842 | 740 MB
dynamic | metrics158 | kube_pod_fs_used | raw | 776355842 | 740 MB
dynamic | metrics156 | kube_pod_fs_size | raw | 776355842 | 740 MB
dynamic | metrics167 | kube_pod_fs_inodes_free | raw | 776355842 | 740 MB
Generally, chunks should not be significantly larger than 1Gi. Using the above as an example, the following steps can be taken to remedy the situation.
-
Identify the chunk size configuration parameter for the bucket size of the large chunk. For the
raw
bucket, this is 8 hours by default. -
To get to the target size of 1Gi or less, divide the chunk size by 1Gi to calculate a dividing factor.
-
Divide the chunk size configuration parameter by this dividing factor to calculate the custom chunk size and round to the nearest 5 minute interval, which is 10 minutes in this example.
-
From the command line, run the following command to create a custom chunk size configuration record using the metric name, bucket size, custom chunk size, and retention:
$ kubectl exec -it -n itrs timescale-0 -c timescale -- psql -U postgres -d obcerv -c \ "CALL set_metric_chunk_config(metric_pattern => 'kafka_log_log_size_value', storage_bucket => 'raw', size => '10 minutes');"
-
Repeat for all metric names with chunks larger than 1Gi.
-
For the new chunk sizes to take effect, run the following command:
kubectl exec -it -n itrs timescale-0 -c timescale -- psql -U postgres -d obcerv -c "CALL configure_chunk_sizes();"
Warning
This change will only affect future chunks, and there may be degraded performance until the large chunks have closed. It may be necessary to drop the offending large chunks. Dropping chunks should be avoided unless absolutely necessary as it will result in data loss.
To drop Timescale chunks, do the following:
-
Use the hypertable name and run the following query to identify the chunk start and end time. Hypertables can use interval OR integer based start and end values:
$ kubectl exec -it -n itrs timescale-0 -c timescale -- psql -U postgres -d obcerv -c \ "SELECT hypertable_schema, hypertable_name, range_start, range_end, range_start_integer, range_end_integer, get_relation_size(chunk_schema, chunk_name) AS size FROM timescaledb_information.chunks WHERE hypertable_name = 'metrics270' ORDER BY size DESC LIMIT 1;" hypertable_schema | hypertable_name | range_start | range_end | range_start_integer | range_end_integer | size -------------------+-----------------+------------------------+------------------------+---------------------+-------------------+------------ dynamic | metrics270 | 2023-05-30 08:00:00+00 | 2023-05-30 16:00:00+00 | | | 64424509440
-
To drop the chunk, use the following command and specify the fully-qualified table name and set the
newer_than
andolder_than
parameters to therange_start
/range_start_integer
andrange_end
/range_end_integer
values from the previous query:$ kubectl exec -it -n itrs timescale-0 -c timescale -- psql -U postgres -d obcerv -c \ "SELECT drop_chunks('dynamic.metrics270', newer_than => '2023-05-30 08:00:00+00', older_than => '2023-05-30 16:00:00+00');" drop_chunks ----------------------------------------------- _timescaledb_internal._hyper_1671_43617_chunk
-
Repeat steps 2 and 3 for all chunks that need to be dropped.
Kubernetes RBAC permissions Copied
If you are deploying Obcerv in a shared cluster or dividing a dedicated cluster for use with Obcerv, you can find out Kubernetes permissions required by Obcerv and the Obcerv Operator. This can help with troubleshooting or debugging potential RBAC (role-based access control) issues.
Run the following commands to understand the Kubernetes permissions required by Obcerv:
-
For a dedicated cluster:
helm install --dry-run obcerv-operator itrs/obcerv-operator -n <namespace>
-
For a shared cluster:
helm install --dry-run obcerv-operator itrs/obcerv-operator -n <namespace> --set installCRDs=false
These commands provide a dry-run of the Obcerv Operator installation, and lists all resources that would be created, including roles and RBAC configurations. For example:
# Source: obcerv-operator/templates/rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: obcerv-operator
namespace: itrs
labels:
helm.sh/chart: obcerv-operator-<version>
app.kubernetes.io/name: obcerv-operator
app.kubernetes.io/instance: obcerv-operator
app.kubernetes.io/version: "<version>"
app.kubernetes.io/managed-by: Helm
rules:
- apiGroups:
- itrsgroup.com
resources:
- obcervs
- obcervs/status
verbs:
- get
- list
- watch
- update
- patch
- apiGroups:
- ""
resources:
- configmaps
- endpoints
- endpoints/restricted
- persistentvolumeclaims
- pods
- secrets
- serviceaccounts
- services
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
- apiGroups:
- apps
resources:
- daemonsets
- deployments
- statefulsets
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
- apiGroups:
- batch
resources:
- jobs
- cronjobs
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
- apiGroups:
- networking.k8s.io
resources:
- ingresses
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
- apiGroups:
- policy
resources:
- poddisruptionbudgets
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
- apiGroups:
- rbac.authorization.k8s.io
resources:
- rolebindings
- roles
verbs:
- get
- create
- update
- patch
- delete
- apiGroups:
- cert-manager.io
resources:
- issuers
- certificates
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
- apiGroups:
- ""
resources:
- endpoints
- pods
- services
verbs:
- get
- list
- apiGroups:
- discovery.k8s.io
resources:
- endpointslices
verbs:
- get
- list
- apiGroups:
- ""
resources:
- events
- resourcequotas
verbs:
- get
- list
- watch
- apiGroups:
- apps
resources:
- replicasets
verbs:
- get
- list
- watch
Analyze the RBAC section of the output and look for resources like service accounts, roles, and cluster roles. This can help you compare permissions with expected access and adjust RBAC settings accordingly.