Manual recovery: Change user IDs with strict security policies
Important
Please contact ITRS Support before performing this procedure and follow their guidance.
When this applies:
- You need to change
runAsUseron an existing installation - Your cluster has Pod Security Admission (PSA) or Gatekeeper/Kyverno policies enforcing restricted security contexts
- The required ownership migration requires privileged operations that violate your security policies
Limitation: The ownership migration process requires running privileged jobs (as root with elevated capabilities) to change file ownership of persistent volumes. These jobs cannot run while strict security policies are enforced.
What gets migrated:
The migration jobs handle all TimescaleDB/PostgreSQL persistent volumes:
- PGDATA: Main database data directory (mandatory)
- WALDIR (TimescaleDB only): Separate write-ahead log volume (mandatory)
- Tablespaces (TimescaleDB only): Optional additional storage volumes for timeseries data (if
timescale.timeseriesDiskCount > 0)
All volumes must have their ownership changed when runAsUser changes, otherwise the database will fail to start with permission errors.
Prerequisites Copied
Before starting, ensure you have:
- Contact with ITRS Support engineer
- Cluster administrator access
- A maintenance window (database pods will restart)
- Note your current and target UID/GID values
Recovery procedure Copied
Step 1: Update security context configuration Copied
Update your Helm values file with new UIDs. Critical: Change both runAsUser and fsGroup:
# values.yaml
securityContext:
pod:
runAsUser: 5000 # New UID
runAsGroup: 5000 # New GID
supplementalGroups: [5000]
fsGroup: 5000 # MUST change with runAsUser
fsGroupChangePolicy: OnRootMismatch
Apply the configuration using Helm:
helm upgrade iax itrs/iax-platform \
--namespace itrs \
--values values.yaml \
--wait
Note
Pods will fail to start at this point due to ownership mismatch. This is expected.
Step 2: Temporarily relax security policies Copied
For Pod Security Admission:
# Change namespace from restricted to privileged temporarily
kubectl label namespace itrs pod-security.kubernetes.io/enforce=privileged --overwrite
# Verify
kubectl get namespace itrs -o jsonpath='{.metadata.labels.pod-security\.kubernetes\.io/enforce}'
# Should show: privileged
For Gatekeeper:
# Get your constraint names
kubectl get constraints -A
# Add itrs namespace to excludedNamespaces for each constraint
# Example for K8sPSPCapabilities constraint named "drop-all":
kubectl patch k8spspcapabilities drop-all --type=json -p='[
{
"op": "add",
"path": "/spec/match/excludedNamespaces/-",
"value": "itrs"
}
]'
# Repeat for other constraints (e.g., K8sPSPAllowedUsers named "jpmc-allowed-user-ranges")
kubectl patch k8spspallowedusers jpmc-allowed-user-ranges --type=json -p='[
{
"op": "add",
"path": "/spec/match/excludedNamespaces/-",
"value": "itrs"
}
]'
# Verify
kubectl get constraints -o yaml | grep -A 5 excludedNamespaces
# Should show 'itrs' in the list
Step 3: Run ownership migration jobs Copied
Create migration job manifests for PostgreSQL and TimescaleDB:
postgres-migration-job-0.yaml (for single replica or replica 0 in HA):
apiVersion: batch/v1
kind: Job
metadata:
name: postgres-ownership-migration-0
namespace: itrs
annotations:
iax.itrsgroup.com/delete-after-install: "true"
spec:
backoffLimit: 3
activeDeadlineSeconds: 21600
template:
metadata:
labels:
iax.itrsgroup.com/pre-upgrade-job: "true"
spec:
restartPolicy: Never
securityContext:
runAsUser: 0
runAsGroup: 0
containers:
- name: fix-ownership
image: docker.itrsgroup.com/iax/postgres:<VERSION> # Replace <VERSION> with your platform version tag
imagePullPolicy: IfNotPresent
securityContext:
runAsNonRoot: false
allowPrivilegeEscalation: false
capabilities:
add:
- CHOWN
- FOWNER
- DAC_OVERRIDE
- DAC_READ_SEARCH
drop:
- ALL
seccompProfile:
type: RuntimeDefault
env:
- name: TARGET_UID
value: "5000" # Your new UID
- name: TARGET_GID
value: "5000" # Your new GID
- name: REPLICA_INDEX
value: "0"
command:
- /bin/bash
- -c
- |
set -e
echo "Starting ownership migration for PostgreSQL replica ${REPLICA_INDEX}"
echo "Target ownership: ${TARGET_UID}:${TARGET_GID}"
PGDATA="/var/lib/postgresql/data"
# Check if data directory is empty (fresh install, skip migration)
if [ ! -d "${PGDATA}" ] || [ -z "$(ls -A ${PGDATA} 2>/dev/null)" ]; then
echo "Data directory is empty - skipping migration (fresh install)"
exit 0
fi
# Get current ownership of PGDATA
CURRENT_UID=$(stat -c '%u' "${PGDATA}")
CURRENT_GID=$(stat -c '%g' "${PGDATA}")
echo "Current PGDATA ownership: ${CURRENT_UID}:${CURRENT_GID}"
# Check if migration is needed
if [ "${CURRENT_UID}" = "${TARGET_UID}" ] && [ "${CURRENT_GID}" = "${TARGET_GID}" ]; then
echo "Ownership already correct - skipping migration"
exit 0
else
echo "Migration required from ${CURRENT_UID}:${CURRENT_GID} to ${TARGET_UID}:${TARGET_GID}"
echo "Changing ownership (may take several minutes for large databases)..."
chown -R "${TARGET_UID}:${TARGET_GID}" "${PGDATA}"
echo "Migration completed"
fi
echo ""
echo "Migration completed successfully for replica ${REPLICA_INDEX}"
echo "Final ownership: $(stat -c '%u:%g' ${PGDATA})"
echo "Note: Permissions will be fixed by the pod startup script"
volumeMounts:
- name: pgdata
mountPath: /var/lib/postgresql
volumes:
- name: pgdata
persistentVolumeClaim:
claimName: data-pgplatform-0
timescale-migration-job-0.yaml (for single replica or replica 0 in HA):
apiVersion: batch/v1
kind: Job
metadata:
name: timescale-ownership-migration-0
namespace: itrs
annotations:
iax.itrsgroup.com/delete-after-install: "true"
spec:
backoffLimit: 3
activeDeadlineSeconds: 21600
template:
metadata:
labels:
iax.itrsgroup.com/pre-upgrade-job: "true"
spec:
restartPolicy: Never
securityContext:
runAsUser: 0
runAsGroup: 0
containers:
- name: fix-ownership
image: docker.itrsgroup.com/iax/timescale:<VERSION> # Replace <VERSION> with your platform version tag
imagePullPolicy: IfNotPresent
securityContext:
runAsNonRoot: false
allowPrivilegeEscalation: false
capabilities:
add:
- CHOWN
- FOWNER
- DAC_OVERRIDE
- DAC_READ_SEARCH
drop:
- ALL
seccompProfile:
type: RuntimeDefault
env:
- name: TARGET_UID
value: "5000" # Your new UID
- name: TARGET_GID
value: "5000" # Your new GID
- name: REPLICA_INDEX
value: "0"
command:
- /bin/bash
- -c
- |
set -e
echo "Starting ownership migration for TimescaleDB replica ${REPLICA_INDEX}"
echo "Target ownership: ${TARGET_UID}:${TARGET_GID}"
echo ""
MIGRATION_NEEDED=false
# Process data volume
PGDATA="/var/lib/postgresql/data"
echo "Processing data volume: ${PGDATA}"
if [ ! -d "${PGDATA}" ] || [ -z "$(ls -A ${PGDATA} 2>/dev/null)" ]; then
echo "Data directory is empty - skipping (fresh install)"
else
# Get current ownership
CURRENT_UID=$(stat -c '%u' "${PGDATA}")
CURRENT_GID=$(stat -c '%g' "${PGDATA}")
echo "Current ownership: ${CURRENT_UID}:${CURRENT_GID}"
if [ "${CURRENT_UID}" = "${TARGET_UID}" ] && [ "${CURRENT_GID}" = "${TARGET_GID}" ]; then
echo "Ownership already correct"
else
echo "Migration required from ${CURRENT_UID}:${CURRENT_GID} to ${TARGET_UID}:${TARGET_GID}"
echo "Changing ownership (may take several minutes for large databases)..."
MIGRATION_NEEDED=true
chown -R "${TARGET_UID}:${TARGET_GID}" "${PGDATA}"
echo "Data volume migration completed"
fi
fi
echo ""
# Process WAL volume
PGWAL="/var/lib/postgresql/wal/pg_wal"
echo "Processing WAL volume: ${PGWAL}"
if [ ! -d "${PGWAL}" ] || [ -z "$(ls -A ${PGWAL} 2>/dev/null)" ]; then
echo "WAL directory is empty - skipping (fresh install)"
else
# Get current ownership
CURRENT_UID=$(stat -c '%u' "${PGWAL}")
CURRENT_GID=$(stat -c '%g' "${PGWAL}")
echo "Current ownership: ${CURRENT_UID}:${CURRENT_GID}"
if [ "${CURRENT_UID}" = "${TARGET_UID}" ] && [ "${CURRENT_GID}" = "${TARGET_GID}" ]; then
echo "Ownership already correct"
else
echo "Migration required from ${CURRENT_UID}:${CURRENT_GID} to ${TARGET_UID}:${TARGET_GID}"
MIGRATION_NEEDED=true
echo "Changing WAL ownership (this is fast)..."
chown -R "${TARGET_UID}:${TARGET_GID}" "${PGWAL}"
echo "WAL volume migration completed"
fi
fi
echo ""
# Process tablespace volumes (if configured)
# Uncomment and adjust based on your timescale.timeseriesDiskCount setting
# Example below assumes timeseriesDiskCount=2:
#
# TABLESPACE="/var/lib/postgresql/tablespaces/timeseries_tablespace_1"
# echo "Processing tablespace volume 1: ${TABLESPACE}"
# if [ ! -d "${TABLESPACE}/data" ] || [ -z "$(ls -A ${TABLESPACE}/data 2>/dev/null)" ]; then
# echo "Tablespace 1 is empty - skipping"
# else
# CURRENT_UID=$(stat -c '%u' "${TABLESPACE}/data")
# CURRENT_GID=$(stat -c '%g' "${TABLESPACE}/data")
# echo "Current ownership: ${CURRENT_UID}:${CURRENT_GID}"
# if [ "${CURRENT_UID}" != "${TARGET_UID}" ] || [ "${CURRENT_GID}" != "${TARGET_GID}" ]; then
# echo "Migration required from ${CURRENT_UID}:${CURRENT_GID} to ${TARGET_UID}:${TARGET_GID}"
# MIGRATION_NEEDED=true
# echo "Changing tablespace 1 ownership..."
# chown -R "${TARGET_UID}:${TARGET_GID}" "${TABLESPACE}"
# echo "Tablespace 1 migration completed"
# else
# echo "Ownership already correct"
# fi
# fi
# echo ""
# (Repeat for tablespace 2, 3, etc. based on your timeseriesDiskCount)
if [ "${MIGRATION_NEEDED}" = "false" ]; then
echo "No migration needed for replica ${REPLICA_INDEX}"
else
echo "Replica ${REPLICA_INDEX} migrated successfully"
fi
echo "Note: Permissions will be fixed by the pod startup script"
volumeMounts:
- name: tsdata
mountPath: /var/lib/postgresql
- name: tswal
mountPath: /var/lib/postgresql/wal
# Uncomment and add tablespace volumes if configured:
# - name: tablespace-1
# mountPath: /var/lib/postgresql/tablespaces/timeseries_tablespace_1
# - name: tablespace-2
# mountPath: /var/lib/postgresql/tablespaces/timeseries_tablespace_2
volumes:
- name: tsdata
persistentVolumeClaim:
claimName: timescale-ha-data-timescale-0
- name: tswal
persistentVolumeClaim:
claimName: timescale-ha-wal-timescale-0
# Uncomment and add tablespace PVCs if configured:
# - name: tablespace-1
# persistentVolumeClaim:
# claimName: timescale-ha-tablespace-data-1-timescale-0
# - name: tablespace-2
# persistentVolumeClaim:
# claimName: timescale-ha-tablespace-data-2-timescale-0
Before applying, update the job manifests:
- Replace
<VERSION>in the image field with your actual platform version tag (e.g.,2.17.0) - Replace
TARGET_UIDandTARGET_GIDenvironment variable values with your target values (5000 in the example above) - For TimescaleDB:
- If you have tablespace volumes configured (
timescale.timeseriesDiskCount > 0), uncomment the tablespace sections in the script and add the corresponding volumeMounts and volumes - Add one TABLESPACE section per tablespace (1, 2, 3, etc. based on your
timeseriesDiskCount)
- If you have tablespace volumes configured (
- For HA configurations (multiple replicas):
- Create separate job manifests for each replica (replica 0, 1, 2, etc.)
- Update
REPLICA_INDEXenvironment variable:"0","1","2", etc. - Update job names:
postgres-ownership-migration-0,postgres-ownership-migration-1, etc. - Update
claimNamesuffixes in volumes:- PostgreSQL:
data-pgplatform-0,data-pgplatform-1, etc. - TimescaleDB:
timescale-ha-data-timescale-0,timescale-ha-data-timescale-1, etc. - TimescaleDB WAL:
timescale-ha-wal-timescale-0,timescale-ha-wal-timescale-1, etc. - TimescaleDB tablespaces:
timescale-ha-tablespace-data-1-timescale-0,timescale-ha-tablespace-data-1-timescale-1, etc.
- PostgreSQL:
Apply the jobs:
# Single replica setup:
kubectl apply -f postgres-migration-job-0.yaml
kubectl apply -f timescale-migration-job-0.yaml
# For HA setups, apply jobs for all replicas:
kubectl apply -f postgres-migration-job-0.yaml
kubectl apply -f postgres-migration-job-1.yaml
# ... repeat for additional replicas
kubectl apply -f timescale-migration-job-0.yaml
kubectl apply -f timescale-migration-job-1.yaml
# ... repeat for additional replicas
# Wait for all jobs to complete
kubectl wait --for=condition=complete --timeout=600s job/postgres-ownership-migration-0 -n itrs
kubectl wait --for=condition=complete --timeout=600s job/timescale-ownership-migration-0 -n itrs
# ... wait for all replica jobs
# Check job logs to verify success
kubectl logs -n itrs job/postgres-ownership-migration-0
kubectl logs -n itrs job/timescale-ownership-migration-0
# ... check logs for all replica jobs
Scale down and up StatefulSets to pick up new ownership:
# Scale down (adjust replicas based on your HA configuration)
kubectl scale statefulset pgplatform -n itrs --replicas=0
kubectl scale statefulset timescale -n itrs --replicas=0
# Wait for pods to terminate
kubectl wait --for=delete pod -l app=pgplatform -n itrs --timeout=120s
kubectl wait --for=delete pod -l app=timescale -n itrs --timeout=120s
# Scale back up (use your original replica count)
kubectl scale statefulset pgplatform -n itrs --replicas=1 # or 2, 3, etc. for HA
kubectl scale statefulset timescale -n itrs --replicas=1 # or 2, 3, etc. for HA
# Wait for platform to be ready
sleep 60
kubectl wait --timeout=600s iaxplatform/iax-iax-platform --for jsonpath="{.status.status}"=DEPLOYED -n itrs
Step 4: Verify system health Copied
# Verify all pods are running
kubectl get pods -n itrs
# All pods should be in Running state
# Verify ownership changed on volumes
kubectl exec -n itrs pgplatform-0 -c postgres -- stat -c '%u:%g' /var/lib/postgresql/data
# Should show: 5000:5000 (or your target UID:GID)
kubectl exec -n itrs timescale-0 -c timescale -- stat -c '%u:%g' /var/lib/postgresql/data
kubectl exec -n itrs timescale-0 -c timescale -- stat -c '%u:%g' /var/lib/postgresql/wal/pg_wal
# Should show: 5000:5000
# If using tablespaces, verify tablespace ownership:
# kubectl exec -n itrs timescale-0 -c timescale -- stat -c '%u:%g' /var/lib/postgresql/tablespaces/timeseries_tablespace_1/data
# Verify correct UIDs inside containers
kubectl exec -n itrs pgplatform-0 -c postgres -- id
kubectl exec -n itrs timescale-0 -c timescale -- id
# Should show: uid=5000 gid=5000 (or your target UIDs)
# Test database connectivity
kubectl exec -n itrs pgplatform-0 -c postgres -- psql -U postgres -c "SELECT 1"
kubectl exec -n itrs timescale-0 -c timescale -- psql -U postgres -c "SELECT 1"
# Should return: 1
# Verify no errors in logs
kubectl logs -n itrs pgplatform-0 -c postgres --tail=50
kubectl logs -n itrs timescale-0 -c timescale --tail=50
Step 5: Re-enable security policies Copied
For Pod Security Admission:
kubectl label namespace itrs pod-security.kubernetes.io/enforce=restricted --overwrite
# Verify
kubectl get namespace itrs -o jsonpath='{.metadata.labels.pod-security\.kubernetes\.io/enforce}'
# Should show: restricted
For Gatekeeper:
Remove itrs from excludedNamespaces. Find the index of “itrs” in the array:
# Get the index for capabilities constraint
ITRS_INDEX_CAP=$(kubectl get k8spspcapabilities drop-all -o jsonpath='{.spec.match.excludedNamespaces}' | tr ',' '\n' | nl -v 0 | grep -n itrs | cut -d: -f1 | head -1)
# Arrays are 0-indexed, but nl starts at 0, so subtract 1
kubectl patch k8spspcapabilities drop-all --type=json -p="[{\"op\": \"remove\", \"path\": \"/spec/match/excludedNamespaces/$((ITRS_INDEX_CAP-1))\"}]"
# Repeat for other constraints (e.g., allowed users)
ITRS_INDEX_USR=$(kubectl get k8spspallowedusers jpmc-allowed-user-ranges -o jsonpath='{.spec.match.excludedNamespaces}' | tr ',' '\n' | nl -v 0 | grep -n itrs | cut -d: -f1 | head -1)
kubectl patch k8spspallowedusers jpmc-allowed-user-ranges --type=json -p="[{\"op\": \"remove\", \"path\": \"/spec/match/excludedNamespaces/$((ITRS_INDEX_USR-1))\"}]"
Final verification:
# Check for constraint violations
kubectl get constraints -A
# Should show 0 violations in itrs namespace
# Verify pods still running after policy re-enablement
kubectl get pods -n itrs
Important notes Copied
- Plan for downtime: Database pods will restart during the migration
- Large databases: Migration time increases with data size (can take 10+ minutes for large datasets)
- Security window: Minimize time with relaxed policies - complete procedure as quickly as possible
- Schedule appropriately: Perform during maintenance window
- Alternative approach: Deploy a fresh installation with correct UIDs if feasible
Why this complexity? Copied
PostgreSQL, TimescaleDB, etcd, and Kafka require their data directories to be owned by the specific UID running the process. Changing UIDs requires recursively changing file ownership on potentially large persistent volumes, which:
- Requires running as root
- Requires elevated Linux capabilities (CHOWN, FOWNER, DAC_OVERRIDE)
- Takes time proportional to data size
- Cannot be automated while strict security policies are active
Best practice: Configure the correct runAsUser and fsGroup before initial deployment to avoid this complexity.