PostgresInArchiveRecovery #

Meaning #

Patroni is unable to recovery from archive. This can happen, when postgres had storage issues and tried a disaster recovery.

Check the logs before mitigating.

Impact #

One or more members of the ProgreSQL cluster are not operational. They can not function as read or writable member and might cause other issues in the database cluster.

Diagnosis #

Find current spilo Pods:

kubectl get pods --selector application=spilo

Check member list (replace ${postgres_pod?} with a Pod from above):

kubectl exec -it ${postgres_pod:?} -- patronictl list

Look for the state in archive recovery

+ Cluster: example-postgres (736034538667634444) ------------------+----+-----------+
| Member              | Host         | Role    | State               | TL | Lag in MB |
+---------------------+--------------+---------+---------------------+----+-----------+
| example-postgres-0 | 10.1.16.243  | Leader  | running             | 51 |           |
| example-postgres-1 | 10.1.108.186 | Replica | in archive recovery | 51 |       528 |
+---------------------+--------------+---------+---------------------+----+-----------+

Check pod logs (replace ${postgres_pod?} with a relevant Pod from above):

kubectl logs ${postgres_pod:?}

Mitigation #

If there is still a healthy master, the easiest way to mitigate this problem is by deleting the stuck postgresql instance’s PVC and Pod (in this order).

# Make sure postgres_pod is set to the stuck Pod
kubectl delete pvc --wait=false pgdata-${postgres_pod:?}
kubectl delete pod ${postgres_pod:?}
sleep 5
# Pod should come back
kubectl wait --for=Running=true --timeout=60s pod ${postgres_pod:?}
# Check the status
kubectl exec -it ${postgres_pod:?} -- patronictl list
sleep 60
kubectl exec -it ${postgres_pod:?} -- patronictl list