PostgresInArchiveRecovery #
Meaning #
Patroni is unable to recovery from archive. This can happen, when postgres had storage issues and tried a disaster recovery.
Check the logs before mitigating.
Impact #
One or more members of the ProgreSQL cluster are not operational. They can not function as read or writable member and might cause other issues in the database cluster.
Diagnosis #
Find current spilo Pods:
kubectl get pods --selector application=spilo
Check member list (replace ${postgres_pod?}
with a Pod from above):
kubectl exec -it ${postgres_pod:?} -- patronictl list
Look for the state in archive recovery
+ Cluster: example-postgres (736034538667634444) ------------------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+---------------------+--------------+---------+---------------------+----+-----------+
| example-postgres-0 | 10.1.16.243 | Leader | running | 51 | |
| example-postgres-1 | 10.1.108.186 | Replica | in archive recovery | 51 | 528 |
+---------------------+--------------+---------+---------------------+----+-----------+
Check pod logs (replace ${postgres_pod?}
with a relevant Pod from above):
kubectl logs ${postgres_pod:?}
Mitigation #
If there is still a healthy master, the easiest way to mitigate this problem is by deleting the stuck postgresql instance’s PVC and Pod (in this order).
# Make sure postgres_pod is set to the stuck Pod
kubectl delete pvc --wait=false pgdata-${postgres_pod:?}
kubectl delete pod ${postgres_pod:?}
sleep 5
# Pod should come back
kubectl wait --for=Running=true --timeout=60s pod ${postgres_pod:?}
# Check the status
kubectl exec -it ${postgres_pod:?} -- patronictl list
sleep 60
kubectl exec -it ${postgres_pod:?} -- patronictl list