VeleroBackupPartialFailures #
Meaning #
The ratio of partially failed backups to total backup attempts for a Velero schedule has exceeded 25% over the last 15 minutes. A partial failure indicates that while the backup completed, some items (for example individual volumes or resources) failed to back up.
Impact #
Some data within the affected backup schedule may not be restorable. Persistent volume snapshots or resource backups may be missing, leading to incomplete disaster recovery coverage.
Diagnosis #
List the recent backups for the affected schedule:
velero --namespace velero-system backup get --schedule <schedule>
Describe a failing backup to see the errors:
velero --namespace velero-system backup describe <backup-name> --details
Check the Velero server and node-agent logs:
kubectl logs -n velero-system deployment/velero
kubectl logs -n velero-system daemonset/node-agent
Mitigation #
If the partial failures are caused by a transient issue (for example a node being down), re-run the backup manually:
velero --namespace velero-system backup create --from-schedule <schedule> --ttl 72h
If failures persist, verify the S3 backend is reachable, Longhorn CSI snapshots are functioning, and the node-agent has sufficient resources.