Velero Backup Partial Failures

VeleroBackupPartialFailures #

Meaning #

The ratio of partially failed backups to total backup attempts for a Velero schedule has exceeded 25% over the last 15 minutes. A partial failure indicates that while the backup completed, some items (for example individual volumes or resources) failed to back up.

Impact #

Some data within the affected backup schedule may not be restorable. Persistent volume snapshots or resource backups may be missing, leading to incomplete disaster recovery coverage.

Diagnosis #

List the recent backups for the affected schedule:

velero --namespace velero-system backup get --schedule <schedule>

Describe a failing backup to see the errors:

velero --namespace velero-system backup describe <backup-name> --details

Check the Velero server and node-agent logs:

kubectl logs -n velero-system deployment/velero
kubectl logs -n velero-system daemonset/node-agent

Mitigation #

If the partial failures are caused by a transient issue (for example a node being down), re-run the backup manually:

velero --namespace velero-system backup create --from-schedule <schedule> --ttl 72h

If failures persist, verify the S3 backend is reachable, Longhorn CSI snapshots are functioning, and the node-agent has sufficient resources.