Backups
Backup and recovery procedures for DataVault
Important: You typically don't need comprehensive DataVault backups since your source documents live in external systems (SharePoint, Google Drive, S3, etc.). However, losing processed data means re-ingestion time and API costs.
What to Consider
Source Data is Safe
Your documents are stored in:
- SharePoint/OneDrive
- Google Drive
- S3 buckets
- Local file systems
- Other external sources
If DataVault is completely lost, you can just re-ingest everything from these sources.
What You Lose Without Backups
- Processed embeddings → Re-ingestion required (time + API costs)
- Configuration → Need to recreate settings
- Processing history → Lost, but not critical
Simple Backup Strategy
Essential Backup (Recommended)
Back up only the critical parts:
#!/bin/bash
# Backup only what matters
BACKUP_DIR="/backup/datavault"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
mkdir -p "$BACKUP_DIR"
cd datavault
echo "=== DataVault Essential Backup - $TIMESTAMP ==="
# Stop services for consistent backup
docker compose stop
# 1. Configuration (critical - recreating is tedious)
tar -czf "$BACKUP_DIR/config-$TIMESTAMP.tar.gz" config/ docker-compose.yaml
# 2. Vector database (saves re-ingestion time/costs)
tar -czf "$BACKUP_DIR/weaviate-$TIMESTAMP.tar.gz" data/
# Restart services
docker compose start
echo "Backup completed: $BACKUP_DIR"
echo "Files: config-$TIMESTAMP.tar.gz, weaviate-$TIMESTAMP.tar.gz"
Weekly Automated Backup
# Add to crontab for weekly backups
echo "0 2 * * 0 /path/to/essential-backup.sh" | crontab -
Recovery
Full Recovery
#!/bin/bash
# Restore from backup
BACKUP_DATE="$1" # e.g., 20241201_120000
if [ -z "$BACKUP_DATE" ]; then
echo "Usage: $0 BACKUP_DATE"
echo "Available backups:"
ls /backup/datavault/config-* | sed 's/.*config-\(.*\)\.tar\.gz/\1/'
exit 1
fi
cd datavault
# Stop services
docker compose down
# Restore configuration
tar -xzf "/backup/datavault/config-$BACKUP_DATE.tar.gz"
# Restore vector database
rm -rf data/
tar -xzf "/backup/datavault/weaviate-$BACKUP_DATE.tar.gz"
# Start services
docker compose up -d
echo "Recovery completed!"
Alternative: Re-ingestion
If you don't have backups or they're corrupted:
# Start fresh DataVault
cd datavault
docker compose down
rm -rf data/
docker compose up -d
# Wait for services to start
sleep 30
# Trigger re-ingestion of all data pools
curl -X POST http://localhost:8080/ingestion/all
Note: Re-ingestion will take time and consume embedding API tokens/costs.
Backup Best Practices
Frequency
- Configuration: Backup after changes
- Vector database: Weekly or after major ingestions
- Before updates: Always backup before upgrading
Storage
# Keep last 4 weekly backups
find /backup/datavault -name "*.tar.gz" -mtime +28 -delete
Test Recovery
# Periodically test your backup files
tar -tzf /backup/datavault/config-20241201_120000.tar.gz >/dev/null
tar -tzf /backup/datavault/weaviate-20241201_120000.tar.gz >/dev/null
echo "Backup files are valid"
When You Need Backups
Essential if:
- Large document collections (re-ingestion takes hours/days)
- Using paid embedding APIs (re-ingestion costs money)
- Custom configuration is complex
Optional if:
- Small document collections (< 1000 files)
- Using local embedding models (no API costs)
- Simple configuration
Summary
DataVault backups are mainly about convenience, not data loss prevention. Your source documents are safe in their original locations. Backup the processed data to avoid re-ingestion time and costs.
Continue with other operational topics: