DataVault Deployment/Operations

Backups

Backup and recovery procedures for DataVault

Important: You typically don't need comprehensive DataVault backups since your source documents live in external systems (SharePoint, Google Drive, S3, etc.). However, losing processed data means re-ingestion time and API costs.

What to Consider

Source Data is Safe

Your documents are stored in:

  • SharePoint/OneDrive
  • Google Drive
  • S3 buckets
  • Local file systems
  • Other external sources

If DataVault is completely lost, you can just re-ingest everything from these sources.

What You Lose Without Backups

  • Processed embeddings → Re-ingestion required (time + API costs)
  • Configuration → Need to recreate settings
  • Processing history → Lost, but not critical

Simple Backup Strategy

Back up only the critical parts:

essential-backup.sh
#!/bin/bash
# Backup only what matters

BACKUP_DIR="/backup/datavault"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

mkdir -p "$BACKUP_DIR"
cd datavault

echo "=== DataVault Essential Backup - $TIMESTAMP ==="

# Stop services for consistent backup
docker compose stop

# 1. Configuration (critical - recreating is tedious)
tar -czf "$BACKUP_DIR/config-$TIMESTAMP.tar.gz" config/ docker-compose.yaml

# 2. Vector database (saves re-ingestion time/costs)
tar -czf "$BACKUP_DIR/weaviate-$TIMESTAMP.tar.gz" data/

# Restart services
docker compose start

echo "Backup completed: $BACKUP_DIR"
echo "Files: config-$TIMESTAMP.tar.gz, weaviate-$TIMESTAMP.tar.gz"

Weekly Automated Backup

# Add to crontab for weekly backups
echo "0 2 * * 0 /path/to/essential-backup.sh" | crontab -

Recovery

Full Recovery

restore.sh
#!/bin/bash
# Restore from backup

BACKUP_DATE="$1"  # e.g., 20241201_120000

if [ -z "$BACKUP_DATE" ]; then
    echo "Usage: $0 BACKUP_DATE"
    echo "Available backups:"
    ls /backup/datavault/config-* | sed 's/.*config-\(.*\)\.tar\.gz/\1/'
    exit 1
fi

cd datavault

# Stop services
docker compose down

# Restore configuration
tar -xzf "/backup/datavault/config-$BACKUP_DATE.tar.gz"

# Restore vector database
rm -rf data/
tar -xzf "/backup/datavault/weaviate-$BACKUP_DATE.tar.gz"

# Start services
docker compose up -d

echo "Recovery completed!"

Alternative: Re-ingestion

If you don't have backups or they're corrupted:

# Start fresh DataVault
cd datavault
docker compose down
rm -rf data/
docker compose up -d

# Wait for services to start
sleep 30

# Trigger re-ingestion of all data pools
curl -X POST http://localhost:8080/ingestion/all

Note: Re-ingestion will take time and consume embedding API tokens/costs.

Backup Best Practices

Frequency

  • Configuration: Backup after changes
  • Vector database: Weekly or after major ingestions
  • Before updates: Always backup before upgrading

Storage

# Keep last 4 weekly backups
find /backup/datavault -name "*.tar.gz" -mtime +28 -delete

Test Recovery

# Periodically test your backup files
tar -tzf /backup/datavault/config-20241201_120000.tar.gz >/dev/null
tar -tzf /backup/datavault/weaviate-20241201_120000.tar.gz >/dev/null
echo "Backup files are valid"

When You Need Backups

Essential if:

  • Large document collections (re-ingestion takes hours/days)
  • Using paid embedding APIs (re-ingestion costs money)
  • Custom configuration is complex

Optional if:

  • Small document collections (< 1000 files)
  • Using local embedding models (no API costs)
  • Simple configuration

Summary

DataVault backups are mainly about convenience, not data loss prevention. Your source documents are safe in their original locations. Backup the processed data to avoid re-ingestion time and costs.

Continue with other operational topics: