Data Sources

DataVault can sync documents from your existing storage systems. Configure one or more data sources to start processing your documents.

These data source configurations are relevant when you operate your own DataVault runtime (on-prem/customer-managed). In the default managed setup, most users configure sources in meinGPT and do not edit app_config.yaml directly.

UI-First Setup (Recommended)

For most users, configure sources in meinGPT first:

Open Data Pools / Data Sources in meinGPT
Add your source in the UI
Start sync and verify indexed content

Use the per-source app_config.yaml examples below only when you run your own on-prem DataVault runtime.

Supported Data Sources

Local Files

Mount local folders directly into DataVault

SharePoint/OneDrive

Connect to SharePoint document libraries and OneDrive

Amazon S3

Connect to AWS S3 buckets or S3-compatible storage

Google Drive

Sync documents from Google Drive

Confluence

Sync pages and attachments from Confluence

Web Crawler

Crawl public or internal websites into a data pool

IMAP Mailbox

Sync emails from IMAP mailboxes

SMB/CIFS

Connect to Windows/Samba network shares

WebDAV

Connect to WebDAV servers (Nextcloud, ownCloud, etc.)

Basic Configuration

config/app_config.yaml

data_pools:
  - id: local
    type: local
    base_path: ./data
    
  - id: my-s3
    type: s3
    access_key_id: $AWS_ACCESS_KEY_ID
    secret_access_key: $AWS_SECRET_ACCESS_KEY
    endpoint: https://s3.amazonaws.com
    bucket_name: my-bucket

Security: Always use environment variables for credentials, never hardcode them in configuration files.

Synchronization (How It Works)

For all source types, synchronization follows the same high-level pattern:

Data pool configuration is resolved (from meinGPT and/or local data_pools in app_config.yaml)
Vault connector fetches files/content into local sync storage
Content is parsed, chunked, embedded, and indexed
Later sync runs update changed content incrementally