Data Sources
Connect your existing document storage to DataVault
DataVault can sync documents from your existing storage systems. Configure one or more data sources to start processing your documents.
These data source configurations are relevant when you operate your own DataVault runtime (on-prem/customer-managed).
In the default managed setup, most users configure sources in meinGPT and do not edit app_config.yaml directly.
UI-First Setup (Recommended)
For most users, configure sources in meinGPT first:
- Open Data Pools / Data Sources in meinGPT
- Add your source in the UI
- Start sync and verify indexed content
Use the per-source app_config.yaml examples below only when you run your own on-prem DataVault runtime.
Supported Data Sources
Local Files
Mount local folders directly into DataVault
SharePoint/OneDrive
Connect to SharePoint document libraries and OneDrive
Amazon S3
Connect to AWS S3 buckets or S3-compatible storage
Google Drive
Sync documents from Google Drive
Confluence
Sync pages and attachments from Confluence
Web Crawler
Crawl public or internal websites into a data pool
IMAP Mailbox
Sync emails from IMAP mailboxes
SMB/CIFS
Connect to Windows/Samba network shares
WebDAV
Connect to WebDAV servers (Nextcloud, ownCloud, etc.)
Basic Configuration
data_pools:
- id: local
type: local
base_path: ./data
- id: my-s3
type: s3
access_key_id: $AWS_ACCESS_KEY_ID
secret_access_key: $AWS_SECRET_ACCESS_KEY
endpoint: https://s3.amazonaws.com
bucket_name: my-bucketSecurity: Always use environment variables for credentials, never hardcode them in configuration files.
Synchronization (How It Works)
For all source types, synchronization follows the same high-level pattern:
- Data pool configuration is resolved (from meinGPT and/or local
data_poolsinapp_config.yaml) - Vault connector fetches files/content into local sync storage
- Content is parsed, chunked, embedded, and indexed
- Later sync runs update changed content incrementally