Complete Configuration Reference
Complete reference for all DataVault enterprise configuration options
This is the complete configuration reference for meinGPT DataVault enterprise deployments. It covers all available configuration options for the main configuration file.
Configuration Structure
Vault Settings
Core vault credentials, processing, and system configuration
Weaviate Database
Vector database connection and configuration
Embedding Models
OpenAI, Azure, Nebius, and HuggingFace model configurations
Data Sources
S3, OneDrive, Google Drive, Confluence, SMB, WebDAV, and local file configurations
Configuration File Structure
# Version of the config file format
version: 1.0
# Base URL for meinGPT service
meingpt_url: $MEINGPT_URL
# ================================
# VAULT CORE SETTINGS
# ================================
vault:
  # Required: Your vault credentials from meinGPT dashboard
  id: your-vault-id
  secret: $VAULT_SECRET
  
  # Standalone mode - if true, vault won't connect to meinGPT server
  standalone_mode: false
  
  # Data storage directory (acts as sync target for rclone)
  data_dir: ./tmp
  
  # Ingestion settings
  ingestion_interval: 900             # Interval in seconds between ingestion runs (0 for disabled)
  tasks_batch_size: 10               # Tasks added to event loop at once from every datapool
  chunk_size: 256                    # Size of each text chunk in tokens
  chunk_overlap: 26                  # Overlapping tokens between consecutive chunks
# ================================
# VECTOR DATABASE CONFIGURATION
# ================================
weaviate:
  # Connection settings
  connection_type: local              # "local" or "custom"
  host: localhost                     # Docker service name or IP
  port: 8001                         # Weaviate port
  grpc_host: localhost               # gRPC host for Weaviate
  grpc_port: 50051                   # gRPC port for Weaviate
  
  # Authentication (empty string for local)
  api_key: ""
# ================================
# EMBEDDING MODEL CONFIGURATION
# ================================
embedding_model:
  # Required base settings for all providers
  rpm: 3000                          # Requests per minute
  tpm: 1000000                       # Tokens per minute
  
  # Optional prompts for specialized embedding
  query_prompt: null                 # Prompt prepended to queries
  document_prompt: null              # Prompt prepended to documents
  
  # === AZURE OPENAI ===
  provider: "azure"
  api_key: $AZURE_API_KEY
  api_version: "2023-05-15"
  model: text-embedding-3-small
  endpoint: https://your-endpoint.openai.azure.com/
  embedding_dimensions: 512
  
  # === OR OPENAI ===
  # provider: "openai"
  # model: "text-embedding-ada-002"   # Default model
  # base_url: null                    # Optional custom URL
  # api_key: $OPENAI_API_KEY
  
  # === OR NEBIUS ===
  # provider: "nebius"
  # tokenizer: "BAAI/bge-multilingual-gemma2"
  # model: "bge-multilingual-gemma2"
  # base_url: "https://api.studio.nebius.ai/v1/"
  # api_key: $NEBIUS_API_KEY
  
  # === OR HUGGINGFACE LOCAL ===
  # provider: "huggingface_local"
  # model: "sentence-transformers/all-mpnet-base-v2"
  # model_kwargs: {}                  # Additional model parameters
  # encode_kwargs: {}                 # Additional encoding parameters
# ================================
# LOGGING CONFIGURATION  
# ================================
logging:
  log_level: "INFO"                  # DEBUG, INFO, WARNING, ERROR, CRITICAL
  log_to_file: true
  log_file_path: "logs/app.log"
  uvicorn_log_file_path: "logs/uvicorn.log"
  
  # Sentry error tracking
  sentry_dsn: ""                     # Sentry DSN for error tracking
  sentry_event_level: "WARNING"
  sentry_tags: {}                    # Additional tags for Sentry events
  
  # Heartbeat monitoring
  heartbeat_url: null                # URL for uptime monitoring
  heartbeat_interval_minutes: 1
  
  # System monitoring intervals (0 to disable)
  system_monitoring_interval: 0      # System usage monitoring
  storage_monitoring_interval: 0     # Storage monitoring
  database_monitoring_interval: 0    # Database monitoring
# ================================
# API RATE LIMITING
# ================================
search_requests_per_minute: 30       # Search requests per minute limit
search_results_limit: 20             # Maximum search results returned
# ================================
# DATA SOURCES CONFIGURATION
# ================================
data_pools:
  # === LOCAL FILESYSTEM ===
  - id: local
    type: local
    base_path: ./data                # Directory used as synchronization source
    
  # === AMAZON S3 ===  
  - id: s3-documents
    type: s3
    access_key_id: $AWS_ACCESS_KEY_ID
    secret_access_key: $AWS_SECRET_ACCESS_KEY
    endpoint: $S3_ENDPOINT
    bucket_name: your-bucket-name
    provider: "Other"                # AWS, MinIO, DigitalOcean, Other
    base_path: "documents/"          # Optional folder prefix
    
  # === GOOGLE DRIVE ===
  - id: google-drive
    type: drive
    refresh_token: $GOOGLE_REFRESH_TOKEN
    scope: "drive.readonly"          # drive, drive.readonly, drive.file, drive.appfolder, drive.metadata.readonly
    root_folder_id: null             # Optional specific folder
    team_drive: null                 # For shared drives
    client_id: null                  # Optional custom client
    client_secret: null
    base_path: "/"
    
  # === MICROSOFT ONEDRIVE ===
  - id: onedrive
    type: onedrive
    client_id: "4306c62e-d96d-41a0-9f59-f577e3707aba"  # Default client ID
    client_secret: null              # Optional custom client secret
    refresh_token: $ONEDRIVE_REFRESH_TOKEN
    drive_id: $ONEDRIVE_DRIVE_ID
    drive_type: "personal"           # personal, business, documentLibrary
    tenant_id: null                  # Optional custom tenant
    base_path: "/"
    
  # === CONFLUENCE ===
  - id: confluence
    type: confluence
    url: "https://company.atlassian.net"
    username: $CONFLUENCE_USERNAME
    token: $CONFLUENCE_TOKEN
    space_id: $CONFLUENCE_SPACE_ID
    base_path: null                  # Optional
    
  # === SMB/CIFS NETWORK SHARE ===
  - id: smb-share
    type: smb
    host: "server.company.com"
    user: $SMB_USERNAME
    password: $SMB_PASSWORD
    port: null                       # Optional port (default 445)
    domain: null                     # Optional domain
    spn: null                        # Optional SPN
    base_path: "/shared"
    
  # === WEBDAV ===
  - id: webdav
    type: webdav
    url: "https://webdav.company.com"
    vendor: "nextcloud"              # fastmail, nextcloud, owncloud, sharepoint, sharepoint-ntlm, rclone, other
    user: $WEBDAV_USERNAME
    password: $WEBDAV_PASSWORD
    bearer_token: null               # Alternative to username/password
    base_path: "/"Configuration Options Reference
Vault Settings
| Field | Type | Default | Required | Description | 
|---|---|---|---|---|
| id | string | - | ✅ | Unique identifier for the Vault instance | 
| secret | string | - | ✅ | Secret key for authentication | 
| standalone_mode | boolean | false | ❌ | If true, vault won't connect to meinGPT server | 
| data_dir | string | "./tmp" | ❌ | Directory for temporary data and rclone sync | 
| ingestion_interval | integer | 900 | ❌ | Seconds between ingestion runs (0 to disable) | 
| tasks_batch_size | integer | 10 | ❌ | Tasks added to event loop at once per datapool | 
| chunk_size | integer | 256 | ❌ | Text chunk size in tokens | 
| chunk_overlap | integer | 26 | ❌ | Overlapping tokens between chunks | 
Weaviate Settings
| Field | Type | Default | Required | Description | 
|---|---|---|---|---|
| connection_type | string | "local" | ❌ | Connection type: "local" or "custom" | 
| host | string | "weaviate" | ❌ | HTTP host for Weaviate instance | 
| port | integer | 8001 | ❌ | HTTP port for Weaviate instance | 
| grpc_host | string | "weaviate" | ❌ | gRPC host for Weaviate instance | 
| grpc_port | integer | 50051 | ❌ | gRPC port for Weaviate instance | 
| api_key | string | "" | ❌ | API key for authentication (empty for local) | 
Embedding Model Settings
Common Settings (All Providers)
| Field | Type | Default | Required | Description | 
|---|---|---|---|---|
| provider | string | - | ✅ | Provider: "azure", "openai", "nebius", "huggingface_local" | 
| rpm | integer | 3000 | ❌ | Requests per minute | 
| tpm | integer | 1000000 | ❌ | Tokens per minute | 
| query_prompt | string | null | ❌ | Prompt prepended to queries | 
| document_prompt | string | null | ❌ | Prompt prepended to documents | 
Azure OpenAI (provider: "azure")
| Field | Type | Default | Required | Description | 
|---|---|---|---|---|
| api_key | string | - | ✅ | Azure API key | 
| api_version | string | "2023-05-15" | ❌ | API version | 
| model | string | "text-embedding-3-small" | ❌ | Model name | 
| endpoint | string | "https://meingpt-canada.openai.azure.com/" | ❌ | Azure endpoint URL | 
| embedding_dimensions | integer | 512 | ❌ | Number of dimensions | 
OpenAI (provider: "openai")
| Field | Type | Default | Required | Description | 
|---|---|---|---|---|
| model | string | "text-embedding-ada-002" | ❌ | Model name | 
| base_url | string | null | ❌ | Optional custom URL | 
| api_key | string | - | ✅ | OpenAI API key | 
Nebius (provider: "nebius")
| Field | Type | Default | Required | Description | 
|---|---|---|---|---|
| tokenizer | string | "BAAI/bge-multilingual-gemma2" | ❌ | Tokenizer name | 
| model | string | "bge-multilingual-gemma2" | ❌ | Model name | 
| base_url | string | "https://api.studio.nebius.ai/v1/" | ❌ | Nebius API URL | 
| api_key | string | - | ✅ | Nebius API key | 
HuggingFace Local (provider: "huggingface_local")
| Field | Type | Default | Required | Description | 
|---|---|---|---|---|
| model | string | "sentence-transformers/all-mpnet-base-v2" | ❌ | Model path or name | 
| model_kwargs | object | ❌ | Model initialization parameters | |
| encode_kwargs | object | ❌ | Encoding parameters | 
Logging Settings
| Field | Type | Default | Required | Description | 
|---|---|---|---|---|
| log_level | string | "INFO" | ❌ | DEBUG, INFO, WARNING, ERROR, CRITICAL | 
| log_to_file | boolean | true | ❌ | Write logs to file | 
| log_file_path | string | "logs/app.log" | ❌ | Main application log file | 
| uvicorn_log_file_path | string | "logs/uvicorn.log" | ❌ | Uvicorn server logs | 
| sentry_dsn | string | "" | ❌ | Sentry DSN for error tracking | 
| sentry_event_level | string | "WARNING" | ❌ | Level for Sentry events | 
| sentry_tags | object | ❌ | Additional tags for Sentry events | |
| heartbeat_url | string | null | ❌ | URL for uptime monitoring | 
| heartbeat_interval_minutes | integer | 1 | ❌ | Heartbeat interval | 
| system_monitoring_interval | integer | 0 | ❌ | System usage monitoring (0 to disable) | 
| storage_monitoring_interval | integer | 0 | ❌ | Storage monitoring (0 to disable) | 
| database_monitoring_interval | integer | 0 | ❌ | Database monitoring (0 to disable) | 
Data Pool Types
Common Settings (All Data Pools)
| Field | Type | Required | Description | 
|---|---|---|---|
| id | string | ✅ | Unique identifier for the data pool | 
| type | string | ✅ | Data pool type | 
| base_path | string | ❌ | Optional path within the data source | 
Local (type: "local")
No additional fields required.
S3 (type: "s3")
| Field | Type | Required | Description | 
|---|---|---|---|
| access_key_id | string | ✅ | AWS access key | 
| secret_access_key | string | ✅ | AWS secret key | 
| endpoint | string | ✅ | S3 endpoint URL | 
| bucket_name | string | ✅ | S3 bucket name | 
| provider | string | ❌ | Provider type ("AWS", "MinIO", "DigitalOcean", "Other") | 
Google Drive (type: "drive")
| Field | Type | Required | Description | 
|---|---|---|---|
| refresh_token | string | ✅ | OAuth refresh token | 
| scope | string | ❌ | Access scope (default: "drive.readonly") | 
| root_folder_id | string | ❌ | Optional specific folder ID | 
| team_drive | string | ❌ | Shared drive ID | 
| client_id | string | ❌ | Optional custom client ID | 
| client_secret | string | ❌ | Optional custom client secret | 
OneDrive (type: "onedrive")
| Field | Type | Required | Description | 
|---|---|---|---|
| refresh_token | string | ✅ | OAuth refresh token | 
| drive_id | string | ✅ | OneDrive ID | 
| drive_type | string | ✅ | Drive type ("personal", "business", "documentLibrary") | 
| client_id | string | ❌ | Application client ID (has default) | 
| client_secret | string | ❌ | Application client secret | 
| tenant_id | string | ❌ | Optional custom tenant | 
Confluence (type: "confluence")
| Field | Type | Required | Description | 
|---|---|---|---|
| url | string | ✅ | Confluence base URL | 
| username | string | ✅ | Username for authentication | 
| token | string | ✅ | API token | 
| space_id | string | ✅ | Confluence space ID | 
SMB (type: "smb")
| Field | Type | Required | Description | 
|---|---|---|---|
| host | string | ✅ | SMB server hostname or IP | 
| user | string | ✅ | Username for authentication | 
| password | string | ✅ | Password for authentication | 
| port | integer | ❌ | Optional port (default 445) | 
| domain | string | ❌ | Optional domain | 
| spn | string | ❌ | Optional SPN | 
WebDAV (type: "webdav")
| Field | Type | Required | Description | 
|---|---|---|---|
| url | string | ✅ | WebDAV server URL | 
| vendor | string | ✅ | Vendor ("fastmail", "nextcloud", "owncloud", "sharepoint", "sharepoint-ntlm", "rclone", "other") | 
| user | string | ❌ | Username for authentication | 
| password | string | ❌ | Password for authentication | 
| bearer_token | string | ❌ | Bearer token (alternative to user/password) |