Complete Configuration Reference

Complete reference for all DataVault enterprise configuration options

This is the complete configuration reference for meinGPT DataVault enterprise deployments. It covers all available configuration options for the main configuration file.

Configuration Structure

Vault Settings

Core vault credentials, processing, and system configuration

Weaviate Database

Vector database connection and configuration

Embedding Models

OpenAI, Azure, Nebius, and HuggingFace model configurations

Data Sources

S3, OneDrive, Google Drive, Confluence, SMB, WebDAV, and local file configurations

Configuration File Structure

config/app_config.yaml
# Version of the config file format
version: 1.0

# Base URL for meinGPT service
meingpt_url: $MEINGPT_URL

# ================================
# VAULT CORE SETTINGS
# ================================
vault:
  # Required: Your vault credentials from meinGPT dashboard
  id: your-vault-id
  secret: $VAULT_SECRET
  
  # Standalone mode - if true, vault won't connect to meinGPT server
  standalone_mode: false
  
  # Data storage directory (acts as sync target for rclone)
  data_dir: ./tmp
  
  # Ingestion settings
  ingestion_interval: 900             # Interval in seconds between ingestion runs (0 for disabled)
  tasks_batch_size: 10               # Tasks added to event loop at once from every datapool
  chunk_size: 256                    # Size of each text chunk in tokens
  chunk_overlap: 26                  # Overlapping tokens between consecutive chunks

# ================================
# VECTOR DATABASE CONFIGURATION
# ================================
weaviate:
  # Connection settings
  connection_type: local              # "local" or "custom"
  host: localhost                     # Docker service name or IP
  port: 8001                         # Weaviate port
  grpc_host: localhost               # gRPC host for Weaviate
  grpc_port: 50051                   # gRPC port for Weaviate
  
  # Authentication (empty string for local)
  api_key: ""

# ================================
# EMBEDDING MODEL CONFIGURATION
# ================================
embedding_model:
  # Required base settings for all providers
  rpm: 3000                          # Requests per minute
  tpm: 1000000                       # Tokens per minute
  
  # Optional prompts for specialized embedding
  query_prompt: null                 # Prompt prepended to queries
  document_prompt: null              # Prompt prepended to documents
  
  # === AZURE OPENAI ===
  provider: "azure"
  api_key: $AZURE_API_KEY
  api_version: "2023-05-15"
  model: text-embedding-3-small
  endpoint: https://your-endpoint.openai.azure.com/
  embedding_dimensions: 512
  
  # === OR OPENAI ===
  # provider: "openai"
  # model: "text-embedding-ada-002"   # Default model
  # base_url: null                    # Optional custom URL
  # api_key: $OPENAI_API_KEY
  
  # === OR NEBIUS ===
  # provider: "nebius"
  # tokenizer: "BAAI/bge-multilingual-gemma2"
  # model: "bge-multilingual-gemma2"
  # base_url: "https://api.studio.nebius.ai/v1/"
  # api_key: $NEBIUS_API_KEY
  
  # === OR HUGGINGFACE LOCAL ===
  # provider: "huggingface_local"
  # model: "sentence-transformers/all-mpnet-base-v2"
  # model_kwargs: {}                  # Additional model parameters
  # encode_kwargs: {}                 # Additional encoding parameters

# ================================
# LOGGING CONFIGURATION  
# ================================
logging:
  log_level: "INFO"                  # DEBUG, INFO, WARNING, ERROR, CRITICAL
  log_to_file: true
  log_file_path: "logs/app.log"
  uvicorn_log_file_path: "logs/uvicorn.log"
  
  # Sentry error tracking
  sentry_dsn: ""                     # Sentry DSN for error tracking
  sentry_event_level: "WARNING"
  sentry_tags: {}                    # Additional tags for Sentry events
  
  # Heartbeat monitoring
  heartbeat_url: null                # URL for uptime monitoring
  heartbeat_interval_minutes: 1
  
  # System monitoring intervals (0 to disable)
  system_monitoring_interval: 0      # System usage monitoring
  storage_monitoring_interval: 0     # Storage monitoring
  database_monitoring_interval: 0    # Database monitoring

# ================================
# API RATE LIMITING
# ================================
search_requests_per_minute: 30       # Search requests per minute limit
search_results_limit: 20             # Maximum search results returned

# ================================
# DATA SOURCES CONFIGURATION
# ================================
data_pools:
  # === LOCAL FILESYSTEM ===
  - id: local
    type: local
    base_path: ./data                # Directory used as synchronization source
    
  # === AMAZON S3 ===  
  - id: s3-documents
    type: s3
    access_key_id: $AWS_ACCESS_KEY_ID
    secret_access_key: $AWS_SECRET_ACCESS_KEY
    endpoint: $S3_ENDPOINT
    bucket_name: your-bucket-name
    provider: "Other"                # AWS, MinIO, DigitalOcean, Other
    base_path: "documents/"          # Optional folder prefix
    
  # === GOOGLE DRIVE ===
  - id: google-drive
    type: drive
    refresh_token: $GOOGLE_REFRESH_TOKEN
    scope: "drive.readonly"          # drive, drive.readonly, drive.file, drive.appfolder, drive.metadata.readonly
    root_folder_id: null             # Optional specific folder
    team_drive: null                 # For shared drives
    client_id: null                  # Optional custom client
    client_secret: null
    base_path: "/"
    
  # === MICROSOFT ONEDRIVE ===
  - id: onedrive
    type: onedrive
    client_id: "4306c62e-d96d-41a0-9f59-f577e3707aba"  # Default client ID
    client_secret: null              # Optional custom client secret
    refresh_token: $ONEDRIVE_REFRESH_TOKEN
    drive_id: $ONEDRIVE_DRIVE_ID
    drive_type: "personal"           # personal, business, documentLibrary
    tenant_id: null                  # Optional custom tenant
    base_path: "/"
    
  # === CONFLUENCE ===
  - id: confluence
    type: confluence
    url: "https://company.atlassian.net"
    username: $CONFLUENCE_USERNAME
    token: $CONFLUENCE_TOKEN
    space_id: $CONFLUENCE_SPACE_ID
    base_path: null                  # Optional
    
  # === SMB/CIFS NETWORK SHARE ===
  - id: smb-share
    type: smb
    host: "server.company.com"
    user: $SMB_USERNAME
    password: $SMB_PASSWORD
    port: null                       # Optional port (default 445)
    domain: null                     # Optional domain
    spn: null                        # Optional SPN
    base_path: "/shared"
    
  # === WEBDAV ===
  - id: webdav
    type: webdav
    url: "https://webdav.company.com"
    vendor: "nextcloud"              # fastmail, nextcloud, owncloud, sharepoint, sharepoint-ntlm, rclone, other
    user: $WEBDAV_USERNAME
    password: $WEBDAV_PASSWORD
    bearer_token: null               # Alternative to username/password
    base_path: "/"

Configuration Options Reference

Vault Settings

FieldTypeDefaultRequiredDescription
idstring-Unique identifier for the Vault instance
secretstring-Secret key for authentication
standalone_modebooleanfalseIf true, vault won't connect to meinGPT server
data_dirstring"./tmp"Directory for temporary data and rclone sync
ingestion_intervalinteger900Seconds between ingestion runs (0 to disable)
tasks_batch_sizeinteger10Tasks added to event loop at once per datapool
chunk_sizeinteger256Text chunk size in tokens
chunk_overlapinteger26Overlapping tokens between chunks

Weaviate Settings

FieldTypeDefaultRequiredDescription
connection_typestring"local"Connection type: "local" or "custom"
hoststring"weaviate"HTTP host for Weaviate instance
portinteger8001HTTP port for Weaviate instance
grpc_hoststring"weaviate"gRPC host for Weaviate instance
grpc_portinteger50051gRPC port for Weaviate instance
api_keystring""API key for authentication (empty for local)

Embedding Model Settings

Common Settings (All Providers)

FieldTypeDefaultRequiredDescription
providerstring-Provider: "azure", "openai", "nebius", "huggingface_local"
rpminteger3000Requests per minute
tpminteger1000000Tokens per minute
query_promptstringnullPrompt prepended to queries
document_promptstringnullPrompt prepended to documents

Azure OpenAI (provider: "azure")

FieldTypeDefaultRequiredDescription
api_keystring-Azure API key
api_versionstring"2023-05-15"API version
modelstring"text-embedding-3-small"Model name
endpointstring"https://meingpt-canada.openai.azure.com/"Azure endpoint URL
embedding_dimensionsinteger512Number of dimensions

OpenAI (provider: "openai")

FieldTypeDefaultRequiredDescription
modelstring"text-embedding-ada-002"Model name
base_urlstringnullOptional custom URL
api_keystring-OpenAI API key

Nebius (provider: "nebius")

FieldTypeDefaultRequiredDescription
tokenizerstring"BAAI/bge-multilingual-gemma2"Tokenizer name
modelstring"bge-multilingual-gemma2"Model name
base_urlstring"https://api.studio.nebius.ai/v1/"Nebius API URL
api_keystring-Nebius API key

HuggingFace Local (provider: "huggingface_local")

FieldTypeDefaultRequiredDescription
modelstring"sentence-transformers/all-mpnet-base-v2"Model path or name
model_kwargsobjectModel initialization parameters
encode_kwargsobjectEncoding parameters

Logging Settings

FieldTypeDefaultRequiredDescription
log_levelstring"INFO"DEBUG, INFO, WARNING, ERROR, CRITICAL
log_to_filebooleantrueWrite logs to file
log_file_pathstring"logs/app.log"Main application log file
uvicorn_log_file_pathstring"logs/uvicorn.log"Uvicorn server logs
sentry_dsnstring""Sentry DSN for error tracking
sentry_event_levelstring"WARNING"Level for Sentry events
sentry_tagsobjectAdditional tags for Sentry events
heartbeat_urlstringnullURL for uptime monitoring
heartbeat_interval_minutesinteger1Heartbeat interval
system_monitoring_intervalinteger0System usage monitoring (0 to disable)
storage_monitoring_intervalinteger0Storage monitoring (0 to disable)
database_monitoring_intervalinteger0Database monitoring (0 to disable)

Data Pool Types

Common Settings (All Data Pools)

FieldTypeRequiredDescription
idstringUnique identifier for the data pool
typestringData pool type
base_pathstringOptional path within the data source

Local (type: "local")

No additional fields required.

S3 (type: "s3")

FieldTypeRequiredDescription
access_key_idstringAWS access key
secret_access_keystringAWS secret key
endpointstringS3 endpoint URL
bucket_namestringS3 bucket name
providerstringProvider type ("AWS", "MinIO", "DigitalOcean", "Other")

Google Drive (type: "drive")

FieldTypeRequiredDescription
refresh_tokenstringOAuth refresh token
scopestringAccess scope (default: "drive.readonly")
root_folder_idstringOptional specific folder ID
team_drivestringShared drive ID
client_idstringOptional custom client ID
client_secretstringOptional custom client secret

OneDrive (type: "onedrive")

FieldTypeRequiredDescription
refresh_tokenstringOAuth refresh token
drive_idstringOneDrive ID
drive_typestringDrive type ("personal", "business", "documentLibrary")
client_idstringApplication client ID (has default)
client_secretstringApplication client secret
tenant_idstringOptional custom tenant

Confluence (type: "confluence")

FieldTypeRequiredDescription
urlstringConfluence base URL
usernamestringUsername for authentication
tokenstringAPI token
space_idstringConfluence space ID

SMB (type: "smb")

FieldTypeRequiredDescription
hoststringSMB server hostname or IP
userstringUsername for authentication
passwordstringPassword for authentication
portintegerOptional port (default 445)
domainstringOptional domain
spnstringOptional SPN

WebDAV (type: "webdav")

FieldTypeRequiredDescription
urlstringWebDAV server URL
vendorstringVendor ("fastmail", "nextcloud", "owncloud", "sharepoint", "sharepoint-ntlm", "rclone", "other")
userstringUsername for authentication
passwordstringPassword for authentication
bearer_tokenstringBearer token (alternative to user/password)