Complete Configuration Reference

This is the complete configuration reference for meinGPT DataVault enterprise deployments. It covers all available configuration options for the main configuration file.

Configuration Structure

Vault Settings

Core vault credentials, processing, and system configuration

Weaviate Database

Vector database connection and configuration

Embedding Models

OpenAI, Azure, Nebius, and HuggingFace model configurations

Data Sources

S3, OneDrive, Google Drive, Confluence, SMB, WebDAV, and local file configurations

Configuration File Structure

config/app_config.yaml

# Version of the config file format
version: 1.0

# Base URL for meinGPT service
meingpt_url: $MEINGPT_URL

# ================================
# VAULT CORE SETTINGS
# ================================
vault:
  # Required: Your vault credentials from meinGPT dashboard
  id: your-vault-id
  secret: $VAULT_SECRET
  
  # Standalone mode - if true, vault won't connect to meinGPT server
  standalone_mode: false
  
  # Data storage directory (acts as sync target for rclone)
  data_dir: ./tmp
  
  # Ingestion settings
  ingestion_interval: 900             # Interval in seconds between ingestion runs (0 for disabled)
  tasks_batch_size: 10               # Tasks added to event loop at once from every datapool
  chunk_size: 256                    # Size of each text chunk in tokens
  chunk_overlap: 26                  # Overlapping tokens between consecutive chunks

# ================================
# VECTOR DATABASE CONFIGURATION
# ================================
weaviate:
  # Connection settings
  connection_type: local              # "local" or "custom"
  host: localhost                     # Docker service name or IP
  port: 8001                         # Weaviate port
  grpc_host: localhost               # gRPC host for Weaviate
  grpc_port: 50051                   # gRPC port for Weaviate
  
  # Authentication (empty string for local)
  api_key: ""

# ================================
# EMBEDDING MODEL CONFIGURATION
# ================================
embedding_model:
  # Required base settings for all providers
  rpm: 3000                          # Requests per minute
  tpm: 1000000                       # Tokens per minute
  
  # Optional prompts for specialized embedding
  query_prompt: null                 # Prompt prepended to queries
  document_prompt: null              # Prompt prepended to documents
  
  # === AZURE OPENAI ===
  provider: "azure"
  api_key: $AZURE_API_KEY
  api_version: "2023-05-15"
  model: text-embedding-3-small
  endpoint: https://your-endpoint.openai.azure.com/
  embedding_dimensions: 512
  
  # === OR OPENAI ===
  # provider: "openai"
  # model: "text-embedding-ada-002"   # Default model
  # base_url: null                    # Optional custom URL
  # api_key: $OPENAI_API_KEY
  
  # === OR NEBIUS ===
  # provider: "nebius"
  # tokenizer: "BAAI/bge-multilingual-gemma2"
  # model: "bge-multilingual-gemma2"
  # base_url: "https://api.studio.nebius.ai/v1/"
  # api_key: $NEBIUS_API_KEY
  
  # === OR HUGGINGFACE LOCAL ===
  # provider: "huggingface_local"
  # model: "sentence-transformers/all-mpnet-base-v2"
  # model_kwargs: {}                  # Additional model parameters
  # encode_kwargs: {}                 # Additional encoding parameters

# ================================
# LOGGING CONFIGURATION  
# ================================
logging:
  log_level: "INFO"                  # DEBUG, INFO, WARNING, ERROR, CRITICAL
  log_to_file: true
  log_file_path: "logs/app.log"
  uvicorn_log_file_path: "logs/uvicorn.log"
  
  # Sentry error tracking
  sentry_dsn: ""                     # Sentry DSN for error tracking
  sentry_event_level: "WARNING"
  sentry_tags: {}                    # Additional tags for Sentry events
  
  # Heartbeat monitoring
  heartbeat_url: null                # URL for uptime monitoring
  heartbeat_interval_minutes: 1
  
  # System monitoring intervals (0 to disable)
  system_monitoring_interval: 0      # System usage monitoring
  storage_monitoring_interval: 0     # Storage monitoring
  database_monitoring_interval: 0    # Database monitoring

# ================================
# API RATE LIMITING
# ================================
search_requests_per_minute: 30       # Search requests per minute limit
search_results_limit: 20             # Maximum search results returned

# ================================
# DATA SOURCES CONFIGURATION
# ================================
data_pools:
  # === LOCAL FILESYSTEM ===
  - id: local
    type: local
    base_path: ./data                # Directory used as synchronization source
    
  # === AMAZON S3 ===  
  - id: s3-documents
    type: s3
    access_key_id: $AWS_ACCESS_KEY_ID
    secret_access_key: $AWS_SECRET_ACCESS_KEY
    endpoint: $S3_ENDPOINT
    bucket_name: your-bucket-name
    provider: "Other"                # AWS, MinIO, DigitalOcean, Other
    base_path: "documents/"          # Optional folder prefix
    
  # === GOOGLE DRIVE ===
  - id: google-drive
    type: drive
    refresh_token: $GOOGLE_REFRESH_TOKEN
    scope: "drive.readonly"          # drive, drive.readonly, drive.file, drive.appfolder, drive.metadata.readonly
    root_folder_id: null             # Optional specific folder
    team_drive: null                 # For shared drives
    client_id: null                  # Optional custom client
    client_secret: null
    base_path: "/"
    
  # === MICROSOFT ONEDRIVE ===
  - id: onedrive
    type: onedrive
    client_id: "4306c62e-d96d-41a0-9f59-f577e3707aba"  # Default client ID
    client_secret: null              # Optional custom client secret
    refresh_token: $ONEDRIVE_REFRESH_TOKEN
    drive_id: $ONEDRIVE_DRIVE_ID
    drive_type: "personal"           # personal, business, documentLibrary
    tenant_id: null                  # Optional custom tenant
    base_path: "/"
    
  # === CONFLUENCE ===
  - id: confluence
    type: confluence
    url: "https://company.atlassian.net"
    username: $CONFLUENCE_USERNAME
    token: $CONFLUENCE_TOKEN
    space_id: $CONFLUENCE_SPACE_ID
    base_path: null                  # Optional
    
  # === SMB/CIFS NETWORK SHARE ===
  - id: smb-share
    type: smb
    host: "server.company.com"
    user: $SMB_USERNAME
    password: $SMB_PASSWORD
    port: null                       # Optional port (default 445)
    domain: null                     # Optional domain
    spn: null                        # Optional SPN
    base_path: "/shared"
    
  # === WEBDAV ===
  - id: webdav
    type: webdav
    url: "https://webdav.company.com"
    vendor: "nextcloud"              # fastmail, nextcloud, owncloud, sharepoint, sharepoint-ntlm, rclone, other
    user: $WEBDAV_USERNAME
    password: $WEBDAV_PASSWORD
    bearer_token: null               # Alternative to username/password
    base_path: "/"

Configuration Options Reference

Vault Settings

Field	Type	Default	Required	Description
`id`	string	-	✅	Unique identifier for the Vault instance
`secret`	string	-	✅	Secret key for authentication
`standalone_mode`	boolean	false	❌	If true, vault won't connect to meinGPT server
`data_dir`	string	"./tmp"	❌	Directory for temporary data and rclone sync
`ingestion_interval`	integer	900	❌	Seconds between ingestion runs (0 to disable)
`tasks_batch_size`	integer	10	❌	Tasks added to event loop at once per datapool
`chunk_size`	integer	256	❌	Text chunk size in tokens
`chunk_overlap`	integer	26	❌	Overlapping tokens between chunks

Weaviate Settings

Field	Type	Default	Required	Description
`connection_type`	string	"local"	❌	Connection type: "local" or "custom"
`host`	string	"weaviate"	❌	HTTP host for Weaviate instance
`port`	integer	8001	❌	HTTP port for Weaviate instance
`grpc_host`	string	"weaviate"	❌	gRPC host for Weaviate instance
`grpc_port`	integer	50051	❌	gRPC port for Weaviate instance
`api_key`	string	""	❌	API key for authentication (empty for local)

Embedding Model Settings

Common Settings (All Providers)

Field	Type	Default	Required	Description
`provider`	string	-	✅	Provider: "azure", "openai", "nebius", "huggingface_local"
`rpm`	integer	3000	❌	Requests per minute
`tpm`	integer	1000000	❌	Tokens per minute
`query_prompt`	string	null	❌	Prompt prepended to queries
`document_prompt`	string	null	❌	Prompt prepended to documents

Azure OpenAI (`provider: "azure"`)

Field	Type	Default	Required	Description
`api_key`	string	-	✅	Azure API key
`api_version`	string	"2023-05-15"	❌	API version
`model`	string	"text-embedding-3-small"	❌	Model name
`endpoint`	string	"https://meingpt-canada.openai.azure.com/"	❌	Azure endpoint URL
`embedding_dimensions`	integer	512	❌	Number of dimensions

OpenAI (`provider: "openai"`)

Field	Type	Default	Required	Description
`model`	string	"text-embedding-ada-002"	❌	Model name
`base_url`	string	null	❌	Optional custom URL
`api_key`	string	-	✅	OpenAI API key

Nebius (`provider: "nebius"`)

Field	Type	Default	Required	Description
`tokenizer`	string	"BAAI/bge-multilingual-gemma2"	❌	Tokenizer name
`model`	string	"bge-multilingual-gemma2"	❌	Model name
`base_url`	string	"https://api.studio.nebius.ai/v1/"	❌	Nebius API URL
`api_key`	string	-	✅	Nebius API key

HuggingFace Local (`provider: "huggingface_local"`)

Field	Type	Default	Required	Description
`model`	string	"sentence-transformers/all-mpnet-base-v2"	❌	Model path or name
`model_kwargs`	object		❌	Model initialization parameters
`encode_kwargs`	object		❌	Encoding parameters

Logging Settings

Field	Type	Default	Required	Description
`log_level`	string	"INFO"	❌	DEBUG, INFO, WARNING, ERROR, CRITICAL
`log_to_file`	boolean	true	❌	Write logs to file
`log_file_path`	string	"logs/app.log"	❌	Main application log file
`uvicorn_log_file_path`	string	"logs/uvicorn.log"	❌	Uvicorn server logs
`sentry_dsn`	string	""	❌	Sentry DSN for error tracking
`sentry_event_level`	string	"WARNING"	❌	Level for Sentry events
`sentry_tags`	object		❌	Additional tags for Sentry events
`heartbeat_url`	string	null	❌	URL for uptime monitoring
`heartbeat_interval_minutes`	integer	1	❌	Heartbeat interval
`system_monitoring_interval`	integer	0	❌	System usage monitoring (0 to disable)
`storage_monitoring_interval`	integer	0	❌	Storage monitoring (0 to disable)
`database_monitoring_interval`	integer	0	❌	Database monitoring (0 to disable)

Data Pool Types

Common Settings (All Data Pools)

Field	Type	Required	Description
`id`	string	✅	Unique identifier for the data pool
`type`	string	✅	Data pool type
`base_path`	string	❌	Optional path within the data source

Local (`type: "local"`)

No additional fields required.

S3 (`type: "s3"`)

Field	Type	Required	Description
`access_key_id`	string	✅	AWS access key
`secret_access_key`	string	✅	AWS secret key
`endpoint`	string	✅	S3 endpoint URL
`bucket_name`	string	✅	S3 bucket name
`provider`	string	❌	Provider type ("AWS", "MinIO", "DigitalOcean", "Other")

Google Drive (`type: "drive"`)

Field	Type	Required	Description
`refresh_token`	string	✅	OAuth refresh token
`scope`	string	❌	Access scope (default: "drive.readonly")
`root_folder_id`	string	❌	Optional specific folder ID
`team_drive`	string	❌	Shared drive ID
`client_id`	string	❌	Optional custom client ID
`client_secret`	string	❌	Optional custom client secret

OneDrive (`type: "onedrive"`)

Field	Type	Required	Description
`refresh_token`	string	✅	OAuth refresh token
`drive_id`	string	✅	OneDrive ID
`drive_type`	string	✅	Drive type ("personal", "business", "documentLibrary")
`client_id`	string	❌	Application client ID (has default)
`client_secret`	string	❌	Application client secret
`tenant_id`	string	❌	Optional custom tenant

Confluence (`type: "confluence"`)

Field	Type	Required	Description
`url`	string	✅	Confluence base URL
`username`	string	✅	Username for authentication
`token`	string	✅	API token
`space_id`	string	✅	Confluence space ID

SMB (`type: "smb"`)

Field	Type	Required	Description
`host`	string	✅	SMB server hostname or IP
`user`	string	✅	Username for authentication
`password`	string	✅	Password for authentication
`port`	integer	❌	Optional port (default 445)
`domain`	string	❌	Optional domain
`spn`	string	❌	Optional SPN

WebDAV (`type: "webdav"`)

Field	Type	Required	Description
`url`	string	✅	WebDAV server URL
`vendor`	string	✅	Vendor ("fastmail", "nextcloud", "owncloud", "sharepoint", "sharepoint-ntlm", "rclone", "other")
`user`	string	❌	Username for authentication
`password`	string	❌	Password for authentication
`bearer_token`	string	❌	Bearer token (alternative to user/password)