Web Crawler
Configure website crawling as a DataVault source
In meinGPT (UI)
Für die meisten Teams erfolgt die Einrichtung direkt in meinGPT, ohne lokale Config-Dateien.
- Öffne die Admin-Einstellungen in meinGPT
- Gehe zu Data Pools / Data Sources
- Klicke Add Source und wähle diesen Source-Typ
- Hinterlege Zugangsdaten und Scope im UI
- Speichere und starte den ersten Sync
Wenn Du keinen eigenen DataVault betreibst, ist das in der Regel ausreichend.
On-Prem Runtime Configuration (Advanced)
data_pools:
- id: website-docs
type: webcrawler
url: "https://example.com"
scraping_method: basic
max_depth: 2
max_pages: 500
output_format: markdown
only_main_content: trueConfiguration Options
| Field | Type | Default | Required | Description |
|---|---|---|---|---|
id | string | - | ✅ | Unique identifier for the data pool |
type | string | - | ✅ | Must be webcrawler |
url | string | - | ✅ | Crawl start URL |
scraping_method | string | basic | ❌ | basic or browser (JS rendering) |
max_depth | integer | 3 | ❌ | Maximum crawl depth |
max_pages | integer | 500 | ❌ | Maximum number of pages |
include_paths | array | null | ❌ | URL path patterns to include |
exclude_paths | array | null | ❌ | URL path patterns to exclude |
wait_for_selector | string | null | ❌ | CSS selector wait condition for browser mode |
page_timeout | integer | 30 | ❌ | Page timeout in seconds |
delay_between_requests | number | 1.0 | ❌ | Delay between requests |
concurrent_requests | integer | 5 | ❌ | Parallel crawl requests |
retry_attempts | integer | 2 | ❌ | Retry attempts |
proxy_server | string | null | ❌ | Proxy endpoint |
proxy_username | string | null | ❌ | Proxy username |
proxy_password | string | null | ❌ | Proxy password |
user_agent | string | null | ❌ | Custom User-Agent |
headers | object | null | ❌ | Custom headers |
output_format | string | markdown | ❌ | markdown, html, or text |
only_main_content | boolean | true | ❌ | Keep only main content blocks |
max_age_hours | integer | 24 | ❌ | Re-crawl freshness window |
Synchronization
- Vault crawls from
urland stores extracted content per discovered page. - Subsequent runs are incremental and respect freshness settings (
max_age_hours). - Use include/exclude path filters to keep crawl scope deterministic.
Setup
- Start with a small scope (
max_depth,max_pages) - Add include/exclude rules for relevant sections
- Use
browsermode only when pages require JavaScript rendering - Re-run sync and review indexed output
On-prem only: this source page is relevant when you operate your own DataVault runtime and configure data_pools yourself.