Web Crawler

In meinGPT (UI)

Für die meisten Teams erfolgt die Einrichtung direkt in meinGPT, ohne lokale Config-Dateien.

Öffne die Admin-Einstellungen in meinGPT
Gehe zu Data Pools / Data Sources
Klicke Add Source und wähle diesen Source-Typ
Hinterlege Zugangsdaten und Scope im UI
Speichere und starte den ersten Sync

Wenn Du keinen eigenen DataVault betreibst, ist das in der Regel ausreichend.

On-Prem Runtime Configuration (Advanced)

data_pools:
  - id: website-docs
    type: webcrawler
    url: "https://example.com"
    scraping_method: basic
    max_depth: 2
    max_pages: 500
    output_format: markdown
    only_main_content: true

Configuration Options

Field	Type	Default	Required	Description
`id`	string	-	✅	Unique identifier for the data pool
`type`	string	-	✅	Must be `webcrawler`
`url`	string	-	✅	Crawl start URL
`scraping_method`	string	`basic`	❌	`basic` or `browser` (JS rendering)
`max_depth`	integer	`3`	❌	Maximum crawl depth
`max_pages`	integer	`500`	❌	Maximum number of pages
`include_paths`	array	null	❌	URL path patterns to include
`exclude_paths`	array	null	❌	URL path patterns to exclude
`wait_for_selector`	string	null	❌	CSS selector wait condition for browser mode
`page_timeout`	integer	`30`	❌	Page timeout in seconds
`delay_between_requests`	number	`1.0`	❌	Delay between requests
`concurrent_requests`	integer	`5`	❌	Parallel crawl requests
`retry_attempts`	integer	`2`	❌	Retry attempts
`proxy_server`	string	null	❌	Proxy endpoint
`proxy_username`	string	null	❌	Proxy username
`proxy_password`	string	null	❌	Proxy password
`user_agent`	string	null	❌	Custom User-Agent
`headers`	object	null	❌	Custom headers
`output_format`	string	`markdown`	❌	`markdown`, `html`, or `text`
`only_main_content`	boolean	`true`	❌	Keep only main content blocks
`max_age_hours`	integer	`24`	❌	Re-crawl freshness window

Synchronization

Vault crawls from url and stores extracted content per discovered page.
Subsequent runs are incremental and respect freshness settings (max_age_hours).
Use include/exclude path filters to keep crawl scope deterministic.

Setup

Start with a small scope (max_depth, max_pages)
Add include/exclude rules for relevant sections
Use browser mode only when pages require JavaScript rendering
Re-run sync and review indexed output

On-prem only: this source page is relevant when you operate your own DataVault runtime and configure data_pools yourself.