Web Crawler
Configure website crawling as a DataVault source
In meinGPT (UI)
For most teams, setup is done directly in meinGPT without editing local config files.
- Open admin settings in meinGPT
- Go to Data Pools / Data Sources
- Click Add Source and choose this source type
- Configure credentials and scope in the UI
- Save and trigger the first sync
If you do not run your own DataVault runtime, this is usually all you need.
On-Prem Runtime Configuration (Advanced)
data_pools:
- id: website-docs
type: webcrawler
url: "https://example.com"
scraping_method: basic
max_depth: 2
max_pages: 500
output_format: markdown
only_main_content: trueConfiguration Options
| Field | Type | Default | Required | Description |
|---|---|---|---|---|
id | string | - | β | Unique identifier for the data pool |
type | string | - | β | Must be webcrawler |
url | string | - | β | Crawl start URL |
scraping_method | string | basic | β | basic or browser (JS rendering) |
max_depth | integer | 3 | β | Maximum crawl depth |
max_pages | integer | 500 | β | Maximum number of pages |
include_paths | array | null | β | URL path patterns to include |
exclude_paths | array | null | β | URL path patterns to exclude |
wait_for_selector | string | null | β | CSS selector wait condition for browser mode |
page_timeout | integer | 30 | β | Page timeout in seconds |
delay_between_requests | number | 1.0 | β | Delay between requests |
concurrent_requests | integer | 5 | β | Parallel crawl requests |
retry_attempts | integer | 2 | β | Retry attempts |
proxy_server | string | null | β | Proxy endpoint |
proxy_username | string | null | β | Proxy username |
proxy_password | string | null | β | Proxy password |
user_agent | string | null | β | Custom User-Agent |
headers | object | null | β | Custom headers |
output_format | string | markdown | β | markdown, html, or text |
only_main_content | boolean | true | β | Keep only main content blocks |
max_age_hours | integer | 24 | β | Re-crawl freshness window |
Synchronization
- Vault crawls from
urland stores extracted content per discovered page. - Subsequent runs are incremental and respect freshness settings (
max_age_hours). - Use include/exclude path filters to keep crawl scope deterministic.
Setup
- Start with a small scope (
max_depth,max_pages) - Add include/exclude rules for relevant sections
- Use
browsermode only when pages require JavaScript rendering - Re-run sync and review indexed output
On-prem only: this source page is relevant when you operate your own DataVault runtime and configure data_pools yourself.