WebsitePlatform Login

Web Crawler

Configure website crawling as a DataVault source

In meinGPT (UI)

For most teams, setup is done directly in meinGPT without editing local config files.

  1. Open admin settings in meinGPT
  2. Go to Data Pools / Data Sources
  3. Click Add Source and choose this source type
  4. Configure credentials and scope in the UI
  5. Save and trigger the first sync

If you do not run your own DataVault runtime, this is usually all you need.

On-Prem Runtime Configuration (Advanced)

data_pools:
  - id: website-docs
    type: webcrawler
    url: "https://example.com"
    scraping_method: basic
    max_depth: 2
    max_pages: 500
    output_format: markdown
    only_main_content: true

Configuration Options

FieldTypeDefaultRequiredDescription
idstring-βœ…Unique identifier for the data pool
typestring-βœ…Must be webcrawler
urlstring-βœ…Crawl start URL
scraping_methodstringbasic❌basic or browser (JS rendering)
max_depthinteger3❌Maximum crawl depth
max_pagesinteger500❌Maximum number of pages
include_pathsarraynull❌URL path patterns to include
exclude_pathsarraynull❌URL path patterns to exclude
wait_for_selectorstringnull❌CSS selector wait condition for browser mode
page_timeoutinteger30❌Page timeout in seconds
delay_between_requestsnumber1.0❌Delay between requests
concurrent_requestsinteger5❌Parallel crawl requests
retry_attemptsinteger2❌Retry attempts
proxy_serverstringnull❌Proxy endpoint
proxy_usernamestringnull❌Proxy username
proxy_passwordstringnull❌Proxy password
user_agentstringnull❌Custom User-Agent
headersobjectnull❌Custom headers
output_formatstringmarkdown❌markdown, html, or text
only_main_contentbooleantrue❌Keep only main content blocks
max_age_hoursinteger24❌Re-crawl freshness window

Synchronization

  • Vault crawls from url and stores extracted content per discovered page.
  • Subsequent runs are incremental and respect freshness settings (max_age_hours).
  • Use include/exclude path filters to keep crawl scope deterministic.

Setup

  1. Start with a small scope (max_depth, max_pages)
  2. Add include/exclude rules for relevant sections
  3. Use browser mode only when pages require JavaScript rendering
  4. Re-run sync and review indexed output

On-prem only: this source page is relevant when you operate your own DataVault runtime and configure data_pools yourself.

On this page