WebsitePlatform Login

Web Crawler

Configure website crawling as a DataVault source

In meinGPT (UI)

Für die meisten Teams erfolgt die Einrichtung direkt in meinGPT, ohne lokale Config-Dateien.

  1. Öffne die Admin-Einstellungen in meinGPT
  2. Gehe zu Data Pools / Data Sources
  3. Klicke Add Source und wähle diesen Source-Typ
  4. Hinterlege Zugangsdaten und Scope im UI
  5. Speichere und starte den ersten Sync

Wenn Du keinen eigenen DataVault betreibst, ist das in der Regel ausreichend.

On-Prem Runtime Configuration (Advanced)

data_pools:
  - id: website-docs
    type: webcrawler
    url: "https://example.com"
    scraping_method: basic
    max_depth: 2
    max_pages: 500
    output_format: markdown
    only_main_content: true

Configuration Options

FieldTypeDefaultRequiredDescription
idstring-Unique identifier for the data pool
typestring-Must be webcrawler
urlstring-Crawl start URL
scraping_methodstringbasicbasic or browser (JS rendering)
max_depthinteger3Maximum crawl depth
max_pagesinteger500Maximum number of pages
include_pathsarraynullURL path patterns to include
exclude_pathsarraynullURL path patterns to exclude
wait_for_selectorstringnullCSS selector wait condition for browser mode
page_timeoutinteger30Page timeout in seconds
delay_between_requestsnumber1.0Delay between requests
concurrent_requestsinteger5Parallel crawl requests
retry_attemptsinteger2Retry attempts
proxy_serverstringnullProxy endpoint
proxy_usernamestringnullProxy username
proxy_passwordstringnullProxy password
user_agentstringnullCustom User-Agent
headersobjectnullCustom headers
output_formatstringmarkdownmarkdown, html, or text
only_main_contentbooleantrueKeep only main content blocks
max_age_hoursinteger24Re-crawl freshness window

Synchronization

  • Vault crawls from url and stores extracted content per discovered page.
  • Subsequent runs are incremental and respect freshness settings (max_age_hours).
  • Use include/exclude path filters to keep crawl scope deterministic.

Setup

  1. Start with a small scope (max_depth, max_pages)
  2. Add include/exclude rules for relevant sections
  3. Use browser mode only when pages require JavaScript rendering
  4. Re-run sync and review indexed output

On-prem only: this source page is relevant when you operate your own DataVault runtime and configure data_pools yourself.

Auf dieser Seite