WebsitePlatform Login

Overview

Overview of Data Sources, RAG, and source connectivity

Data Sources (RAG)

Data sources are the foundation for retrieval-augmented generation (RAG) in meinGPT. Content from connected sources is indexed and made available to your assistants as knowledge. You can also attach a data source directly to a chat when you only want to look something up once.

How search works

Customer knowledge bases are often large. Hundreds or thousands of gigabytes of Word, PDF, and other files are not unusual. Sending every search query through every file would be far too slow. That is why a search index is built up front. It works similar to Google, just for your internal documents.

The initial indexing can take hours to days, depending on the data volume. Word, PDF, and similar formats are binary, so the text has to be extracted first. This one-time effort pays off in fast search results afterwards.

Search characteristics:

  • Semantic: Documents are converted into mathematical representations (called embeddings) that capture meaning, not just individual words.
  • Sorted by relevance: Hits are ranked by content fit, not by frequency of a search term. A document that thematically matches the question can rank higher than one with the exact keyword.
  • Number of results: By default, the ten most relevant sources are returned. The number is configurable in the settings.
  • Filename search: Besides content, you can also search specifically by filename, for example "Show me file XY".

Cloud data source at a glance

For most teams, the cloud-based data source is the right choice.

PropertyValue
HostingmeinGPT Cloud (Hetzner, Germany)
Sync intervalEvery 15 minutes
Search results per queryDefault 10, configurable
Search methodSemantic (embeddings, sorted by relevance)
AvailabilityIncluded in the standard package

File formats

Well supported are all formats that primarily consist of text:

  • Office documents: DOCX, PPTX, XLSX (with caveats, see below)
  • PDF
  • TXT, Markdown, HTML
  • Code files

Excel tables are a special case. When splitting documents into searchable chunks (a process called chunking), the table context gets lost. A single data row without its column headers often no longer makes sense. For calculations, analyses, and visualizations from Excel files, use Excel mode instead. It processes the original file directly.

OneNote workaround. OneNote files are currently not indexed directly because the format is proprietary. Workaround: export OneNote content regularly via Make or n8n, as PDF or text. The exported files can then be connected like any other source.

Access control

Data sources can be restricted to specific teams. You create teams in the admin interface and assign them to specific data sources. This lets you control which user groups see which data.

Details on creating and managing teams: Team management.

How meinGPT works with files (3-stage model)

Not every request needs the same processing depth. meinGPT decides per request how deep it has to go. There are three stages:

StageWhat happensSufficient for
1. SearchThe platform searches all configured sources and returns snippets and filenamesSimple questions like "Is there a document about topic X?"
2. Full-text retrievalThe model loads the complete content of individual files that look relevant after stage 1Content questions about individual, not overly large documents
3. Code SandboxThe original file is opened in an isolated environment (the sandbox) and processed with PythonCalculations, analyses, charts from large or structured files (e.g. Excel with many rows)

You do not have to configure anything manually. More about the sandbox: Code Sandbox.

Note for on-premise setups: Stage 3 (sandbox) temporarily uploads original files into the meinGPT Cloud, because the sandbox environments run there. The files are deleted immediately after processing. For privacy-sensitive setups, communicate this transparently to your stakeholders.

SharePoint Connector vs. data source: which one when?

If you want to use SharePoint data in meinGPT, you have two options: the native Microsoft 365 Connector, or a data source with SharePoint as a source. Both have their strengths.

CriterionMicrosoft 365 ConnectorData source with SharePoint source
Search methodDirect access in real timePre-built index, sync every 15 minutes
PermissionsRespects SharePoint permissions automatically (at user level via OAuth)Admin configures manually. SharePoint permissions do not apply automatically
AuthenticationEach user authenticates individuallyCentrally configured
ScalingGood for targeted research in single sites or foldersScales to large data volumes, multiple sources combinable
Combining sourcesOnly SharePoint and OneDriveMultiple sources in one data source (SharePoint, local files, Drive, …)

Rule of thumb:

  • Microsoft 365 Connector for most SharePoint use cases. Especially when each user should only see what they are allowed to see in SharePoint itself.
  • Data source when you want to index large stocks centrally, mix multiple sources, or build a central knowledge base without individual permissions.

Configuration & recommendations

Narrow the data scope

There is no hard data limit. But the more data a data source covers, the more irrelevant hits compete with the relevant ones. Recommendation:

  • Connect 500 to 1,000 relevant files per data source in a targeted way, not the entire SharePoint
  • Prefer multiple specialized data sources over one huge one, for example "HR policies", "Product specifications", "Sales material"
  • The more focused a data source, the better the results

Write the short description carefully

The short description of a data source is not just documentation. The model uses it to decide whether a data source is relevant for a given request. A poor description leads to data sources not being searched even when they should be.

Good: "Contains all internal HR policies, process descriptions, and onboarding documents."

Less good: "HR documents."

Reference the tool in the system prompt

In assistant instructions, it pays to reference data sources explicitly. For example: "Start every conversation by retrieving relevant information from the connected data source." This makes data source usage more reliable.

Advanced: Customer-Managed Data Vault (On-Premise)

If you want to run your own on-premise knowledge infrastructure, for example for regulated industries or special security requirements, you can deploy your own Data Vault. Data does not leave your network in this setup. The exception is the temporary sandbox processing, see above.

Sources

All supported sources are listed here:

Typical sources:

  • SharePoint and OneDrive
  • Google Drive
  • Confluence
  • Amazon S3
  • SMB and WebDAV
  • Local filesystems

Custom Data Preparation Pipelines

For the dedicated pattern with S3 handover for third-party systems, see:

On this page