Data Sources (RAG)

Data sources are the foundation for retrieval-augmented generation (RAG) in meinGPT. Content from connected sources is indexed and made available to your assistants as knowledge. You can also attach a data source directly to a chat when you only want to look something up once.

How search works

Customer knowledge bases are often large. Hundreds or thousands of gigabytes of Word, PDF, and other files are not unusual. Sending every search query through every file would be far too slow. That is why a search index is built up front. It works similar to Google, just for your internal documents.

The initial indexing can take hours to days, depending on the data volume. Word, PDF, and similar formats are binary, so the text has to be extracted first. This one-time effort pays off in fast search results afterwards.

Search characteristics:

Semantic: Documents are converted into mathematical representations (called embeddings) that capture meaning, not just individual words.
Sorted by relevance: Hits are ranked by content fit, not by frequency of a search term. A document that thematically matches the question can rank higher than one with the exact keyword.
Number of results: By default, the ten most relevant sources are returned. The number is configurable in the settings.
Filename search: Besides content, you can also search specifically by filename, for example "Show me file XY".

Cloud data source at a glance

For most teams, the cloud-based data source is the right choice.

Property	Value
Hosting	meinGPT Cloud (Hetzner, Germany)
Sync interval	Every 15 minutes
Search results per query	Default 10, configurable
Search method	Semantic (embeddings, sorted by relevance)
Availability	Included in the standard package

File formats

Well supported are all formats that primarily consist of text:

Office documents: DOCX, PPTX, XLSX (with caveats, see below)
PDF
TXT, Markdown, HTML
Code files

Excel tables are a special case. When splitting documents into searchable chunks (a process called chunking), the table context gets lost. A single data row without its column headers often no longer makes sense. For calculations, analyses, and visualizations from Excel files, use Excel mode instead. It processes the original file directly.

OneNote workaround. OneNote files are currently not indexed directly because the format is proprietary. Workaround: export OneNote content regularly via Make or n8n, as PDF or text. The exported files can then be connected like any other source.

Access control

Data sources can be restricted to specific teams. You create teams in the admin interface and assign them to specific data sources. This lets you control which user groups see which data.

Details on creating and managing teams: Team management.

How meinGPT works with files (3-stage model)

Not every request needs the same processing depth. meinGPT decides per request how deep it has to go. There are three stages:

Stage	What happens	Sufficient for
1. Search	The platform searches all configured sources and returns snippets and filenames	Simple questions like "Is there a document about topic X?"
2. Full-text retrieval	The model loads the complete content of individual files that look relevant after stage 1	Content questions about individual, not overly large documents
3. Code Sandbox	The original file is opened in an isolated environment (the sandbox) and processed with Python	Calculations, analyses, charts from large or structured files (e.g. Excel with many rows)

You do not have to configure anything manually. More about the sandbox: Code Sandbox.

Note for on-premise setups: Stage 3 (sandbox) temporarily uploads original files into the meinGPT Cloud, because the sandbox environments run there. The files are deleted immediately after processing. For privacy-sensitive setups, communicate this transparently to your stakeholders.

SharePoint Connector vs. data source: which one when?

If you want to use SharePoint data in meinGPT, you have two options: the native Microsoft 365 Connector, or a data source with SharePoint as a source. Both have their strengths.

Criterion	Microsoft 365 Connector	Data source with SharePoint source
Search method	Direct access in real time	Pre-built index, sync every 15 minutes
Permissions	Respects SharePoint permissions automatically (at user level via OAuth)	Admin configures manually. SharePoint permissions do not apply automatically
Authentication	Each user authenticates individually	Centrally configured
Scaling	Good for targeted research in single sites or folders	Scales to large data volumes, multiple sources combinable
Combining sources	Only SharePoint and OneDrive	Multiple sources in one data source (SharePoint, local files, Drive, …)

Rule of thumb:

Microsoft 365 Connector for most SharePoint use cases. Especially when each user should only see what they are allowed to see in SharePoint itself.
Data source when you want to index large stocks centrally, mix multiple sources, or build a central knowledge base without individual permissions.

Configuration & recommendations

Narrow the data scope

There is no hard data limit. But the more data a data source covers, the more irrelevant hits compete with the relevant ones. Recommendation:

Connect 500 to 1,000 relevant files per data source in a targeted way, not the entire SharePoint
Prefer multiple specialized data sources over one huge one, for example "HR policies", "Product specifications", "Sales material"
The more focused a data source, the better the results

Write the short description carefully

The short description of a data source is not just documentation. The model uses it to decide whether a data source is relevant for a given request. A poor description leads to data sources not being searched even when they should be.

Good: "Contains all internal HR policies, process descriptions, and onboarding documents."

Less good: "HR documents."

Reference the tool in the system prompt

In assistant instructions, it pays to reference data sources explicitly. For example: "Start every conversation by retrieving relevant information from the connected data source." This makes data source usage more reliable.

Advanced: Customer-Managed Data Vault (On-Premise)

If you want to run your own on-premise knowledge infrastructure, for example for regulated industries or special security requirements, you can deploy your own Data Vault. Data does not leave your network in this setup. The exception is the temporary sandbox processing, see above.

Choose network model: On-Premise Connections
Vault operations and configuration: /integrations/vault

Sources

All supported sources are listed here:

Data Sources

Typical sources:

SharePoint and OneDrive
Google Drive
Confluence
Amazon S3
SMB and WebDAV
Local filesystems

Custom Data Preparation Pipelines

For the dedicated pattern with S3 handover for third-party systems, see:

Custom Data Preparation Pipelines

Code Sandbox: stage 3 of file processing
Excel mode: structured table analysis
Microsoft 365 Connector: direct SharePoint, Outlook, and Teams access
Team management: access control per data source
Data Sources: full list of source types

Overview