Overview
Overview of Data Sources, RAG, and source connectivity
Data Sources (RAG)
Data sources are the foundation for retrieval-augmented generation (RAG) in meinGPT. Content from connected sources is indexed and made available to your assistants as knowledge. You can also attach a data source directly to a chat when you only want to look something up once.
How search works
Customer knowledge bases are often large. Hundreds or thousands of gigabytes of Word, PDF, and other files are not unusual. Sending every search query through every file would be far too slow. That is why a search index is built up front. It works similar to Google, just for your internal documents.
The initial indexing can take hours to days, depending on the data volume. Word, PDF, and similar formats are binary, so the text has to be extracted first. This one-time effort pays off in fast search results afterwards.
Search characteristics:
- Semantic: Documents are converted into mathematical representations (called embeddings) that capture meaning, not just individual words.
- Sorted by relevance: Hits are ranked by content fit, not by frequency of a search term. A document that thematically matches the question can rank higher than one with the exact keyword.
- Number of results: By default, the ten most relevant sources are returned. The number is configurable in the settings.
- Filename search: Besides content, you can also search specifically by filename, for example "Show me file XY".
Cloud data source at a glance
For most teams, the cloud-based data source is the right choice.
| Property | Value |
|---|---|
| Hosting | meinGPT Cloud (Hetzner, Germany) |
| Sync interval | Every 15 minutes |
| Search results per query | Default 10, configurable |
| Search method | Semantic (embeddings, sorted by relevance) |
| Availability | Included in the standard package |
File formats
Well supported are all formats that primarily consist of text:
- Office documents: DOCX, PPTX, XLSX (with caveats, see below)
- TXT, Markdown, HTML
- Code files
Excel tables are a special case. When splitting documents into searchable chunks (a process called chunking), the table context gets lost. A single data row without its column headers often no longer makes sense. For calculations, analyses, and visualizations from Excel files, use Excel mode instead. It processes the original file directly.
OneNote workaround. OneNote files are currently not indexed directly because the format is proprietary. Workaround: export OneNote content regularly via Make or n8n, as PDF or text. The exported files can then be connected like any other source.
Access control
Data sources can be restricted to specific teams. You create teams in the admin interface and assign them to specific data sources. This lets you control which user groups see which data.
Details on creating and managing teams: Team management.
How meinGPT works with files (3-stage model)
Not every request needs the same processing depth. meinGPT decides per request how deep it has to go. There are three stages:
| Stage | What happens | Sufficient for |
|---|---|---|
| 1. Search | The platform searches all configured sources and returns snippets and filenames | Simple questions like "Is there a document about topic X?" |
| 2. Full-text retrieval | The model loads the complete content of individual files that look relevant after stage 1 | Content questions about individual, not overly large documents |
| 3. Code Sandbox | The original file is opened in an isolated environment (the sandbox) and processed with Python | Calculations, analyses, charts from large or structured files (e.g. Excel with many rows) |
You do not have to configure anything manually. More about the sandbox: Code Sandbox.
Note for on-premise setups: Stage 3 (sandbox) temporarily uploads original files into the meinGPT Cloud, because the sandbox environments run there. The files are deleted immediately after processing. For privacy-sensitive setups, communicate this transparently to your stakeholders.
SharePoint Connector vs. data source: which one when?
If you want to use SharePoint data in meinGPT, you have two options: the native Microsoft 365 Connector, or a data source with SharePoint as a source. Both have their strengths.
| Criterion | Microsoft 365 Connector | Data source with SharePoint source |
|---|---|---|
| Search method | Direct access in real time | Pre-built index, sync every 15 minutes |
| Permissions | Respects SharePoint permissions automatically (at user level via OAuth) | Admin configures manually. SharePoint permissions do not apply automatically |
| Authentication | Each user authenticates individually | Centrally configured |
| Scaling | Good for targeted research in single sites or folders | Scales to large data volumes, multiple sources combinable |
| Combining sources | Only SharePoint and OneDrive | Multiple sources in one data source (SharePoint, local files, Drive, …) |
Rule of thumb:
- Microsoft 365 Connector for most SharePoint use cases. Especially when each user should only see what they are allowed to see in SharePoint itself.
- Data source when you want to index large stocks centrally, mix multiple sources, or build a central knowledge base without individual permissions.
Configuration & recommendations
Narrow the data scope
There is no hard data limit. But the more data a data source covers, the more irrelevant hits compete with the relevant ones. Recommendation:
- Connect 500 to 1,000 relevant files per data source in a targeted way, not the entire SharePoint
- Prefer multiple specialized data sources over one huge one, for example "HR policies", "Product specifications", "Sales material"
- The more focused a data source, the better the results
Write the short description carefully
The short description of a data source is not just documentation. The model uses it to decide whether a data source is relevant for a given request. A poor description leads to data sources not being searched even when they should be.
Good: "Contains all internal HR policies, process descriptions, and onboarding documents."
Less good: "HR documents."
Reference the tool in the system prompt
In assistant instructions, it pays to reference data sources explicitly. For example: "Start every conversation by retrieving relevant information from the connected data source." This makes data source usage more reliable.
Advanced: Customer-Managed Data Vault (On-Premise)
If you want to run your own on-premise knowledge infrastructure, for example for regulated industries or special security requirements, you can deploy your own Data Vault. Data does not leave your network in this setup. The exception is the temporary sandbox processing, see above.
- Choose network model: On-Premise Connections
- Vault operations and configuration: /integrations/vault
Sources
All supported sources are listed here:
Typical sources:
- SharePoint and OneDrive
- Google Drive
- Confluence
- Amazon S3
- SMB and WebDAV
- Local filesystems
Custom Data Preparation Pipelines
For the dedicated pattern with S3 handover for third-party systems, see:
Related pages
- Code Sandbox: stage 3 of file processing
- Excel mode: structured table analysis
- Microsoft 365 Connector: direct SharePoint, Outlook, and Teams access
- Team management: access control per data source
- Data Sources: full list of source types