Architecture

Components

A DataVault deployment consists of several components that work together to capture, process and store data. The following overview shows the main components and their functions.

The DataVault is the heart of a DataVault deployment. Its main task is to read data from various sources, process it and store it in the vector database, as well as answer search queries from meinGPT. For this purpose, the data is converted into vectors using a special AI model, the embedding model, which are stored in the vector database. This allows efficient search for semantically similar text passages across thousands of documents.

The DataVault is provided as a Docker image via Docker Hub at https://hub.docker.com/r/meingpt/datavault and can be operated on any system that supports Docker.

Vector Database

The vector database stores the vectors created by the embedding model and enables efficient search for similar vectors. The meinGPT DataVault uses the Weaviate vector database, which was specifically developed for this type of application.

Weaviate offers:

Efficient similarity search across millions of vectors
Horizontal scalability for large amounts of data
Persistent storage of vectors and metadata
Easy integration via REST API and gRPC
Support for various similarity metrics (Cosine, Dot Product, Euclidean)
Filtering of search results based on metadata

The vector database is provided as a Docker container and can be operated together with the DataVault on the same system.

Embedding Model

The embedding model is an AI model that converts text into mathematical vectors (embeddings). These vectors represent the meaning of the text in a high-dimensional space, so that semantically similar texts also generate similar vectors.

The quality of the embedding model has a direct impact on the quality of search results:

The better the model, the better it understands the meaning of the texts
The more dimensions the vectors have, the more accurately meaning nuances can be represented
The larger the model, the more computing power is required

The DataVault supports various embedding models:

Cloud-based models from OpenAI, Azure and Nebius
Local models from HuggingFace