Architecture
Overview of all components of a DataVault deployment
Components
A DataVault deployment consists of several components that work together to capture, process and store data. The following overview shows the main components and their functions.
DataVault
The DataVault is the heart of a DataVault deployment. Its main task is to read data from various sources, process it and store it in the vector database, as well as answer search queries from meinGPT. For this purpose, the data is converted into vectors using a special AI model, the embedding model, which are stored in the vector database. This allows efficient search for semantically similar text passages across thousands of documents.
The DataVault is provided as a Docker image via Docker Hub at https://hub.docker.com/r/meingpt/datavault and can be operated on any system that supports Docker.
Vector Database
The vector database stores the vectors created by the embedding model and enables efficient search for similar vectors. The meinGPT DataVault uses the Weaviate vector database, which was specifically developed for this type of application.
Weaviate offers:
- Efficient similarity search across millions of vectors
- Horizontal scalability for large amounts of data
- Persistent storage of vectors and metadata
- Easy integration via REST API and gRPC
- Support for various similarity metrics (Cosine, Dot Product, Euclidean)
- Filtering of search results based on metadata
The vector database is provided as a Docker container and can be operated together with the DataVault on the same system.
Embedding Model
The embedding model is an AI model that converts text into mathematical vectors (embeddings). These vectors represent the meaning of the text in a high-dimensional space, so that semantically similar texts also generate similar vectors.
The quality of the embedding model has a direct impact on the quality of search results:
- The better the model, the better it understands the meaning of the texts
- The more dimensions the vectors have, the more accurately meaning nuances can be represented
- The larger the model, the more computing power is required
The DataVault supports various embedding models:
- Cloud-based models from OpenAI, Azure and Nebius
- Local models from HuggingFace