Selecting Embedding Models
Always match the vector width of your chosen model
Update the embeddings.dimensions
Zep server config settings to match the vector width of your chosen model.
Embeddings
are vectors
(an array of floats) that represent the semantic meaning of a document. They are generated by large language models that have been trained specifically for semantic similarity search.
Zep and Embeddings
Zep automatically embeds both documents and chat messages. Zep stores both the text and embedding for each text, and uses the embedding to perform semantic search.
Document Collections are configured to auto-embed by default, but you can disable this behavior. You may embed documents yourself and pass these documents and embedding vectors to Zep.
Supported Models
Zep can be configured to use, or you can provide, either normalized or unnormalized embeddings.
Embedding Options
Document and the Chat Memory Store's Message embedding options can be set separately.
There are two options for embedding models:
- OpenAI's API (default with docker compose deploys). This may be configured in the Zep server's config file,
docker-compose.yaml
or via environment variables. Setembeddings.service
or the equivalent environment variable toopenai
and theembeddings.dimensions
to1536
. - Local This may be configured in the Zep server's config file or via environment variables. Set
embeddings.service
or the equivalent environment variable tolocal
and theembeddings.dimensions
to the vector width of your chosen model.
Local Embedding Models
For the local option, you may choose from the following models:
- Default configured:
all-MiniLM-L6-v2
a small, memory efficient, very fast, and surprisingly accurate model. - Other Sentence Transformer compatible models of your choice. These can be normalized or not. Be mindful of the model's memory requirements and inference speed. See the Sentence Transformer documentation for more information.
Embedding Model Considerations
- Maximum Sequence Length Models have maximum input sequence lengths. If you exceed this length, the model will truncate your input. This may result in poor embeddings. Select a model that matches your chunking stratgy. Note that
all-MiniLM-L6-v2
has a maximum sequence length of 256 characters. - Speed Zep, as shipped, uses CPU inference. The larger the model, the slower the inference.
- Memory The larger the model, the more memory it will require.
Docker on Macs: Local Embedding is slow!
For docker compose deployment we default to using OpenAI's embedding service.
Zep relies on PyTorch for embedding inference. On MacOS, Zep's NLP server runs in a Linux ARM64 container. PyTorch is not optimized to run on Linux ARM64 and does not have access to MacBook M-series acceleration hardware.