Selecting Embedding Models

Always match the vector width of your chosen model

Update the embeddings.dimensions Zep server config settings to match the vector width of your chosen model.

Embeddings are vectors (an array of floats) that represent the semantic meaning of a document. They are generated by large language models that have been trained specifically for semantic similarity search.

Zep and Embeddings

Zep automatically embeds both documents and chat messages. Zep stores both the text and embedding for each text, and uses the embedding to perform semantic search.

Document Collections are configured to auto-embed by default, but you can disable this behavior. You may embed documents yourself and pass these documents and embedding vectors to Zep.

Supported Models

Zep can be configured to use, or you can provide, either normalized or unnormalized embeddings.

Embedding Options

Document and the Chat Memory Store's Message embedding options can be set separately.

There are two options for embedding models:

OpenAI's API (default with docker compose deploys). This may be configured in the Zep server's config file, docker-compose.yaml or via environment variables. Set embeddings.service or the equivalent environment variable to openai and the embeddings.dimensions to 1536.
Local This may be configured in the Zep server's config file or via environment variables. Set embeddings.service or the equivalent environment variable to local and the embeddings.dimensions to the vector width of your chosen model.

Local Embedding Models

For the local option, you may choose from the following models:

Default configured: all-MiniLM-L6-v2 a small, memory efficient, very fast, and surprisingly accurate model.
Other Sentence Transformer compatible models of your choice. These can be normalized or not. Be mindful of the model's memory requirements and inference speed. See the Sentence Transformer documentation for more information.

Embedding Model Considerations

Maximum Sequence Length Models have maximum input sequence lengths. If you exceed this length, the model will truncate your input. This may result in poor embeddings. Select a model that matches your chunking stratgy. Note that all-MiniLM-L6-v2 has a maximum sequence length of 256 characters.
Speed Zep, as shipped, uses CPU inference. The larger the model, the slower the inference.
Memory The larger the model, the more memory it will require.

Docker on Macs: Local Embedding is slow!

For docker compose deployment we default to using OpenAI's embedding service.

Zep relies on PyTorch for embedding inference. On MacOS, Zep's NLP server runs in a Linux ARM64 container. PyTorch is not optimized to run on Linux ARM64 and does not have access to MacBook M-series acceleration hardware.