RAG (Retrieval-Augmented Generation)¶
MiniBot can index text documents into a Qdrant vector store and retrieve semantically relevant passages at query time using sentence-transformers. It can also optionally rerank the semantic candidate set with a cross-encoder for higher precision on the final results.
This is useful when:
The
http_requesttool saves an oversized HTTP response to a temp file — the bot can index it withrag_indexand later answer questions about the content viarag_search.A user uploads a text document and wants to query it semantically rather than reading the whole file into the context window.
Setup¶
Install torch
Choose CPU or GPU depending on your environment:
# CPU pip install torch --index-url https://download.pytorch.org/whl/cpu # GPU (CUDA, default PyPI wheel) pip install torch
Install sentence-transformers
pip install sentence-transformersReranking uses
sentence-transformers.CrossEncoderfrom this same install family, so no separate package is needed beyondtorchandsentence-transformers.Start Qdrant
Using the pre-downloaded binary (see
qdrant/download_bin.sh):./qdrant/qdrantOr via Docker (service is defined in
docker-compose.yml):docker compose up minibot-qdrantEnable RAG in
config.toml:[tools.rag] enabled = true qdrant_url = "http://localhost:6333" collection_name = "minibot_chunks" chunk_size_tokens = 96 chunk_overlap_tokens = 20 search_limit = 5 truncate_result_tokens = false max_result_tokens = 1500 [tools.rag.embedding] model = "sentence-transformers/all-MiniLM-L12-v2" dim = 384 max_sequence_tokens = 128 # truncate_dim = 256 # Matryoshka truncation — see below [tools.rag.rerank] enabled = false model = "cross-encoder/ms-marco-MiniLM-L2-v2" candidate_limit = 50 max_results = 7
On startup, MiniBot creates the Qdrant collection automatically if it does not exist. If it already exists with an incompatible vector size, or from the older
source_nameera without thefilenamepayload schema, startup fails fast.
Usage¶
Once enabled, the bot has access to four tools:
rag_index — provide a file path plus optional
tagsandcategoriesmetadata. The bot reads the file, splits it into overlapping chunks, embeds each chunk, and upserts the vectors into Qdrant. Returns the number of chunks indexed and thedocument_idused.rag_search — provide a natural language query. The bot embeds the query and returns the top-k most relevant chunks with their similarity score and source metadata. Optional
filename,tags, andcategoriesfilters narrow the result set. Tag/category filters match any of the provided values. Scope filters are bound to the active runtime context; ifuser_id,agent_id, orchat_idis provided explicitly, it must match the current context. When reranking is enabled, MiniBot first pulls a larger semantic candidate set from Qdrant, reranks it with a cross-encoder, then returns only the final top results. Reranked responses usescorefor the rerank score and also includesemantic_scorefrom Qdrant.rag_list_metadata — list available
tags,categories, andfilenamesvalues, with counts, so the bot can choose real filters before callingrag_search.rag_delete — remove indexed chunks by
document_idand/or scope tags when the data should no longer be searchable. Optionaltagsandcategoriesfilters are also supported. The tool call must include at least one explicit filter; context defaults alone do not trigger deletion. Scope filters are bound to the active runtime context; explicit scope values must match it.
Example interaction:
you: index the file at data/files/http_responses/tmp/response-abc.txt
bot: [calls rag_index] Indexed 24 chunks under document ID doc_a394c4126b601889.
you: what does the document say about rate limits?
bot: [calls rag_search] Based on the indexed document, rate limits are ...
Matryoshka embeddings¶
Some models (e.g. BAAI/bge-m3) support Matryoshka Representation Learning, which allows
truncating the embedding to a smaller dimension without retraining.
To use a truncated dimension:
[tools.rag.embedding]
model = "BAAI/bge-m3"
dim = 256 # effective vector size stored in Qdrant
truncate_dim = 256
dim and truncate_dim must match — dim tells MiniBot what size to use when
creating the Qdrant collection, and truncate_dim tells sentence-transformers to truncate
the output to that size.
Resetting the collection¶
When switching embedding models (different model or different truncate_dim), existing
vectors are incompatible and the collection must be recreated. MiniBot validates the expected
vector size at startup and fails if the existing collection is incompatible:
./scripts/rag_clear_collection.sh # default collection
./scripts/rag_clear_collection.sh my_chunks # custom name
MiniBot recreates the collection automatically on next startup. This reset is required for
older collections that were indexed before filename replaced source_name.
Configuration reference¶
[tools.rag]
Key |
Default |
Description |
|---|---|---|
|
|
Enable |
|
|
Qdrant HTTP endpoint. |
|
|
Qdrant collection used for chunk vectors. |
|
|
Embedding-token count per chunk. Must not exceed |
|
|
Embedding-token overlap between consecutive chunks. Must be less than |
|
|
Default final number of results returned by |
|
|
Truncate returned |
|
|
Maximum total embedding-token budget for returned |
|
optional |
LLM-supplied string lists stored on each chunk; values are trimmed, lowercased, deduplicated, and can be used as any-match filters in search/delete. |
|
required |
RAG reads files through managed storage and inherits its path restrictions. |
[tools.rag.rerank]
Key |
Default |
Description |
|---|---|---|
|
|
Enable cross-encoder reranking after the initial semantic Qdrant search. |
|
|
Cross-encoder model ID loaded lazily on first reranked search. |
|
|
Number of semantic candidates to fetch before reranking. MiniBot always fetches at least the requested final result count even if this value is smaller. |
|
|
Hard cap on final returned results when reranking is enabled. |
[tools.rag.embedding]
Key |
Default |
Description |
|---|---|---|
|
|
Any sentence-transformers compatible model ID. |
|
|
Full output dimension; must match the Qdrant collection vector size. |
|
|
Hard max input token length for the embedding model. MiniBot fails startup if
|
|
|
Matryoshka truncation size. When set, must equal |