Knowledge Management
In this cookbook, we will learn how to use Opsmate to manage knowledge.
Notes the knowledge management feature is currently in the early stage of development, the features and the UX are subject to change. At the moment 2 type of data source can be ingested as knowledge:
- Any text based files from your local file system or network-attached storage.
- Any text based files from Github repositories.
We use lancedb as the underlying vector database to store the knowledge. We use lancedb mainly because of the serverless nature of the database where you can use the cloud storage as the backend, which reduces the cost of ownership.
Knowledge retrieval is can be achieved via the KnowledgeRetrival
tool - which is a built-in tool in Opsmate.
Environment variablebased configuration Options¶
FS_EMBEDDINGS_CONFIG¶
This is a JSON key-value pair where the key is the path to the directory to be ingested and the value is the glob pattern to match the files to be ingested.
Example usage:
FS_EMBEDDINGS_CONFIG='{"./docs/cookbooks": "*.md"}'
This will ingest all the markdown files in the ./docs/cookbooks
directory.
GITHUB_EMBEDDINGS_CONFIG¶
This is a JSON key-value pair where the key is the owner/repo:optional[branch]
and the value is the glob pattern to match the files to be ingested.
Example usage:
GITHUB_EMBEDDINGS_CONFIG='{"opsmate/opsmate": "*.md", "kubernetes/kubernetes:test-branch": "*.txt"}'
In the example above, the first entry will ingest all the markdown files in the opsmate/opsmate
repository. The second entry will ingest all the text files in the kubernetes/kubernetes
repository on the test-branch
branch.
If the branch is not specified, it will default to main
.
:important: The Github token is required to be set in the environment variable GITHUB_TOKEN
.
EMBEDDING_REGISTRY_NAME and EMBEDDING_MODEL_NAME¶
EMBEDDING_REGISTRY_NAME
is the name of the embedding registry to use. It is default to openai
.
EMBEDDING_MODEL_NAME
is the name of the embedding model to use. It is default to text-embedding-ada-002
.
LanceDB supports wide range of embedding models, you can refer to the lancedb embedding documentation for more details.
EMBEDDINGS_DB_PATH¶
EMBEDDINGS_DB_PATH is the path to the lancedb database. It is default to ~/.data/opsmate-embeddings
.
Right now it is defaulted to the local file system, but there are wide range of storage options supported by lancedb, you can refer to the lancedb storage documentation for more details. In the documentation it provides a very comprehensive diagram to show case the thought process that goes into choosing the right storage backend.
WARNING: Currently the ingestion chunk size is set to 1000, with overlap set to 0, with recursive text splitter as the default chunking strategy. This is hardcoded right now through environment-variable based configuration, but we will support more flexible configuration in the future.
SPLITTER_CONFIG¶
Currently there are 2 types of splitter:
- RecursiveTextSplitter
- MarkdownHeaderTextSplitter
Here are the example configurations:
SPLITTER_CONFIG='{"name": "recursive", "chunk_size": 1000, "chunk_overlap": 0}' # this is the default configuration
# OR
SPLITTER_CONFIG='{"name": "markdown_header", "headers_to_split_on": [["#", "h1"], ["##", "h2"], ["###", "h3"]]}'
SDK-based data ingestion¶
You can also choose to ingest the knowledge via the SDK which provides greater flexibility in terms of configuration.
In the example below, we ingest all the markdown files in the docs/book/src
directory of the kubernetes-sigs/kubebuilder
repository to learn about the kubebuilder.
Note this is going to take a while to complete and emit a lot of logs so we are not going to run it here.
from opsmate.libs.config import config
import asyncio
from sqlmodel import create_engine, text
import structlog
from opsmate.app.base import on_startup as base_app_on_startup
from opsmate.ingestions import ingest_from_config
from opsmate.libs.config import Config
logger = structlog.get_logger()
async def main():
engine = create_engine(
config.db_url,
connect_args={"check_same_thread": False},
# echo=True,
)
with engine.connect() as conn:
conn.execute(text("PRAGMA journal_mode=WAL"))
conn.close()
await base_app_on_startup(engine)
await ingest_from_config(
Config(
github_embeddings_config={
"kubernetes-sigs/kubebuilder:master": "./docs/book/src/**/*.md"
},
categorise=False, # By default we categorise the knowledge into categories for better segmentation, but we disable it here for the sake of speed.
),
engine=engine,
)
if __name__ == "__main__":
asyncio.run(main())
You can initiate the ingestion via OPSMATE_DB_URL=sqlite:////tmp/sqlite.db python main.py
For the actual ingestion, start the background worker via OPSMATE_DB_URL=sqlite:////tmp/sqlite.db python -m opsmate.dbqapp.app
Once the knowledge of the kubebuilder is ingested, we can use the KnowledgeRetrieval
tool to provide retrieval augmented generation (RAG):
from opsmate.tools import KnowledgeRetrieval
result = await KnowledgeRetrieval(query="how to do env test against a real cluster in kubebuilder using environment variables?").run()
print(result.summary)
2025-02-21 17:02:04 [info ] running knowledge retrieval tool query=how to do env test against a real cluster in kubebuilder using environment variables? To run envtest against a real cluster using Kubebuilder, you need to set specific environment variables to point to the existing cluster's control plane and binaries. Here are the key environment variables to use: 1. **`USE_EXISTING_CLUSTER`**: Set this to `true` to connect to an existing cluster instead of creating a local control plane. 2. **`KUBEBUILDER_ASSETS`**: This should point to the directory containing the binaries needed for your tests (like `kubectl`, `etcd`, and `kube-apiserver`). 3. **`TEST_ASSET_KUBE_APISERVER`, `TEST_ASSET_ETCD`, `TEST_ASSET_KUBECTL`**: These variables can be set to the specific paths of the `kube-apiserver`, `etcd`, and `kubectl` binaries, respectively. They provide a more granular way to specify which binaries to use if they differ from the default ones. ### Example of Setting Variables You can export the necessary variables in your terminal session before running your tests: ```bash export USE_EXISTING_CLUSTER=true export KUBEBUILDER_ASSETS="/path/to/binaries/" export TEST_ASSET_KUBE_APISERVER="/path/to/kube-apiserver" export TEST_ASSET_ETCD="/path/to/etcd" export TEST_ASSET_KUBECTL="/path/to/kubectl" ``` After setting these environment variables, you can run your tests, and they will utilize the existing cluster rather than initializing a new one.
The question is fairly obscure, which in the past took me several hours to figure out with the help from mixture of google search and reading the kubebuilder documentation.
With semantic search, the answer is returned in seconds.
Future capabilities¶
- Right now the async based knowledge ingestion is fairly naive and is not designed to be run in a distributed and fault-tolerant manner. We need to design a more robust system to support this - Potentially brining in the big gun such as Celery but ideally anything easy to maintain and scale.
- We need to support more data source types, such as databases or other API-based data sources.
- Currently only text-based files are supported, we need to support more file types, such as images, videos, and other binary data.