LLM Clients API¶

kavalai.llm_clients provides a unified, observable interface over LLM and embedding providers. Every call returns a ModelCallStat with token usage and timing, and structured output is validated against a Pydantic response_model.

Run in the browser¶

The browser/ provider runs a model entirely client-side over WebGPU — no API key, no server, no CORS. The same make_client() / make_embedding_client() factories you use on the server return a BrowserLLMClient / BrowserEmbeddingClient, so your code is identical apart from the provider/model string. The two snippets below have a Run in browser ▶ button (the model id comes from the panel’s dropdown):

from kavalai import make_client

client = make_client(f"browser/{KAVAL_BROWSER_MODEL}")
colours = await client.prompt("Name the three primary colours, comma-separated.")
print(colours)

Embeddings work the same way. Embedding models are distinct from chat models; KAVAL_BROWSER_EMBED_MODEL is a small, full-precision Snowflake Arctic model:

from kavalai import make_embedding_client

client = make_embedding_client(f"browser/{KAVAL_BROWSER_EMBED_MODEL}")
texts = [
    "Tallinn is the capital of Estonia.",
    "Estonia's capital city is Tallinn.",
    "I had pasta for dinner last night.",
]
vectors, stats = await client.compute_embeddings(texts, normalize=True)
print(f"{len(vectors)} vectors of dimension {len(vectors[0])}")

# Vectors are L2-normalised, so cosine similarity is just their dot product.
def similarity(a, b):
    return sum(x * y for x, y in zip(a, b))

print(f"sim(0, 1) = {similarity(vectors[0], vectors[1]):.3f}  # same meaning")
print(f"sim(0, 2) = {similarity(vectors[0], vectors[2]):.3f}  # unrelated")

Note

browser/ models need a WebGPU-capable browser (recent Chrome/Edge, or Firefox with dom.webgpu.enabled). The model downloads on first use and is cached by the browser. Outside the browser, use an openai/, gemini/ or ollama/ model instead.

Base client and models¶

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

class kavalai.llm_clients.base_client.LlmClientParameters(*, temperature: float | None = 1.0, top_p: float | None = 0.2, reasoning_effort: str | None = None, service_tier: str | None = None, timeout_seconds: float | None = 30.0)[source]¶

Bases: BaseModel

temperature : float | None¶

top_p : float | None¶

reasoning_effort : str | None¶

service_tier : str | None¶

timeout_seconds : float | None¶

model_config : ClassVar[ConfigDict] = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class kavalai.llm_clients.base_client.ChatMessage(*, role: str | None = None, type: str | None = None, content: str | None = None)[source]¶

Bases: BaseModel

Standard chat completion message.

role : str | None¶

type : str | None¶

content : str | None¶

model_config : ClassVar[ConfigDict] = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class kavalai.llm_clients.base_client.ChatHistory(*, messages: list[ChatMessage])[source]¶

Bases: BaseModel

messages : list[ChatMessage]¶

model_config : ClassVar[ConfigDict] = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

Bases: BaseModel

call_type : Literal['llm', 'embedding']¶

model : str | None¶

request_data : str | None¶

response_data : str | None¶

response_code : int | None¶

prompt_tokens : int | None¶

completion_tokens : int | None¶

total_tokens : int | None¶

batch_size : int | None¶

duration_seconds : float | None¶

model_config : ClassVar[ConfigDict] = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class kavalai.llm_clients.base_client.ModelStatsReceiver[source]¶

Bases: object

receive_model_stats(stats: ModelCallStat)[source]¶

class kavalai.llm_clients.base_client.ModelStatsLogger(format_str: str | None = None)[source]¶

Bases: ModelStatsReceiver

Logs model call statistics using a configurable format.

receive_model_stats(stats: ModelCallStat)[source]¶

class kavalai.llm_clients.base_client.BaseLlmClient(llm_client_parameters: LlmClientParameters | None = None, model_stats_receiver: ModelStatsReceiver | None = None)[source]¶

Bases: object

async stream_chat_completions(*, chat_history: ChatHistory, response_model: type[BaseModel] | None = None) → Streamer[source]¶

Execute a chat completion and return a Streamer.

Parameters:¶

chat_history: ChatHistory¶: The history of messages.
response_model: type[BaseModel] | None = None¶: Optional Pydantic model for structured output.

Returns:¶

A Streamer instance that will yield the completion events.

async chat_completions(*, chat_history: ChatHistory, response_model: type[BaseModel] | None = None)[source]¶

async stream_prompt(system_message: str, response_model: type[BaseModel] | None = None) → Streamer[source]¶

async prompt(system_message: str, response_model: type[BaseModel] | None = None)[source]¶

exception kavalai.llm_clients.base_client.LlmClientException[source]¶: Bases: RuntimeError

class kavalai.llm_clients.base_client.BaseEmbeddingClient[source]¶: Bases: object

Provider clients¶

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

class kavalai.llm_clients.openai_client.OpenAIClient(model: str, llm_client_parameters: LlmClientParameters | None = None, model_stats_receiver: ModelStatsReceiver | None = None, api_key: str | None = None, base_url: str | None = None)[source]¶

Bases: BaseLlmClient

OpenAI LLM client implementation using the Responses API and Streamer.

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

class kavalai.llm_clients.gemini_client.GeminiClient(model: str, llm_client_parameters: LlmClientParameters | None = None, model_stats_receiver: ModelStatsReceiver | None = None, api_key: str | None = None)[source]¶

Bases: BaseLlmClient

Gemini LLM client implementation using the Streamer.

kavalai.llm_clients.gemini_client.convert_messages(messages: list[dict[str, Any]]) → tuple[str | None, list[Content]][source]¶

kavalai.llm_clients.gemini_client.remove_additional_properties(schema: dict[str, Any]) → None[source]¶: Recursively remove ‘additionalProperties’ from a JSON schema. Gemini’s API doesn’t support this field.

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

class kavalai.llm_clients.ollama_client.OllamaClient(model: str, llm_client_parameters: LlmClientParameters | None = None, model_stats_receiver: ModelStatsReceiver | None = None, host: str | None = None)[source]¶

Bases: BaseLlmClient

Ollama LLM client implementation using the Streamer.

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

kavalai.llm_clients.browser_client.get_browser_bridge()[source]¶

Return the page’s JS bridge object (window.kavalBrowserLLM).

Shared by the browser LLM client (.chat) and the browser embedding client (.embed). Raises a helpful LlmClientException when not running under Pyodide, or when the page has not loaded a bridge.

class kavalai.llm_clients.browser_client.BrowserLLMClient(model: str, llm_client_parameters: LlmClientParameters | None = None, model_stats_receiver: ModelStatsReceiver | None = None)[source]¶

Bases: BaseLlmClient

LLM client that runs entirely in the browser, with no network calls.

Inference happens inside the page through a tiny JavaScript bridge exposed on window.kavalBrowserLLM, typically backed by a WebGPU engine such as WebLLM. This makes Kaval.AI’s LLM nodes usable inside Pyodide with no API key, no provider account and no CORS constraints — the model is downloaded once and cached by the browser.

Use it through make_client("browser/<model-id>") or construct it directly. <model-id> is passed verbatim to the bridge (e.g. a WebLLM model id like Llama-3.2-1B-Instruct-q4f32_1-MLC).

The bridge contract is a single async function:

window.kavalBrowserLLM.chat(requestJson) -> Promise<resultJson>

where requestJson is a JSON string of {model, messages, temperature, top_p, response_format?} and resultJson is a JSON string of either {content, usage} or {error}. Exchanging plain JSON strings keeps the Python<->JS boundary free of proxy-conversion surprises.

Embeddings¶

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

class kavalai.llm_clients.embeddings.BaseEmbeddingClient(model: str)[source]¶

Bases: object

Common interface for v2 embedding clients.

The model name is bound at construction (the factory splits the provider/model string), so compute_embeddings only takes the texts. Implementations return the embeddings plus a database-ready ModelCallStat (the ORM row) so callers such as RagService can persist usage directly.

async compute_embeddings(texts: list[str], normalize: bool = False, normalizer: Normalizer | None = None, **kwargs) → tuple[list[list[float]], ModelCallStat][source]¶

class kavalai.llm_clients.embeddings.OpenAIEmbeddingClient(model: str, api_key: str | None = None, base_url: str | None = None, timeout: float = 30.0)[source]¶

Bases: BaseEmbeddingClient

OpenAI embeddings (e.g. text-embedding-3-small).

async compute_embeddings(texts: list[str], normalize: bool = False, normalizer: Normalizer | None = None, **kwargs) → tuple[list[list[float]], ModelCallStat][source]¶

class kavalai.llm_clients.embeddings.GeminiEmbeddingClient(model: str, api_key: str | None = None)[source]¶

Bases: BaseEmbeddingClient

Google Gemini embeddings.

async compute_embeddings(texts: list[str], normalize: bool = False, normalizer: Normalizer | None = None, **kwargs) → tuple[list[list[float]], ModelCallStat][source]¶

class kavalai.llm_clients.embeddings.OllamaEmbeddingClient(model: str, host: str | None = None, timeout: float = 30.0)[source]¶

Bases: BaseEmbeddingClient

Ollama (local) embeddings.

async compute_embeddings(texts: list[str], normalize: bool = False, normalizer: Normalizer | None = None, **kwargs) → tuple[list[list[float]], ModelCallStat][source]¶

class kavalai.llm_clients.embeddings.FastEmbedClient(model: str, cache_dir: str | None = None, threads: int | None = None, **kwargs)[source]¶

Bases: BaseEmbeddingClient

Local embeddings via FastEmbed / ONNX Runtime (no API key).

async compute_embeddings(texts: list[str], normalize: bool = False, normalizer: Normalizer | None = None, **kwargs) → tuple[list[list[float]], ModelCallStat][source]¶

class kavalai.llm_clients.embeddings.BrowserEmbeddingClient(model: str)[source]¶

Bases: BaseEmbeddingClient

In-browser embeddings via the WebLLM bridge (Pyodide only, no API key).

Mirrors BrowserLLMClient: inference happens inside the page through window.kavalBrowserLLM, here via its async embed function:

window.kavalBrowserLLM.embed(requestJson) -> Promise<resultJson>

where requestJson is a JSON string of {model, input} (input is the list of texts) and resultJson is a JSON string of either {embeddings, usage} or {error}. The model is downloaded once and cached by the browser — no API key, no provider account, no CORS.

Use it through make_embedding_client("browser/<model-id>"); <model-id> is passed verbatim to the bridge (e.g. a WebLLM embedding id like snowflake-arctic-embed-m-q0f32-MLC-b4).

async compute_embeddings(texts: list[str], normalize: bool = False, normalizer: Normalizer | None = None, **kwargs) → tuple[list[list[float]], ModelCallStat][source]¶

kavalai.llm_clients.embeddings.make_embedding_client(model: str) → BaseEmbeddingClient[source]¶

Construct a v2 embedding client from a provider/model string.

Supported providers: openai, gemini, ollama, fastembed, browser. The provider is split off and the remainder (which may itself contain slashes, e.g. fastembed/BAAI/bge-small-en-v1.5) is the model name. The browser provider runs entirely client-side via a WebLLM bridge (Pyodide only) and needs no API key.

Streaming¶

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

exception kavalai.llm_clients.streamer.StreamerTimeoutException(names: list[str], timeout_seconds: float)[source]¶

Bases: Exception

Raised when no stream chunk arrives within the configured timeout.

Reported by Streamer while waiting on its queue when a timeout_seconds is set; names lists the streamers still active when the timeout elapsed.

class kavalai.llm_clients.streamer.StreamContent(*, type: str, name: str, value: str | None = None)[source]¶

Bases: BaseModel

StreamContent represents a streamed message from a Streamer.

Variables:¶

type : str¶: The type of stream message (e.g., ‘partial’, ‘complete’).
name : str¶: The identifier for the stream source or target.
value : str | None¶: The actual content string.

type : str¶

name : str¶

value : str | None¶

model_config : ClassVar[ConfigDict] = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class kavalai.llm_clients.streamer.ValueStreamer(name: str, queue: Queue, response_model: type[BaseModel] | None = None, stream_delta: bool = False, on_complete_callback: callable | None = None)[source]¶

Bases: object

A helper class to manage and push streaming content to an asyncio queue.

Variables:¶

name: str¶: Default name for the stream chunks.
queue: Queue¶: The asyncio.Queue where messages are placed.

get_safe_value() → str[source]¶: Safely parse and return the buffered content as JSON string if response_model is set, otherwise return as string.

async stream_partial(value: str)[source]¶

Push a ‘partial’ chunk to the queue.

Parameters:¶

value: str¶: The partial content to stream.
name: Optional override for the stream name.

async stream_complete()[source]¶

Push a ‘complete’ chunk to the queue, indicating the stream has finished.

Parameters:¶

value: Optional final content to append to the buffer before completing.
name: Optional override for the stream name.

class kavalai.llm_clients.streamer.Streamer(stream_delta: bool = False, timeout_seconds: float | None = None)[source]¶

Bases: object

property queue : Queue¶

get_value_streamer(name: str, stream_delta: bool | None = None, response_model: type[BaseModel] | None = None) → ValueStreamer[source]¶

async stream_error(error: Exception)[source]¶: Push an ‘error’ chunk to the queue.