LLM clients¶
Kaval.AI ships native, async LLM clients with one small interface over every provider. The headline methods are:
prompt(message)— a single call that returns the model’s answer.stream_prompt(message)— the same call, streamed as it is generated.chat_completions(chat_history=...)— a full multi-message conversation.
Each one takes an optional response_model (a Pydantic model) to get
validated, structured output instead of plain text. The clients are
standalone — you can use them on their own, or let a workflow build them for you.
The fastest way to construct one is make_client(), which picks the
right client from a "provider/model" string (openai/…, gemini/…,
ollama/… or browser/…).
Try it in your browser: text vs. structured output¶
The browser/ provider runs a small model right on this page over WebGPU —
no API key, no server — so you can try the two output modes without installing
anything. The snippets below have a Run in browser ▶ button; the model id
comes from the panel’s dropdown (exposed to your code as KAVAL_BROWSER_MODEL).
Text in, text out. With no response_model, prompt returns a plain
string:
from kavalai import make_client
client = make_client(f"browser/{KAVAL_BROWSER_MODEL}")
answer = await client.prompt("In one sentence, what is Tallinn?")
print(type(answer).__name__, "->", answer)
Structured in, structured out. Pass a Pydantic response_model and the
model is constrained to that schema; you get back a validated instance, not a
string to parse:
from pydantic import BaseModel
from kavalai import make_client
class City(BaseModel):
name: str
country: str
fun_fact: str
client = make_client(f"browser/{KAVAL_BROWSER_MODEL}")
city = await client.prompt("Describe Tallinn.", response_model=City)
print(type(city).__name__, "->", city)
print("country :", city.country)
print("fun fact :", city.fun_fact)
That is the whole point of structured output: instead of coaxing facts out of
free-form prose, you declare the shape you want once and read typed fields
(city.country, city.fun_fact) straight off the result. The same
response_model works with every provider below.
Streaming responses¶
For long answers you often do not want to wait for the whole response.
stream_prompt returns a Streamer you can iterate as the
model produces output:
client = make_client("openai/gpt-5.4-mini")
streamer = await client.stream_prompt("Write a short story about a curious robot.")
async for chunk in streamer:
# chunk.type is "partial" while generating and "complete" at the end.
print(chunk.value, end="", flush=True)
Streaming lowers perceived latency (text appears immediately), lets you show progress on long generations, and makes it easy to cancel early. When you stream structured output, every partial chunk is still valid JSON, so a UI can render a partially-filled object safely.
This is just the gist — backpressure, timeouts and structured streaming are covered in the dedicated Streaming results with Streamer tutorial.
Provider clients: OpenAI, Gemini and Ollama¶
Outside the browser, Kaval.AI ships native clients for OpenAI, Google Gemini and Ollama. You can build them directly and pass the API key two ways — straight to the constructor, or via an environment variable:
from kavalai import OpenAIClient, GeminiClient, OllamaClient
# 1) Pass the key to the client...
openai = OpenAIClient("gpt-5.4-mini", api_key="sk-...")
# 2) ...or omit it and the client reads it from the environment.
gemini = GeminiClient("gemini-3.1-flash-lite") # reads GEMINI_API_KEY
ollama = OllamaClient("llama3") # local; reads OLLAMA_HOST
The environment variables and extra options per provider:
Provider |
API key / host env var |
Notes |
|---|---|---|
|
|
|
|
|
Google Gemini models. |
|
|
Runs locally; no API key. |
make_client() is the shortcut — it builds the matching client from
a "provider/model" id and reads the same environment variables:
from kavalai import make_client
client = make_client("openai/gpt-5.4-mini")
reply = await client.prompt("Say hello in Estonian.")
Inside a workflow you rarely construct a client yourself — you name the
model and the engine builds it. Set it per node or as the workflow default with
llm_model="openai/gpt-5.4-mini", or set the KAVALAI_DEFAULT_LLM_MODEL
environment variable so you can omit llm_model entirely:
export KAVALAI_DEFAULT_LLM_MODEL="openai/gpt-5.4-mini"
Using clients without a workflow¶
Everything above used the clients on their own — no workflow, no engine. The
same clients power workflow llm nodes under the hood, but you can drop one
into any async code to call a model, get structured output or stream. Reach for
workflows when you want to orchestrate several steps, branch on
results or call tools; reach for a client directly when you just need a model.
Model statistics and observability¶
Every call reports a ModelCallStat — the model, prompt /
completion / total token counts, the HTTP status and the wall-clock duration. By
default Kaval.AI logs these through loguru with the built-in
ModelStatsLogger. Pass your own ModelStatsReceiver
to send them anywhere — a metrics backend, a database, or just stdout:
from kavalai import OpenAIClient, ModelStatsReceiver, ModelCallStat
class PrintStats(ModelStatsReceiver):
def receive_model_stats(self, stats: ModelCallStat):
print(f"{stats.model}: {stats.total_tokens} tokens "
f"in {stats.duration_seconds:.2f}s")
client = OpenAIClient("gpt-5.4-mini", model_stats_receiver=PrintStats())
await client.prompt("What is 2 + 2?")
Inside workflows these stats are aggregated per run (WorkflowState.token_usage)
and surfaced in the backoffice UI as per-call token and cost metrics — see
Observability.
Timeouts and retries¶
Reliability and sampling are controlled with LlmClientParameters:
from kavalai import OpenAIClient, LlmClientParameters
client = OpenAIClient(
"gpt-5.4-mini",
llm_client_parameters=LlmClientParameters(
temperature=0.2,
timeout_seconds=60, # cap each attempt (default: 30s)
),
)
timeout_seconds bounds each request. On transient failures — rate limits,
timeouts, dropped connections and 5xx errors — the client retries automatically
with exponential backoff (up to 5 attempts, with jitter). It does not retry
on errors you should fix yourself: authentication failures, 404 responses
and other bad requests are raised immediately.
Where to next¶
Streaming results with Streamer — stream text and structured output in depth.
Building agents with Workflows — wire clients into typed, branching workflows.
Agents & tools — the
FunctionKerneland the multi-stepAgentloop.Observability — stats, tracing and the backoffice UI.