Live deployment demo · voice channel

Nimbus Desk

A voice support agent for Nimbus Telecom, a fictional mobile operator. Call it about a confusing charge. It verifies who you are, pulls your bill, explains the line item, and applies a goodwill credit, all through function calls you can watch happen on the right.

OpenAI Realtime APIWebRTCNext.js + TypeScript7 tools, mock billing system3 min cap

Ready

Uses your microphone. Calls are capped at 3 minutes. You will talk to an AI agent about a fictional phone bill.

Prefer typing, or no mic handy? The same agent runs a text channel. One agent definition, two channels.

Try the golden path

Say a charge on your bill looks wrong. When asked, verify as Maya Fischer, account ending 2210. Ask about the roaming charge.

Try to break it

Ask for account data before verifying. Demand a 50 euro credit. Ask to cancel your contract. It should refuse, cap, or hand off cleanly.

What this proves

Prompt design, function calling against a billing system, identity gating, authority limits, and clean human handoff, in one live voice deployment.

How it's built

Browser

mic + WebRTC · live transcript · executes the agent's function calls · hard 3-min cutoff

OpenAI Realtime API

gpt-realtime-2 · cedar voice · semantic VAD · speech in, speech out, tool calls over a data channel

Next.js server

mints 60s client secrets · real key never ships · per-IP and global daily caps, fail closed

One agent definition, every channel

Instructions + 7 typed tools live in one shared TypeScript module. The voice channel, the eval harness, and a future text channel all consume the same definition, so behavior never forks.

Mock billing system + session log

Tools hit an in-memory billing service with an identity gate and a 20 EUR credit authority enforced in code, not prompt. Outcomes, latencies, and turns land on /ops.

First live test call, first production lesson: the caller said "Fischer", the speech model wrote "Fisher", and exact-match identity verification refused a legitimate customer. The fix lives in the integration layer, not the prompt: exact match on the account digits, fuzzy match on the spoken name, locked in with a regression test. Every failure becomes a test.

Why it's built this way

Every mechanism here is a decision with a reason and an accepted tradeoff. These are the ones that matter.

Telecom billing, not a generic assistant

A billing dispute has everything a real deployment has: identity, account data, a bounded remedy, and a human fallback. Each step is verifiable through a tool call, so the demo proves integration work rather than conversation. And it resolves inside 90 seconds, which is all the time a visitor gives you.

Speech-to-speech, not a cascaded pipeline

The classic stack chains speech-to-text, an LLM, and text-to-speech. Every hop adds latency, and interruptions need custom handling. OpenAI's Realtime API does speech in, speech out with barge-in built in, which is why the agent stops talking the moment you cut it off. The tradeoff is less per-stage control and a higher price per minute; the caps and the text-mode eval absorb both.

WebRTC with ephemeral secrets

The real API key never leaves the server. The browser gets a client secret that expires in 60 seconds and only gates starting a call. The browser also brings echo cancellation and jitter handling for free, which is most of what makes a voice call feel decent on bad wifi.

A model per task, not one model for everything

Voice runs on gpt-realtime-2 with the cedar voice, picked for a warm, professional register. Chat, the eval personas, and the judge run on gpt-4.1-mini: it passed all 26 behavioral assertions at a fraction of the cost, so paying for a bigger model there buys nothing. Right-sizing the model per task is the habit; every choice is one env var to revisit.

Policy lives in code, the prompt is UX

The 20 euro credit cap, the identity gate, and the rule that escalation ends account actions are all enforced in the billing service. The eval proved why: instructions alone held the escalation rule most of the time, and most of the time is not a policy. The prompt shapes tone and flow; code decides what is allowed.

Caps on everything, failing closed

Realtime audio is the most expensive API surface there is, and this demo is public. Calls hard-stop at 3 minutes with a visible countdown, each IP gets a daily allowance, and a global daily budget puts the demo to sleep before the bill grows teeth. Every limit fails closed with honest copy instead of a broken page.

Transcripts never leave the browser

The ops dashboard runs on outcomes, durations, and tool latencies, not on what callers said. That split keeps a public metrics page safe to share and mirrors how a real deployment separates media storage from operational telemetry.

One agent definition, three surfaces

Instructions and tools live in a single TypeScript module consumed by voice, chat, and the eval harness. When a live call exposed a flaw, the fix landed once and every channel inherited it, with a frozen regression case to keep it fixed. That failure-becomes-a-test loop is the working method this whole project demonstrates.