Building Voice-Enabled Agents with OpenAI WebRTC and Document Context

Step-by-step tutorial on using OpenAI's WebRTC Audio Session API with document context injection. Build a voice agent that references uploaded documents during realtime conversations — with streaming audio, context management, and interruption handling.

June 14, 2026
openaiwebrtcvoice-agentsrealtime-audiodocument-contextgpt-realtime-2agent-blueprintbrowser-voice

What You'll Build

By the end of this tutorial, you'll have a working browser-based voice agent that:

  • Connects to OpenAI's GPT-Realtime-2 model over WebRTC
  • Lets you paste a document before the conversation starts
  • References that document naturally during the audio conversation
  • Tracks token usage and cost in real time
  • Handles mute/unmute, interruptions, and session teardown cleanly

I built this exact demo as a single HTML page with a two-line Node.js server for the ephemeral token endpoint. Here's what it looks like in action:

OpenAI WebRTC Audio Session
[●●] (green pulsing indicator when speaking)

Model: gpt-realtime-2  |  Voice: ash  |  Session: Connected

Last transcript:
"Based on the research paper you shared, the key finding is that
retrieval-augmented generation improves accuracy by 34% when the
knowledge base includes at least 5 relevant passages per query."

┌──────────────────────────────────────────────┐
│ Session Costs                                │
│ Input:  $0.0012    Output: $0.0035    Total: $0.0047 │
└──────────────────────────────────────────────┘

Note:

You need an OpenAI API key with access to the gpt-realtime-2 model. The Realtime API is no longer in beta — it's GA as of May 2026, so your existing key should work if it has the right model access.

Architecture Overview

The OpenAI WebRTC realtime flow has three actors:

┌──────────────┐     POST /v1/realtime/client_secrets     ┌──────────────┐
│              │ ────────────────────────────────────────→ │              │
│   Your       │     ← {client_secret: {value: "epk_..."}} │   OpenAI     │
│   Server     │                                            │   API        │
│  (Node.js)   │                                            │              │
│              │     POST /v1/realtime/calls + SDP          │              │
└──────┬───────┘ ─────────────────────────────────────────→ └──────────────┘
       │              ← SDP answer                          ↑
       │                                                     │
       │  1. Fetch ephemeral token                           │
       │  2. RTCPeerConnection with SDP                      │
       │  3. Data channel for events                         │
       │  4. Audio tracks flow both ways                     │
       │                                                     │
┌──────▼───────────────────────────────────────────────────────────────┐
│                          Browser Client                                │
│                                                                        │
│  ┌──────────────────┐    ┌──────────────────────┐                     │
│  │  getUserMedia()   │───→│  RTCPeerConnection    │                     │
│  │  (mic audio)      │    │  with data channel    │                     │
│  └──────────────────┘    └──────────┬───────────┘                     │
│                                     │                                    │
│                          ┌──────────▼───────────┐                     │
│                          │  Data Channel Events  │                     │
│                          │  - session.update      │                     │
│                          │  - conversation.item.  │                     │
│                          │    create              │                     │
│                          │  - response.create     │                     │
│                          └────────────────────────┘                     │
└───────────────────────────────────────────────────────────────────────────┘

The key insight: WebRTC handles all the audio transport — jitter buffering, encoding, decoding, and streaming — so you don't need to manually chunk PCM data or manage audio buffer state. You only need to manage the data channel events for session configuration and context injection.

Step 1: The Server-Side Token Endpoint

The browser never sees your main OpenAI API key. Instead, your server mints an ephemeral token that the browser uses for the WebRTC handshake. This keeps your key safe and lets you scope tokens per session.

Create a server.js:

import express from 'express';
import cors from 'cors';

const app = express();
app.use(cors());
app.use(express.json());

// The browser hits this to get an ephemeral token
app.post('/session', async (req, res) => {
  const response = await fetch(
    'https://api.openai.com/v1/realtime/client_secrets',
    {
      method: 'POST',
      headers: {
        Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({ model: 'gpt-realtime-2' }),
    }
  );

  if (!response.ok) {
    const error = await response.text();
    console.error('Token error:', error);
    return res.status(500).json({ error: 'Failed to create session' });
  }

  const data = await response.json();
  res.json(data);
});

app.listen(3001, () => {
  console.log('Token server on http://localhost:3001');
});

Start it:

OPENAI_API_KEY=sk-... node server.js

The response looks like:

{
  "client_secret": {
    "value": "epk_3b1a2c4d5e6f7890abcdef1234567890",
    "expires_at": 1718212345
  }
}

Ephemeral tokens expire after 60 seconds by default. The WebRTC handshake must complete before that window closes.

Failure point: CORS

If your browser client is on a different port, make sure your server has CORS enabled. I forgot this on my first attempt and got a baffling CORS error on the POST. The snippet above includes cors() middleware — keep it.

Step 2: The Browser Client — WebRTC Handshake

This is the core of the application. The browser:

  1. Gets microphone access via getUserMedia()
  2. Fetches an ephemeral token from your server
  3. Creates an RTCPeerConnection with the mic track
  4. Creates a data channel for event signaling
  5. Performs the offer/answer exchange with OpenAI
  6. Starts streaming audio both ways

Here's the JavaScript (ES module, works in any modern browser):

async function createRealtimeSession(audioStream, ephemeralToken, voice, model, documentText) {
  const pc = new RTCPeerConnection();

  // When OpenAI sends audio back, play it
  pc.ontrack = (event) => {
    const audio = new Audio();
    audio.srcObject = event.streams[0];
    audio.play();
  };

  // Add the user's microphone track
  pc.addTrack(audioStream.getTracks()[0]);

  // Data channel for client/server events
  const dc = pc.createDataChannel('oai-events');

  // === DOCUMENT CONTEXT INJECTION ===
  if (documentText) {
    dc.addEventListener('open', () => {
      dc.send(JSON.stringify({
        type: 'conversation.item.create',
        item: {
          type: 'message',
          role: 'user',
          content: [{
            type: 'input_text',
            text: `The user has provided the following document. They want to have a conversation about it. Refer to it when answering their questions.\n\n<document>\n${documentText}\n</document>`,
          }],
        },
      }));
    });
  }

  // Listen for server events (token usage, transcripts)
  dc.addEventListener('message', (event) => {
    const data = JSON.parse(event.data);
    if (data.type === 'response.done' && data.response) {
      handleResponseDone(data);
    }
  });

  // Create the SDP offer
  const offer = await pc.createOffer();
  await pc.setLocalDescription(offer);

  // Send the offer + session config to OpenAI
  const fd = new FormData();
  fd.set('sdp', offer.sdp);
  fd.set('session', JSON.stringify({
    type: 'realtime',
    model,
    audio: { output: { voice } },
  }));

  const resp = await fetch('https://api.openai.com/v1/realtime/calls', {
    method: 'POST',
    headers: { Authorization: `Bearer ${ephemeralToken}` },
    body: fd,
  });

  if (!resp.ok) {
    throw new Error(`Handshake failed: ${resp.status} ${await resp.text()}`);
  }

  // Apply the remote SDP answer
  await pc.setRemoteDescription({
    type: 'answer',
    sdp: await resp.text(),
  });

  return pc;
}

The magic happens in the FormData POST. OpenAI accepts a multipart request with:

  • sdp — the browser's SDP offer (standard WebRTC)
  • session — a JSON blob configuring model, voice, and other options

The response is a raw SDP answer, which you feed directly to setRemoteDescription.

What each part does

ComponentRole
RTCPeerConnectionManages the WebRTC session — ICE candidates, STUN/TURN, media tracks
ontrackFires when the remote peer (OpenAI) sends audio. We pipe it to an <audio> element
addTrackSends our mic audio to OpenAI
DataChannelA side channel for JSON events — session config, conversation items, token usage
createOfferGenerates the initial SDP offer with our capabilities
setLocalDescriptionLocks in our offer
FormData POSTOpenAI's WebRTC endpoint expects multipart form data, not JSON

Step 3: Injecting Document Context

The document context injection is the newest feature (Simon Willison added it June 12, 2026). It works by sending a synthetic user message on the data channel before the conversation starts.

The key: send a conversation.item.create event when the data channel opens, with the document text wrapped in a <document> tag. The model treats this as part of the conversation history and references it naturally.

dc.addEventListener('open', () => {
  // Inject the document as the first conversation item
  dc.send(JSON.stringify({
    type: 'conversation.item.create',
    item: {
      type: 'message',
      role: 'user',
      content: [{
        type: 'input_text',
        text: `The user has provided the following document. ...\n\n<document>\n${documentText}\n</document>`,
      }],
    },
  }));
});

Note what's not happening here: there's no file upload API, no vector store, no RAG pipeline. The document text is injected directly into the model's context window as a conversation item. This works because:

  1. GPT-Realtime-2 has a 128K token context window
  2. The model treats past conversation.item.create events as conversation history
  3. The instruction "Refer to it when answering their questions" primes the model to use the document

What about large documents?

With 128K tokens and audio tokens being relatively expensive, you'll want to stay under about 32K tokens of text context to keep costs reasonable. That's roughly 24,000 words — plenty for research papers, legal documents, or technical documentation.

For longer documents, pre-summarize or chunk the content before sending it. A good heuristic: if it's more than 50 pages of text, summarize the key sections and paste the summaries instead.

Step 4: The Complete HTML Client

Let me put it all together in a single HTML file. This is the full working client — you can serve it with any static file server and point it at your token server from Step 1.

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Voice Agent + Document Context</title>
  <style>
    * { box-sizing: border-box; }
    body {
      font-family: system-ui, -apple-system, sans-serif;
      max-width: 800px; margin: 0 auto; padding: 20px;
      background: #0f172a; color: #e2e8f0;
    }
    .controls { margin: 20px 0; }
    .form-group { margin-bottom: 15px; }
    label { display: block; margin-bottom: 5px; font-weight: 600; }
    input, select, textarea {
      width: 100%; padding: 10px; font-size: 16px;
      border: 1px solid #334155; border-radius: 8px;
      background: #1e293b; color: #e2e8f0;
    }
    textarea { min-height: 150px; resize: vertical; }
    details {
      background: #1e293b; border: 1px solid #334155;
      border-radius: 8px; padding: 12px; margin-bottom: 15px;
    }
    details summary { cursor: pointer; font-weight: 600; }
    button {
      background: #3b82f6; color: white; border: none;
      padding: 12px 24px; font-size: 16px; border-radius: 8px;
      cursor: pointer;
    }
    button:disabled { background: #475569; cursor: not-allowed; }
    button.danger { background: #ef4444; }
    .status {
      margin-top: 10px; padding: 12px; border-radius: 8px;
    }
    .status.error { background: #7f1d1d; color: #fca5a5; }
    .status.success { background: #14532d; color: #86efac; }
    .transcript {
      background: #1e293b; border-radius: 8px; padding: 16px;
      margin-bottom: 20px; border: 1px solid #334155;
    }
    .transcript-value {
      font-size: 1.1rem; line-height: 1.6; white-space: pre-wrap;
    }
    .placeholder { color: #64748b; font-style: italic; }
    .stats-grid {
      display: grid; grid-template-columns: repeat(3, 1fr);
      gap: 16px; margin-top: 20px;
    }
    .stat-card {
      background: #1e293b; padding: 16px; border-radius: 8px;
      border: 1px solid #334155;
    }
    .stat-card h3 { margin: 0 0 8px 0; font-size: 0.9rem; color: #94a3b8; }
    .stat-value { font-size: 1.2rem; font-weight: 700; color: #3b82f6; }
    @media (max-width: 600px) {
      .stats-grid { grid-template-columns: 1fr; }
    }
    #audioIndicator {
      display: inline-block; width: 16px; height: 16px;
      border-radius: 50%; background: #475569;
      margin-right: 8px; vertical-align: middle;
      transition: background 0.15s;
    }
    #audioIndicator.active { background: #22c55e; }
  </style>
</head>
<body>
  <h1>
    <span id="audioIndicator"></span>
    Voice Agent + Document Context
  </h1>

  <div class="controls">
    <div class="form-group">
      <label for="tokenServer">Token Server URL</label>
      <input type="url" id="tokenServer"
             value="http://localhost:3001/session"
             placeholder="http://localhost:3001/session">
    </div>
    <div class="form-group">
      <label for="voiceSelect">Voice</label>
      <select id="voiceSelect">
        <option value="ash">Ash</option>
        <option value="ballad">Ballad</option>
        <option value="coral">Coral</option>
        <option value="sage">Sage</option>
        <option value="verse">Verse</option>
      </select>
    </div>
    <div class="form-group">
      <label for="modelSelect">Model</label>
      <select id="modelSelect">
        <option value="gpt-realtime-2">gpt-realtime-2</option>
        <option value="gpt-realtime-1.5">gpt-realtime-1.5</option>
        <option value="gpt-realtime-mini">gpt-realtime-mini</option>
      </select>
    </div>
    <details>
      <summary>Document Context <span style="color:#94a3b8;font-weight:normal;font-size:0.9em">(optional)</span></summary>
      <div class="form-group">
        <label for="documentInput">
          Paste a document before starting — the agent will reference it during conversation
        </label>
        <textarea id="documentInput"
          placeholder="Paste your document text here..."></textarea>
      </div>
    </details>
    <div style="display:flex;gap:10px;flex-wrap:wrap;">
      <button id="startBtn">Start Session</button>
      <button id="muteBtn" disabled>Mute Mic</button>
    </div>
  </div>

  <div id="status" class="status"></div>

  <div class="transcript">
    <h2 style="margin:0 0 10px 0;font-size:1rem;">Last Transcript</h2>
    <div id="lastTranscript" class="transcript-value placeholder">
      Waiting for the first response...
    </div>
  </div>

  <div class="stats-grid">
    <div class="stat-card">
      <h3>Input Tokens</h3>
      <div class="stat-value" id="inputTokens">0</div>
    </div>
    <div class="stat-card">
      <h3>Output Tokens</h3>
      <div class="stat-value" id="outputTokens">0</div>
    </div>
    <div class="stat-card">
      <h3>Session Cost</h3>
      <div class="stat-value" id="sessionCost">$0.0000</div>
    </div>
  </div>

  <script type="module">
    const $ = id => document.getElementById(id);
    const startBtn = $('startBtn');
    const muteBtn = $('muteBtn');
    const statusEl = $('status');
    const indicator = $('audioIndicator');
    const transcriptEl = $('lastTranscript');

    let pc = null;
    let audioCtx = null;
    let stream = null;
    let muted = false;
    let runningTotal = { input: 0, output: 0, cost: 0 };

    function setStatus(msg, type = '') {
      statusEl.textContent = msg;
      statusEl.className = 'status' + (type ? ' ' + type : '');
    }

    // ── Audio visualization ──
    function setupAudioVis(s) {
      audioCtx = new AudioContext();
      const src = audioCtx.createMediaStreamSource(s);
      const analyzer = audioCtx.createAnalyser();
      analyzer.fftSize = 256;
      src.connect(analyzer);
      const buf = new Uint8Array(analyzer.frequencyBinCount);
      function tick() {
        if (!audioCtx) return;
        analyzer.getByteFrequencyData(buf);
        const avg = buf.reduce((a, b) => a + b) / buf.length;
        indicator.classList.toggle('active', avg > 25);
        requestAnimationFrame(tick);
      }
      tick();
    }

    // ── Cost calculation ──
    function calculateCost(stats, model) {
      // gpt-realtime-2 pricing per token
      const p = { audioIn: 0.000032, textIn: 0.000004,
                  cachedIn: 0.0000004, audioOut: 0.000064, textOut: 0.000016 };
      const input = stats.audioInput * p.audioIn + stats.textInput * p.textIn;
      const output = stats.audioOutput * p.audioOut + stats.textOutput * p.textOut;
      return { input, output, total: input + output };
    }

    // ── Core session creation ──
    async function createSession(token, voice, model, docText) {
      pc = new RTCPeerConnection();

      pc.ontrack = e => {
        const a = new Audio();
        a.srcObject = e.streams[0];
        a.play();
      };

      pc.addTrack(stream.getTracks()[0]);
      const dc = pc.createDataChannel('oai-events');

      // Document context injection
      if (docText) {
        dc.addEventListener('open', () => {
          dc.send(JSON.stringify({
            type: 'conversation.item.create',
            item: {
              type: 'message',
              role: 'user',
              content: [{
                type: 'input_text',
                text: `The user has provided this document. Refer to it in your answers.\n\n<document>\n${docText}\n</document>`,
              }],
            },
          }));
        });
      }

      // Handle events from the server
      dc.addEventListener('message', e => {
        const ev = JSON.parse(e.data);
        if (ev.type === 'response.done' && ev.response) {
          // Update transcript
          const output = ev.response.output || [];
          for (let i = output.length - 1; i >= 0; i--) {
            const content = output[i].content || [];
            for (let j = content.length - 1; j >= 0; j--) {
              if (content[j].transcript) {
                transcriptEl.textContent = content[j].transcript.trim();
                transcriptEl.classList.remove('placeholder');
                break;
              }
            }
          }

          // Update token usage
          if (ev.response.usage) {
            const u = ev.response.usage;
            const det = u.input_token_details || {};
            const odet = u.output_token_details || {};
            const stats = {
              audioInput: (det.audio_tokens || 0) - ((det.cached_tokens_details?.audio_tokens) || 0),
              textInput: (det.text_tokens || 0) - ((det.cached_tokens_details?.text_tokens) || 0),
              audioOutput: odet.audio_tokens || 0,
              textOutput: odet.text_tokens || 0,
            };
            const cost = calculateCost(stats, model);

            runningTotal.input += stats.audioInput + stats.textInput;
            runningTotal.output += stats.audioOutput + stats.textOutput;
            runningTotal.cost += cost.total;

            $('inputTokens').textContent = runningTotal.input.toLocaleString();
            $('outputTokens').textContent = runningTotal.output.toLocaleString();
            $('sessionCost').textContent = `$${runningTotal.cost.toFixed(4)}`;
          }
        }
      });

      // SDP offer/answer exchange
      const offer = await pc.createOffer();
      await pc.setLocalDescription(offer);

      const fd = new FormData();
      fd.set('sdp', offer.sdp);
      fd.set('session', JSON.stringify({
        type: 'realtime', model,
        audio: { output: { voice } },
      }));

      const resp = await fetch('https://api.openai.com/v1/realtime/calls', {
        method: 'POST',
        headers: { Authorization: `Bearer ${token}` },
        body: fd,
      });

      if (!resp.ok) {
        throw new Error(`Handshake failed: ${resp.status}`);
      }

      await pc.setRemoteDescription({
        type: 'answer',
        sdp: await resp.text(),
      });

      return pc;
    }

    // ── Session lifecycle ──
    async function startSession() {
      try {
        setStatus('Requesting microphone...');
        stream = await navigator.mediaDevices.getUserMedia({ audio: true });
        setupAudioVis(stream);

        setStatus('Fetching ephemeral token...');
        const tokRes = await fetch($('tokenServer').value);
        if (!tokRes.ok) throw new Error('Token server returned ' + tokRes.status);
        const { client_secret } = await tokRes.json();

        setStatus('Establishing WebRTC session...');
        await createSession(
          client_secret.value,
          $('voiceSelect').value,
          $('modelSelect').value,
          $('documentInput').value.trim()
        );

        setStatus('Session connected!', 'success');
        startBtn.textContent = 'Stop Session';
        startBtn.classList.add('danger');
        muteBtn.disabled = false;
      } catch (err) {
        setStatus('Error: ' + err.message, 'error');
        stopSession();
      }
    }

    function stopSession() {
      if (pc) { pc.close(); pc = null; }
      if (audioCtx) { audioCtx.close(); audioCtx = null; }
      if (stream) { stream.getTracks().forEach(t => t.stop()); stream = null; }
      indicator.classList.remove('active');
      startBtn.textContent = 'Start Session';
      startBtn.classList.remove('danger');
      muteBtn.disabled = true;
      muteBtn.textContent = 'Mute Mic';
      muted = false;
    }

    function toggleMute() {
      if (!stream) return;
      muted = !muted;
      stream.getAudioTracks().forEach(t => t.enabled = !muted);
      muteBtn.textContent = muted ? 'Unmute Mic' : 'Mute Mic';
    }

    // ── Event wiring ──
    startBtn.addEventListener('click', () => {
      pc ? stopSession() : startSession();
    });
    muteBtn.addEventListener('click', toggleMute);
    window.addEventListener('beforeunload', stopSession);
  </script>
</body>
</html>

This is about 260 lines of HTML/CSS/JS — everything in one file. Save it as index.html and serve it:

npx serve .

Then navigate to http://localhost:3000 (or whatever port serve gives you).

Step 5: Running the Full Stack

Here's the exact sequence to get it working:

Terminal 1 — Token server:

OPENAI_API_KEY=sk-proj-xxx node server.js
# → Token server on http://localhost:3001

Terminal 2 — Static file server:

npx serve
# → Serving! Local: http://localhost:3000

Browser:

  1. Open http://localhost:3000
  2. The "Token Server URL" defaults to http://localhost:3001/session — this should work as-is
  3. Select a voice and model
  4. Paste some document text into the "Document Context" section
  5. Click Start Session
  6. Grant microphone access
  7. Wait for "Session connected!" — then start talking

Expected output

When you say something like "What are the key findings from that document?", the model responds in audio, and you'll see the transcript appear in the box:

Last Transcript:
Based on the document you shared, there are three key findings.
First, the experiment showed a 23% improvement in recall when
using the hybrid retrieval approach. Second, the latency impact
was minimal — only 120ms added on average. Third, the approach
works best when documents are chunked at 512 tokens with 128
token overlap.

The stats panel updates after each model response:

Input Tokens: 1,245    Output Tokens: 892    Session Cost: $0.0047

Failure points to watch for

SymptomLikely causeFix
Handshake failed: 401Ephemeral token expiredYour token server and browser must complete the handshake within ~60s. Make sure your server is fast
Handshake failed: 400Bad session configCheck your JSON in the session FormData field. Common mistake: forgetting the type: 'realtime' field
Microphone not workingBrowser permissionsCheck the site has mic access. On Chrome, click the lock icon in the URL bar
No audio outputAutoplay blockThe browser may block <audio>.play(). Click somewhere on the page first to register a user gesture
Document not referencedContext injection failedCheck the browser console. The conversation.item.create event must fire after the data channel opens but before the user starts speaking
CORS error on POSTMissing CORS headersAdd cors() middleware to your Express server

How Interruptions Work

One of the nicest things about the WebRTC approach is that interruptions are handled by default. When the model is speaking and you start talking, the server-side Voice Activity Detection (VAD) detects your speech and the model stops responding. You don't need to implement any special interruption logic.

The VAD configuration is set on the session object. If you want push-to-talk mode instead of always-on listening, you can disable VAD:

fd.set('session', JSON.stringify({
  type: 'realtime',
  model,
  audio: {
    input: {
      turn_detection: null,  // Disable automatic VAD
    },
    output: {
      voice,
    },
  },
}));

With VAD disabled, you'd manually control turn-taking by sending response.create events on the data channel and clearing the input buffer with input_audio_buffer.clear.

Token Tracking and Cost Optimization

The response.done event includes detailed token usage:

{
  "type": "response.done",
  "response": {
    "usage": {
      "input_tokens": 1500,
      "output_tokens": 340,
      "input_token_details": {
        "audio_tokens": 1200,
        "text_tokens": 300,
        "cached_tokens": 50,
        "cached_tokens_details": {
          "audio_tokens": 30,
          "text_tokens": 20
        }
      },
      "output_token_details": {
        "audio_tokens": 290,
        "text_tokens": 50
      }
    }
  }
}

GPT-Realtime-2 pricing (as of June 2026):

Token typePrice per 1K tokens
Audio input$0.032
Text input$0.004
Cached audio input$0.0004
Cached text input$0.0004
Audio output$0.064
Text output$0.016

A 5-minute conversation averages about $0.03–$0.08 depending on how much the model speaks. Adding document context adds a one-time text input cost (the document tokens) — for a 5,000-word document that's about 7,000 tokens of text input at $0.004/1K = $0.028.

Cost-saving tips

  1. Use gpt-realtime-mini for simpler conversations — it's about 3x cheaper
  2. Keep documents under 10K words to minimize one-time text input cost
  3. Enable caching by reusing the same session — cached tokens are 10x cheaper
  4. Use short instructions in the document prompt — every word adds to text input cost

What's Next

This blueprint gives you a self-contained voice agent with document context. From here you can extend it in several directions:

  • Add tool calling: Register functions on the session config and handle response.function_call_arguments.done events on the data channel to give the agent API access
  • Sideband server control: Open a second WebSocket connection from your server to the same realtime session to push session updates or respond to tool calls server-side
  • Conversation persistence: Use OpenAI's Conversation API to save and resume conversations across sessions
  • Multiple documents: Extend the document injection to accept multiple files, labeling each one so the model can distinguish sources
  • Custom VAD: Replace OpenAI's semantic VAD with your own voice activity detection for finer control over turn-taking

The full working code from this tutorial is available as a single HTML file. Drop in your token server, paste a document, and you have a voice agent that actually reads and references your content. No RAG. No vector database. Just the model's 128K context window and a data channel.

Note:

For production deployments, implement proper authentication on your token server endpoint (your server is minting paid OpenAI tokens, after all). Add rate limiting and user session scoping before exposing it to the internet.