Gemini Audio & Speech Prompts: Transcription & Analysis

Master audio prompting with Gemini. Learn transcription techniques, speaker diarization, sentiment analysis, meeting summarization, and working with low-quality recordings.

June 14, 2026
GeminiAudioSpeechTranscriptionSentimentMultimodal

Gemini processes audio natively — speech, music, ambient sounds, and mixed audio sources. It transcribes, identifies speakers, detects emotion and tone, and can analyze acoustic properties. But audio prompting requires different techniques than text or image prompting because Gemini hears the audio as a continuous stream, not as segmented data.

Core Audio Prompt Pattern

Audio: "customer-call.wav" — 8-minute recorded customer support call
between a customer and a support agent.

1. Full transcription with speaker labels
2. Sentiment analysis: track customer sentiment across the call timeline
3. Identify the moment sentiment shifted (if any)
4. Extract the resolution: was the issue solved? How?
5. Agent performance assessment: clarity, empathy, efficiency

Format the transcription as:
[Agent]: ...
[Customer]: ...

Note:

Always specify your desired speaker labeling convention explicitly. "Use speaker labels [Agent] and [Customer]" produces cleaner results than "label the speakers" — which might give you "Speaker 1," "Speaker A," or names inferred from content.

Transcription Patterns

Verbatim Transcription

Audio: "interview-recording.mp3" — 25-minute research interview.

Produce a verbatim transcript including:
- All filler words (um, uh, like, you know)
- False starts and self-corrections
- Laughter, sighs, and significant pauses [in brackets]
- Overlapping speech marked with //
- Unintelligible segments marked as [unintelligible]

Speaker labels: [Interviewer] and [Participant]

Clean Transcription

Audio: "podcast-episode.mp3" — 45-minute podcast discussion.

Produce a clean, readable transcript:
- Remove filler words (um, uh)
- Complete interrupted sentences where intent is clear
- Omit false starts unless they meaningfully change the statement
- Preserve the speakers' distinctive phrasing and vocabulary
- Add paragraph breaks at topic transitions

Speaker labels: [Host Name] and [Guest Name]

Translation-Capable Transcription

Audio: "multilingual-meeting.mp3" — 30-minute meeting with speakers
switching between English, French, and German.

1. Transcribe each segment in its original language
2. Provide English translations in [brackets] after each non-English segment
3. Mark language switches: [FR] for French, [DE] for German
4. If a speaker mixes languages mid-sentence, preserve the mix

Sentiment and Tone Analysis

Gemini can detect emotional content in speech — not just from words, but from vocal characteristics like pitch, pace, and intensity.

Audio: "sales-pitch-recording.mp3" — 12-minute sales call.

Analyze the prospect's engagement throughout the call:
1. Overall sentiment arc (interested, skeptical, excited, disengaged)
2. Key moments where sentiment shifted (with timestamps)
3. Questions that generated the most positive response
4. Sections where the prospect seemed to disengage
5. Likelihood of conversion based on vocal cues (Low/Medium/High)

For the sentiment arc, use a timeline format:
[0:00-2:30] Neutral, polite, guarded
[2:30-5:00] Increasing interest, faster responses
...

Note:

Gemini's sentiment analysis from audio is based on both linguistic content and vocal characteristics, but vocal analysis can be influenced by cultural speaking patterns. A naturally fast-talking, high-energy speaker may be misclassified as "excited" or "agitated." When possible, provide context about the speaker's baseline communication style.

Meeting Summarization

Audio: "weekly-sync.mp3" — 45-minute team sync meeting with 8 participants.

Generate a structured meeting summary:
1. Attendees (list everyone who spoke)
2. Agenda items discussed (with timestamps)
3. Key decisions made
4. Action items with owners and deadlines
5. Topics raised but deferred
6. Follow-up meeting scheduled? (date/time)

Format action items as a markdown task list:
- [ ] @owner: Task description (due: date)

Working with Low-Quality Audio

Audio: "field-recording.wav" — 5-minute outdoor interview with
significant wind noise and distance from microphone.

This recording has poor audio quality with wind interference.
Please:
1. Transcribe what you can understand
2. Mark unclear sections with [unintelligible] and [approximate: "text"]
3. Note any sections where you have low confidence in the transcription
4. If background noise contains identifiable sounds (traffic, birds,
   machinery), note them in [brackets]

Multi-Audio Analysis

Audio 1: "call-1-scripted.mp3" — Agent following script
Audio 2: "call-2-adlib.mp3" — Agent ad-libbing
Audio 3: "call-3-trained.mp3" — Agent using Gemini-generated responses

Compare the three support calls:
1. Customer satisfaction signals in each
2. Resolution time for each
3. Script adherence in call 1
4. Key phrases that correlated with positive customer response
5. Which approach produced the best outcome and why

Common Failures

FailureCauseFix
Speaker confusionSimilar voices, overlapping speechSpecify number of speakers: "this recording has exactly 2 speakers"
Missing non-speech soundsPrompt focuses on transcription onlyRequest: "include significant non-speech sounds in [brackets]"
Overly literal transcriptionNo guidance on transcription styleSpecify verbatim vs. clean transcription explicitly
Cultural tone misreadVocal patterns vary by cultureProvide speaker baseline context when possible
Long audio dropoutGemini loses thread on 2+ hour recordingsSegment audio into 30-minute chunks