Gemini processes audio natively — speech, music, ambient sounds, and mixed audio sources. It transcribes, identifies speakers, detects emotion and tone, and can analyze acoustic properties. But audio prompting requires different techniques than text or image prompting because Gemini hears the audio as a continuous stream, not as segmented data.

Core Audio Prompt Pattern

Audio: "customer-call.wav" — 8-minute recorded customer support call
between a customer and a support agent.

1. Full transcription with speaker labels
2. Sentiment analysis: track customer sentiment across the call timeline
3. Identify the moment sentiment shifted (if any)
4. Extract the resolution: was the issue solved? How?
5. Agent performance assessment: clarity, empathy, efficiency

Format the transcription as:
[Agent]: ...
[Customer]: ...

Note:

Always specify your desired speaker labeling convention explicitly. "Use speaker labels [Agent] and [Customer]" produces cleaner results than "label the speakers" — which might give you "Speaker 1," "Speaker A," or names inferred from content.

Transcription Patterns

Verbatim Transcription

Audio: "interview-recording.mp3" — 25-minute research interview.

Produce a verbatim transcript including:
- All filler words (um, uh, like, you know)
- False starts and self-corrections
- Laughter, sighs, and significant pauses [in brackets]
- Overlapping speech marked with //
- Unintelligible segments marked as [unintelligible]

Speaker labels: [Interviewer] and [Participant]

Clean Transcription

Audio: "podcast-episode.mp3" — 45-minute podcast discussion.

Produce a clean, readable transcript:
- Remove filler words (um, uh)
- Complete interrupted sentences where intent is clear
- Omit false starts unless they meaningfully change the statement
- Preserve the speakers' distinctive phrasing and vocabulary
- Add paragraph breaks at topic transitions

Speaker labels: [Host Name] and [Guest Name]

Translation-Capable Transcription

Audio: "multilingual-meeting.mp3" — 30-minute meeting with speakers
switching between English, French, and German.

1. Transcribe each segment in its original language
2. Provide English translations in [brackets] after each non-English segment
3. Mark language switches: [FR] for French, [DE] for German
4. If a speaker mixes languages mid-sentence, preserve the mix

Sentiment and Tone Analysis

Gemini can detect emotional content in speech — not just from words, but from vocal characteristics like pitch, pace, and intensity.

Audio: "sales-pitch-recording.mp3" — 12-minute sales call.

Analyze the prospect's engagement throughout the call:
1. Overall sentiment arc (interested, skeptical, excited, disengaged)
2. Key moments where sentiment shifted (with timestamps)
3. Questions that generated the most positive response
4. Sections where the prospect seemed to disengage
5. Likelihood of conversion based on vocal cues (Low/Medium/High)

For the sentiment arc, use a timeline format:
[0:00-2:30] Neutral, polite, guarded
[2:30-5:00] Increasing interest, faster responses
...

Note:

Gemini's sentiment analysis from audio is based on both linguistic content and vocal characteristics, but vocal analysis can be influenced by cultural speaking patterns. A naturally fast-talking, high-energy speaker may be misclassified as "excited" or "agitated." When possible, provide context about the speaker's baseline communication style.

Meeting Summarization

Audio: "weekly-sync.mp3" — 45-minute team sync meeting with 8 participants.

Generate a structured meeting summary:
1. Attendees (list everyone who spoke)
2. Agenda items discussed (with timestamps)
3. Key decisions made
4. Action items with owners and deadlines
5. Topics raised but deferred
6. Follow-up meeting scheduled? (date/time)

Format action items as a markdown task list:
- [ ] @owner: Task description (due: date)

Working with Low-Quality Audio

Audio: "field-recording.wav" — 5-minute outdoor interview with
significant wind noise and distance from microphone.

This recording has poor audio quality with wind interference.
Please:
1. Transcribe what you can understand
2. Mark unclear sections with [unintelligible] and [approximate: "text"]
3. Note any sections where you have low confidence in the transcription
4. If background noise contains identifiable sounds (traffic, birds,
   machinery), note them in [brackets]

Multi-Audio Analysis

Audio 1: "call-1-scripted.mp3" — Agent following script
Audio 2: "call-2-adlib.mp3" — Agent ad-libbing
Audio 3: "call-3-trained.mp3" — Agent using Gemini-generated responses

Compare the three support calls:
1. Customer satisfaction signals in each
2. Resolution time for each
3. Script adherence in call 1
4. Key phrases that correlated with positive customer response
5. Which approach produced the best outcome and why

Common Failures

Failure	Cause	Fix
Speaker confusion	Similar voices, overlapping speech	Specify number of speakers: "this recording has exactly 2 speakers"
Missing non-speech sounds	Prompt focuses on transcription only	Request: "include significant non-speech sounds in [brackets]"
Overly literal transcription	No guidance on transcription style	Specify verbatim vs. clean transcription explicitly
Cultural tone misread	Vocal patterns vary by culture	Provide speaker baseline context when possible
Long audio dropout	Gemini loses thread on 2+ hour recordings	Segment audio into 30-minute chunks

Video Processing — Audio within video content
Multimodal Workflows — Combining audio with images and text

Gemini Audio & Speech Prompts: Transcription & Analysis

Core Audio Prompt Pattern

Transcription Patterns

Verbatim Transcription

Clean Transcription

Translation-Capable Transcription

Sentiment and Tone Analysis

Meeting Summarization

Working with Low-Quality Audio

Multi-Audio Analysis

Common Failures

Related Articles

Japanese Zen & Wabi-Sabi SREF Codes

Claude Computer Use Prompting: UI Targets & Action Sequences

UX & Design: ChatGPT Prompts for Designers

On this page

Gemini Audio & Speech Prompts: Transcription & Analysis

Core Audio Prompt Pattern

Transcription Patterns

Verbatim Transcription

Clean Transcription

Translation-Capable Transcription

Sentiment and Tone Analysis

Meeting Summarization

Working with Low-Quality Audio

Multi-Audio Analysis

Common Failures

Related Pages

Related Articles

Japanese Zen & Wabi-Sabi SREF Codes

Claude Computer Use Prompting: UI Targets & Action Sequences

UX & Design: ChatGPT Prompts for Designers

On this page