Gemini Audio & Speech Prompts: Transcription & Analysis
Master audio prompting with Gemini. Learn transcription techniques, speaker diarization, sentiment analysis, meeting summarization, and working with low-quality recordings.
Gemini processes audio natively — speech, music, ambient sounds, and mixed audio sources. It transcribes, identifies speakers, detects emotion and tone, and can analyze acoustic properties. But audio prompting requires different techniques than text or image prompting because Gemini hears the audio as a continuous stream, not as segmented data.
Core Audio Prompt Pattern
Audio: "customer-call.wav" — 8-minute recorded customer support call
between a customer and a support agent.
1. Full transcription with speaker labels
2. Sentiment analysis: track customer sentiment across the call timeline
3. Identify the moment sentiment shifted (if any)
4. Extract the resolution: was the issue solved? How?
5. Agent performance assessment: clarity, empathy, efficiency
Format the transcription as:
[Agent]: ...
[Customer]: ...
Note:
Always specify your desired speaker labeling convention explicitly. "Use speaker labels [Agent] and [Customer]" produces cleaner results than "label the speakers" — which might give you "Speaker 1," "Speaker A," or names inferred from content.
Transcription Patterns
Verbatim Transcription
Audio: "interview-recording.mp3" — 25-minute research interview.
Produce a verbatim transcript including:
- All filler words (um, uh, like, you know)
- False starts and self-corrections
- Laughter, sighs, and significant pauses [in brackets]
- Overlapping speech marked with //
- Unintelligible segments marked as [unintelligible]
Speaker labels: [Interviewer] and [Participant]
Clean Transcription
Audio: "podcast-episode.mp3" — 45-minute podcast discussion.
Produce a clean, readable transcript:
- Remove filler words (um, uh)
- Complete interrupted sentences where intent is clear
- Omit false starts unless they meaningfully change the statement
- Preserve the speakers' distinctive phrasing and vocabulary
- Add paragraph breaks at topic transitions
Speaker labels: [Host Name] and [Guest Name]
Translation-Capable Transcription
Audio: "multilingual-meeting.mp3" — 30-minute meeting with speakers
switching between English, French, and German.
1. Transcribe each segment in its original language
2. Provide English translations in [brackets] after each non-English segment
3. Mark language switches: [FR] for French, [DE] for German
4. If a speaker mixes languages mid-sentence, preserve the mix
Sentiment and Tone Analysis
Gemini can detect emotional content in speech — not just from words, but from vocal characteristics like pitch, pace, and intensity.
Audio: "sales-pitch-recording.mp3" — 12-minute sales call.
Analyze the prospect's engagement throughout the call:
1. Overall sentiment arc (interested, skeptical, excited, disengaged)
2. Key moments where sentiment shifted (with timestamps)
3. Questions that generated the most positive response
4. Sections where the prospect seemed to disengage
5. Likelihood of conversion based on vocal cues (Low/Medium/High)
For the sentiment arc, use a timeline format:
[0:00-2:30] Neutral, polite, guarded
[2:30-5:00] Increasing interest, faster responses
...
Note:
Gemini's sentiment analysis from audio is based on both linguistic content and vocal characteristics, but vocal analysis can be influenced by cultural speaking patterns. A naturally fast-talking, high-energy speaker may be misclassified as "excited" or "agitated." When possible, provide context about the speaker's baseline communication style.
Meeting Summarization
Audio: "weekly-sync.mp3" — 45-minute team sync meeting with 8 participants.
Generate a structured meeting summary:
1. Attendees (list everyone who spoke)
2. Agenda items discussed (with timestamps)
3. Key decisions made
4. Action items with owners and deadlines
5. Topics raised but deferred
6. Follow-up meeting scheduled? (date/time)
Format action items as a markdown task list:
- [ ] @owner: Task description (due: date)
Working with Low-Quality Audio
Audio: "field-recording.wav" — 5-minute outdoor interview with
significant wind noise and distance from microphone.
This recording has poor audio quality with wind interference.
Please:
1. Transcribe what you can understand
2. Mark unclear sections with [unintelligible] and [approximate: "text"]
3. Note any sections where you have low confidence in the transcription
4. If background noise contains identifiable sounds (traffic, birds,
machinery), note them in [brackets]
Multi-Audio Analysis
Audio 1: "call-1-scripted.mp3" — Agent following script
Audio 2: "call-2-adlib.mp3" — Agent ad-libbing
Audio 3: "call-3-trained.mp3" — Agent using Gemini-generated responses
Compare the three support calls:
1. Customer satisfaction signals in each
2. Resolution time for each
3. Script adherence in call 1
4. Key phrases that correlated with positive customer response
5. Which approach produced the best outcome and why
Common Failures
| Failure | Cause | Fix |
|---|---|---|
| Speaker confusion | Similar voices, overlapping speech | Specify number of speakers: "this recording has exactly 2 speakers" |
| Missing non-speech sounds | Prompt focuses on transcription only | Request: "include significant non-speech sounds in [brackets]" |
| Overly literal transcription | No guidance on transcription style | Specify verbatim vs. clean transcription explicitly |
| Cultural tone misread | Vocal patterns vary by culture | Provide speaker baseline context when possible |
| Long audio dropout | Gemini loses thread on 2+ hour recordings | Segment audio into 30-minute chunks |
Related Pages
- Video Processing — Audio within video content
- Multimodal Workflows — Combining audio with images and text
Related Articles
Reddit Prompt Engineering Community
Learn from the Reddit community's collective wisdom on prompt engineering techniques and best practices.
Code Review with ChatGPT
Learn how to effectively use ChatGPT for code reviews and get actionable feedback on code quality, security, and performance.
Optimization Techniques
Master optimization strategies with effective prompts and practical approaches for ChatGPT.