The killer application for Gemini's 2M-token context window is full-document analysis. Not summarization — that's table stakes. Real analysis: cross-referencing claims across 500 pages, tracing a variable through a 200K-line codebase, finding contradictions across a set of 50 research papers, or verifying that every provision in a 300-page contract is consistent with every other provision.

This page covers production patterns for four document types. Each pattern accounts for the lost middle problem, attention management, and cost optimization.

Full Book Analysis

I'm providing the complete text of "[BOOK TITLE]" by [AUTHOR].

TASK: Comprehensive literary analysis

1. THEMATIC ANALYSIS
   - Identify the 3-5 central themes
   - For each theme, cite 3+ passages as evidence (with chapter/page)
   - Trace how each theme develops from beginning to end

2. CHARACTER ARC ANALYSIS
   - For each major character, describe their arc across the entire book
   - Identify the turning point chapter for each character
   - Quote the passage that best represents their transformation

3. NARRATIVE STRUCTURE
   - Map the plot structure (exposition, rising action, climax, etc.)
   - Identify the exact passage that represents the climax
   - Note any structural innovations (non-linear timeline, unreliable narrator)

4. STYLISTIC ANALYSIS
   - Describe the author's prose style with specific examples
   - Note any stylistic shifts across the book
   - Compare the opening paragraph to the closing paragraph

5. CONTRADICTIONS & AMBIGUITIES
   - Identify any internal inconsistencies in plot or character behavior
   - Flag passages open to multiple interpretations
   - Note unresolved questions the text raises but doesn't answer

CITATION FORMAT: [Chapter:Paragraph] for every claim.
If a theme or pattern appears across many passages, cite the 3 strongest.

Note:

For full-book analysis, always ask for citations with chapter/paragraph references. Without citation requirements, Gemini will synthesize plausible-sounding but unverifiable claims that mix accurate observations with reasonable-sounding fabrications.

Codebase-Wide Analysis

Gemini can ingest an entire mid-sized codebase in a single prompt. The key is sending the files in a structured format with clear path boundaries.

I'm providing the complete source code for [PROJECT NAME].
The codebase is organized as follows:

[FILE: src/auth/login.ts]
... file contents ...

[FILE: src/auth/session.ts]
... file contents ...

[FILE: src/api/routes.ts]
... file contents ...

... (all project files)

TASK: Codebase architecture review

1. ARCHITECTURE OVERVIEW
   - Identify the architectural pattern (MVC, layered, microservices, etc.)
   - Map the dependency graph: which modules depend on which others?
   - Identify circular dependencies

2. SECURITY AUDIT
   - Find all authentication and authorization checks
   - Identify any routes or functions missing auth checks
   - Check for common vulnerabilities:
     * SQL injection points
     * Unsanitized user input reaching queries
     * Hardcoded secrets or API keys
     * Missing CSRF protection

3. CODE QUALITY
   - Identify the 5 most complex functions (by cyclomatic complexity)
   - Find duplicate or near-duplicate code blocks
   - Flag functions that are too long (>50 lines)
   - Identify dead code or unreachable paths

4. ERROR HANDLING
   - Find all try/catch blocks and error handlers
   - Identify functions that can fail without error handling
   - Check if errors leak sensitive information to users

5. REFACTORING RECOMMENDATIONS
   - Suggest the 3 highest-impact refactorings with specific file references
   - For each: what changes, why it matters, risk level

CITATION: [file:line-range] for every finding.

# Helper: serialize a codebase for Gemini ingestion
import os

def serialize_codebase(root_dir, extensions=None):
    """Serialize a directory tree into Gemini-friendly format."""
    output = []
    for dirpath, _, filenames in os.walk(root_dir):
        for f in sorted(filenames):
            if extensions and not any(f.endswith(ext) for ext in extensions):
                continue
            path = os.path.join(dirpath, f)
            rel_path = os.path.relpath(path, root_dir)
            with open(path) as fh:
                content = fh.read()
            output.append(f"[FILE: {rel_path}]\n{content}\n[/FILE: {rel_path}]")
    return "\n\n".join(output)

Note:

Don't blindly dump node_modules, .git, or build artifacts. They'll consume context without adding analysis value. Use .gitignore patterns to filter, and always set file extension filters (.ts, .py, .go, etc.).

Legal Document Analysis

I'm providing a commercial lease agreement (187 pages) and
three addendums.

[DOC: lease-agreement.pdf]
... full text ...

[DOC: addendum-1.pdf]
... full text ...

[DOC: addendum-2.pdf]
... full text ...

[DOC: addendum-3.pdf]
... full text ...

TASK: Comprehensive lease review

1. OBLIGATION EXTRACTION
   Extract every obligation for the tenant, organized by:
   - Financial (rent, deposits, fees, increases)
   - Operational (maintenance, insurance, compliance)
   - Reporting (notices, documentation, certifications)

2. RISK ASSESSMENT
   - Identify unusual or tenant-unfavorable provisions
   - Find any "evergreen" clauses (auto-renewal without notice)
   - Locate all termination rights and their conditions
   - Identify any provisions that contradict each other

3. ADDENDUM IMPACT
   - For each addendum, list exactly which provisions it modifies
   - Check if any modifications create new contradictions
   - Verify that all referenced sections actually exist in the main agreement

4. MISSING PROTECTIONS
   - List standard tenant protections NOT present in this agreement
   - Compare against [jurisdiction]'s statutory requirements
   - Flag anything a tenant should negotiate before signing

For every finding, cite the specific section number and paragraph.

Multi-Document Research Synthesis

I'm providing 30 research papers on [TOPIC]. Each paper is
labeled with [PAPER: ID] markers.

TASK: Cross-paper synthesis

1. CONSENSUS FINDINGS
   - What conclusions do 70%+ of papers agree on?
   - For each consensus finding, list the supporting papers

2. CONTROVERSIES & DISAGREEMENTS
   - Where do papers directly contradict each other?
   - For each controversy, present both sides with supporting papers
   - Which side has stronger evidence?

3. METHODOLOGICAL COMPARISON
   - Compare methodologies across papers
   - Which papers have the strongest experimental design? Why?
   - Identify common methodological weaknesses across the set

4. RESEARCH GAPS
   - What questions do NONE of these papers address?
   - What would the ideal follow-up study look like?

5. META-ANALYSIS CANDIDATES
   - Which papers share sufficiently similar methodology
     that their results could be combined in a meta-analysis?

CITATION FORMAT: [Author Year, Section] for every claim.
Distinguish between: [Finding], [Claim by authors], and [My inference].

Cost Management for Large Documents

Full-document analysis can get expensive fast. Here's the optimization playbook:

Document Type	Context Size	Cache Strategy	Estimated Cost per Analysis
300-page book	~500K tokens	Cache book text, run multiple analyses	Moderate
50K-line codebase	~300K tokens	Cache source, run security + quality + refactor	Low-Moderate
50 research papers	~1M tokens	Chunk into 10-paper batches, synthesize results	High
Legal contract	~200K tokens	Cache contract, run multiple review passes	Low

Note:

For large document analysis, always run a cheap "scoping" pass first: ask Gemini to identify which sections are most relevant to your task, then feed only those sections into the expensive detailed analysis. This can cut costs by 60-80% without meaningful quality loss.

Common Failures

Failure	Cause	Fix
Plausible-sounding fabrications	Gemini synthesizes without evidence	Require specific citations for every claim
Cross-document hallucination	Attributes finding to wrong source	Use unique document IDs in every reference
Missing key findings	Critical info lost in lost middle	Use KEY FINDINGS beacons; run multiple passes
Cost explosion	Analyzing full corpus for every question	Scope pass → detailed pass pipeline
Token limit exceeded	Codebase includes build artifacts/deps	Filter aggressively before ingestion

1M Token Strategies — Attention placement and retrieval techniques
Context Caching — Cost optimization for repeated analysis

Gemini Large Document Analysis: Books, Codebases & Research Sets

Full Book Analysis

Codebase-Wide Analysis

Legal Document Analysis

Multi-Document Research Synthesis

Cost Management for Large Documents

Common Failures

Related Articles

Gemini Context Caching: Reduce Costs by 75%

Literature Review Guide

Poetry Writing with ChatGPT: Master Poetic Forms

On this page

Gemini Large Document Analysis: Books, Codebases & Research Sets

Full Book Analysis

Codebase-Wide Analysis

Legal Document Analysis

Multi-Document Research Synthesis

Cost Management for Large Documents

Common Failures

Related Pages

Related Articles

Gemini Context Caching: Reduce Costs by 75%

Literature Review Guide

Poetry Writing with ChatGPT: Master Poetic Forms

On this page