Back to blog

Wednesday, June 17, 2026

TREX — Greptile's AI Code Reviewer That Actually Runs Your Code

cover

Greptile launched TREX today — an execution layer for AI code review that actually runs your code as part of the review process. Not static analysis. Not pattern matching against a training set. Real execution in a sandboxed environment, with screenshots, logs, and video as evidence.

Most AI code reviewers today are sophisticated diff-readers. They analyze pull requests the same way a human would: read the changes, apply patterns from training data, and flag suspicious code. This works for a class of bugs — the ones that announce themselves plainly in the diff. But there's a whole category of bugs that only appear when the program runs: state-dependent logic errors, UI regressions that require rendering, race conditions that need real network requests, and integration failures that span services.

TREX (Test, Run, Execute) is Greptile's answer to that ceiling. Here's how it works under the hood, why the architecture went through three iterations to get there, and what it means for CI/CD pipelines.

The Architecture: Orchestrator Agents Spawning Execution Agents

The most interesting architectural decision in TREX is not the sandbox. It's that TREX subagents are managed from within the main Greptile reviewer agent — a design that took three attempts to get right.

Attempt 1: Standalone Agent

TREX started as a completely separate product from the main Greptile reviewer. It would independently read a PR, generate tests, and run them. The theory was that giving each agent its own context window would keep both focused.

It didn't work. The standalone agent generated tests that weren't relevant to what the PR was actually doing. It produced noisy, low-value output and missed edge cases because it didn't share context with the reviewer agent. Both agents independently explored the same parts of the codebase, duplicating work and wasting compute.

Attempt 2: Single Agent

The obvious fix was to combine both capabilities into one agent. That introduced a different problem: a single agent handling the full review — spinning up services, taking screenshots, running tests, writing reviews — was overloaded. The context window filled with operational details and left no room for the actual analysis.

Attempt 3: Agent-in-Agent Orchestration

The solution was the current architecture. The Greptile reviewer agent acts as an orchestrator. It reads the diff, identifies issues worth investigating, and spins up a dedicated TREX agent per issue — all running in parallel.

Greptile Reviewer Agent (orchestrator)
  ├── Reads PR diff
  ├── Identifies issues worth executing
  ├── Spawns TREX Subagent 1 (auth-gated feature)
  ├── Spawns TREX Subagent 2 (checkout flow)
  └── Spawns TREX Subagent n (API endpoint)
        └── Each subagent:
              ├── Inherits orchestrator context
              ├── Runs in its own context window
              ├── Spins up sandboxed environment
              ├── Executes relevant code paths
              └── Returns multi-modal artifacts

Each TREX subagent inherits what the orchestrator already found, operates in its own context window scoped to the specific problem, and returns evidence — not just conclusions. This avoids the duplication problem of the standalone approach and the context-overload problem of the single-agent approach.

A concrete example from Greptile's engineering team: a UI feature hidden behind an authentication gate. A TREX subagent figures out the auth setup, navigates through the gate, enables the feature flag, renders the feature, and comes back with a screenshot — all autonomously.

Multi-Modal Artifacts: Bullet Points Are Not Evidence

TREX outputs findings backed by a multi-modal artifact set: screenshots, logs, API traces, execution scripts, and video. Each modality covers a different part of the story.

This wasn't the first version. Initially, TREX returned bullet points: "Tested checkout flow, found failure." That turned out to be worse than useless for two reasons:

  1. Attribution opacity — A human or downstream agent couldn't tell where in the process something failed. Setup error? Assertion bug? Environment flake?
  2. Hallucination — The early agent would sometimes claim it had thoroughly tested something it hadn't. Bullet points gave no way to verify.

The fix was to pair every finding with artifacts that make the run reproducible. The artifact set for a single finding might include:

  • A screenshot of the rendered UI at the point of failure
  • Logs from the service showing error traces
  • API traces showing the request/response chain
  • The execution script the agent used
  • A video of the interaction (animation changes, page transitions)

The principle is that bad evidence is worse than no evidence. Every artifact must give a reviewer enough to verify the run themselves. This is especially important for downstream agents that need to take action on the findings — without the trace, all they have is the answer, not the steps that produced it.

The Sandbox: Disposable Execution Per Review

Every TREX review spins up a disposable sandboxed environment: an isolated compute instance per review, started fresh in milliseconds, destroyed when the run is complete. These environments run real projects with real dependencies — not just unit tests against mocks, but actual services with actual state.

Key sandbox characteristics:

PropertyDetail
IsolationPer-review, disposable compute instance
StartupMilliseconds (not seconds)
CapabilityFull project with dependencies, not mocks
LifecycleCreated fresh, destroyed after run
CredentialsRotated per execution

Starting from scratch every time would be too slow, so TREX uses reusable base images and per-repository snapshots. A repository is cloned once, captured as a snapshot, and resumed for subsequent reviews. Each review still fetches the exact PR commits and rotates credentials before execution begins.

The caching strategy is carefully balanced: a cache that includes too little is slow (cold starts every time), while a cache that includes too much becomes "haunted" — stale state carries over between runs and produces unreliable results. The useful kind of cache is warm enough to move quickly and fresh enough to trust.

Model-Agnostic Evaluation Harness

TREX is designed around a model-agnostic harness that allows hot-swapping between frontier models without rebuilding the pipeline. This is not theoretical flexibility — the main orchestrator agent and the TREX subagents can use different providers simultaneously. Multiple models can run within the same review.

The evaluation framework measures two primary metrics:

  • Recall: How many real bugs are caught, measured against open-source PRs or customer data where comments were addressed
  • Precision: Consistency across runs — if you review the same PR twice, are you finding roughly the same set of issues?

Latency is intentionally deprioritized. The team's position is that developers would rather wait longer for an accurate result than get a fast result they can't trust.

The open-source evaluation harness they use performs on par with native provider harnesses. There's no meaningful quality penalty for being model-agnostic — a result the team says they wouldn't have predicted before testing it.

How TREX Compares to Static-Analysis Reviewers

The current AI code review landscape breaks into three categories:

CategoryExamplesApproachLimitations
Static analysisCodeRabbit, SonarQube, ESLintPattern-match diffs against rulesMiss state-dependent, runtime, and integration bugs
LLM-based reviewGitHub Copilot Code Review, CodeRabbit AI, BitoRead diff with LLM contextCeiling on what can be inferred from static text
Execution-basedTREXRun code in sandbox, observe behaviorSlower, more expensive compute per review

Static-analysis and LLM-based reviewers can process a review in seconds to a few minutes. They catch a meaningful set of issues — SQL injection patterns, type mismatches, style violations, known vulnerability patterns. But they share the same fundamental ceiling: they can reason about what the code says, not what it does.

TREX is slower and more expensive per run (compute costs for sandboxed execution), but it catches a different class of bugs — the runtime bugs that no amount of diff-reading will find. The two approaches are complementary, not either-or. A practical pipeline runs static analysis fast for the obvious issues and execution-based review deeper for the subtle ones.

Implications for CI/CD Pipelines

Adding code execution to the review loop changes several assumptions about CI/CD:

Pipeline stage placement. Traditional AI review hooks in at PR submission (pre-merge). Execution-based review could also run post-merge for regression suites, or on a schedule for long-running integration tests. The sandbox model makes this practical because environments are disposable and cost scales with usage.

Cost modeling. Code execution consumes real compute. Teams need to factor in execution costs per review — Greptile has transitioned to a base + usage pricing model. The question is whether catching runtime bugs before merge pays for the compute cost. For teams shipping to production multiple times per day, the math is usually favorable.

Feedback latency. Static reviewers return feedback in under a minute. Execution-based reviewers take longer because they need to set up environments and run code. TREX intentionally prioritizes accuracy over latency. Teams need to decide whether slower but deeper review fits their workflow. For most teams, the trade-off is acceptable: run the fast static review immediately, run the execution review asynchronously.

False positive profile. Static analysis and LLM-based reviewers can flag issues that aren't actually bugs (false positives from over-matching). Execution-based reviewers have a different false positive profile — environment instability, flaky tests, timeout issues. The artifact trail makes these easier to diagnose and dismiss.

Pitfalls

Two-agent context management is hard. The three-iteration architecture evolution at Greptile is instructive. Standalone agents missed context, single agents overflowed context, and the agent-in-agent solution required careful scoping. If you're building a similar system, start with the orchestrator-subagent pattern and scope each subagent narrowly.

Artifacts require trust, not just capture. Generating a screenshot or log doesn't automatically make the finding trustworthy. The first version of TREX had hallucination problems even with artifacts. The key was building enough trace information so that a reviewer (human or downstream agent) can independently verify the run. If your artifacts can't be verified, they're decoration.

Cache sizing is critical for sandbox speed. Reusable base images and repository snapshots make execution-based review practical. Too little cache means cold starts dominate runtime. Too much cache produces haunted environments with stale state. Monitor cache hit rates and snapshot freshness as first-class metrics.

Model-agnosticism has no quality penalty — but test it. Greptile's finding that their model-agnostic harness performs on par with native provider harnesses is specific to their evaluation framework. If you build a similar system, measure recall and precision per provider and per model. Don't assume agnosticism is free.

Cost climbs with PR complexity. An orchestrator spawning subagents per issue means complex PRs with many issues to investigate cost more to review. Consider setting per-PR budget limits or using the fast static review as a gate (only run TREX on PRs that pass the static check).

The Bottom Line

TREX represents a shift in how AI code review can work — from reading code to running it. The subagent architecture, the artifact pipeline, and the sandboxed execution environment each solve a specific problem that previous iterations didn't.

Greptile's stated vision is a "world with no bugs" — and they're no longer positioning themselves as a code review tool, but as a validation suite: an end-to-end layer that automates what engineering teams have done manually for decades, running on every PR.

Whether that vision lands depends on whether the execution costs are worth the bugs caught. But the architectural patterns — agent-in-agent orchestration, multi-modal evidence chains, disposable sandboxed execution — are worth studying regardless of where TREX lands in the market.