OpenEnv — Open-Source Training Environments for Agentic RL

Complete tutorial on OpenEnv: the community-backed open-source environment standard for training agents with reinforcement learning. Covers architecture, setup, pre-built environments, custom environment building, and GRPO training with TRL.

June 14, 2026
openenvagentic-rlreinforcement-learninggrpotrltraining-environmentshuggingfacemeta-pytorch

What You'll Build

By the end of this tutorial, you'll have:

  • A running OpenEnv environment (both pre-built and custom)
  • A Python client that connects to an environment, sends actions, and processes observations
  • A custom environment you built from scratch with typed actions and rewards
  • A model fine-tuned with GRPO through TRL's OpenEnv integration
  • Understanding of why the community is standardizing around OpenEnv for agentic RL evaluation

What Is OpenEnv?

OpenEnv is an interoperability layer for agentic reinforcement learning environments. It's not a training framework or a reward system — it's the common socket that environments, training loops, and evaluation harnesses all plug into.

The project started at Hugging Face and recently transitioned to a community-governed project coordinated by a committee including Hugging Face, Meta-PyTorch, NVIDIA, and Unsloth. The goal: make agentic RL environments as composable and standardized as model weights on the Hub.

Architecture

OpenEnv separates environments from training loops with a client/server model:

LayerWhat It Does
Environment ServerRuns the actual environment logic (game, code sandbox, web browser, etc.). Packaged as a Docker container or HF Space.
Client LibraryTyped Python client that communicates with the server over WebSocket. Exposes reset(), step(), and state().
Training LoopTRL, Unsloth, SkyRL, or any other framework calls the client. OpenEnv doesn't care which training algorithm you use.
MCP LayerModel Context Protocol is a first-class citizen. Every environment exposes tools via MCP, making them usable in both training and inference.

Key design decisions:

  • Gymnasium-style API: Every environment exposes reset(), step(), and state(). If you've used Gym/ Gymnasium, you already know the shape.
  • HTTP + WebSocket: Environment servers communicate over WebSocket for low-latency multi-step interactions. HTTP for setup and discovery.
  • MCP as a native protocol: Environments speak MCP natively, so any MCP-compliant tool (Claude Code, OpenClaw, custom agents) can interact with them.
  • Docker packaging: Environments are canonically distributed as Docker images for isolation, reproducibility, and deployment portability.
  • Typed actions and observations: Actions and observations are Pydantic models, giving you type safety and IDE autocomplete.

Quick Start

Install

The core library is on PyPI:

pip install openenv-core

Or clone the monorepo for environment clients and tutorials:

git clone https://github.com/huggingface/OpenEnv.git
cd OpenEnv
pip install -e .

Connect to a Running Environment

OpenEnv hosts ready-to-use environments as Hugging Face Spaces. Let's connect to the Echo environment — a simple test environment that echoes back anything you send it.

import asyncio
from echo_env import EchoAction, EchoEnv

async def main():
    # Connect to the public HF Space
    async with EchoEnv(
        base_url="https://openenv-echo-env.hf.space"
    ) as env:
        # Reset starts a new episode
        result = await env.reset()
        print(f"Reset: {result.observation.echoed_message}")

        # Step sends an action, gets an observation + reward + done flag
        result = await env.step(
            EchoAction(message="Hello, OpenEnv!")
        )
        print(f"Echoed: '{result.observation.echoed_message}'")
        print(f"Reward: {result.reward}")

asyncio.run(main())

Expected output:

Reset: Echo environment ready!
Echoed: 'Hello, OpenEnv!'
Reward: 0.0

Tip:

Sync vs async. OpenEnv clients are async by default. For Jupyter notebooks and simple scripts, use the .sync() wrapper:

with EchoEnv(base_url="...").sync() as env:
    result = env.reset()
    result = env.step(EchoAction(message="Hello"))

Install and Run a Pre-Built Environment

You can install environments directly from Hugging Face Spaces:

# Echo environment
pip install "openenv-echo-env @ git+https://huggingface.co/spaces/openenv/echo_env"

# Wordle environment (from TextArena)
pip install "openenv-textarena @ git+https://huggingface.co/spaces/openenv/wordle"

# Catch environment (from OpenSpiel)
pip install "openenv-openspiel-env @ git+https://huggingface.co/spaces/openenv/openspiel_env"

Each installed environment gives you a typed client (EchoEnv, WordleEnv, OpenSpielEnv) that handles the WebSocket connection and serialization. You don't need to worry about JSON parsing, connection management, or retries — the client handles all of that.

Using Pre-Built Environments: Wordle Example

Let's use the Wordle environment. The agent sees the game state after each guess and receives a reward based on correctness.

import asyncio
from wordle_env import WordleEnv, WordleAction

async def play_wordle():
    async with WordleEnv(
        base_url="https://openenv-wordle.hf.space"
    ) as env:
        state = await env.reset()
        print(f"Game started. Word length: {state.observation.word_length}")

        guesses = ["STARE", "HOUSE", "MOUSE"]
        for guess in guesses:
            result = await env.step(WordleAction(word=guess))
            obs = result.observation
            print(f"'{guess}' → {obs.feedback}  (reward: {result.reward})")

            if result.done:
                if result.reward > 0:
                    print(f"Won! Word was {obs.target_word}")
                break

asyncio.run(play_wordle())

Expected output (simulated):

Game started. Word length: 5
'STARE' → 🟨⬛⬛🟨⬛  (reward: 0.0)
'HOUSE' → ⬛🟨🟩🟩⬛  (reward: 0.0)
'MOUSE' → 🟩🟩🟩🟩🟩  (reward: 1.0)
Won! Word was MOUSE

The reward signal is what the RL training loop optimizes for. In Wordle, it's sparse (1.0 on correct guess, 0.0 otherwise). In a code environment, it might be structured (passes tests → 1.0, compiles → 0.5, etc.).

Building a Custom Environment

Now let's build our own environment from scratch. We'll create a Code Feedback environment: the agent writes a Python function, and the environment compiles it and reports whether it runs without errors.

Step 1: Define Action and Observation Types

# my_env/models.py
from pydantic import BaseModel
from typing import Optional

class CodeAction(BaseModel):
    code: str
    function_name: str

class CodeObservation(BaseModel):
    status: str  # "ok", "error", "timeout"
    output: Optional[str] = None
    error: Optional[str] = None
    execution_time_ms: Optional[float] = None

Step 2: Implement the Environment

# my_env/environment.py
from uuid import uuid4
import time
from openenv.core.env_server.interfaces import Environment
from openenv.core.env_server.types import State
from models import CodeAction, CodeObservation

class CodeFeedbackEnvironment(Environment[CodeAction, CodeObservation, State]):
    """An environment that executes Python code and returns compile/ runtime feedback."""

    def __init__(self):
        self._state = State(episode_id=str(uuid4()), step_count=0)

    def reset(self) -> CodeObservation:
        self._state = State(episode_id=str(uuid4()), step_count=0)
        return CodeObservation(
            status="ok",
            output="Environment ready. Submit Python code for feedback."
        )

    def step(self, action: CodeAction) -> tuple[CodeObservation, float, bool]:
        self._state.step_count += 1

        # Simple sandboxed exec (for demo only — use a real sandbox in production)
        start = time.time()
        try:
            exec(action.code, {"__builtins__": __builtins__}, {})
            elapsed = (time.time() - start) * 1000
            return (
                CodeObservation(
                    status="ok",
                    output=f"Function '{action.function_name}' defined successfully.",
                    execution_time_ms=round(elapsed, 2)
                ),
                1.0,   # reward
                False  # not done
            )
        except Exception as e:
            elapsed = (time.time() - start) * 1000
            return (
                CodeObservation(
                    status="error",
                    error=str(e),
                    execution_time_ms=round(elapsed, 2)
                ),
                -0.5,  # penalty
                False
            )

    @property
    def state(self) -> State:
        return self._state

Step 3: Package and Run

Add a pyproject.toml and a server entry point:

.
├── my_env/
│   ├── __init__.py
│   ├── models.py
│   └── environment.py
├── server/
│   ├── Dockerfile
│   └── run.py
├── pyproject.toml
└── README.md

The server entry point is minimal:

# server/run.py
from openenv.core.env_server import serve
from my_env.environment import CodeFeedbackEnvironment

if __name__ == "__main__":
    serve(
        environment_class=CodeFeedbackEnvironment,
        host="0.0.0.0",
        port=8000,
    )

Run locally without Docker:

uv run server --host 0.0.0.0 --port 8000

Or build and run with Docker:

docker build -t code-feedback-env server/
docker run -p 8000:8000 code-feedback-env

Step 4: Test Your Environment

import asyncio
from my_env import CodeFeedbackEnv  # auto-generated client

async def test():
    async with CodeFeedbackEnv(base_url="http://localhost:8000") as env:
        await env.reset()

        # Test valid code
        result = await env.step(CodeAction(
            code="def add(a, b): return a + b",
            function_name="add"
        ))
        print(result.observation.status)   # "ok"
        print(result.reward)               # 1.0

        # Test invalid code
        result = await env.step(CodeAction(
            code="def broken(",  # SyntaxError
            function_name="broken"
        ))
        print(result.observation.status)   # "error"
        print(result.reward)               # -0.5

asyncio.run(test())

Note:

Real safety. The exec() in this demo is not sandboxed. For production environments that execute untrusted code, use Docker isolation, namespace containers, or a proper sandbox (gVisor, Firecracker). The OpenEnv Docker packaging makes this straightforward — each environment runs in its own container with no host access.

Training an Agent with GRPO and TRL

This is where OpenEnv shines. The Hugging Face TRL library has first-class OpenEnv integration — GRPOTrainer can pull environments directly and use them as training grounds.

How the Integration Works

  1. You define an environment factory — a function that creates environment instances
  2. You define a reward function — a function that maps environment trajectories to rewards
  3. GRPOTrainer handles the rest: rollout generation, advantage calculation, policy updates

Minimal Training Script

Here's a complete training setup for training a model to play Wordle:

# train_wordle.py
from datasets import Dataset
from trl import GRPOConfig, GRPOTrainer

# 1. Define the prompt dataset
prompts = [
    "Play Wordle. Guess the 5-letter word one letter at a time.",
]
dataset = Dataset.from_dict({
    "prompt": [[{"role": "user", "content": p}] for p in prompts]
})

# 2. Define environment factory
def env_factory():
    """Create a fresh Wordle environment instance for each rollout."""
    from wordle_env import WordleEnv
    return WordleEnv(base_url="https://openenv-wordle.hf.space")

# 3. Define reward function
def reward_func(environments, **kwargs):
    """Reward: +1 for correct word, -0.1 per wrong guess (encourages efficiency)."""
    rewards = []
    for env in environments:
        # env.trajectory contains the step history
        if env.trajectory and env.trajectory[-1].reward > 0:
            # Won — reward based on speed (fewer guesses = higher reward)
            steps = len(env.trajectory)
            rewards.append(max(0.5, 1.0 - (steps - 1) * 0.1))
        else:
            rewards.append(0.0)
    return rewards

# 4. Configure and train
training_args = GRPOConfig(
    model_id="Qwen/Qwen3-1.7B",
    output_dir="./wordle-agent",
    num_generations=4,
    max_steps=100,
    per_device_train_batch_size=2,
    vllm_device="cuda:0",
)

trainer = GRPOTrainer(
    model="Qwen/Qwen3-1.7B",
    args=training_args,
    train_dataset=dataset,
    reward_funcs=[reward_func],
    environment_factory=env_factory,
)

trainer.train()

Running the Training

For real training, you'd run two terminals:

# Terminal 1: Start the vLLM inference server
CUDA_VISIBLE_DEVICES=0 trl vllm-serve \
    --model Qwen/Qwen3-1.7B \
    --host 0.0.0.0 --port 8000

# Terminal 2: Run GRPO training with OpenEnv
CUDA_VISIBLE_DEVICES=1 python train_wordle.py \
    --vllm-mode server \
    --vllm-server-url http://localhost:8000

Expected training output (abbreviated):

[Step 0/100] loss: 0.892 | avg reward: 0.12 | avg trajectory length: 4.2
[Step 10/100] loss: 0.743 | avg reward: 0.31 | avg trajectory length: 3.8
[Step 25/100] loss: 0.612 | avg reward: 0.55 | avg trajectory length: 3.1
[Step 50/100] loss: 0.489 | avg reward: 0.74 | avg trajectory length: 2.8
[Step 75/100] loss: 0.401 | avg reward: 0.81 | avg trajectory length: 2.5
[Step 100/100] loss: 0.356 | avg reward: 0.87 | avg trajectory length: 2.3

Tip:

Start small. You can test the full pipeline end-to-end on a free Colab instance by using a small model (Qwen3-1.7B or SmolLM2) and a lightweight environment like EchoEnv. The OpenEnv course notebooks on GitHub are designed to run top-to-bottom in Colab.

The Standardization Push

OpenEnv's community transition in June 2026 marked a turning point for agentic RL. Three active RFCs are shaping the standard:

RFC 006: Tasksets via Datasets

Environments wire their task definitions directly to Hugging Face datasets. This makes environments as composable as benchmarks — you can mix and match tasks, reward functions, and environments without rewriting anything.

RFC 007: External Rewards

Reward functions can be defined in external libraries (TRL, custom Python, etc.) while OpenEnv handles deployment and the environment interface. This separation of concerns means environment authors don't need to be RL experts, and RL practitioners can reuse environments across reward schemes.

RFC 008: Auto-Validation

A standardized way to measure environment quality and its specific contribution to model learning. This gives the community a scalable way to evaluate environments and drive up quality — think hackathons for environment building with automated scoring.

Ecosystem Support

OpenEnv is already integrated with the major RL training frameworks:

FrameworkIntegrationStatus
TRLNative environment_factory support in GRPOTrainerLive
UnslothDrop-in acceleration for GRPO training with OpenEnvLive
SkyRL (UCB)First-class OpenEnv client integrationLive
Lightning AIEnvironment management via FabricIn development
Axolotl AIYAML-based env configurationPreview
vLLMInference serving for agent rolloutsLive
TorchForge (Meta)Deep integration plannedPreview

Why This Matters

Before OpenEnv, every agentic RL project had its own bespoke environment setup. Researchers at different labs couldn't reproduce each other's results without running a maze of custom scripts. Environment authors had to build training loop adapters for every RL framework.

OpenEnv standardizes the middle layer. The result:

  • Reproducibility: Same environment, same reward signal, same training loop — across labs, frameworks, and hardware.
  • Composability: Swap environments without touching your training code. Swap training frameworks without touching your environment.
  • Benchmarking: When every environment has a standard interface, agent performance becomes directly comparable.
  • Production path: Environments trained via RL can be deployed in inference mode through the same MCP interface — no porting needed.

Key Takeaway

OpenEnv decouples environment execution from RL training. By standardizing the interface layer — typed actions/observations, WebSocket transport, MCP compatibility, and Docker packaging — it makes agentic RL environments as reusable and interoperable as models on the Hub. The community governance ensures no single vendor controls the standard. Start with a pre-built environment from HF Spaces, build a custom one for your domain, and plug it into TRL's GRPOTrainer for your first agentic RL training run.

What's Next

  • Browse the OpenEnv monorepo at github.com/huggingface/OpenEnv
  • Try the openenv-course notebooks on Colab
  • Read the TRL integration docs for deeper training configurations
  • Review the RFCs in the repo — particularly RFC 006 (Tasksets), 007 (External Rewards), and 008 (Auto-validation)
  • Build and publish your own environment to HF Spaces
  • Join the community discussions in the Hugging Face OpenEnv organization