Friday, June 26, 2026
One Command to Run Any Model: vLLM on Hugging Face Jobs
Posted by

Between the fully managed inference APIs (Together, Anthropic, Replicate) and the raw GPU cloud providers (Lambda, RunPod, Vast) there's been a gap. You either pay per-token on someone else's pricing table, or you spend an afternoon configuring Kubernetes to run vLLM yourself.
Hugging Face just filled that gap with a single command.
As of this week, you can run a private vLLM server on Hugging Face Jobs with hf jobs run — no Dockerfile, no YAML, no GPU orchestration beyond one --flavor flag. It's docker run for inference, and it might be the most practical thing Hugging Face has shipped this year.
The Command
Here's what it looks like:
hf jobs run --flavor a10g-large --expose 8000 --timeout 2h \
vllm/vllm-openai:latest \
vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000
You get back a URL like https://<job_id>--8000.hf.jobs — a gated, OpenAI-compatible endpoint that only accepts requests carrying your Hugging Face token. You query it with curl or any OpenAI client:
curl https://<job_id>--8000.hf.jobs/v1/chat/completions \
-H "Authorization: Bearer $(hf auth token)" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-4B",
"messages": [{"role": "user", "content": "Hello!"}]
}'
That's it. No provisioning, no load balancers, no inference engine config. The vllm/vllm-openai:latest Docker image from the blog post handles the rest.
How It Actually Works
HF Jobs is Hugging Face's ephemeral compute service — you pick a hardware flavor, supply a Docker image and a command, and your container starts on HF-managed GPUs. The vLLM integration adds the --expose flag, which opens a port on a .hf.jobs subdomain and routes traffic to your server with token-based auth.
The security model is straightforward: gated, not public. Every request must carry a Bearer token with read access to the job's namespace. The URL itself leaks nothing — without a valid token, it's a dead endpoint. If you cancel the job (hf jobs cancel <job_id>), the endpoint dies immediately and billing stops.
Pricing Reality Check
Here's the breakdown that matters:
| Flavor | GPU | VRAM | Price |
|---|---|---|---|
| A10G-large | 1× A10G | 24 GB | $1.50/hr |
| A100-large | 1× A100 | 80 GB | $2.50/hr |
| H200 | 1× H200 | 141 GB | $5.00/hr |
| 8× H200 | 8× H200 | 1,128 GB | $40.00/hr |
Plus $0.01/hr for the exposed port. You're billed per-minute while the job is starting or running — build time is free, and failed jobs auto-stop billing.
Compare this to the managed inference APIs: Together AI charges $0.88/M output tokens on Llama-3 70B. A single H200 running vLLM costs $5/hour. If you're running even moderate throughput, the break-even point comes fast — especially for batch workloads, evals, and test suites where you don't need per-token pricing flexibility.
But the HF Jobs model really shines for intermittent workloads. Managed APIs charge always-on per-token pricing even when you're idle. With Jobs, spin up, run your batch, cancel. You pay for exactly the compute you used, down to the minute.
Going Big
The blog post from Hugging Face's Quentin Gallouédec includes examples for larger models using tensor parallelism. The same pattern scales up trivially:
hf jobs run --flavor h200x2 --expose 8000 --timeout 2h \
vllm/vllm-openai:latest \
vllm serve Qwen/Qwen3.5-122B-A10B \
--host 0.0.0.0 --port 8000 --tensor-parallel-size 2 \
--max-model-len 32768 --max-num-seqs 256
One flag changes the hardware, one argument matches the parallelism. If you hit OOM, dial down --max-model-len and --max-num-seqs. If you need to debug, add --ssh and hf jobs ssh <job_id> gives you a shell on the running instance.
This is where the offering gets interesting. You can run a 122B MoE model on H200s without ever touching a GPU cluster config file. That's genuinely new.
What This Means for the Open-Weight Ecosystem
I've been skeptical about Hugging Face's compute play. Inference Endpoints has always felt overpriced compared to the GPU spot market, and the configuration overhead made it a hard sell against just using Together or Replicate.
HF Jobs + vLLM changes my mind for three reasons:
1. It lowers the adoption barrier for open-weight models. The biggest friction point for trying a new model isn't the model quality — it's the deployment overhead. When "try this model" becomes a single hf jobs run command, model discovery and model usage converge. You can eval five models in an afternoon.
2. It's the API-compatible escape hatch. If you're building on the OpenAI SDK and want to switch to an open-weight model, this is the lowest-friction path. Same client code, different base_url and api_key. No vendor lock-in, no proprietary SDK adapters.
3. Pay-per-second beats pay-per-token for power users. At scale, vLLM's throughput is excellent — a single A10G can saturate a lot of concurrent requests. The per-token pricing on managed APIs includes a margin for their GPU utilization risk. With HF Jobs, you're paying the raw hardware cost plus a thin HF margin. If you know your workload, you'll save money.
The Catch (There's Always One)
The elephant in the room: cold starts. Every hf jobs run has to pull the Docker image and load the model into GPU memory. For a 4B model on an A10G, that's maybe 30-60 seconds. For a 122B model on H200s, it's several minutes. Managed APIs keep models warm; Jobs doesn't.
This makes HF Jobs + vLLM a bad fit for:
- Low-latency production endpoints (cold start kills you)
- Serverless-style bursty traffic from many users (each new job creates a new cold start)
- Workloads that need <100ms time-to-first-token
And a great fit for:
- Batch inference and offline evals
- Testing and prototyping new models
- CI/CD pipelines that need model inference
- Developer previews and demos
- Any workload where minutes of startup time don't matter
How to Try It Right Now
pip install -U "huggingface_hub>=1.20.0"
hf auth login
# Launch Qwen3-4B as a test
hf jobs run --flavor a10g-large --expose 8000 --timeout 30m \
vllm/vllm-openai:latest \
vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000
# Wait for "Application startup complete" in the logs
# Then query with the OpenAI client
From login to first response: about 90 seconds. No cloud console, no IAM roles, no cluster config.
Bottom Line
This is the first thing Hugging Face has shipped this year that makes me want to reach for HF compute instead of grabbing an API key from one of the inference providers. It's not a production inference platform — the cold start latency and lack of autoscaling rules that out. But as a developer tool for experimenting with, evaluating, and building on open-weight models, it's the best path I've seen yet.
The takeaway is simple: the gap between "I want to try this model" and "I'm running this model" just got much, much smaller. And for a category that lives or dies on adoption friction, that's everything.
Source: Run a vLLM Server on HF Jobs in One Command by Quentin Gallouédec, Hugging Face Blog