PyTorch Cursor Rules: Deep Learning Best Practices

Cursor rules for PyTorch covering tensor device management, nn.Module patterns, DataLoader pipelines, training loop conventions, gradient handling, GPU/CUDA optimization, model persistence, and distributed training with DDP.

June 11, 2026by PromptGenius Team
pytorchcursor-rulesdeep-learningmachine-learningpythongpu
PyTorch Cursor Rules: Deep Learning Best Practices

Overview

PyTorch is the most popular deep learning framework, used in research and production for computer vision, NLP, generative AI, and reinforcement learning. These cursor rules enforce proper nn.Module design, tensor device management, torch.no_grad() for inference, DataLoader patterns, training loop conventions, gradient handling, GPU optimization, model serialization, and distributed training so AI assistants generate efficient, production-ready PyTorch code.

Note:

Enforces explicit tensor device management (.to(device) over .cuda()), nn.Module patterns (super().init(), forward(), training/eval modes), torch.no_grad() for inference, DataLoader with pin_memory and non_blocking, optimizer.zero_grad() before backward, gradient clipping, state_dict serialization, mixed precision with torch.cuda.amp, and DistributedDataParallel over DataParallel.

Rules Configuration

---
description: Enforces PyTorch best practices including tensor device management, nn.Module patterns, DataLoader optimization, training loop structure, gradient handling, GPU memory management, model serialization, and distributed training.
globs: **/*.py,models/**/*.py,train.py,dataset.py,lightning_logs/**/*,**/torch/**
---
# PyTorch Best Practices

You are an expert in PyTorch, deep learning, and high-performance model training.
You understand tensor operations, autograd, GPU optimization, distributed training, and production model deployment.

### Tensor Operations & Device Management
- Use `.to(device)` for explicit device placement over `.cuda()` or `.cpu()`
- Define device once at the top: `device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')`
- Check device availability: `if torch.cuda.is_available():` — never assume CUDA
- Avoid moving tensors between devices inside training loops
- Use `tensor.detach()` to remove from computation graph, not `.data`
- Use `tensor.clone()` for independent copies, not `tensor.new_tensor()` or `tensor.new()`
- Use `torch.cat([a, b])` over `torch.stack([a, b])` when dimension already exists
- Prefer in-place operations with trailing `_` only when memory-constrained: `x.add_(y)`
- NEVER use `.item()` on tensors inside loops — it triggers CPU-GPU sync on every iteration
- Use `torch.set_default_device('cuda')` for global device context (PyTorch 2.0+)

### nn.Module Design
- ALWAYS call `super().__init__()` in `__init__` as the FIRST line
- Define all layers and parameters in `__init__` — never create layers in `forward()`
- Register tensors with `nn.Parameter()` so they're included in `model.parameters()`
- Implement `forward()` as the ONLY callable method — auxiliary methods return tensors used in forward
- NEVER call `model.eval()` or `model.train()` inside `forward()` — set mode externally
- Use `self.register_buffer(name, tensor)` for non-parameter tensors that need device tracking
- Override `self.train(mode)` if custom training/eval behavior is needed
- Use `nn.Sequential` for simple sequential architectures
- Use `nn.ModuleList` or `nn.ModuleDict` for dynamic module collections

### Data Loading
- Inherit from `torch.utils.data.Dataset` — implement `__len__` and `__getitem__`
- Return tensors from `__getitem__`, not numpy arrays — DataLoader handles collation
- Set `pin_memory=True` in DataLoader for faster CPU-GPU transfers
- Set `num_workers > 0` for parallel data loading (typically 4-8, not exceeding CPU cores)
- Set `persistent_workers=True` to avoid worker restart between epochs
- Use `torch.utils.data.Sampler` for custom sampling strategies
- NEVER load entire dataset into memory — stream from disk with Dataset class
- Use `collate_fn` for custom batch assembly, especially for variable-length sequences
- Use `torch.utils.data.random_split` for train/val/test splits
- Set `drop_last=True` for consistent batch sizes when using BatchNorm

### Training Loop
- Follow the canonical pattern each batch:
  1. `optimizer.zero_grad()` (or `set_to_none=True` for memory savings)
  2. `output = model(inputs)`
  3. `loss = criterion(output, targets)`
  4. `loss.backward()`
  5. `torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)` when needed
  6. `optimizer.step()`
- NEVER forget `optimizer.zero_grad()` — gradients accumulate by default
- Use `model.train()` before training, `model.eval()` before validation
- Wrap evaluation in `torch.no_grad()`: `with torch.no_grad(): ...`
- Calculate metrics on detached tensors: `accuracy = (pred.detach().argmax(1) == targets).float().mean()`
- Use `torch.cuda.amp.autocast()` for mixed precision training
- Use `GradScaler` when using float16 mixed precision
- Implement `on_batch_end` logging without blocking training — use moving averages

### GPU & CUDA Best Practices
- Enable `torch.backends.cudnn.benchmark = True` for fixed input sizes — auto-tunes CUDA kernels
- Use `torch.backends.cudnn.deterministic = True` ONLY when reproducibility is required (slower)
- Use `torch.cuda.empty_cache()` sparingly — only after deleting large tensors
- Use `torch.cuda.memory_summary()` for debugging memory issues
- Use gradient checkpointing for memory-heavy models: `torch.utils.checkpoint.checkpoint()`
- NEVER transfer data GPU→CPU→GPU — keep tensors on the same device throughout
- Use `.to(device, non_blocking=True)` with pinned memory for async transfers
- Prefers `torch.compile(model)` for eager-mode optimization (PyTorch 2.0+)
- Use `torch.set_float32_matmul_precision('high')` for TensorCore acceleration

### Model Persistence
- Save with `torch.save(model.state_dict(), 'model.pth')` — save state_dict, not the model object
- Load with `model.load_state_dict(torch.load('model.pth', map_location=device))`
- Always set `map_location=device` when loading — prevents accidental CPU loading
- Save optimizer and scheduler state for resumable training: `torch.save({'epoch': n, 'model_state_dict': ..., 'optimizer_state_dict': ..., 'scheduler_state_dict': ...}, 'checkpoint.pth')`
- Set `model.eval()` after loading for inference — BatchNorm and Dropout behave differently
- Export for production: `torch.jit.trace(model, example_input)` or `torch.onnx.export(model, ...)`
- NEVER pickle entire model objects — use state_dict for portability
- Version check: `torch.save({'pytorch_version': torch.__version__, ...})` for reproducibility

### Distributed Training (DDP)
- Use `DistributedDataParallel` (DDP) over `DataParallel` (DP) — DDP is faster and avoids Python GIL bottleneck
- Initialize process group: `dist.init_process_group('nccl', rank=rank, world_size=world_size)`
- Use `DistributedSampler` for dataset sharding: `sampler = DistributedSampler(dataset)`
- Call `sampler.set_epoch(epoch)` at the start of each epoch to ensure different shuffling
- Wrap model AFTER moving to device: `model = DDP(model.to(device), device_ids=[local_rank])`
- Launch with `torchrun --nproc_per_node=N train.py` — not `python -m torch.distributed.launch`
- Use `local_rank` from environment: `local_rank = int(os.environ['LOCAL_RANK'])`
- Sync metrics across processes with `dist.all_reduce(tensor, op=dist.ReduceOp.SUM)`

### Reproducibility
- Set random seeds: `torch.manual_seed(seed)` and `random.seed(seed)` and `np.random.seed(seed)`
- Set `torch.use_deterministic_algorithms(True)` for fully deterministic behavior (slower)
- Set `torch.backends.cudnn.deterministic = True` for deterministic CUDA convolution
- Log the seed and hyperparameters for every experiment
- Use `torch.save()` with `_use_new_zipfile_serialization=True` for consistent serialization

Installation

Create pytorch.mdc in your project's .cursor/rules/ directory and paste the configuration above. Cursor and Windsurf both read .cursor/rules/ — Copilot users place it in .github/copilot-instructions.md instead.

pip install torch torchvision torchaudio

# Check CUDA availability
python -c "import torch; print(torch.cuda.is_available())"

# For multi-GPU training
torchrun --nproc_per_node=4 train.py

Examples

# model.py — Proper nn.Module with forward, training/eval modes
import torch
import torch.nn as nn
import torch.nn.functional as F


class TextClassifier(nn.Module):
    def __init__(self, vocab_size: int, embed_dim: int, num_classes: int):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(embed_dim, 256, batch_first=True, bidirectional=True)
        self.dropout = nn.Dropout(0.3)
        self.classifier = nn.Linear(512, num_classes)

    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
        embedded = self.embedding(input_ids)
        lstm_out, (hidden, _) = self.lstm(embedded)
        pooled = torch.cat([hidden[-2], hidden[-1]], dim=-1)
        return self.classifier(self.dropout(pooled))
# train.py — Canonical training loop with mixed precision and DDP
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler


def train_epoch(model, dataloader, optimizer, criterion, scaler, device, epoch):
    model.train()
    sampler = dataloader.sampler
    if isinstance(sampler, DistributedSampler):
        sampler.set_epoch(epoch)

    for batch in dataloader:
        inputs, targets = batch[0].to(device, non_blocking=True), batch[1].to(device, non_blocking=True)

        optimizer.zero_grad(set_to_none=True)

        with torch.cuda.amp.autocast():
            outputs = model(inputs)
            loss = criterion(outputs, targets)

        scaler.scale(loss).backward()
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        scaler.step(optimizer)
        scaler.update()


@torch.no_grad()
def evaluate(model, dataloader, criterion, device):
    model.eval()
    total_loss, correct, total = 0.0, 0, 0

    for batch in dataloader:
        inputs, targets = batch[0].to(device, non_blocking=True), batch[1].to(device, non_blocking=True)
        outputs = model(inputs)
        total_loss += criterion(outputs, targets).item()
        correct += (outputs.argmax(1) == targets).sum().item()
        total += targets.size(0)

    return total_loss / len(dataloader), correct / total