Pandas Cursor Rules: Python Data Analysis

Cursor rules for Pandas covering DataFrame operations, I/O, filtering, groupby/aggregation, merging, datetime handling, vectorized operations, and performance optimization.

June 11, 2026by PromptGenius Team
pandascursor-rulespythondata-sciencedata-analysisdataframe
Pandas Cursor Rules: Python Data Analysis

Overview

Pandas is the most widely used Python library for data manipulation and analysis, providing DataFrames and Series for working with tabular data. These cursor rules enforce vectorized operations, proper filtering with loc/iloc, groupby/aggregation patterns, merge strategies, datetime handling, I/O best practices, and performance optimization so AI assistants generate efficient, idiomatic Pandas code.

Note:

Enforces vectorized operations (no Python loops), loc/iloc filtering, groupby with named aggregations, merge/join conventions, datetime index handling, chunked I/O for large files, and categorical dtype usage.

Rules Configuration

---
description: Enforces Pandas best practices including vectorized operations, loc/iloc filtering, groupby aggregation, merge strategies, datetime handling, I/O patterns, and performance optimization.
globs: **/*.py,**/*.ipynb
---
# Pandas Best Practices

You are an expert in Pandas, data analysis, and scientific computing with Python.
You understand DataFrame operations, vectorized computation, time series analysis, and data pipeline design.

### DataFrame Fundamentals
- Create DataFrames from dicts: `pd.DataFrame({"name": ["Alice", "Bob"], "age": [30, 25]})`
- Use `df.head()`, `df.tail()`, `df.info()`, `df.describe()` for inspection
- Access columns: `df["column"]` or `df.column` (prefer bracket notation)
- Use `df.shape`, `df.columns`, `df.dtypes`, `df.index` for metadata
- Set index: `df.set_index("id")` or load with `pd.read_csv("data.csv", index_col="id")`
- Reset index: `df.reset_index(drop=True)`

### I/O Operations
- Read CSV: `pd.read_csv("data.csv", parse_dates=["date"])` — always parse dates on read
- Read SQL: `pd.read_sql("SELECT * FROM users", connection)`
- Read JSON: `pd.read_json("data.json", orient="records")`
- Read Excel: `pd.read_excel("data.xlsx", sheet_name="Sheet1")`
- Read Parquet: `pd.read_parquet("data.parquet")` — prefer over CSV for large data
- Write CSV: `df.to_csv("output.csv", index=False)` — always set index=False
- Use `chunksize` parameter for large files: `for chunk in pd.read_csv("large.csv", chunksize=10000)`
- Specify `dtype` on read to avoid inference: `pd.read_csv("data.csv", dtype={"id": "int32"})`

### Filtering & Selection
- Use `.loc[row_labels, col_labels]` for label-based selection
- Use `.iloc[row_positions, col_positions]` for integer position-based selection
- Use `df[df["age"] > 30]` for boolean filtering
- Combine conditions: `df[(df["age"] > 30) & (df["city"] == "NYC")]` — use `&` not `and`
- Use `df.query("age > 30 and city == 'NYC'")` for SQL-like filtering strings
- Use `.isin()` for membership: `df[df["country"].isin(["US", "CA", "UK"])]`
- Use `.between()` for range: `df[df["age"].between(25, 35)]`
- Use `.str.contains()` for text: `df[df["email"].str.contains("@gmail.com")]`

### GroupBy & Aggregation
- Basic group: `df.groupby("category")["value"].mean()`
- Named aggregation: `df.groupby("category").agg(total=("value", "sum"), avg=("value", "mean"), count=("value", "count"))`
- Multiple aggregations: `df.groupby("category").agg(["mean", "std", "min", "max"])`
- Transform (broadcast back): `df["pct"] = df.groupby("category")["value"].transform(lambda x: x / x.sum())`
- Filter groups: `df.groupby("category").filter(lambda x: len(x) > 10)`
- Apply custom functions: `df.groupby("category").apply(lambda x: x.nlargest(3, "value"))`

### Merging & Joining
- Use `pd.merge(left, right, on="key", how="inner")` — how: inner, left, right, outer
- Use `pd.merge(left, right, left_on="lkey", right_on="rkey")` for different column names
- Use `pd.concat([df1, df2])` for row-wise concatenation
- Use `pd.concat([df1, df2], axis=1)` for column-wise concatenation
- Use `df.join(other, on="key")` for index-based joins
- Validate merge results: use `validate="one_to_one"`, `"one_to_many"`, `"many_to_many"`
- Use `indicator=True` to track merge source: adds `_merge` column (both, left_only, right_only)

### Missing Data
- Check nulls: `df.isnull().sum()`
- Drop nulls: `df.dropna(subset=["column"])` or `df.dropna(how="all")`
- Fill nulls: `df["col"].fillna(0)`, `df["col"].fillna(df["col"].mean())`
- Forward fill: `df.fillna(method="ffill")`
- Interpolate: `df.interpolate()`
- Use `pd.NA` (not `np.nan`) for nullable integer columns with `dtype="Int64"`

### DateTime Handling
- Convert to datetime: `pd.to_datetime(df["date"])`
- Extract components: `df["date"].dt.year`, `.dt.month`, `.dt.day`, `.dt.dayofweek`
- Resample time series: `df.resample("M")["value"].sum()` — needs DatetimeIndex
- Rolling windows: `df["value"].rolling(window=7).mean()`
- Shift data: `df["value"].shift(1)` for previous value, `.shift(-1)` for next
- Date ranges: `pd.date_range("2024-01-01", periods=12, freq="ME")`
- Time deltas: `pd.Timedelta(days=1)`, `df["end"] - df["start"]`

### Performance
- Never iterate with `for index, row in df.iterrows()` — use vectorized operations
- Use `.apply()` as a last resort — `.str`, `.dt`, and NumPy ufuncs are faster
- Use `pd.Categorical` for low-cardinality string columns: `df["status"].astype("category")`
- Use `pd.eval()` and `df.eval()` for large expression chains
- Use integer dtypes: `int8`, `int16`, `int32`, `int64``int32` is usually enough
- Use `float32` instead of `float64` when precision permits
- Load only needed columns: `pd.read_csv("data.csv", usecols=["name", "age"])`
- Use Parquet over CSV: faster I/O, preserves dtypes, smaller size

Installation

Create pandas.mdc in your project's .cursor/rules/ directory and paste the configuration above. Cursor and Windsurf both read .cursor/rules/ — Copilot users place it in .github/copilot-instructions.md instead.

pip install pandas

# Or with optional dependencies
pip install pandas[parquet,excel]

Examples

# analysis.py — Data pipeline with I/O, filtering, and aggregation
import pandas as pd

# Read with date parsing and type hints
df = pd.read_csv(
    "sales.csv",
    parse_dates=["order_date"],
    dtype={"customer_id": "int32", "quantity": "int32", "price": "float32"},
)

# Filter and compute
recent = df[df["order_date"] >= "2024-01-01"]
recent["total"] = recent["quantity"] * recent["price"]

# Group by customer with named aggregation
customer_stats = (
    recent.groupby("customer_id")
    .agg(
        total_spent=("total", "sum"),
        avg_order=("total", "mean"),
        order_count=("total", "count"),
        first_order=("order_date", "min"),
        last_order=("order_date", "max"),
    )
    .reset_index()
)

# Join with customer info
customers = pd.read_csv("customers.csv")
result = customers.merge(customer_stats, on="customer_id", how="left")

result.to_csv("customer_report.csv", index=False)
# timeseries.py — Time series analysis with resample and rolling
import pandas as pd

df = pd.read_csv("metrics.csv", parse_dates=["timestamp"], index_col="timestamp")

# Daily resample
daily = df["value"].resample("D").agg(["mean", "min", "max", "count"])

# 7-day rolling average
daily["rolling_avg"] = daily["mean"].rolling(window=7).mean()

# Fill missing days
daily = daily.asfreq("D")
daily["mean"] = daily["mean"].fillna(method="ffill")

# Monthly summary
monthly = df["value"].resample("ME").sum()
# pivot.py — Reshaping with pivot and melt
import pandas as pd

# Get from wide to long
sales = pd.DataFrame({
    "product": ["A", "B"],
    "jan": [100, 200],
    "feb": [150, 250],
    "mar": [120, 220],
})

long = sales.melt(id_vars=["product"], var_name="month", value_name="sales")

# Get from long to wide
pivot = long.pivot_table(
    index="product",
    columns="month",
    values="sales",
    aggfunc="sum",
)

# Stack/unstack for multi-index manipulation
stacked = pivot.stack()
unstacked = stacked.unstack("month")