Can Data Win Your Fantasy League?

Authors: Vidit Tiwari, Navdeep Sirigiri

Dataset Introduction and Question Identification

Data Overview

We’re using a dataset from Pro Football Reference, which includes ~12,000 player-season entries and around 35 features. It covers standard NFL stats (like targets, carries, and touchdowns), fantasy metrics (like points per game and rank), and advanced features like player grades and red zone usage over since 2004.

Main Question

Can we predict a player’s final fantasy football rank for the next season using prior-year stats and player grades?

This matters because millions play fantasy football, and being able to spot breakout players or avoid busts can give players a huge edge.

Key Columns

FantPos: Player’s fantasy position (RB, WR, etc.)
Age: Age of the player during the season
Tgt: Rec, Yds, TD: Receiving stats
Att: Yds, TD: Rushing stats
FantPt: PPR: Total and PPR fantasy points scored
PosRank: OvRank: Player’s position and overall fantasy ranking (target)

Data Cleaning and Exploratory Data Analysis

Data Cleaning

To prepare the dataset for analysis, we first addressed symbolic annotations in player names—specifically asterisks (*) indicating Pro Bowl selections and plus signs (+) indicating All-Pro honors. These symbols were part of the original data collection process from Pro Football Reference, where such accolades are embedded directly into player names. To reflect this, we extracted them into two binary columns (is_probowl and is_allpro) and cleaned the Player column to ensure consistent, identifier-friendly names for joins and tracking.

Ambiguous columns like Yds, Yds.1, and TD.1 stemmed from the original website’s formatting for multiple stat types (e.g., passing, rushing), so we renamed them to Pass_yd, Rush_yd, and Rush_TD for clarity and usability in downstream analysis.

Team abbreviations were standardized (e.g., SDG to LAC, STL to LAR) to align with current franchise naming conventions, accounting for relocations that would otherwise fragment team-based aggregations.

Most missing statistics were filled with zeros, assuming these reflected zero performance rather than true missing data, as the original site omits stats when no play occurred. For example, the 2PM column was often NaN, since many times players don’t record this stat. For the OvRank column we imputed the maximum value for that given year, since all values that were NaN in this column fell below a baseline rank set by Pro Football Reference. We used the maximum value since we deemed players below the the threshold to be equal in rank.

Finally, we created a Next_PosRank column by applying a group-wise shift based on player and position—mirroring the year-to-year progression of the NFL season—to support predictive modeling of future fantasy performance.

	Rk	Player	Tm	FantPos	Age	G	GS	Att	Rush_Att	Rush_yd	Y/A	Rush_TD	Tgt	Rec	Rec_yd	Y/R	Rec_TD	Fmb	FL	TD.3	FantPt	PPR	DKPt	FDPt	VBD	PosRank	OvRank	-9999	Year	is_probowl	Next_PosRank
3804	34	A.J. Brown	TEN	WR	22	16	11.0	0.0	3.0	60.0	20.0	1.0	84.0	52.0	1051.0	20.21	8.0	1.0	0.0	9	165.0	217.1	220.1	191.1	36.0	9	34.0	BrowAJ00	2019	0	9.0
601	29	A.J. Brown	TEN	WR	23	14	12.0	0.0	0.0	0.0	0.0	0.0	106.0	70.0	1075.0	15.36	11.0	2.0	1.0	12	178.0	247.5	251.5	212.5	51.0	9	29.0	BrowAJ00	2020	1	32.0
1331	102	A.J. Brown	TEN	WR	24	13	13.0	2.0	2.0	10.0	5.0	0.0	105.0	63.0	869.0	13.79	5.0	0.0	0.0	5	118.0	180.9	183.9	149.4	0.0	32	0.0	BrowAJ00	2021	0	4.0
3134	13	A.J. Brown	PHI	WR	25	17	16.0	0.0	0.0	0.0	0.0	0.0	145.0	88.0	1496.0	17.00	11.0	2.0	2.0	11	212.0	299.6	304.6	255.6	91.0	4	13.0	BrowAJ00	2022	1	8.0
2509	20	A.J. Brown	PHI	WR	26	17	17.0	0.0	0.0	0.0	0.0	0.0	158.0	106.0	1456.0	13.74	7.0	2.0	2.0	7	184.0	289.6	294.6	236.6	51.0	8	20.0	BrowAJ00	2023	1	17.0

Univariate Analysis

Pie chart of Position distribution

This pie chart displays the distribution of fantasy football positions in our dataset, with wide receivers making up the largest group. This suggests that WRs are the most common fantasy players, which could influence drafting depth and positional strategies.

Bivariate Analysis

Points vs Position BoxPlot

This box plot shows the distribution of fantasy points by position, with quarterbacks having the highest median point totals. This suggests that quarterbacks are generally the most valuable fantasy players, followed by running backs.

## Fantasy Points Vs Year Histogram

his histogram shows how the number of fantasy points scored has changed over the years. It reveals an upward trend, indicating that players have been scoring more fantasy points in recent seasons.

Heat Map of Score output per team

This heatmap highlights how the top 10 NFL teams distribute Fantasy Points across positions, revealing which teams are especially strong at specific roles like QB, RB, TE, or WR. It helps identify team-position combinations that consistently produce high fantasy value.

The Prediction Problem

Our goal is to build a regression model that predicts an NFL player’s final fantasy position ranking for the upcoming season, using their performance statistics from the previous season.

Response Variable

Next_PosRank (next season’s fantasy ranking within position) is the response variable we aim to predict.

We chose Next_PosRank because it offers a normalized, position-specific measure of a player’s fantasy value. This makes it especially useful for fantasy football managers when planning draft strategies or identifying breakout candidates within specific roles (e.g., WR, RB, QB).

Type of Prediction

This is a regression problem.
Although ranks are ordinal, we treat them as continuous for prediction purposes, since we aim to estimate the exact or near-exact rank value rather than classify into broad tiers.

We also measure performance not just by exact rank, but how close the prediction is to the actual outcome.

Evaluation Metric

We use Mean Absolute Error (MAE) as our primary evaluation metric.

We chose MAE because:

It is directly interpretable in the same units as our target variable (ranking position), making it easy to explain and understand.
Unlike R² or RMSE, MAE is more robust to outliers and gives a straightforward sense of how far off predictions are on average.
It aligns well with how fantasy managers might perceive draft accuracy (e.g., off by 2–3 ranks is more intuitive than a % variance).

Baseline Model Predicting Next Position Rank

Model Description

We constructed a ordinal regression model to predict the Next_PosRank (the next position rank) of a player using FantPt and Age.

Features Used

The model used the following features:

FantPt (Fantasy Points) — Quantitative
Age — Quantitative
FantPos (Fantasy Position)** — Nominal

The target variable is Next_PosRank, a numeric measure indicating a player’s projected future ranking within their position.

Feature Types Summary

Feature	Type	Description
FantPt	Quantitative	Continuous numerical feature
Age	Quantitative	Continuous numerical feature
FantPos	Nominal	Categorical feature (e.g., QB, RB)

Quantitative features: 2
Nominal features: 1
Ordinal features: 0

Encoding and Preprocessing

Quantitative features (FantPt, Age) were passed through without transformation using "passthrough".
Nominal feature (FantPos) was encoded using One-Hot Encoding via OneHotEncoder, with handle_unknown="ignore" to avoid errors from unseen categories during testing.
A ColumnTransformer was used to apply these transformations.
These transformations were combined with a LinearRegression model inside a Pipeline.

Model Evaluation

A final MAE of 9.45 means that, on average, the model’s predicted player ranking is off by about 10 ranks.

29.92% of predictions fall within 5 ranks
62.08% of predictions fall within 10 ranks
83.68% of predictions fall within 15 ranks

Positional Accuracy within ±10 Ranks:

QB: 71.07%
WR: 56.89%
TE: 56.77%
RB: 64.34%
FB: 00.00%

The reported metrics demonstrate that the model’s predictions are generally accurate, with 58.60% of predictions within ±10 ranks, which is highly valuable for fantasy managers. The positional accuracy also shows strong performance, particularly for quarterbacks (64.75%), indicating the model is effective across most positions while offering opportunities for further improvement in others.

Final Model

Feature Engineering

To better capture the underlying patterns in the data and enhance the model’s ability to predict player ranks, we introduced several new features:

Total_TD: Total Touchdown summing from each category, rushing, passing etc.
Touches: Combines rushing attempts and receptions which are crucial for fantasy production.
YardsPerTouch: Measures efficiency per opportunity; better efficiency leads to better players.
CatchRate: Reflects how often a player successfully catches a target.
TD_PG, RecYards_PG, RushYards_PG: Normalizing stats per game accounts for variability in games played, which is especially important due to injuries.
FantPt_PerTouch: Measures fantasy productivity per touch.
TeamYearRank: Ranks a player’s fantasy points and how much they contributed to the team.

Modeling Algorithm and Hyperparameter Selection

We used the LogisticAT (ordinal regression with an adjacent-category logistic model) as the model algorithm. This was selected because our target is ordinal. LogisticAT is specifically designed for such tasks, making it more appropriate than standard regression or classification models.

To find the optimal hyperparameter (alpha), we used cross-validation on the training set and selected the value that minimized the Mean Absolute Error (MAE) on validation folds. The best performing alpha was 100.0, which likely provided the right balance between model complexity and regularization.

Final Model vs. Baseline Model

Metric	Baseline Model	Final Model	Improvement
MAE	9.45	8.86	Lower error
Accuracy Score	0.02	0.02	No change
Accuracy ±5 ranks	33.12%	34.56%	Higher
Accuracy ±10 ranks	58.77%	65.52%	Higher
Accuracy ±15 ranks	78.25%	85.12%	Higher
WR ±10 ranks	56.89%	59.88%	Higher
TE ±10 ranks	56.77%	63.23%	Higher
RB ±10 ranks	64.34%	66.43%	Higher
QB ±10 ranks	71.07%	74.84%	Higher
FB ±10 ranks	00.00%	00.00%	Higher

The final model demonstrated consistent improvements across all key metrics. While the raw accuracy score remains low due to the challenging nature of rank prediction, the model substantially improved on meaningful ordinal metrics like accuracy within 10 and 15 ranks. Notably, accuracy within 10 ranks increased across most positions, with significant jumps for QB, where contextual features like per-game efficiency and role-adjusted ranking were particularly impactful. We also found that since there were not as many FB, it was hard to get proper tests in that category.