Predicting the NBA Draft

By Isaac Butz, Zongxia Li, and Laksh Sekhri


Welcome to our tutorial! In this project, the goal is to look at what factors are the most important for NBA draft picks. First, a little bit about the NBA. The National Basketball Association, or NBA, is the highest level professional basketball league in the world. It is composed of 30 teams that play 82 games each, before a playoff bracket is set. The most common way for players to enter the league (After 2006, read more about the rule change here is to be drafted in the NBA draft. Most players drafted come from collegiate basketball, more specifically NCAA D1 Mens Basketball. The draft consists of 2 rounds, with each team getting 1 pick per round, for a total of 60 picks.

The end goal of this project is to try and predict the 2021 NBA draft that will take place on July 29th. Finding out the position of draft picks would be useful for those in the sports industry. A team would be able to predict what players other teams will select, see if players are over/under-rated, or possibly help set sports betting odds. However to start predicting the draft, we must look at past drafts to see what are the most important factors for an NBA draft pick. The project will be broken up into 5 sections:

  1. Setup
  1. Data Wrangling
  1. Analysis of Previous Drafts
  1. Building a Model
  1. Conclusion

Over the course of this guide, we hope the reader is able to understand how and why data analysis is done, and would be able to follow similar steps to do some data science on other topics!

Part 0: Setup

The following imports and libraries will be used. If you want to read the documentation, click below.

Part 1: Data Wrangling

Perhaps the most important part is to get the data. This first dataset needed is a collection of past NBA draft picks college stats. This is what we will do analysis on, and train our machine learning model on. We found the website barttorvik.com that contains data on NCAA mens basketball players. Secondly, a dataset is needed for the current NBA draft class. With these two data sets we will be able to inspect our data and do preliminary hypothesis testing, create a model for the data, and lastly try to predict this current years draft order.

Step a: Previous NBA drafts

Unfortunately, the data is generated using javascript and we were unable to parse it, so it was done manually. Here are the steps. From here, under the top drop down menu 'all' was selected to get all time. Next, scrolling down the left side, Min%, O-Reb%, D-Reb%, Ast%, Blk%, Stl%, FTR, 2PM, 3PM, and drafted were selected to be shown. Drafted was also set to "<=" and "60", since there are 60 picks in the NBA draft. After that, "name" was clicked twice so the players were sorted by name. Then, the "Show 100 more" option was repeatedly selected, until the full 1302 element table was shown. Lastly, the entire table was selected using a cursor, and copy/pasted into a blank csv file that was read into pandas.

The better approach would be to use a library such a selenium, however we had difficulties figuring this out.

The data is super messey. We want to fix it and have the following columns:

The last 8 are hyperlinks to a gloassary by basketball-reference.com that further explains what these columns mean.

Step b: This year's draft

Unfortunately we will not be able to predict everyone in the draft. This is because the deadline for players to announce for the NBA draft is May 30th, and this project is due May 17th. However, most players who will be picked higher up in the rankings declare earlier for the draft, so we can use these early declarations for predictions. This website contains a nice table of all the players who have declared early thus far.

Next, we use this list of names as a cross reference for the 2021 season stats, getting the 2021 season players stats from barttorvik.com again. Like before, this data is created using javascript, so here are the steps for how we manually got the data. First, go to barttorvik player stats. Make sure the year is 2021. Next, select Min%, O-Reb%, D-Reb%, Ast%, Blk%, Stl%, FTR, 2PM, and 3PM. And once again, show 100 more was selected repeatedly until all 2125 players were visible. Lastly, this was copy/pasted into a csv file, which is read in and processed.

To clean us this DataFrame, I am going to follow almost exactly the same steps as I did in step a. The columns will be the exact same, except there won't be a 'pick' column.

Part 2

Now that all the data has been collected, the data analysis can begin. The goal is to look for trends between high draft picks and player's stats. If any trends can be observed, that would be something to train our model on. For plotting we will use matplotlib to display the data.

Step a: Players stats

We will start off with plotting possibly the most simple breakdown, age. A common sentiment among basketball fans, and our hypothesis, is that a younger player is drafted earlier. This is because the NBA teams want to develop the player on their own and younger players adapt more easily. At the same time, we can plot draft pick over minutes played to see how time spent in game correlates with draft position. Finally, we will look at the players stats to see if there are any correlations between players stats and draft placement.

Looking at the plot on the left it seems that if a freshman is going to be drafted, they will be picked early in the draft. Seniors are the opposite, often being picked at the end of the draft. The distribution seems to be somewhat linear, so this could be a good parameter for training our model on. This is in line with our hypothesis. Conversely, the distribution of pick number over minutes played shows almost no correlation. It is a total mess and thus it will not be something that we will use in our preditctive model.

Next, we will look at how the players stats impact their draft position. First, we will look at all the players stats against draft pick to see if there is any basic correlation.

There seems to be no correlation between draft pick and any of these stats. Every plot is a mess. This means that there is no specifc player stat that really makes or breaks their draft position. This makes sense considering the variety of positions in basketball and the different skill sets. Some years teams may want a defensive minded player that gets a lot of steals and blocks, or teams may draft a good shooter really high.

Step b: Getting more data

Since there seems to be no correlation there, we are going to loop back in the data science pipeline to collecting data. Perhaps there is a stronger relationship with biometrics: height, weight, vertical, etc. We found this data on Data World. To access it you do have to make an account, however this is free and simple. The data was read in and joined to our existing pastPicks_df.

Here is a description of the new dataframe. The first 12 columns are the same, and the new columns have mostly self explanatory. As you can see in the Non-Null Count, the measurments we added were not up to date. Some columns, such as "Bench", only have 284 out of 621 entries. However this should be enough data to do more hypothesis testing on.

Step c: Further Analysis

In this section, we will do scatter plots on different biometrics vs. pick of the player and see which biometric affects the player's pick position the most

Part 3

Next we can try to train a model to predict the 2021 NBA draft. There aren't too many good parameters, so for the most part we will be training on all of the data points.

Step a: Training and Finding the Best Model

We will use these features as input of our different models and pick as the output of those models. We will train the data using K nearest neighbor model, support vector machine, logistic regression and multi layer perceptron. Then we will see how accurate each model is on fitting the traning dataset. And we will pick the top 2 models that perform the best and use the models to predict the draft for 2021.

Collecting training data and testing data

We are taking the NBA drafts data from previous year starting from 2008 use differnet models to test how accurate each model is to predict the draft picks for NBA players. Then we will use the most accurate model to predict the NBA draft pick for 2021.

K means clustering classification model

K means clustering is just classifying data into different clusters. Each cluster has a center point. We consider the data points that are closer to the center points to be in that cluster with a specific centroid. We will separate the training data points into different clusters and visualize them. Because there are 10 input features for each data point, in order to visualize the data points in a 2D plane, we need to do principle component analysis on the data points. We calculate the variance for each feature among all the data points, and take the top 2 features as our principle component and use the top 2 features as our datapoints to be plotted in the 2D plane.

Accuracy checking function

The following function checks how accurate the current model is. Given a trained model, a set of input features, and a list of actual expected output. We predict the output using the trained model and the input and compare the predicted output with the expected output.

K Nearest Neighbors Model prediction and accuracy checking

In this part, we will use sklearn.KNeighborClassifier algorithm on the data.

Support Vector Machine Model (SVM)

We will use linear SVM and radial basis SVM to test the acuracy of the model. SVMs separate data into different classes and classify the data based on which class it belongs to. If you want to learn more about what an SVM is, click here

Multilayer Perceptron

Multilayer Perceptron is a type of deep learning model that utilizes forward computation, back propagation, stochastic gradient descend to train the network

Logistic Regression

We we also test the logistic regression model accuracy rate

Finally, with all of our different models, we can generate a table of accuracy rates. The goal is to find a model to predict the NBA draft, so the highest accuracy rate should be the model we want.

Step b: Model comparisons and predictions.

Model Comparisons

From the above data, we can see that KNN with 1 neighbor has the highest accuracy - 1.0. But we should give it a doubt because KNN with only 1 neighbor means that every datapoint has itself as a neighbor. The second highest accuracy modelis KNN with neighbors 2- 53% accuracy rate. The models with the lowest accuracy rate are multilayer perceptrons and support vector machones with radial basis classification- 6.8% accuracy. The best model has still very low accuracy rate because we only have 1302 datapoints as our training data. We need more datapoints to improve the accuracy rate.

Prediction on 2021 drafts

We wil pick the top 2 models and use them to predict the drafts for NBA players 2021.

Conclusion

First, a special shoutout to Maryland's own Aaron Wiggins; he is predicted to go number 7 by our KNN 2 model.

From our predictions, some of the top players include Jose Alvardo, Colin Castleton, and Quentin Grimes. They appear in the top 10 for both our KNN 1 and KNN 2 models. We can output some of the top draft picks from both of our models here.

Now, we won't know the actual results of the draft until July. However, we can compare our guesses to what the sports media thinks by looking at mock drafts. These mock drafts are what people think will happen. The following websites will be looked at:

All three of these websites have Cade Cunningham, a freshman from Oklahoma State, as the unanimous number 1 overall pick. Both of our models had him predicted to go around pick 36, much later in the draft. Evan Mobley is predicted by these mock drafts to go second overall by Hashtag Basketball and Bleacher Report, and third by CBS. Our KNN model had him picked at number 36, and our KNN 2 model had him picked at number 26.

Comparing our top picks to the mock drafts, one thing to note is CBS Sports and Bleacher Report only predicted the first round of the draft, so the top 30 picks. That said, Jose Alvarado and Colin Castleton did not appear on any of these three lists as draft picks. Quentin Grimes appeared only on Hashtage Basketball as the 58th best prospect, barely squeaking into the end of the draft. Sadly, our picks and the sports media's picks do not seem to align.

How to get a better model

From our low model accuracy, to the predictions that did not match what most other people are saying, our model definitely is not the best. What would be the best ways to improve it? First, more data. As stated in the beginning(, until relatively recently (2008) NBA players did not have to attend college before playing. Many top players were drafted right from high school. If data was collected over the next ten years, the sample size would effectively be doubled, giving better insight into overall trends.

Possibly one of the largest pieces of data we were missing was position. A point guard is typically more offensively talented, shooting, passing, and making plays. A center is larger and will have more dunks, a worse free throw percentage, and a lot of rebounds. Since we did not have positional data, all players were grouped together. Had we divided by position, perhaps there would be a relation between point guard shooting percentage and draft positions, for example.

More data analysis in general could have been beneficial. We did not see any trends in the data, so our model was trained on all of it. If we could better refine what parameters to train our data on, perhaps the model would be more accurate.

Of course, the final test of our predictions will be the actual NBA draft. You can click here to stay up to date or find out how to watch. Thank you for following through our data science tutorial, we hope that you were inspried to do some data analysis on your own!