SQL, Python, and Tableau basketball stats project

Recently I wanted to learn how to run SQL queries. In deciding which data to use, I wanted a dataset where I’d have some level of intuition for what the data points mean. An obvious choice is sports statistics, and basketball is the sport with which I’m most familiar.

I decided this also would be a good opportunity to further practice my python skills, particularly in object-oriented programming. I wrote a program that would simulate a basketball season in an imaginary league with 30 teams, each playing every other team twice. It would store the box score for each game in a csv file, which I could then store in an SQL database and run queries against.

Select the buttons below to reveal information for the components of the project you’d like to learn about.

I want the simulated basketball games to be as close to reality as possible, so let's look into typical game statistics using a Kaggle dataset of real NBA games. The dataset has 5 csv files: teams, players, games, games_details, and ranking. They all contain multiple columns of data, for example:

teams.csv: founding year, mascot, arena name, owner name, etc.
players.csv: player name, player ID, etc.
games.csv: game date, team IDs, stats for each team, etc.
games_details.csv: box scores for each game
ranking.csv: team records for each season

My simulated basketball league will assign to a player expected values for several of their in-game statistics, e.g. three-point field goal percentage and number of rebounds. Let's look at the career three-point field goal percentages for starting players, and plot the distribution for each of the three positions, guard (G), forward, (F), and center (C). It is quick to run an SQL query against the games_details table to get this data:


    SELECT PLAYER_ID, SUM(FG3M) AS Total_FG3M, SUM(FG3A) AS Total_FG3A
    FROM games_details
    WHERE START_POSITION = 'G'
    GROUP BY PLAYER_ID;

This query groups the starting guards by their ID, and sums the values in the columns FG3M and FG3A, returning the sums along with their ID. Only the starting players have a non-null value for START_POSITION, so the WHERE START_POSITION = 'G' line selects only the starting guards. We can do an analogous query for two-point field goals, and we can then export the data to do some exploratory data analysis in Python.

We can use pandas to manipulate the data generated by our SQL query. A processing function can filter out outliers.


    import pandas as pd

    def read_and_process_data(file_path, position, filter_col, filter_value, percent_col, numerator, denominator):
        df = pd.read_csv(file_path)
        # remove rows with 0 attempts
        df = df[df[filter_col] != filter_value]
        # add column for FG% or FG3%
        df[percent_col] = df[numerator] / df[denominator]
        # exclude outliers of 0% and 100%
        df = df[df[percent_col].between(0.01, 0.99)]  
        # add column for position
        df['Position'] = position
        return df

After processing the files, we generate histograms with the plotly library.


    import plotly.express as px

    # Histograms
    fg2_fig = px.histogram(fg2_data, x='FG2%', color='Position', marginal='rug', opacity=0.7, nbins=100, barmode='group')
    fg3_fig = px.histogram(fg3_data, x='FG3%', color='Position', marginal='rug', opacity=0.7, nbins=100, barmode='group')

Click on the histograms to be redirected to an interactive version where you can select/deselect different positions.

It is reasonable to say these follow a normal distribution. The means and standard deviations are:


    FG2%
    G: mean = 0.47, std = 0.09
    F: mean = 0.49, std = 0.10
    C: mean = 0.53, std = 0.10
    FG3%
    G: mean = 0.35, std = 0.09
    F: mean = 0.34, std = 0.09
    C: mean = 0.32, std = 0.12

We can use these values in our simulation when generating a player.

The rebounds, assists, steals, blocks, and turnovers, on the other hand, appear to follow a Poisson distribution. This makes sense, because it is reasonable to approximate that these events (i.e. rebound or assist) occur at a constant mean rate and independently of the time since the last event.

The means for these distributions are:


    G OREB 0.6941601915652641
    G DREB 3.0671553535649707
    G STL 1.1481503200899184
    G BLK 0.29137467624492985
    G TO 2.1170796070957336
    F OREB 1.4905629025794407
    F DREB 4.696712899158432
    F STL 0.9501607874185067
    F BLK 0.6708402975300316
    F TO 1.6754537723953904
    C OREB 2.494358733697034
    C DREB 5.618290608318179
    C STL 0.6815079877202245
    C BLK 1.2430730724858723
    C TO 1.595119375843257

We are now ready to write the code for our simulated league.

The python libraries used are as follows:


    import numpy as np
    import pandas as pd
    import random
    from itertools import combinations
    from math import comb

Creating the rosters

I create a class for each of the three basic positions, using the means and standard deviations (if applicable) for the distributions we investigated earlier. For example, the guard class looks like:


    class Guard(Player):
        def __init__(self, seed=None):
            super().__init__({
                # stat: (mean, std)
                'fg2': (0.47, 0.09),
                'fg3': (0.35, 0.09),
                # mean value
                'dreb': 3.07,
                'oreb': 0.69,
                'ast': 4.6,
                'stl': 1.15,
                'blk': 0.29,
                'to': 2.12
            }, seed=seed)

I then have a player class that, when called, generates a player of a particular position and gives them fixed expected values for their stats, by picking a number either according to a normal or Poisson distribution. This makes each player have some level of uniqueness to them.


    class Player:
        '''
        Generates a player of a given position and sets their expected stats.
        '''
        def __init__(self, typical_stats, seed=None):
            self.expected_stats = {}
            self.seed = seed
            means_stds = {key: value for key, value in typical_stats.items() if type(value) == tuple}
            poissons = {key: value for key, value in typical_stats.items() if type(value) == float}

            for stat, (mean, std) in means_stds.items():
                self.expected_stats[stat] = np.random.normal(mean, std, 1)[0]

            for stat, mean in poissons.items():
                self.expected_stats[stat] = np.random.poisson(mean, 1)[0]

            # create an overall rating of each player based on their expected stats
            self.overall = (self.expected_stats['fg2'] + self.expected_stats['fg3'] + self.expected_stats['dreb'] + self.expected_stats['oreb'] + self.expected_stats['ast']) / (typical_stats['fg2'][0] * typical_stats['fg3'][0] * typical_stats['dreb'] * typical_stats['oreb'] * typical_stats['ast'])

The team class initializes a team with a particular name and generates five players to be on the roster.


    class Team:
        ''' 
        Generates a team of 5 players.  Also sets the expected distribution of stats and the pace.
        '''
        def __init__(self, name, seed=None):
            self.name = name
            self.seed = seed
            # roster
            self.pg = Guard(seed=self.seed)
            self.sg = Guard(seed=self.seed + 1)
            self.sf = Forward(seed=self.seed)
            self.pf = Forward(seed=self.seed + 1)
            self.c = Center(seed=self.seed)

            # dictionaries with position objects
            self.positions_dict = {
                'PG': self.pg,
                'SG': self.sg,
                'SF': self.sf,
                'PF': self.pf,
                'C': self.c
            }

This class also has dictionaries that contain the expected distribution of stats by position, which are weighted according to the player's expected values for those stats. For example, for rebounds we have:


    def normalize_dict(dictionary):
        '''
        Normalize a dictionary of values so that they sum to 1.
        '''
        total = sum(dictionary.values())
        if total == 0:
            return {key: 0.2 for key in dictionary.keys()}
        return {key: value / total for key, value in dictionary.items()}

    class Team:
    ''' 
    Generates a team of 5 players.  Also sets the expected distribution of stats and the pace.
    '''
    def __init__(self, name, seed=None):
        self.name = name
        self.seed = seed
        # roster
        self.pg = Guard(seed=self.seed)
        self.sg = Guard(seed=self.seed + 1)
        self.sf = Forward(seed=self.seed)
        self.pf = Forward(seed=self.seed + 1)
        self.c = Center(seed=self.seed)

        # dictionaries with position objects
        self.positions_dict = {
            'PG': self.pg,
            'SG': self.sg,
            'SF': self.sf,
            'PF': self.pf,
            'C': self.c
        }

        # expected distributions of stats by position
        self.shot_distribution_pos = self.roster_weight('fg')
        self.dreb_distribution_pos = self.roster_weight('dreb')
        self.oreb_distribution_pos = self.roster_weight('oreb')
        self.ast_distribution_pos = self.roster_weight('ast')
        self.stl_distribution_pos = self.roster_weight('stl')
        self.blk_distribution_pos = self.roster_weight('blk')
        self.to_distribution_pos = self.roster_weight('to')
        # pace of team: average number of possessions per game
        self.pace = np.random.normal(75, 5, 1)[0]
        # expected rebounds a team gets per game
        self.expected_dreb = sum(player.expected_stats['dreb'] for player in self.positions_dict.values())
        self.expected_oreb = sum(player.expected_stats['oreb'] for player in self.positions_dict.values())

    def roster_weight(self, stat_to_consider):
        # returns a dictionary of weights for each position based on a given stat.
        weights_dict = {}
        if stat_to_consider == 'fg':
            for position, player in self.positions_dict.items():
                # choose the larger of 2P% and 3P%
                weights_dict[position] = max(player.expected_stats['fg2'], player.expected_stats['fg3'])
            return normalize_dict(weights_dict)
        for position, player in self.positions_dict.items():
            weights_dict[position] = player.expected_stats[stat_to_consider]
        return normalize_dict(weights_dict)

This also sets the pace for the team, which will be used later so that each simulated game does not have the same number of possessions, and the total number of expected rebounds per game.

Simulating a game

The program works by simulating a bunch of individual possessions in a game between two teams. First there are functions which handle the outcome of a shot, assist, rebound, etc. For example:


    def handle_shot(off_team, position, shot_worth, result):
        '''
        Determines if a shot is made and updates the stats accordingly.
        '''
        shot_key = 'fg2' if shot_worth == 2 else 'fg3'
        shot_chance = off_team.positions_dict[position].expected_stats[shot_key]
        if random.random() < shot_chance:
            # update points
            result[position][0] += shot_worth
            # update makes
            make_index = 4 if shot_worth == 2 else 6
            result[position][make_index] += 1
            # update attempts
            result[position][make_index + 1] += 1
            return True
        # update attempts for a miss
        miss_index = 5 if shot_worth == 2 else 7
        result[position][miss_index] += 1

    def handle_assist(off_team, shooter, result):
        '''
        Determines if an assist is made and updates the stats accordingly.
        '''
        if random.random() < ASSIST_CHANCE:
            assister = weighted_random_key(off_team.ast_distribution_pos)
            while assister == shooter:
                assister = weighted_random_key(off_team.ast_distribution_pos)
            result[assister][2] += 1

    def handle_rebound(type, team, result):
        '''
        Updates the rebounding stats.
        '''
        if type == 'oreb':
            rebounder = weighted_random_key(team.oreb_distribution_pos)
            result[rebounder][2] += 1
        else:
            rebounder = weighted_random_key(team.dreb_distribution_pos)
            result[rebounder][1] += 1

These select a player using the weighted_random_key function, which picks according to a distribution of probabilities for the players on the team. If your probability is relatively large compared to your teammates, you are more likely to be selected by the function.


    def weighted_random_key(prob_dict):
        '''
        Pick a random key from a dictionary of probabilities.
        '''
        # generate a random number
        rand_num = random.random()
        cumulative_probability = 0.0

        # iterate through the dictionary
        for key, probability in prob_dict.items():
            cumulative_probability += probability
            # check if the random number is less than the cumulative probability
            if rand_num < cumulative_probability:
                return key

We also initialize the stats for that possession with an initialize_stats function.


    def initialize_stats():
        '''
        Initializes a dictionary of stats for each position on a single possession.
        [PTS, DREB, OREB, AST, 2PM, 2PA, 3PM, 3PA, STL, BLK, TO]
        '''
        return {pos: [0] * 11 for pos in ['PG', 'SG', 'SF', 'PF', 'C']}

The possession function proceeds through a possession, calling the above functions when appropriate. It randomly selects a player from the team to take a shot; the random selection is weighted by the players' field goal percentages, i.e. if you are a better shooter, you are more likely to be the one to take the shot. The function also allows for the possibility of a turnover, and for offensive rebounds.


    def possession(off_team, def_team):
        '''
        Simulates a possession and returns the players' stats.
        '''
        # initialize who will have the ball on the next possession
        next_team = def_team
        # initialize stats for this possession
        off_result, def_result = initialize_stats(), initialize_stats()

        # check for a turnover
        if random.random() < TURNOVER_CHANCE:
            handle_steal(def_team, def_result)
            handle_turnover(off_team, off_result)
            return off_result, def_result, next_team

        # determine who will shoot the ball
        shooter = weighted_random_key(off_team.shot_distribution_pos)
        # decide whether they attempt a two or a three
        shooting_percentages = {2: off_team.positions_dict[shooter].expected_stats['fg2'], 3: off_team.positions_dict[shooter].expected_stats['fg3']}
        shooting_percentages = normalize_dict(shooting_percentages)
        shot_worth = weighted_random_key(shooting_percentages)

        # check for a made shot
        if handle_shot(off_team, shooter, shot_worth, off_result):
            handle_assist(off_team, shooter, off_result)
            return off_result, def_result, next_team

        # check for block
        handle_block(def_team, def_result)

        # check for offensive rebound
        if random.random() < OFF_REBOUND_CHANCE_BASE + OFF_REBOUND_BUFF_FACTOR * ((off_team.expected_oreb / def_team.expected_dreb) - 1):
            handle_rebound('oreb', off_team, off_result)
            next_team = off_team
        # assign defensive rebound
        else:
            handle_rebound('dreb', def_team, def_result)

        return off_result, def_result, next_team

Simulating a full game, then, simply requires us to simulate many possessions and keep track of the players' stats along the way.


    class Game:
        '''
        Simulates a game by simulating a number of possessions between two teams.
        '''
        def __init__(self, team1, team2):
            self.team1 = team1
            self.team2 = team2

        def initialize_box_score(self):
            return {pos: [0] * 11 for pos in ['PG', 'SG', 'SF', 'PF', 'C', 'Team']}

        def update_box_score(self, box_score, position_stats):
            for position, stats in position_stats.items():
                for i in range(11):
                    box_score[position][i] += stats[i]

        def sum_team_stats(self, box_score):
            for i in range(11):
                box_score['Team'][i] = sum(stats[i] for position, stats in box_score.items() if position != 'Team')

        def play_game(self):
            team1_box_score = self.initialize_box_score()
            team2_box_score = self.initialize_box_score()

            num_possessions = int(np.ceil(sum([self.team1.pace, self.team2.pace]) / 2.0))

            off_team, def_team = self.team1, self.team2
            for _ in range(num_possessions * 2):
                off_result, def_result, next_team = possession(off_team, def_team)

                if off_team == self.team1:
                    self.update_box_score(team1_box_score, off_result)
                    self.update_box_score(team2_box_score, def_result)
                else:
                    self.update_box_score(team2_box_score, off_result)
                    self.update_box_score(team1_box_score, def_result)

                off_team, def_team = next_team, self.team1 if next_team == self.team2 else self.team2

            self.sum_team_stats(team1_box_score)
            self.sum_team_stats(team2_box_score)

            return team1_box_score, team2_box_score

Simulating a season

Simulating a season is also straightforward. We enumerate all the matchups that will be played, initialize a DataFrame to store the stats for each game, and then simulate all the games. We also keep track of the winners and losers of each game so that league standings can be recorded.


    class Season:
    '''
    Simulates a season by simulating all games between all teams.  Each team plays every other team twice.
    '''
    def __init__(self, num_teams):
        self.num_teams = num_teams
        self.num_games = 2 * comb(self.num_teams, 2)
        self.teams = [Team(name=team_nicknames[i], seed=i) for i in range(self.num_teams)]
        self.matchups = self.generate_matchups()
        self.df_game_stats = self.initialize_game_stats_df()
        self.team_records = {team_name: [0, 0, 0] for team_name in team_nicknames[:self.num_teams]}

    def generate_matchups(self):
        matchups = np.array(list(combinations(self.teams, 2)))
        # flip the matchups so that each team plays each other team twice
        return np.concatenate((matchups, np.flip(matchups, axis=1)))

    def initialize_game_stats_df(self):
        columns = ['GAME_ID', 'TEAM_NAME', 'OPPONENT_NAME', 'PLAYER_NAME', 'POSITION', 'PTS', 'REB', 'AST', 'FG2M', 'FG2A', 'FG3M', 'FG3A']
        return pd.DataFrame(columns=columns)

    def add_game_stats(self, game_id, team, opponent, team_stats):
        # add the team stats for a single game to the game stats dataframe
        rows = [[game_id, team.name, opponent.name, f"{team.name}_{position}", position] + stats for position, stats in team_stats.items()]
        self.df_game_stats = pd.concat([self.df_game_stats, pd.DataFrame(rows, columns=self.df_game_stats.columns)], ignore_index=True)

    def update_team_records(self, team1, team2, team1_stats, team2_stats):
        if team1_stats['Team'][0] > team2_stats['Team'][0]:
            self.team_records[team1.name][0] += 1
            self.team_records[team2.name][1] += 1
        elif team1_stats['Team'][0] < team2_stats['Team'][0]:
            self.team_records[team1.name][1] += 1
            self.team_records[team2.name][0] += 1
        else:
            self.team_records[team1.name][2] += 1
            self.team_records[team2.name][2] += 1

    def play_season(self):
        for game_id in range(self.num_games):
            team1, team2 = self.matchups[game_id]
            team1_stats, team2_stats = Game(team1, team2).play_game()
            self.add_game_stats(game_id, team1, team2, team1_stats)
            self.add_game_stats(game_id, team2, team1, team2_stats)
            self.update_team_records(team1, team2, team1_stats, team2_stats)

        self.team_records = {key: value for key, value in sorted(self.team_records.items(), key=lambda item: item[1][0], reverse=True)}

For a league with 50 teams, we can then simply run:


    s = Season(50)
    s.play_season()

The box score data is now ready for analysis. I uploaded the csv-file output from my Python simulation to a MySQL server hosted on an Amazon Web Servcies RDS instance, and used the free software DBeaver to interact with the server.

Suppose we want to calculate a fantasy score for each player in each game. We can alter the SQL table with the following sequence of commands to do so.


    ALTER TABLE game_stats 
    ADD fantasy_score INT;
    UPDATE game_stats
    SET fantasy_score = PTS + FG3M - FG2A - FG3A + 2*FG2M + 2*FG3M + OREB + DREB + 2*AST + 4*STL + 4*BLK - 2*`TO`;

We might want to plot a rolling average of a player's fantasy score as the season progresses. To do this, we can make use of a common table expression (CTE) to remove the 'Team' stats from the calculation.


    WITH FantasyScores AS (
        SELECT GAME_ID, GAME_DATE, PLAYER_ID, fantasy_score 
        FROM game_stats
        WHERE `POSITION` != 'Team'
    ),
    SELECT GAME_DATE, PLAYER_ID,
        AVG(fantasy_score) OVER (
            PARTITION BY PLAYER_ID
            ORDER BY GAME_DATE
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW 
        )
    FROM FantasyScores

We can also add a column of data to each row in the game_stats table that indicates whether that player won or lost the game. This is something that could have been done in Python, but I didn't realize I'd want it until after I uploaded the data to the SQL server. The process makes use of CTEs and the joining of tables.


    WITH 
    TeamPoints AS (
        SELECT GAME_ID, TEAM_ID, SUM(PTS) AS PointsFor
        FROM game_stats
        WHERE `POSITION` != 'Team'
        GROUP BY GAME_ID, TEAM_ID
    ),
    TotalPoints AS (
        SELECT GAME_ID, SUM(PTS) AS PointsForPlusPointsAgainst
        FROM game_stats
        WHERE `POSITION` != 'Team'
        GROUP BY GAME_ID
    ),
    JoinedTables AS (
        SELECT a.GAME_ID AS GAME_ID, a.TEAM_ID AS TEAM_ID, a.PointsFor AS PointsFor, b.PointsForPlusPointsAgainst AS PointsForPlusPointsAgainst
        FROM TeamPoints AS a
        INNER JOIN TotalPoints AS b ON a.GAME_ID = b.GAME_ID
    ),
    WLDTable AS 
    (
        SELECT GAME_ID, TEAM_ID, PointsFor, PointsForPlusPointsAgainst, 
            CASE 
                WHEN PointsFor > PointsForPlusPointsAgainst / 2 THEN 1
                WHEN PointsFor < PointsForPlusPointsAgainst / 2 THEN 0
                WHEN PointsFor = PointsForPlusPointsAgainst / 2 THEN 2
            END AS WLD
        FROM JoinedTables
    )
    UPDATE game_stats gs
    JOIN WLDTable wld ON gs.GAME_ID = wld.GAME_ID AND gs.TEAM_ID = wld.TEAM_ID
    SET gs.WLD = wld.WLD;

Check out the Tableau story I created with the data resulting from my Python simulation.