A data analysis project I chose to do for fun was to analyze the results of the past 10+ years of winning Powerball numbers and see if there is a pattern between the winning numbers and states.
The main questions I wanted to answer with this project are:
Who is winning the Powerball? What states are more or less likely to win?
What numbers and combination of numbers are more or less likely to win? Are there better odds based on previous data?
To start, I had to create my own database that contained the numbers, winning states initials, and Jackpot amounts from all drawings from January 4, 2014 to August 5, 2024. This data is publicly available on the official Powerball website.
Each drawing date has a page with the same format for text values and categories. Instead of going to each page and copying and pasting each value manually, which there are 13 values on close to 1260 pages for more than 10 years worth of drawings, I chose to utilize a Python script to scrap web elements from each page automatically. I used the webpage inspector to find the xpaths for the text elements I needed to pull.
To create the Python code, I needed a lot of assistance from online tutorials and videos to walk through the process of creating a script for web scraping. The full code and raw .csv file is available on the Github repository: Github Powerball Demo project
After generating the raw .csv file, I pasted the values into a Google Sheets spreadsheet and organized the columns. I utilized formulas to clean and simplify data to implement in my analysis later on.
I created tabs to divide the data into multiple categories. The first three showcases the dates, numbers, winning amounts, and states of winning drawings. The 'State Counts' tab showcases the amount of winners per state. The 'Number Counts' tab lists the frequency of numbers on each ball. The last tab 'Sheet4' is simply a worksheet that I used to test formulas before applying them to the main tabs.
The full link to the spreadsheet can be found here: Powerball Lotto sheet
The first area I performed my analysis on is regarding my first question of who is winning the Powerball. To better visualize the dataset, I created a Map and bar chart in Tableau Public that best showcases the amount and density of winners per state.
At a quick glance, I see that Florida has the most lottery winners at 221 winners, which includes Jackpot and $2 million and $1 million winners, with California having 2nd most winners overall. A few surprises that stood out to me were my small home state of MA having 61 total winners, the most out of other states of a similar size, and NY having more winners than TX, which is comparatively larger in terms of both population and land mass.
From the bar chart, I see that the majority of all winners are $1 million winners, making up more than half of all winners in most states. Through this dataset, I also discovered that CA does not have a $2 million winner prize. Researching the topic further showed me that CA's gambling regulations do not permit them to have a Powerball option to boost earnings.
The link to the interactive Tableau dashboard is here: Slide 1 of Powerball Lotto Results
To confirm a hypothesis that the number of Powerball winners is tied to population density, I found a map from Wikipedia using US census data detailing the population density per square mile of all states and territories as of 2020. The light-to-dark ratio of the colors on both maps are similar, though not exact. It is likely that the overall state population numbers also play a role in how many people participate in the lottery.
Other thing I've noticed in the dataset
In conclusion, to answer the question of who is winning the lottery and why, without detailed information on the amount of tickets purchased or ticket holders, I have determined that the amount of Powerball winners in each state is affected by local gambling laws, state-wide gambling culture, overall population, and population density.