Skip to main content Link Search Menu Expand Document (external link)

Finding Datasets

updated October 3, 2022

For the final group project, you will need to find a dataset for your data science question. Here is a list that we have compiled over the years to help you get started. Please feel free to use any of these, or find another dataset online.

A few sites may require a free account to download/access the data

  • US Public Food Assistance ↗
    This dataset focuses on public assistance programs in the United States that provide food, namely SNAP and WIC
  • DSNY Monthly Tonnage Data ↗
    New York’s Department of Sanitation dataset about trash and recycling in the city going back to 1990
  • Las Vegas Strip Dataset ↗
    This dataset includes quantitative and categorical features from online reviews from 21 hotels located on the Las Vegas Strip, extracted from TripAdvisor. (To view the data set click Data Folder at the top then download the .csv)
  • Bookstore Dataset ↗
    Fictional bookstore dataset containing over 1000 books, with different categories, star ratings, and prices
  • Meat Washing Survey Data ↗
    YouTuber Adam Ragusea asked his viewers to answer a detailed survey about whether (and why, and how) they wash meat before cooking it. He received more than 13,000 responses, then made a video about what he found and published a spreadsheet of the anonymized answers.
  • NBA RAPTOR ↗
    Datasets of former and current NBA players with statistics calculated by FiveThirtyEight’s RAPTOR algorithm.
  • Private Prisons Dataset ↗
    Political scientist Anna Gunderson has assembled a longitudinal dataset of private prisons at the federal, state, and local levels through 2016. Gunderson’s dataset identifies each facility’s name, location, primary customer, capacity, security level, contract information, and more.
  • U.S. Births ↗
    Dataset used by FiveThirtyEight to analyze whether parents are superstitious about their children being born on Friday the 13th.
  • Emergency Events Database ↗
    Database of more than 21,000 natural and technological disasters in the world, from 1900 to the present. It focuses on disasters that have caused 10+ human deaths, affected 100+ people, sparked a state of emergency, and/or prompted a request for international assistance.
  • TikTok Popular Songs 2022 ↗
    Popular TikTok songs in 2022 that categorize basic information about the song like artist, title, etc. as well as advanced music information like tempo and time signatures.
  • Gender by Name Dataset ↗
    Names of babies born in the US, UK, Australia, and Canada. Contains the name, gender, count, and probability based on the aggregate count.
  • Fandango Dataset ↗
    A dataset that contains a movie’s Fandango online rating value as well as many other online movie critic websites.
  • CBP Migrant Children Detention Data ↗
    The Marshall Project has obtained and published official data from US Customs and Border Protection listing 580,000+ times that the agency detained migrant children since early 2017. For each detention, the dataset includes the date and time the child entered and left CBP custody, as well as the child’s age, gender, and citizenship.
  • Online Shoppers Purchasing Intention Dataset ↗
    Dataset about online shoppers’ purchasing habits containing product information and time spent on the website.
  • The Planetary Science Budget Dataset ↗
    This integrated, annualized dataset enables improved adjustments for inflation and allows direct comparisons between past and current planetary exploration efforts within the United States. The project is now led by Bill Nye (the science guy)
  • The Speed-dating Data ↗
    Data collected from speed dating events hosted by the researchers for 2 years. More information about the research can be found here.
  • Replication Data for: Deal or No Deal? ↗
    Researchers collected data from over 100 episodes of Deal or No Deal to analyze people’s decision-making with higher risk.
  • Drug Review Dataset ↗
    The dataset provides patient reviews on specific drugs along with related conditions and a 10-star patient rating reflecting overall patient satisfaction.
  • Fraud Detection ↗
    Anonymized credit card transactions are labeled as fraudulent or genuine. Data collected from transactions made by credit cards in September 2013 by European cardholders
  • Food World Cup ↗
    Data collected from a survey where participants were asked about certain international dishes based on the answer choices given.
  • Venmo Transaction Dataset ↗
    The author extracted every public transaction through Venmo’s API to help people become aware of privacy issues. (Full dataset is large, I’d suggest looking at sample.json to get an idea)
  • College Scorecard - U.S Department of Education ↗
    The College Scorecard dataset is provided by the U.S. Department of Education and contains information on nearly every college and university in the United States. The dataset includes data on student loan repayment rates, graduation rates, affordability, earnings after graduation, and more.
  • Air Quality Dataset ↗
    Contains the responses of a gas multisensor device deployed on the field in an Italian city. Hourly response averages are recorded along with gas concentration references from a certified analyzer.
  • San Diego Burritos ↗
    Scott Cole is a neuroscience Ph.D. student at UC San Diego who, in his spare time, is leading a project to rate the region’s burritos on a 10-dimensional scale.
  • Breast Cancer Coimbra Dataset ↗
    Observations of 64 patients with breast cancer and 52 healthy controls to make predictions based on certain health features
  • Kickstarter Datasets ↗
    A Lithuania-based web-scraping company has been collecting data on Kickstarter projects every month. The datasets include each project’s number of backers, amount pledged, and category.
  • County Water Dataset ↗
    Dataset provided by the San Diego County. Their target goal is to decrease water consumption by 15% from 2014 to 2020 and decrease it by 20% in 2030.
  • Mushroom Dataset ↗
    Mushrooms described in terms of physical characteristics; classification: poisonous or edible
  • Speedtest Data by Ookla ↗
    The dataset provides global fixed broadband and mobile network performance metrics. Download speed, upload speed, and latency are collected via the Speedtest by Ookla
  • SBO Dataset ↗
    The Census Bureau conducts a survey of business owners every 5 years to get better information on the demographics of the economy. More info about the survey is here.