Skip to main content Link Search Menu Expand Document (external link)

Reading 3

  • Hadley Wickham, 2014, Tidy Data [Sections 1-3] download ↗
  • Broman & Woo, 2018, Data organization in spreadsheets download ↗

This reading focuses on the process of cleaning data, also known as data “munging” or data “wrangling”. Primarily we focus on spreadsheet type of data, because this is likely to be some of the first data you encounter in future classes and it is ubiquitous in the real world scenarios. There are many other instances of tidy data when dealing with images, text, video, so on and so forth, but the skills and theory behind tidy data (as presented in these readings) are foundational and absolutely required to begin wrangling other, more complex, data types. An example in class of wrangling more complex data will be seen in the final weeks during a talk on deep learning.

Additional Resources

Here is an example ↗ from Garrett Grolemund on how you can use a programming language such as R to work with and quickly turn a dataset into tidy data. This walk-through provides some insight into how data is wrangled, as well as, some of the benefits afforded to the analyst when working with data that is tidy.

The paper “Good enough practices in scientific computing” ↗ by Greg Wilson et al. goes into detail about minimal considerations that must be implemented when doing data science.