Skip to main content Link Search Menu Expand Document (external link)

Reading 2

  • Loukides M, Mason, H & Patil DJ, Data’s Day of Reckoning download ↗
  • Narayanan and Shmatikov, Myths & Fallacies of “Personally Identifiable Information” download ↗

These readings focus on ethics—specifically, the ethical challenges that come with working with large-scale data. In class, we’ll introduce the core ideas and frameworks in ethics and walk through practical examples. These readings go further by emphasizing how ethical issues show up in real data science work, why they arise, and how you might respond to them in practice.

Patil’s article offers concrete, actionable guidance for thinking through ethical tradeoffs and building responsible practices. It’s especially useful when you write the ethics section in Assignment 1 and in your final project.

Narayanan’s paper takes a deeper dive into the risks of working with personal data, including how collecting, linking, and storing data can create harms—even when intentions are good. It highlights why privacy and data protection are not optional add-ons, but central to responsible analysis.

Reading Guide

  • R2 Reading Guide download↗
    • This reading guide highlights the major ideas to focus on in the paper. You will not turn in the guide—it’s simply here to help you distill the paper and keep track of the main points. Use it before you read (to preview what to look for), while you read (to take notes), and after you read (to review). It will also be a helpful reference when preparing for the reading quiz/exam!

Additional Resources

Security

Here are two excellent sources for security-related information, research, and news.

  • Bruce Schneier’s site ↗ He is a fellow/lecturer at Harvard and on the board of the EFF.
  • Brian Krebs’ blog KrebOnSecurity ↗ is a resource for current news and investigatory material on security issues. He worked for the Washington Post as a reporter for over 15 years.

Deon Checklist Tool

Deon view ↗ is a command line tool that allows you to easily add an ethics checklist to your data science projects. You can look at this checklist for an example during the ethics portion of assignment 1 and your final group project.

HIPAA Data

The HIPAA Privacy Rule protects individually identifiable health information. HIPAA defines 18 identifiers that count as personally identifiable information—data that could identify, contact, or locate a person, either on its own or when combined with other sources. When this identifying information is linked to someone’s health condition, care, or payment for care, it is considered Protected Health Information (PHI).

If a dataset contains any of the 18 identifiers (even partial forms, like initials), it is considered identified. A dataset is only de-identified if all 18 identifiers are removed, including all dates, voice recordings, and photographic images:

  • Name
  • Address (all geographic subdivisions smaller than a state, including street address, city county, and zip code)
  • All elements (except years) of dates related to an individual (including birthdate, admission date, discharge date, date of death, and exact age if over 89)
  • Telephone numbers
  • Fax number
  • Email address
  • Social Security Number
  • Medical record number
  • Health plan beneficiary number
  • Account number
  • Certificate or license number
  • Vehicle identifiers and serial numbers, including license plate numbers
  • Device identifiers and serial numbers
  • Web URL
  • Internet Protocol (IP) Address
  • Finger or voice print
  • Photographic image - Photographic images are not limited to images of the face.
  • Any other characteristic that could uniquely identify the individual