logo

Introduction to Cultural Analytics & Python

How To

  • Interact With This Book

The Course

  • Course Schedule
  • Course Syllabus

The Book

  • 1. The Command Line
  • 2. Python Basics
    • Installation
    • How to Use Jupyter Notebooks
    • Anatomy of a Python Script
    • Variables
    • Data Types
    • String Methods
    • Files and Character Encoding
    • Comparisons & Conditionals
    • Lists & Loops — Part 1
    • Lists & Loops — Part 2
    • Dictionaries
    • Functions
    • Common Python Errors
    • What We’re Not Covering
  • 3. Data Analysis (Pandas)
    • Pandas Basics — Part 1
    • Pandas Basics — Part 2
    • Pandas Basics — Part 3
    • Pandas — Merge Datasets
  • 4. Data Collection (Web Scraping, APIs)
    • User Ethics & Legal Concerns
    • Web Scraping — Part 1
    • Web Scraping — Part 2
    • Git and GitHub
    • Application Programming Interfaces (APIs)
    • Song Genius Data Collection
      • Song Genius API
      • Song Lyrics Collection
      • Song Lyrics Analysis
    • Twitter Data Collection
      • Twitter API Setup
      • Twitter Data Collection
      • Twitter Data Analysis
      • Twitter Data Sharing
    • Reddit Data Collection
  • 5. Text Analysis
    • TF-IDF
      • TF-IDF with HathiTrust Data
      • TF-IDF with Scikit-Learn
    • Topic Modeling
      • Topic Modeling — Overview
      • Topic Modeling — Set Up
      • Topic Modeling — Text Files
      • Topic Modeling — CSV Files
      • Topic Modeling — Time Series
    • Named Entity Recognition
    • Part-of-Speech Tagging
    • Keyword Extraction
  • 6. Network Analysis
    • Network Analysis
    • Make an Interactive Network Visualization
  • 7. Mapping
    • Mapping
    • Geocoding with GeoPy
    • Making Interactive Maps
    • Custom Map Backgrounds
    • Publish Your Map on the Web

Datasets

  • Datasets

Extra Materials

  • Jupyter Tips & Tricks
  • Make Random Student Groups
Powered by Jupyter Book
Contents
  • Legal Concerns
  • Institutional Review Boards (IRBs)
  • User Ethics
  • Models & Examples
  • Recommended Reading

User Ethics & Legal Concerns¶

Before we dive into collecting data from the internet, we need to discuss some serious questions. Is it legal and/or ethical to computationally collect data from the internet? Is it legal and/or ethical to publish research that includes internet users’ data without their knowledge?

Legal Concerns¶

If internet data is publicly available (e.g., tweets from a public Twitter account), it is generally considered legal to collect this data, even if a particular platform says that you cannot. In 2019, the Ninth Circuit Court of Appeals ruled that scraping publicly accessible websites likely does not violate federal anti-hacking laws. You can read more about this legal ruling from the Electronic Frontier Foundation.

Institutional Review Boards (IRBs)¶

Research that involves human participants (e.g., surveys, interviews, blood draws) needs to be approved by an Institutional Review Board (IRB). But research about publicly available internet data does not typically require IRB approval.

The Cornell Institutional Review Board recommends being cautious with regard to data mining from the internet, however, and seeking “formal confirmation of non-human participant research status”:

If the individual or social media/network site has not placed any restrictions on access to information about himself/herself (e.g., information available on a public website, blog, twitter feed, chat room, etc.), the following best practices should be followed:

  • The researcher should send a project description to the IRB office and seek a formal confirmation of non-human participant research status for the study. We believe that in most cases, this will not be considered human participant research, but caution is recommended before a researcher makes his/her own determination, because of the emerging ethical sensitivities in this area.

User Ethics¶

Just because something is legal or gets approved by an IRB does not mean it is ethical. Collecting, sharing, and publishing internet data created by or about individuals can lead to unwanted public scrutiny, harm, and other negative consequences for those individuals. For these reasons, some researchers attempt to anonymize internet data before sharing it or before publishing an article that cites a post specifically. Yet anonymizing internet data also does not give credit to internet users as creators and authors.

There is no single, simple answer to the many difficult questions raised by internet data collection. It is important to develop an ethical framework that responds to the specifics of your particular research project or use case (e.g., the platform, the people involved, the context, the potential consequences, etc.).

In my own research, I have started seeking explicit permission from internet users when I want to quote them in a published article. In this book, I only share internet data that meets a certain threshold of publicness, such as tweets from verified Twitter accounts or Reddit posts with a certain number of upvotes. This is an approach that I have developed based on some of the models and readings included below.

Models & Examples¶

Below are a few examples of how researchers have approached social media data in published research:

  • In Maria Antoniak, David Mimno, and Karen Levy’s article about a Reddit subcommunity dedicated to birthstories (r/BabyBumps), they paraphrased Reddit submissions discussed in the article and then deleted all collected Reddit data after the article was published.

  • In Deen Freelon, Charlton McIlwain, and Meredith D. Clark’s report about the #BlackLivesMatter movement, they included links to tweets rather than the full text of tweets and only linked to tweets with a minimum of 100 retweets published by Twitter users who had at least 3,000 followers or were verified. They embargoed their Twitter data for a year and then publicly released a list of tweet IDs. Tweet IDs can be used by third-parties to re-download any tweets that have not been deleted yet, as I discuss in the lesson “Twitter Data Sharing”.

  • In Moya Bailey’s article about the #GirlsLikeUs hashtag, created by trans advocate Janet Mock, she asked for Mock’s permission to work on the project before it began and collaborated with Mock to develop research questions and determine the project’s direction.

Recommended Reading¶

  • Doc Now White Paper, Bergis Jules, Ed Summers, Dr. Vernon Mitchell, Jr.

  • No Robots, Spiders, or Scrapers: Legal and Ethical Regulation of Data Collection Methods in Social Media Terms of Service, Casey Fiesler, Nathan Beard, Brian C. Keegan

  • #transform(ing)DH Writing and Research: An Autoethnography of Digital Humanities and Feminist Ethics, Moya Bailey

  • The #TwitterEthics Manifesto, Dorothy Kim and Eunsong Kim

Data Collection Web Scraping — Part 1

By Melanie Walsh
© Copyright 2021.

Creative Commons License This book is licensed under a Creative Commons BY-NC-SA 4.0 License. The code is licensed under a GNU General Public License v3.0.