As mentioned in the previous unit, I used TAGS to scrape data for a research project in late 2019. Over the span of approximately 5 months, I collected/half a million tweets. Pretty early on in my Twitter scraping, I realized that I was getting more information than I had asked for or expected. For instance, I was only interested in knowing what kinds of tweets were using hashtags and who was tweeting them. However, the Twitter API gave me information on the user profile location, the number of retweets, Twitter ID, replies, mentions, media, and profile image.
I was struck by how much more data I was able to get. As a graduate student who was just beginning their data collection, I was elated. This was a boon to put it simply. I was able to get a lot more data without having to do a lot of work. At the time I didn’t really think about the power implications or ethics behind that. It wasn’t until 2021 when I started doing participant observation on Twitter, that I started to reflect on my existing Twitter dataset. The question that I had for myself was how private is social media data?
And here I am indebted to Internet and social media scholars such as Christian Fuchs, Matthew Williams, and Moya Z Bailey, who have not only written extensively but also offered us models on how to conduct social media research in an ethical and transparent manner, especially when working with vulnerable communities or online hate groups. Their scholarship served as a guide for my own work as they alerted me to how desperately social media researchers need to be thinking about questions of ethical conduct, regardless of how tedious or arduous it may appear.
Commercialism: Twitter is a public platform that is driven by commercial interest. Their guidelines on privacy should not be a starting point for researchers and academics. Instead, we researchers should think about the implications behind quoting a tweet verbatim in our study, even if it is coming from a relatively unknown, small Twitter account.
Public data: Even if Twitter is saying that its platform is public, that does not mean that its users are aware or even onboard with how their tweets will be used outside of the platform. Just stating that a dataset is public, does not mean its contents are always public, especially if the information that we have collected about a group of users has not consented to us using their words for a research study.
Lastly, anytime you use social media data for academic purposes, it always has the potential to release sensitive information about people. What I mean is that we need to consider what constitutes sensitive information in the context of our individual social media research. What I, as a researcher, consider to be sensitive information may be totally different or at odds with what the users who make up my data, consider to be sensitive information.
The existing scholarship on internet research ethics suggests that we create context-specific ethical frameworks for our research methodologies. To do so, we can implement a three-step process. The first two steps are research-driven. They involve us reading up on existing models of internet research ethics, and getting acquainted with data privacy laws from the region we are conducting our research from. The last step is to create a model of our own, that is specifically applicable to our research context.
In the next few paragraphs, I will give you a brief rundown of my ethical framework for my current research project. Since I work with both vulnerable groups and online hate groups, it was imperative that I come up with ways to not only protect the privacy of my users but also my own self.
Here I read up on the association of internet researchers' latest installment of ethical guidelines, UCLA’s IRB guidelines on how to conduct internet research. I also relied on guidelines produced by the American Anthropological Association and the think tank Data and Society on how to conduct risky research.
I then also read up on the data privacy laws for California and India, so I could learn about what I could or could not publish. Moreover, reading up on data privacy laws is a good way to decide what makes up sensitive personal information in the context of your research.
Sensitive personal information usually has to do with race, ethnicity, age, place of residence, telephone number, name. In the context of social media research, we don’t have the same kind of data, but we still have sensitive personal information. For instance, username, profile image could be constituted as sensitive information in the context of my research. This handy map created by Matthew L Williams on how to publish Twitter data is useful in figuring out what constitutes sensitive information.
Based on my research, I came up with the conclusion, that I should submit an IRB for my study. Because I am working with "Big Data", where tweets and the accompanying metadata can be easily classified as data collected from humans, albeit algorithmically or computationally.
Upon reading through existing guidelines on ethical research conduct, I came up with the conclusion that I should seek informed consent for tweets that I plan to quote verbatim, otherwise, I will have to paraphrase them. The other task is to de-personalize or anonymize my Twitter dataset using open source tools using tools like Excel or OpenRefine. I will be anonymizing my datasets using the instructions shared in the video below.
Now it's your turn to start thinking about your own ethical frameworks.
Take 5-10 minutes and read through the first couple of pages of AoIR guidelines. If you're a UCLA student, I recommend checking out the following IRB guidelines here.
Next, go to your preferred search engine and spend five minutes look up the data privacy laws of the region you are researching from.
Spend a couple of minutes checking out this decision tree for publishing Twitter data
Based on your brief research exercise, create a checklist for yourself that outlines additional readings you will have to do, decision to submit an IRB, and a preliminary idea on how to anonymize your data.