©University of Sheffield, all rights reserved
©University of Sheffield, all rights reserved
While it may seem that social media data has already been shared and made available to all, it is important to remember that:
The data may not be persistently available, or persistently available under the same access/usage conditions
The data may be owned by the individual and/or the owner of the social media platform.
Making social media data FAIR depends on multiple factors, as we'll see below, but is still possible in some instances.
While some aspects of using social media data will be covered here, a much more comprehensive policy note has been created by the University Ethics team: Policy Note No. 14. Research Involving Social Media.
As mentioned in that policy notice, one of the most important things you need to do with regard to social media data is to check the platform’s Terms and Conditions before collecting or sharing any data. This is not only because the ability to use/access a platform’s data can change from one time to another, but also because the Terms and Conditions change frequently too. If they do, then the new Terms immediately apply to the data you have. It should also be noted that the use of social media data in research, either primary or secondary, where the content is available, needs ethical approval.
There are multiple ways that social media data can be collected;
Using the social media platform as a way of reaching and contacting people
This could either be to collect their already-existing social media data, or for research by other means, such as sending a link to an online survey.
Mining or scraping data directly from the platform
Purchasing already-collected data from a third party provider
Directly from the platform in an 'observation of online public space'
There are also multiple types of data that could be collected via these methods:
Data created by the user (content)
Data about the content (engagement data): for example, the number of likes, shares, followers, connections, etc.
Data collected by the social media platform company, such as user location, interest in adverts, time spent on platform, etc.
Personal data such as names, photos, user IDs, often found in the user bio[ography] (personal data could also be contained in the user content).
Each of these collection methods and types of data affects how you might be able to share the data with others. However, most cases will fall into one of a few main categories that may allow you to share social media data (none of which supersede the platform's T&Cs).
It should also be noted that the decision of whether a social media post is private or public mainly comes from the social media user, even if it is openly available (see Policy Note No. 14 for further discussion/examples). The idea that it would be used for research, placed in a dataset and shared onwards, might not have occurred to, or be wanted by the user.
It's important to note that while social media data may be readily available online, it still belongs to the individual (and potentially the platform as well) - therefore, consent is required to enable any planned re-use unless the data is wholly anonymised (see below).
By obtaining individual consent from participants, you can not only ensure that you can use their social media data for your research, but also make sure they are happy for you to make it open afterwards. This allows you to obtain agreement for the long term preservation of their data, which they may choose to delete from the platform (see below for more information). It also allows you to check if the person is who they say they are (such as their age, and whether they are a vulnerable user, e.g. under 18).
Fully anonymised data is no longer considered personal data, which means it can be shared openly (although this is sometimes a method not permitted by the platform). For more information on anonymisation, please see the Personal, sensitive & confidential data page. Effective anonymisation can be difficult in general, but even more so with social media data as the data (e.g. a post) could still be openly available to all and could be found via a search engine if searched as a full query. This means that any removed user names or other information could potentially be found and linked back into the data again.
It is often recommended that paraphrasing is used instead of direct quoting for publications for this reason. However, this may contravene some platforms’ T&Cs, and limit the reusability/reproducibility of the research compared to if, say, the raw data was available.
One of the main reasons why it is difficult to work with and share social media data is because of the responsibility that the platform has to their users, primarily the right to delete user data.
If a user deletes content, this should be deleted in all its forms, which would include the instance that you may hold as a researcher. To do this, synchronisation of data is required, which ensures that content is deleted appropriately and in its entirety. For such cases, it is often normal for the platform’s T&Cs to only allow the sharing of content/post IDs. In theory, this would mean others would be able to only access such content if it has not been deleted (in a process called rehydration).
There could, however, be an issue with this approach as this would mean that a user ID could also be available, and as mentioned previously, this could lead to the individual being identifiable.
In some instances, another approach that could be taken is to make open and available the approach you took to create your dataset: what type of content was sought, inclusion and exclusion criteria, the methods or Application Programming Interface (API) functions/code that was used. While this would most likely create a different dataset in total if ‘reproduced’, it could be seen as a snapshot taken at different times, using the same approach. Other ways of being open that don’t include open data are to share analysed, sample, or dummy data, sharing how you accessed the data (if gained from a third party, or how you got permission from the platform), as well as creating a metadata-only record in a repository.
If the social media data that was collected also includes other elements, such as photos or images, these may very well have copyright attached to them. This means that storing them and making them openly available in a repository may not be possible. Caution should be taken if such data has been collected to ensure that nothing is shared that is under copyright and not owned by the researcher(s).
It is important to note that any social media data collected may also be protected by copyright. This is especially the case where the media includes original photos or images, but may also be the case for text - short extracts of as few as 11 words can qualify for copyright protection.
While there are exceptions for the collection of social media data (e.g. section 29A exception for text and data analysis), the terms of the exception have to be adhered to: in this case, it does not allow sharing of the copied material. This means that, depending how material was collected, storing it and making it openly available - such as in a repository - may not be possible. Caution should be taken with copyrighted data to ensure that nothing is shared or communicated unless a suitable lawful basis exists to support this.
It's worth mentioning, (again), that using social media data to do research, and openly sharing that data are different processes with their own difficulties. It may be possible to do both, but it might only be possible to use the data and not share it. This is in part because ethical arguments to use the data can differ from legal arguments to share the data. The Ethics Team’s Policy Note No. 14. Research Involving Social Media should be consulted to support decision-making on this point.
This UKRIO (UK Research Integrity Office) webinar on Social Media and Ethics is also a good primer in this area.