SDG Survey Data
Text Analysis Results
This website contains interactive visualisations on four text analysis types (term extraction, contract analysis, topic modeling, network mapping), based on the survey data where researchers selected research output that are related to the 17 Sustainable Development Goals (SDGs). This is used as input to improve the current SDG classification model v4.0 to v5.0
For you, this website can serve as a resource to find connections between terms, concepts or keywords that are relevant to map research output to the SDG's.
Scroll down, to download the Survey data, Text Analysis data and the SDG Search Queries data. Click on the SDG Icons to see the text analysis visualisations
Sustainable Development Goals are the 17 global challenges set by the United Nations. Within each of the goals specific targets and indicators are mentioned to monitor the progress of reaching those goals by 2030. In an effort to capture how research is contributing to move the needle on those challenges, we earlier have made an initial classification model than enables to quickly identify what research output is related to what SDG. (This Aurora SDG dashboard is the initial outcome as proof of practice.)
The initiative started from the Aurora Universities Network in 2017, in the working group "Societal Impact and Relevance of Research", to investigate and to make visible 1. what research is done that are relevant to topics or challenges that live in society (for the proof of practice this has been scoped down to the SDGs), and 2. what the effect or impact is of implementing those research outcomes to those societal challenges (this also have been scoped down to research output being cited in policy documents from national and local governments an NGO's). The classification model we have used are 17 different search queries on the Scopus database.
About the data and its provenance
For full transparency about what you are looking at, we explain the steps how we got the data. For improving the SDG classification model v4.0, have aggregated all the SDG related papers (metadata only) from the Scopus publication database. Next we distributed a survey, where senior researchers with a specific SDG expertise were invited to, among other things) hand-pick a random set of 100 publications and accept or reject the papers if they think are relevant for that SDG. (This measures the precision /soundness of our v4.0 model.) Those papers, including their judgements (accept=true, reject=false), are analysed on the title, abstract and keywords in the metadata. These steps are illustrated below and for each step a dataset has been made.
All data, models and software are licensed under a Creative Commons license. We want to stimulate everyone to use and re-use the data and contribute to improve the classification models. The global goals are there for all of humanity. As this project will stop one day, please use and contribute to make this your own.
To contribute and improve models, please visit our git repository here: https://github.com/Aurora-Network-Global/sdg-queries
To use and reuse our data, eg. for build and validating your own classification model for SDG related research, please visit our data deposit here: https://zenodo.org/communities/aurora-universities-network/ . In there you'll find:
The Survey data (input and output data):
The Analyses data of these text analyses:
The Classification models (SDG queries) old and improved:
This shows the context of the Text Analysis in the project. It uses the output of the survey, to be input for the new classification model.
Methods used to do the text analysis
Term Extraction: after text normalisation (stemming, etc) we extracted 2 terms in bigrams and trigrams that co-occurred the most per document, in the title, abstract and keyword
Contrast analysis: the co-occurring terms in publications (title, abstract, keywords), of the papers that respondents have indicated relate to this SDG (y-axis: True), and that have been rejected (x-axis: False). In the top left you'll see term co-occurrences that a clearly relate to this SDG. The bottom-right are terms that are appear in papers that have been rejected for this SDG. The top-right terms appear frequently in both and cannot be used to discriminate between the two groups.
Network map: This diagram shows the cluster-network of terms co-occurring in the publications related to this SDG, selected by the respondents (accepted publications only).
Topic model: This diagram shows the topics, and the related terms that make up that topic. The number of topics is related to the number of of targets of this SDG.
Contingency matrix: This diagram shows the top 10 of co-occurring terms that correlate the most.
Software used to do the text analyses
CorTexT: The CorTexT Platform is the digital platform of LISIS Unit and a project launched and sustained by IFRIS and INRAE. This platform aims at empowering open research and studies in humanities about the dynamic of science, technology, innovation and knowledge production.
About the SDG survey data | the basis for these text analyses
In total we have had 224 completed responses from researchers from the different Aurora Universities and beyond. This diagram shows the distribution of responses per SDG per country. The most respondents came from our Spanish partners.
Improving the SDG Classificaiton model, we need a minimum of 5 respondents per SDG. All SDG's passed that minimum, except SDG1 "No poverty". The SDG with the most respondents is SDG3 "Good health and well-being".
Survey outcomes: Precision & Recall, defining the baseline for improvement
To improve the classification model, we first needed to measure how good the initial one is. This is done measuring the precision and recall. Precision is about soundness: "Are the publications we found for an SDG actually related to an SDG?" Recall is about completeness: "Have we found all publications related to an SDG?" This goes from 0 (bad) to 1 (good).
The diagram shows the precision (y-axis) and recall (x-axis) for each of the SDG classification models (SDG queries), the size represents the number of respondents.
We can see that most of our current queries are sound, but are far from complete.
(SDG1 is an outlier, since it has almost no data)