Final Project for Class Fundamentals of Data Science - FGV
By Davi Barreira & Franklin OliveiraThe Curriculum Lattes is a cv of the academic activities of students and researchers of Brazil. Stored at Plataforma Lattes, this cv format is adopted by most academic institutions and research centres of Brazil, therefore, constituting a rich source of information to study the academic production of the country.
At Plataforma Lattes, the researchers inform their area of study, making it possible to analyse the data by research area.
In this project, we analyse the data for the researchers in mathematics. We also obtained and analysed the researchers with areas listed as computer science, probability and statistics, since we consider them to be part of mathematics as a whole.
Our aim was to get an overall understanding of how is the research network of mathematics of the researches in Plataforma Lattes. For that we:
We are Brazilian math students, do we need to say more?! Learning about the mathematics research network, questions such who are the researchers that publish the most, how is research evolving over time, do mathematicians in the network collaborate among themselves, what is their degrees of separation...
Plataforma Lattes is a public platform owned by the Brazilian government and is freely available to anyone who has the patience to go through the CAPTCHA and collect it. The raw cv's were in .xml format and were turned into json through python, thus feeding a MongoDB database. Using MonogoDB, we created two main dataframes, one with information regarding the researchers and another one related to the published papers.
The cv's collected were the ones marked by the platform as relating researchers from mathematics, computer science and probability & statistics. The total number of documents collected were 11.424, but some of them did not exist anymore by the time we collected them (or were missing for some reason).
We recommend starting with the Exploratory Analysis, so you can get a better understanding of the data itself, before diving into the network.