Dataset

Obtaining the Data

THE CHALLENGE

Plataforma Lattes is a public web-site that contains the Lattes CV for all the researchers in the database. Anyone can access any CV, but the has to go through a captcha, hence the difficulty of extracting the documents through webscraping. Using some ingenuity and programming skills, we were able to solve this difficulty and extract all the documents that we need, which were a total of 11.423.

Example of CV in Plataforma Lattes

ORGANISING AND CLEANING

The raw data extracted from the website is an .xml file containing several nested fields. Before doing any cleaning, we needed to organise all this files in a way that we could easily find and extract the relevant information. To speed up our queries and to better deal with all these documents, we turned each of them in a .json file, and stored them in a MongoDB database. MongoDB is a nosql database which is ideal for storing this type of data.

Visualising the raw data using MongoDB Compass, a GUI for MongoDB

By using MongoDB to do queries, we constructed two main dataframes to be used with Pandas in Python. One dataframe containing information of researchers and another one with information about the published papers.

Fortunately for us, the data stored in Plataforma Lattes has a very good quality and not much cleaning was needed. We found a few duplicate CVs, but kept the most recent ones by using the date that the information was lastly updated ("@DATA-ATUALIZACAO"). The information that needed more attention was the name of the articles published. Analysing the dataset we saw that was quite common to researchers write the name of the same article slightly different from each other. For example, one person would put a dot in the end, while the other wouldn't. Therefore, we created a "key" for each article by taking it's name and turning all into uppercase, removing white spaces , accents and symbols. This key is in the column "CHAVE_ARTIGO" of the papers dataframe.

PRESENTING DATAFRAMES

In the picture below, we have a representation of the first five lines of the dataframe created by querying published papers information from MongoDB.

Dataframe 1 - information on published papers

Here is a brief explanation of the content of each column:

NOME-COMPLETO: full author's name.
TITULO-DO-ARTIGO: title of the paper.
ANO-DO-ARTIGO: year in which the paper was published.
JOURNAL: name of the journal in which the paper was published.
PALAVRAS-CHAVE-ARTIGO: keywords of the paper.
COAUTORES: coauthors (other authors of the paper).
CHAVE_ARTIGO: title of the paper after some treatment (all uppercase letters with all special characters removed).

Additionaly, we included one column with the ID of each register for latter checking in the database. Next, we present the dataframe we formed with authors' information.

Dataframe 2 - information on authors

Below, there's a brief description of each column:

NOME-COMPLETO: author's full name.
AREA-ATUACAO: a list with information of the next three columns.
GRANDE-AREA: broad research area.
AREA: research area.
SUB-AREA: sub-area of research.
ESPECIALIDADE: author's specialty (field of research).
NOME-INSTITUICAO: name of author's institution of work.
UF-INSTITUICAO: acronym of the State where the institution is located.
PAIS-INSTITUICAO: country where the institution is located.
PAIS: country of nationality.
NOME-CITACOES: author's names in citations.
CODIGO-INSTITUICAO: code of the institution.

Then, with the dataframes in hand, we were able to start exploring...

Page updated

Google Sites

Report abuse