Plataforma Lattes is a public web-site that contains the Lattes CV for all the researchers in the database. Anyone can access any CV, but the has to go through a captcha, hence the difficulty of extracting the documents through webscraping. Using some ingenuity and programming skills, we were able to solve this difficulty and extract all the documents that we need, which were a total of 11.423.
The raw data extracted from the website is an .xml file containing several nested fields. Before doing any cleaning, we needed to organise all this files in a way that we could easily find and extract the relevant information. To speed up our queries and to better deal with all these documents, we turned each of them in a .json file, and stored them in a MongoDB database. MongoDB is a nosql database which is ideal for storing this type of data.
By using MongoDB to do queries, we constructed two main dataframes to be used with Pandas in Python. One dataframe containing information of researchers and another one with information about the published papers.
Fortunately for us, the data stored in Plataforma Lattes has a very good quality and not much cleaning was needed. We found a few duplicate CVs, but kept the most recent ones by using the date that the information was lastly updated ("@DATA-ATUALIZACAO"). The information that needed more attention was the name of the articles published. Analysing the dataset we saw that was quite common to researchers write the name of the same article slightly different from each other. For example, one person would put a dot in the end, while the other wouldn't. Therefore, we created a "key" for each article by taking it's name and turning all into uppercase, removing white spaces , accents and symbols. This key is in the column "CHAVE_ARTIGO" of the papers dataframe.
In the picture below, we have a representation of the first five lines of the dataframe created by querying published papers information from MongoDB.
Here is a brief explanation of the content of each column:
Additionaly, we included one column with the ID of each register for latter checking in the database. Next, we present the dataframe we formed with authors' information.
Below, there's a brief description of each column:
Then, with the dataframes in hand, we were able to start exploring...