The first big challenge of this project was obtaining the data and organising it in a proper way to extract the relevant information. Although the data is publicly available, more than 11.000 documents (~3.3 gb) needed to be extracted and captchas were in the way. Fortunately, we were able to overcome this difficulty with some ingenuity.
The objectives by the beginning of this project were to do a thorough exploration of the data followed by a construction of a co-authorship network to analyse the links between mathematics researchers. It can be said that both objectives were accomplished.
- The data exploration was able to answer our main questions, such as total number of researchers, total number of publications, trending areas of research, growth in the number of researchers over times, and many others;
- We were able to create the co-authorship network and it provided huge insight into the way collaboration is occurring among researchers. Using some "tools" from network science, we were able to analyse the graph from many perspectives, and came to an understanding of it's overall structure.
The data we collected is very rich, and much more information and insight can be obtained. Here are some other possible analysis that we may do in the future:
- Create a network from participation in research projects. The data contain research projects in which people are engaging. Therefore, one can create a new network based on who is actually working with who, instead of using co-authorship;
- Topic modelling of the different research projects. The research projects have a field called "Description", where people do a brief overview of what the project is about. Therefore, one can apply NLP to try to extract some common topics among the different projects;
- Enrich the dataset and do more exploration. Gather information such as number of citations and the level of prestige of each journal to create a more robust metric to evaluate the academic performance of each researcher.