Using the Curriculum Lattes of all researchers listed with study area as Mathematics, Computer Science and Probability & Statistics, a co-authorship network was created. This is a undirected network, with each researcher being represented as a node, and the edges representing that both researchers co-authored a paper together. Each edge has a weight related to the number of papers published together. In this project, we didn't make much use of the weights in the edges and focused more on the presence of links between the researchers.
Let's start with some basic statistics:
Total Nodes = 11419 Total Edges = 16164 Avg. Degree = 2.83# Connected Components = 4531By the number of connected components, we can infer that the network is quite disconnected. Also, the average degree is quite low, implying that on average, the researcher has co-authorship with only other 2.8 people in the network. Perhaps there is something else going on. An image together with the distribution of the degrees can help us sort this out.
This image is quite something, it helps a lot in the understanding of the network as a whole. As we can see, there exists a giant connected component (we can get from one node to all nodes), while the rest is all disconnected, sometimes forming small clusters. When we look at the data itself, many of this nodes are actually of people that didn't publish anything (or created a Lattes CV, but didn't fill it with their publications).
The giant component is the "heart" of the network and where most of the interesting information is. Therefore, we extract this component and analyse it separately.
Total Nodes = 6553Total Edges = 15803Avg. Degree = 4.82# Connected Components = 1The size of the component contains around 50% of nodes in the network and 98% of the edges. Once we filter the only this giant component, the average degree almost doubles. Let's zoom in the connected component:
Inside the giant component the degree distribution is also very skewed, with most nodes having just a few number of link, and some others having up to almost 80. From this image alone we cannot see much of a structure, but it seems that there is a tendency of some of the highly connected nodes to band together... but this might be too strong of an assumption.
Another measure we can look at is the Clustering Coefficient. This coefficient captures how connected are the neighbours of a given node, so a high value of clustering indicates that more densely interconnected the neighbourhood of node is. In other words, the clustering coefficient is equivalent to the probability that two neighbours of a random node are connected to each other.
Avg. Clustering = 0.35Median Clustering = 0.22In this case, the average clustering is quite deceiving. Our network has lot's of nodes (the majority) with a clustering of 0. This are the peripheral nodes that can be seen in the image of the graph of the giant component. Although we have lot's of 0's, we also have lot's of 1's! So there are quite a bit of local clustering in the network.
An interesting question is, how far away is each math researcher from each other? This is known as the degrees of separation. An usual adage is that the world has 6 degrees of separation, thus, we are all related to each other by 6 people apart.
How does this hold for our giant connected component? The answer in our case is 7.
The Erdos number describes the "collaborative distance" between mathematician Paul Erdos and another researcher, measured using the co-authorship as in our network.
So why not create the Brazilian version? In Brazil, Elon Lages Lima is a prominent mathematician, but he is completely different from Erdos in the sense that Erdos is known for having published an absurd amount of papers and having collaborated with an extraordinary amount of people. This is the precise opposite of Elon, which has a co-authorship with only one person (Manfredo Perdigão). Therefore, perhaps Elon is not the "right" person to build our metric around. But we (the ones doing this project) quite enjoyed his book on Real Analysis, so we'll follow with him.
Below, we extract the shortest distance from Elon to each researcher in the network. Note that he is in the giant component, so all the researchers that are not in the component will be assigned a distance of 0.
Note the the distribution has lot's of zeros, corresponding to the nodes outside of the giant component, and when discarding the zeros, the distribution looks gaussian centered around 8, which is roughly the average degree of separation plus 1. This makes sense, since to get to Elon, one must first go through Manfredo.
Now that we got some understanding of the overall behaviour of the network, let's put some names in this nodes. So who are the researchers with highest degree, hence, most number of connections? Would you believe me if I told you that the most connected researcher is none other then... Gauss?!
Next, we measure the centrality of each node using a metric called betweenness. Let's find out who are the most "central" researchers in the network, those that join different areas together and help create this giant component.
The researchers area of study are divided in "Ciência da Computação", "Matemática" and "Probabilidade & Estatística". How are the networks connected? Do they group together? Most likely!
Using the year respective to the first paper published by the node and the first paper that generated the edge between authors, we can explore how the network evolved in time.
The graph above is pretty interesting! It shows that the network grows gradually in terms of nodes, while the edges take some time to actually take off and by around 2011 they surpass the number of individual nodes. This is sort of expected since when the number of nodes is small, not much connection is expected. What was not expect is that it seems that the network seems to be "maturing", with the number of nodes starting to plateau.
Below is a video of the network evolving over time.
Finally, we create a graph called Hierarchical Edge Bundling. The idea for this graph is to bundle the adjacent edges together thus reducing the clutter usually observed in complex networks. Again we isolate the giant connected component to generate the visual. The graph below show us what we inferred from the clustering coefficient, that there existed many highly local clusters in the network, but it also reveals more. The network is asymmetric, with the "left-downwards" portion in the image presenting much less connections to the rest of the network, and "right-downward" being more densely connected.
To summarise, our network is composed of one giant component and many nodes that don't form any connection or just a small number of connections. When we looked deeper in the giant component, we discovered that it is "hairy" (it has many nodes with one connection sticking out of the giant component), but it also is formed of many small local clusters, which can be seen in both the distribution of the clustering coefficient and the hierarchical edge bundling plot. The hierarchical plot also shows us that there is a part of the sub-network that is more densely interconnected. When plotting the graph by main area of study, we can clearly see that the areas tend to stick together, which was expected. Also, for some reason that we yet don't know, the network seems to be reaching a plateau in terms of number of nodes, and perhaps even in the number of edges, but the last one is not so clear.
Finally, we leave you with an image of the Minimum Spanning Tree of the giant component.