My fiancée is studying History at the renowned university la Sorbonne. For her master degree, she worked on the French Wars of Religion and the diffusion of Protestantism in France during the 16th century.
She was especially interested in studying the propagation of protestant thesis in France. Her idea was to analyse a massive corpus of 1,500+ letters (15,000 pages of PDFs), and exchanged by the French-speaking Protestant pastors in the early 16th century. And because of the amount of data to study, we decided to use my skills in Computer Science and AI to help the analysis.
I first programmed an algorithm to parse the letters in the PDFs. (Note that PDFs were computer-generated, not handwritten 16th century texts !). The goal was too automatically split the letters body from the metadata (date, sender, recipient, ...). The letters were collected by Aimé-Louis Herminjard in 1868, then typewritten during the 20th century. It goes without saying that the data quality policies of 18th and 19th century historians were absolutely awful. The parsing itself was a nightmare, but worth the effort, because of the amount of data to process.
Just for fun, here is the fuzzy REGEX I used to detect letters headers:
Once the data-set was created, I conducted a series of analysis and visualizations at my fiancée's request:
Creating animated maps of the locations of the pastors across time (so that she can get insights of how the Protestantism propagated through pastors)
Finding the most important pastors of the time (with modified version of PageRank algorithm), do see if some pastors were under-studied by historians, which was her intuition. That revealed to be true.
Generating important working documents for future historians (a "prosopography", as well as individual cards for each letter of the corpus) that can be found here.
My fiancée's final report can be found here. She received the highest grade possible from the Sorbonne jury for this computer scientist & historian collaboration
I used this work as a student project for my data-science track at Télécom Paris. My data-set has been chosen by a researcher from Télécom to be published on the official bank of datasets of the school.
The data-set was automatically cleaned up (merging of close names, correction of OCR errors, etc.) and manually reviewed by my fiancée, for errors that only a historian can spot.
The animated maps of the pastors journeys:
Importance of french pastors, computed with a modified version of Page Rank:
My fiancée's report (in french):
My report for Télécom data-science track on this collaboration: