Disclaimer: The views expressed on this website are my own and do not necessarily reflect those of the Banco de España or the Eurosystem.
Since the start of the Russian invasion on February, 24 2022, I have developed an interest in understanding how the tremendous quantity of information, both official and open-source, provided in relation to this major geopolitical event can be digested by the average person that follows the news closely.
This project therefore aims at tracking information published by a variety of sources so that the end-user can have a bird's eye view of the situation.
The datasets can be downloaded on my Kaggle page (updated weekly).
In order to obtain actionable data, I leverage three main techniques: web scraping, text mining and user emulation/Java robots. These techniques allow me to automate the data collection. More information regarding these techniques can be found on their dedicated page.
Each source required a specific set of tools to obtain the data, but mostly relies on web scraping techniques and emulation of a human user in order to avoid detection. As for any web scraping project, the quality of the tool is conditioned by (i) the quality of the code present on the webpage and (ii) the consistency across time of the structure of the page. The collection process has been developed in MATLAB.
I use five sources for this project:
Russian Ministry of Defense
Scraping the Russian Ministry of Defense (MoD) is fairly easy since the website is not protected agains webread/websave requests. The Russian MoD publishes a daily report on the, quote, “progress of the special military operation”. This is precisely this consistency in the naming of the daily briefs that allows me to download the HTML code. The last paragraph of text reports the cumulative losses of the Ukrainian Armed Forces.
Ukrainian Ministry of Defense
The Ukrainian MoD is protected against headless browsers. To circumvent this limitation, I emulate human behavior from MATLAB with the help of a Java robot. Once I have obtained the HTML code, I capture the data that appears in a bullet list.
ORYX
Contrary to the two previous sources, ORYX does not produce a history of the communication but present the latest data available instead. In order to create a real-time dataset, I use the Wayback machine, which crawls and archives webpages on a regular basis. The data treatment is more extensive in this case given the level of details about the equipment. There is also considerable cleaning involved since the manual reporting on ORYX is error-prone.
The Killed in Ukraine project
Open-source intelligence officers crawl Russian social media and newspaper searching for confirmation on the name of the Officer and its rank. Both the date of publication and estimated date of death are provided, such that we can easily create a real-time dataset.
The UALosses project
Open-source intelligence officers crawl social media posts and obituaries searching for confirmation on a Ukrainian soldier's death.
Notes: Junior Officers are defined as Junior Lieutenant, Lieutenant, Senior Lieutenant, and Captain. Senior Officers are defined as Major, Lieutenant Colonel, and Colonel. Supreme and General Officers are defined as Major General, Lieutenant General, Colonel General, Army General, and Marshal of the Russian Federation. The vertical bars represent the corresponding percentile of the distribution of the duration between the date of death and the date of publication of death, where a higher percentile indicates a higher degree of trustworthiness of the reported data.
Important note: since February 2025, the Russian Ministry of Defense website is no longer accessible for scraping. These series are therefore discontinued.
Important note: since February 2025, the Russian Ministry of Defense website is no longer accessible for scraping. These series are therefore discontinued.