In this section, job posts, working profiles suitable for ongoing projects and other working proposals
within the Health Analytics group @ MOX and/or the Health Data Science Center @ HT are presented.
Health Analytics @PolMi proposals
MD thesis on text mining for health data
1. Generative models for clinical documents
In the healthcare domain, a large amount of textual data is produced and they contain valuable information often missing from structured databases. Nevertheless, due to privacy concerns, these data are almost never publicly released, not even in anonymized versions, and their access for research purposes is difficult. In the last years, some generative models for textual data have been proposed, from the well-known Generative Pretrained Transformer (GPT) to various adaptations of the Generative Adversarial Network (GAN) architecture, which has been previously extensively used to generate image data. There have been some attempts to use them to generate medical documents, but there are still many open issues, in particular, related to the trade-off between diversity and fidelity with respect to the training data, and the privacy guarantees that these approaches can provide. This thesis aims to compare different state-of-the-art solutions for the generation of medical documents, identifying their limitations and developing improvements to overcome them.
References:
- Guan, Jiaqi, et al. "Generation of synthetic electronic medical record text." 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2018.
- Ive, Julia, et al. "Generation and evaluation of artificial mental health records for natural language processing." NPJ digital medicine 3.1 (2020): 69.
- Sánchez, David, and Montserrat Batet. "C‐sanitized: A privacy model for document redaction and sanitization." Journal of the Association for Information Science and Technology 67.1 (2016): 148-163.
MD thesis on frailty survival models for cancer subtyping
The thesis aims at developing a Penalized frailty Cox model, starting from [paper1, paper2, R package] for the selection of relevant predictors of survival in cancer patients (Cholangiocarcinoma).
A second possible project would be the generalization of a survival clustering framework [paper, repository] using a penalized frailty Cox model in order to account for possible multi-center structure of the design [paper] .
Post-lauream or PhD candidate on NLP and text mining
Research Program: Automatic clinical information extraction from unstructured textual data of Italian Electronical Health Records
Texts from electronic health records (EHRs) are a plentiful source of clinical knowledge, containing a lot of valuable medical data regularly recorded and updated. Notwithstanding, automatic information extraction is difficult to perform in this context, as texts are by their very nature unstructured, high-dimensional, incomplete and may contain any sort of random errors and systematic biases.
Within this challenging scenario, natural language processing (NLP) systems define a fruitful framework for the automatic extraction of valuable knowledge from unstructured or semi-structured text. While several applications have showcased the promising usage of NLPs systems in extracting accurate and timely information in English EHRs, advances in other languages are still very limited.
Motivated by the afore-mentioned problem, the successful candidate is expected to develop text mining algorithms to extract tabular information from unstructured textual data, with a particular focus on the Cartella Clinica Elettronica (CCE) associated to the Italian National Health System. The devised pipelines will aid the identification of covariates necessary for building epidemiological models with clinical endpoints. The research domain will be focused on, but not limited to, patients affected by Non-Small Cell Lung Cancer.
Health Data Science @HT proposals
MD thesis on CVD and CoviD based on Healthcare Utilization Databases investigation
Lo studio si propone di investigare tre aspetti specifici relativi i) al rischio di eventi avversi cardiovascolari a breve, medio, e lungo termine a seguito di infezione da SARS-CoV-2 e la differenza con altre patologie infettive, ii) a come questo rischio si modifichi in relazione al vaccino e alla tipologia di vaccino, e infine iii) a una valutazione dell’impatto sulla salute e il consumo di risorse dei pazienti affetti da SARS-CoV-2.
La banca dati di interesse è quella amministrativa degli assistiti di Regione Lombardia.
MD thesis on multiple CVD scoring based on Healthcare Utilization Databases
Lo studio si propone di comprendere come utilizzare le banche dati secondarie/amministrative per costruire opportuni indici di rischio cardiovascolare.
PhD scholarship on Digital Twins, Epidemiology of maternal and child health