Please see two project themes below. please note, these are idea for different projects. Themes 1 and Themes 2 are two separate themes.
Theme 1: Building a knowledge graph of research skills and strengths across our University, and using it to enhance LLM answering capabilities.
Theme 2: Multiple projects centred around the concept of Data-centric AI and end-to-end explainability (from models back to data)
Theme 1: We are building an ambitious knowledge graph describing our research strenghts across our entire University and the entire range of our academic disciplines. We want to make the KG available to genAI models to enhance our internal and external research capabilities. This is a collaboration between Computer Science and the Institute for Data and AI in Birmingham.
Background. To be successful in areas of research that are increasingly data-driven, AI-powered, and interdisciplinary, researchers need to work in teams that cover a breadth of diverse kinds of research/academic/professional expertise. One common problem is that researchers do not normally have good visibility of research excellence beyond their immediate areas, with the possible exception of direct collaborators. This limits the types of cross-discipline collaborations that could potentially lead to the joint development of powerful new ideas.
The premise of this project is that we can use genAI in combination with KG technology to overcome this limitation.
For this reason, the Institute for Data and AI aims to create a knowledge-based system that is: (a) aware of "who does what" within the University, by extracting data about our research excellence as demonstrated by publication records, impact stories, history of funding successes; and (2) is capable of suggesting new networks of academics, possibly at diverse career stages and certainly across all our Colleges, who have potential to address major research challenges together by exploiting synergies between their expertise.
This is an ambitious project that will cover the entire data-to-knowledge trajectory requiring (a) data extraction and harvesting from some of the University's internal Information Systems, as well as from publication repositories; (b) creating Knowledge Graphs that capture the extracted information in a structured way; (c) using LLMs to extract topic / research-specific information from publications; (d) demonstrating how foundational models (eg LLMs) and Graph Neural Networks can exploit this knowledge to provide competent and effective answers to a variety of queries regarding research excellence across all our research community.
Project. Please contact me for further details on how the problem outined above can be addressed. This is most likely going to be team work requiring multiple students with data engineeing and AI ambitions.
Theme 2: Data-centric AI explainability
Context. Research around Explainable AI (XAI) is interpreted differently depending on the type of model, data, and applications. In general, however, it has been focused primarily on explaining model inference (see eg LIME, occlusion testing for images), with relatively little relevance given to linking the inference to the data used for training.
This is changing, with recent advances in the area of so-called "Data-Centric AI". For example, concepts underpinning data valuation, such as Influence Functions [WFW+20] and more recently AME (Average Marginal Effects) [LLZ+22] help pinpoint specific data points in the training set, which are most closely responsible for a given inference. Other, relatively older concepts like Explanation Tables [EF+18] have also been "resurrected" with the aim to provide data-centric explanations.
References
[WFW+20] Wu, Weiyuan, Lampros Flokas, Eugene Wu, and Jiannan Wang. ‘Complaint-Driven Training Data Debugging for Query 2.0’. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, 1317–34, 2020. https://dl.acm.org/doi/abs/10.1145/3318464.3389696.
[LLZ+22] Lin, Jinkun, Anqi Zhang, Mathias Lécuyer, Jinyang Li, Aurojit Panda, and Siddhartha Sen. ‘Measuring the Effect of Training Data on Deep Learning Predictions via Randomized Experiments’. In Proceedings of the 39th International Conference on Machine Learning, 13468–504. PMLR, 2022. https://proceedings.mlr.press/v162/lin22h.html.
[EF+18] El Gebaly, Kareem, Guoyao Feng, Lukasz Golab, Flip Korn, and Divesh Srivastava. ‘Explanation Tables’. Sat 5 (2018): 14.
Starting point:
We have been developing a provenance-based tool for capturing the derivations of data through python/Pandas scripts:
https://github.com/pasqualeleonardolazzaro/PROLIT
The tool is described in this recent presentation: https://www.slideshare.net/slideshow/design-and-development-of-a-provenance-capture-platform-for-data-science/268330702
and in papers:
Design and Development of a Provenance Capture Platform for Data Science. Gregori, L.; Missier, P.; Stidolph, M.; Torlone, r.; and Wood, A. In Procs. 3rd DATAPLAT workshop, co-located with ICDE 2024, Utrecht, NL, May 2024. IEEE https://www.dropbox.com/scl/fi/plz8egd5wdvb5bp5vra09/840300a285.pdf?rlkey=gitqo6jzveh915g9fhbsqpqyn&st=8pk9vluh&dl=0
Concrete projects:
Using the above as a starting point, we can develop ideas in a number of interesting directions. Please see here for a recent talk that provides more background: https://www.slideshare.net/slideshow/explainable-data-centric-ai-what-are-you-explaininhg-and-to-whom/268463441
LLMs exploiting the derivation graphs to provide suggestions on pipeline and data repairs. This project will look at using RAG (Retrieval Augmented Generation), specifically to incorporate results from Neo4J queries (Cypher) into narratives that suggest interventions on data and on pipelines
Graph analysis of provenance graphs. The derivation graph is natively stored in a Neo4J graph DB. The project will experiment with graph analysis algorithms on these derivations, using the Neo4J Graph Data Science library
"Why+" explanations: augmenting data derivations to describe the behaviour of complex data processing algorithms, for example training set optimisation, incremental data cleaning, etc. please see this presentation paper: https://www.dropbox.com/scl/fi/yfpzxtsbrtj9oppc52ymk/DCAI_position_SEBD_24_CR.pdf
I am open to discussing other related ideas for alternative projects.