This project aims to develop and evaluate Large Language Model (LLM)-based mediation agents that will actively participate in online discussions, with or without additional human mediation. Although extensive work has been performed to assist toxicity detection for moderation purposes in online discussions (e.g., in social media, news portals, customer review platforms) [1-8], most of that work aimed to flag toxic posts or classify them (e.g., as profanity, hate speech, racism) and eventually remove them from the discussions automatically or after checking with a human moderator. With recent instruction-following LLMs [9,10], however, it seems possible to move towards automatic mediation agents, which will actively participate in online discussions to improve their quality, rather than just remove posts [11]. For example, LLMs could be instructed to intervene in discussions whenever they see toxic statements, explain to their authors why those statements could be perceived as toxic by fellow discussants, and suggest rephrases, possibly even before the statements become visible to others. The LLMs could also be instructed to periodically summarise the main topics of a discussion, the opinions of different (emerging or known) groups, and points of agreement and disagreement, to help discussions reach a consensus. Or they could invite discussants to vote, when discussions seem to have reached a steady state or when the discussion period has ended and decisions need to be made. Recent work [12,13] has demonstrated that LLMs can, at least in principle, undertake such mediation roles. There is still, however, no concrete analysis of (a) the kinds of interventions human mediators typically make (or would ideally make) in online discussions, (b) how/if LLM mediators can be instructed/tuned to perform similar (or other desirable) interventions, (c) how useful these interventions (by human and LLM mediators) are, as perceived by people participating in the discussions or by external observers. Answering these points is necessary in order to develop effective and appealing LLM-based/assisted services.
To address points (a) – (c) above, discussion-based experiments will be undertaken within the academic ecosystem, where we can reach out to volunteers from multiple backgrounds, familiar with online discussions, and curious about LLMs (e.g., students and instructors are already debating how models like ChatGPT should be used in coursework). Students and researchers of AUEB, SU, EPFL (provisional support established, more institutions will be reached) will be invited to participate in online text-only (chat-like) fora, set up for the purposes of the experiments. Topics of academic interest (e.g., braingain, funding, academic honesty, LLMs use) and beyond (immigration, abortion, LGBTQIA+ etc.) will be selected and the participants will be randomly divided into groups (four per topic), to be mediated by (i) humans, (ii) LLMs, (iii) both, or (iv) none. Human mediators will be given instructions based on previous literature and existing guidelines of online discussion fora (further extended during the proposed work), explaining what they should aim at (e.g., making sure that minority opinions are heard, that no participants are bullied, helping the discussions converge by clustering and summarising opinions, inviting participants to vote when a steady state has been reached or the discussion period has ended). LLM mediators will be prompted with similar instructions and/or will be fine-tuned (further trained) on human mediator interventions. Multiple discussion groups will be formed per topic, one per mediator type (i) – (iv). All the discussions will be recorded with the permission of the participants. The datasets (incl. LLM prompts used) will be made publicly available, they will be of appropriate sizes to fine-tune LLMs, and will include metadata (role of each participant, demographics, topic, date/time, etc.). The guidelines to the human mediators will also be public. For evaluation, we will monitor: engagement (e.g., response frequency across discussants); toxicity (e.g., via Perspective API); constructiveness (e.g., via constructiveness markers [15]); politeness (e.g., as in [16]); polarisation (e.g., by asking participants to fill in the political compass and checking if clusters of users/opinions align with political views); imparted knowledge, user satisfaction, perceived dialog quality (e.g., via questionnaires and structured interviews after the discussions, directed to participants, moderators, external observers). We will also consider appropriately modified measures from dialog systems evaluation [14]. The results will be studied per (and compared across) moderation type(s) (i) – (iv) above, so that we can explore the pros and cons of each type, as well as the most desirable human and LLM-based kinds of interventions.
By shedding light on points (a) – (c) above, we will provide guidance towards developing more effective and appealing LLM-based, but also hybrid (LLM/human-based) mediation services, which will in turn help improve the quality of online discussions, with far-reaching effects on the quality of democracy, social cohesion, governance, accountability. The publicly available datasets to be constructed and the evaluation measures to be studied will be valuable to other researchers wishing to develop and evaluate their own LLM mediators and ways to combine them with human mediation. The LLM mediators to be developed will be strong baselines for future research. The results of our studies will also determine the most desirable types of interventions (by humans and LLMs), as well as the most effective ways to prompt and/or fine-tune LLM mediators, further informing future work.
The project is funded by Archimedes, a vibrant research hub connecting the global AI and Data Science research community fostering groundbreaking research. Our work is supported by project MIS 5154714 of the National Recovery and Resilience Plan Greece 2.0 funded by the European Union under the NextGenerationEU Program.
[1] Korre, K., Pavlopoulos, Sorensen, J., Laugier, L., Androutsopoulos I., Dixon, L., and Barrón-Cedeño, A., "Harmful Language Datasets: An Assessment of Robustness". In Proceedings of WOAH, pp. 221-230. 2023.
[2] Sorensen, J., Korre, K., Pavlopoulos, J., Tomanek, K., Thain, N., Dixon, L. and Laugier, L., "JUAGE at SemEval-2023 Task 10: Parameter Efficient Classification". In Proceedings of SemEval, pp. 1195-1203. 2023.
[3] Xenos, A., Pavlopoulos, J., Androutsopoulos, I., Dixon, L., Sorensen, J., and Laugier, L., "Toxicity detection sensitive to conversational context". First Monday (2022).
[4] Pavlopoulos, J., Laugier, L., Xenos, A., Sorensen, J., and Androutsopoulos, I., "From the detection of toxic spans in online discussions to the analysis of toxic-to-civil transfer". In Proceedings of ACL, pp. 3721-3734. 2022.
[5] Xenos, A., Pavlopoulos, J., and Androutsopoulos, I., "Context sensitivity estimation in toxicity detection". In Proceedings of WOAH, pp. 140-145. 2021.
[6] Pavlopoulos, J., Sorensen, J., Laugier, L., and Androutsopoulos, I., "SemEval-2021 task 5: Toxic spans detection". In Proceedings of SemEval, pp. 59-69. 2021.
[7] Pavlopoulos, J., Sorensen, J., Dixon, L., Thain, N., and Androutsopoulos, I., "Toxicity Detection: Does Context Really Matter?". In Proceedings of ACL, pp. 4296-4305. 2020.
[8] Pavlopoulos, J., Malakasiotis, P., and Androutsopoulos, I., "Deeper attention to abusive user content moderation". In Proceedings of EMNLP, pp. 1125-1135. 2017.
[9] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A. and Schulman, J., “Training language models to follow instructions with human feedback”. NeurIPS, 35, pp. 27730-27744. 2022.
[10] Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, E., et al. "Scaling Instruction-Finetuned Language Models". arXiv e-prints (2022): arXiv-2210.
[11] Bakker, M., Chadwick, M., Sheahan, H., Tessler, M., Campbell-Gillingham, L., Balaguer, J., McAleese, N., et al. "Fine-tuning language models to find agreement among humans with diverse preferences." NeurIPS (2022): 38176-38189.
[12] Small, C. T., Vendrov, I., Durmus, E., Homaei, H., Barry, E., Cornebise, J., Suzman, T., Ganguli, D., and Megill, C, "Opportunities and Risks of LLMs for Scalable Deliberation with Polis" arXiv preprint arXiv:2306.11932 (2023).
[13] Fish, S., Gölz, P., Parkes, D. C., Procaccia, A. D., Rusak, G., Shapira, I. and Wüthrich, M., "Generative Social Choice" arXiv preprint arXiv:2309.01291 (2023).
[14] Yi-Ting Y., Eskenazi, M., and Mehri, S., “A Comprehensive Assessment of Dialog Evaluation Metrics”. In The 1st Workshop on Evaluations and Assessments of Neural Conversation Systems, pages 15–33, Online. ACL. 2021.
[15] Niculae, V. and Danescu-Niculescu-Mizil, C., “Conversational Markers of Constructive Discussions”. In Proceedings of NAACL-HLT, pp. 568-578. 2016.
[16] Zhang, J., Chang, J., Danescu-Niculescu-Mizil, C., Dixon, L., Hua, Y., Taraborelli, D. and Thain, N., 2018, July. Conversations Gone Awry: Detecting Early Signs of Conversational Failure. In Proceedings of ACL, pp. 1350-1361. 2018.
L. Dixon (Google)
Crystal Qian (Google)
J. Sorensen (Jigsaw)
C. Small (Jigsaw)
Stella Markantonatou (RC Athena ILSP & Archimedes)