Reading and Discussion of Research Paper
Devanbu PT. Promises and Perils of LLM-and Agent-Generated Code. Computer. 2025 Dec 31;59(1):184-6.
Asgari A, Panichella A, Derakhshanfar P, Olsthoorn M. What Challenges Do Developers Face in AI Agent Systems? An Empirical Study on Stack Overflow. arXiv preprint arXiv:2510.25423. 2025 Oct 29.
Seminar
Speaker: Xiaoke Han and Hong Zhu
Title: MASTEST: A LLM-Based Multi-Agent System For RESTful API Tests
Abstract: Testing RESTful API is increasingly important in quality assurance of cloud-native applications. Recent advances in machine learning (ML) techniques have demonstrated that various testing activities can be performed automatically by large language models (LLMs) with reasonable accuracy. This paper develops a multi-agent system called MASTEST that combines LLM-based and programmed agents to form a complete tool chain that covers the whole workflow of API test starting from generating unit and system test scenarios from API specification in the OpenAPI Swagger format, to generating of Pytest test scripts, executing test scripts to interact with web services, to analysing web service response messages to determine test correctness and calculate test coverage. The system also supports the incorporation of human testers in reviewing and correcting LLM generated test artefacts to ensure the quality of testing activities. MASTEST system is evaluated on two LLMs, GPT-4o and DeepSeek V3.1 Reasoner with five public APIs. The performances of LLMs on various testing activities are measured by a wide range of metrics, including unit and system test scenario coverage and API operation coverage for the quality of generated test scenarios, data type correctness, status code coverage and script syntax correctness for the quality of LLM generated test scripts, as well as bug detection ability and usability of LLM generated test scenarios and scripts. Experiment results demonstrated that both DeepSeek and GPT-4o achieved a high overall performance. DeepSeek excels in data type correctness and status code detection, while GPT-4o performs best in API operation coverage. For both models, LLM generated test scripts maintained 100\% syntax correctness and only required minimal manual edits for semantic correctness. These findings indicate the effectiveness and feasibility of MASTEST.
Note: The presentation and demonstration are based on the following paper:
X. Han and H. Zhu, “MASTEST: A LLM-based multi-agent system for RESTful API tests,” arXiv preprint arXiv:2511.18038, 2025. Submitted to IEEE Transactions on Service Computing.
The paper is also available online at: https://drive.google.com/file/d/1ID7gIcEsUFJS759tyS8BqPVVOP3lnGku/view?usp=sharing
Xia, C.S., Deng, Y., Dunn, S. and Zhang, L., 2025. Demystifying llm-based software engineering agents. Proceedings of the ACM on Software Engineering, 2 (FSE), pp.801-824.
Seminar
Speaker: George Wolf-Jackson
Title: Secret Hacker: a Learning-Based Game for Cyber Security Education
Abstract: Existing cyber security training methods do not deliver sufficient behaviour change to address the growing threat landscape that comes with an ever-more interconnected world. Game-based training methods have been used successfully in a variety of fields, but do not provide a 'silver bullet' solution for cyber security training, as a lack of commercial options, a high upfront cost to develop, and similarity with traditional training means that new game-based training methods have not had the desired impact on the cyber security landscape. Here I propose Learning-Based Games - an alternative to Game-Based Learning and Gamification, in which a game-first approach develops existing (or newly developed) games into educational interventions, whilst maintaining the core game elements of the original game. As pilot example of this, Secret Hacker is a game I have developed from the popular card game Secret Hitler in which players must take on one of two roles, and work together with their team to bring about a desirable outcome for their team. In the case of Secret Hacker, these outcomes are either good or bad cyber security conduct. I will also look at some of the winning games from the ECGBL conference, and identify lessons that can be learnt from them.
The paper is available online at: https://doi.org/10.34190/ecgbl.19.2.4020
Or: Click here
Fang S, Ding W, Mastropaolo A, Xu B. Smaller= Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation. arXiv preprint arXiv:2506.22776. 2025 Jun 28.
Reading and Discussion of Research Papers
Dong Y, Jiang X, Qian J, Wang T, Zhang K, Jin Z, Li G. A survey on code generation with LLM-based agents. arXiv preprint arXiv:2508.00083. 2025 Jul 31.
Jandaghi P, Ahrabian K. The Fault in Our LLM Leaderbaords. In 4th Workshop on Practical Deep Learning (Practical-DL 2025): Toward Robust Compressed Foundation Models in the Real World.
Reading and Discussion of Research Papers
Xue Z, Zhang X, Gao Z, Hu X, Gao S, Xia X, Li S. Clean Code, Better Models: Enhancing LLM Performance with Smell-Cleaned Dataset. arXiv preprint arXiv:2508.11958. 2025 Aug 16.
Reading and Discussion of Research Papers
Lin F, Kim DJ, Li Z, Yang J. RobuNFR: Evaluating the Robustness of Large Language Models on Non-Functional Requirements Aware Code Generation. arXiv preprint arXiv:2503.22851. 2025 Mar 28.
Hou, X., Zhao, Y., Wang, S., & Wang, H. Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions. arXiv:2503.23278v1. 30 March 2025.
Reading and Discussion of Research Papers
Wang, Z., Zhou, Z., Da Song, Y.H., Chen, S., Ma, L. and Zhang, T., 2025. Towards Understanding the Characteristics of Code Generation Errors Made by Large Language Models. Preprint.
Pan, J., Shar, R., Pfau, J., Talwalkar, A., He, H. and Chen, V., 2025. When Benchmarks Talk: Re-Evaluating Code LLMs with Interactive Feedback. arXiv preprint arXiv:2502.18413.
Reading and Discussion of Research Papers
Zamfirescu-Pereira, J.D., Jun, E., Terry, M., Yang, Q. and Hartmann, B., 2025. Beyond Code Generation: LLM-supported Exploration of the Program Design Space. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems.
Ho A, Bui AM, Nguyen PT, Di Salle A, Le B. EnseSmells: Deep ensemble and programming language models for automated code smells detection. Journal of Systems and Software. 2025 Feb 15:112375.
Read Research Paper
Cai, Y., Hou, Z., Sanan, D., Luan, X., Lin, Y., Sun, J. and Dong, J.S., 2025. Automated Program Refinement: Guide and Verify Code Large Language Model with Refinement Calculus. Proceedings of the ACM on Programming Languages, 9(POPL), pp.2057-2089.
Li, B., Wu, W., Tang, Z., Shi, L., Yang, J., Li, J., Yao, S., Qian, C., Hui, B., Zhang, Q. and Yu, Z., 2025, January. Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study. In Proceedings of the 31st International Conference on Computational Linguistics (pp. 7511-7531).
Read Research Paper
Djamel Mesbah, Nour El Madhoun, Khaldoun Al Agha, Hani Chalouati. Leveraging Prompt-based Large Language Models for Code Smell Detection: A Comparative Study on the MLCQ Dataset. The 13-th International Conference on Emerging Internet, Data & Web Technologies (EIDWT-2025), Feb 2025, Matsue, Japan. hal-04881949.
Pan Z, Song X, Wang Y, Cao R, Li B, Li Y, Liu H. Do Code LLMs Understand Design Patterns?. arXiv preprint arXiv:2501.04835. 2025 Jan 8.
Siam, M.K., Gu, H. and Cheng, J.Q., 2024. Programming with AI: Evaluating ChatGPT, Gemini, AlphaCode, and GitHub Copilot for Programmers. arXiv preprint arXiv:2411.09224.
Awal, M.A., Rochan, M. and Roy, C.K., 2024. Comparing Robustness Against Adversarial Attacks in Code Generation: LLM-Generated vs. Human-Written. arXiv preprint arXiv:2411.10565.
Arp D, Quiring E, Pendlebury F, Warnecke A, Pierazzi F, Wressnegger C, Cavallaro L, Rieck K. Pitfalls in Machine Learning for Computer Security. Communications of the ACM. 2024 Nov 1;67(11):104-12.
Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A.J., Welihinda, A., Hayes, A., Radford, A. and Mądry, A., 2024. GPT-4o System Card. arXiv:2410.21276.
Xinyu Gao, Yun Xiong, Deze Wang, Zhenhan Guan, Zejian Shi, Haofen Wang, and Shanshan Li. 2024. Preference-Guided Refactored Tuning for Retrieval Augmented Code Generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE '24). Association for Computing Machinery, New York, NY, USA, 65–77. https://doi.org/10.1145/3691620.3694987
Hao Ding, Ziwei Fan, Ingo Guehring, Gaurav Gupta, Wooseok Ha, Jun Huan, Linbo Liu, Behrooz Omidvar-Tehrani, Shiqi Wang, and Hao Zhou. 2024. Reasoning and Planning with Large Language Models in Code Development. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '24). Association for Computing Machinery, New York, NY, USA, 6480–6490. https://doi.org/10.1145/3637528.3671452
Title: Testing and Evaluation of The Robustness of Large Language Models for Code Generation
Speaker: Debalina Ghosh Paul
Nunez, A., Islam, N.T., Jha, S.K. and Najafirad, P., 2024. AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing. arXiv preprint arXiv:2409.10737.
Title: A Critical Review of Benchmarks and Metrics for Evaluations of Code Generation.
Speaker: Hong Zhu
Abstract: With the rapid development of Large Language Models (LLMs), a large number of machine learning models have been developed to assist programming tasks including the generation of program code from natural language input. However, how to evaluate such LLMs for this task is still an open problem despite of the great amount of research efforts that have been made and reported to evaluate and compare them. This talk presents a critical review of the existing work on the testing and evaluation of these tools with a focus on two key aspects: the benchmarks and the metrics used in the evaluations. Based on the review, further research directions are discussed.
Acknowledgement: The talk is based on the following reseach paper.
Debalina Ghosh Paul, Hong Zhu and Ian Bayley, : A Critical Review. Proccedings of the First IEEE International Workshop on Testing and Evaluation of Large Languge Models, Shanghai, July 15-18, 2024.
Wang, C.Y., DaghighFarsoodeh, A. and Pham, H.V., 2024. Selection of Prompt Engineering Techniques for Code Generation through Predicting Code Complexity. arXiv preprint arXiv:2409.16416.
Mikhail Evtikhiev, Egor Bogomolov, Yaroslav Sokolov, and Timofey Bryksin. 2023. Out of the BLEU: How should we assess quality of the Code Generation models? J. Syst. Softw. Vol. 203, Issue C, Sep. 2023. https://doi.org/10.1016/j.jss.2023.111741
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. Vol. 55, No. 12, Article 248, Dec. 2023, https://doi.org/10.1145/3571730
Albert Ziegler, Eirini Kalliamvakou, X. Alice Li, Andrew Rice, Devon Rifkin, Shawn Simister, Ganesh Sittampalam, and Edward Aftandilian. 2024. Measuring GitHub Copilot's Impact on Productivity. Commun. ACM 67, 3 (March 2024), 54–63. https://doi.org/10.1145/3633453
Ruixiang Tang, Yu-Neng Chuang, and Xia Hu. 2024. The Science of Detecting LLM-Generated Text. Commun. ACM 67, 4 (April 2024), 50–59. https://doi.org/10.1145/3624725
Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D. and Steinhardt, J., 2021, August. Measuring Coding Challenge Competence With APPS. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). Also arXiv:2105.09938.
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.D.O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G. and Ray, A., 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
Gemini Team, Gemini: A Family of Highly Capable Multimodal Models, Google DeepMind, Available Online at URL: https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf. Last access on 9 Dec. 2023.
Laskar, M.T.R., Bari, M.S., Rahman, M., Bhuiyan, M.A.H., Joty, S. and Huang, J.X., 2023. A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets. arXiv preprint arXiv:2305.18486.
Pascale Fung, ChatGPT: What It Can and Cannot Do, Centre for Artificial Intelligence Research, Hong Kong University of Science & Technology, March 2023.
The link to the video on YouTube is:
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I., 2017. Attention is all you need. Advances in neural information processing systems, 30.
Tubishat, M., Ja'afar, S., Alswaitti, M., Mirjalili, S., Idris, N., Ismail, M.A. and Omar, M.S., 2021. Dynamic salp swarm algorithm for feature selection. Expert Systems with Applications, 164, p.113873.
Zhou, Y., Lin, J. and Guo, H., 2021. Feature subset selection via an improved discretization-based particle swarm optimization. Applied Soft Computing, 98, p.106794.
Bommert, A., Sun, X., Bischl, B., Rahnenführer, J. and Lang, M., 2020. Benchmark for filter methods for feature selection in high-dimensional classification data. Computational Statistics & Data Analysis, 143, p.106839.
Zhang, L., 2023. A Feature Selection Method Using Conditional Correlation Dispersion and Redundancy Analysis. Neural Processing Letters, pp.1-35.
Bommert, A., Sun, X., Bischl, B., Rahnenführer, J. and Lang, M., 2020. Benchmark for filter methods for feature selection in high-dimensional classification data. Computational Statistics & Data Analysis, 143, p.106839.
Speaker: Dr. Shiyu Yan
Topic: Metamorphic Testing of Scientific Computing Programs on Differential Equations: a Case Study
Venue: Room B213, Wheatley Campus, Oxford Brookes University, UK (Also at Zoom)
Time: 1pm to 6pm.
Program:
Time Speaker and Title
1:00 – 1:05 Opening
1:05 – 1:30 Hong Zhu, An introduction of datamorphic test methodology
1:30 – 2:00 Reebu Joy, Datamorphic testing of ML regression models for feature selection
2:00 – 2:30 Movin Fernandes, Performance-based feature selection techniques
2:30 – 3:00 Aiden Gourley, Exploring adversrial examples of machine learning image classifiers
3:00 – 3:30 Coffee Break
3:30 – 4:00 Debalina Ghosh Paul, Testing natural language processing applications: A survey
4:00 – 4:30 Aamer Bassmaji, Eleni Elia and Sarah Howcutt, Automating meta-analysis: Advancements and perspectives
4:30 – 5:00 Tanha Miah, Testing ChatGPT’s capability of generating R program code
5:00 – 5:30 Daniel Rodriguez, Data complexity and quality issues in machine learning
5:30 – 6:00 Alexander Rast, Simulation model for testing
6:00 – 6:05 Closing
Click to download the Detailed Technical Program.
Robnik-Šikonja, M. and Kononenko, I., 2003. Theoretical and empirical analysis of ReliefF and RReliefF. Machine learning, 53, pp.23-69.
Dhal, P. and Azad, C., 2022. A comprehensive survey on feature selection in the various fields of machine learning. Applied Intelligence, 52:4543–4581. https://doi.org/10.1007/s10489-021-02550-9
Abdulwahab, H.M., Ajitha, S. and Saif, M.A.N., 2022. Feature selection techniques in the context of big data: taxonomy and analysis. Applied Intelligence, 52(12), pp.13568-13613. https://doi.org/10.1007/s10489-021-03118-3
Speaker:
Ms Debalina Ghosh Paul, Oxford Brookes University, UK
Title: Testing Machine Learning Applications for Natural Language Processing
Slides Video Recording
Jinhan Kim, Robert Feldt, and Shin Yoo. 2022. Evaluating Surprise Adequacy for Deep Learning System Testing. ACM Trans. Softw. Eng. Methodol. Just Accepted (July 2022). https://doi.org/10.1145/3546947
Xiaofei Xie, Tianlin Li, Jian Wang, Lei Ma, Qing Guo, Felix Juefei-Xu, and Yang Liu. 2022. NPC: Neuron Path Coverage via Characterizing Decision Logic of Deep Neural Networks. ACM Trans. Softw. Eng. Methodol. 31, 3, Article 47 (July 2022), 27 pages. https://doi.org/10.1145/3490489
Youcheng Sun, Xiaowei Huang, Daniel Kroening, James Sharp, Matthew Hill, and Rob Ashmore. 2019. Structural Test Coverage Criteria for Deep Neural Networks. ACM Trans. Embed. Comput. Syst. 18, 5s, Article 94 (October 2019), 23 pages. https://doi.org/10.1145/3358233
Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018. DeepGauge: multi-granularity testing criteria for deep learning systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE '18). Association for Computing Machinery, New York, NY, USA, 120–131. https://doi.org/10.1145/3238147.3238202
Fabrice Harel-Canada, Lingxiao Wang, Muhammad Ali Gulzar, Quanquan Gu, and Miryung Kim. 2020. Is neuron coverage a meaningful measure for testing deep neural networks? In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020). Association for Computing Machinery, New York, NY, USA, 851–862. https://doi.org/10.1145/3368089.3409754
Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2018. DeepTest: automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th International Conference on Software Engineering (ICSE '18). Association for Computing Machinery, New York, NY, USA, 303–314. https://doi.org/10.1145/3180155.3180220
Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. DeepXplore: Automated Whitebox Testing of Deep Learning Systems. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP '17). ACM, New York, NY, USA, 1–18. https://doi.org/10.1145/3132747.3132785
Speaker:
Prof. Kuo-Ming Chao, Bournemouth University, UK
Title:
A new machine learning approach to detect frauds in supply chain finance
Abstract:
This talk will start with a brief introduction to supply chain finance, possible frauds and the challenges of detecting them. I will, then, talk about a new multitask learning framework to detect fraudulent transactions and enterprises with explainability based on heterogeneous graph neural networks. I will show the experimental results to prove the effectiveness of fraud detection and explainability in both transactions and enterprises. This talk will end by showing that the proposed method outperforms the others in the criteria, F1 and AUC, over two open datasets.
About the Speaker:
Dr Kuo-Ming Chao obtained his MSc and PhD degrees from Sunderland University, UK. He currently is a distinguished professor at National Engineering Laboratory for E-Commerce Technologies (NELECT), Fudan University, China and a professor at Bournemouth University, UK. He joined Coventry University in 2000, gained his professorship in 2009 and left in 2021. Between 2007 and 2008, he joined British Telecom Research Lab as a short-term research fellow. From 1997, Dr Chao worked at the Engineering Design Centre at Newcastle-upon-Tyne University as a postdoc research associate for more than three years.
His research interests include the areas of intelligent agents, machine learning, service-oriented computing and big data, and their applications, such as E-business, advanced manufacturing and energy efficiency management. He has over 200 refereed publications in books, journals and conference proceedings. Prof. Chao participates actively in the organization of international conferences and workshops as a program/general/steering conference chair and at numerous events as a program committee member. He is the chair of the IEEE Technical Community on Business Informatics and Systems. He is a co-founder and managing editor of Service-Oriented Computing and Applications: A Springer Journal to promote Service-Oriented Computing and a member of editorial boards for several international journals (ESCI and EI indexed). Besides, Prof. Chao contributed to many EU-funded projects as a coordinator or a work package leader. He was an invited speaker at various international conferences and workshops.
Samer Y. Khamaiseh, Derek Bagagem, Abdullah Al-Alaj, Mathew Mancino, Hakam W. Alomari, "Adversarial Deep Learning: A Survey on Adversarial Attacks and Defense Mechanisms on Image Classification", IEEE Access, vol.10, pp.102266-102291, 2022.
S. -M. Moosavi-Dezfooli, A. Fawzi and P. Frossard, "DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2574-2582, doi: 10.1109/CVPR.2016.282.
Szegedy, Christian, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. "Intriguing properties of neural networks." In 2nd International Conference on Learning Representations (ICLR 2014). 2014.
At this meet-up, the group will read the following research paper:
J. Su, D. V. Vargas and K. Sakurai, "One Pixel Attack for Fooling Deep Neural Networks," in IEEE Transactions on Evolutionary Computation, vol. 23, no. 5, pp. 828-841, Oct. 2019, doi: 10.1109/TEVC.2019.2890858.Please clikc on the title of the paper to download the paper.At this online meet-up, the members of the group will share their recent research results and ideas, their research activities and plans for near future.
Juan Wang: User Review Analysis for Requirement Elicitation
Abstract: Online reviews are an important channel for requirement elicitation. However, requirement engineers face challenges when analysing online user reviews, such as data volumes, technical supports, existing techniques, and legal barriers.
This thesis proposes a framework solving user review analysis problems for the purpose of requirement elicitation that sets up a channel from downloading user reviews to structured analysis data.
This framework is believed to be able to solve the problems because (a) the structure of this framework is composed of several loosely integrated components, which not only realize the flow of data from downloading raw user reviews to the structured analysis results, but also provide adaptability and flexibility for wider future applications; (b) the reasonable use of linguistic rules makes it possible to adjust and control the internal details of the system in this data flow; (c) natural language processing (NLP) technologies, such as chunking, regular expressions, and especially Stanford dependency trees, provide substantial technical support for this framework.
Three mobile app user review datasets were used to evaluate the functionalities. 6081 user reviews from the first dataset is used for the development of linguistic rules. The first two datasets are used to enrich the popular opinions and the keywords list. The third dataset acts as a control group. The performance results of the prototype demonstrate that this framework is practical and usable.
The main contributions of this work are: (1) this thesis proposed a framework to solve the user review analysis problem for requirement elicitation; (2) the prototype of this frame
Hong Zhu: Testing Regression Models of Machine Learning -- A Case Study with An Industry Real Application
To be determined
At this online meet-up, we will read the following research paper.
Carvalho, D.V.; Pereira, E.M.; Cardoso, J.S. Machine Learning Interpretability: A Survey on Methods and Metrics. Electronics 2019, 8, 832.
Please click on the title of the paper to download from the reading group's repository on Google Drive.
At this online meet-up, we will read the following research paper.
Gunning, D.; Aha, D. DARPA’s Explainable Artificial Intelligence (XAI) Program. AIMag 2019, 40, 44–58.
Please click on the title of the paper to download from the reading group's repository on Google Drive.
At this online meet-up, we will read the following research paper.
Zhou, Jianlong, Amir H. Gandomi, Fang Chen, and Andreas Holzinger. 2021. "Evaluating the Quality of Machine Learning Explanations: A Survey on Methods and Metrics" Electronics 10, no. 5: 593. https://doi.org/10.3390/electronics10050593
Please click on the title of the paper to download from the reading group's repository on Google Drive.
At this meet-up, we will continue to read the following research paper that was not finished at the last meet-up.
Liming Xu, Dave Towey, Andrew P. French, Steve Benford, Zhi Quan Zhou, Tsong Yueh Chen, Using Metamorphic Relations to Verify and Enhance Artcode Classification, arXiv:2108.02694v1, Aug. 2021.
At this online meet-up, we will read the following research paper on testing machine learning applications.
Liming Xu, Dave Towey, Andrew P. French, Steve Benford, Zhi Quan Zhou, Tsong Yueh Chen, Using Metamorphic Relations to Verify and Enhance Artcode Classification, arXiv:2108.02694v1, Aug. 2021.
Please click on the title of the paper to download from the reading group's repository on Google Drive.
At this online meet-up, we will read the following research paper on metrics of data complexity for machine learning.
José Daniel Pascual-Triana, David Charte, Marta Andrés Arroyo, Alberto Fernández and Francisco Herrera, Revisiting data complexity metrics based on morphology for overlap and imbalance: snapshot, new overlap number of balls metrics and singular problems prospect, Knowledge and Information Systems (2021) 63:1961–1989,
Please click on the title of the paper to download from the reading group's repository on Google Drive.
At this online meet-up, we will read the following research paper on metrics of data complexity for machine learning.
Tin Kam Ho and Mitra Basu, Complexity Measures of Supervised Classification Problems, IEEE Transactions On Pattern Analysis And Machine Intelligence, Vol. 24, No. 3, March 2002, pp 289-300.
Please click on the title of the paper to download from the reading group's repository on Google Drive.
At this online meet-up, we will exchange research ideas, recent research outcomes, and plan for future works.
At this online meet-up, we will share the progress in our research projects.
Details to be announced.
At this online meet-up, we will read and discuss the following paper:
Xiaoyuan Xie, Joshua W.K. Ho, Christian Murphy, Gail Kaiser, Baowen Xu, Tsong Yueh Chen, Testing and validating machine learning classifiers by metamorphic testing, Journal of Systems and Software, Volume 84, Issue 4, 2011, Pages 544-558.
Please click on the title of the paper or here to download the paper from Google Drive.
At this online meet-up, we continue reading and discussing the following paper:
A. Sharma and H. Wehrheim, "Testing Machine Learning Algorithms for Balanced Data Usage," in Proc. of 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST), Xi'an, China, 2019 pp. 125-135.
Please click on the title of the paper or here to download the paper from Google Drive.
At this online meet-up, we will read and discuss the following paper:
A. Sharma and H. Wehrheim, "Testing Machine Learning Algorithms for Balanced Data Usage," in Proc. of 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST), Xi'an, China, 2019 pp. 125-135.
Please click on the title of the paper or here to download the paper from Google Drive.
At this meet-up, Hong Zhu will report his work in progress on
Evaluation of Classifiers Based on Exploratory Datamorphic Testing
At this online meet-up, we will read and discuss the following paper:
Huang, J., & Ling, C. (2007). Constructing new and better evaluation measures for machine learning. In Proceedings of the 20th international joint conference on artificial intelligence (IJCAI’2007) (pp. 859–864).
Please click on the title of the paper or here to download the paper from Google Drive.
At this meet-up, we will read the following research paper.
Marina Sokolova and Guy Lapalme, "A systematic analysis of performance measures for classification tasks", Information Processing and Management, 45 (2009), Elsevier, pp427–437.
If you are not familiar with the metrics used to evaluate classifiers, the following paper may be useful as an introduction.
Tom Fawcett, An introduction to ROC analysis, Pattern Recognition Letters 27 (2006), pp861–874.
(Recommented by Hong Zhu)
Please click on the title of the paper download the paper from Google Drive.
At this meet-up, we will review the following research paper draft written by Hong Zhu and Ian Bayley.
Hong Zhu and Ian Bayley, "Exploratory Datamorphic Testing of Feature Based Classifiers", Submitted to The Journal of Systems and Software.
Please click on the title of the paper or here to download the paper from Google Drive.
At this meet-up, we will read the following research paper.
Shenao Yan, et al., Correlations between deep neural network model coverage criteria and model quality, in Proc. of ESEC/FSE 2020, Nov. 2020, Pages 775–787.
(Recommented by Hong Zhu)
Please click on the title of the paper download the papers from Google Drive.
At this meet-up, we will read the following research paper.
Davide Dell’Anna, Fabiano Dalpiaz, Mehdi Dastani, "Validating Goal Models via Bayesian Networks", in Proc. of AIRE 2018.
(Recommented by Rachel Harrison)
Please click on the title of the paper or here to download the paper from Google Drive.