Financial Challenges in Large Language Models - FinLLM

Introduction

With the advent of LLM in finance, financial text analysis, generation, and decision-making tasks have received growing attention, including financial classification, financial text summarization, and single stock trading. For instance, 1) The financial classification task aims to categorize sentences as claims or premises effectively. 2) Financial text summarization task for abstracting financial texts into concise summaries. 3) The single stock trading task aims to make a predictable decision regarding stock trading. Although several approaches have achieved remarkable performance with LLMs, their capabilities of comprehensive analysis and decision-making for finance remain largely unexplored. We plan to explore the ability of LLMs from these facets. We prepared a Large Language Model-based financial shared task, FinLLM [1], which contains three subtasks to address diverse financial challenges effectively and holistically.

Dataset

We provide three diverse datasets for three tasks to challenge the capabilities of financial LLMs.

Task 1: financial classification

This task focuses on argument unit classification to test the capabilities of LLMs to identify and categorize texts as premises or claims [2]. Participants receive a financial text and two options, followingly design the prompt query template, and then classify the text as a claim or premise.

We provide 7.75k training data and 969 test data to categorize sentences as claims or premises.

We use the following prompt template to ask and ask the question in this task.

Instruction: [task prompt] Text: [input text] Response: [output]

[input text] denotes the financial text in the prompt, [output] is the classified label (i.e., "Claim" or "Premise").

We use two metrics to evaluate classification capability, like F1 and Accuracy.

We use F1 score as the final ranking metrics.

Task 1

Task 2: financial text summarization

This task is designed to test the capabilities of LLMs to generate coherent summaries [3]. Participants need to summarize a corresponding concise text according to the given financial news text, following the designed prompt template of query.

We provide 8k training data and 2k test data for abstracting financial news articles into concise summaries.

We use the following prompt template to ask and ask the question in this task.

Instruction: [task prompt] Context: [input context] Response: [output]

[input text] denotes the multiple-sentence text in financial news article, [output] is the abstractive summarization on this text.

We utilize three metrics, such as ROUGE (1, 2, and L) and BERTScore, to evaluate generated summaries in terms of Relevance.

We use ROUGE -1 score as the final ranking metrics.

Task 2

Task 3: single stock trading

This task aims to evaluate LLMs’ ability to make sophisticated decisions in trading activities, which is currently restricted by human’s limited ability to process large volumes of data rapidly [4]. Participants receive a combination of open-source data for stocks and an ETF. The system should output one of the three trading decisions (“buy”, “sell”, “hold”) with reasonings.

We provide 291 data to evaluate LLMs on sophisticated stock Decisions.

We use the following prompt template to ask and ask the question in this task.

Instruction: [task prompt] Context: [input context] Response: [output]

[input text] denotes the financial investment information in the prompt, [output] should strictly conform the following JSON format without any additional contents: {"investment_decision": string, "summary_reason": string, "short_memory_index": number, "middle_memory_index": number, "long_memory_index": number, "reflection_memory_index": number}

We offer a comprehensive assessment of profitability, risk management, and decision-making prowess by a series of metrics, such as Sharpe Ratio (SR), Cumulative Return (CR), Daily (DV) and Annualized volatility (AV), and Maximum Drawdown (MD).

We use Sharpe Ratio (SR) score as the final ranking metrics.

The formulas are as follows:

task 3

Model Cheating Detection

To measure the risk of data leakage from the test set used in the training of a model, the Model Cheating, we have developed a new metric called the Data Leakage Test (DLT), building on existing research [5].

The DLT calculates the difference in perplexity of the large language models (LLMs) on both the training and test data to determine its data generation tendencies. Specifically, we separately input the training set and the test set into the LLMs, and calculate the perplexity on the training set (ppl-on-train) and the perplexity on the test set (ppl-on-test). The DLT value is then computed by subtracting the ppl-on-train from the ppl-on-test. A larger difference implies that the LLM is less likely to have seen the test set during training compared to the training set and suggests a lower likelihood of the model cheating. Conversely, a smaller difference implies that the LLM is more likely to have seen the test set during training and suggests a higher likelihood of the model cheating.

In the detection process, we will calculate the DLT values for some LLMs to establish a reference baseline of Model Cheating, and minimize the impact of generalization on the metric. The formula is as follows:

[1] The FinBen: An Holistic Financial Benchmark for Large Language Models (https://arxiv.org/pdf/2402.12659)

[2] Fine-Grained Argument Understanding with BERT Ensemble Techniques: A Deep Dive into Financial Sentiment Analysis (https://aclanthology.org/2023.rocling-1.30.pdf)

[3] Trade the Event: Corporate Events Detection for News-Based Event-Driven Trading (https://arxiv.org/pdf/2105.12825)

[4] FinMem: A Performance-Enhanced LLM Trading Agent with Layered Memory and Character Design (https://arxiv.org/pdf/2311.13743)

[5] Skywork: A More Open Bilingual Foundation Model (https://arxiv.org/pdf/2310.19341)

Schedule

There will be two phases in the challenge. Phase 1 will be the challenge stage, including Registration Open, data release, system's outputs submission, cheating detection, and results release. Phase 2 will be the shared task paper stage.

Phase 1: Challenge Stage
- Time zone: Anywhere On Earth (AOE)
- Registration Open: April 29th, 2024
  - https://forms.gle/Xn5CqLg7qDEDzKq89

- Training set release: May 1st, 2024
  - Task 1 dataset: https://huggingface.co/datasets/TheFinAI/finarg-ecc-auc_train
  - Task 2 dataset: https://huggingface.co/datasets/TheFinAI/edtsum_train
  - Task 3 dataset: https://huggingface.co/datasets/TheFinAI/FinTrade_train

- - Google Drive: https://drive.google.com/drive/folders/1ZOWN8GiHtjbFuYyyWhESDmViH6Xnx7_Q
  - Starter Kit: https://github.com/The-FinAI/PIXIU/blob/main/README.ijcai_challenge.md

- Test set release: May 28th, 2024
  - Task 1: https://huggingface.co/datasets/TheFinAI/flare-finarg-ecc-auc_test
  - Task 2: https://huggingface.co/datasets/TheFinAI/flare-edtsum_test

- System's outputs submission deadline (Registration Close): May 28, 2024.
  - Task 1: Please commit the result
  - Task 2: Please commit the result
  - Task 3: Please commit the model checkpoints and inference Settings

submission site: https://forms.gle/GYZee2f6xPweeY2U7

Please note that the submission will be closed at 23:59 PM (AOE) on May 29, 2024.

- Model cheating detection: May 30th, 2024
- Release of results: June 1st, 2024 June 2nd, 2024

Leaderboard: https://huggingface.co/spaces/TheFinAI/IJCAI-2024-FinLLM-Learderboard

Phase 2: Paper Stage
- Shared task paper submissions due: June 15th, 2024
- Shared Task Paper Submission System: https://easychair.org/conferences/?conf=finnlpagentscen2024
- Notification: June 20th, 2024
- Camera-Ready Version of Shared Task Paper Due: June 25th, 2024

Policies

Participants can opt to join one or more task(s).
The ACL Template MUST be used for your submission(s). The main text is limited to 4 pages. The appendix is unlimited and placed after references.
The paper title format is fixed: "[Model or Team Name] at the FinLLM Challenge Task: [Title]".
The reviewing process will be single-blind. Accepted papers proceedings will be published at ACL Anthology.
Shared task participants will be asked to review other teams' papers during the review period.
Submissions must be in electronic form using the paper submission software linked above.
At least one author of each accepted paper should register and present their work in person in FinNLP-AgentScen 2024. Papers with “No Show” may be redacted. Authors will be required to agree to this requirement at the time of submission. It's a rule for all IJCAI-2024 workshops.

Shared Task Organizers

Qianqian Xie – The FinAI, Singapore
Xiao-yang Liu – Open Finance, Columbia University, US
Yangyang Yu – Stevens Institute of Technology, US
Dong Li – Wuhan University, China
Benyou Wang - Chinese University of Hong Kong, Shenzhen, China
Alejandro Lopez-Lira – Univeristy of Florida, US
Yanzhao Lai - Southwest Jiaotong University, China
Min Peng – Wuhan University, China
Sophia Ananiadou – University of Manchester, UK; Archimedes RC, Greece
Hao Wang – Sichuan University, China
Jimin Huang – The FinAI, Singapore
Zhengyu Chen – Wuhan University, China
Ruoyu Xiang – New York University, US
VijayaSai Somasundaram – University of Florida, US
Kailai Yang – University of Manchester, UK
Chenhan Yuan – University of Manchester, UK
Zheheng Luo – University of Manchester, UK
Zhiwei Liu – University of Manchester, UK
Yueru He – Columbia University, US
Yuechen Jiang – Stevens Institute of Technology, US
Haohang Li – Stevens Institute of Technology, US
Duanyu Feng – Sichuan University, China

Contact

Contestants can communicate any questions on Discord.
Contact email: ijcaifinllmcontest@thefin.ai

Report abuse