Financial Challenges in Large Language Models - FinLLM

Introduction

With the advent of LLM in finance, financial text analysis, generation, and decision-making tasks have received growing attention, including financial classification, financial text summarization, and single stock trading. For instance, 1) The financial classification task aims to categorize sentences as claims or premises effectively. 2) Financial text summarization task for abstracting financial texts into concise summaries. 3) The single stock trading task aims to make a predictable decision regarding stock trading. Although several approaches have achieved remarkable performance with LLMs, their capabilities of comprehensive analysis and decision-making for finance remain largely unexplored. We plan to explore the ability of LLMs from these facets. We prepared a Large Language Model-based financial shared task, FinLLM [1], which contains three subtasks to address diverse financial challenges effectively and holistically.  

Dataset

We provide three diverse datasets for three tasks to challenge the capabilities of financial LLMs.

 

Task 1: financial classification

This task focuses on argument unit classification to test the capabilities of LLMs to identify and categorize texts as premises or claims [2]. Participants receive a financial text and two options, followingly design the prompt query template, and then classify the text as a claim or premise.

We provide 7.75k training data and 969 test data to categorize sentences as claims or premises. 

We use the following prompt template to ask and ask the question in this task.

Instruction: [task prompt] Text: [input text] Response: [output]

[input text] denotes the financial text in the prompt, [output] is the classified label  (i.e., "Claim" or "Premise"). 

We use two metrics to evaluate classification capability, like F1 and Accuracy. 

We use F1 score as the final ranking metrics.


Task 1

Task 2: financial text summarization

This task is designed to test the capabilities of LLMs to generate coherent summaries [3]. Participants need to summarize a corresponding concise text according to the given financial news text, following the designed prompt template of query. 

We provide 8k training data and 2k test data for abstracting financial news articles into concise summaries.  

We use the following prompt template to ask and ask the question in this task.

Instruction: [task prompt] Context: [input context] Response: [output]

[input text] denotes the multiple-sentence text in financial news article, [output] is the abstractive summarization on this text.

We utilize three metrics, such as ROUGE (1, 2, and L) and BERTScore, to evaluate generated summaries in terms of Relevance.  

We use ROUGE -1 score as the final ranking metrics.


Task 2

Task 3: single stock trading

This task aims to evaluate LLMs’ ability to make sophisticated decisions in trading activities, which is currently restricted by human’s limited ability to process large volumes of data rapidly [4]. Participants receive a combination of open-source data for stocks and an ETF. The system should output one of the three trading decisions (“buy”, “sell”, “hold”) with reasonings. 

We provide 291 data to evaluate LLMs on sophisticated stock Decisions. 

We use the following prompt template to ask and ask the question in this task.

Instruction: [task prompt] Context: [input context] Response: [output]

[input text] denotes the financial investment information in the prompt, [output] should strictly conform the following JSON format without any additional contents: {"investment_decision": string, "summary_reason": string, "short_memory_index": number, "middle_memory_index": number, "long_memory_index": number,  "reflection_memory_index": number}

We offer a comprehensive assessment of profitability, risk management, and decision-making prowess by a series of metrics, such as Sharpe Ratio (SR), Cumulative Return (CR), Daily (DV) and Annualized volatility (AV), and Maximum Drawdown (MD). 

We use Sharpe Ratio (SR) score as the final ranking metrics.

The formulas are as follows: 

task 3

Model Cheating Detection

To measure the risk of data leakage from the test set used in the training of a model, the Model Cheating, we have developed a new metric called the Data Leakage Test (DLT), building on existing research [5].

The DLT calculates the difference in perplexity of the large language models (LLMs) on both the training and test data to determine its data generation tendencies. Specifically, we separately input the training set and the test set into the LLMs, and calculate the perplexity on the training set (ppl-on-train) and the perplexity on the test set (ppl-on-test). The DLT value is then computed by subtracting the ppl-on-train from the ppl-on-test. A larger difference implies that the LLM is less likely to have seen the test set during training compared to the training set and suggests a lower likelihood of the model cheating. Conversely, a smaller difference implies that the LLM is more likely to have seen the test set during training and suggests a higher likelihood of the model cheating.


In the detection process, we will calculate the DLT values for some LLMs to establish a reference baseline of Model Cheating, and minimize the impact of generalization on the metric. The formula is as follows:

[1] The FinBen: An Holistic Financial Benchmark for Large Language Models (https://arxiv.org/pdf/2402.12659)

[2] Fine-Grained Argument Understanding with BERT Ensemble Techniques: A Deep Dive into Financial Sentiment Analysis (https://aclanthology.org/2023.rocling-1.30.pdf)

[3] Trade the Event: Corporate Events Detection for News-Based Event-Driven Trading (https://arxiv.org/pdf/2105.12825)

[4] FinMem: A Performance-Enhanced LLM Trading Agent with Layered Memory and Character Design (https://arxiv.org/pdf/2311.13743)

[5] Skywork: A More Open Bilingual Foundation Model (https://arxiv.org/pdf/2310.19341)

Schedule

There will be two phases in the challenge. Phase 1 will be the challenge stage, including Registration Open, data release, system's outputs submission, cheating detection, and results release. Phase 2 will be the shared task paper stage.





              submission site: https://forms.gle/GYZee2f6xPweeY2U7

              Please note that the submission will be closed at 23:59 PM (AOE) on May 29, 2024.


Leaderboard:  https://huggingface.co/spaces/TheFinAI/IJCAI-2024-FinLLM-Learderboard


Policies


Shared Task Organizers


Contact