On this page:
Recently, large language models (LLMs) have been used as agents to explore and interact with an environment. Previous works like LLMs As Tool Makers [1] and Voyager [2] aim to create a LLM which is able to query and manipulate objects in an environment, and use high level reasoning to guide its actions. Although there has been a large amount of research done on integrating LLMs with environments, there is still much work to be done on teaching LLMs to decompose complex skills into simpler skills and independently learn new knowledge outside of its environment to reuse these skills in the target environment. Some existing work on that include Chain of Thought [3], Graph of Though [4], and CodeChain [5].
In our work, we take advancements from the agentic LLM field, and adapt these findings into a code base. By giving the LLM a problem set, and defining a freeform action space where the LLM can decompose the problem set into simpler skills, we hypothesise that the LLM will learn to modularly build up its skill set and chain actions to solve these simple tasks, and learn to solve the more complicated problem set. Moreover, we also propose a new framework using retrieval augmented learning to guide the model to learn new skills when not prompted to, allowing it to extract skills it finds relevant from prompts unrelated to the problem set and use these skills to solve the target problem.
This project takes reference from Voyager [2]. However, where Voyager explores a video-game world through a text API, our Chain Of Action instead operates within a programming environment. In both cases, the reward system for training the model are clearly defined through successful task completion within an API or code framework.
The novelty of Chain Of Action as compared to Voyager lies in the idea of breaking down tasks into subtasks. Taking inspiration from the Chain-of-Thought [3] and graph of thoughts [4], we utilise the idea that programming tasks can often be broken down into smaller subtasks, and develop a model that attempts to break down programming tasks into subtasks iteratively when these tasks are not directly solvable using existing knowledge. A very recent paper published in October 2023, CodeChain [5], attempts a similar framework, but utilising a different approach to storing established solutions.
Similar to Voyager [2], we use ChromaDB to store established solutions. To reduce token usage, code is saved into individual .py files, and these filepaths are saved to ChromaDB, being embedded in a vector space that is a latent semantic representation of the text description associated with each code.
Fig. 1: Architecture of Chain of Action model.
Programming problems are posed to the model in the form of text input describing the problem. The model then heuristically iterates through the following steps: First, the model attempt to zero-shot the problem using GPT-4’s innate knowledge. Should this fail, GPT-4 then decomposes the problem into subproblems. Each subproblem is passed to ChromaDB as a query. By minimising the distance between the subproblem’s text’s semantic representation and samples of code saved to this vector database, established code is retrieved. These code samples are concatenated and passed to GPT-4 to use as reference examples for solving the problem. New code generated that successfully solves problems is then also embedded into the ChromaDB. This process is visualised in Fig 1.
The solutions generated are evaluated by running them in a Python environment.
The set of all Python3 problems were scraped from https://leetcode.com/. For each problem, the following data was also scraped: detailed categorical tags (e.g. whether the problem involves arrays, strings, hash tables, etc.), difficulty metrics, and the top 100 upvoted solutions. Each solution was separated into text and code.
Based on the tags, problems were found to be adequately described by the 7 clusters shown in Fig. 2.
Fig. 2b: Distribution of clusters in reduced dimensionality space. As shown, the 7 clusters are distinct.
Fig. 2c: k-means cluster loss against number of clusters. After n=7, the benefits of further clustering falls off.
For the experiment, 201 problems were randomly chosen from this dataset for testing. The following labels were generated for each test case: input values for evaluation are generated by GPT-4, and output values for evaluation are generated by passing the input values through the code of the top upvoted solution for each problem, assuming that the top upvoted solution should function correctly.
We further independently randomly select another 5% of problems, and take the top upvoted solution for each problem to initialise the database. For each of these problems, GPT-4 is used to generate a summarised description, and the semantic representation of this summarised description is used to embed the code in the vector database.
Since the Chain of Action (CoA) model uses baseline GPT-4 as the first step, and only implements the novel technique when this first step fails, we do not need to separately test CoA v.s. baseline. Instead, we run CoA, and log the success at each stage (‘oneshot’, ‘chromaDB’) and the number of times it re-attempts to use CoA’s chromaDB (‘try_n’). The success rate for ‘oneshot’ is the effective success rate for baseline, and the success rate for ‘chromaDB’ alone corresponds to the increase in performance of CoA over the baseline. We apply this pipeline to all 201 problems in the test set using GPT-4 Turbo.
The generated solutions were tested by using the input/output labels for each test case. Additionally, two ablation studies were done:
CoA-no-chroma: Where the model breaks down problems into smaller steps, but does not retrieve reference data from chromaDB.
CoA-no-rephrase: Where the model only attempts to break down problems into smaller steps once, and does not iteratively try to rephrase it again.
Using the baseline model, i.e. GPT-4 with a Chain of Thought approach, we achieve success on 121 / 201 problems in the test set.
However, with CoA prompting, the success rate increases to 176 / 201. This is a substantial improvement over the baseline.
Fig. 3. Success rates partitioned by question difficulties. Percentages shown are calculated by difficulty category, e.g. 66.3% of ‘Easy’ problems could be solved by the baseline model as ‘oneshot’.
We evaluate the improvement of CoA over the varying levels of difficulty (as labeled on Leetcode). Fig. 3 shows that CoA was able to solve problems across all categories that baseline was unable to solve.
We observe that there were no overall failures for ‘hard’ problems. However, due to the very small number of ‘hard’ samples in the dataset, this may be due to statistical variation.
Figure 4 shows the variation in performance across the different tags. We verify that the success rate is not affected by the number of problems tagged with each tag, by showing that the ratio of each outcome (baseline pass, CoA pass, CoA fail) to the total number of problems for each tag is approximately consistent, as shown in Fig. 5.
Fig. 4. Success rates partitioned by tags. Note that each problem may be tagged with more than one tag. Some tags were not present in the randomly-sampled test set, but we also note that the original dataset had large variance in tag counts.
Fig. 5. Outcome counts against total count, with each tag corresponding to one set of 3 datapoints. Given that the ratio of success rate can be well modeled linearly, we expect that the tag category and number of problems per tag do not strongly influence outcomes.
Fig. 6: Distribution of semantic distance between problems to nearest entry in skill library against number of entries into skill library
Voyager [1] demonstrated that their model showed significant improvements in performance over time, as it added to the skill library. To better understand the impact of skill library on performance of our model, we track the range of distances between queries to the nearest solution in ChromaDB over skill library size.
This is visualised in Fig. 6. Fig. 6 also encodes information on how many attempts to refine the prompt are made at each stage. For example, the leftmost data in Fig. 6 shows that two attempts were made at the start, one of which had three attempts and failed, while the other passed.
From this data, there is insufficient evidence to demonstrate that an increase in skill library size results in a decrease in average distance to solutions. This is demonstrated in Fig. 6: we do not have strong evidence that the average distance to the nearest solution decreases as the database size increases. However, this may be due to the nature of the solution space. Further work on larger and more diverse test sets needs to be done. We suggest that for future work, experiments should be done on tasks that are too complex for GPT-4 baseline to do on its own, and therefore requires GPT-4 to make use of the subskills stored in the vector database.
Additionally, during these initial studies, the skill library was initialised with full solved problems from the scraped Leetcode database. This is a realistic use case as the model would add entire solved problems to the skill library as it continues to work on more problems. However, this also means that each entry in the initialised skill library would comprise multiple sub-skills, and therefore the semantic distance to each query, which comprises only one sub-skill, would not be minimised. For future work, we wish to investigate a way to correctly and accurately automate pre-processing the initialised values for higher specificity.
Fig. 7: Distribution of mean distance to nearest solution in ChromaDB
Finally, from this data, we cannot claim that a decrease in average distance to solutions results in the model being better able to generate solutions. We additionally verify this using the frequency distribution plot in Fig. 7, which shows that there does not seem to be a noticeable difference in mean difference in distance to nearest solution in ChromaDB for problems that passed v.s. failed. We suggest that this may be due to the issues of specificity and complexity discussed above. However, we at least note that from casual observation the threshold between relevant or irrelevant is approximately 1.15, which suggests that the majority of skills retrieved are relevant.
We repeat the experiment with the same parameters to test for reproducibility. The results are shown in Fig. 8.
Fig. 8: Contingency table comparing outcomes of initial and repeated experiment, showing number of attempts made by CoA and baseline
It is very noteworthy that the overall proportion of baseline success rates v.s. CoA success rates v.s. fails has almost no variation between the two runs of the experiment. However, when we consider the test cases where baseline fails but CoA succeeds, we see that only half remain consistent between trials, and the other half could be one-shot with the baseline model.
This suggests that at least some degree of success of CoA should be attributed to forcing the baseline GPT-4 to rephrase or reconsider mistakes made, rather than because of ChromaDB. This is consistent with the above finding that the skill library (vector database) has limited benefit.
It is not clear why the majority of cases solved through CoA happen on the second attempt, while minimal cases are solved on the 1st or 3rd attempts. This requires further investigation.
Fig. 9: Contingency table comparing outcomes of initial and repeated experiment, partitioned by cluster.
We break down the following test cases by cluster:
"both_oneshot": Both experiments could one-shot the problem with baseline GPT-4
"coth_chroma": Both experiments required CoA
"both_fail": Both experiments failed even with CoA
"swapped": The test case required CoA in one experiment but was one-shot by baseline GPT-4 in the other experiment.
The outcomes are shown in Fig. 9. We perform a chi-square test for independence and find no evidence to show that which cluster each test case is in affects the outcome of the experiment (p = 0.544).
From this, we suggest that the variation between experiments can be attributed to GPT-4's inherent random variability, and while on a micro-level these affects case-by-case results, it does not seem to affect overall results. It is likely that, if the initial baseline GPT-4 oneshot was allowed to repeat several times before turning to ChromaDB, the oneshot rates would be significantly higher.
The experiments and previous ablation studies seem to suggest a limited impact of the skill library. To confirm this, we perform an ablation study without the skill library, iterating through the same set of problems. By tracking the performance on each problem between the original experiment and the ablation study, we can evaluate whether there is a difference between the outcomes of the two studies. The results are shown in Fig. 10.
Fig. 10: Contingency table comparing the performance of the model with and without the skill library.
From Fig. 10, we see that when the original experiment fails, it is almost certain that the ablation study will fail. However, a substantial number of test cases that failed the ablation study could be solved in the original experiment.
This suggests that there is a significant performance difference when the skill library is omitted from the model. This is confirmed through a chi square test of independence (p<<0.001).
We further observe that, similar to the case above, for the problems that could be solved through CoA but not through baseline in the original experiment, in the ablation study half of these problems could be solved in one try while the other half failed.
For completeness, we verified that the set of 19 problems that could be one-shot in the ablation study but required CoA in the original study is not identical to the set of problems that could be one-shot in the repeat study but required CoA in the original study.
Fig. 11: Comparison of success rates
We define rephrasing as the ability of CoA to try decomposing a problem into subtasks differently on each attempt.
As shown in Fig. 11, rephrasing significantly improves the performance of CoA. It is not clear why the proportion of oneshot_success decreases in the no-rephrasing case. This requires further investigation.
We had intended to run an experiment with an empty skill library. However, based on the outcomes of the main study and ablation studies above, there does not seem to be any benefit to testing with an empty skill library at this point of time, unless experiments are done where the skill library plays a stronger role in the outcome of the experiment.
A novel architecture combining a semantic vector database and an agentic model was proposed, and shows an improvement in performance over baseline GPT-4. However, ablation studies and closer inspection of the skill library suggests that the improvement in performance may stem largely from (a) random variation in GPT-4 and (b) providing feedback to GPT-4 that the previous attempt did not work. We maintain the belief that a skill library will be beneficial to solving novel problems, as shown by the ablation studies without the use of a skill library. However, in this study the problems posed to the model may be too basic and coarse-grained for the skill library to be beneficial, thus limiting the impact of the skill library.
We suggest that for future work, experiments should be done on tasks that are too complex for GPT-4 baseline to do on its own, yet still are solvable by breaking down into components. This would therefore require GPT-4 to make use of the subskills stored in the vector database.
[1] Wang, Guanzhi, et al. "Voyager: An open-ended embodied agent with large language models." arXiv preprint arXiv:2305.16291 (2023).
[2] Cai, Tianle, et al. "Large language models as tool makers." arXiv preprint arXiv:2305.17126 (2023).
[3] Wei, Jason, et al. "Chain-of-thought prompting elicits reasoning in large language models." Advances in Neural Information Processing Systems 35 (2022): 24824-24837.
[4] Besta, Maciej, et al. "Graph of thoughts: Solving elaborate problems with large language models." arXiv preprint arXiv:2308.09687 (2023).
[5] Le, Hung, et al. “CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules.” arXiv preprint arXiv:2310.08992 (2023)