LLM, VLM, Planning, Chain-of-Thought, Embodied Agent, Foundation Models, Embodied Reasoning, In-Context Learning, Zero-Shot, Semantic Reasoning, LoRA, Feedback, Reinforcement Learning, DPO, RAG, Planning, Memory, Action, Context-Awareness, Efficient-Tuning, Multi-Agents, Intelligence.
[17th, July 2024] Limitation of Prompt Engineering with Gemini-Flash in Embodied Planning from Ego-Centric Video
ICML 2024 WORKSHOP: Multi-modal Foundation Model meets Embodied AI
PDF: https://drive.google.com/file/d/13SImnJ96m8qxHlPPtTs4CjTWGRRgpIqZ/view?usp=sharing
I tackled the problem of determining the next action to take from a first-person perspective. I had to choose one option from four choices. This was part of a competition held at a prestigious academic conference. I experimented with various prompt engineering techniques and explored the limitations of Gemini-Flash.
[20th, Mar 2024] Natural Language as Policies: Reasoning for Coordinate-Level Embodied Control with LLMs
Arxiv: https://arxiv.org/abs/2403.13801
Project page: https://natural-language-as-policies.github.io/
Supervisors: Andrew Melnik, Jun Miura, Ville Hautamäki
I conducted research on controlling a robotic arm using a Large Language Model (LLM). Instead of using the traditional method of simply calling functions, I focused on expressing the robot's skills at the natural language level to enhance generalization performance. This approach enabled grounding the robot's control in natural language.
Preprint: https://arxiv.org/abs/2403.13801
Project Page: https://natural-language-as-policies.github.io/
Skills: Python, LLaVA, ChatGPT, OpenAI, GPT4, GPT3, Pytorch, LLaVA-Next, Huggingface.
[2020-2022] Time series forecast using Transformer architecture
I conducted time series forecasting using the Transformer architecture and experimented with PyTorch. By optimizing the use of the attention mechanism, I was able to improve accuracy. Specifically, I designed the attention mechanism to capture the relationships between multiple time series by carefully applying it to multivariate time series data. Additionally, I used MLflow to streamline the management of experiments.
Skills: Python, MLflow, Dash, Pytorch
[2018-2020] Deep Reinforcement Learning on Educational Robot Arm [2018-2020]
I implemented a highly simple framework that allows for the application of deep reinforcement learning to an educational robotic arm. Using convolutional neural networks (CNNs), I captured scene information as images. The framework was trained to minimize the distance between the destination and the target object. This work demonstrated the feasibility of the approach.
Skills: Python, OpenCV.
Research Interests and Goal:
My primary research focus is on how LLMs (Large Language Models) and VLMs (Vision-Language Models) can be applied to enable robots and agents to perform reasoning and decision-making.
In my past work, I used LLMs to control robotic arms by translating high-level human instructions into low-level commands. This research aimed to reduce training costs and improve the generalization capabilities of LLMs in robotics. I explored how LLMs can translate high-level instructions into low-level concrete actions, such as robotic arm manipulations for tabletop tasks. My research covers a wide range of tasks, from embodied tabletop robotic arm manipulation to automating player actions in games and automating operations in PC applications for agents.
Ultimately, my research seeks to:
Bridge the gap between high-level reasoning and low-level execution in LLM/VLMs by smoothly & semantically connecting them.
Enable a multi-agent system where each LLM can play different roles by system prompting without large-scale finetuning (without losing generalizability).
Enable step-by-step and more grounded reasoning process, unlike traditional one-time transformation from human instruction to agent action, This means, the agent can perform both algorithmic (doing the same things many times, and IF branch) and semantic reasoning efficiently and dynamically.
I pay attention to integrating traditional angetic concept and LLM-based approach. "LLM-based approach" consists of prompt engineering, chain-of-thought, in-context learning, DPO, RL, natural language processing, RAG, and others.
Current Research and PhD Research Plan:
Currently, I am working on checking and improving LLMs’ ability to generate low-level commands for robots without requiring large-scale parameter updates. I am experimenting with both open-source and closed-source models, collecting data to analyze their performance in various environments. In this experiment, I especially aim to address the gap between high-level reasoning (e.g., INPUT: “I want to drink apple juice” > OUTPUT: “First, get an apple,,,then,,,”) and low-level reasoning (e.g., INPUT: “First, get an apple.” > OUTPUT: [0.3,0.2,0.6]). While LLMs excel at abstract tasks, they often struggle with the precision required for low-level commands. My research focuses on overcoming these limitations. My hypothesis is that if VLM can have enough context awareness and low-level understanding, they can have strong generalizability for language-conditioned agentic/robotic tasks.
For my PhD, I plan to expand this work by investigating how LLMs and VLMs can be applied in multi-agent systems. I hypothesize that LLMs can take on different (any) roles, such as high-level/low-level planners, obstacle detectors, or vision experts by the given prompt like different parts of the human brain can be responsible for various capabilities. Furthermore, my goal is to explore how these models can coordinate tasks across multiple agents in dynamic environments, using LLM-based methods, such as human feedback reinforcement learning, and Direct Preference Optimization (DPO) to get additional preference and context awareness.
Practical Experience:
In addition to my research,
I am currently working as an AI/LLM engineer during my internship, where I solve tasks related to natural language processing using traditional ML models and LLMs. I am working on projects like text summarization and analyzing news articles.
When I was a bachelor's student, I worked at a small startup with a student team.
When I was a master’s student, I had completed a three-month internship in Joensuu, Finland, where I built an object detection model and built a web application to demonstrate it for the clients.
I believe these practical experiments are very beneficial for collaboration with industries during my PhD, and eventually starting my own business.
Career Goals:
After obtaining my PhD, I plan to pursue a career in the industry. I intend to gain practical experience while studying and, after graduation, apply both academic insights and practical skills to industry problems.