Investigates Image data fusion techniques and video analytics that combine image and track data from multiple sensors to achieve improved accuracies and more specific inferences than could be achieved by using a single sensor alone. Our aim is to explore the state-of-the-art image processing and video analytics algorithms for achieving effective enhancement, detection, tracking, and video summarization as in:
Image-to-Text Generation for Active LLM
1. Motivation
- current smart kiosk systems
Mainly depend on speech and touch without any visual information
There are limitations to the richness of the responses in LLM due to solely using speech input
Operate in a passive manner, necesitating user initiation through touchscreen inputs
2. Research goal and issue
- Goal : Develop image-to-text conversion technology for active LLM model
- Issue
The current face detection encounters challenges in identifying users
Most image-to-text based models need huge computational resources
3. Approach
- user detection
Current methods for detecting faces often overlook practical application such as identifying users who have a specific intended use
Develop identifying user criteria using face detection methods
- image-to-text generation
image captioning is the task of describing comprehensive image contents in words
scene graph generation method which obtains the relationship between objects is more proper
develop scene graph generation method for lightweight architecture
4. Result
- Face identification : Identify users by comparing with pre-registered face vectors using a pre-trained model
- Face Expression Recognition : Develop visual emotion recognition model with lightweight. Emotion detection performance is suboptimal when the user is in a side view
- Face Engagement : Engagement is essential for preprocessing to understand user emotions. Engagement is determined using the key points.