Attention Neural Baby Talk:

Captioning of Risk Factors While Driving

Abstract

Driving has various risk factors, including the possibility of traffic accidents involving pedestrians and/or oncoming vehicles. A driver assistance system that can prevent traffic accidents must be able to get the driver’s attention to enable better safety. A practical solution for attention attraction should involve caption generation from in-vehicle images. Although a number of approaches for caption generation with deep neural networks have been proposed, they are inadequate for the specific risk factors while driving. The reason is that conventional captioning methods focus on not these factors but the entirety of an image. To tackle this problem, we first created a dataset to attract attention, one that considers risk factors during driving. Furthermore, we propose an image captioning method for the assistance system. Our method is based on neural baby talk and introduces an attention mask focusing on risk factors in an image. The mask enables our model to generate captions on each factor. Experimental results with our created dataset show that our method can generate captions for ideal attention attraction.

Architecture & Auto annotation system

In this paper, we propose the following two approaches to achieve image captioning suitable enough to get the driver’s attention, which is based on NBT. We first explain a rule- based automatic annotation method to create a dataset for image captioning of the attention attraction during driving. We then introduce an image captioning method by describing an attention mask to the NBT model. Our methods could solve the problems of the conventional captioning method and could generate captions focusing on risk factors.

Our attention mechanism

Our annotation system

This figure shows examples of captions generated by the conventional NBT model and our method. In the left of the figure, the proposed method generates “There is a person on the sidewalk nearby to the right.” for a woman crossing a road as the result of priority 1. The reason is that this result successfully includes words that indicate the appropriate classification, distance, and position. In contrast, the conventional NBT generates “A street with a lot of traffic on it.”, which considers the entirety of the scene but is inadequate for attention attraction. These results demonstrate that our method can generate a caption suitable to get the driver’s attention, enabling improved safety.

Bibtex

@inproceedings{Mori2019,
author = {Yuki Mori and Hiroshi Fukui, Tsubasa Hirakawa, Jo Nishiyama, Takayoshi Yamashita, Hironobu Fujiyoshi},
booktitle = {IEEE International Conference on Intelligent Transportation Systems},
title = {{Attention Neural Baby Talk: Captioning of Risk Factors While Driving}},
year = {2019}
}