论文:2018,Vol:36,Issue(3):522-527
引用本文:
曲仕茹, 席玉玲, 丁松涛. 基于深度学习的交通场景语义描述[J]. 西北工业大学学报
Qu Shiru, Xi Yuling, Ding Songtao. Image Caption Description of Traffic Scene Based on Deep Learning[J]. Northwestern polytechnical university

基于深度学习的交通场景语义描述
曲仕茹, 席玉玲, 丁松涛
西北工业大学 自动化学院, 陕西 西安 710129
摘要:
对复杂交通场景进行准确的语义描述,一直是图像视觉领域的难题。交通场景复杂多变,对图像场景的理解容易受到光线变化、物体遮挡等因素的干扰。针对这一问题,提出了一种基于注意力机制的交通场景语义描述方法。使用卷积神经网络(CNN)和循环神经网络(RNN)相结合的方式,产生对交通场景的端对端描述。交通目标种类繁杂,为了产生带有明显区分度的场景描述,在语言模型中引入了注意力机制。为了验证新算法的有效性,分别在Flickr8K、Flickr30K和MS COCO 3个基准数据库上进行了实验。结果表明,在不同评估方法下,算法准确率分别提升了8.6%,12.4%,19.3%和21.5%。同时,通过定性分析验证了算法在光线变化、异常天气环境、道路显著目标和多种交通工具等4种不同的复杂交通场景下,都具有良好的鲁棒性。
关键词:    智能交通    深度学习    神经网络    交通场景语义描述    注意力机制   
Image Caption Description of Traffic Scene Based on Deep Learning
Qu Shiru, Xi Yuling, Ding Songtao
School of Automation, Northwestern Polytechnical University, Xi'an 710072, China
Abstract:
It is a hard issue to describe the complex traffic scene accurately in computer vision. The traffic scene is changeable, which causes image captioning easily interfered by light changes and object occlusion. To solve this problem, we propose an image caption generation model based on attention mechanism. Combining convolutional neural network (CNN) and recurrent neural network (RNN) to generate an end-to-end description for traffic images. To generate a semantic description with distinct degree of discrimination, the attention mechanism is applied to language model. Using Flickr8K、Flickr30K and MS COCO benchmark datasets to validate the effectiveness of our method. The accuracy is promoted maximally by 8.6%,12.4%,19.3% and 21.5% in different evaluation metrics. Experiments show that our algorithm has good robustness in four different complex traffic scenarios, such as light change, abnormal weather environment, road marked target and various kinds of transportation tools.
Key words:    intelligent transportation    deep learning    neural network    image captioning    attention mechanism    design of experiments    reliability analysis   
收稿日期: 2017-04-02     修回日期:
DOI:
通讯作者:     Email:
作者简介: 曲仕茹(1963-),女,西北工业大学教授,主要从事图像处理与分析、交通信息与控制的研究。
相关功能
PDF(2193KB) Free
打印本文
把本文推荐给朋友
作者相关文章
曲仕茹  在本刊中的所有文章
席玉玲  在本刊中的所有文章
丁松涛  在本刊中的所有文章

参考文献:
[1] Li S, Kulkarni G, Berg T L, et al. Composing Simple Image Descriptions Using Web-Scale N-Grams[C]//Proceedings of the Fifteenth Conference on Computational Natural Language Learning Association for Computational Linguistics, 2011:220-228
[2] Kuznetsova P, Ordonez V, Berg T, et al. Treetalk:Composition and Compression of Trees for Image Descriptions[J]. Transactions of the Association of Computational Linguistics, 2014, 2(1):351-362
[3] Kiros R, Salakhutdinov R, Zemel R S. Multimodal Neural Language Models[C]//International Conference on Machine Learning. 2014:595-603
[4] Donahue J, Hendricks L A, Guadarrama S, et al. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015:2625-2634
[5] Vinyals O, Toshev A, Bengio S, et al. Show and Tell:a Neural Image Caption Generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015:3156-3164
[6] Bahdanau D, Cho K, Bengio Y. Neural Machine Translation by Jointly Learning to Align and Translate[J]. Computer Science, 2014(9):0473-0482
[7] Graves A. Generating Sequences with Recurrent Neural Networks[J]. Computer Science, 2013(8):0850-0863
[8] Xu K, Ba J, Kiros R, et al. Show Attend and Tell:Neural Image Caption Generation with Visual Attention[C]//International Conference on Machine Learning, 2015:2048-2057
[9] Ba J, Mnih V, Kavukcuoglu K. Multiple Object Recognition with Visual Attention[J]. Computer Science, 2014(12):7755-7771
[10] Rashtchian C, Young P, Hodosh M, et al. Collecting Image Annotations using Amazon's Mechanical Turk[C]//NAACL HLT Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, 2010:139-147
[11] Young P, Lai A, Hodosh M, et al. From Image Descriptions to Visual Denotations:New Similarity Metrics for Semantic Inference over Event Descriptions[J]. Transactions of the Association for Computational Linguistics, 2014(2):67-78
[12] Lin T Y, Maire M, Belongie S, et al. Microsoft COCO:Common Objects in Context[C]//European Conference on Computer Vision Springer Cham, 2014:740-755
[13] Wu Q, Shen C, Liu L, et al. What Value Do Explicit High Level Concepts Have in Vision to Language Problems[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016:203-212
[14] Corbetta M, Shulman G L. Control of Goal-Directed and Stimulus-Driven Attention in the Brain[J]. Nature Reviews Neuroscience, 2002, 3(3):201-215
[15] Van De Weijer J, Schmid C, Verbeek J, et al. Learning Color Names for Real-World Applications[J]. IEEE Trans on Image Processing, 2009, 18(7):1512-1523
[16] Zhu Y, Groth O, Bernstein M, et al. Visual7w:Grounded Question Answering in Images[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016:4995-5004