李林, 张盛兵, 吴鹃. 面向图像识别的深度学习VLIW处理器设计[J]. 西北工业大学学报
LI Lin, ZHANG Shengbing, WU Juan. Design of Deep Learning VLIW Processor for Image Recognition[J]. Northwestern polytechnical university

李林1,2, 张盛兵1, 吴鹃3
1. 西北工业大学 计算机学院, 陕西 西安 710072;
2. 北京微电子技术研究所 设计四部, 北京 100076;
3. 西安职业技术学院 动漫软件学院, 陕西 西安 710077
为了适应航空航天领域高分辨率图像识别和本地化高效处理的需求,解决现有研究中计算并行性不足的问题,在对深度卷积神经网络模型各层计算优化的基础上,设计了一款可扩展的多处理器簇的深度学习超长指令字(VLIW)处理器体系结构。设计中采用了特征图和神经元的并行处理,基于VLIW的指令级并行,多处理器簇的数据级并行以及流水线技术。FPGA原型系统测试结果表明,该处理器可有效完成图像分类和目标检测应用;当工作频率为200 MHz时,处理器的峰值性能可以达到128 GOP/s;针对选取的测试基准,该处理器的计算速度至少是CPU的12倍,是GPU的7倍;对比软件框架运行结果,处理器的测试精度的平均误差不超过1%。
关键词:    图像识别    深度学习    卷积神经网络    超长指令字(VLIW)    处理器    可扩展   
Design of Deep Learning VLIW Processor for Image Recognition
LI Lin1,2, ZHANG Shengbing1, WU Juan3
1. School of Computer Science and Engineering, Northwestern Polytechnical University, Xi'an 710072, China;
2. Fourth Design Department, Beijing Institute of Micoelectronics Technology, Beijing 100076, China;
3. School of Animation and Software, Xi'an Vocational and Technical College, Xi'an 710077, China
In order to adapt the application demands of high resolution images recognition and efficient processing of localization in aviation and aerospace fields, and to solve the problem of insufficient parallelism in existing researches, an extensible multiprocessor cluster deep learning processor architecture based on VLIW is designed by optimizing the computation of each layer of deep convolutional neural network model. Parallel processing of feature maps and neurons, instruction level parallelism based on very long instruction word (VLIW), data level parallelism of multiprocessor clusters and pipeline technologies are adopted in the design. The test results based on FPGA prototype system show that the processor can effectively complete the image classification and object detection applications. The peak performance of processor is up to 128 GOP/s when it operates at 200 MHz. For selecting benchmarks, the processor speed is about 12X faster than CPU and 7X faster than GPU at least. Comparing with the results of the software framework, the average error of the test accuracy of the processor is less than 1%.
Key words:    image recognition    deep learning    convolutional neural networks    very long instruction word(VLIW)    processor    extensible   
收稿日期: 2019-01-08     修回日期:
DOI: 10.1051/jnwpu/20203810216
通讯作者:     Email:
作者简介: 李林(1982-),西北工业大学博士研究生,主要从事微处理器体系结构及集成电路设计研究。
PDF(1878KB) Free
李林  在本刊中的所有文章
张盛兵  在本刊中的所有文章
吴鹃  在本刊中的所有文章

[1] LI L, ZHANG S, WU J. An Efficient Hardware Architecture for Activation Function in Deep Learning Processor[C]//IEEE International Conference on Image, Vision and Computing, 2018:911-918
[2] LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-Based Learning Applied to Document Recognition[J]. Proceedings of the IEEE, 1998, 86(11):2278-2324
[3] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Image Net Classification with Deep Convolutional Neural Networks[C]//International Conference on Neural Information Processing Systems, 2012:1097-1105
[4] LIU W, ANGUELOV D, ERHAN D, et al. SSD:Single Shot MultiBox Detector[C]//European Conference on Computer Vision, 2016:21-37
[5] HOWARD A G, ZHU M, CHEN B, et al. MobileNets:Efficient Convolutional Neural Networks for Mobile Vision Applications[EB/OL].(2017-04-17)[2019-01-02]. https://arxiv.org/abs/1704.04861
[6] HENNESSY J L, PATTERSON D A. Computer Architecture:a Quantitative Approach[M]. 6th Edition. Cambridge:Morgan Kaufmann Publishers Inc, 2018
[7] FARABET C, MARTINI B, CORDA B, et al. NeuFlow:a Runtime Reconfigurable Dataflow Processor for Vision[C]//Computer Vision and Pattern Recognition Workshops, 2011:109-116
[8] PEEMEN M, SETIO A A A, MESMAN B, et al. Memory-Centric Accelerator Design for Convolutional Neural Networks[C]//IEEE International Conference on Computer Design, 2013:13-19
[9] CHEN T, DU Z, SUN N, et al. DianNao:a Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning[J]. Acm Sigplan Notices, 2014, 49(4):269-284
[10] DU Z, FASTHUBER R, CHEN T, et al. Shidiannao:Shifting Vision Processing Closer to the Sensor[C]//International Symposium on Computer Architecture, 2015:92-104
[11] JOUPPI N, YOUNG C, PATIL N, et al. In-Datacenter Performance Analysis of a Tensor Processing Unit[C]//International Symposium on Computer Architecture, 2017:1-12
[12] JIA Y, SHELHAMER E, DONAHUE J, et al. Caffe:Convolutional Architecture for Fast Feature Embedding[C]//ACM International Conference on Multimedia, 2014:675-678
[13] KRIZHEVSKY A, HINTON G. Learning Multiple Layers of Features from Tiny Images[R]. Technical Report TR-2009
[14] KHOSLA A, JAYADEVAPRAKASH N, YAO B, et al. Novel Dataset for Fine-Grained Image Categorization[C]//First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition Colorado Springs, 2011
[15] EVERINGHAM M, ESLAMI S, VAN G, et al. The Pascal Visual Object Classes Challenge:A Retrospective[J]. International Journal of Computer Vision, 2015, 111(1):98-136
[16] Satellite Imaging Corporation. WorldView-3 Satellite Sensor[EB/OL].(2018-02-06)[2019-01-02]. https://www.satimagingcorp.com/satellite-sensors/worldview-3/
[17] ZHANG C, LI P, SUN G, et al. Optimizing FPGA-Based Accelerator Design for Deep Convolutional Neural Networks[C]//ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2015:161-170
1.白芃远, 许华, 孙莉.基于卷积神经网络与时频图纹理信息的信号调制方式分类方法[J]. 西北工业大学学报, 2019,37(4): 816-823