论文:2022,Vol:40,Issue(2):344-351
引用本文:
常立博, 张盛兵. 面向混合量化CNNs的可重构处理器设计[J]. 西北工业大学学报
CHANG Libo, ZHANG Shengbing. A reconfigurable processor for mix-precision CNNs on FPGA[J]. Northwestern polytechnical university

面向混合量化CNNs的可重构处理器设计
常立博1,2, 张盛兵1
1. 西北工业大学 计算机学院, 陕西 西安 710072;
2. 西安邮电大学 电子工程学院, 陕西 西安 710121
摘要:
为了解决已有卷积神经网络(convolution neural networks,CNNs)加速器,因无法适应混合量化CNN模型的计算模式和访存特性而引起加速器效率低的问题,设计了可适应混合量化模型的可重构计算单元、弹性片上缓存单元和宏数据流指令集。其中,采用了可根据CNN模型结构的重构多核结构以提高计算资源利用率,采用弹性存储结构以及基于Tile的动态缓存划分策略以提高片上数据复用率,采用可有效表达混合精度CNN模型计算和可重构处理器特性的宏数据流指令集以降低映射策略的复杂度。在Ultra96-V2平台上实现VGG-16和ResNet-50的计算性能达到216.6和214 GOPS,计算效率达到0.63和0.64 GOPS/DSP。同时,在ZCU102平台上实现ResNet-50的计算性能可达931.8 GOPS,计算效率可达0.40 GOPS/DSP,相较于其他类似CNN加速器,计算性能和计算效率分别提高了55.4% 和100%。
关键词:    混合精度量化    卷积神经网络加速器    可重构计算   
A reconfigurable processor for mix-precision CNNs on FPGA
CHANG Libo1,2, ZHANG Shengbing1
1. School of Computer Science, Northwestern Polytechnical University, Xi'an 710072, China;
2. School of Electronic Engineering, Xi'an University of Posts and Telecommunication, Xi'an, 710121 China
Abstract:
To solve the problem of low computing efficiency of existing accelerators for convolutional neural network (CNNs), which caused by the inability to adapt to the characteristics of computing mode and caching for the mixed-precision quantized CNNs model, we propose a reconfigurable CNN processor in this paper, which consists of the reconfigurable adaptable computing unit, flexible on-chip cache unit and macro-instruction set. The multi-core CNN processor can be reconstructed according to the structure of CNN models and constraints of reconfigurable resources, to improve the utilization of computing resources. The elastic on-chip buffer and the data access approach by dynamically configuring an address to better utilization of on-chip memory. Then, the macroinstruction set architecture (mISA) can fully express the characteristics of the mixed-precision CNN models and reconfigurable processors, to reduce the complexity of mapping CNNs with different network structures and computing modes to reconfigurable the CNNs processors. For the well-known CNNs-VGG16 and ResNet-50, the proposed CNN processor has been implemented using Ultra96-V2 and ZCU102 FPGA, showing the throughput of 216.6 GOPS, and 214 GOPS, the computing efficiency of 0.63 GOPS/DSP and 0.64 GOPS/DSP on Ultra96-V2, respectively, achieving a better efficiency than the CNN accelerator based on fixed bit-width. Meanwhile, for ResNet-50, the throughput and the computing efficiency are up to 931.8 GOPS, 0.40 GOPS/DSP on ZCU102, respectively. In addition, these achieve up to 55.4% higher throughput than state-of-the-art CNN accelerators.
Key words:    mixed-precision quantization    convolutional neural network accelerator    reconfigurable computing   
收稿日期: 2021-07-15     修回日期:
DOI: 10.1051/jnwpu/20224020344
基金项目: 国家重点研发计划(2019YFB1803600)与中国民航适航中心开放基金(SH2021111903)资助
通讯作者: 张盛兵(1968-),西北工业大学教授,主要从事微处理器体系结构研究。e-mail:zhangsb@nwpu.edu.cn     Email:zhangsb@nwpu.edu.cn
作者简介: 常立博(1985-),西北工业大学博士研究生,主要从事微处理器体系结构及深度学习加速器设计研究。
相关功能
PDF(2119KB) Free
打印本文
把本文推荐给朋友
作者相关文章
常立博  在本刊中的所有文章
张盛兵  在本刊中的所有文章

参考文献:
[1] ZHAO R, HU Y, DOTZEL J, et al. Improving neural network quantization without retraining using outlier channel splitting[C]//International Conference on Machine Learning, 2019
[2] WANG K, LIU Z, Lin Y, et al. HAQ:hardware-aware automated quantization with mixed precision[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019
[3] MA Y, CAO Y, VRUDHULA S, et al. Performance modeling for CNN inference accelerators on FPGA[J]. IEEE Trans on Computer-Aided Design of Integrated Circuits and Systems, 2019, 39(4):843-856
[4] AZIZIMAZREAH A, CHEN L. Shortcut mining:exploiting cross-layer shortcut reuse in DCNN accelerators[C]//2019 IEEE International Symposium on High Performance Computer Architecture, 2019
[5] HENNESSY J, PATTERSON D. A new golden age for computer architecture:domain-specific hardware/software co-design, enhanced[C]//ACM/IEEE 45th Annual International Symposium on Computer Architecture, 2018
[6] JUDD P, ALBERICIO J, HETHERINGTON T, et al. Stripes:bit-serial deep neural network computing[C]//2016 49th Annual IEEE/ACM International Symposium on Microarchitecture, 2016
[7] LEE J, KIM C, KANG S, et al. UNPU:an energy-efficient deep neural network accelerator with fully variable weight bit precision[J]. IEEE Journal of Solid-State Circuits, 2018, 54(1):173-185
[8] SHARIFY S, LASCORZ A D, SIU K, et al. Loom:exploiting weight and activation precisions to accelerate convolutional neural networks[C]//2018 55th ACM/ESDA/IEEE Design Automation Conference, 2018:1-6
[9] SHARMA H, PARK J, SUDA N, et al. Bit fusion:bit-level dynamically composable architecture for accelerating deep neural network[C]//2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture, 2018:764-775
[10] YIN S, TANG S, LIN X, et al. A high throughput acceleration for hybrid neural networks with efficient resource management on FPGA[J]. IEEE Trans on Computer-Aided Design of Integrated Circuits and Systems, 2018, 38(4):678-691
[11] MA Y, CAO Y, VRUDHULA S, et al. Automatic compilation of diverse CNNs onto high-performance FPGA accelerators[J]. IEEE Transations on Computer-Aided Design of Integrated Circuits and Systems, 2018, 39(2):424-437
[12] WEI X, YU C H, ZHANG P, et al. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs[C]//Proceedings of the 54th Annual Design Automation Conference, 2017
[13] GUO K, SUI L, QIU J, et al. Angel-eye:a complete design flow for mapping CNN onto embedded FPGA[J]. IEEE Trans on Computer-Aided Design of Integrated Circuits and Systems, 2017, 37(1):35-47
[14] AZIZIMAZREAH A, CHEN L. Polymorphic accelerators for deep neural networks[J]. IEEE Trans on Computers, 2022, 71(3):534-546
[15] DONG Z, YAO Z, ARFEEN D, et al. HAWQ-v2:hessian aware trace-weighted quantization of neural networks[J]. Advances in Neural Information Processing Systems, 2020, 33:18518-18529