论文:2022,Vol:40,Issue(6):1261-1268
引用本文:
张维, 张浩晨. 一种基于最优集成随机森林的小样本数据特征提取方法[J]. 西北工业大学学报
ZHANG Wei, ZHANG Haochen. A feature extraction method for small sample data based on optimal ensemble random forest[J]. Journal of Northwestern Polytechnical University

一种基于最优集成随机森林的小样本数据特征提取方法
张维, 张浩晨
西北工业大学 机电学院, 陕西 西安 710072
摘要:
高维小样本数据作为数据挖掘的难点,用传统的随机森林算法进行特征选择时极易出现分类结果过拟合而导致的特征重要度排序稳定性差、精度低等问题。针对随机森林在小样本数据降维过程中出现的难点,提出了一种基于小样本数据特征提取算法OTE-GWRFFS。基于生成对抗网络GAN进行样本扩充,避免传统随机森林在小样本分类过程中的过拟合现象;在数据扩充的基础上采用基于权重的最优树集合算法,减小生成数据分布误差对特征提取精度的影响,提升决策树集合的整体稳定性;采用单棵决策树的权重与特征重要性度量值加权平均得到特征重要性排序,从而解决了小样本数据特征选择过程中精度低稳定性差的问题。通过UCI数据集将所提算法与传统随机森林以及基于权重的随机森林算法进行实验对比,OTE-GWRFFS算法在处理高维小样本数据时具有更高的稳定性和精度。
关键词:    高维小样本数据    最优树集合    随机森林    特征提取    数据扩充   
A feature extraction method for small sample data based on optimal ensemble random forest
ZHANG Wei, ZHANG Haochen
School of Mechanical Engineering, Northwestern Polytechnical University, Xi'an 710072, China
Abstract:
High dimensional small sample data is the difficulty of data mining. When using the traditional random forest algorithm for feature selection, it is to have the poor stability and low accuracy of feature importance ranking caused by over fitting of classification results. Aiming at the difficulties of random forest in the dimensionality reduction of small sample data, a feature extraction algorithm ote-gwrffs is proposed based on small sample data. Firstly, the algorithm expands the samples based on the generated countermeasure network Gan to avoid the over fitting phenomenon of traditional random forest in the small sample classification. Then, on the basis of data expansion, the optimal tree set algorithm based on weight is adopted to reduce the impact of data distribution error on feature extraction accuracy and improve the overall stability of decision tree set. Finally, the weighted average of the weight and feature importance measure of a single decision tree is used to obtain the feature importance ranking, which solves the problem of low accuracy and poor stability in the feature selection process of small sample data. Through the UCI data set, the present algorithm is compared with the traditional random forest algorithm and the weight based random forest algorithm. The ote-gwrffs algorithm has higher stability and accuracy for processing high-dimensional and small sample data.
Key words:    high dimensional small sample data    ensemble of optimal trees    random forest    feature extraction    data expansion   
收稿日期: 2022-03-07     修回日期:
DOI: 10.1051/jnwpu/20224061261
通讯作者:     Email:
作者简介: 张维(1970—),西北工业大学副教授,主要从事智能制造、制造数据分析技术研究。e-mail:zhangw@nwpu.edu.cn
相关功能
PDF(3019KB) Free
打印本文
把本文推荐给朋友
作者相关文章
张维  在本刊中的所有文章
张浩晨  在本刊中的所有文章

参考文献:
[1] HASSAN H, BADR A, ABDELHALIM M B. Prediction of o-glycosylation sites using random forest and GA-tuned PSO technique[J]. Bioinformatics & Biology Insights, 2015, 9(9):103-109
[2] ROBIN G, JEAN-MICHEL P, CHRISTINE T. Variable selection using random forests[J]. Pattern Recognit, Lett, 2010, 31:2225-2236
[3] 姚登举, 杨静, 詹晓娟. 基于随机森林的特征选择算法[J]. 吉林大学学报, 2014, 44(1):137-141 YAO Dengju, YANG Jing, ZHAN Xiaojuan. Feature selection algorithm based on random forest[J]. Journal of Jilin University, 2014, 44(1):137-141 (in Chinese)
[4] 王翔, 胡学钢. 高维小样本分类问题中特征选择研究综述[J]. 计算机应用, 2017, 37(9):2433-2438 WANG Xiang, HU Xuegang. A review of feature selection in high-dimensional small sample classification[J]. Computer Application, 2017, 37(9):2433-2438 (in Chinese)
[5] 徐少成, 李东喜. 基于随机森林的加权特征选择算法[J]. 统计与决策, 2018, 34(18):25-28 XU Shaocheng, LI Dongxi. Weighted feature selection algorithm based on random forest[J]. Statistics and Decision Making, 2018, 34(18):25-28 (in Chinese)
[6] LI H B, WANG W, DING H W, et al. Trees weighting random forest method for classifying high dimensional noisy data[C]//IEEE 7th International Conference on E-Business Engineering, 2010
[7] KHAN Zardad, ASMA Gul, ARIS Perperoglou, et al. Ensemble of optimal trees, random forest and random projection ensemble classification[J]. Advances in Data Analysis and Classification, 2020, 14:97-116
[8] KHAN Z, GUL A, MAHMOUD O, et al. An ensemble of optimal trees for class membership probability estimation//Analysis of large and complex data[M]. Switzerand:Springer International Publshiug, 2016:395-409
[9] WEN B, LUIS O, COLON K P. Subbalakshmi and ramamurti chandramouli causal-TGAN:generating tabular data using causal generative adversarial networks[D]. Hoboken:Stevens Institute of Technology, 2021
[10] 赵庆平, 陈得宝, 姜恩华, 等. 一种改进权重的非局部均值图像去噪算法[J]. 电子测量与仪器学报, 2014, 28(3):334-339 ZHAO Qingping, CHEN Debao, JIANG Enhua, et al. An improved weighted nonlocal mean image denoising algorithm[J]. Journal of Electronic Measurement and Instrument, 2014, 28(3):334-339 (in Chinese)
[11] KUNCHEVA L I, MATTHEWS C E, ARNAIZ-GONZÁLEZ A, et al. Feature selection from high-dimensional data with very low sample size:a cautionary tale[J/OL]. (2020-08-27)[2022-01-19]. https://arxiv.org/abs/2008.12025
[12] 李秋玮. 基于条件生成对抗网络和超限学习机的小样本数据处理方法研究[D]. 镇江:江苏大学, 2019 LI Qiuwei. Research on small sample data processing method based on conditional generation countermeasure network and transfinite learning machine[D]. Zhenjiang:Jiangsu University, 2019 (in Chinese)