论文:2015,Vol:33,Issue(6):1055-1061
引用本文:
刘迎春, 陈梅玲. 流式大数据下随机森林方法及应用[J]. 西北工业大学学报
Liu Yingchun, Chen Meiling. Random Forest Method and Application in Stream Big Data Systems[J]. Northwestern polytechnical university

流式大数据下随机森林方法及应用
刘迎春, 陈梅玲
北京航空航天大学 经济管理学院, 北京 100191
摘要:
流式计算形态下的大数据分析一直是当前需要解决的问题,而且研究成果和实践经验较少。随机森林方法是目前应用较多的分类算法,但在流式计算应用场景中,数据所呈现出来的实时性、易失性、无序性等特征会使得算法准确度逐渐降低。针对这个问题,分析了随机森林的算法特点,提出了根据决策树的准确度进行随机森林剪枝的思路。同时为了适应数据的变化,结合准确度间隔的概念提出生成、验证并补充新决策树的方法,最终形成可以不断随数据更新的随机森林,满足流式大数据环境对算法的要求。使用实际数据对改进后方法的可行性进行了验证,证明新方法在真实流式大数据场景中有着更高的分类准确度,最后分析讨论了随机森林方法如何进一步研究改进的主题。
关键词:    决策树    随机森林方法    大数据    流式计算    社交网站    搜索引擎    分类器    剪枝    客户评分    分布式系统   
Random Forest Method and Application in Stream Big Data Systems
Liu Yingchun, Chen Meiling
School of Economics and Management, Beihang University, Beijing 100191, China
Abstract:
Stream computing is an important form of big data computing. Random forest method is one of the most widely applied classification algorithms at present. From the actual requirements, random forest method faces not only huge number of features but also constantly changing data pattern over time. The accuracy of a random forest algorithm without self renewal and adaptive algorithm will gradually reduce over time. Aiming at this problem, this paper analyzes the characteristics of random forest algorithm, gives a new pruning idea according to the accuracy of the decision trees. In order to adapt to the change of data, a new random method based on margin is presented. This new method can update itself constantly and can be applied in stream big data environments. Using the actual data, the new method is verified has higher accuracy in classification, and analysis and discussion of how to further research and improve the random forest method in big data environment.
Key words:    decision tree    random forest    big data    stream computing    social network    searching engine    classifier    pruning    customer rating    distributed system   
收稿日期: 2015-04-24     修回日期:
DOI:
通讯作者:     Email:
作者简介: 刘迎春(1980—),女,北京航空航天大学博士研究生,主要从事大数据、分布式系统研究。
相关功能
PDF(1245KB) Free
打印本文
把本文推荐给朋友
作者相关文章
刘迎春  在本刊中的所有文章
陈梅玲  在本刊中的所有文章

参考文献:
[1] 孟小峰,慈祥. 大数据管理:概念、技术与挑战[J]. 计算机研究与发展, 2013, 50(1): 146-169 Meng X F, Ci X. Big Data Management: Concepts, Techniques and Challenges[J]. Journal of Computer Research and Development, 2013,50(1):146-169 (in Chinese)
[2] Lim L, Misra A, Mo T L. 基于节能智能手机的连续处理传感器数据流自适应数据采集策略[J]. 分布式和并行数据库, 2013,31(2):321-351 Lim L, Misra A, Mo T L. Adaptive Data Acquisition Strategies for Energy-Efficient, Smartphone-Based, Continuous Processing of Sensor Streams[J]. Distributed and Parallel Databases, 2013,31(2): 321-351 (in Chinese)
[3] Li B D, Mazur E, Diao Y L. SCALLA: 可伸缩的单通过分析用Map Reduce平台[J]. ACM数据库系统通讯, 2012, 37(4): 1-43 Li B D, Mazur E, Diao Y L. SCALLA: A Platform for Scalable One-Pass Analytics Using Map Reduce[J]. ACM Trans. on Database Systems, 2012, 37(4): 1-43 (in Chinese)
[4] Yang D, Rundensteiner E A, Ward M. 数据流中的邻近模式挖掘[J]. 信息系统, 2013,38(3):331-350 Yang D, Rundensteiner E A, Ward M. Mining Neighbor-Based Patterns in Data Streams[J]. Information Systems, 2013,38(3):331-350 (in Chinese)
[5] 李国杰, 程学旗. 大数据的研究现状与科学思考[J]. 中国科学院院刊, 2012,27(6): 647-657 Li G J, Cheng X Q. Research Status and Scientific Thinking of Big Data[J]. Bulletin of Chinese Academy of Sciences, 2012,27(6): 647-657 (in Chinese)
[6] 王元卓, 靳小龙, 程学旗. 网络大数据:现状与展望[J]. 计算机学报, 2013,36(6):1125-1138 Wang Y Z, Jin X L, Cheng X Q. Network Big Data: Present and Future[J]. Chinese Journal of Computers, 2013, 36(6): 1125-1138 (in Chinese)
[7] 覃雄派,王会举,杜小勇,王珊. 大数据分析——RDBMS与MapReduce的竞争与共生[J]. 软件学报, 2012,23(1):32-45 Qin X P, Wang H J, Du X Y, Wang S. Big Data Analysis: Competition and Symbiosis of RDBMS and Map Reduce[J]. Ruan Jian Xue Bao/ Journal of Software, 2012,23(1): 32-45 (in Chinese)
[8] Kobielus A. 大数据架构中流式计算技术的角色. 2013. http://ibmdatamag.com/2013/01/ the-role-of-stream-computing-in-big-data-architectures/ Kobielus A. The Role of Stream Computing in Big Data Architectures. 2013. http://ibmdatamag.com/2013/01/ the-role-of-stream-computing-in-big-data-architectures/ (in Chinese)
[9] 孙大为, 张广艳, 郑纬民. 大数据流式计算:关键技术及系统实例[J]. 软件学报, 2014(4): 839-862 Sun D W, Zhang G Y, Zheng W M. Big Data Stream Computing:Technologies and Instances[J]. Journal of Software, 2014(4): 839-862 (in Chinese)
[10] Neumeyer L, Robbins B, Nair A, Kesari A. S4: 分布式流计算平台. 第十届IEEE数据挖掘国际会议(ICDMW 2010). Sydney: IEEE Press, 2010. 2010.170-177 Neumeyer L, Robbins B, Nair A, Kesari A. S4: Distributed Stream Computing Platform. In: Proc. of the 10th IEEE Int'l Conf. on Data Mining Workshops (ICDMW 2010). Sydney: IEEE Press, 2010: 170-177 (in Chinese)
[11] Borthakur D, Sarma JS, Gray J, Muthukkaruppan K, Spigeglberg N, Kuang HR, Ranganathan K, Molkov D, Mennon A, Rash S, Schmidt R, Aiyer A. 脸书中Apachi Hadoop的实时应用. ACM数据管理国际会议 (SIGMOD 2011 and PODS 2011). Athens: ACM Press, 2011: 1071-1080 Borthakur D, Sarma JS, Gray J, Muthukkaruppan K, Spigeglberg N, Kuang HR, Ranganathan K, Molkov D, Mennon A, Rash S, Schmidt R, Aiyer A. Apache hadoop goes realtime at Facebook. In: Proc. of the ACM SIGMOD Int'l Conf. on Management of Data (SIGMOD 2011 and PODS 2011). Athens: ACM Press, 2011: 1071-1080 (in Chinese)