论文:2018,Vol:36,Issue(3):480-486
引用本文:
周渭博, 钟勇, 王阳. MapReduce模型中基于直方图的数据均衡算法[J]. 西北工业大学学报
Zhou Weibo, Zhong Yong, Wang Yang. Data Balance Algorithm Based on Histogram in MapReduce[J]. Northwestern polytechnical university

MapReduce模型中基于直方图的数据均衡算法
周渭博1,2,3, 钟勇1, 王阳1,2
1. 中国科学院成都计算机 应用研究所, 四川 成都 610041;
2. 中国科学院大学, 北京 100049;
3. 成都崇信大数据服务有限公司, 四川 成都 611230
摘要:
MapReduce模型是一种典型的分布式计算模型,被广泛应用于大规模数据处理,其性能很大程度上依赖于数据分布状态。由于数据内容往往都是不均衡的,再加上存储的随机性,因此MapReduce模型在计算过程中容易出现数据倾斜的问题。针对该问题,通过改进的基于MapReduce的数据直方图并行构建算法,对数据块和整个文件分别建立数据直方图,根据数据块分布情况,判断每个存储节点的数据倾斜程度,并定义了文件均衡偏差值作为数据倾斜的度量标准,进而通过数据均衡算法来降低文件均衡偏差值。改进的基于MapReduce的数据直方图并行构建算法能够适应各种类型的数据应用场景,直方图构建过程中Map端向Reduce端只需要传输直方图统计信息,不需要传输文件内容,数据传输量几乎可以忽略不计;基于直方图的数据均衡算法采用了贪心策略,可以获得均衡分布最优解的一个比较好的近似解,经过不同数据多次实验验证,该算法与随机block分布算法相比,可以降低40%左右的文件均衡偏差值,具有更好的数据均衡效果。
关键词:    直方图    并行算法    数据倾斜    数据块    数据均衡    约束优化    实验设计   
Data Balance Algorithm Based on Histogram in MapReduce
Zhou Weibo1,2,3, Zhong Yong1, Wang Yang1,2
1. Chengdu Institute of Computer applications, Chinese Academy of Sciences, Chengdu 610041, China;
2. University of Chinese Academy of Sciences, Beijing 100049, China;
3. Chengdu ChongXin Big Data Services Co., Ltd. Chengdu 611230, China
Abstract:
MapReduce model is a typical distributed computing model, which is widely used in large-scale data processing, and its performance depends largely on the data distribution status. As the data content is often unbalanced, coupled with the storage of randomness, so MapReduce model prone to data skew problem in the calculation process. In order to solve this problem, this paper establishes a data histogram for the data block and the whole file through the improved parallel histogram parallelization algorithm based on MapReduce. According to the data block distribution, we can judge the data skew degree of each storage nodes and define the file equilibrium deviation value as the measure of data skew, and then the data balance algorithm is used to reduce the file equilibrium deviation value. The improved MapReduce-based data histogram parallel construction algorithm can adapt to various types of data application scenarios. In the process of building the histogram, the Map side only needs to transmit histogram statistics to the Reduce side without transmitting the contents of the file. The data transfer can be almost negligible. The data balance algorithm based on histogram employs greedy strategy, which can obtain a better approximate solution of the optimal solution of equilibrium distribution. After several experiments, compared with the random block distribution algorithm, the improved algorithm reduce about 40% of the file balance deviation value and achieves a better data balance performance.
Key words:    histogram    parallel algorithm    data skew    block    data balance    constrained optimization    design of experiments   
收稿日期: 2017-04-10     修回日期:
DOI:
基金项目: 四川省科技支撑计划(2014GZ0013)、四川省科技支撑计划(2015GZ0088)与四川省青年软件创新工程(2014-063)资助
通讯作者:     Email:
作者简介: 周渭博(1981-),中国科学院博士研究生、高级工程师,主要从事大数据理论与应用研究。
相关功能
PDF(1438KB) Free
打印本文
把本文推荐给朋友
作者相关文章
周渭博  在本刊中的所有文章
钟勇  在本刊中的所有文章
王阳  在本刊中的所有文章

参考文献:
[1] 陈吉荣,乐嘉锦. 基于Hadoop生态系统的大数据解决方案综述[J]. 计算机工程与科学, 2013, 35(10):25-35 Chen Jirong, Le Jiajin. Reviewing the Big Data Solution Based on Hadoop Ecosystem[J]. Computer Engineering and Science, 2013, 35(10):25-35(in Chinese)
[2] 李建江, 崔健, 王聃, 等. MapReduce并行编程模型研究综述[J]. 电子学报, 2011, 39(11):2635-2642 Li Jianjiang, Cui Jian, Wang Dan, et al. Survey of MapReduce Parallel Programming Model[J]. Acta Electronica Sinica, 2011, 39(11):2635-2642(in Chinese)
[3] 王刚, 李盛恩. MapReduce中数据倾斜解决方法的研究[J]. 计算机技术与发展, 2016, 26(9):201-204 Wang Gang, Li Sheng'en. Research on Handling Data Skew in MapReduce[J]. Computer Technology and Development, 2016, 26(9):201-204(in Chinese)
[4] Blanas S, Patel J M, Ercegovac V, et al. A Comparison of Join Algorithms for Log Processing in MapReduce[C]//Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, New York, 2010:975-986
[5] Zhang Chi, Li Feifei, Jestes J. Efficient Parallel KNN Joins for Large Data in MapReduce[C]//Proceedings of the 15th International Conference on Extending Database Technology, New York, 2012:38-49
[6] Jestes J, Yi Ke, Li Feifei. Building Wavelet Histograms on Large Data in MapReduce[C]//Proceedings of the Vldb Endowment, 2011:109-120
[7] Shi Yingjie, Meng Xiaofeng, Wang Fusheng, et al. HEDC++:an Extended Histogram Estimator for Data in the Cloud[J]. Journal of Computer Science and Technology, 2013, 28(6):973-988
[8] Tang Mingwang. Efficient and Scalable Monitoring and Summarization of Large Probabilistic Data[C]//Proceedings of the 2013 SIGMOD/PODS Ph D Symposium, New York, 2013:61-66
[9] Kwon Y, Balazinska M, Howe B, et al. A Study of Skew in MapReduce Applications[C]//Open Cirrus Summit, Moscow, Russia, 2011
[10] Gufler B, Augsten N, Reiser A, et al. Load Balancing in MapReduce Based on Scalable Cardinality Estimates[C]//Proceedings of the 2012 IEEE 28th International Conference on Data Engineering(ICDE), Washington, USA, 2012:522-533
[11] Gufler B, Augsten N, Reiser A, et al. Handing Data Skew in MapReduce[C]//Proceedings of the 1st International Conference on Cloud Computing and Services Science, Noordwijkerhout, Netherlands, 2011, 146:574-583
[12] Kolb L, Thor A, Tahm E. Block-Based Load Balancing for Entity Resolution with MapReduce[C]//Proceeding of the 20th ACM International Conference on Information and Knowledge Management Glasgow, UK, 2011:2397-2400
[13] Kolb L, Thor A, Tahm E. Load Balancing for MapReduce-Based Entity Resolution[C]//Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, Washington, USA, 2012:618-629
[14] 王卓, 陈群, 李战怀,等. 基于增量式分区策略的MapReduce数据均衡方法[J]. 计算机学报, 2016(1):19-35 Wang Zhuo, Chen Qun, Li Zhanhuai, et al. An Incremental Partitioning Strategy for Data Balance on MapReduce[J]. Chinese Journal of Computers, 2016(1):19-35(in Chinese)