论文:2014,Vol:32,Issue(4):658-663
引用本文:
王丽芳, 张志珂, 蒋泽军, 蔡小斌, 彭成章. 基于文件路径的重复数据删除集群的数据路由策略[J]. 西北工业大学
Wang Lifang, Zhang Zhike, Jiang Zejun, Cai Xiaobin, Peng Chengzhang. A Novel Data Routing Strategy Based on Directories for Deduplication Clusters[J]. Northwestern polytechnical university

基于文件路径的重复数据删除集群的数据路由策略
王丽芳1, 张志珂2, 蒋泽军1, 蔡小斌1, 彭成章1
1. 西北工业大学 计算机学院, 陕西 西安 710072;
2. 国家电网 河南省电力公司, 河南 郑州 450052
摘要:
重复数据删除集群是解决不断增长的海量数据备份需求的一种有效方法。它的关键问题是数据路由策略,即如何把数据合理分配到集群内的各个节点。目前的数据路由策略利用文件或者数据段的最小数据块签名计算路由目标节点,称作MCS(minimum chunk signature)数据路由策略。当重复数据删除集群规模较小时,这种方法的存储使用量接近单节点重复数据删除。但是,当集群规模较大时,它的存储使用量远远劣于单节点重复数据删除。为了降低重复数据删除集群的存储使用量,提出一种基于路径的重复数据删除集群的数据路由策略,称作DRSD(data routing strategy based on directories)。实验结果表明,对于各种不同的节点数量,DRSD的重复数据删除率都明显高于MCS,并且接近单节点重复数据删除。当节点数量是64时,DRSD的重复数据删除率比MCS高35%。
关键词:    重复数据删除集群    无状态数据路由算法    文件路径;存储使用量   
A Novel Data Routing Strategy Based on Directories for Deduplication Clusters
Wang Lifang1, Zhang Zhike2, Jiang Zejun1, Cai Xiaobin1, Peng Chengzhang1
1. Department of Computer Science and Engineering, Northwestern Polytechnical University, Xi'an 710072, China;
2. State Grid Henan Electric Power Company, Zhenzhou 450052, China
Abstract:
Deduplication cluster is an effective way for meeting the increasing and massive data backup require-ments. Its key problem is how to distribute the data to nodes in the deduplication cluster; this is the data routing strategy. Existing data routing strategy utilizes the MCS (Minimum Chunk Signature) of a file or data segment to compute the target routing node. When the size of the deduplication cluster is small, the storage utilization of MCS approaches the single node deduplication. However, when the deduplication cluster is in large scale, its storage uti-lization is much lower than the single node deduplication. We propose a novel data routing strategy using directories for the deduplication cluster for decreasing the storage utilization of the deduplication cluster,;this new strategy we call DRSD(Data Routing Strategy Based on Directories). Experimental results and their analysis show preliminarily that, for various numbers of the nodes of the deduplication cluster, the deduplication ratios obtained with DRSD are much better than those obtained with MCS, and even approach those obtained with single node deduplication. When the number of nodes is 64, the deduplication ratio obtained with DRSD is 35% better than that obtained with MCS.
Key words:    calculations    cluster computing    data tansfer    design    efficiency    experiments    hard disk storage    routing algorithms    simulators    software architecture;deduplication cluster;directory    stateless data routing algorithm;storage utilization   
收稿日期: 2013-11-09     修回日期:
DOI:
基金项目: 国家自然科学基金(61373120);航空科学基金(2012ZC53040)资助
通讯作者:     Email:
作者简介: 王丽芳(1964-),女,西北工业大学教授,主要从事云存储及云计算的研究。
相关功能
PDF(281KB) Free
打印本文
把本文推荐给朋友
作者相关文章
王丽芳  在本刊中的所有文章
张志珂  在本刊中的所有文章
蒋泽军  在本刊中的所有文章
蔡小斌  在本刊中的所有文章
彭成章  在本刊中的所有文章

参考文献:
[1] Gantz J F, Chute C, Manfrediz A, Minton S, Reinsel D, Schlichting W, Toncheva A. The Diverse and Exploding Digital Universe: An Updated Forecast of Worldwide Information Growth through 2011[R]. An IDC White Paper-Sponsored by EMC, 2008
[2] Stoica I. A Berkeley View of Big Data. https: //amplab. cs. berkeley. edu /about /.
[3] Dong W, Douglis F, Li K, Patterson H, Reddy S, Shilane P. Tradeoffs in Scalable Data Routing for Deduplication Clusters[C] ∥Proceedings of the 9th Conference on USENIX Conference on File and Storage Technologies. San Jose, CA, USA: USENIX Association, Berkeley, CA, USA, 2011: 15-17, 15-29
[4] You L, Pollack K, Long D. Deep Store: An Archival Storage System Architecture[C] ∥ Proceedings of the 21th International Conference on Data Engineering. Tokyo, Japan: IEEE Computer Society, Washington, DC, USA, 2005: 804-815
[5] Zhu B, Li K, Patterson H. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System[C] ∥ Proceedings of the 6th Conference on USENIX Conference on File and Storage Technologies. San Jose, CA, USA: USENIX Association, Berkeley, CA, USA, 2008: 269-282
[6] Zhang Zhike, Bhagwat D, Litwin W, Long D, Schwarz S. Improved Deduplication through Parallel Binning[C] ∥Performance Computing and Communications Conference (IPCCC), 2012 IEEE 31st International. IEEE, Washington, DC, USA, 2012:130-141
[7] Zhang Zhike, Jiang Zejun, Liu Zhiqiang, et al. LHs: A Novel Method of Information Retrieval Avoiding an Index Using Linear Hashing with Key Groups in Deduplication[C] ∥Proceedings of 2012 International Conference on Machine Learning and Cybernetics. Washington, DC: IEEE, 2012: 1312-1318
[8] Zhang Zhike, Jiang Zejun, Cai Xiaobin, Peng Chengzhang. A Novel Cache Prefetching Algorithm for Restoration Operations of Deduplication Systems[J]. Lecture Notes in Electrical Engineering, 2012, 219(4): 331-338
[9] Bhagwat D, Eshghi K, Long D, Lillibridge M. Extreme Binning: Scalable, Parallel Deduplication for Chunk-Based File Backup [C] ∥Proceedings of the 17th Annual Meeting of the IEEE /ACM International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems. London, UK: IEEE Computer Society, Washington, DC, USA, 21-23 September 2009, 1-9
[10] Dubnicki C, Gryz L, Heldt L, Kaczmarczyk M, Kilian W, Strzelczak P, Szczepkowski J, Ungureanu C, Welnicki M. Hydrastor: A Scalable Secondary Storage[C] ∥Proceedings of the 7th Conference on USENIX Conference on File and Storage Technologies. San Francisco, CA, USA: USENIX Association, Berkeley, CA, USA, 2009: 197-210
[11] Frey D, Kermarrec A, Kloudas K. Probabilistic Deduplication for Cluster-Based Storage Systems[C] ∥Proceedings of the Third ACM Symposium on Cloud Computing. ACM, New York, NY, USA, 2012: 17
[12] Forman G, Eshghi K, Chiocchetti S. Finding Similar Files in Large Document Repositories. Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Chicago, IL, USA: ACM, New York, NY, USA,2005: 394-400