基于补偿回滚的操作系统故障自恢复技术 -- 西北工业大学学报,2015,33(5):709-715

	论文:2015,Vol:33,Issue(5):709-715
	引用本文：
	朱怡安, 史佳龙. 基于补偿回滚的操作系统故障自恢复技术[J]. 西北工业大学学报
	Zhu Yian, Shi Jialong. A New Operating System Fault Recovery Technique Based on Kernel Compensation and Process State Roll-Back[J]. Northwestern polytechnical university

基于补偿回滚的操作系统故障自恢复技术

朱怡安, 史佳龙

西北工业大学计算机学院, 陕西西安 710072

摘要:

操作系统故障根据传播特性可分为process-local和kernel-global 2类,分别造成进程局部数据和内核全局状态的错误。现有技术通过重启系统或故障进程实现对进程局部数据错误的恢复,但未考虑内核全局状态的不一致问题,不能保证对kernel-global类型故障的恢复效果。针对以上问题,提出了一种基于补偿回滚的故障自恢复技术。该技术通过监测内核全局方法调用,在进程局部数据被正确恢复的前提下,利用补偿操作对不一致的内核全局状态进行恢复,控制了故障的传播效应,减小了单点故障造成的影响。此外,该技术以内核模块的形式实现,不需要对目标操作系统进行修改,可便捷地实现功能扩展和移植。故障注入实验结果表明,在保证系统功能正常的前提下,该技术能对91.6%的故障进行有效恢复,且带来的系统负载较小。

关键词: 操作系统内核补偿进程状态回滚故障自恢复

A New Operating System Fault Recovery Technique Based on Kernel Compensation and Process State Roll-Back

Zhu Yian, Shi Jialong

Department of Computer Science and Engineering, Northwestern Polytechnical University, Xi'an 710072, China

Abstract:

Sections 1 through 4 of the full paper explain and evaluate a new fault recovery technique based on kernel motion compensation and process state roll-back. The core of our thinking and that of sections 1 through 4 consists of: (1) past research papers on operating system fault recovery mainly focus on the data loss caused by process-local faults and the global state inconsistency caused by kernel-global faults is neglected; we propose a new fault recovery technique based on kernel motion compensation and process state roll-back model; it can minimize the propagation of faults and ensure the consistency of global state; this technique is implemented as loadable kernel module which makes it easy to expand functionality;(2) section 2 presents the design of kernel motion compensation and process state roll-back model; (3) section 3 presents the implementation details of this technique in Linux operating system; (4) evaluation results presented in section 4 and their analysis show preliminarily the effectiveness of the proposed technique.

Key words: adaptive algorithms approximation algorithms backstepping conception design cost functions computer simulation computer software design dynamic models efficiency embedded software embedded systems estimation failure modes fault detection fault tolerance global optimization intelligent systems mathematical models models motion compensation real time control reliability analysis safety engineering software reliability fault recovery kernel compensation operating system process state roll-back

收稿日期: 2015-03-12 修回日期:

DOI:

基金项目: 航天支撑技术基金(2013-HT-XGD(10))、陕西省科学技术研究发展计划项目(2014K05-25)、陕西省科学技术研究发展计划项目(2015GY035)与航空科学基金(20130753006)资助。

通讯作者: Email：

作者简介: 朱怡安(1961-),西北工业大学教授,主要从事高性能计算、云计算及普适计算的研究。

相关功能

PDF(1654KB) Free

打印本文

把本文推荐给朋友

作者相关文章

朱怡安 在本刊中的所有文章

史佳龙 在本刊中的所有文章


	参考文献:
	[1] Deshpande B D. System and Methods for Self-Healing from Operating System Faults in Kernel/Supervisory Mode[P]US8930764 B2, 2014 [2] Hamann P S, Perry R L. Compensation Recommendations[P]US20140032382 A1, 2014 [3] Asghari S A, Kaynak O, Taheri H. An Investigation Into Soft Error Detection Efficiency at Operating System Level[J]. The Scientific World Journal, 2014(1):1-9 [4] Yoshimura T, Yamada H, Kono K. Is Linux Kernel Oops Useful or Not?[C]//Proceedings of the Eighth USENIX Conference on Hot Topics in System Dependability, 2012:2-7 [5] Frei R, McWilliam R, Derrick B, Purvis A, Tiwari A, Serugendo G D M. Self-Healing and Self-Repairing Technologies[J]. The International Journal of Advanced Manufacturing Technology, 2013, 69(5/6/7/8):1033-1061 [6] Davis T A, Bishop A K, Cruzan C J. Detecting and Recovering from Process Failures[P]US8103905 B2, 2013 [7] Kato Y, Saito S, Mouri K, Matsuo H. Faster Recovery From Operating System Failure and File Cache Missing[C]//Proceedings of the International Multi Conference of Engineers and Computer Scientists, 2012 [8] Mousa N M. Avida Checkpoint/Restart Implementation[J]. McNair Scholars Research Journal, 2014, 10:10-14 [9] Schneider C, Barker A, Dobson S. Autonomous Fault Detection in Self-Healing Systems:Comparing Hidden Markov Models and Artificial Neural Networks[C]//Proceedings of International Workshop on Adaptive Self-Tuning Computing Systems, ACM, New York, 2014:24-33 [10] Hargrove P H, Duell J C. Berkeley Lab Checkpoint/Restart (blcr) for Linux Clusters[J]. Journal of Physics:Conference Series, IOP Publishing, 2006(46):494-503 [11] Zhu Y, Li Y, Xue J, Tan T, Shi J, Shen Y, Ma C. What Is System Hang and How to Handle It[C]//2012 IEEE 23rd International Symposium on Software Reliability Engineering, 2012:141-150 [12] Slaby J, Strejĉek J, Trtík M. Clabure DB:Classified Bug-Reports Database[C]//Verification, Model Checking, and Abstract Interpretation, Springer, 2013, 268-274

邮编:710072 电话：029-88495455 Email：xuebao@nwpu.edu.cn

本系统由北京仁和汇智信息技术有限公司设计开发技术支持：info@rhhz.net