{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,1]],"date-time":"2025-02-01T05:21:27Z","timestamp":1738387287488,"version":"3.35.0"},"reference-count":24,"publisher":"Wiley","issue":"6","license":[{"start":{"date-parts":[[2008,8,7]],"date-time":"2008-08-07T00:00:00Z","timestamp":1218067200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/summer-heart-0930.chufeiyun1688.workers.dev:443\/http\/onlinelibrary.wiley.com\/termsAndConditions#vor"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Concurrency and Computation"],"published-print":{"date-parts":[[2009,4,25]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>In the rollback recovery of large\u2010scale long\u2010running applications in a distributed environment, pessimistic message logging protocols enable failed processes to recover independently, though at the expense of logging every message synchronously during fault\u2010free execution. In contrast, coordinated checkpointing protocols avoid message logging, but they are poor in scalability with a sharply increased coordinating overhead as the system grows. With the aim of achieving efficient rollback recovery by trading off logging overhead and coordinating overhead, this paper suggests a partitioning of the system into clusters, and then presents a scheme to implement the conversion between these overheads. Using the proposed conversion, coordination can be introduced to reduce the unbearable logging overhead found in some systems, whereas proper logging can be employed to alleviate the unacceptable coordinating overhead in others. Furthermore, heuristics are introduced to address the issue of how to partition the system into clusters in order to speed up the recovery process and to improve recovery efficiency. Performance evaluation results indicate that our scheme can lower the overall system overhead effectively. Copyright \u00a9 2008 John Wiley &amp; Sons, Ltd.<\/jats:p>","DOI":"10.1002\/cpe.1364","type":"journal-article","created":{"date-parts":[[2008,8,7]],"date-time":"2008-08-07T15:23:11Z","timestamp":1218122591000},"page":"819-853","source":"Crossref","is-referenced-by-count":4,"title":["Trading off logging overhead and coordinating overhead to achieve efficient rollback recovery"],"prefix":"10.1002","volume":"21","author":[{"given":"Jin\u2010Min","family":"Yang","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Kin Fun","family":"Li","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wen\u2010Wei","family":"Li","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Da\u2010Fang","family":"Zhang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"311","published-online":{"date-parts":[[2009,3,18]]},"reference":[{"volume-title":"The Grid: Blueprint for a New Computing Infrastructure","year":"1999","author":"Foster L","key":"e_1_2_8_2_2"},{"key":"e_1_2_8_3_2","doi-asserted-by":"publisher","DOI":"10.1145\/568522.568525"},{"key":"e_1_2_8_4_2","doi-asserted-by":"publisher","DOI":"10.1016\/S0304-3975(02)00566-2"},{"key":"e_1_2_8_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.1987.232562"},{"key":"e_1_2_8_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/32.666828"},{"key":"e_1_2_8_7_2","unstructured":"HuangY WangYM.Why optimistic message logging has not been used in telecommunication systems. Twenty\u2010fifth International Symposium on Fault\u2010tolerant Computing (FTCS\u201025) Pasadena CA U.S.A. 1995;459\u2013463."},{"key":"e_1_2_8_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/69.842260"},{"key":"e_1_2_8_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/TDSC.2004.15"},{"key":"e_1_2_8_10_2","doi-asserted-by":"crossref","unstructured":"SilvaLM SilvaJG.The performance of coordinated and independent checkpointing. Thirteenth International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing San Juan Puerto Rico 1999;280\u2013284.","DOI":"10.1109\/IPPS.1999.760487"},{"key":"e_1_2_8_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/359545.359563"},{"key":"e_1_2_8_12_2","first-page":"223","article-title":"Efficient distributed recovery using message logging","author":"Sistla P","year":"1989","journal-title":"ACM Symposium on Principles of Distributed Computing"},{"key":"e_1_2_8_13_2","doi-asserted-by":"crossref","unstructured":"LoweryA RussellJR GoldberyAP.Optimistic failure recovery for very large networks. Symposium on Reliable Distributed Systems Pisa Italy 1991;66\u201375.","DOI":"10.1109\/RELDIS.1991.145407"},{"key":"e_1_2_8_14_2","unstructured":"VaidyaNH.Distributed recovery units: A hybrid and adaptive distributed recovery. Technical Report 93\u2010052 Department of Computer Science Texas A&M University 1993."},{"key":"e_1_2_8_15_2","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.737"},{"key":"e_1_2_8_16_2","doi-asserted-by":"crossref","unstructured":"MonnetS MorinC BadrinathR.Hybrid checkpointing for parallel applications in cluster federations. 2004 IEEE International Symposium on Cluster Computing and the Grid (CCGrid'04) Chicago IL U.S.A. 2004;773\u2013782.","DOI":"10.1109\/CCGrid.2004.1336712"},{"key":"e_1_2_8_17_2","doi-asserted-by":"crossref","unstructured":"YangJM LiKF ZhangDF.A coarse\u2010grained pessimistic message logging scheme for improving rollback recovery efficiency. Third IEEE International Symposium on Dependable Autonomic and Secure Computing (DASC'07) Columbia MD U.S.A. 2007.","DOI":"10.1109\/DASC.2007.19"},{"key":"e_1_2_8_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/12.689645"},{"issue":"11","key":"e_1_2_8_19_2","doi-asserted-by":"crossref","first-page":"1570","DOI":"10.1006\/jpdc.2001.1757","article-title":"Processor allocation and checkpoint interval selection in cluster computing systems","volume":"61","author":"Plank JS","year":"2001","journal-title":"Journal of Parallel and Distributed Computing"},{"key":"e_1_2_8_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/12.609281"},{"key":"e_1_2_8_21_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2004.01.003"},{"key":"e_1_2_8_22_2","unstructured":"YangJM ZhangDF YangXD.WINDAR: A multithreaded rollback\u2010recovery toolkit on Windows. Tenth IEEE Pacific Rim Dependable Computing International Symposium (PRDC10) Papeete Tahiti French Polynesia 2004;395\u2013400."},{"key":"e_1_2_8_23_2","doi-asserted-by":"crossref","unstructured":"CandeaG CutlerJ FoxA DoshiR GargP GowdaR.Reducing recovery time in a small recursively restartable system. 2002 International Conference on Dependable Systems and Networks (DSN'02) 2002;605\u2013614.","DOI":"10.1109\/DSN.2002.1029006"},{"key":"e_1_2_8_24_2","doi-asserted-by":"crossref","unstructured":"PlankJS ElwasifWR.Experimental assessment of workstation failures and their impact on checkpointing systems. Twenty\u2010eighth International Symposium on Fault\u2010tolerant Computing Munich Germany 1998;48\u201357.","DOI":"10.1109\/FTCS.1998.689454"},{"key":"e_1_2_8_25_2","doi-asserted-by":"publisher","DOI":"10.1002\/spe.771"}],"container-title":["Concurrency and Computation: Practice and Experience"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/summer-heart-0930.chufeiyun1688.workers.dev:443\/https\/api.wiley.com\/onlinelibrary\/tdm\/v1\/articles\/10.1002%2Fcpe.1364","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/summer-heart-0930.chufeiyun1688.workers.dev:443\/https\/onlinelibrary.wiley.com\/doi\/pdf\/10.1002\/cpe.1364","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,1,31]],"date-time":"2025-01-31T10:50:38Z","timestamp":1738320638000},"score":1,"resource":{"primary":{"URL":"https:\/\/summer-heart-0930.chufeiyun1688.workers.dev:443\/https\/onlinelibrary.wiley.com\/doi\/10.1002\/cpe.1364"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2009,3,18]]},"references-count":24,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2009,4,25]]}},"alternative-id":["10.1002\/cpe.1364"],"URL":"https:\/\/summer-heart-0930.chufeiyun1688.workers.dev:443\/https\/doi.org\/10.1002\/cpe.1364","archive":["Portico"],"relation":{},"ISSN":["1532-0626","1532-0634"],"issn-type":[{"type":"print","value":"1532-0626"},{"type":"electronic","value":"1532-0634"}],"subject":[],"published":{"date-parts":[[2009,3,18]]}}}