Table of Contents
Preface vii
Acknowledgment xx
Chapter I Dependability and Fault-Tolerance: Basic Concepts and Terminology 1
Introduction 1
Dependability, Resilient Computing and Fault-Tolerance 1
Fault-Tolerance, Redundancy and Complexity 15
Conclusion 16
References 16
Endnotes 19
Chapter II Fault-Tolerant Software: Basic Concepts and Terminology 21
Introduction and Objectives 21
What is a Fault-Tolerant Program? 22
Dependable Services: The System Model 24
Dependable Services: The Fault Model 27
(In)Famous Accidents 28
Software Fault-Tolerance 34
Software Fault-Tolerance in the Application Layer 36
Strategies, Problems and Key Properties 37
Some Widely Used Software Fault-Tolerance Provisions 39
Conclusion 48
References 49
Endnotes 52
Chapter III Fault-Tolerant Protocols Using Single- and Multiple-Version Software Fault-Tolerance 53
Introduction and Objectives 53
Fault-Tolerant Protocols Using Single- and Multiple-Version Software Fault-Tolerance 54
The EFTOS Tools: The EFTOS Voting Farm 68
The EFTOS Tools: The Watchdog Timer 79
The EFTOS Tools: The EFTOS Trap Handler 81
The EFTOS Tools: Atomic Actions 85
The TIRAN Data Stabilizing Software Tool 90
An Approach to Express Recovery Blocks: The Recovery Meta-Program 104
A Hybrid Case: The RAFTNET Library for Dependable Farmer-Worker Parallel Applications 106
Conclusion 124
References 125
Endnotes 131
Chapter IV Fault-Tolerant Protocols Using Compilers and Translators 133
Introduction and Objectives 133
Fault-Tolerant Protocols Using Compilers and Translators 134
An Example: Reflective and Refractive Variables 141
Adaptive Data IntegrityThrough Dynamically Redundant Data Structures 150
Conclusion 155
References 156
Chapter V Fault-Tolerant Protocols Using Fault-Tolerance Programming Languages 161
Introduction and Objectives 161
Fault-Tolerant Protocols Using Custom Programming Languages 161
Conclusion 172
References 172
Chapter VI The Recovery Language Approach 175
Introduction and Objectives 175
The Ariel Recovery Language 176
A Distributed Architecture Based on the Recovery Language Approach 191
Integrating Recovery Strategies into a Primary Substation Automation System 224
Summary and Lessons Learned 234
Conclusion 235
References 235
Endnotes 239
Chapter VII Fault-Tolerant Protocols Using Aspect Orientation 242
Introduction and Objectives 242
Fault-Tolerant Protocols Through Aspect Orientation 243
Conclusion 247
References 248
Endnote 249
Chapter VIII Failure Detection Protocols in the Application Layer 250
Introduction and Objectives 250
Failure Detection Protocols in the Application Layer 251
Conclusion 271
References 272
Endnotes 274
Chapter IX Hybrid Approaches 275
Introduction and Objectives 275
A Dependable Parallel Processing Model Based on Generative Communication and Recovery Languages 276
Enhancing a TIRAN Dependable Mechanism 282
Composing Dependable Mechanisms: The Redundant Watchdog 290
Cactus 296
Conclusion 297
References 298
Endnotes 300
Chapter X Measuring and Assessing Tools 301
Introduction and Objectives 301
Reliability Analysis of the TIRAN Distributed Voting Mechanism 301
Performance Analysis of Redundant Variables 305
A Tool for Monitoring and Fault Injection 310
Conclusion 321
References 321
Endnotes 322
Appendix 324
Chapter XI Conclusion 326
An Introduction and Some Conclusions 326
Appendix The Ariel Internals 328
References 349
Endnote 349
About the Author 350
Index 352