Engineering a Safer World Engineering Systems Editorial Board: Joel Moses (Chair), Richard de Neufville, Manuel Heitor, Granger Morgan, Elisabeth Pat é -Cornell, William Rouse Flexibility in Engineering Design, by Richard de Neufville and Stefan Scholtes, 2011 Engineering a Safer World, by Nancy G. Leveson, 2011 Engineering Systems, by Olivier L. de Weck, Daniel Roos, and Christopher L. Magee, 2011 ENGINEERING A SAFER WORLD Systems Thinking Applied to Safety Nancy G. Leveson The MIT Press Cambridge, Massachusetts London, England © 2011 Massachusetts Institute of Technology All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher. For information about special quantity discounts, please email special_sales@mitpress.mit.edu This book was set in Syntax and Times Roman by Toppan Best-set Premedia Limited. Printed and bound in the United States of America. Library of Congress Cataloging-in-Publication Data Leveson, Nancy. Engineering a safer world : systems thinking applied to safety / Nancy G. Leveson. p. cm. — (Engineering systems) Includes bibliographical references and index. ISBN 978-0-262-01662-9 (hardcover : alk. paper) 1. Industrial safety. 2. System safety. I. Title. T55.L466 2012 620.8 ′ 6 — dc23 2011014046 10 9 8 7 6 5 4 3 2 1 We pretend that technology, our technology, is something of a life force, a will, and a thrust of its own, on which we can blame all, with which we can explain all, and in the end by means of which we can excuse ourselves. — T. Cuyler Young, Man in Nature To all the great engineers who taught me system safety engineering, particularly Grady Lee who believed in me. Also to those who created the early foundations for applying systems thinking to safety, including C. O. Miller and the other American aerospace engineers who created System Safety in the United States, as well as Jens Rasmussen’s pioneering work in Europe. Contents Series Foreword xv Preface xvii I FOUNDATIONS 1 1 Why Do We Need Something Different? 3 2 Questioning the Foundations of Traditional Safety Engineering 7 2.1 Confusing Safety with Reliability 7 2.2 Modeling Accident Causation as Event Chains 15 2.2.1 Direct Causality 19 2.2.2 Subjectivity in Selecting Events 20 2.2.3 Subjectivity in Selecting the Chaining Conditions 22 2.2.4 Discounting Systemic Factors 24 2.2.5 Including Systems Factors in Accident Models 28 2.3 Limitations of Probabilistic Risk Assessment 33 2.4 The Role of Operators in Accidents 36 2.4.1 Do Operators Cause Most Accidents? 37 2.4.2 Hindsight Bias 38 2.4.3 The Impact of System Design on Human Error 39 2.4.4 The Role of Mental Models 41 2.4.5 An Alternative View of Human Error 45 2.5 The Role of Software in Accidents 47 2.6 Static versus Dynamic Views of Systems 51 2.7 The Focus on Determining Blame 53 2.8 Goals for a New Accident Model 57 3 Systems Theory and Its Relationship to Safety 61 3.1 An Introduction to Systems Theory 61 3.2 Emergence and Hierarchy 63 3.3 Communication and Control 64 3.4 Using Systems Theory to Understand Accidents 67 3.5 Systems Engineering and Safety 69 3.6 Building Safety into the System Design 70 x Contents II STAMP: AN ACCIDENT MODEL BASED ON SYSTEMS THEORY 73 4 A Systems-Theoretic View of Causality 75 4.1 Safety Constraints 76 4.2 The Hierarchical Safety Control Structure 80 4.3 Process Models 87 4.4 STAMP 89 4.5 A General Classification of Accident Causes 92 4.5.1 Controller Operation 92 4.5.2 Actuators and Controlled Processes 97 4.5.3 Coordination and Communication among Controllers and Decision Makers 98 4.5.4 Context and Environment 100 4.6 Applying the New Model 100 5 A Friendly Fire Accident 103 5.1 Background 103 5.2 The Hierarchical Safety Control Structure to Prevent Friendly Fire Accidents 105 5.3 The Accident Analysis Using STAMP 119 5.3.1 Proximate Events 119 5.3.2 Physical Process Failures and Dysfunctional Interactions 123 5.3.3 The Controllers of the Aircraft and Weapons 126 5.3.4 The ACE and Mission Director 140 5.3.5 The AWACS Operators 144 5.3.6 The Higher Levels of Control 155 5.4 Conclusions from the Friendly Fire Example 166 III USING STAMP 169 6 Engineering and Operating Safer Systems Using STAMP 171 6.1 Why Are Safety Efforts Sometimes Not Cost-Effective? 171 6.2 The Role of System Engineering in Safety 176 6.3 A System Safety Engineering Process 177 6.3.1 Management 177 6.3.2 Engineering Development 177 6.3.3 Operations 179 7 Fundamentals 181 7.1 Defining Accidents and Unacceptable Losses 181 7.2 System Hazards 184 7.2.1 Drawing the System Boundaries 185 7.2.2 Identifying the High-Level System Hazards 187 7.3 System Safety Requirements and Constraints 191 7.4 The Safety Control Structure 195 7.4.1 The Safety Control Structure for a Technical System 195 7.4.2 Safety Control Structures in Social Systems 198 Contents xi 8 STPA: A New Hazard Analysis Technique 211 8.1 Goals for a New Hazard Analysis Technique 211 8.2 The STPA Process 212 8.3 Identifying Potentially Hazardous Control Actions (Step 1) 217 8.4 Determining How Unsafe Control Actions Could Occur (Step 2) 220 8.4.1 Identifying Causal Scenarios 221 8.4.2 Considering the Degradation of Controls over Time 226 8.5 Human Controllers 227 8.6 Using STPA on Organizational Components of the Safety Control Structure 231 8.6.1 Programmatic and Organizational Risk Analysis 231 8.6.2 Gap Analysis 232 8.6.3 Hazard Analysis to Identify Organizational and Programmatic Risks 235 8.6.4 Use of the Analysis and Potential Extensions 238 8.6.5 Comparisons with Traditional Programmatic Risk Analysis Techniques 239 8.7 Reengineering a Sociotechnical System: Pharmaceutical Safety and the Vioxx Tragedy 239 8.7.1 The Events Surrounding the Approval and Withdrawal of Vioxx 240 8.7.2 Analysis of the Vioxx Case 242 8.8 Comparison of STPA with Traditional Hazard Analysis Techniques 248 8.9 Summary 249 9 Safety-Guided Design 251 9.1 The Safety-Guided Design Process 251 9.2 An Example of Safety-Guided Design for an Industrial Robot 252 9.3 Designing for Safety 263 9.3.1 Controlled Process and Physical Component Design 263 9.3.2 Functional Design of the Control Algorithm 265 9.4 Special Considerations in Designing for Human Controllers 273 9.4.1 Easy but Ineffective Approaches 273 9.4.2 The Role of Humans in Control Systems 275 9.4.3 Human Error Fundamentals 278 9.4.4 Providing Control Options 281 9.4.5 Matching Tasks to Human Characteristics 283 9.4.6 Designing to Reduce Common Human Errors 284 9.4.7 Support in Creating and Maintaining Accurate Process Models 286 9.4.8 Providing Information and Feedback 295 9.5 Summary 306 10 Integrating Safety into System Engineering 307 10.1 The Role of Specifications and the Safety Information System 307 10.2 Intent Specifications 309 10.3 An Integrated System and Safety Engineering Process 314 10.3.1 Establishing the Goals for the System 315 10.3.2 Defining Accidents 317 10.3.3 Identifying the System Hazards 317 10.3.4 Integrating Safety into Architecture Selection and System Trade Studies 318 xii Contents 10.3.5 Documenting Environmental Assumptions 327 10.3.6 System-Level Requirements Generation 329 10.3.7 Identifying High-Level Design and Safety Constraints 331 10.3.8 System Design and Analysis 338 10.3.9 Documenting System Limitations 345 10.3.10 System Certification, Maintenance, and Evolution 347 11 Analyzing Accidents and Incidents (CAST) 349 11.1 The General Process of Applying STAMP to Accident Analysis 350 11.2 Creating the Proximal Event Chain 352 11.3 Defining the System(s) and Hazards Involved in the Loss 353 11.4 Documenting the Safety Control Structure 356 11.5 Analyzing the Physical Process 357 11.6 Analyzing the Higher Levels of the Safety Control Structure 360 11.7 A Few Words about Hindsight Bias and Examples 372 11.8 Coordination and Communication 378 11.9 Dynamics and Migration to a High-Risk State 382 11.10 Generating Recommendations from the CAST Analysis 383 11.11 Experimental Comparisons of CAST with Traditional Accident Analysis 388 11.12 Summary 390 12 Controlling Safety during Operations 391 12.1 Operations Based on STAMP 392 12.2 Detecting Development Process Flaws during Operations 394 12.3 Managing or Controlling Change 396 12.3.1 Planned Changes 397 12.3.2 Unplanned Changes 398 12.4 Feedback Channels 400 12.4.1 Audits and Performance Assessments 401 12.4.2 Anomaly, Incident, and Accident Investigation 403 12.4.3 Reporting Systems 404 12.5 Using the Feedback 409 12.6 Education and Training 410 12.7 Creating an Operations Safety Management Plan 412 12.8 Applying STAMP to Occupational Safety 414 13 Managing Safety and the Safety Culture 415 13.1 Why Should Managers Care about and Invest in Safety? 415 13.2 General Requirements for Achieving Safety Goals 420 13.2.1 Management Commitment and Leadership 421 13.2.2 Corporate Safety Policy 422 13.2.3 Communication and Risk Awareness 423 13.2.4 Controls on System Migration toward Higher Risk 425 13.2.5 Safety, Culture, and Blame 426 13.2.6 Creating an Effective Safety Control Structure 433 13.2.7 The Safety Information System 440 Contents xiii 13.2.8 Continual Improvement and Learning 442 13.2.9 Education, Training, and Capability Development 442 13.3 Final Thoughts 443 14 SUBSAFE: An Example of a Successful Safety Program 445 14.1 History 445 14.2 SUBSAFE Goals and Requirements 448 14.3 SUBSAFE Risk Management Fundamentals 450 14.4 Separation of Powers 451 14.5 Certification 452 14.5.1 Initial Certification 453 14.5.2 Maintaining Certification 454 14.6 Audit Procedures and Approach 455 14.7 Problem Reporting and Critiques 458 14.8 Challenges 458 14.9 Continual Training and Education 459 14.10 Execution and Compliance over the Life of a Submarine 459 14.11 Lessons to Be Learned from SUBSAFE 460 Epilogue 463 APPENDIXES 465 A Definitions 467 B The Loss of a Satellite 469 C A Bacterial Contamination of a Public Water Supply 495 D A Brief Introduction to System Dynamics Modeling 517 References 521 Index 531 Series Foreword Engineering Systems is an emerging field that is at the intersection of engineering, management, and the social sciences. Designing complex technological systems requires not only traditional engineering skills but also knowledge of public policy issues and awareness of societal norms and preferences. In order to meet the challenges of rapid technological change and of scaling systems in size, scope, and complexity, Engineering Systems promotes the development of new approaches, frameworks, and theories to analyze, design, deploy, and manage these systems. This new academic field seeks to expand the set of problems addressed by engi- neers, and draws on work in the following fields as well as others: • Technology and Policy • Systems Engineering • System and Decision Analysis, Operations Research • Engineering Management, Innovation, Entrepreneurship • Manufacturing, Product Development, Industrial Engineering The Engineering Systems Series will reflect the dynamism of this emerging field and is intended to provide a unique and effective venue for publication of textbooks and scholarly works that push forward research and education in Engineering Systems. Series Editorial Board: Joel Moses, Massachusetts Institute of Technology, Chair Richard de Neufville, Massachusetts Institute of Technology Manuel Heitor, Instituto Superior Té cnico, Technical University of Lisbon Granger Morgan, Carnegie Mellon University Elisabeth Pat é -Cornell, Stanford University William Rouse, Georgia Institute of Technology Preface I began my adventure in system safety after completing graduate studies in com- puter science and joining the faculty of a computer science department. In the first week at my new job, I received a phone call from Marion Moon, a system safety engineer at what was then the Ground Systems Division of Hughes Aircraft Company. Apparently he had been passed between several faculty members, and I was his last hope. He told me about a new problem they were struggling with on a torpedo project, something he called “software safety.” I told him I didn’t know anything about it and that I worked in a completely unrelated field. I added that I was willing to look into the problem. That began what has been a thirty-year search for a solution and to the more general question of how to build safer systems. Around the year 2000, I became very discouraged. Although many bright people had been working on the problem of safety for a long time, progress seemed to be stalled. Engineers were diligently performing safety analyses that did not seem to have much impact on accidents. The reason for the lack of progress, I decided, was that the technical foundations and assumptions on which traditional safety engineer- ing efforts are based are inadequate for the complex systems we are building today. The world of engineering has experienced a technological revolution, while the basic engineering techniques applied in safety and reliability engineering, such as fault tree analysis (FTA) and failure modes and effects analysis (FMEA), have changed very little. Few systems are built without digital components, which operate very differently than the purely analog systems they replace. At the same time, the complexity of our systems and the world in which they operate has also increased enormously. The old safety engineering techniques, which were based on a much simpler, analog world, are diminishing in their effectiveness as the cause of accidents changes. For twenty years I watched engineers in industry struggling to apply the old techniques to new software-intensive systems—expending much energy and having little success. At the same time, engineers can no longer focus only on technical issues and ignore the social, managerial, and even political factors that impact safety xviii Preface if we are to significantly reduce losses. I decided to search for something new. This book describes the results of that search and the new model of accident causation and system safety techniques that resulted. The solution, I believe, lies in creating approaches to safety based on modern systems thinking and systems theory. While these approaches may seem new or paradigm changing, they are rooted in system engineering ideas developed after World War II. They also build on the unique approach to engineering for safety, called System Safety, that was pioneered in the 1950s by aerospace engineers such as C. O. Miller, Jerome Lederer, and Willie Hammer, among others. This systems approach to safety was created originally to cope with the increased level of com- plexity in aerospace systems, particularly military aircraft and ballistic missile systems. Many of these ideas have been lost over the years or have been displaced by the influence of more mainstream engineering practices, particularly reliability engineering. This book returns to these early ideas and updates them for today’s technology. It also builds on the pioneering work in Europe of Jens Rasmussen and his followers in applying systems thinking to safety and human factors engineering. Our experience to date is that the new approach described in this book is more effective, less expensive, and easier to use than current techniques. I hope you find it useful. Relationship to Safeware My first book, Safeware , presents a broad overview of what is known and practiced in System Safety today and provides a reference for understanding the state of the art. To avoid redundancy, information about basic concepts in safety engineering that appear in Safeware is not, in general, repeated. To make this book coherent in itself, however, there is some repetition, particularly on topics for which my understanding has advanced since writing Safeware Audience This book is written for the sophisticated practitioner rather than the academic researcher or the general public. Therefore, although references are provided, an attempt is not made to cite or describe everything ever written on the topics or to provide a scholarly analysis of the state of research in this area. The goal is to provide engineers and others concerned about safety with some tools they can use when attempting to reduce accidents and make systems and sophisticated products safer. It is also written for those who are not safety engineers and those who are not even engineers. The approach described can be applied to any complex, Preface xix sociotechnical system such as health care and even finance. This book shows you how to “reengineer” your system to improve safety and better manage risk. If pre- venting potential losses in your field is important, then the answer to your problems may lie in this book. Contents The basic premise underlying this new approach to safety is that traditional models of causality need to be extended to handle today’s engineered systems. The most common accident causality models assume that accidents are caused by component failure and that making system components highly reliable or planning for their failure will prevent accidents. While this assumption is true in the relatively simple electromechanical systems of the past, it is no longer true for the types of complex sociotechnical systems we are building today. A new, extended model of accident causation is needed to underlie more effective engineering approaches to improving safety and better managing risk. The book is divided into three sections. The first part explains why a new approach is needed, including the limitations of traditional accident models, the goals for a new model, and the fundamental ideas in system theory upon which the new model is based. The second part presents the new, extended causality model. The final part shows how the new model can be used to create new techniques for system safety engineering, including accident investigation and analysis, hazard analysis, design for safety, operations, and management. This book has been a long time in preparation because I wanted to try the new techniques myself on real systems to make sure they work and are effective. In order not to delay publication further, I will create exercises, more examples, and other teaching and learning aids and provide them for download from a website in the future. Chapters 6 – 10, on system safety engineering and hazard analysis, are purposely written to be stand-alone and therefore usable in undergraduate and graduate system engineering classes where safety is just one part of the class contents and the practical design aspects of safety are the most relevant. Acknowledgments The research that resulted in this book was partially supported by numerous research grants over many years from NSF and NASA. David Eckhardt at the NASA Langley Research Center provided the early funding that got this work started. I also am indebted to all my students and colleagues who have helped develop these ideas over the years. There are too many to list, but I have tried to give them