Improving Product Reliability and Software Quality: Strategies, Tools, Process and Implementation [2 ed.] 1119179394, 9781119179399

The authoritative guide to the effective design and production of reliable technology products, revised and updated Whil

911 115 10MB

English Pages 456 [434] Year 2019

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Improving healthcare quality in Europe : characteristics, effectiveness and implementation of different strategies 9789289051750, 9289051752

3,191 84 4MB Read more

The Healthcare Quality Book: Vision Strategies and Tools 9781567936278

As healthcare reform continues to transform US healthcare delivery and processes, one thing remains the same: the import

184 31 15MB Read more

Academic Writing: Process and Product

1,527 178 468KB Read more

Unit Test Frameworks: Tools for High-Quality Software Development 9780596104825

This is the only book to explore unit testing as a language-independent, standalone development methodology. It covers t

1,187 169 885KB Read more

Continuous integration improving software quality and reducing risk [8. print ed.] 9780321336385, 0321336380

1,580 236 4MB Read more

The Lean Six SIGMA Pocket Toolbook: A Quick Reference Guide to Nearly 100 Tools for Improving Quality and Speed: A Quick Reference Guide to 70 Tools for Improving Quality and Speed [Paperback ed.] 0071441190, 9780071441193

Provides the tools for implementing Lean Six Sigma - what they are, how they work, and which to use. This book presents

1,022 134 10MB Read more

Machine Tools: Design, Reliability and Safety : Design, Reliability and Safety [1 ed.] 9781622572045, 9781612091440

In machine tools, the designed systems include many components, such as sensors, actuators, joints and motors. It is req

329 84 13MB Read more

Software Reliability Growth Models 9789811600258

586 51 10MB Read more

"Crouching Tiger" : Quality and its Implementation in the Indian and Irish Software Communities [1 ed.] 9781443814522, 9781847183262

There are few people who have not heard of the Irish software success story. Once a country whose primary industries wer

133 23 1MB Read more

Food Product Optimization for Quality and Safety Control: Process, Monitoring, and Standards [1 ed.] 1771888792, 9781771888790

This new book discusses food quality and safety standards that are critically important for both developed and developin

365 87 10MB Read more

Improving Product Reliability and Software Quality: Strategies, Tools, Process and Implementation [2 ed.]
1119179394, 9781119179399

Author / Uploaded
Mark A. Levin
Ted T. Kalal
Jonathan Rodin

Table of contents :
Cover
Wiley Series in Quality & Reliability Engineering
Improving Product Reliability and Software
Quality:

Strategies, Tools, Process and Implementation
© 2019
Dedication
Contents
About the Authors
List of Figures
List of Tables
Series Editor’s Foreword
Series Foreword Second Edition
Series Foreword First Edition
Foreword First Edition
Preface Second Edition
Preface First Edition
Acknowledgments
Glossary
Part I:

Reliability and Software
Quality – It’s a Matter of Survival
1 The Need for a New Paradigm for Hardware Reliability and
Software Quality
2 Barriers to Implementing Hardware Reliability and Software
Quality
3 UnderstandingWhy Products Fail
4 Alternative Approaches to Implementing Reliability
Part II: Unraveling the
Mystery
5
The Product Life Cycle
6 Reliability Concepts
7 Reliability Concepts
8 The Reliability Toolbox
9 Software Quality Goals and Metrics
10 Software Quality Analysis Techniques
11 Software Life Cycles
12 Software Procedures and Techniques
13 Why Hardware Reliability and Software Quality
Improvement Efforts Fail
14 Supplier Management
Part III:

Steps to Successful
Implementation
15
Establishing a Reliability Lab
16 Hiring and Staffing the Right People
17 Implementing the Reliability Process
Part IV:

Reliability and Quality Process
for Product Development
18
Product Concept Phase
19 Design Concept Phase
20 Product Design Phase
21 Design Validation Phase
22 Software Testing and Debugging
23 Applying Software Quality Procedures
24 Production Phase
25 End-of-Life Phase
26 Field Service
Appendix A
Appendix B
Index

Citation preview

Improving Product Reliability and Software Quality

Wiley Series in Quality & Reliability Engineering Dr Andre Kleyner Series Editor The Wiley series in Quality & Reliability Engineering aims to provide a solid educational foundation for both practitioners and researchers in Q&R field and to expand the reader’s knowledge base to include the latest developments in this field. The series will provide a lasting and positive contribution to the teaching and practice of engineering. The series coverage will contain, but is not exclusive to, • • • • • • • •

statistical methods; physics of failure; reliability modeling; functional safety; six-sigma methods; lead-free electronics; warranty analysis/management; and risk and safety analysis.

Wiley Series in Quality & Reliability Engineering Improving Product Reliability and Software Quality by Mark A. Levin, Ted T. Kalal, Jonathan Rodin April 2019 Design for Safety By Louis J Gullo, Jack Dixon February 2018 Next Generation HALT and HASS: Robust Design of Electronics and Systems by Kirk A. Gray, John J. Paschkewitz May 2016 Reliability and Risk Models: Setting Reliability Requirements, 2nd Edition by Michael Todinov September 2015 Applied Reliability Engineering and Risk Analysis: Probabilistic Models and Statistical Inference by Ilia B. Frenkel, Alex Karagrigoriou, Anatoly Lisnianski, Andre V. Kleyner September 2013 Design for Reliability by Dev G. Raheja (Editor), Louis J. Gullo (Editor) July 2012

Effective FMEAs: Achieving Safe, Reliable, and Economical Products and Processes using Failure Mode and Effects Analysis by Carl Carlson April 2012 Failure Analyis: A Practical Guide for Manufacturers of Electronic Components and Systems by Marius Bazu, Titu Bajenescu April 2011 Reliability Technology: Principles and Practice of Failure Prevention in Electronic Systems by Norman Pascoe April 2011 Improving Product Reliability: Strategies and Implementation by Mark A. Levin, Ted T. Kalal March 2003 Test Engineering: A Concise Guide to Cost-effective Deign, Development and Manufacture by Patrick O’Connor April 2001 Integrated Circuit Failure Analysis: A Guide to Preparation Techniques by Friedrich Beck January 1998 Measurement and Calibration Requirements for Quality Assurance to ISO 9000 by Alan S. Morris October 1997 Electronic Component Reliability: Fundamentals, Modelling, Evaluation, and Assurance by Finn Jensen November 1995

Improving Product Reliability and Software Quality Strategies, Tools, Process and Implementation

Second Edition

Mark A. Levin Teradyne, Inc. California, USA

Ted T. Kalal Retired Texas, USA

Jonathan Rodin Teradyne, Inc. California, USA

This edition first published 2019 © 2019 John Wiley & Sons Ltd Edition History John Wiley & Sons, Ltd (1e, 2003) All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions. The right of Mark A. Levin, Ted T. Kalal and Jonathan Rodin to be identified as the authors of this work has been asserted in accordance with law. Registered Offices John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK Editorial Office The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com. Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats. Limit of Liability/Disclaimer of Warranty While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. Library of Congress Cataloging-in-Publication Data Names: Levin, Mark A., 1959- author. | Kalal, Ted T., author. | Rodin, Jonathan, 1957- author. Title: Improving product reliability and software quality : strategies, tools, process and implementation / Mark A. Levin, Teradyne, Inc., California, USA, Ted T. Kalal (Retired), Texas, USA, Jonathan Rodin, Teradyne, Inc., California, USA. Other titles: Improving product reliability Description: 2nd edition. | Hoboken, NJ : John Wiley & Sons, Inc., [2019] | Revised edition of: Improving product reliability : strategies and implementation / Mark A. Levin and Ted T. Kalal. c2003. | Includes bibliographical references and index. | Identifiers: LCCN 2018061430 (print) | LCCN 2019000421 (ebook) | ISBN 9781119179412 (Adobe PDF) | ISBN 9781119179436 (ePub) | ISBN 9781119179399 (hardcover) Subjects: LCSH: Reliability (Engineering) | Manufacturing processes–Data processing. | Computer software–Evaluation. Classification: LCC TS173 (ebook) | LCC TS173 .L47 2019 (print) | DDC 620/.00452–dc23 LC record available at https://lccn.loc.gov/2018061430 Cover Design: Wiley Cover Images: (top to bottom): © teekid/Getty Images, © ez_thug/Getty Images, © AK2/Getty Images, Courtesy of Universal Robots/Teradyne Inc. Set in 10/12pt WarnockPro by SPi Global, Chennai, India Printed in Great Britain by TJ International Ltd, Padstow, Cornwall

10 9 8 7 6 5 4 3 2 1

Cary and Darren Kalal To my beautiful wife, Dana Mischel Levin, for her endless love, support, and patience, and to our sons, Spencer Nathan Levin and Andrew Dylan Levin. To Brigid, Sam, and Molly Rodin for their support and encouragement.

ix

Contents About the Authors xix List of Figures xxi List of Tables xxv Series Editor's Foreword xxvii Series Foreword Second Edition xxix Series Foreword First Edition xxxi Foreword First Edition xxxiii Preface Second Edition xxxv Preface First Edition xxxvii Acknowledgments xli Glossary xliii

Part I

Reliability and Software Quality – It’s a Matter of Survival

1

1

The Need for a New Paradigm for Hardware Reliability and Software Quality 3

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10

Rapidly Shifting Challenges for Hardware Reliability and Software Quality 3 Gaining Competitive Advantage 5 Competing in the Next Decade – Winners Will Compete on Reliability 5 Concurrent Engineering 6 Reducing the Number of Engineering Change Orders at Product Release 8 Time-to-Market Advantage 9 Accelerating Product Development 10 Identifying and Managing Risks 11 ICM, a Process to Mitigate Risk 11 Software Quality Overview 12 References 13 Further Reading 13

2

Barriers to Implementing Hardware Reliability and Software Quality 15

2.1 2.2

Lack of Understanding 15 Internal Barriers 16

x

Contents

2.3 2.4 2.5 2.6 2.7 2.8

Implementing Change and Change Agents 17 Building Credibility 19 Perceived External Barriers 20 Time to Gain Acceptance 21 External Barrier 22 Barriers to Software Process Improvement 23

3

Understanding Why Products Fail 25

3.1 3.2 3.3 3.4 3.5

Why Things Fail 25 Parts Have Improved, Everyone Can Build Quality Products 28 Hardware Reliability and Software Quality – The New Paradigm 28 Reliability vs. Quality Escapes 29 Why Software Quality Improvement Programs Are Unsuccessful 30 Further Reading 31

4

Alternative Approaches to Implementing Reliability 33

4.1 4.2 4.3

Hiring Consultants for HALT Testing 33 Outsourcing Reliability Testing 33 Using Consultants to Develop and Implement a Reliability Program 34 Hiring Reliability Engineers 34

4.4

Part II 5

5.1 5.2 5.2.1 5.2.2 5.2.3 5.3 5.4 5.5

6

6.1 6.2 6.2.1 6.2.2 6.2.3 6.2.4 6.2.5 6.2.6

Unraveling the Mystery 37

39 Six Phases of the Product Life Cycle 39 Risk Mitigation 41 Investigate the Risk 41 Communicate the Risk 41 Mitigate the Risk 44 The ICM Process for a Small Company 45 Design Guidelines 46 Warranty 46 Further Reading 47 Reliability Process 47 DFM 48 The Product Life Cycle

Reliability Concepts 49 The Bathtub Curve 50 Mean Time between Failure 51 Mean Time between Repair 52 Mean Time between Maintenance (MTBM) 52 Mean Time between Incidents (MTBI) 52 Mean Time to Failure (MTTF) 52 Mean Time to Repair (MTTR) 52 Mean Time to Restore System (MTTRS) 52

Contents

6.3 6.4 6.4.1 6.4.2 6.4.3 6.4.4 6.4.5 6.4.6 6.4.7 6.4.8 6.5 6.6 6.7 6.7.1 6.7.2 6.7.3 6.8 6.9

Warranty Costs 53 Availability 55 On-site Manufacturer Service Personnel 56 Trained Customer Service Personnel 56 Manufacturer Training for Customer Service Personnel 56 Easy-to-Use Service Manuals 56 Rapid Diagnosis Capability 56 Repair and Spare Parts Availability 57 Rapid Response to Customer Requests for Service 57 Failure Data Tracking 57 Reliability Growth 57 Reliability Demonstration Testing 59 Maintenance and Availability 62 Preventative Maintenance 63 Predictive Maintenance 64 Prognostics and Health Management (PHM) 64 Component Derating 69 Component Uprating 70 Reference 71 Further Reading 72 Reliability Growth 72 Reliability Demonstration 72 Prognostics and Health Management 72

7 7.1 7.2 7.2.1 7.2.1.1 7.2.1.2 7.2.2 7.2.2.1 7.2.2.2 7.2.3 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10

FMEA 73

8

8.1 8.1.1 8.1.2

Benefits of FMEA 73 Components of FMEA 74 The Functional Block Diagram (FBD) 74 Generating the Functional Block Diagram 75 Filling in the Functional Block Diagram 76 The Fault Tree Analysis 78 Building the Fault Tree 78 Brainstorming 79 Failure Modes and Effects Analysis Spreadsheet 80 Preparing for the FMEA 86 Barriers to the FMEA Process 89 FMEA Ground Rules 91 Using Macros to Improve FMEA Efficiency and Effectiveness 92 Software FMEA 94 Software Fault Tree Analysis (SFTA) 97 Process FMEAs 97 FMMEA 99 The Reliability Toolbox 101 The HALT Process 101 Types of Stresses Applied in HALT 104 The Theory behind the HALT Process 105

xi

xii

Contents

8.1.3 8.1.4 8.2 8.2.1 8.2.2 8.2.3 8.2.4 8.2.5 8.3 8.4 8.5 8.6 8.7

HALT Testing Liquid Cooled Products 109 Planning for HALT Testing 110 Highly Accelerated Stress Screening (HASS) 121 Proof of Screen (POS) 122 Burn-In 123 Environmental Stress Screening (ESS) 124 Economic Impact of HASS 125 The HASA Process 126 HALT and HASS Test Chambers 127 Accelerated Reliability Growth (ARG) 128 Accelerated Early Life Test (ELT) 131 SPC Tool 132 FIFO Tool 132 References 134 Further Reading 134 FMEA 134 HALT 135 HASS 136 Quality 136 Burn-in 136 ESS 137 Up Rating 137

9

Software Quality Goals and Metrics 139

9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8

Setting Software Quality Goals 139 Software Metrics 140 Lines of Code (LOC) 142 Defect Density 142 Defect Models 144 Defect Run Chart 145 Escaped Defect Rate 147 Code Coverage 148 References 149 Further Reading 150

10

Software Quality Analysis Techniques 151

10.1 10.2 10.3 10.4 10.5 10.6

Root Cause Analysis 151 The 5 Whys 151 Cause and Effect Diagrams 152 Pareto Charts 153 Defect Prevention, Defect Detection, and Defensive Programming 154 Effort Estimation 157 Reference 158 Further Reading 158

11

Software Life Cycles 159

11.1

Waterfall 159

Contents

11.2 11.3 11.4

Agile 161 CMMI 162 How to Choose a Software Life Cycle Reference 166 Further Reading 166

12

Software Procedures and Techniques 167

12.1 12.2 12.3 12.4 12.5 12.6 12.7 12.8

Gathering Requirements 167 Documenting Requirements 169 Documentation 172 Code Comments 173 Reviews and Inspections 174 Traceability 179 Defect Tracking 179 Software and Hardware Integration 180 References 182 Further Reading 182

13

Why Hardware Reliability and Software Quality Improvement Efforts Fail 183

13.1 13.2 13.3 13.4 13.5 13.6 13.7 13.8 13.9 13.10 13.11

Lack of Commitment to the Reliability Process 183 Inability to Embrace and Mitigate Technologies Risk Issues 185 Choosing the Wrong People for the Job 186 Inadequate Funding 186 Inadequate Resources 191 MIL-HDBK 217 – Why It Is Obsolete 192 Finding But Not Fixing Problems 195 Nondynamic Testing 196 Vibration Testing Too Difficult to Implement 196 The Impact of Late Hardware or Late Software Delivery 196 Supplier Reliability 196 Reference 197 Further Reading 197

14

Supplier Management 199

14.1 14.2 14.3 14.4 14.5 14.6

Purchasing Interface 199 Identifying Your Critical Suppliers 200 Develop a Thorough Supplier Audit Process 200 Develop Rapid Nonconformance Feedback 201 Develop a Materials Review Board (MRB) 202 Counterfeit Parts and Materials 202

Part III

165

Steps to Successful Implementation

15

Establishing a Reliability Lab

15.1

Staffing for Reliability 207

207

205

xiii

xiv

Contents

15.2 15.3 15.4 15.5 15.6 15.7 15.7.1 15.7.2 15.7.3 15.7.4 15.7.5 15.7.6 15.7.7 15.7.8 15.7.9 15.7.10 15.7.11 15.7.12 15.7.13 15.7.14 15.7.15 15.7.16

The Reliability Lab 208 Facility Requirements 210 Liquid Nitrogen Requirements 210 Air Compressor Requirements 211 Selecting a Reliability Lab Location 212 Selecting a Halt Test Chamber 213 Chamber Size 214 Machine Overall Height 214 Power Required and Consumption 216 Acceptable Operational Noise Levels 216 Door Swing 216 Ease of Operation 217 Profile Creation, Editing, and Storage 217 Temperature Rates of Change 217 Built-In Test Instrumentation 217 Safety 217 Time from Order to Delivery 217 Warranty 218 Technical/Service Support 218 Compressed Air Requirements 218 Lighting 218 Customization 218 Reference 220

16

Hiring and Staffing the Right People 221

16.1 16.1.1 16.1.2 16.1.3 16.1.4 16.1.5 16.1.6 16.1.7 16.1.8 16.1.9 16.2 16.3

Staffing for Reliability 221 A Reliability Engineering Background 221 HALT/HASS and ESS 221 Shock and Vibration Testing 223 Statistical Analysis 223 Failure Budgeting/Estimating 223 Failure Analysis 224 Conducting Reliability Training 224 Persuasive in Implementing New Concepts 224 A Degree in Engineering and/or Physics 225 Staffing for Software Engineers 225 Choosing the Wrong People for the Job 226

17

Implementing the Reliability Process 229

17.1 17.2 17.3 17.4 17.5 17.6 17.7 17.8

Reliability Is Everyone’s Job 229 Formalizing the Reliability Process 230 Implementing the Reliability Process 231 Rolling Out the Reliability Process 231 Developing a Reliability Culture 235 Setting Reliability Goals 236 Training 237 Product Life Cycle Defined 238

Contents

17.8.1 17.8.2 17.8.3 17.8.4 17.9

Concept Phase 239 Design Phase 240 Production Phase 241 End-of-Life and Obsolescence Phase 241 Proactive and Reactive Reliability Activities 241 Further Reading 244 Reliability Process 244

Part IV Reliability and Quality Process for Product Development 245 18

Product Concept Phase 247

18.1 18.2 18.3 18.4 18.5 18.6 18.6.1 18.6.1.1 18.6.1.2 18.6.1.3 18.6.1.4 18.6.1.5 18.6.1.6 18.6.1.7 18.6.1.8 18.6.1.9 18.6.2

Reliability Activities in the Product Concept Phase 247 Establish the Reliability Organization 248 Define the Reliability Process 249 Define the Product Reliability Requirements 249 Capture and Apply Lessons Learned 249 Mitigate Risk 252 Filling Out the Risk Mitigation Form 253 Identify and Analyze Risk 253 Risk Severity 254 Date Risk Is Identified 254 Risk Accepted 254 High-Level Mitigation Plan 254 Resources Required 254 Completion Date 255 Success Metric 255 Investigate Alternative Solutions 255 Risk Mitigation Meeting 255

19

Design Concept Phase 257

19.1 19.2 19.2.1 19.2.2 19.2.3

Reliability Activities in the Design Concept Phase 257 Set Reliability Requirements and Budgets 259 Requirements for Product Use Environment 259 Product Useful Life Requirements 260 Subsystem and Printed Circuit Board Assembly (PCBA) Reliability Budgets 261 Service and Repair Requirements 263 Define Reliability Design Guidelines 263 Revise Risk Mitigation 264 Identifying Risk Issues 264 Reflecting Back (Capturing Internal Lessons Learned) 265 Looking Forward (Capturing New Risk Issues) 265 Schedule Reliability Activities and Capital Budgets 268 Decide Risk Mitigation Sign-off Day 269 Reflect on What Worked Well 271

19.2.4 19.3 19.4 19.4.1 19.4.2 19.4.3 19.5 19.6 19.7

xv

xvi

Contents

20

Product Design Phase 273

20.1 20.2 20.3 20.3.1 20.3.1.1 20.3.1.2 20.3.1.3 20.3.1.4 20.3.2 20.3.2.1 20.3.2.2 20.4 20.4.1 20.5 20.6 20.7 20.8 20.9

Product Design Phase 273 Reliability Estimates 274 Implementing Risk Mitigation Plans 276 Mitigating Risk Issues Captured Reflecting Back 276 Design Out (or Use an Alternate Part/Supplier) 276 Change Use Conditions 277 Fix Part 278 Fix Process 278 Mitigating Risk Issues Captured Looking Forward 278 Accelerated Life Testing 280 Risk Mitigation Progress 284 Design for Reliability Guidelines (DFR) 285 Derating Guidelines 288 Design FMEA 289 Installing a Failure Reporting Analysis and Corrective Action System 290 HALT Planning 291 HALT Test Development 292 Risk Mitigation Meeting 295 Further Reading 296 FMEA 296 HALT 296

21

Design Validation Phase

21.1 21.2 21.2.1 21.2.2 21.2.3 21.2.4 21.2.5 21.2.6 21.2.7 21.2.8 21.2.9 21.2.10 21.2.11 21.3 21.4 21.5 21.6 21.7

299 Design Validation 299 Using HALT to Precipitate Failures 301 Starting the HALT Test 304 Room Ambient Test 306 Tickle Vibration Test 306 Temperature Step Stress Test and Power Cycling 306 Vibration Step Stress Test 308 Combinational Temperature and Vibration Test 308 Rapid Thermal Cycling Stress Test 309 Slow Temperature Ramp 310 Combinational Search Pattern Test 311 Additional Nonenvironmental Stress Tests 312 HALT Validation Test 312 Proof of Screen (POS) 313 Highly Accelerated Stress Screen (HASS) 315 Operate FRACAS 315 Design FMEA 317 Closure of Risk Issues 317 Further Reading 318 FMEA 318 Acceleration Methods 318 ESS 318 HALT 319

Contents

22

Software Testing and Debugging 321

22.1 22.2 22.3 22.4 22.5 22.6 22.7 22.8 22.8.1 22.9

Unit Tests 321 Integration Tests 323 System Tests 324 Regression Tests 324 Security Tests 326 Guidelines for Creating Test Cases 327 Test Plans 328 Defect Isolation Techniques 329 Simulation 329 Instrumentation and Logging 331 Further Reading 334

23

Applying Software Quality Procedures 335

23.1 23.2

Using Defect Model to Create Defect Run Chart 336 Using Defect Run Chart to Know When You Have Achieved the Quality Target 336 Using Root Cause Analysis on Defects to Improve Organizational Quality Delivery 338 Continuous Integration and Test 338 Further Reading 339

23.3 23.4

24

Production Phase 341

24.1 24.1.1 24.1.1.1 24.1.1.2 24.2 24.2.1 24.2.2 24.3 24.3.1 24.3.1.1 24.3.1.2 24.3.1.3

Accelerating Design Maturity 341 Product Improvement Tools 343 FRACAS 344 Design Issue Tracking 345 Reliability Growth 346 Accelerated Reliability Growth (ARG) 349 Accelerated Early Life Testing (ELT) 350 Design and Process FMEA 351 Quality Control Tools 351 SPC 352 Six Sigma 354 HASS and HASA 355 Further Reading 355 FMEA 355 Quality 356 Reliability Growth 356 Burn-In 357 HASS 357

25

End-of-Life Phase 359

25.1 25.2 25.3

Managing Obsolescence 359 Product Termination 360 Project Assessment 360 Further Reading 361

xvii

xviii

Contents

26

26.1 26.2 26.3 26.4 26.5 26.6 26.7 26.8 26.9

Field Service 363 Design for Ease of Access 363 Identify High Replacement Assemblies (FRUs) 363 Wearout Replacement 365 Preemptive Servicing 365 Servicing Tools 365 Service Loops 366 Availability or Repair Time Turnaround 367 Avoid System Failure Through Redundancy 367 Random versus Wearout Failures 367 Further Reading 368

Appendix A 369

A.1 A.2 A.3 A.4 A.5 A.6 A.7 A.8 A.9 A.10

Reliability Consultants 369 Graduate Reliability Engineering Programs and Reliability Certification Programs 372 Reliability Professional Organizations and Societies 376 Reliability Training Classes 377 Environmental Testing Services 379 HALT Test Chambers 381 Reliability Websites 382 Reliability Software 383 Reliability Seminars and Conferences 384 Reliability Journals 386

Appendix B 387

B.1 B.2 B.3 B.3.1

MTBF, FIT, and PPM Conversions 387 Mean Time Between Failure (MTBF) 387 Estimating Field Failures 396 Comparing Repairable to Nonrepairable Systems 397 Index 399

xix

About the Authors Mark A. Levin is the reliability manager at Teradyne, Inc. and is based in Agoura Hills, California. He received his bachelor of science degree in Electrical Engineering (1982) from the University of Arizona, a master of science degree in Technology Management (1999) from Pepperdine University, a master of science in Reliability Engineering (2009) from the University of Maryland, and all but dissertation for a PhD in Reliability Engineering from the University of Maryland. He has more than 36 years of electronics experience spanning the aerospace, defense, consumer, and medical electronics industries. He has held several management and research positions at Hughes Aircraft Missiles Systems Group, Hughes Aircraft Microwave Products Division, General Medical Company, and Medical Data Electronics. His experience is diverse, having worked in manufacturing, design, and research and development. He has developed manufacturing and reliability design guidelines, reliability training classes, workmanship standards, quality programs, JIT manufacturing, and ESD safe work environments, and has established a surface mount production facility. ([email protected]) Ted T. Kalal is a reliability engineer (now retired) who has gained much of his understanding of reliability from hands-on experience and from many great mentors. He is a graduate of the University of Wisconsin (1981) in Business Administration after completing much preliminary study in mathematics, physics, and electronics. He has held many positions as a contract engineer and as a consultant, where he was able to focus on design, quality, and reliability tasks. He has authored several papers on electronic circuitry and holds a patent in the field of power electronics. With two partners, he started a small manufacturing company that makes high-tech power supplies and other scientific apparatus for the bioresearch community. Jonathan Rodin is a software engineering manager at Teradyne, Inc. A graduate of Columbia University (1981), Jon has 39 years of experience developing software, both working as a programmer and managing software development projects. His experience spans companies of many sizes, ranging from early stage startups to companies of greater than 100 000 employees. Prior to joining Teradyne, Jon held executive engineering management positions at FTP Software, NaviSite, and Percussion Software. He has led software process reengineering projects numerous times, most recently driving the effort to bring Teradyne’s Semiconductor Test Division to CMMI Level 3.

xxi

List of Figures Figure 1.1

Product cost is determined early in development. 7

Figure 1.2

Cost to fix a design increases an order of magnitude with each subsequent phase. Source: Courtesy of Teradyne, Inc. 8

Figure 1.3

The reliability process reduces the number of ECOs required after product release. 9

Figure 1.4

Including reliability in concurrent engineering reduces time to market. 10

Figure 1.5

Product introduction relative to competitors.

Figure 1.6

The ICM process.

Figure 2.1

Overcoming reliability hurdles bring significant rewards. Source: Courtesy of Teradyne, Inc. 16

Figure 5.1

The six phases of the product life cycle.

Figure 5.2

The ICM process.

Figure 5.3

A risk mitigation program (ICM) needs to address risk issues in all aspects of the development program. Source: Courtesy of Teradyne, Inc. 44

Figure 6.1

The bathtub curve (timescale is logarithmic). 50

Figure 6.2

Cumulative failure curve. 50

Figure 6.3

Light bulb theoretical example. 51

Figure 6.4

Availability as a function of MTBF and MTTR. Note: The curve has a slight ripple in it due to change in the MTBF axis. For the range between 0 and 200, it is marked in 25-hour increments and in 100-hour increments thereafter. This was done for resolution purposes to illustrate the impact of both low MTBFs and long MTTRs. 58

Figure 6.5

Design maturity testing – accept/reject criteria. 61

Figure 6.6

Number of fan failures vs. run time. 63

Figure 6.7

Mechanism that can cause degradation and failure. 66

Figure 6.8

PHM data collection and processing to detect degradation. Source: courtesy Anto Peter. 68

Figure 7.1

Functional block diagram. 74

10

12

40

41

xxii

List of Figures

Figure 7.2

Filled-out functional block diagram. 75

Figure 7.3

Schematic diagram of a flashlight. 76

Figure 7.4 Figure 7.5

Functional block diagram of a flashlight. 77 Functional block diagram of a flashlight using Post-its. 78

Figure 7.6

Fault tree logic symbols.

Figure 7.7 Figure 7.8

Fault tree diagram for flashlight using Post-its. 80 Logic flow diagram. 81

Figure 7.9

Fault tree logic diagram. 81

79

Figure 7.10 Flash light fault tree logic diagram. 84 Figure 7.11 Functional block diagram for the flashlight process. 87 Figure 7.12 Example of a SFTA for an execution flow failure. 98 Figure 8.1

Pareto of failures.

103

Figure 8.2

HALT failure percentage by stress type. 105

Figure 8.3 Figure 8.4

Product design specification limits. 106 Design margin. 106

Figure 8.5

Some products fail product spec.

Figure 8.6

HALT increases design margin. 107

Figure 8.7 Figure 8.8

Soft and hard failures. 108 Impact of HALT on design margins. 108

Figure 8.9

Two heat exchangers placed in front of chamber forced air. 109

107

Figure 8.10 Test setup profile to checkout connections and functionality. 112 Figure 8.11 Temperature step stress with power cycle and end of each step. 112 Figure 8.12 Vibration step stress. 114 Figure 8.13 Temperature and vibration step stress. 116 Figure 8.14 Rapid thermal cycling.

118

Figure 8.15 Slow temperature ramp. 119 Figure 8.16 Slow temperature ramp with constantly varying vibration level. 120 Figure 8.17 HASS stress levels. 122 Figure 8.18 The bathtub curve.

123

Figure 8.19 HASA plan. Source: Courtesy of James McLinn. 126 Figure 8.20 A HALT chamber has six simultaneous degrees of freedom (movement). 127 Figure 8.21 ARG process flow. 129 Figure 8.22 Accelerated reliability growth.

130

Figure 8.23 ARG and ELT acceleration test plans. 130 Figure 8.24 Selective process control. Source: Courtesy of James McLinn. 132 Figure 9.1 Figure 9.2

Quality ROI chart (financial impact of escapes is low). 141 Quality ROI chart (financial impact of escapes is high). 141

List of Figures

Figure 9.3

Sample line counts. 143

Figure 9.4

Defect run chart 1. 145

Figure 9.5

Defect run chart 2. 146

Figure 9.6

Comparative escape rates. 148

Figure 10.1 Generic fishbone diagram. 152 Figure 10.2 Sample fishbone diagram. 153 Figure 10.3 Sample Pareto chart.

154

Figure 10.4 Code review root cause Pareto. 155 Figure 10.5 Try-catch code example. 156 Figure 11.1 Waterfall life cycle.

160

Figure 11.2 Quality processes in a waterfall life cycle.

161

Figure 11.3 Sprint activities. 162 Figure 11.4 Sprint activities in an epic. 163 Figure 12.1 Sample requirements. 170 Figure 12.2 Sample user stories. 171 Figure 12.3 Code comments example. 175 Figure 12.4 Sample UART HAL code. 181 Figure 15.1 ESPEC/Qualmark HALT chamber.

215

Figure 17.1 The six phases of the product life cycle. Figure 17.2 The hardware reliability process.

239

242

Figure 17.3 Proactive activities in the product life cycle.

243

Figure 18.1 Product concept phase risk mitigation form.

253

Figure 18.2 Risk severity scale. 254 Figure 18.3 ICM sign-off required before proceeding to design concept. 255 Figure 19.1 Opportunity to affect product cost.

258

Figure 19.2 The bathtub curve. 260 Figure 19.3 System MTBF requirement. 261 Figure 19.4 Subsystem MTBF requirement. 263 Figure 19.5 180∘ of reliability risk mitigation. 264 Figure 19.6 Where to look for new reliability risks. 266 Figure 19.7 The reliability risk mitigation process.

270

Figure 19.8 The ICM is an effective gate to determine if the project should proceed. 270 Figure 20.1 The first phase of the product life cycle.

274

Figure 20.2 Looking forward to identify risk issues. 279 Figure 20.3 Risk mitigation strategies for reliability and performance. 279 Figure 20.4 Risk growth curve shows the rate at which risk issues are identified and mitigated. 284

xxiii

xxiv

List of Figures

Figure 20.5 DFR guideline for electrolytic capacitor usage. Source: Courtesy of Teradyne, Inc. 288 Figure 20.6 HALT planning flow. 292 Figure 20.7 HALT planning checklist. 293 Figure 20.8 HALT development phase. 294 Figure 21.1 Reliability activities in the validation phase. 301 Figure 21.2 HALT process flow. 305 Figure 21.3 HALT test setup verification test. Figure 21.4 Temperature step stress. 308

307

Figure 21.5 Vibration step stress. 309 Figure 21.6 Temperature and vibration step stress. 309 Figure 21.7 Rapid thermal cycling (60 ∘ C min−1 ). 310 Figure 21.8 Slow temperature ramp. 310 Figure 21.9 Slow temperature ramp and sinusoidal amplitude vibration. 311 Figure 21.10 HALT form to log failures. 312 Figure 21.11 HALT graph paper for documenting test. 313 Figure 21.12 HASS stress levels. 314 Figure 21.13 HASS profile. 316 Figure 22.1 Assert functions can be used with an appropriate header. 322 Figure 22.2 Sample test plan. 325 Figure 22.3 Sample log code. 333 Figure 22.4 Example log file extract. 334 Figure 24.1 Achieving quality in the production phase. 344 Figure 24.2 Design issue tracking chart. 346 Figure 24.3 Reliability growth chart. 347 Figure 24.4 Reliability growth chart versus predicted. 348 Figure 24.5 Duane curve. 348 Figure 24.6 Phase 5 ARG process flow. 350 Figure 24.7 Typical SPC chart. 353

xxv

List of Tables Table 5.1

Functional activities for cross-functional integration of reliability. 42

Table 6.1

Failures in the warranty period w/different MTBFs. 54

Table 6.2

Advantages of proactive reliability growth.

Table 6.3

RDT multiplier for failure-free runtime. 60

Table 6.4

FMMEA for fan bearings (detection omitted). 66

Table 6.5

Sensors to monitor for overstress in wearout degradation. 67

Table 6.6

Sensors to monitor bearing degradation. 68

Table 6.7

Component grade temperature classifications. 71

Table 7.1

The FMEA spreadsheet. 83

Table 7.2

RPN ranking table. 85

Table 7.3

FMEA parking lot for important issue that are not part of the FMEA. 92

Table 7.4

Common software failure modes. 95

Table 7.5

Common causes for software failure. 95

Table 7.6

Failure modes and associated possible causes. 96

Table 8.1

Agreed upon HALT limits. 111

Table 8.2

HALT profile for test setup checkout.

Table 8.3

Temperature step stress with power cycle and end of each step. 113

Table 8.4

Vibration step stress. 115

Table 8.5

Temperature and vibration step stress. 116

Table 8.6

Rapid thermal cycling.

Table 8.7

Slow temperature ramp. 118

Table 8.8

Slow temperature ramp with constantly varying vibration level. 119

59

111

117

Table 11.1 CMMI process areas. 164 Table 11.2 CMMI maturity levels. 165 Table 11.3 Life cycle comparison. 166 Table 14.1 Industry standards for managing counterfeit material risk. Table 15.1 Annual sales dollars relative to typical warranty costs. Table 15.2 HALT facility decision guide. 209

208

203

xxvi

List of Tables

Table 15.3 HALT machine decision matrix. 219 Table 16.1 Reliability skill set for various positions. 222 Table 17.1 Reliability activities for each phase of the product life cycle. 232 Table 17.2 Reliability activities – what’s required, recommended, and nice to have. 234 Table 18.1 Product concept phase reliability activities. 248 Table 19.1 Design concept phase reliability activities. 258 Table 20.1 Reliability activities for the product design phase. 275 Table 20.2 Common accelerated life test stresses. 282 Table 20.3 Environmental stress tests. 282 Table 21.1 Reliability activities in the design validation phase. 300 Table 21.2 HALT Profile test limits and test times. 306 Table 24.1 Reliability activities in the production ramp Phase 5. 342 Table 24.2 Reliability activities in the production release Phase 6. 343 Table B.1 Conversion tables for FIT to MTBF and PPM. 388 Table B.2 Table B.3

Factorials. 397 Repairable versus nonrepairable systems still operating (in MTBF time units). 398

xxvii

Series Editor’s Foreword Engineering systems are becoming more and more complex, with added functions, capabilities and increasing complexity of the systems architecture. Systems modeling, performance assessment, risk analysis and reliability prediction present increasingly challenging tasks. Continuously growing computing power relegates more and more functions to the software, placing more pressure on delivering faultless hardware-software interaction. Rapid development of autonomous vehicles and growing attention to functional safety brings quality and reliability to the forefront of the product development cycle. The book you are about to read presents a comprehensive and practical approach to reliability engineering as an integral part of the product design process. Various pieces of the puzzle, such as hardware reliability, physics of failure, FMEA, product validation and test planning, reliability growth, software quality, lifecycle engineering approach, supplier management and others fit nicely into a comprehensive picture of a successful reliability program. Despite its obvious importance, quality and reliability education is paradoxically lacking in today’s engineering curriculum. Few engineering schools offer degree programs or even a sufficient variety of courses in quality or reliability methods. Therefore, a majority of the quality and reliability practitioners receive their professional training from colleagues, engineering seminars, publications and technical books. The lack of formal education opportunities in this field greatly emphasizes the importance of technical publications, such as this one, for professional development. We are confident that this book, as well as the whole series, will continue Wiley’s tradition of excellence in technical publishing and provide a lasting and positive contribution to the teaching and practice of engineering. Dr. Andre Kleyner Editor of the Wiley Series in Quality & Reliability Engineering

xxix

Series Foreword Second Edition There is a popular saying, “If you fail to plan, you are planning to fail.” I don’t know if there is another discipline in complex product development where this is more true than designing for product reliability. When products are simple, it is possible to achieve high reliability by observing good design practices, but as products become more complex, and include thousands of components and hundreds of thousands of lines of software, a systematic approach is required. This has played itself out inside of Teradyne over the last decade through two product lines in our Semiconductor Test Division. One product line, the UltraFLEX Test System, was designed internally. Another, the ETS-800 Test System, was designed in a company that Teradyne acquired in 2008. The UltraFLEX platform was designed using Teradyne’s internal Design for Reliability standards. The principles embodied in those standards are described by the authors. We religiously used an approved parts list of qualified components and suppliers, we analyzed the electrical stress on every circuit, and we calculated predicted reliability for every instrument and the whole system. Once the system was fielded, we tracked MTBF and executed our failure response, analysis, and corrective action system (FRACAS) on repeat failure modes. The result is that the UltraFLEX platform, our most complex product, has a field reliability about three times higher than prior-generation products. What makes this more remarkable is that the UltraFLEX has the capability to test two or even four more semiconductor devices in parallel compared to prior testers. During the development of the UltraFLEX and over the past decade, we also began to deploy and came to rely upon more formal methods to improve software reliability. To be frank, our organizational maturity in software reliability lagged behind our hardware best practices. But through the application of tools like defect models, and especially tracking the reliability of deployed software through automated quality monitors, we were able to both improve the quality of the deployed product and also improve our development methods. A key tool we use to evaluate software reliability is a metric we call clean sessions. A clean session is a session where an operator starts up the tester, loads a program, executes a task like developing tests, debugging, or just testing devices, finishes the task, and then unloads the program, without encountering any anomalous behavior. When we started tracking this metric at the launch of the UltraFLEX, only about half of the sessions were clean. It took us nearly five years to get to 95% clean sessions, and this has set a benchmark that our competitors struggle to reach. Through the learning achieved in this long struggle, we have been able to achieve 95% clean sessions within three months of the release of our next-generation product.

xxx

Series Foreword Second Edition

The ETS-800 is the next generation version of the successful tester for mixed signal and power devices. When Teradyne acquired the business in 2008, there was no formal reliability program in place, but their products were well regarded in the marketplace and reasonably reliable. The ETS-800 was a big step up in terms of capability from the prior generation. The instruments were two to four times as dense, and the system could support almost twice as many instruments. Further, the tester included a promising new feature that would greatly simplify customer test programs by providing the switching needed to share tester resources between different device pins. From a functional and performance perspective, the ETS-800 was a fantastic success. A single ETS-800 could replace up to eight prior generation testers. But we found out the hard way that the informal approach to reliability that worked for simple products did not work for more complex ones. When we initially fielded the ETS-800, it was not a reliable tester. The weak link in the design was the inclusion of thousands of mechanical relays. These relays provided superior electrical performance, but are challenging to use from a reliability perspective. Mechanical relays are highly reliable if they are not hot switched, or switched while a current is flowing through the contacts. A hot-switching event causes an arc across the contacts surface that causes a rapid degradation to the contact surface and the life of the relay. If the relays were designed for reliability, the hot-switching event could have been avoided. The ETS 800 reliability was an order of magnitude below the much more complex UltraFLEX platform, and this put a blemish on the reputation we worked hard to develop for delivering highly reliable products. We worked for a long time to try to improve the robustness of the relays, and reduce the occurrence of hot switching without making much progress. Ultimately we decided to redesign all of the instrumentation using guidelines from the Teradyne reliability system. We are just beginning the deployment of the redesigned instruments, but in side-by-side testing, they are demonstrating about 100 times higher reliability than the ones that they replace. It was a hard but effective lesson that a systematic approach to hardware reliability and software quality as the authors have described is the best way to achieve both high customer satisfaction and good profits. Gregory S. Smith President, Semiconductor Test Division Teradyne, Inc.

xxxi

Series Foreword First Edition Modern engineering products, from individual components to large systems, must be designed and manufactured to be reliable. The manufacturing processes must be performed correctly and with the minimum of variation. All of these aspects impact upon the costs of design, development, manufacture, and use, or, as they are often called, the product’s life cycle costs. The challenge of modern competitive engineering is to ensure that life cycle costs are minimized whilst achieving requirements for performance and time to market. If the market for the product is competitive, improved quality and reliability can generate very strong competitive advantages. We have seen the results of this in the way that many products, particularly Japanese cars, machine tools, earthmoving equipment, electronic components, and consumer electronic products have won dominant positions in world markets in the last 30–40 years. Their success has been largely the result of the teaching of the late W. E. Deming, who taught the fundamental connections between quality, productivity, and competitiveness. Today this message is well understood by nearly all the engineering companies that face the new competition, and those that do not understand lose position or fail. The customers for major systems, particularly the US military, drove the quality and reliability methods that were developed in the West. They reacted to a perceived low achievement by imposing standards and procedures, whilst their suppliers saw little motivation to improve, since they were paid for spares and repairs. The methods included formal systems for quality and reliability management (MIL-Q-9858 and MIL-STD-758) and methods for predicting and measuring reliability (MIL-STD-721, MIL-HDBK-217, MILSTD781). MIL-Q-9858 was the model for the international standard on quality systems (ISO9000); the methods for quantifying reliability have been similarly developed and applied to other types of products and have been incorporated into other standards such as ISO60300. These approaches have not proved to be effective and their application has been controversial. By contrast, the Japanese quality movement was led by an industry that learned how quality provided the key to greatly increased productivity and competitiveness, principally in commercial and consumer markets. The methods that they applied were based on an understanding of the causes of variation and failures, and continuous improvements through the application of process controls and the motivation and management of people at work. It is one of history’s ironies that the foremost teachers of these ideas were Americans, notably P. Drucker, W.A. Shewhart, W.E. Deming, and J.R Juran. These two streams of development epitomize the difference between the deductive mentality applied by the Japanese to industry in general, and to engineering in particular,

xxxii

Series Foreword First Edition

in contrast to the more inductive approach that is typically applied in the West. The deductive approach seeks to generate continuous improvements across a broad front and new ideas are subjected to careful evaluation. The inductive approach leads to inventions and “break-throughs,” and to greater reliance on “systems” for control of people and processes. The deductive approach allows a clearer view, particularly in discriminating between sense and nonsense. However, it is not as conducive to the development of radical new ideas. Obviously these traits are not exclusive, and most engineering work involves elements of both. However, the overall tendency of Japanese thinking shows in their enthusiasm and success in industrial teamwork and in the way that they have adopted the philosophies of western teachers such as Drucker and Deming, whilst their western competitors have found it more difficult to break away from the mold of “scientific” management, with its reliance on systems and more rigid organizations and procedures. Unfortunately, the development of quality and reliability engineering has been afflicted with more nonsense than any other branch of engineering. This has been the result of the development of methods and systems for analysis and control that contravene the deductive logic that quality and reliability are achieved by knowledge, attention to detail, and continuous improvement on the part of the people involved. Therefore, it can be difficult for students, teachers, engineers, and managers to discriminate effectively, and many have been led down wrong paths. In this series we will attempt to provide a balanced and practical source covering all aspects of quality and reliability engineering and management, related to present and future conditions, and to the range of new scientific and engineering developments that will shape future products. The goal of this series is to present practical, cost-efficient and effective quality and reliability engineering methods and systems. I hope that the series will make a positive contribution to the teaching and the practice of engineering. Patrick D.T. O’Connor February 2003

xxxiii

Foreword First Edition In my 26 years at Teradyne, I have seen the automated test industry emerge from its infancy and grow into a multibillion-dollar industry. During that period, Teradyne evolved into the world’s leading supplier of automated test equipment (ATE) for testing semiconductors, circuit boards, modules, voice, and broadband telephone networks. As our business grew, the technology necessary to design ATE became increasingly complex, often requiring leading-edge electronics to meet customer performance needs. Our designs have pushed the envelope, demanding advancements in nearly every technological area including process capability, component density, cooling technology, ASIC complexity, and analog/digital signal accuracy. Our customers, too, insist on the highest performance systems possible to test their products. But performance alone does not provide the product differentiation that wins sales. Customers also demand incomparable reliability. Revenue lost when an ATE system goes down can be staggering, often in the area of tens of thousands of dollars per hour. Furthermore, because of design complexity and system cost, the warranty cost to maintain these systems is increasing. Low reliability severely impacts the bottom line and impedes the ability to gain and hold market share. To improve product reliability, changes had to be made to the reliability process. We learned that the process needed to be proactive. It had to start early in the product concept stage and include all phases of the product development cycle. In researching solutions for improving product reliability, we found the wealth of information available to be too theoretical and mathematically based. Clearly, we didn’t want a solution that could only be implemented by reliability engineers and statisticians. If the training were overly statistical, the message would be lost. If the process required training everyone to become a reliability engineer, it would be useless. The process had to reduce technical reliability theory into practical processes easily understood by the product development team. For the reliability program to be successful, we needed a way to provide both management and engineering with practical tools that are easily applied to the product development process. The reliability processes presented in this book achieve this goal. The authors logically present the reliability processes and deliverables for each phase of the product development cycle. The reliability theory is thoughtful, easily grasped, and does not include a complex mathematical basis. Instead, concepts are described using simple analogies and practical processes that a competent product development team can understand and apply. Thus, the reliability process described

xxxiv

Foreword First Edition

can be implemented into any electronic or other business, regardless of its size or type, and ultimately helps give customers products with superior performance and superior reliability. Edward Rogas Jr. Senior Vice President Teradyne, Inc.

xxxv

Preface Second Edition When this book was first published, the primary focus was on improving product reliability, why reliability improvement efforts fail, and how poor reliability negatively impacts current and future business. We discussed the ease with which consumers can research a product to determine consumer satisfaction and discover issues related to product reliability. To improve product reliability, we presented a comprehensive process for product development and an implementation strategy that any business can start. We also discussed ways to change the corporate culture so that it strives to design reliable products. The importance of designing reliable products has not changed since the book was first published. However, much has changed in regard to the types of products being developed today compared to when the book was first published. The most significant change is the amount of software and firmware required for new products. The other significant change is the number of new products being developed that connect to the internet (IOT) to provide ease of use, communicate with other devices, aid in customer support and update software remotely. The internet provides the consumer with greater ease of use and a better user experience, but brings with it a new set of risks regarding security and privacy. We changed the book title to Improving Product Reliability and Software Quality to convey the importance of software in product development. There are many books written about hardware reliability and likewise about software quality. The hardware reliability books do not cover software quality and the software quality books do not address hardware reliability. However, successful product development is dependent on the synergy of these two functional groups working well together. Hardware engineers and software engineers are very different and communicate in different languages; therefore, they do not effectively integrate each other’s requirements and dependencies. Assumptions are often made regarding what other functional groups are delivering, which later turn out to be wrong. Hardware and software engineers look at bugs very differently. The hardware development team strives to release products without any reliability issues and assumes last-minute discoveries will be fixed with software. Hardware requirements can be fully defined and validated to ensure the release of a reliable product. The software development team does not set a requirement for a 100% bug-free product before product release. In fact, for most products, the software requirements and validation cannot define every use condition and possible state. When the software is released, the team is already working on the next update, tier release, or patch.

xxxvi

Preface Second Edition

In addition to software quality, there is also the issue of software security. Many new products access the Internet as a way to quickly and efficiently send out software bug fixes and as a way to improve customer use experience through user apps. A good example is the NestTM programmable learning thermostat. This connectivity raises software security concerns and new challenges that are often overlooked or underestimated. Some products can communicate via Bluetooth and Global Positioning Services (GPS), which also have the potential to be compromised. Each new generation of electronic products incorporates significantly more software and firmware than the previous generation. This goes for simple products like a home thermostat to complex ones. Even the mix of development engineers required for product development is shifting. The goal of the book is to provide insight, process, and tools to help meet these changing demands.

xxxvii

Preface First Edition Nearly every day, we learn of another company that has failed. In the new millennium, this rate of failure will increase. Competitors are rapidly entering the marketplace using technology, innovation, and reliability as their weapons to gain market share. Profit margins are shrinking. Internet shopping challenges the conventional business model. The information highway is changing the way consumers make buying decisions. Consumers have more resources available for product information, bringing them new awareness about product reliability. These changes have made it easier for consumers to choose the best product for their individual needs. As better-informed shoppers, consumers can now determine their product needs at any place, any time, and for the best price. The information age allows today’s consumer to research an entire market efficiently at any time and with little effort. Conventional shopping is being replaced by “smart” shopping. And a big part of smart shopping is getting the best product for the best price. As the sources for product information continue to increase, the information available about the quality of the product increases as well. In the past, information on product quality was available through consumer magazines, newspapers, and television. The information was not always current and often did not cover the full breadth of the market. Today’s consumer is using global information sources and internet chat to help in their product-selection process. An important part of the consumer’s selection process is information regarding a product’s quality and reliability. Does it really do what the manufacturer claims? Is it easy to use? Is it safe? Will it meet customer expectations of trouble-free use? The list can be very long and very specific to the individual consumer.

Quality versus Reliability From automobiles to consumer electronics, the list of manufacturers who make high-quality products is continuously evolving. Manufacturers who did not participate in the quality revolution of the last two decades were replaced by those that did. They went out of business because the companies with high-quality systems were producing products at a lower cost. Today, consumers demand products that not only meet their individual needs, but also continue to meet these needs over time. Quality design and manufacturing was the benchmark in the 1980s and 1990s; quality over time (reliability) is becoming the requirement in the twenty-first century. In today’s

xxxviii

Preface First Edition

marketplace, product quality is necessary in order to stay in business. In tomorrow’s marketplace, reliability will be the norm. Quality and reliability are terms that are often used interchangeably. While strongly connected, they are not the same. In the simplest terms: • Quality is conformance to specifications. • Reliability is conformance to specification over time. As an example, consider the quality and reliability in the color of a shirt. In solid-color men’s shirts, the color of the sleeves must match the color of the cuffs. They must match so closely that it appears that the material came from the same bolt of cloth. In today’s manufacturing processes, several operations occur simultaneously. One bolt of cloth cannot serve several machines. The colors of several bolts of cloth must be the same, or the end product will be of poor quality. Every bolt of cloth has to match to a specified color standard, or the newest manufacturing technologies cannot be applied to the process. Quality in the material that goes into the product is as important as the quality that comes out. In fact, the quality that goes in becomes a part of the quality that comes out. After numerous washings, the shirt’s color fades out. The shirt conformed to the consumer’s expectations at the time of first use (quality), but failed to live up to the consumers’ expectations later (reliability). Reliability is the continuation of quality over time. It is simply the time period over which a product meets the standards of quality for the period of expected use. Quality is now the standard for doing business. In today’s marketplace and beyond, reliability will be the standard for doing business. The quality revolution is not over; it has just evolved into the reliability revolution. This book is an effort to guide the user on how to implement and improve product reliability with a product life cycle process. It is written to appeal to most types of businesses regardless of size. To achieve this, the beginning of each chapter discusses issues and principles that are common to all businesses, independent of size. We also segregate business into three categories based on size: small, medium, and large. Definitions are summarized in Table I.1. The finance department can, more precisely, quantify the lost revenue due to warranty claims and poor quality. This loss represents the potential dollars that are recoverable “after” the reliability process improvements have been implemented and have begun to bear fruit.

Table I.1 Business size definition. Company size Metric

Small

Medium

Large

Employee count

100 and