The New Science of Fixing Things: Powerful Insights about Root Cause Analysis That Will Transform Product and Process Performance 9798626423686

In order to solve any problem, a reasonably deep understanding of how and why things happen is required. This knowledge,

1,182 202 2MB

English Pages 125 [126] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

The New Science of Fixing Things: Powerful Insights about Root Cause Analysis That Will Transform Product and Process Performance
 9798626423686

Citation preview

David J Hartshorne

The New Science of Fixing Things

How well a function performs

Powerful Insights About Root Cause Analysis That Will Transform Product and Process Performance

Desired Functionality

Acceptable Functionality

Below specification or not functioning

NOK

i

The New Science of Fixing Things Powerful Insights About Root Cause Analysis That Will Transform Product and Process Performance

David J. Hartshorne

The New Science of Fixing Things Ltd. United Kingdom

ISBN: 979-8-6264-2368-6 c 2020 David John Hartshorne and The New Copyright  Science of Fixing Things Ltd www.tnsft.com All rights to illustrations and text reserved by David John Hartshorne. This work may not be copied, or translated in whole or in part without written permission of the publisher, except for brief excerpts in connection with reviews or scholarly analysis. Use with any form of information storage and retrieval, electronic adaptation, or whatever, computer software, or by similar or dissimilar methods now known or developed in the future is also strictly forbidden without written permission of the publisher.

ii

Contents Table of Contents

iii

List of Figures

v

Preface

ix

1 Diagnosing Machine Behavior

1

2 Top Down Progressive Searches

13

3 A Problem Solving Case-study

23

4 Manifestation of a Natural Law

29

5 Changing the Questions

43

6 Completing The Progressive Search

55

7 Knowledge, Information and Data

71

8 Review of Key Learning Points

89

Appendix

101

Biblography

111 iii

iv Index

CONTENTS 112

List of Figures 1.1 1.2

Basic Diagnostic Logic Map. . . . . . . . . . . Ishikawa (Fishbone) Diagram. . . . . . . . . .

Integers 1 to 128 organized in sequence into an 8 × 16 matrix. . . . . . . . . . . . . . . . . 2.2 1 to 128 matrix separated into prime and composite numbers. . . . . . . . . . . . . . . . . . 2.3 1 to 128 matrix separated into even and odd numbers. . . . . . . . . . . . . . . . . . . . . 2.4 1 to 128 matrix separated into those greater than 64 and those not. . . . . . . . . . . . . . 2.5 1 to 128 matrix separated into those divisible by 5 and those not. . . . . . . . . . . . . . . . 2.6 Example progressive search for the number 15 summarized in a Search Tree. . . . . . . . . . 2.7 Matryoshka metaphor for stratified characterization of performance. . . . . . . . . . . . 2.8 Functional Isolation of how a process works. . 2.9 Structural Dissection into what a product is made up of. . . . . . . . . . . . . . . . . . . . 2.10 Dividing a Device into Energetic Networks Using the z-Strategy. . . . . . . . . . . . . . .

7 8

2.1

3.1

Assembly station rejects for low peak force. . v

14 15 15 16 16 17 19 19 19 20 26

LIST OF FIGURES

vi 4.1

Peak force for individual assemblies produced on 4 pallets. . . . . . . . . . . . . . . . . . . . 4.2 Conceptualization of the distribution of peak force for daily groups. . . . . . . . . . . . . . 4.3 Conceptualization of the distribution of peak force for monthly groups. . . . . . . . . . . . 4.4 Conceptualization of the distribution of peak force for the combined population. . . . . . . 4.5 Comparison of populations; within a time period, between time periods, between pallets. . 4.6 Comparison of populations after reducing variation within a time period. . . . . . . . . . . 4.7 Comparison of populations after also reducing variation between time periods. . . . . . . 4.8 Response of Y (insertion force) over the entire range of X1 . . . . . . . . . . . . . . . . . . . . 4.9 Response of Y (insertion force) over the entire range of X2 . . . . . . . . . . . . . . . . . . . . 4.10 Response of Y (insertion force) over the entire range of X3 . . . . . . . . . . . . . . . . . . . . 4.11 Amplifying effect of square-root-of-the-sumof-the-squares on the Pareto distribution of slopes. . . . . . . . . . . . . . . . . . . . . . . 4.12 An essential Search Policy. . . . . . . . . . . . 5.1 5.2 5.3 5.4

Exploiting Sparsity of Effects by changing the questions. . . . . . . . . . . . . . . . . . . . . Fresh samples collected on the first afternoon of the workshop. . . . . . . . . . . . . . . . . Small Multiples, or Multivari plot of insertion force. . . . . . . . . . . . . . . . . . . . . . . . An interdependency manifested in a Matryoshka characterization. . . . . . . . . . . . . . . . .

30 31 32 33 34 35 37 37 38 38

39 40

43 45 47 48

LIST OF FIGURES 5.5 5.6 5.7 5.8 5.9 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12

6.13 6.14 6.15 7.1

Concise expression of Matryoshka splits as a Search Tree. . . . . . . . . . . . . . . . . . . . Force and displacement during the assembly process. . . . . . . . . . . . . . . . . . . . . . Structural Schematic of Assembly Station. . . Cartoon of assembly cycle key steps. . . . . . Small multiples plot of force and displacement during assembly. . . . . . . . . . . . . . . . .

vii

49 50 51 52 53

Search Tree. Continuing the progressive search. Isolation strategy tactical approach 1. . . . . Isolation strategy tactical approach 2. . . . . Results typical of the Steep X living in the function. . . . . . . . . . . . . . . . . . . . . . Results typical of the Steep X living in the inputs. . . . . . . . . . . . . . . . . . . . . . . Housing machining cell cartoon. . . . . . . . . Historical rejects by machine. . . . . . . . . . Historical rejects by nest. . . . . . . . . . . . Assigning nests to the new observations. . . . 4 out of 256 possible combinations of nests and pallets. . . . . . . . . . . . . . . . . . . . Position with respect to press axis is the prime suspect. . . . . . . . . . . . . . . . . . . . . . Splitting up housing position with respect to press into housing machining positions and pallet position in the press. . . . . . . . . . . Nest positions on the machines. . . . . . . . . Relative strength of pallet and nest misalignments. . . . . . . . . . . . . . . . . . . . . . . The end of the progressive search. . . . . . .

55 56 56

67 68

Information content of two different data sets for the Brake Rotor Case. . . . . . . . . . . .

74

57 58 60 61 61 62 63 63

65 66

LIST OF FIGURES

viii 7.2 7.3 7.4 7.5

7.6

8.1

Information content of different data sets for the Warranty Rattle Case. . . . . . . . . . . . The three scenarios that emerge when looking at true performance data. . . . . . . . . . . . A Weibull analysis as it relates to true performance. . . . . . . . . . . . . . . . . . . . . Comparing quality control data and the decomposed performance data for the Brake Rotor Case. . . . . . . . . . . . . . . . . . . . . . Comparing process capability data and the decomposed fastener performance data for the Warranty Rattle Case. . . . . . . . . . . . . . The three scenarios that emerge when looking at true performance data. . . . . . . . . . . .

76 78 80

83

86 94

Preface In order to solve any problem, a reasonably deep understanding of how and why things happen is required. This knowledge, however arrived at, provides the capacity to take action. Making the link between what we see happening and such knowledge is a process called diagnosis. But how can we go about diagnosing performance and reliability of engineering systems when the required knowledge is not immediately to hand? This book shows that truly excellent performance is achievable, and it is not that difficult. These insights are powerful, and yet seem to be largely unknown, almost secret. This is a management overview of what effective diagnosis should entail, and what is possible. I want to highlight both the core principles it took me too long to uncover, and the very small number of strategies that we have found to be both effective and efficient, and to show them being applied with a couple of case-studies. I have used the word uncover because I would certainly not claim any discovery. Most have a long history, and calling our company The New Science of Fixing Things is somewhat tongue-in-cheek. Most of the content of this short book is taken from the introductory chapters of Diagnosing Performance and Reliability 1 which explains 1

[Hartshorne(2019)]

ix

how to execute these strategies in much greater detail, and in a very wide variety of situations. The true essence required some unearthing because there has been a tendency over the last thirty or forty years to bury principles under the flim-flam of step-by-step guides, rules and procedures. Because even the helpful ones are wrapped up into marketable tools that are the product of the management consulting industry, it becomes difficult to see the form of the forest when one is up close to the trees. Some procedures appear scientific when they employ numbers and perfectly reasonable mathematical machinery to process those numbers. It is important to step back and consider whether the underlying strategy is reasonable, or if it is fundamentally flawed, ignoring some key principle. We will see how information is often cleverly extracted from data, to give a totally false picture about what is actually happening. Let’s make sure we avoid that trap. We have practiced diagnosis of machine and process performance and reliability problems for nearly 30 years. We run workshops on diagnosis. No techniques are powerful enough to compensate for poor diagnostic strategy and tactics. Good strategy enables ordinary engineers to become extraordinary problem solvers. Problems with any manmade machine, device, product or process, that have thwarted the efforts of the most experienced engineers and problem solvers, or ones that have dogged companies for years are diagnosed in hours or days using the same fundamental methodology illustrated in this book. It is possible to diagnose system behavior when the knowledge about what is happening is not immediately to hand. It is also straightforward to flush out gaps in knowledge at the earliest possible stage of product and process development. Identifying and closing those gaps will provide enormous competitive advantage, real excellence instead of disappointing mediocrity. x

Further Information The book Diagnosing Performance and Reliability (ISBN: 978-1-5272-5139-7) is an A4 size hardback book with over 300 pages and supported by 316 figures, most in color. It can be found at: www.tnsft-bookstore.com Readers of this book will get a 10% discount by quoting the code TNSFT10% at checkout. The New Science of Fixing Things website is at: www.tnsft.com

About the Author and TNSFT David Hartshorne has now spent more than half of his life focused on the science of diagnosing performance and reliability, from the late 1980s to the present. Having worked in the automotive and aerospace industries, along with Tobias Mack and John Allen of TNSFT, he was a founding member of Shainin LLC. TNSFT primarily provides a service of diagnosing what appear to be the toughest problems, for some of the largest global manufacturing companies. In addition, they also run one-week training workshops in client facilities, during which participants learn to diagnose (topographically) on their own real-world problems, which in some cases they may have been struggling with for months or longer.

xi

xii

1

Diagnosing Machine Behavior Diagnosing Product Quality and Reliability One of the chief concerns of organizations involved in design, development and manufacturing is product quality (meeting performance requirements well, with a suitable margin of safety) and reliability (continuing to perform well for the life of the product). Marginal performance of some critical component or process has ruined some companies, and for the rest it significantly reduces profits and means that the organization is average, preventing them from meeting their goal of being best-in-class. Whenever there is underperformance in this context, there is a gap in knowledge. Knowledge is a model contained within the mind about how and why things happen that provides the capacity to act effectively. Under-performance needs diagnosing, and diagnosis is a process for building knowledge. Diagnosis1 and prevention of poor machine performance 1

Diagnosis is the means by which critical knowledge gaps are ur-

1

2

1. DIAGNOSING MACHINE BEHAVIOR

and reliability is a difficult job when those tasked with doing it do not have knowledge about the links between symptoms and cause, and the causal mechanisms involved. The usual reason for undertaking diagnosis is being faced with a problem that manifests as a system not performing as intended, not performing at all, or it fails to perform intermittently. The problem may be detected in the manufacturing plant, at time = zero (performance at t0 ) or it may be detected on an endurance test or worse, by the customer in the field. These problems become reliability problems (performance at tn ). Most organizations continually make a few basic mistakes that result in squandering scarce resources by investigating and fixing the wrong things. The consequences of these problems for a business range from lost sales or reduced profit margins, to bad press or unwanted attention from government agencies, to bankruptcy. I want to explain how to save time and money doing the right things, and at the same time deliver products of truly world class quality and reliability.

Alternative Approaches to Diagnosis The definition of knowledge has been the subject of philosophical discussion for thousands of years. However, for our purposes, it is useful to simply take the view that knowledge is a model that is contained within the mind about how and why things happen, and that provides its owner the capacity to act effectively. The purpose of knowledge is to improve our lives. Business enterprises use knowledge to increase value. In order to solve engineering problems, we need to use a specific piece of knowledge called a causal explanation. gently filled

3 A causal explanation is a description of the necessary and sufficient conditions, and the how-why mechanism involved, in producing a specific behavior or effect. As an example of a causal explanation, if we describe how energy is released in the combustion of a hydrocarbon fuel oil, we know that it is a molecular chain of carbon and hydrogen atoms. During combustion, the chain is broken up in the presence of oxygen molecules. The carbon atoms are removed as carbon dioxide molecules, and the hydrogen atoms are removed as water molecules. A great deal of heat is produced because the new molecular bonds are stronger than the original molecules, in other words potential energy is released because the new arrangements are energetically more favorable than the old ones. There is an energy barrier to initiating the reaction, which is why a spark and/or pressure is required. If we want to describe why the energy is released, at the physical level it is because entropy always strives to increase. The why may also be because we wish to create a hot reservoir/cold sink which transfers energy (by heat), some of which we can use to transfer energy by work. Such knowledge is combined with other pieces of knowledge to enable actions to be taken which could range from fire prevention to designing a combustion chamber. Causal explanations move in the cause direction asking how and in the effect direction asking why (specifically what phenomena we expect to observe from a mechanism). This is not the same thing as the Five Why methodology (which really ought to be called the Five How method if we were being precise in our use of language). If used correctly, the Five Why methodology is primarily a way of ensuring that the causal explanation goes deep enough to prevent recurrence of a problem. For example, a car’s failure to start might be explained in terms of a loose battery cable, or in terms of the manufacturer’s mismanagement,

4

1. DIAGNOSING MACHINE BEHAVIOR

an underlying cause. The notion of root cause refers to the most basic reason for an undesirable condition or problem which, if eliminated or corrected, would have prevented it from existing or occurring. This concept is useful in that it ensures diagnosis is carried out deeply enough. A correct understanding of a problem, in terms of knowledge about the mechanisms involved is not the same thing as solving a problem. Using insights about the how-why mechanisms, in creative and innovative ways, is also required. It is worth noting here that diagnosis and experimentation/simulation are complementary applications of explanatory knowledge. Diagnosis is the inverse of simulation and experimentation. Simulation and experimentation are concerned with the derivation of the behavior of the process given its structural and functional aspects. Diagnosis, on the other hand is concerned with deducing function and structure from the behavior. One nice simple project centered on machining a large brake rotor on a brand new production line. Scrap was about 15% and the part was expensive, with no rework possible. Parts not to print went in the scrap bin. The overwhelming reason for rejection was flatness exceeding the customer’s specification, and management wanted to know the root cause. Later in this book there will be much discussion about the value of characteristics such as flatness, as well as the diagnostic steps involved. However, for the moment we only want to consider root causes and causal explanations. Here is the causal explanation: The clamping system was the type that moved a fixed distance when closed, the force varying to overcome impedance to that movement. The forces were applied through a small contact area on three jaws. By clamping on the cast surface of

5 the hub, which did not have a repeatable shape, the forces varied a lot. The different clamping forces were distributed within the rotor, and how well the part held its shape was a function of its compliance (or stiffness). As the cutting tool wore, a pattern was imposed in phase with the chuck jaws, and once the rotor was released from the chuck, it was no longer flat. Was the root cause the shape variation of the cast surface? Or was it the number of cycles before changing the cutting tool? Both could be shown to have a statistically significant (actually interdependent) relationship to the flatness numbers quality control used to decide if a part should be accepted or rejected. But of what practical significance are they? In fact, the root cause could also be seen as the decision to no longer machine the hub prior to this operation, and clamp on a cast surface? But that was necessary to make a profit, as was not producing any scrap parts. There is deeper knowledge in the causal explanation, that can be combined with other knowledge. There are more choices than trying to make castings with less shape variation, or limiting the life of the cutting tool. A six-jaw chuck that contacted more than twice the contact area was tested. Not only was the scrap eliminated, but the shape greatly improved, for which the customer was grateful. For most problems, it is sufficient to use a shallow approach to the diagnosis and causal explanation, known as Symptomatic reasoning2 , in order to take effective action. Symptomatic Knowledge is gleaned from past experience which establishes connections between symptoms and causes. This knowledge is also referred to as compiled, evidential, 2 Symptomatic diagnosis can be summarized as answering the question what’s wrong?

1. DIAGNOSING MACHINE BEHAVIOR

6

history-based or case-based knowledge. This type of knowledge is mostly associated with diagnosis by experts. Diagnosis transpires as a rapid recognition process. An expert knows the cause by virtue of having previously encountered similar cases. When Symptomatic diagnosis is not viable, because of a knowledge gap, we must use what is known as a Topographic approach3 to reach the causal explanation. Topography in this context means mapping the elements of a system, and their connections. Topographic diagnosis can be summarized as answering the question what’s happening? This book focuses on Topographic diagnosis. The relationships between problematic behavior of a system, Symptomatic and Topographic diagnosis, causal explanation (knowledge) and solutions can be summarized in a simple diagram (Figure 1.1) which is the simplified version of a map that John Allen4 and I devised in 2006 to help us see how the trees of diagnosis make up a forest.

Diagnosis Based on Symptomatic Knowledge Symptomatic diagnosis is that used most commonly. However, it won’t work independently with novel faults, or where deeper understanding of performance or reliability is sought. The predominant success of the Symptomatic approach is paradoxically its principal weakness: there is often a significant delay in recognizing those problems that cannot be solved because the appropriate symptom-cause relationships are not known or fully understood. By this stage the diagnosis has effectively descended into a random search based upon guessing, without the recognition of those involved. 3

Topographic diagnosis can be summarized as answering the question what’s happening? 4 John and David founded TNSFT together in 2006

7

Figure 1.1: The Basic Diagnostic Logic Map. There are two ways we can diagnose system behaviour. Symptomatic is the fastest approach, making use of knowledge built up by experience. Conversely, Topographic is a model-based approach that provides much deeper knowledge and insight.

There are also many approaches to diagnosis that must be recognized as being designed to continue the Symptomatic approach after it has failed, and indeed accelerate the guessing process in an attempt to flush out potential causes. Those approaches are fundamentally flawed, because one of the main reasons for Symptomatic diagnosis failing is that the knowledge is missing.5 The Ishikawa (Fishbone) Diagram is one way to capture existing knowledge and organize it into departmental responsibilities (the main branches). This can then be used as a preventive tool to communicate responsibilities for ensuring control systems and procedures are in place. In other words, using existing knowledge to maintain a level of per5

It is a far better strategy to treat a difficult problem as the result of missing knowledge, rather than a problem of organizing and reviewing existing knowledge.

8

1. DIAGNOSING MACHINE BEHAVIOR

formance. It is not, however, a tool to support a progressive search.

Figure 1.2: The Ishikawa (Fishbone) Diagram. One way existing knowledge is captured and organized. Another useful way of diagrammatically organizing existing knowledge is with a fault tree, which directly maps symptoms to the most probable causes. There are several other, less popular methods. Appropriating any of these tools as hypothesis generators for guessing at a root cause is not an effective approach. The objective of such a strategy is to consider the potential impact of as many independent causal variables (Xs) that we already know about, then confirm control systems and procedures are in place, look for correlations, or choose some of them to be tested for their effect on the Y . This is a Bottom-Up approach which squanders scarce resources. No approach based upon Symptomatic knowledge can help if the knowledge is miss-

9 ing. None of these tools are appropriate if we need to adopt a Topographic approach. The branches of an Ishikawa Diagram are not testable, only low level factors are, which we will see is a long-winded way of searching, wasteful of resources and with no likelihood of success if we really do lack some key knowledge. Undertaking a Brute Force6 Search of a large list takes a lot of resource. If what we a searching for is not even on the list, it is also futile. It appears difficult to abandon the Symptomatic approach. In his especially important book, Dietrich Dörner7 explains why this is easier said than done. In Chapter two he characterizes the difficulties; the way we think when under pressure to fix things, our incomplete or incorrect understanding about systems, and how we delude ourselves. These things inhibit our ability to get effective and timely proper answers, and steer toward a so-called solution we want because it is convenient, or think others want, often suffering catastrophic unintended consequences, many of which should have been obvious. On page 42 he says: An individual’s reality model can be right or wrong, complete or incomplete. As a rule it will be both incomplete and wrong. . . People are inclined to insist they are right when they are wrong and when they are beset by uncertainty. It even happens that people prefer their incorrect hypothesis to correct ones and will fight tooth and nail rather than abandon an idea that is demonstrably false. This problem is reinforced several times throughout Dörner and Kimber’s book. 6

Brute Force Searches of large lists use resources and are futile when the target isn’t even on the list. 7 [Dörner and Kimber(1996)]

1. DIAGNOSING MACHINE BEHAVIOR

10

Employing a Topographic Approach to Diagnosis Fortunately, although for every effect (Y ) there are a very large number of potential causes (Xs), including interdependent responses (also called interactions) between at least two independent Xs, we can exploit some of the nature of systems and of the physical world to undertake a progressive search, working from Y to converge on important Xs. In a Progressive Search8 , each pass is made within a search space already reduced by constraints from previous passes. Each pass provides more constraints to further reduce the search space for subsequent passes. As an example of search space, in the game Guess Who, the players each begin with a set of character cards from which to choose. They then take turns asking yes or no questions about the other player’s choice. The set of cards is the search space for this game. In chess, the search space is more complicated, it is all possible valid moves. The key to effective and efficient Topographic diagnosis is in knowing how to execute a progressive search, from a higher level abstraction where groups of equipment and functional systems are considered, to a lower level of abstraction where the properties and function of individual elements are analyzed. The power of the search is dependent upon the way the system is decomposed (split) into elements, and the quality of the information available (providing evidence for elimination). The search can be either based on a functional or a structural decomposition hierarchy. A structural hierarchy represents the connectivity of the system’s elements. For example, what components a product is made of, and 8

Employing a progressive search, and the choice of how to decompose the system, can be seen as the diagnostic strategy in which we take small and rapid steps, generating information that is relevant, timely, dependable, sufficient and explicit.

11 what people and equipment it is made by. A functional hierarchy represents the means-end (how-why) relationships of the system’s elements, that is; how a product works, and how it is made.

Chapter 1 Key Points There are two different approaches to diagnosis. For most problems, it is sufficient to use a shallow approach known as Symptomatic in order to take effective action. Symptomatic Knowledge is gleaned from past experience which establishes connections between symptoms and causes. Symptomatic diagnosis can be summarized as answering the question what’s wrong? However, Symptomatic diagnosis won’t work independently with novel faults, or where deeper understanding of performance or reliability is sought. There is also usually a significant delay in recognizing those problems that cannot be solved because the appropriate symptom-cause relationships are not known or fully understood. When Symptomatic diagnosis is not viable, because of a knowledge gap, we must use what is known as a Topographic approach, mapping the elements of a system and their connections, and using a progressive search to reach the causal explanation. Progressive searches are explained in the next chapter. Topographic diagnosis can be summarized as working in a step by step fashion towards answering the question what’s happening?

12

1. DIAGNOSING MACHINE BEHAVIOR

2

Top Down Progressive Searches Imagine a parlor game where the objective is to find an unknown number that is an integer between 1 and 128. The game proceeds by asking questions of one person who knows the number, and the goal is to find the unknown number with the minimum of questions . The person answering can only answer Yes or No to any question, but must tell the truth. The only other rule is that players cannot ask the same question twice consecutively, only changing the argument of the expression. In other words, you cannot ask "is the unknown value greater than 64?" immediately followed by "is the unknown value greater than 96?" if the answer is Yes. That is the same expression (y > x), only the argument, x, has been changed from 64 to 96. As an example, let’s see how we might find the unknown number 15. It is not a Prime number, so that eliminates nearly a 14 of possibilities. It is not Even (evenly divisible by 2), so that eliminates 21 of what was left (only the number 2 belongs to both eliminated groups). It is not greater than 13

14

2. TOP DOWN PROGRESSIVE SEARCHES

Figure 2.1: The integers 1 to 128 organized in sequence into an 8 × 16 matrix.

64, so 78 of the numbers are now eliminated, leaving 18 of the original search space. It is a multiple of 5, and it is not greater than 32, so the last two questions now only leave two possibilities: 25 and 15. The progressive search can be tracked in a Search Tree that concisely diagrams the eliminated and remaining search space at each step. This turns out to be a powerful tool for developing strategy as the search progresses, as well as communicating to others what the constraints on the search space are. It is not the same as a fault tree, one cannot replace the other. The key to the game lies in exploiting fundamental ways that the system of numbers is organized, and phrasing the question carefully to include all remaining possibilities. Organizing the numbers into an 8 × 16 matrix, in sequence left to right, is a way of mapping the system of numbers that allows us to split by column in a similar way to splitting by row.

15

Figure 2.2: The 1 to 128 matrix separated into prime and composite numbers.

Figure 2.3: The 1 to 128 matrix separated into even and odd numbers.

16

2. TOP DOWN PROGRESSIVE SEARCHES

Figure 2.4: The 1 to 128 matrix separated into those greater than 64 and those not.

Figure 2.5: The 1 to 128 matrix separated into those divisible by 5 and those not. The same split as odd or even, the argument has changed from 2 to 5.

17

Figure 2.6: Example progressive search for the number 15 summarized in a Search Tree. A powerful tool for developing strategy as well as communicating what the constraints on the search space are.

If the rules didn’t prevent it, it would have been possible with the simple example of a table of numbers to phrase the question as "is it greater than 64?", the next question could have been the same, but the dividing point, or argument, moved; "is it greater than 32?" and so on. In the real world of engineering systems, there are opportunities to ask the same question and move the dividing point, but that single strategy is limited and will not get all the way to the causal explanation, which is why we have chosen an analogy where we can divide by using different groupings. In real world engineering systems we find that there are four primary groupings that we can readily exploit.

18

2. TOP DOWN PROGRESSIVE SEARCHES

In a progressive search, convergence is by eliminating very large proportions of possibilities. A binary search is an example of convergence by elimination, getting answers to a series of questions, for which the answer is Yes or No. The questions must be phrased carefully so that all possibilities not yet eliminated are included. There are a few different ways the set of numbers can be divided into groups, even without referring to the table layout: Odd vs Even; Prime vs Composite; 1 to 64 vs 65 to 128 are examples. How can this be applied in the world of products and processes? We know of four natural system organizations that are useful because they provide us with the means to make such binary splits and obtain unambiguous Yes-No answers quickly. This is achieved by: • A hierarchical (Matryoshka) division which is a binary search from the lowest to highest stratification of the essential system characteristics, the key being the concept of the machine or process cycle. • Functionally dividing systems by Isolation which is a Serial binary search at a high level. • Structurally dividing systems by Dissection which is a Parallel binary search at a high level. • Impedance (Z) lumping and dividing the system’s network of energetic properties which allows a binary search at a low level to be undertaken. In diagnosis, strategy really boils down to the questions asked. In a progressive search, the question should be expressed as a split of the remaining search space.

19

Figure 2.7: Matryoshka is a metaphor for a stratified characterization of performance based on a device’s cycle. The single most important strategy, Matryoshka will be explained using the central case-study running through this book.

Figure 2.8: Functional Isolation: the dominant causal mechanism for behavior observed on the output side of any function either lives in that function, or it has been unknowingly carried in on the input materiel.

Figure 2.9: Structural Dissection into what a product (in this case) is made up of, because we will be able to associate the dominant causal mechanism with one or two structural subsystems.

20

2. TOP DOWN PROGRESSIVE SEARCHES

Figure 2.10: All devices are really networks that manage energy. The network can be probed at different nodes to divide the many impedance (Z) elements into groups. In essence, generalizing a strategy commonly used to understand the behavior of electrical systems.

21

Chapter 2 Key Points Topographic diagnosis is about using various ways in which systems are naturally organized to divide and eliminate using well-formulated questions in a progressive search. In a Progressive Search, each pass is made within a search space already reduced by constraints from previous passes. Each pass provides more constraints to further reduce the search space for subsequent passes. We know of four natural system organizations that are useful because they provide us with the means to make binary splits and obtain unambiguous Yes-No answers quickly. We call the splitting strategies that make use of these natural organizations Matryoshka, Isolation, Dissection and Impedance Lumping.

22

2. TOP DOWN PROGRESSIVE SEARCHES

3

A Problem Solving Case-study Background to the Case This is the story of a relatively straightforward case, a seemingly low risk function and a small number of in-process rejects that nevertheless represented an opportunity to save money by reducing waste and eliminating disruptions caused by an automated assembly line producing a small number of rejects. It is the sort of project completed in four afternoons in the course of one of our in-house training workshops. This simple project illustrates some very important characteristics of the physical world of engineering, and how to exploit nature together with a progressive search to achieve the performance we want, for the minimum of effort. The other side of the same coin is that there are traps that are easy to fall into that result in squandering scarce resources to achieve mediocrity by investigating and fixing the wrong things.1 . 1 We must avoid the traps of squandering scarce resources to achieve mediocrity by investigating and fixing the wrong things

23

24

3. A PROBLEM SOLVING CASE-STUDY

I originally selected the case as good teaching material because some data from the case that we would not normally use had been given to us, and that data nicely illustrated an extremely important principle. Additionally, the product and process technologies involved were simple enough that folks from any background would be able to follow the story. As I wrote it up, however, it became apparent that this single case demonstrated all of the key interconnected principles that over the last 30 years we have learned are important about achieving the best quality and reliability performance in the global marketplace for manufactured products. Everything else I want to explain are the details of how to apply these principles effectively, this story is about why they matter, although it does give insight into how to achieve business excellence with a minimum of resources. The case may seem blindingly obvious at the end, and clear that it should only take a couple of days from start to finish2 . But that is not what we find happening. In every organization we have ever worked with, people spend weeks on projects like this, and still fix the wrong things at the end, and nothing gets better. On the other hand, even the seemingly most complex cases are just as straightforward, that is once you see the answer! All performance and reliability problems can be made to be diagnosed like this one, and you can avoid getting to the point where the CEO knows about your problem.

Business Excellence or Business Mediocrity? The story starts with the Business Excellence Department of a corporation, sadly working very hard to achieve business mediocrity, something which is being repeated hundreds of 2 Projects always look easy when you know the answer, but even the seemingly most complex cases are just as straightforward as this one

25 thousands times each day around the world. The story starts with a young engineer being assigned a project to reduce the rejects being generated by an automated assembly line. The engineer was instructed to find the root cause and propose how to fix it. There is a lot of data available. Everything large or valuable enough is labeled with a QR code or a readable chip including parts going into and coming out of processes, and fixtures, pallets, nests etc. Huge amounts of data can be stored and accessed with query systems, going back for months or even years. Time stamps on data from process sensor readings mean that process parameters can be linked within the database. So, is the explanation for non-conformances in the database? 32 pallets carry housings through sequential assembly and test operations. The specific operation of interest inserted a shaft and bearing sub-assembly, and a seal assembly into the housing. Peak force during insertion was monitored and recorded every cycle. Minimum peak force was specified, a common way to ensure minimum push-out force. If minimum peak force was not reached during the cycle, the pallet was directed to a reject station and unloaded. Over a month, 1.6% were rejected, and during certain periods, interruptions caused line productivity to dramatically fall to a very low level. The first thing that our young engineer did was to run some data mining queries (Figure 3.1). It was quickly apparent that half of the total rejects were associated with just 2 out of the 32 pallets. That was easy! Remove the two pallets, write out a maintenance request to overhaul them, job done. Tell the story to the plant manager, pat on the back, and best of all, another project towards certification once the 20 PowerPoint slides are made. Everyone is happy, except for the line supervisor who, a month later, had to spend a whole weekend (a holiday weekend) on over-

3. A PROBLEM SOLVING CASE-STUDY

26

Figure 3.1: Assembly station rejects for low peak force. In one month, 14, 848 parts were produced with 231 Rejects (1.6%). Two pallets accounted for 101 of the rejects (almost half, so fixing them would bring reject rate down to about 0.8%). But, is this a case of too much data but not enough information?

time trying to catch up because there had been so many rejects for peak insertion force the day before3 . But overall, it was felt that there had been a temporary reduction in the number of rejects, so it was decided that this was an opportunity for a second project with another engineer assigned to find "other" root causes and so make further reductions in the reject rate4 . Does the story of an apparently fixed problem returning with a vengeance (at the worst possible time) sound familiar? What’s going on? Is this what we call incremental improvement? Anyway, that was last Friday. Right now it is Monday lunchtime, and both the first engineer in the story, and our newly assigned person, are 3

Why do apparently solved problems come back to haunt us at the worst possible time? 4 At first glance, fixing the two bad pallets looks like a no-brainer, but we need to examine the behavior more closely.

27 taking part in our workshop. We have told them that our expectation is that they will get to the bottom of this problem and know how to fix it so that in future there will be zero rejects for the specific reject code. What’s more, they will be presenting clear evidence to the plant manager and his team that they know what they are talking about. They have four afternoons to finish the job!

Chapter 3 Key Points Along with failing to recognize the absence of adequate symptomatic knowledge, using counts of rejects and other transformed data as measures of performance during diagnosis is another major trap for those trying to improve performance. It will result in a lot of work and the results will be mediocre. Much of the remainder of this book is used to explain why this is so.

28

3. A PROBLEM SOLVING CASE-STUDY

4

Manifestation of a Natural Law To see what is different about the approach they took during the workshop, and why it is so effective, we are going to look first at some of the data the first engineer kindly shared with us, and expose information that is commonly missed, why this information is so important, and how to quickly flush it out every single project. In Figure 4.1 we can see four of the thirty two plots that could be extracted from a database query by pallet. Each plot is for assemblies made on one pallet. In these plots, the vertical axis is the peak insertion force, which is used to decide whether the part is accepted or rejected, the horizontal axis is the time when the assembly operation took place. Each point is one assembly. The thicker horizontal line is the minimum specification limit, 8kN so we can see how different reject rates occur for each pallet, as well as how they vary over time1 . 1 Historical data offers so many tempting avenues to explore that it is easy to spend days on analysis. However, data limited to reject

29

30

4. MANIFESTATION OF A NATURAL LAW

Figure 4.1: Peak force (vertical axis) for individual assemblies vs time they were produced (horizontal axis) over one month, for 4 of 32 pallets. Each point is one assembly. The thicker horizontal line represents the minimum specification limit, 8kN. Products below this line are rejects. Big Data, Big Danger! may sound like a glib soundbite, but there are some characteristics of the physical world of manufacturing that can be a trap leading to mediocre results from lots of hard work and potentially spending a lot of money. The flip side is that, if understood, these same characteristics can instead be exploited to really achieve excellence and competitive advantage for lower cost and less effort. Of the four different pallets, two are the seemingly bad pallets and two others were chosen because they were physically adjacent in the line at this point in time. We are lookcounts gives a terribly distorted picture. The more data there is, the more statistically significant small effects that will be found. Small effects that have no practical significance. Switching to the full picture of performance is an improvement that gives insight into the truth, but it is still ancient history.

31 ing at over 1, 800 measurements of peak force. There were almost 15, 000 measurements of peak force available for this month alone, that can be analyzed six ways to breakfast. This quantity of data, especially when it is ancient history (which in manufacturing can mean a day old if the product has been shipped) is not helpful, and in fact gets in the way. However, it is useful to this discussion, to demonstrate the characteristics of the physical world that we need to understand.

Figure 4.2: Conceptualization of the distribution of peak force for daily groups of assemblies produced on 2 of the same 4 pallets. The three different shaded bands reflect the fact that some of the population can be seen as comfortably in the desired range (white), some are inside a safety margin - call these marginal (gray) and some are out of specification (black). If we could use colors, we would shade them green, orange and red.

If we look at a population, or even a sub-group, we find that they are not uniformly distributed along the vertical, or Y axis. More results are clustered around the group average, with less and less results the higher or lower we look away from the average. Typical of how populations are distributed for many characteristics. When the group is large enough, distributions with these characteristics take the form of a bell, and that’s its common name. If we slice the bell up between different values of our measured characteristic, we see that areas in different slices are proportional to the per-

32

4. MANIFESTATION OF A NATURAL LAW

centage of the group that falls into that slice. For real-world populations, the bell is not symmetrical, but we are just using these bells to illustrate a concept and think about how these groups are themselves distributed. We have used different shades to visualize how many in a group are bad or rejects (Black - normally Red), how many are acceptable but marginal (Gray - normally Orange), and how many are good (White which is normally Green)

Figure 4.3: Conceptualization of the distribution of peak force for monthly groups of assemblies for separate pallets.

Let’s get an idea of how the insertion force performance for each individual assembly is distributed over a day, a week or the whole month, for each individual pallet, as well as the distribution for the line as a whole. It is important to note that the performance of a large proportion of products could be considered to be marginal, and any small downward shift of the population will instantly result in a dramatic increase in the reject rate. What is needed is a Causal Explanation for most of what is happening, and then to use that knowledge to take action that will result in a margin of safety between the performance of the entire population and the minimum required performance. In order to understand the natural laws we have been referring to, let’s first consider how performance varies in a small time period for one pallet, for example three hours. Although the population varies from hour to hour, we can

33

Figure 4.4: Conceptualization of the distribution of peak force for the combined population of one "good" and one "bad" pallet.

see that there is consistently a wide range of peak force values, typically of the order of 8kN, and this is true for all four of the pallets we can see. Now, comparing time periods for each pallet, we can see that the location of the average for the population also varies, whichever pallet we look at. We can see a range of locations for period averages as the population shifts up and down, a range of movement of around 3kN. Finally, if we look at the distribution of all results for each pallet for the entire month, we can see that there are differences in the locations of the distribution from pallet to pallet. Because of the way that these populations are distributed around their averages (the bell-shaped probability distribution), this results in dramatically different counts of rejects from pallet to pallet. Yet the populations of the best and worst pallet differ by only 1kN! If we analyzed how peak force for the whole population of 14, 848 products was distributed, ignoring reference to manufacturing sequence or pallet, we would in fact see a total range in the order of 9kN. We could improve our analysis somewhat, and standardize the measure of spread and use sigma (σ or standard deviation) when characterizing the populations and sub-groups, but the essential part of the

4. MANIFESTATION OF A NATURAL LAW

34

story would remain exactly the same2 . The impact of variation time period to time period is much, much bigger than the pallet to pallet differences over the long run. But the impact of variation within a time period for any of the pallets is far greater still. The impact is so great, in fact, that reducing pallet to pallet variation alone will not make a noticeable difference long term, and even reducing time period to time period variation will barely make any difference at all. So, when the two pallets with the high reject rates were removed, it still left a large proportion of products of marginal performance, and when there was a small downward shift of the population two days before the holiday weekend, there was instantly a dramatic increase in the reject rate meaning production targets could not be met.

Figure 4.5: Comparison of populations; within a time period, between time periods (for each pallet) and between two pallets. Note that the spread or delta within a time period for any pallet ≈ 8 kN, the delta between time periods for each pallet ≈ 3 kN (but the pattern depends on the pallet) and the delta between pallets over all time periods ≈ 1 kN (but changes depending on the time period). 2

Ratio√of Deltas is 8 : 3 : 1, but the leverage is 64 : 9 : 1. The total range is 82 + 32 + 12 ≈ 8.6. This is because of an important result that is known by the name Square-Root-of-the-Sum-of-the-Squares.

35 A thought experiment: Instead of reducing the impact on variation coming from the pallets, which the first engineer did temporarily by removing them from the line, let’s conduct a thought experiment (which has been replicated in many, many simulations, and more expensively, in real life). In this thought experiment, we imagine that we have the knowledge to halve the variation within the three hour time period from 8kN to 4kN quickly and cheaply. The result of doing that would be dramatic! We can estimate the combined effect of any number of causal variables by using a well established simple calculation: summing the individual variances (the variance is the square of a population’s standard deviation, or σ 2 ) and then taking the square root. This important result applies to any multiples of σ too, and so automatically applies to the whole spread, or delta, which is what we will do to avoid unnecessary statistical analysis of the data. Taking the Square-Root-of-the-Sum-of-theSquares accounts for all of the positive and negative effects that could cancel each other, as well as the differences in magnitude of effects.

Figure 4.6: Thought experiment: How the picture would change if delta within time periods was halved4 .

36

4. MANIFESTATION OF A NATURAL LAW

Overall variation would be reduced from 9kN√to the order of 5kN. This is because 9 is an approximation of √82 + 32 + 12 . Reducing 8 to 4 would change the calculation to 42 + 32 + 12 which is a little over 5. With the knowledge we have, we would also be able to ensure that the population was predominantly in the upper half of its previous range. The number of rejects will be reduced, and perhaps we should be able to reduce the variation within each three hour period even more, say to 3kN, but let’s look elsewhere as a second step in our thought experiment. In this step, we imagine that we have the knowledge to halve the variation between time periods from 3kN to 1.5kN cost effectively. The result of doing that would be a good improvement too. Overall variation would be reduced √ further from just over 5kN to just under 4.4kN because 42 + 1.52 + 12 is just under 4.4. We would still be able to ensure that the population was predominantly in the upper portion of its previous range, and so achieve our goal of zero rejects, but with a small margin of safety. Perhaps understanding and taking action on the differences between pallets is now worthwhile in order to maintain or improve the margin of safety, and if it really does just involve a little maintenance. We can use another viewpoint that is helpful to see what is going on here. Scatter plots of paired data characterize the relative strength of relationship between two data sets, by assigning one axis to each variable. Conventionally, the horizontal axis is referred to as the X-axis and the vertical axis is the Y -axis. Within this convention, the variable assigned to the Y -axis is said to be dependent on the (independent) variable assigned to the X-axis. We can see how changes in X cause corresponding changes in an effect, Y . we are able to see how the variable assigned to the X-axis are distributed, how the variable assigned to the Y -axis is

37

Figure 4.7: Thought experiment: How the picture would change further if delta time period to period was also halved6 .

Figure 4.8: The response of Y (insertion force) over the entire range of a key factor, X1 that varies pallet to pallet.

distributed, and how the two are related on average. The central question asked using a scatter plot set is how much of the response (Y ) variation is explained by the X variation? Notice that in Figures 4.8, 4.9 and 4.10 we have used the common currency of the Y axis (in this case that is insertion force measured in kN) to combine the variables, and the range of each X axis over which the slope is measure is always the observable operating range for that variable, no more, no less. The magnitude of each slope must be squared, then summed, then the square root of that sum

38

4. MANIFESTATION OF A NATURAL LAW

Figure 4.9: The response of Y (insertion force) over the entire range of a second key factor, X2 that varies time period to time period.

Figure 4.10: The response of Y (insertion force) over the entire range of a third key factor, X3 that varies within a time period of a couple of hours.

taken, in order to find the total combined variation. In Figure 4.8 we can see that, although there is a causeeffect relationship between X1 and Y , for any value of X1 there is a considerable spread of Y results. This variation, or scatter, is caused predominantly by X3 varying, less by X2 varying and even less by variation of any other possible causes. In Figure 4.9, there is a stronger cause-effect relationship between X2 and Y , but the spread we can see for a single value of X2 is still mostly driven by X3 , and somewhat by X1 varying etc. In Figure 4.10, the cause-effect relationship between X3 and Y is the strongest (steepest

39 slope), with the scatter being the result of X2 varying, and less by X1 variation. X3 would be called the Steep X.

Figure 4.11: The amplifying effect of square-root-of-the-sum-of-thesquares on the Pareto distribution of slopes. The strongest cause-effect relationship is called the Steep X. Although the actual Pareto-type distribution of slopes is pictured on the left, the Pareto-type distribution of leverage is pictured on the right. Any list of independent variables combine into sub-groups in this way, and any list of independent sub-groups combine in this way. We have not named specific variables yet, but simply sub-groups. We have lumped together all the variables associated with pallets, lumped together all the variables that fluctuate over time periods greater that a couple of hours, and lumped together all variables that can change at least as fast as from one cycle to the next. These are all real effects, showing that all three subgroups contain at least one (but probably several in each group) variable that affects the Y response we are interested in. This is always the case. However, fortunately their influences are not equal, and if we look at the slopes of variables affecting any Y response in any project, we find that they conform to a Pareto-type distribution, or a Power Law. That is, the steepest slope is considerably steeper than

40

4. MANIFESTATION OF A NATURAL LAW

the second steepest, which is considerably steeper than the third, but that as we go down the ranked list, the differences start to become smaller. The name given to the Pareto distribution of slopes, and their combination by root-sum-squared is the Sparsity of Effects Principle. It is an extremely important result. It is one of the laws of nature that for engineering and manufacturing organizations it is essential to understand in order to achieve true business excellence. The Sparsity of Effects Principle means that any effect cannot be evenly attributed to a large number of variables.

Figure 4.12: Essential Search Policy: only use contrasts bigger than 3σ of the variation we see in Y .

There are huge implications for managing resources targeted at performance improvement. One implication is that a diagnosis based upon counting rejects in the tail of a population distribution is almost certain to identify flatter Xs. The fact that rejects are produced on a regular basis, no matter how low the reject rate is, automatically means that much of the population has marginal performance, and spending scarce resources on flat Xs is wasteful.

41 Performance does not get any better in the long term. Paradoxically, the problem gets worse the lower the reject rate. Achieving business excellence through lots of small incremental improvements is a fallacy, only a step change from controlling the Steep X will get there. Actually, searching out flatter Xs will also be more difficult and time consuming, requiring very large sample sizes. Data mining large, historic samples is a sure way to find flat Xs. The good news is that the Steep X becomes easy to find, if we apply two simple policies when we search for it. The first policy is to use only observations in Y to make decisions about the answers to our diagnostic questions. These questions always involve contrasts, so the second policy has to be that we must only use contrasts bigger than 3 to 4 σ of the variation we see in Y . Smaller than that, there is a very high risk of becoming confused as we work from Y to the Steep X, because the flat Xs can appear as large as the Steep X in a small window.

Chapter 4 Key Points The Sparsity of Effects Principle means that any effect cannot be evenly attributed to a large number of variables. There will always be a single dominant causal mechanism as a result of the squaring effect on the leverage. For simplicity, when explaining the concept, we talk in terms of a Steep X and the rest being flat Xs. This does not mean it is one single variable controlling everything, although most of the time all the leverage needed can be ultimately achieved through managing one thing. Sparsity of Effects has a number of far-reaching consequences, which are important to understand if you are aiming for business excellence in engineering or manufacturing. One consequence is that that we can obtain unambiguous

42

4. MANIFESTATION OF A NATURAL LAW

answers to well-formulated questions in a progressive search; telling us that the dominant mechanism lives here not there. This is provided that we strictly apply the policies of only using contrasts bigger than 3 to 4 σ of the variation we see, and only use the Y . Other consequences will be developed in chapter 7.

5

Changing the Questions We can exploit Sparsity of Effects by changing the questions to get a fast answer, as in Figure 5.1.

Figure 5.1: Exploiting Sparsity of Effects by changing the questions to get a fast answer. Laid out as a Search Tree. 43

44

5. CHANGING THE QUESTIONS

These turn out to be extremely powerful questions1 . In diagnosing system behavior, we wish to employ a progressive search strategy, and the questions define strategy at each stage. The two sides of each branch contain every possible causal mechanism not previously eliminated. We can obtain answers to each question very quickly, and definitively, without the need for complex analysis. We are exploiting the Sparsity of Effects Principle by searching for the causal mechanism with the most leverage over the total observed variation. The concise generic way of expressing these questions is to use the concept of total variation being divided up into a family hierarchical view of a machine system. Cycle to consecutive cycle of the same equipment combination are more closely related than any extended family. Because they are so closely related, there are fewer possible X differences between them, and so much will be eliminated if cyclical variation turns out to be the family group responsible for most of the variation in Y . But, if cyclical variation is not responsible for most of the variation in Y , then (if it exists) we look at the slightly more extended family group of different pieces of equipment that carry out the same function. These can be manufacturing devices or products, and because they are the structural elements of our engineering system, the name structural variation is given to this family group. There will be fewer X differences between structural elements functioning in one time period than between time periods. Temporal variation is this least closely related family group. Conveniently, since it will be answered ahead of the next two, cyclical variation is more often the family group responsible for most of the variation in Y than structural variation, 1 Answering a very simple sequence of questions is the first step towards ensuring a search stays focused on the Steep X.

45

Figure 5.2: Fresh samples collected on the first afternoon of the workshop.

and structural variation is more often responsible for most of the variation in Y than temporal variation. It’s another tempting trap to ask questions in the reverse order just because it happens to be collected that way by the IT system. Engineers spending hours looking at screens analyzing historical data have fallen into that trap. Handing the job over to artificial intelligence employing the same flawed logic will result in the same flawed answers, but maybe faster. One of the beneficial consequences of Sparsity of Effects is that the Steep X will be exposed even in relatively small observation sample sizes. In contrast, the effect of a flat X needs large samples, and sometimes clever statistics, to be identified. The large amount of data on the improved Y may appear as if it provided some good information. It is certainly a nice illustration of Sparsity of Effects, but it is still historical. To find the real Causal Explanation we need information that we can act upon immediately.

46

5. CHANGING THE QUESTIONS

Starting a Progressive Search The remainder of the story is what happened over the course of a few hours when we ask the right questions. These questions form the progressive search. Such a search can be very fast, but to get quickly to the causal explanation each questions must not only progressively reducing the search space, but must be able to be answered quickly. Hours rather than days. The question after this one may require some minor intervention in the process to answer it, such as supplying specially chosen components to the station. Historical observations aren’t very helpful for this, we always want fresh ones, but not very many. Our engineers collected their first observations at the end of the first afternoon. Three consecutive pallets were chosen. pallets 04 and 09 still appeared to be occasionally problematic, and they had been put back into the line together. Just in front of them was pallet 12. One result was collected for each of the three pallets, we waited for the pallets to come around again and collected a second set of three results. This was repeated a third time. In Figure 5.2 the results have been plotted as pallet groups, so that the first three connected points show consecutive assemblies made on pallet 12. None of the nine assemblies were rejected, but all were intercepted before shipping, and put aside just in case we wanted to look at any of them more closely. If not, they could be put into the next shipment. The next morning, just before the classroom part of the workshop started, the engineers collected a second set of three results from each of the same three pallets, again putting the assemblies aside, just in case. After lunch, they repeated the exercise a third time. We also want to be able to see the answer clearly with-

47

Figure 5.3: Small multiples, or Multivari plot of insertion force for all three time periods.

out resorting to crunching numbers. Fortunately there is an elegant graphical way of showing very nicely which family the Steep X belongs to, that Edward Tufte2 refers to as Small Multiples. This particular type of small multiple also goes by the name of a Multivari plot, named by Leonard Seder3 . In the plot of Figure 5.3 we have observed the Peak Force for 27 cycles of the shaft, seal and bearing assembly station. In period i we recorded three consecutive cycles for pallets 12, 04 and 09 (adjacent pallets at that point in time). Cycle to cycle variation uses up most of the expected total range. Sparsity of Effects means that no X belonging to higher order families can be steeper than a variable that belongs to the cyclical family. There is no contribution from pallets (structural family) detectable in the first nine observations. However, being aware that pallets 04 and 09 had produced higher numbers of rejects even after maintenance, the observations were continued for two more time 2 3

[Tufte(2001)] [Seder(1950a)][Seder(1950b)]

48

5. CHANGING THE QUESTIONS

Figure 5.4: An interdependency manifested in a Matryoshka characterization.

periods. Period ii told the same story as period i, but interestingly, period iii showed a difference in performance between pallets. The total elapsed time was only one and one half shifts, but recognize that by far the Steepest X was identified within a few minutes; because of the Sparsity of Effects, we obtained the same critical information as was available in the data set for the whole month, from a fraction of the observations. We were also able to obtain even more important information from these cycles by recording the complete force-displacement picture, and keep the most interesting parts for further examination. The Sparsity of Effects result requires that the effect (Y ) of each cause (X) variable on the list, or sub-groups, are independent of each other. Sometimes an interdependency between variables exists; that is the slope of one variable depends upon the value of another variable. Statisticians often refer to such interdependence of effects as interactions. This case illustrates what an interdependency looks like; the impact that the secondary mechanism (that is driving time

49 to time variation) appears to depend upon which pallet we look at (the tertiary mechanism is driving differences in pallet performance) and vice versa. We can develop tactics for searching out both unknown pieces of the causal explanation that exploit this, and achieve the excellence we seek. In such cases, Sparsity of Effects still applies, but the interdependency must be thought of as an item on the list, rather than the separate variables. We can easily spot an effect interdependency between sub-groups, and it simply means that we have to be a bit more selective when choosing which observations to compare.

Figure 5.5: Concise expression of Matryoshka splits. The concise representation of these splits is shown in Figure 5.5. Before the question whether cyclical variation is the family group responsible for most of the Y variation, the first split of the progressive search is to determine if most of the variation can already be observed within a single cycle. This elemental variation is the most closely related family group that we can see, and it has the fewest possible X differences and so provides even more constraints if this is where we can see the most Y variation. Conveniently, it turns out to be by far the commonest family group where

5. CHANGING THE QUESTIONS

50

most of the Y variation lives; in well over 80% of cases we find that the mechanism with the greatest leverage can be easily explained by observing elemental variation. Elemental variation is where the action is, it does, however, require us to carefully consider what the Y is, fully grasp the concept of a machine cycle, and determine the spatio-temporal datum framework of the cycle. These topics are discussed in a lot of detail in Diagnosing Performance and Reliabilty 4 .

Figure 5.6: Force and displacement during the assembly process. 15 cycles overlaid, 3 high and 3 low highlighted.

The splitting up of a system’s behavior into these generic nested family groups is given the name Matryoshka Strategy, after the nested Russian dolls. Information on how well a process is performing may be obtained by characterizing the product coming out of the process (as dimensions, or non-destructively detectable faults and in-homogeneities). It may also be gained from an energetic characterization; either during the function of the process, or of the properties of the product. In this case, the Y response (peak force) being used to accept or reject the product at the assembly station is itself an energetic variable measured during the function of the process. However, 4

[Hartshorne(2019)]

51 energetic variables come in conjugate pairs such as voltage and current, torque and angle, pressure and volumetric flow. The full story is contained in a picture of the conjugate pair at different steady-state conditions or during a cycle.

Figure 5.7: Structural Schematic of Assembly Station. In this case, the energetic conjugate variable pair of interest are force and displacement during the process cycle. This constitutes the elemental family of variation for observing this process. The results of the 15 consecutive cycles have been overlaid in this picture, and the 3 highest, 3 lowest, peak force cycles highlighted and related to the physical position of tooling during the cycle. The cyclical family is where the Steep X manifests itself, but a close study of the elemental family points towards what is actually happening. It is very rare in any diagnosis that the elemental family is not an essential source of information that makes the diagnosis fast and efficient. In order to include the elemental family in a small multiples graphical display, we must expand a single point on a one-dimensional axis into a two-dimensional plot. These are the same observations plotted in the earlier small multiple

52

5. CHANGING THE QUESTIONS

Figure 5.8: Cartoon of the assembly cycle at key steps.

(multivari). We can still stratify cyclical variation by overlaying up to 3 consecutive cycles within each small graphic element. In this case we have reduced the observations from three down to only the highest and the lowest of the subgroup to make it a little cleaner looking. The structural family (pallet to pallet) have been stratified into rows, and the temporal family is shown as columns. Only the highest and lowest final force from the 3 cycles of each pallet in each period are shown to de-clutter the plot. What we can learn from this small sample is that although most of the cycle to cycle variation lives between points C and D in our cartoon, but a cycle that is low at point A ends up low, a cycle that is high at point A ends up high. In other words, the differences between cycles exists for the entire cycle. We can also clearly see that the interdependency between pallet and something changing time

53 period to time period has the same characteristic, but amplified.

Figure 5.9: Small multiples plot of force and displacement during the assembly process. Highest and lowest from each set of 3.

Chapter 5 Key Points An example of well-formulated questions, Matryoshka is a hierarchical splitting from the lowest to highest stratification of the essential system characteristics, the key being the concept of the machine or process cycle. Matryoshka is the single most important strategy we employ and teach; characterization in this manner always reveals multiple constraints on the possible causal explanation, as well as focusing the search on the contrast with the greatest leverage.

54

5. CHANGING THE QUESTIONS

6

Completing The Progressive Search We can use the knowledge gained so far to form another progressive search question, and ensure we can answer it quickly. The picture of what is happening within one cycle (elemental variation) is key to developing tactics for the next question, but the strongest contrast is cycle to cycle.

Figure 6.1: Search Tree. Continuing the progressive search. 55

56

6. COMPLETING THE PROGRESSIVE SEARCH

Because of Sparsity of Effects, we must focus on explaining that behavior, above all other contrasts that we may be able to see in the characterization, such as the temporal (change from time period ii to time period iii) or structural (differences between pallets) variation. On the other hand, knowledge of that behavior, and the interdependency, means that we can avoid getting confused by carefully selecting the hardware we use to answer the next question in the progressive search.

Figure 6.2: Isolation strategy tactical approach 1.

Figure 6.3: Isolation strategy tactical approach 2. Of the remaining helpful ways that systems are organized, one split that should be able to be quickly made is a functional Isolation. In this case, we should be able to quickly identify the source of cyclical variation as either the Function of the assembly process, or the component parts fed as Inputs. One tactical approach to this could be to

57

Figure 6.4: Results typical of the Steep X living in the function, when the function is repeatedly carried out on one set of originally high, one set of originally low inputs.

disassemble the parts, and feed them back into the system so that the cycle is repeated (on the same pallet) with the same inputs. In theory, by doing this for both for a sample of parts that had a high force, and a sample of parts that had a low force, when the function was originally performed, we should see whether or not this can change by something greater than around 6kN in both directions when the function is repeated. The downside with this tactic is that the parts may have been permanently changed during the original cycle, and it may be logistically difficult to feed parts into an automated process at precisely the right moment. Nevertheless, our knowledge about the behavior within a cycle means that this remains a viable tactical approach,

58

6. COMPLETING THE PROGRESSIVE SEARCH

Figure 6.5: A function repeatedly carried out on one set of originally high, one set of originally low inputs. Results typical of the Steep X living in the inputs.

even if it needs some refinement. Matryoshka Characterization always reveals multiple constraints on the possible causal explanation, as well as focusing the search on the contrast with the greatest leverage. The fact that the cycle to cycle difference is maintained from the start to the end of the cycle suggests that the seal and the shaft-bearing assembly are not prime suspects, even if they cannot be eliminated. To make execution easier, we could let them become part of the Function, and make the split be between the housings as the Input, and everything else be defined as part of the Function. Then, if the answer turns out to be that the Steep X lives in Function, at least the housings are eliminated, and that would still be a big

59 step forward. So, we could just repeat the assembly process with the same housing, using new seals, shafts and bearings each time, which would be a lot easier. We need to use the same pallet, since we already know that pallets are part of the causal explanation. However, in cases where the Steep X does live in one of the Inputs, it is usual to then find that the problematic behaviour we are investigating is a function of the spatio-temporal characteristics of the processes producing that particular Input. In fact, there are lots of alternative tactical approaches to obtaining an answer, as long as we keep the focus on the strategy, that is the split or question is it inputs or function?. Whatever tactical approach we decide upon, we need to understand how the housing machining cell works. The normal sequence of the Cell is: 1. Load Op 10 for machine A, machine most surfaces 2. Load Op 10 for machine B, machine most surfaces 3. Unload Op 20 from A, laser etch, place finished housings in trays for transport to subsequent processing 4. Unload Op 20 from B, laser etch, place finished housings in trays for transport to subsequent processing 5. Transfer from Op 10 to Op 20 for machine A, machine bore 6. Transfer from Op 10 to Op 20 for machine B, machine bore The cell also has protective cages with lock-outs that allow it to be operated with only one machine as well as both, allowing access for maintenance etc. Before we executed the 2nd tactical approach,we decided to exploit the fact that a record was kept of which machine and nest each and every housing had been made, just to see if we could save a bit of work. If that analysis didn’t show

60

6. COMPLETING THE PROGRESSIVE SEARCH

anything, it does not mean that the housing is eliminated (a common mistake) and we would only have lost half an hour.

Figure 6.6: Housing machining cell cartoon. One important implication of Sparsity of Effects that we already identified is that counting rejects that are from tails of distributions is a transformation of data that loses critical information and gives a badly distorted picture about what is really happening. Further, we noted that mining large amounts of historical data is not fast and efficient, and it is a sure way of finding Flat Xs and perpetuating mediocre performance. Extending the original query that led to the incorrect conclusion about the problem living in the pallets to also consider machining, we obtain the pictures in Figures 6.7 and 6.8, which is pointing more in the right direction, but still leaves room for searches based on guessing. The warning about big data, big danger still applies! Instead of mining large amounts of data, we can use its availability in a more focused way. Ask the question, not

61

Figure 6.7: Historical rejects by machine. The lighter blocks represent the failure mode when force = zero. Still only considers rejects.

Figure 6.8: Historical rejects by nest. The lighter blocks represent the failure mode when force = zero. Still only considers rejects. about historical rejects, but about the assemblies we already know a lot about, and that we still have in our hands. In Figure 6.9 we have labeled the multivari data points with the nest the housings were machined on. We see important things emerge; in time period iii, when the behavior of pallets 04 and 09 was quite different to periods i and ii, that the source of housings changed from machine A to machine B. We can also see that there is a strong tendency for assemblies made using housings machined in certain nests to be high insertion force, and others to be low insertion force.

62

6. COMPLETING THE PROGRESSIVE SEARCH

This is real evidence pointing to the position within the cycle of Op 20 (the nest in which the housing bore is machined) to be driving almost all of the variation in insertion force.

Figure 6.9: Assigning nests to the new observations. We can shift the standard of evidence against the nests to beyond reasonable doubt by making a prediction about insertion forces for specific parts, and obtaining the predicted result multiple times. Focusing on just two nests, and two pallets, we only need to produce four combinations out of the 256 possible. The combinations should include what are expected to be amongst the highest and lowest results. Focusing only on four combinations is the best way of avoiding confusion as we get closer to a causal explanation. The result in Figure 6.10 was obtained using all new parts, the only intervention necessary was to feed the housings we selected to the pallets of interest. It shows that the whole story must be something to do with geometrical relationships between pallets and housings (specifically the nest on which the bore was machined). It is not a complex set of

63

Figure 6.10: 4 out of the 256 possible combinations of housings from specific nests and assembly pallets.

causal factors, but one simple relationship driving the majority of the variation in insertion force, and whilst there are two interdependent elements in the relationship, it is really only the factor associated with position in Op 20 (the machining nests) that matters. Fix the nests and the variation in insertion force will be dramatically reduced, enough to eliminate rejects!

Figure 6.11: Position with respect to press axis is the prime suspect.

64

6. COMPLETING THE PROGRESSIVE SEARCH

The progressive search is almost done. The evidence from the Matryoshka and Isolation splits provides so many constraints that the most likely suspect has to be relative alignment between the housing bore and the assembly press. Since a specific combination of housing machining nest and pallet is very repeatable, we measured both magnitude and direction for the four combinations in our last test. The measurements show a consistent story of greater misalignment resulting in higher insertion force. In other words, monitoring insertion force turns out to be measuring misalignment above all else. The implication is that the assemblies are either all good, in which case the specification limit must change, or else they are all bad, in which case the bore dimensional specification must change. In either case, this should have been found during the product and process validation phase of development. It would definitely have been found if Matryoshka characterization had been conducted at that stage. We measured housing bores and found the position of each nest with respect to the datum. Then, knowing the position of the bores with respect to the press, we can quickly identify alignment of each pallet with respect to the press. This confirmed that indeed, pallet 04 (and also pallet 09, not shown here) were the best aligned pallets. As a second confirmation, pallet 04 was checked by maintenance. This explained why, after checking the pallet, they simply returned it to production. The best nest positions can be seen to be B3 and B4, and the lowest insertion forces obtained were with the combinations of pallets 04 and 09 with nests B3 and B4 (causing most of the rejects). pallet 12 turned out to be one of the worst aligned, although we hadn’t chosen it that way because simply looking at rejects (or lack of them) does not discriminate between results that are close and far from specification. What is important here, is that

65

Figure 6.12: Splitting up housing position with respect to press into a)housing positions with respect to machining datum, and b) pallet positions with respect to their datum, which is the locating point in the press. if all 32 pallets had been corrected, there would have been a small reduction in the range of insertion forces, but quite a large increase in the number of rejects from nests B3 and B4. On the other hand, if all 8 nests had been corrected, there would have been much more reduction in the range of insertion forces, and very much greater increase in the number of rejects. So, once again, we can see the impact and importance of Sparsity of Effects. We can see in In Figure 6.13 that the misalignments of the bores are not the result of nests being incorrectly set individually, but rather the offset and rotation of an entire table with respect to the machine axes. What we have to remember is that the bore position with

66

6. COMPLETING THE PROGRESSIVE SEARCH

Figure 6.13: The nest positions on the machines are not random. respect to datum is the result of the relationship between Op 10 and Op 20 on each machine, so we need to determine which one is misaligned. Normally, the housing machined on Op 10 B1 is transferred to Op 20 B1, and Op 10 B4 to Op 20 B4 etc. But it is fairly simple to change the program so that Op 10 B1 goes to Op 20 B4 and vice versa. That would be one way to figure out which table was misaligned. Ultimately, the best course of action is to ensure that both tables are correctly aligned, rather than simply adjust Op 20 to match Op10, which could potentially cause, or at least fail to fix, other problems. So, a more effective approach may be to directly check alignments of both tables on both machines. The full causal explanation: Insertion force is proportional to magnitude of misalignment between press axis and housing bore axis, greater misalignment results in greater insertion forces. Although pallet alignments with respect to press axis contribute, it is the housing bore axis with respect to its datum that drives

67 the observed variation. This is the result of a misalignment between Op 10 and Op 20 Tables in the housing machining cells. Re-aligning pallets will have little or no effect.

Figure 6.14: The relative strength of pallet misalignment (left plot) and nest misalignment (right plot).

We can further add that, with the current process specifications, correcting alignments will result in most assemblies being rejected. Therefore, we need to understand the relationship between insertion energy, potential energy stored, and push-out energy, to be able to use the force and displacement sensors of the press to detect assemblies that are at risk. Without this knowledge, we only know that either all assemblies are at risk, or none of them are. Whatever the case, the rejects are not different from the rest of the population as far as the quality and reliability of the product is concerned. Armed with the causal explanation for most of what is happening, we must use this knowledge to take action that will result in a margin of safety between the performance of the entire population and the minimum required performance. That is how business excellence is achieved, a very low risk of producing defective product, for all practical purposes, a zero parts per million defect rate!

68

6. COMPLETING THE PROGRESSIVE SEARCH

Figure 6.15: The end of the progressive search. Insertion force is proportional to magnitude of misalignment between press axis and housing bore axis, greater misalignment results in greater insertion forces. Although pallet alignments with respect to press axis contribute, it is the housing bore axis with respect to its datum that drives the observed variation. This is the result of a misalignment between Op 10 and Op 20 Tables in the housing machining cells.

Chapter 6 Key Points The key lessons from the story are: • Regular rejects, even at less than 1%, suggests that performance of a lot of the population is much poorer than desirable, and lacks an operating margin of safety. • Counting rejects that are from tails of distributions is a transformation of data that loses critical information and gives a badly distorted picture about what is really happening. • Data mining does not suit the purposes of Topographic diagnosis. Large amounts of historical data is not fast

69 and efficient, it is a sure way of finding Flat Xs and perpetuating mediocre performance. • Sparsity of Effects means that it is only possible to achieve the step change in performance that excellence demands by controlling Steep Xs. • Sparsity of Effects also means that, by using the correct Y response for performance, together with progressive search questions, finding the causal explanation for that performance will be fast and resource efficient. • The nested questions of Matryoshka are an essential opening move for every investigation. The answer to the questions provides a characterization of product or process behavior that drives the progressive search. Is it difficult? Will it take a lot of resources? I hope you can see it is less effort than the approach most folks use, and it will only need to be done once. What it does take is insight into fundamental natural laws in order to discern the important few from the trivial many. This project conveniently started with an appropriate and information-rich performance characteristic. We easily transitioned from using data of reject counts to measuring insertion force as the press moved during the cycle. That made the project a good introductory example of a realworld progressive search and for illustrating the importance of the Sparsity of Effects Principle. Can the same approach be used to explain broken parts, leaks, visual defects, for example? Can it be used for products that are cast, welded, glued, mixed and injection molded as well as for assemblies made from machined components? Can it be used on materials with properties such as glass

70

6. COMPLETING THE PROGRESSIVE SEARCH

and rubber? Will it work for batteries, microchips, medical devices and spacecraft? The answer is yes, of course, but in the next chapter we look at a key principle for ensuring that performance is characterized in an information-rich way, in order that every project flows like this one.

7

Knowledge, Information and Data Performance and information content. We saw earlier that the specific additional knowledge needed to bridge a gap, and so enable us to improve performance and reliability of manufactured products, was referred to as a causal explanation. This was defined as a description of the necessary and sufficient conditions, and the how-why mechanism involved in producing a specific behavior or effect. In other words, knowledge is a model that is contained within the mind about how and why things happen that provides its owner the capacity to act effectively. Business enterprises use knowledge to increase value. A Fault or Problem may be the result of a complete failure of a piece of equipment, or it might be defined as nonoptimal performance or out-of-spec products or components. Fault Detection is recognizing that a problem has occurred, even if we cannot yet explain how it happened (in other words, reached a causal explanation that enables corrective 71

7. KNOWLEDGE, INFORMATION AND DATA

72

action to be taken). Fault Diagnosis is converging on the causal explanation. We want new, or at least, deeper knowledge from an effective diagnosis. We have seen how the concept of a root cause is usually too simple and restricts us from acting effectively. A statement that just defines a root cause is usually too shallow. That does not mean that a causal explanation should be complicated. If an explanation seems overcomplex because it involves a lot of variables, it is certainly wrong. We need it to go deep enough in the how direction to enable us to act effectively, even optimally if possible. A causal explanation cannot be a list of root causes, either. An understanding of Sparsity of Effects removes that possibility. Integrating new information with, and relating it to existing experience and ways of thinking (given the label paradigm by Thomas Kuhn1 , is how we acquire knowledge. Knowledge is therefore built from relevant information. Information is generally considered to be data (observations) connected in ways which allow us to make decisions. A diagnosis that follows a progressive search involves first asking effective questions, and then assembling information from data in order to get the answers. Information is separate pieces of data (observations) that, in addition to being accurate and timely, has been connected up in ways so that it: • Is specific and organized for a purpose. • Is presented within a context that gives it meaning and relevance. • Leads to an increase in understanding and decrease in uncertainty. 1

[Kuhn(1970)]

73 Information should be considered of no value if, after receiving it, things remain unchanged. Data are any observations that have been captured and stored. As well as numbers, it is important to remember that data can take the form of electrical signals and images, as well as words, symbols and diagrams. Information is made from key connections between data, not from huge quantities of data. Relevance requires that we have made observations with those connections. If the data connectivity is poor, diagnosis will at best take a long time, but is very likely to fail to get to a useful causal explanation. Information content must be sufficient to answer the formulated questions, and explicit in that it doesn’t need further explanation. The best information will provide more constraints than just the answer to a well formulated question, further reducing the search space. How should we evaluate data connectivity and relevance? For effective diagnosis (knowledge building) information must be built from data connected to performance, and from data connected to a spatio-temporal framework.

Management information or true performance? Consider the simple brake rotor machining case mentioned in chapter 1 in the context of causal explanations. We asked about the project when we arrived. "Nearly 15% of machined parts don’t meet the specification for flatness after the last grinding operation. We need to fix flatness. We know which machine is the problem, but we need your help in finding the root cause" was the answer.

74

7. KNOWLEDGE, INFORMATION AND DATA

Figure 7.1: Information content of two different data sets for the Brake Rotor Case. The upper data set is derived from the lower. Unfortunately, we’ll see later that fixing flatness is a poor starting point for diagnosis, and we already know finding a root cause is a poor end point. Worse still, they were wrong about the machine. A lot of the reasons for simple projects dragging on and on are contained in that single statement. Compare the information content of the two data sets in Figure 7.1. The aggregated information needed by the business unit manager to allocate resources is shown at the top part of the picture. However, they have made the critical error of breaking the count data down by machine (in

75 the cell, two turning machines each directly feed their own grinding machine) which has had the effect of concentrating minds on differences between machines. Sparsity of Effects tells us this is not a good idea. At the bottom is a picture of how the quality control function is deciding what to accept, what to reject. The value of the Y axis is maximum minus minimum out of 108 observations recorded automatically as each rotor is rotated (36 readings for 3 circles) covering the functional surface. There is more data in the top picture than the lower one, but clearly there has been a transforming loss of information going from the data used by quality control to that used by management. When we understand Sparsity of Effects, the information provided by the performance data that quality control use shows there is not a great deal of difference between machines, they both produce parts ranging from very good to reject. However, there is not enough information in that data to diagnose what is happening either. We’ll see later exactly what constitutes enough information. Converting variable data into counted data is a transformation of that data by calculating what proportions fall above and below a particular value. Sometimes this is happening because we are recording the occurrence of an event, or detecting a fault, rather than characterizing it. Another simple case involved warranty claims for a part on a vehicle that sometimes developed an annoying rattle after a few months. Isolating the source probably took the dealers’ technicians quite a while on the first few cases, when it was a novel problem, but the knowledge soon became symptomatic as it was disseminated within the network. A sheet metal part was screwed to a coated casting which housed a relatively expensive sub-system, and as is typical, whenever a customer complained about a rattle, these subsystems were removed and eventually returned to the sup-

76

7. KNOWLEDGE, INFORMATION AND DATA

Figure 7.2: Information content of different data sets for the Warranty Rattle Case. The upper data set is not derived from the lower data set in this case, but a relationship was assumed.

plier, together with a bill for all the associated costs. Engineers found that the three screws could always be tightened a little by hand, and this would eliminate the rattle, and it would not recur. The screws were self-tapping, which meant they cut a thread in the soft housing as they were screwed in. There were two assembly lines, and on both lines, the screwing process was fully automatic. Screwdriver rotation was stopped as soon as the torque reached a target value. Very occasionally an assembly was sent to a manual rework

77 station because torque was not reached before a time limit was reached. The screw was always missing in those cases. It was decided that the solution was simple. Increase the target torque value for the automatic screwdriver by a small amount, inspect the assemblies for a couple of days to be sure everything is OK, and watch the warranty costs related to rattles come down over the next few months. But, the costs didn’t come down, even after a year of hoping and waiting. Was information from characterizing performance used to get to a causal explanation that could be tested? The cost of the under-performance and the time delay to confirming a fix should have demanded more than the unchallenged acceptance of symptomatic knowledge in this case. Once again, there is not enough information in the data that gives us the picture of the performance distribution to diagnose what is happening. But we can get that information. There are three different scenarios that emerge whenever we look at true performance data, illustrated in Figure 7.3. In each case, the vertical axis is what we are considering to be the true product performance measure, or Y response, developed through this chapter. In this illustration it is generic, it may be a measurement of product functionality or a simple characteristic of a component part such as a geometry or property. One of the most important things to realize is that the information needed to explain what is happening lies above the line that separates OK functionality from NOK. Hardly anything useful lies below the line, on the NOK side. Data only from below the line does not provide contrast, and data transformed into counts is dangerously misleading. Most management data (which is needed to make decisions about allocating resources) is focused on the NOK side of the line, and it should not be applied to explaining what is happening.

78

7. KNOWLEDGE, INFORMATION AND DATA

Figure 7.3: The three scenarios that emerge when looking at true performance data. On the left is the most desirable. The two on the right are those that exist when failures and quality control rejects occur. The vertical axis is what we are considering to be the true performance measure, or Y response, developed through this chapter.

This is a major source of confusion in problem solving. On the left of Figure 7.3 is the situation we are striving to achieve. The entire population is operating in the desirable range (we would normally think of this as operating in the green range), which is with a comfortable safety margin between the worst performance and unacceptable (which means anything from being outside of specification to a hard failure). The performance scale has two zones, one where performance is acceptable (OK ) and below some limit, where it is not OK (NOK is a common term used for this). In the middle situation, the population ranges all the way from NOK (shown as dark gray, normally thought of as the red range) to the most desirable (the white range, normally considered green). Between those extremes is a large fraction of the population for which the performance

79 could be considered to be marginal (light gray, or an orange range). Sometimes exactly this scenario exists, but the manufacturer is blissfully unaware until the fateful day that they start producing NOK products, and either they find them or the customer does. The causal explanation in this situation will at least need to explain what drives the difference in extremes. Without being able to measure performance, the third scenario on the right of Figure 7.3 would appear to be the same. It has similar consequences if we are only counting failures. The difference is that the performance of the entire population is marginal. It is quite a different situation, and the solution could involve either a process change or a design change. Figure 7.4 relates Weibull analysis to true performance measurement. The objective is to determine reliability characteristics and trends of a population using a relatively small sample size. This type of analysis is focused on how the population of failures is distributed with respect to the expected lifetime of products. That could be construed as a performance measure, but it is not. The information is only about the failures. Two different illustrative scenarios are shown, one (broken line) is earlier than expected wear-out, and on the right of the picture is conceptually how that would look if true performance with a rapid decay was measured over product lifetime. The other (solid line) is a pattern of failures referred to as infant mortality, which often results from both of the right-hand scenarios of Figure 7.3. In fact this is the most commonly occurring. The implication is the failing products have a manufactured weakness, it may surprise you that the most common action is to redesign the product. Weibull analysis should be used as a warning flag. The difficulty is that it is usually too expensive before a product is sold (for all but very simple component durability tests),

80

7. KNOWLEDGE, INFORMATION AND DATA

Figure 7.4: A Weibull analysis of a population’s reliability characteristics and trends, and is a methodology focused on the failures. On the left is a typical graphical result where the vertical scale is the natural logarithm of the fraction of the population failing, the horizontal scale is the natural logarithm of a life measure such as time or cycles. The use of logarithmic scales provides a way of characterizing the underlying distributions of failures as a function of time (illustrated in the middle of the figure) based on simple linear parameters. To the right of the figure are shown what could be the underlying true performance.

and too late afterwards. The fact that the information needed to explain what is happening lies above the line that separates OK functionality from NOK is another very important principle, with huge consequences. This includes where the line is, what safety margin we have or don’t have, what the performance variation is, and the ability to find the Steep X using a progressive search. Hardly anything we need to know lives on the NOK side of the line. A measurement that exists only below the line does not provide contrast, and transformed data such as counts will be dangerously misleading. As pointed out already, most management data is focused

81 on the NOK part of the picture, and this can tend to focus those investigating a problem into that zone too. That is a fatal mistake.

Decomposing performance to see what is happening. It’s all about the Y , but the Y isn’t just a number2 . For the rotor machining case, the problem with flatness as a starting point for diagnosis is that it is calculated by measuring a series of points without respect to a datum framework, and subtracting the largest from the smallest. This transformation of the data discards all of the critical diagnostic information. Other characterizations of geometric form, such as cylindricity (to ensure a cylindrical feature is round enough and straight enough along its axis) or parallelism between two features work in the same way. The data that remains after the transformation has relevance to the quality control function, which is to ensure that requirements the designer has specified are met during manufacture, but it no longer carries the information needed to fix a problem. Most data transformations lose information, some do not. Transformations are often useful, many times essential, for fault detection, but never have enough information for topographic diagnosis. In addition to the transformation of variable data into counted data, many other mathematical functions are used in order to characterize a set of data. These include among the most common; averaging, 2

A progressive search is about converging on a causal explanation for the behavior seen in the performance measure Y . In diagnosis, it is important to think about the Y as much more than just a number associated with a product.

82

7. KNOWLEDGE, INFORMATION AND DATA

calculating standard deviations or ranges, integrating and differentiating, and using logarithms to linearize data. The input of all transformations is one data set, the output is another data set. The value of doing this is that it allows us to compare several large data sets, often for the purpose of system health and fitness checks. The downside is the loss of information. Diagnosis of novel faults needs enough information for us to decompose the machine or process behavior (as much data as possible is not the same thing). We have found that obtaining fresh data is always more efficient than trying to squeeze information out of historical data. For diagnostic purposes, it is important to make sure that information is not lost whenever a calculation is made on data. The key test as to the loss of information in a data transform is whether or not the transform is reversible. A machine that creates a plane can be characterized by measuring a point on the part some distance from the proper datum, then two other points in the direction of tool travel to form a segment, then a series of segments which form a line, then a series of lines which form the plane. The results can be easily plotted by hand, which is a good idea if you want to move quickly. If the performance is unsatisfactory, then the causal explanation can be deduced because the shape plotted is easily related to the particular machine function that created the points in space. The data carries all of the information needed to describe size, position and shape. In the top of Figure 7.5 is the performance data that quality control used to decide between OK and NOK parts in the brake rotor case. In the bottom of Figure 7.5 are the results when we keep just 36 measurements (just one circle) on each of two brake rotors, just for illustration. These were one of the best parts measured (solid lines), and one

83

Figure 7.5: Difference in information content between the quality control data and the decomposed performance data of the Brake Rotor Case.

part that would be rejected (broken lines). However, there is additional context; these measurements were made after turning but before grinding. Our symptomatic knowledge is that a shape that exists after a turning operation will not be removed by finish grinding because grinding will change the surface finish, not the shape. To be sure, we measured one part both before and after grinding, and found that the same shape pattern imposed at turning was indeed still present after finish grind. You might be tempted to say that

84

7. KNOWLEDGE, INFORMATION AND DATA

this is only a graphical representation of run-out, but that is not the case. The starting point, 0◦ , is a specific location (datum) in the spatial framework of the machine. This gives the ability to see how the initial locating point varies from part-to-part. In addition, 0◦ is always established with respect to the chuck, since a rotor can be located randomly with respect to the chuck. Keeping track of how the part is located provides the ability to see any pattern the lathe imposes on the part as the geometric form is created, but only if the part is marked in some way. The data has been connected spatially. To connect data in time (not shown here), we also made sure that any rotors we measured had a sequence number or time reference. A small sample taken right after the cutting tool was changed can be compared with a sample taken towards the end of the prescribed life of the tool, with no particular expectation of any temporal pattern. Initially, on the first three parts measured, we plotted three circles of increasing diameter, but the pattern was consistent circle to circle, so we decided to reduce the work and only measure the largest circle, where it was easiest to see the pattern. By retaining the orientation with respect to the chuck, we found that the pattern was non-random with respect to the chuck, and similar on serial parts as long as the cutting tool had enough cycles. The causal explanation in the brake rotor case was that the clamping system was the type that moved a fixed distance when closed, the force varying to overcome impedance to that movement. The forces were applied through a small contact area on three jaws. By clamping on the cast surface of the hub, which did not have a repeatable shape, the forces varied a lot. The different clamping forces were distributed within the rotor, and how well the part held its shape was a function of its compliance (or stiffness). As the cutting tool

85 wore, a pattern was imposed in phase with the chuck jaws, and once the rotor was released from the chuck, it was no longer flat. What we can see in Figures 7.1 and 7.5 is how information is lost as data is aggregated to suit different purposes. For diagnosing a specific system’s performance or reliability, we need more information from a smaller sample of the population. For allocating resources in a business, we need much less information about a much wider population. Between the two are the measures quality control use. It is all too common to find problem solving teams attempting diagnosis using business level data (the insertion force case) which is doomed from the start. It is extremely common to find engineers diagnosing using the quality control measures. That makes life either extremely difficult, or at least much more difficult than it needs to be. That is because, more frequently than not, the quality control measure is a data transformation that has lost critical information. In the warranty rattle case, more relevant information from a very small sample also led quickly to a causal explanation. The relationships shown in Figures 7.2 and 7.6 are exactly parallel to those from Figures 7.1 and 7.5. The claims count (incidents per thousand vehicles, or IPTV) information was enough to point at manufacturing (because both automatic assembly lines share good and bad days), but not much more. The torque data was used by the automatic process for the quality control function. The engineers thought they had organized it into information useful for diagnosis by connecting it to products returned because they had a rattle. But what about spatio-temporal connections to the manufacturing process? Assembling and holding things together is ubiquitous in manufactured products. Holding things together always involves storing potential energy, and there are many ways of

86

7. KNOWLEDGE, INFORMATION AND DATA

Figure 7.6: A comparison between the process capability data and the decomposed fastener performance data for the Warranty Rattle Case.

achieving that. Provided the energy the product sees during its life that is working to force it apart does not exceed the stored potential energy, all is well. If the operating energy does exceed the stored potential energy, there will be leaks, rattles and worse. An obvious means of storing potential energy in a joint is to use tension and compression, together with the friction leverage of a spiral thread. We have already indicated that energetic functions require a conjugate pair of variables to

87 characterize them. Recall that force in units of Newtons (N) was conjugated with linear displacement in units of meters or some fraction such as millimeters (mm) for the insertion process described in chapter 1. The insertion process was in what we refer to as the translational domain. In the rotational domain, torque (Nm) must be conjugated with angular displacement (◦ ). It is true that whenever we need to perform calculations involving angular displacement, units must be the natural dimensionless unit of radians (that is, the angle subtended by an arc the same length as its radius, which is 180◦ /π, or a little over 57◦ ). For most purposes, using degrees is fine. We also want to see torques and corresponding angles as the process cycle proceeds, not just final values. In Figure 7.6 we haven’t shown the full set of observations that constituted a Matryoshka characterization. That characterization did quickly lead to converging on the causal explanation, but for simplicity the contrast between an unsatisfactory cycle and one that achieved the objective with a good safety margin are shown. In the case of the unsatisfactory cycle, too much energy was used during the screwing process, and because the torque was limited, this resulted in too little energy being stored during final clamping (the stored energies are the shaded areas), even though the joint seemed tight. The message here is that the information that makes diagnosis easy is built out of data that will show true performance, and that is connected to the time and spatial (spatiotemporal) framework of the cycle of a device. Devices can be both products and processes. significant portion of Diagnosing Performance and Reliability 3 is dedicated to showing how this needs to be done. 3

[Hartshorne(2019)]

88

7. KNOWLEDGE, INFORMATION AND DATA

Chapter 7 Key Points For effective and efficient diagnosis, we must characterize true performance, above and below the line that separates OK from not OK output. This is the only Y that carries enough information, but it is important to think about the Y as much more than just a number associated with a product. For effective diagnosis (which is knowledge building), information must be built from Y data connected to true performance which is connected to a spatio-temporal framework. This performance data may be obtained by characterizing the product coming out of the process (as dimensions, or non-destructively detectable faults and in-homogeneities). It may also be gained from an energetic characterization; either during the function of the process, or of the properties of the product.

8

Review of Key Learning Points Exploiting These Insights. This is not a book about tools to add to your toolbox. What I have covered are some basic principles, strategies and concepts that have enabled us, as ordinary engineers, to accomplish what people have told us are extraordinary things, in time-frames considered to be extraordinarily short. The intent was to identify the fundamental ideas and illustrate them being applied. First, I discussed two different approaches to diagnosis. For most problems, it is sufficient to use the shallow Symptomatic approach to the diagnosis in order to take effective action. Symptomatic Knowledge is gleaned from past experience which establishes connections between symptoms and causes. Symptomatic diagnosis can be summarized as answering the question what’s wrong? Symptomatic diagnosis is that used most commonly. How89

90

8. REVIEW OF KEY LEARNING POINTS

ever, it won’t work independently with novel faults, or where deeper understanding of performance or reliability is sought. The predominant success of the Symptomatic approach is also its principal weakness in that there is often a significant delay in recognizing those problems that cannot be solved because the appropriate symptom-cause relationships are not known or fully understood. When Symptomatic diagnosis is not viable, because of a knowledge gap, we must use what is known as a Topographic approach, mapping the elements of a system, and their connections to reach the causal explanation. Topographic diagnosis can be summarized as working in a step by step fashion towards answering the question what’s happening? I also illustrated the extremely important Sparsity of Effects Principle at work, which means that any effect cannot be evenly attributed to a large number of variables. There will always be a single dominant causal mechanism as a result of the squaring effect on the leverage. For simplicity, when explaining the concept, we talk in terms of a Steep X and the rest being flat Xs. This does not mean it is one single variable controlling everything, although most of the time all the leverage needed can be ultimately achieved through managing one thing. Sparsity of Effects has a number of far-reaching consequences, which are important to understand if you are aiming for business excellence in engineering or manufacturing. One consequence is that that we can obtain unambiguous answers to well-formulated questions in a progressive search; telling us that the dominant mechanism lives here not there. That is what Topographic diagnosis is all about, using the various ways that systems are naturally organized to divide and eliminate. In a Progressive Search, each pass is made

91 within a search space already reduced by constraints from previous passes. Each pass provides more constraints to further reduce the search space for subsequent passes. We know of four natural system organizations that are useful because they provide us with the means to make binary splits and obtain unambiguous Yes-No answers quickly. This is achieved by: • A hierarchical (Matryoshka) division which is a binary search from the lowest to highest stratification of the essential system characteristics, the key being the concept of the machine or process cycle. Matryoshka is the single most important strategy; characterization in this manner always reveals multiple constraints on the possible causal explanation, as well as focusing the search on the contrast with the greatest leverage. • Functionally dividing systems by Isolation which is a Serial binary search at a high level, based on the fact that the dominant causal mechanism for behavior observed on the output side of any function either lives in that function, or it has been unknowingly carried in on the input materiel. • Structurally dividing systems by Dissection which is a Parallel binary search at a high level. Structural Dissection into what a product (in this case) is made up of, enables us to associate the dominant causal mechanism with one or two structural subsystems. • Impedance (Z) lumping and dividing the system’s network of energetic properties which allows a binary search at a low level to be undertaken. By mapping a device as an energetic network (all devices are really networks that manage energy) allows us to probe it at different

92

8. REVIEW OF KEY LEARNING POINTS nodes, and so divide the many impedance (Z) elements into groups. This is simply generalizing a strategy commonly used to understand the behavior of electrical systems.

All of the questions need to be asked by looking at what we call the output response, behavior or performance characteristics of the system being diagnosed. Nothing is asked about causes, or their influence. We are able to find the dominant causal mechanism simply by searching for where it lives. Another consequence of Sparsity of Effects is that a diagnosis based upon counting rejects in the tail of a population distribution is almost certain to identify flatter Xs. The fact that rejects are produced on a regular basis, no matter how low the reject rate is, automatically means that much of the population has marginal performance, and spending scarce resources on flat Xs is wasteful. Performance does not get any better in the long term. Paradoxically, the problem gets worse the lower the reject rate. Achieving business excellence through lots of small incremental improvements is a fallacy, only a step change from controlling the Steep X will get there. Actually, searching out flatter Xs will also be more difficult and time consuming, requiring very large sample sizes. Data mining large, historic samples is a sure way to find flat Xs. Instead, for effective and efficient diagnosis, we must characterize true performance, above and below the line that separates OK from not OK output. This is the only Y that carries enough information, but it is important to think about the Y as much more than just a number associated with a product. For effective diagnosis (which is knowledge building), information must be built from Y data connected

93 to true performance which is connected to a spatio-temporal framework. This performance data may be obtained by characterizing the product coming out of the process (as dimensions, or non-destructively detectable faults and in-homogeneities). It may also be gained from an energetic characterization; either during the function of the process, or of the properties of the product. The fact that there are three basic scenarios that emerge whenever we look at true performance data, as illustrated in Figure 8.1, is one of the major insights that can really gives a business a competitive edge. Remember, in the illustration the vertical axis is what we are considering to be the true product performance measure, or Y response. The picture is generic, it may be a measurement of product functionality or a simple characteristic of a component part such as a geometry or property. One facet of this is that the information needed to explain what is happening lies above the line that separates OK functionality from NOK. Hardly any useful information lies below the line, on the NOK side. Data only from below the line (such as how bad a leak is, how much damage there is, and so on) does not provide contrast, and most importantly, data transformed into counts (which is most of the data needed by management to allocate resource) will be dangerously misleading and a major source of confusion when we try to associate results with causal mechanisms. This is a dilemma; since management’s focus then becomes the focus of those investigating a problem; a fatal path to follow. Information is lost as data is aggregated to suit different purposes. For diagnosing a specific system’s performance or reliability, we need more information from a smaller sample

94

8. REVIEW OF KEY LEARNING POINTS

Figure 8.1: The three scenarios that emerge when looking at true performance data. On the left is the most desirable. The two on the right are those that exist when failures and quality control rejects occur. The vertical axis is what we are considering to be the true performance measure, or Y response.

of the population. On the other hand, management needs much less data from each individual product, but information used by management must reflect the general state of the whole population in order to allocate resources in a business. Between these two requirements are the measures quality control use. It is all too common to find problem solving teams attempting diagnosis using business level data (the insertion force case) which is doomed from the start. It is extremely common to find engineers diagnosing using the quality control measures. That makes life either extremely difficult, or at least much more difficult than it needs to be. That is because, more frequently than not, the quality control measure is a data transformation that has lost critical information. You may be thinking, as I used to, that there are not

95 really three scenarios, but that there is every possible combination or shade of gray in between. This is not the case. It actually does turn out to be like this when you take the time to look. We’ve had 30 years of looking.

What About Avoiding Problems? I also pointed out that most reliability data, for example that summarized with Weibull analysis, is in reality only counted data from below the line dividing OK from NOK product, and is just as misleading in diagnosing the causal mechanisms. This leads to a discussion about how we can use these insights to avoid problems in the first place. We have diagnosed hundreds of product and process performance problems over the last thirty years. The starting point for some projects were problems identified during manufacture, but just as many have involved the primary function of final product failing durability testing, or failing in the hands of the customer, both of which are too late to avoid very high costs and unwanted help and scrutiny from the customer in the case of suppliers. Despite widespread adoption of measures such as Failure Mode and Effect Analysis (FMEA)1 and best practice validation testing, OEMs in every sector suffer from growing warranty costs and an increasing number of crises from unforeseen failure modes at product launch. WE HAVE FOUND THAT, IN MOST CASES, THE FUNCTIONS THAT FAILED WERE ONES FOR WHICH TRUE PERFORMANCE WAS NOT CHARACTERIZED. 1 It is important to recognize, of course, that an FMEA documents existing knowledge and actions about the risks of failures.

96

8. REVIEW OF KEY LEARNING POINTS

In other words, almost all cases involved failed functions (one level down from the primary function) that were either not tested at all, or more usually, tested in a way that the only information available about performance was pass or fail, or else only failures were characterized (that is, below the OK/NOK line). Conversely, those functions with the lowest risk of failure were those where it was known that there was a margin of safety, and where there was a predictable decay rate. In truth, much of our work over the last 30 years would have been avoided, if the high risk resulting from pass-fail testing had been recognized. Once true performance could be characterized (in order to diagnose rapidly) we see that failures are always easily predicted from a device’s energetic behavior; using extremely small population samples at the end of the production process, whether failures are few or many ppm, hard or soft. Evaluating the decay in performance is only a little more work than evaluating performance. It cannot be that, over this long time period, and with such a large number of projects, that the circumstances we have seen are special in some way. Although our experience is with a diverse range of product types, in many different industries, there are a couple of important things that these projects have in common: • Most economic impact from performance and reliability failures can be classified as infant mortality. • Many failure modes are categorized as rare because they were of a low parts per million (ppm) occurrence rate, or else intermittent (which are frequently mislabeled No Fault Found), or else environmentally induced. Sometimes a failure falls into all three categories.

97 Engineers often believe that the root cause they seek in these cases is very difficult to find, and that there was very little chance of flushing out such failure modes prior to product launch. Much fruitless effort is expended trying to create the right combination of factors that will reproduce the fault in complete ignorance. Instead, it is possible to implement a pre-emptive strategy; flushing out the high risk functions using gateway functions and finding out what drives their behavior. Once this is done, effective surveillance is possible using knowledge milestones from the very first prototype until end of production. Almost all of the manufacturing organizations we deal with today have initiatives aimed at avoiding problems at the design stage of a product’s life cycle. Some have been implementing them for ten years or more to my knowledge. But still failures occur as frequently as ever. What they are missing is quite simple: measure true performance. Progress could and should be dramatically accelerated. Prizes await those who pull this off first; there are enormous knowledge gaps in every organization we work with, yet many of them also represent the best in the world at what they do. The only way to start to fill gaps in knowledge is to go about finding out firstly where they are. The places where you will find them are in the functions where true performance is not being characterized. Most organizations that go about conducting an FMEA will also employ some sort of function mapping. This is the place to start. It is always possible to identify between three and ten functions for any engineered system that can act as gateway functions. These are functions that will show how well subsystems are performing. These functions must be defined in such a way that true performance can be measured, and it must be implemented as the means of surveil-

8. REVIEW OF KEY LEARNING POINTS

98

lance for the product or process life-cycle.

Final Word. These principles apply to, and the same approach can be used for explaining problems as diverse as broken or decayed parts, leaks and visual defects, in addition to products simply not meeting specifications. It does not matter whether products are cast, welded, glued, mixed or injection molded, or if they are assemblies made from machined components. There is nothing special about materials that have different properties, such as steel, glass or rubber. The same principles apply to batteries, microchips, medical devices and spacecraft as well as cars and trucks. In short, all products and all processes. After 30 years, there isn’t much we haven’t seen. Everything that is so-called new technology is similar to something that has come before, or the manufacturing processes are. On the other hand, horrendous amounts of time and resource are expended doing it just like everyone else. There are three things that stand four-square in the way; incorrectly characterizing behavior2 , not understanding sparsity of effects, and consequently not following a progressive search for a causal explanation. Exactly how to characterize true performance in the case of broken or decayed parts, leaks and visual defects, how to obtain data connected to the spatio-temporal framework of your system, how to select gateway functions and set up surveillance is explained in detail in Diagnosing Performance and Reliability. 2

Incorrectly characterizing behavior is, and will continue to be, the thing that prevents organizations from achieving the potential that their otherwise sensible initiatives should, given the aim of avoiding problems.

Where Next?

This book has been about insights that can be used to great competitive advantage. In a sense, it has covered why you should approach diagnosis and problem prevention somewhat differently from the herd. The book Diagnosing Performance and Reliability (ISBN: 978-1-5272-5139-7) is an A4 size hardback book with over 300 pages, over 40 case studies covering a very wide range of scenarios, which are supported by 316 figures, most in color. In addition to covering the reasons why, it is a reference book detailing how practitioners should go about it. Diagnosing Performance and Reliability can be found at: www.tnsft-bookstore.com Readers of this book will get a 10% discount by quoting the code TNSFT10% at checkout.

The results we achieve are not too good to be true, and we don’t get lucky all the time either. Yes we are practiced in what we do, but more importantly we stay disciplined and keep things simple. We don’t fall back on old habits, and we don’t allow those that we work with to fall back either. TNSFT provide hands-on training workshops and can guide clients through every step of what appear to be the most intransigent of problems. The New Science of Fixing Things website is at: www.tnsft.com 99

100

Appendix: Further Case Examples

101

Case Example: Minivan Door Auto-close Warranty failures due to the automatic sliding door of a van intermittently re-opening for no apparent reason. Problem A safety feature meant that if an object such as a hand were in the path of the closing door, the door would reverse direction and reopen. On a small number of vehicles, this reversal would intermittently happen when the door actually should have been locking. Attempts to fix the problem used up huge resources, including over 200 engineering changes over a couple of years, with no improvement. Strategy and Tactics Rather than attempt to reproduce an event that was quite rare, even on a known faulty vehicle, it is a much more effective strategy to execute a convergent search by characterizing behavior, then splitting impedances. Easier than it sounds, taking just a few hours with simple data acquisition. The result was simply that cable tensions were not being correctly adjusted during vehicle assembly. Benefits Warranty costs for this fault were immediately eliminated for new vehicles, with a simple fix for any residual field failures. The type of behavioral characterization employed on this problem should have been applied to the very first vehicle off the line, flushing out this failure before the product got to the customer. Applies to all products.

102

103

Case Example: Liquid Crystal Display Faults An LCD manufacturer experienced warranty failures where the digits were incorrectly displayed due to missing segments. Failures started to occur after 3 months operation, and were very expensive to repair. Problem 100% pass rate at end of line testing, but 2% or 3% failures during warranty period. Strategy and Tactics Instead of treating this problem as an attribute (on-off) characteristic, the key is to be able to characterise performance at both time zero, and after operation. We need to know how well the device is working, not does it work or not. It turned out that most of the population performed poorly, but good enough to pass, even at the end of the line. A strategy of splitting impedances was able to identify the causal mechanism within a couple of days, allowing a zero cost solution to be implemented very quickly. Benefits The solution did not add cost to the product, and the warranty cost quickly subsided. The type of behavioural characterisation employed on this problem should be applied to the very first product off the line. It would have flushed out this failure before the product got to the customer.

Case Example: Biological Assay Performance Variation A manufacturer was concerned with performance of microtitre plates. These have an array of wells in which sample reactions are assayed by passing high-intensity light to the wells. The color emitted by the reaction happening in each well is quantified by a detector. Problem Plates are coated with appropriate chemistry for whatever application they are assigned, usually involving 2 to 3 different coats, with washing and drying before and after each coat. Testing is destructive, so quality control involves a sampling plan, with a tight color tolerance for each batch to be shipped. Variation was too large, and needed to be diagnosed. Strategy and Tactics An Isolation Strategy was employed – splitting between inputs to the process and the functions of the process. Tactically, this only involved the same number of plates as a regular production test sample, and was completed in a couple of hours. The causal mechanism lived in the first drying operation, not coating or washing. Over-aggressive air flow drying caused the variation. Benefits The variability of the production sample tests was reduced by 75%. Previously planned, expensive and time-consuming experimentats were avoided. 104

105

Case Example: Vehicle Instrument Panel Stepper Motor Failures Stepper motors costing less than $1 were causing warranty repairs of $1000 because the indicator would stick. Problem This intermittent but extremely expensive failure was being investigated by a large team from both the supplier and customer. They had managed to reproduce the failure by running the device continuously for several hours under severe operating conditions. This meant that every test they performed generally took 1 to 2 days. Strategy and Tactics Rather than try to reproduce the attribute failure (stuck pointer), it is much more effective to characterize the device’s behaviour. By doing this, it was possible to conduct tests in seconds, and within a small production sample, observe the whole range of performance. Then we found that, rather than being rare, stepper motors at risk of sticking are in fact very common. It also meant that the cause of the faulty components in the manufacturing process, an injection moulding operation, was easy to converge upon within a couple of days.

Benefits The explanation for how the problem occurred was found in less time than it took to run a single attribute test, dozens of which were planned. Following the diagnostic investigation, the behavioral characterization employed was then applied to the test samples submitted for reliability validation. Some of these samples had poor performance, but they had passed the test. The huge warranty costs could have been easily avoided.

106

Case Example: Vehicle Brake Failure (ABS) Complete braking system failure on vehicles within days of assembly. Problem The brake pedal would go to the floor causing vehicles to crash when being driven onto rail cars at the assembly plant (a small number, 1 or 2 in 1000). Brakes would function normally when pressing the pedal subsequently. No-one had succeeded in reproducing the failure. Strategy and Tactics If the problem cannot be reproduced, even on a known failure, it is very unlikely that tearing down the assembly will reveal anything. It is more likely that any clues would be destroyed. We need to characterize functionality on vehicles before somebody puts their foot on the pedal. A preliminary mapping of how the system functions indicated that we should start by characterizing the ABS valves. This revealed a large difference between impedance upon driving the vehicle off the assembly line, and impedance of the same valves after sitting in the parking lot at −20◦ C for two days. There were also large differences between ABS modules vehicle to vehicle. With this knowledge, and parts from the bad end of the spectrum available, we succeeded in recreating the problem in a refrigerator. Rust-preventer had reacted with brake fluid, then crystallized at low temperature. Crystals would go back into solution when the fluid moved or warmed up, but not before a catastrophic failure. 107

Benefits The solution was to avoid the combination of brake fluid and rust preventer formulations, and the problem was eliminated. Although more difficult to detect, standard reliability testing included taking the products below −20◦ C, therefore, if performance had been tested as we did, the problem was likely to have been found at an early stage.

108

Case Example: In-Process Wrinkling Widespread in-process wrinkling of reinforced polymer sheet rolls. Problem In-process wrinkling and related phenomena cause a great deal of disruption and waste materials in a wide range of processes and materials. This large corporation suffered the problem, albeit infrequently, in all of its plants world-wide. With apparently a lot of symptomatic knowledge of causeeffect relationships, why should this be a problem? The company lacked a strategy and tactics with which to attack the problem. Strategy and Tactics The nature of a sheet wrinkle gives some indication of the direction of stresses involved. Wrinkles form as a result of energy being contained and released in the sheet, specifically due to compressive forces. For one particular wrinkle, the force differential was across the roll, the maximum being at a slight angle. All wrinkles had precisely the same angle, indicating a consistent differential. To characterise energy storage for the sheet, we need to think in terms of displacement as well as force. Doing this, it is possible to execute a progressive search without actually seeing a wrinkle. First, we isolated where in the process the cross-direction forces are greatest with respect to process-direction forces, without worrying about what causes the differential gradient at first. At the end, we proved that we had control over the formation of wrinkles from a causal explanation that fitted all constraints. 109

Benefits This particular problem was addressed during the afternoons of a four-day workshop, yet had consumed a great deal of resource over many years. The results applied to the same process in other plants, eliminating much disruption. More importantly, the company now had a strategy and tactics, and applied them to many more wrinkle problems with different causal mechanisms.

110

Bibliography [Dörner and Kimber(1996)] Dietrich Dörner and R. Kimber. The Logic of Failure: Why things go wrong and what we can do to make them right. Perseus Books, Reading, MA, 1996. ISBN 0-201-47948-6. [Hartshorne(2019)] David J. Hartshorne. Diagnosing Performance and Reliability. The New Science of Fixing Things Ltd, UK, 2019. ISBN 978-1-5272-5139-7. [Kuhn(1970)] T. S. Kuhn. The structure of scientific revolutions (2nd ed). University of Chicago Press, Chicago, USA, 1970. [Seder(1950a)] Leonard A. Seder. Diagnosis with Diagrams Part I. Industrial Quality Control Vol. 6 No. 4, pp11-18, Jan 1950a. [Seder(1950b)] Leonard A. Seder. Diagnosis with Diagrams Part II. Industrial Quality Control Vol. 6 No. 5, pp7-11, Mar 1950b. [Tufte(2001)] Edward R. Tufte. Envisioning Information. Graphics Press, USA, 2001.

111

Index bore machining case, 62 Matryoshka brake rotor case, 4, 74, 82, characterization, 59 84, 85 cyclical variation, 44 elemental variation, 50, 55 cartooning, 51, 60 structural variation, 44 temporal variation, 44 data mining, 27, 41, 92 multivari plot, 45, 46, 62 data transformation, 82, 85 diagnostic logic map, 6 diagnostic strategy, 10

no-fault-found, 96 numbers game, 13

experiments (and diagnosis), 4 paired data plot, 37 paradigm, 72 fault tree, 8 Pareto distribution, 39 progressive search, 10, 44, 46, gateway functions, 97 64, 67, 72, 81 in-homogeneities, 51 insertion force case, 25 rare (and intermittent) faults, interdependencies, 10, 48 96 intermittent (and rare) faults, 96 search space, 10, 14 Ishikawa diagram, 8 search tree, 14, 44 knowledge, 1, 2 knowledge milestones, 97 leverage, 44

simulation (and diagnosis), 4 small-multiples plot, 46, 52, 53 sparsity of effects, 40, 41, 90 112

INDEX spatio-temporal (framework), 50, 59, 73, 85, 88, 93 Steep X, 39 stratification, 18, 53, 91 symptomatic diagnosis, 5, 11, 89 knowledge, 5, 6, 11, 77, 83, 89 topographic diagnosis, 6, 11, 90 true performance, 73, 79, 88, 93 warranty rattle case, 77, 85 Weibull analysis (and true performance), 79, 95

113