Troubleshooting Tips: Isolating Intermittent Faults
| By: Kristin Lewotsky, Contributing Editor
Determining the cause of an intermittent fault requires a disciplined approach.
Probably nothing in engineering is more difficult than finding a fault that only shows up occasionally. The problem could lie in the controls hardware, it could be caused by wiring faults, it could be introduced by a recent software update - the possible sources are endless but downtime costs money, so that support team needs to isolate and fix the problem as soon as possible.
In this first of a series of troubleshooting articles, we will focus on tips and techniques for more efficiently correcting the source of an intermittent fault. Fault finding is a dynamic process. Determining the cause of a fault obviously depends on the circumstances. That said, there are some engineers who have a natural talent for the task. We queried some experts in the industry to see what tips they had to offer that might speed solving your own problems.
What is your first action upon encountering a transient fault? What do you do next?
First, identify the specific fault. Most digital drives today include a fault register that indicates specific faults like Undervoltage, OverCurrent, or Control Error. Given the specific fault, you need to apply a combination of inductive and deductive reasoning. I generally start with inductive reasoning. If, for example, it is an under voltage fault, I begin to ask questions like, “What can cause this, an undersized power supply? High- acceleration? A bad electrical connection?” If the system has been in production for a long time and suddenly intermittent faults start to appear, the first question to ask yourself is what changed, software? Hardware? The manufacturing process? Given good initial questions, you can then begin to apply deductive reasoning to systematically eliminate different possible causes.
—John Chandler, Vice President of Sales, North America, Technosoft US
I usually gather up more information from whoever has been observing the intermittent issue. The first step is to see if it is a known user issue (user power supply, bug fixed in current code, etc.) A good deal of information in tracking down issues can often be had from the line level technician. I have learned that asking, “What is wrong?” will be turned into “That is what I was asking you!” The question of “What leads you to believe that the unit is not working?” will provide the set of symptoms you are looking for.
—Donald Labriola, President, QuickSilver Controls Inc.
The first thing to do is document the symptoms. What occurs, and when? Is there usually another activity that precedes the fault?
—Ian Hall, Manager, Application Support Team, Siemens Inc.
What is the most common cause of intermittent faults?
There are far too many site- and application-specific conditions to make a broad statement in this regard but having worked for 25 years on the controls/drives manufacturing side of the business, I can say with confidence that the most common reason for the return of a component product from the field is, "No problem found." Too often repair technicians are placed under pressure to not necessarily find the root cause, but simply to get the equipment running again. They do this most time by simply swapping parts out until the machine runs. Once the machine is running, they are left with a box of mostly good parts, happy management, and no real understanding of what the problem was. The parts are then returned to their respective suppliers for fault analysis. Sometimes, this is the most practical thing to do, but seldom is the full cost of this approach accounted for. I place no blame with the field technicians for this behavior.
Interconnections at various levels are the most common cause of the faults I see: loose or contaminated connectors, soldering, wire crimping, etc. Next are unwanted interactions, be they ground loops or interactions of multiple processors finding weaknesses in the hardware or software interfaces. Finally, I see interactions related to aliasing in one form or another getting into data.
From my experience, intermittent faults are most commonly hardware related either damaged product or a problem with the installation. Software or firmware are root cause in some cases. In reality most software problems are not as intermittent as they may seem.
— Jim Wiley, Application Engineering Team Leader, Parker Hannifin Corp.
On hardware (drives, etc.) it’s most commonly something external to the drive—the line supply of an external input. In a motion controller or PLC, it’s often race conditions in user logic.
Is there anything special that you do in terms of data logging to try to isolate the activity?
Some good detailed questioning of those who have observed the intermittent, what they were doing and what the system was doing. Understanding what the system was doing when an intermittent occurs can help you make an educated guess at what to try to make it fail faster or more consistently. A repeatedly failing system is much easier to debug than one that only fails every few weeks, for example.
Triggering is the key to finding the root cause of an intermittent fault. Most digital drives on the market today offer a built-in oscilloscope tool with triggering capability. From the "under voltage" example above, a good scope can be configured to trigger on the next occurrence of this fault. Once set, the system can be run for hours, then suddenly when a fault occurs again, you can capture say 0.5 s of key data just prior to the event, such as "supply voltage," "motor current," and "motor speed." From this captured data, you can see if perhaps the supply is dropping out only when the motor is under high acceleration and approaching some nominal speed.
Oscilloscopes are great tools for trying to understand faults. In addition to hardware oscilloscopes, many software packages that support drives and controllers include soft oscilloscopes. Visualizing the patterns of encoders, inputs, and other signals can often lead to the cause of issue.
How do you determine whether the fault lies in hardware, software, or wiring?
Software tends to be very repeatable. A software fault may seem intermittent, but normally this is because a very specific set of conditions needs to be present to cause the fault. Once these conditions are present, however, software will always respond in the same pre-programmed logical way. On the other hand, hardware faults tend to be affected by the end environment, which can include cables and other site-specific conditions.
The first step is back tracking from how the error was observed and what is capable of producing that symptom. Trying to pound the system with the various educated guesses of the cause usually vindicates several suspicions as not being the cause - varying power, injecting noise into power systems, or injecting RF into the system. I like a multiband 5-W, handheld ham radio with the antenna almost touching the board. If that does not bother your system, you will not usually find that RF is the problem. Physical stress on cabling and connectors as well as vibration and thermal cycling (heat gun and freeze spray) can often chase out wiring and soldering issues. A good careful look at the physical connections and soldering on a PCB can often locate PCB soldering or contamination issues. Rotating the board to pick up reflections from a strong light source (or the sun through a window) can show which solder point doesn’t reflect the light like the others - they “wink” at you out of step with the other solder joints, indicating a different contour angle because the solder has not flowed in the same way as the other solder joints. These are good places to begin your search.
The first search in software is to go back to a previous “known good” version with a good history and no evidence of the failure. This is particularly useful if you have been able to find a way to get the failure to repeat more frequently. A change in behavior may help locate the suspect code. Errata in chip operation can be much more frustrating, but do not rule it out. If you can document the issue, factory support will often elevate the issue to the designer and root causes as well as solutions can often be obtained.
Aliasing, often a hardware issue, can be another difficult item to examine. A key clue is seeing a waveform from one system in another system or data at a different time scale. Knowing the waveforms from various high-frequencies, high-power sections, such as power supplies and drivers, can aid in this. One extreme example was power supply ripple in a flash lamp power supply which dropped into hysteretic mode when the HV supply was charged. The hysteretic recharge was on the order of a couple of milliseconds. The data acquired from an optical system using the flash lamp showed the discharge curve expanded out over multiple seconds to multiple minutes - but just occasionally. The shape of the curve, sometimes inverted, was the clue to the problem. A slight redesign of the power supply to produce much lower levels of hysteresis (and thus charging pulses) in the voltage maintenance operation reduced the noise level in normal operation as well as removing the ghost aliasing that had randomly plagued the system.
How do you determine whether the fault is in response to external conditions such as temperature swings or power draw from a neighboring machine?
Isolation transformers and careful observation of what the neighboring machine is doing when a fault occurs may help. Several observations of the same click or whirr simultaneously with the failure may help with causality. Sometimes use of a thermal chamber can help with temperature swings, but I’ve seen some issues that only happen over very small temperature ranges and can drive one to consider a job in sales!
A shared power supply in a large instrument caused the system to be returned to the factory after every circuit board and almost every cable in the system had been replaced in the field. The culprit was located by listening to the system with a scope on the “victim signal” and watching system operation. The victim pulse was coincident with the click of a solenoid. Visual inspection showed that the catch diode on the solenoid was spiking a 12 V supply that was shared with the victim circuit. The diode was physically mounted to the solenoid, so it was not fixed by changing the boards or harnesses.
What was the hardest transient fault you ever had to isolate and correct? What could you have done differently to either isolated more quickly or prevent the fall from happening?
Perhaps this was not the hardest, but defiantly one of the strangest. After working on a new drive design for over one year, we found on the first pre-production run that 30% had faulted during a 24-hour burn-in test. Upon closer examination, we found that each drive tested would run for an exact period of time and then go through a reset, but the exact time period was different for each drive, meaning drive one would always reset after exactly one hour, 23 minutes and 42 seconds of running, but drive two would always reset after six hours, 43 minutes, and 12 seconds of running. The problem turned out to be due to a last minute software update for which we failed to initialize a variable in RAM memory. This variable would count down from some unknown starting value, which then leads to execution of an illegal address, which in turn caused the reset. The reason that reset time was repeatable, but different from one drive to the next, is because any given location within a RAM memory chip will tend power on with the same value each time, but from one chip to the next, this value will be different.
The fact that the behavior was repeatable told us from the beginning that the root cause was likely due to software rather than hardware. It also turned out the processor in our drive held a register for determining what caused the reset, which indicated an illegal address rather than one of four other possible other causes. Putting two and two together—repeatable problem, reset caused by illegal address—plus some inductive reasoning, we soon realized that we had failed to initialize a variable. We searched the code, found it, and fixed it.
We experienced intermittent encoder failures that would shut down a large printing press. The fault would reset and the press would run again, but the time between repeat failures would reduce until a hard fault occurred that could not be reset and the encoder had to be changed. When looking for causes of failures like this, you need to list all possible (and improbable) causes and then work to remove as many of these from the list as possible starting with the most probable and easiest to test/eliminate. We ruled out some simple issues early on then moved onto harder possibilities such as EMC interference, vibration etc.
The conclusion of the testing and reason for the faults was axial loading on the encoder. A design change on the motor mounting to the press and very tight tolerances of the encoder mounting in the motor has caused an axial load to be applied through the motor shaft that pressed the encoder against the motor end cap and created an axial load on the encoder which eventually lead to the failure of the bearings in the encoder. The breakdown of the bearings impacted the integrity of the electronics in the encoder and so caused intermittent faults that eventually overtime lead to a hard failure.
What is the biggest mistake engineers make trying to isolate intermittent faults?
They sometimes panic and give up before actually trying to find the fault. Or, in an effort to fix something quickly, they take the shotgun approach and miss some obvious things that a more systematic approach would uncover. A cool mind and some sound reasoning can go a long way.
I have seen engineers that are very adept at methodically root causing these problems and many engineers that flail around for days without and real clear direction. The best piece of advice I have is before beginning each day's troubleshooting, step back and brainstorm with the team on what you are trying to prove or disprove. Often, I see teams spend days and lose focus on what they were really trying to understand, prove, or figure out as the troubleshooting exercise extends. I like to lead a white board discussion and throw out all the possible things we think could be happening and then figure out tests to prove or disprove those ideas. Then we begin to prioritize the tests depending on complexity and overall gut feel likelihood on what we think is the root cause.
—Bill Allai, Motion Control Principal Engineer, National Instruments
Early assumptions and not keeping an open mind. Too often we make an initial diagnosis of a problem without considering other possibilities. This can lead to excessive time spent hunting for the wrong solution. Engineers need to be willing to step back and answer what may seem like redundant questions. Often something dismissed as irrelevant at the beginning of the troubleshooting process can lead to a solution later when looked at with fresh eyes.
Failing to document steps taken and results seen. When debugging a complex issue under pressure, it’s easy to move quickly and try many things to resolve the issue. After a while, the engineer may start going in circles, applying fixes that have already been tried and rejected. Documenting steps taken alleviates this and also acts to structure the thought process leading to better fault finding process.
What is your favorite troubleshooting tool? Do you wish you had known when you first started troubleshooting intermittent faults?
I started troubleshooting stuff from about age seven. It is a way of life trying to fix stuff instead of just replacing parts. A good understanding of the fundamentals physics can significantly help the insight into the system. Good tools can help—one of my favorites is a good near-field H probe to chase down ground noises, EMI, and poor switching-power-supply designs. They can see the problem from a half inch or more, which can then help locate the particular aggressor signal. A pair of probes can be used to see time correlation between a noise signal and a switching source, again from a distance. Understanding transmission lines and the physics of inductance can help in designing and debugging switching supplies and parasitic problems. Understanding the assembly-level programming for the machines programmed in higher level languages can be essential to tracking down compiler errors, or interpreting unusual execution errors.
How do you isolate and fix the fault without either spending an excessive amount of engineering hours on it or causing excessive downtime?
Before going to trouble shoot a problem in the field, spend time in the lab getting familiar with the diagnostic tools available. Also, make sure that you understand the underlying control system structure so that when abnormal behavior is present, you know what signals to record.
If I can provide a replacement unit to move the fault away from the line, I usually first try this approach. If it clears the fault, the scope of the fault has been determined and the line is back up. The faulting unit can then be probed at a more leisurely rate. Scheduled access to the line outside normal hours is sometimes needed if the replacement part fixes the issue but the replaced part seems to no longer cause any faults. If it is a single-unit fault that will not repeat, keeping the unit in quarantine until additional units show issues or until it is believed that it is a single unit issue (such as ESD to an exposed PCB) can be an acceptable approach. This is usually after some time has been spent on other troubleshooting.
Not every fault can be determined within reasonable time and cost limits. Faulty silicon and other bad parts, if very rare, may not always be able to be resolved. Customer descriptions of what happened to the unit may not always include all the information needed (especially if confession would void the warranty). We had one unit returned for bad communications that we could not find the cause of at first. Then we found a couple of traces that had been vaporized without scorching the PCB. The black anodized cover showed a copper-colored glow. Only after we discovered this and added jumpers to replace the vaporized traces did the customer allow that there was a chance that 120 V had hit the RS232 connections in their cabinet.
Stick to the basics and narrow the scope of the problem.
Within an engineering team, training and cross pollination of knowledge is essential.
I tracked one “memory problem” that was producing an error-code correction (ECC) error in a new commercial memory board. The problem was intermittent, only happening approximately every six weeks for a particular system. A couple of well experienced engineers had each gone after the problem for some six weeks each before I was landed with it; it was coming very close to shutting down a major product line as the previously version (working) boards were almost out of stock, and a key part was no longer available to produce those boards.
The new memory board design had been well reviewed and did not appear to be the source of the problem. A careful interview of several of the line technicians revealed the instrument was always in the same procedure when the error occurred. The procedure transferred a significant amount of data between a pair of the processors in the system through shared memory. Taking a (scientific) wild guess) that the multiprocessor interaction was the key clue, I had the software engineer assigned to the problem make up some custom code that caused both processors to repeatedly read and write to the shared memory as fast as possible. In less than one day from starting, we were able to get the system to fail within 30 seconds rather than six weeks.
The problem was related to some handshake lines with improper pull-ups and the use of the wrong edge of a clock on one of the processors that reduced the settling time for the handshake line. The stalled processor ended up seeing the remnants of transfer acknowledge for the first processor, aided by a bit of noise. Changing to use the clock edge and changing the pull-up/terminating resistor provided proper timing, and the ECC error was eliminated. Take the time to query the first level technicians, their observations are worth gold!