Payload EM Short Root Cause Analysis

On Monday 1st of February 2021, the IRIS Payload EM board experienced a short-circuit event. This root cause analysis (RCA) is to determine what caused this short and determine actions to prevent it and similar incidents from occurring in the future.

Remember: This is NOT to point blame, it is to make sure issues are found so we can learn from them and make sure they don't happen again.

1. Test Sequence Timeline

Friday January 22nd: Matt brought the EM board back to campus, hooked it up to the variable power supply, and attempted to communicate with a computer over JTAG but could not get the board to communicate. (This was later found out to be because he plugged the JTAG connector in backwards)

Monday January 25th: Ali and Joseph discovered that the JTAG was plugged in backwards and corrected this error. Around 1 PM Ali and Joseph read the board-mount thermistor voltage and found that the ADC number was 4000-something which corresponded to 2.96 V. The voltage differential (dV) across the thermistor was 0.04. Ali and Joseph used the resistance equation on the Thermal ICD wiki page to calculate the resistance as 10,540 Ohms. They then used the temperature equation on the Thermal ICD wiki page to calculate the temperature as 296.965 K (23.8 C). This appeared to be correct as it was slightly higher than the room temperature (22 C). Board heat dissipation could account for the difference.


Wednesday January 27th: On Wednesday, we worked on CAN communication in the morning and then tried verifying the part-mount thermistor circuit, substituting a 10 kOhm resistor in place of the part-mount thermistor as that was not yet ordered. It took a while to get the resistor and come back. There were also issues because the two resistor circuits were labelled wrong on the wiki (schematic and pin assignment page) and so we had the board and sample mount thermistors mixed up. The pin assignment is correct on the schematic in the folder: "Iris->Payload->Payload Avionics->Phase C ->Documents" (also the Altium project). Measurements support that they are labelled in reverse on the wiki.


Before that we were working on CAN. After a break, we went back to the thermistor test. Then around 2:30 we found that we read the part-mount thermistor correctly.

We tried reading the board-mount resistor in the afternoon after the board had been on for a few hours. The resistance value was different than Monday, reading 5 kOhm instead of 10 kOhm. Joseph felt the thermistor itself with his finger and it felt hot around the thermistor and Joseph believes the rest of the board was cooler. It is possible that this is normal, but it felt unusually hot. At the time when it was physically hot, the board-mount thermistor was reading cold temperatures but this may have been because of an incorrect formula. The calculated temperature was 11 C but multimeter probing showed 5 kOhm which corresponds to a temperature of 45 C. When Joseph touched the thermistor, it cooled down.


At the end of the day, Ali and Joseph disconnected the power supply by turning the power supply off and then disconnecting the alligator clips from the EM board leads as per the EM Board Operation Guide. Joseph thinks he unplugged the power supply. After that, Joseph and Ali left the university.

Monday February 1st: Joseph was helping Mitesh with CDH EM in the morning and the power supply was working well then. It might have tripped once when they were first setting up and so they disconnected and reconnected everything. The amount of current drawn by CDH is pretty low compared to the Payload over the past few days.


Before powering the payload board on, Ali and Joseph removed the camera because based on Wednesday's tests they suspected that we may have had the wrong board-mount thermistor. After removing the camera, they saw that the thermistor was too small to read anything. Ali and Joseph then plugged the leads to the power supply as per the operations guide and, when Joseph went to attach the power lead, the power supply went to constant current mode. Joseph quickly disconnected the leads. As per the operations guide, the power supply was on when the leads were connected.

2. Observed Failure

When the Payload EM board is connected to the variable power supply, the power supply immediately changes from constant voltage to constant current mode. No burning smell or smoke was observed.

This indicates that there is likely a short in the EM board or possibly an issue with the power supply.

3. Post-Failure Actions

Ali and Joseph then tried connecting the board to the power supply a few more times but it kept going to constant current mode. They then turned the power off and checked for continuity using the multimeter. Joseph checked for continuity across the leads and then measured the resistance across the leads. There was no continuity and the resistance was very high (mega ohms). Joseph only measured the leads, no probe points. Ali and Joseph then tried to put the camera back and then plug it into the power supply. It again went to constant current mode.

4. Potential Causes

4.1 Power Supply is Damaged

Description: The power supply was damaged between Payload uses on Wednesday January 27th and February 1st. Mitesh said that Jesse was having issues with the power supply entering constant current mode on Wednesday afternoon after Joseph and Ali were done using it. Jesse confirmed that this occurred both when using only the nichrome wire and when a larger load (1 - 10 k Ohm) was applied. On the morning of February 1st, the power supply did enter constant current mode once while Mitesh and Joseph were using it but just while setting up the CDH board and did not experience any issues after that.

Discussion: The payload does use more current than the CDH board (400 mA vs their 25 mA) and so it is possible that the higher load is causing issues with the power supply. However the power supply is rated to 0-30 V and 0-5 A, unless the power supply is damaged, it should easily accommodate the payload board load.

Verification: Make a simple bread-board circuit that accepts 6.4 V and 400 mA ( a 16 ohm resistor should be sufficient) and confirm that the power supply does not enter constant current mode when powering this circuit.

Test Result 1: Test was performed and power supply reverted to constant current mode, but resistor burnt out immediately. Unclear if the power supply reverted because the resistor burnt out. Need to repeat test with a higher power-capacity resistor.

Test Result 2: Test was performed with three of 5 ohms resistors and a bread board. The power supply didn't go to constant current mode. Thus, power supply is good.

4.2 Solder Joint Broke During Camera Removal

Description: The pin connectors on the board are all very tight and so the board does flex when pin connectors and the cameras are removed. It is possible that a solder joint separated when the camera was removed.

Discussion: While possible, a separated solder joint will cause a circuit break -- not a short -- and this is unlikely to cause a short in the overall circuit. If the EM board circuit does not have continuity, it will have 0 current draw and not cause the power supply to go to constant current mode. Likelihood further decreased as no snapping noise was heard when removing the camera.

Verification: ???


4.3 EM Power Leads accidentally put into contact

Description: It is possible to put the positive and negative power leads into contact with each other when connecting them to the power supply. Doing so would cause a short.

Discussion: While possible, Joseph and Ali were very careful to avoid this. If the buck converter is damaged, it will lead more credibility to this hypothesis but the buck converter could be damaged through other means as well.

Verification:

  1. While EM board is NOT powered, check for continuity between the L1 dot terminal and ground. If there is continuity, the buck converter is damaged.

  2. While EM board is NOT powered, check for continuity between the L1 dot terminal and V_PLD_Pos (the positive power input). If there is continuity, the buck converter is damaged.

Test Result: Tested continuity between L1 dot and ground. There was continuity, indicating that the buck converter is damaged. Didn't test between L1 dot and V_PLD_Pos, since test already showed that the buck converter is damaged. While this confirms that the EM board's power converter was indeed damaged it does not point to a root cause. The board was functioning normally on Wednesday January 27th but failed immediately on February 1st. No known actions were taken between the tests and the board was powered-down as per the operating instructions. It is possible that the failure was due to a fault in the power supply. Hafis notes that there is a capacitor in-line with the EM board's power-circuit which should prevent in-rush current damage.


4.4 Random Component Failure

Description: A component failed and failed such that it caused a short.

Discussion: While possible, the board was in use for 2 days beforehand and so a component failing is unlikely.

Verification: While EM board is NOT powered, check for continuity between the 3.3 V line and ground to see if there is a short between the two. This checks the power input pins of all components in one sweep and could rule out single component failure.

Test Result: Tested continuity between 3.3 and ground. There was continuity. The inductor is a short for DC, so since the side of the inductor connected to the buck converter is shorted, then also the other side of the inductor (the 3.3v line) is also shorted. This means that while another component failure is possible, the short across the 3.3V line is likely due to the buck converter failure. This test should be performed again after the buck converter is replaced to check for further damage.


4.5 Damage Caused by Improper Power-On Procedures

Description: The power-on procedures called for the board to be connected to a live power line from the variable power supply. Making a step change in voltage causes an infinite spike in voltage for inductive loads (V=L di/dt). Using the switch on a power supply makes sure the voltages ramps up in a controlled way for this very reason — to protect inductive loads. When live wires are connected and disconnected, you disable this protection and subject the test article to voltages that are likely many tens or hundreds of volts. The board could maybe survive this a few times, but it can eventually kill it.

Discussion: This is a very plausible root cause. Previous tests have indicated that the board's buck converter was damaged: indicating that the failure was likely due to supplied power. Additionally, no other known actions were taken between powering the board down and turning it on. It is also possible that the failure was due to a fault in the power supply -or both- and so the power supply should be tested further.

Verification: Test the variable power supply further to determine if it is a possible failure source.

5. Things to Investigate Further

The following test will provide support to the hypothesis that the power supply is damaged:

  1. (COMPLETED BUT RE-TEST NECESSARY) Check that variable power supply works at expected voltage and current: Make a simple bread-board circuit that accepts 6.4 V and 400 mA ( a 16 ohm resistor should be sufficient) and confirm that the power supply does not enter constant current mode when powering this circuit.

The following tests will provide support to the hypothesis that the payload EM board is damaged:

  1. (COMPLETED) While EM board is NOT powered, check for continuity between the 3.3 V line and ground to see if there is a short between the two. The supply and ground terminals were checked for continuity, the short could be with the 3.3 V line. This checks the power input pins of all components in one sweep and could rule out single component failure.

  2. (COMPLETED) While EM board is NOT powered, check for continuity between the L1 dot terminal and ground. If there is continuity, the buck converter is damaged.

  3. (NOT NECESSARY, TEST 3 SHOWED SHORT) While EM board is NOT powered, check for continuity between the L1 dot terminal and V_PLD_Pos (the positive power input). If there is continuity, the buck converter is damaged.

  4. Re-do Test 1, to check that variable power supply works at expected voltage and current. In this new test, again make a simple bread-board circuit that accepts 6.4 V and 400 mA ( a 16 ohm resistor should be sufficient) and confirm that the power supply does not enter constant current mode when powering this circuit. However, in this test, use a resistor with a greater power capacity that will not burn out at 2.7 Watts.

  5. AFTER The buck converter has been replaced: Check for continuity between the 3.3V line and ground. If there is continuity, this indicates that other components are damaged.

6. Actions and Lessons Learned

This section is to describe all lessons learned from this RCA exercise and what actions must be taken to prevent such incidents in the future.

1) Revise the payload operating procedures for safer power-on and power-off procedures. (Already implemented)

7. TEST NOTES

Insert test notes as images here.


8. Power Line Checking

To determine if additional power lines have been damaged, we must check for continuity between ground and each power line:

  • MCU_Power (3V3)

    • Check for continuity between resistor R8 and ground.

  • NTC_Excite (3V0)

    • Check for continuity between resistor R2 and ground.

  • NTC_Ref (1V5)

    • Check for conitnuity between resistor R7 and ground.

  • CAM1_Power (3V3)

    • Check for continuity by probing camera 1's pin 1 and the board's ground line. As seen in Fig. 1, the camera's pin 1 is the bottom left pin, when you are looking at the camera lens with the pins on the bottom.

  • CAM2_Power (3V3)

    • Check for continuity by probing camera 2's pin 1 and the board's ground line. As seen in Fig. 1, the camera's pin 1 is the bottom left pin, when you are looking at the camera lens with the pins on the bottom.


Fig. 1: Camera Pin 1. NOTE: The header has been placed on the opposite side of the board as in this figure.

  • ADC_Power (3V0)

    • Probe for continuity between the right terminal on the R9 resistor shown in Fig. 2 and the EM board ground.

  • ADC_VRef (3V0)

    • Probe for continuity between the left terminal on the R9 resistor shown in Fig. 2 and the EM board ground. To confirm that the R9 resistor itself is not damaged, probe between ground and both sides of the resistor.



9. Power Line Checking Results

All power lines were checked for shorts. Only the MCU_Power (3V3) line was shorted. To attempt to narrow down where the short is occuring we checked the following:

  • Checked CAN voltage line CAN_VCC, by measuring across ground and R1. No continuity but IC SN65HVD234DR has a second CAN_VCC input so the chip could still be damaged.

  • Checked flash voltage line NAND_VCC by measuring accross ground and R35, R36. R1. No continuity but W29N02GVSIAA has a third NAND_VCC input so the chip could still be damaged.

  • Checked LED D3 by measuring across ground and R30. No continuity so LED is ok.

  • Measured between each input to ESDALC6V1W5 and ground. No continuity so ESDALC6V1W5 is ok.

  • Measured between ground and JTAG connector pins 1, 2, 3, 5, 7, 9, 11, 13, 15, 17, 19. Pins 1 and 2 short as expected for a 3V3 line short but the others do not, indicating that the connector itself is likely ok.

Additional MCU_Power (3V3) Line Components to Check

  • U17 REF3130AIDBZT

    • Less likely point of failure as 3V0 ADC power line is still ok.

    • Check between L2 and ground

  • R17 1.5 kR Resistor to ground

    • There is a normally open switch between the resistor and ground so this resistor shorting out shouldn't cause a continuous short

10. Component Removal Plan

The EM board's MCU_Power 3.3V line has shorted out. To determine which components have been damaged, we must remove components one at a time and re-check the continuity between the 3.3V line and ground until there is no longer continuity. The MCU_Power line connects to the following components:

Power Circuit

  • TPS62142RGTR Buck Converter

    • This is the voltage regulator that converts from VBAT to 3V3 volts. It is the first component that external power reaches so is the most likely point of failure. It was replaced, but a soldering error could cause the 3V3 line to still fail.

MCU Circuit

  • U1 STM32F765VGT6TR

Camera Circuit

  • U2 QS3VH257QG8

  • U3 QS3VH257QG8

  • U4 QS3VH257QG8

CAN Circuit

  • U6 SN65HVD234DR

Flash Circuit

  • U5 W29N02GVSIAA

Other Power Circuit Components

  • U12 SIP32509DT-T1-GE3 Power Switch IC

    • Less likely point of failure as 3V3 CAM2 power line is still ok.

  • U9 SIP32509DT-T1-GE3 Power Switch IC

      • Less likely point of failure as 3V3 CAM1 power line is still ok.

  • U14 TPS3823-33QDBVRQ1

    • Outputs go through ground via the U1 MUC (STM32F765VGT6TR ). The power in and ground could short together but that seems less likely?

  • U17 REF3130AIDBZT

    • Less likely point of failure as 3V0 ADC power line is still ok.

  • U15 MC78LC15NTRG

    • Less likely point of failure as 3V0 NTC power line is still ok.

  • U16 REF3130AIDBZT

    • Less likely point of failure as 1V5 NTC power line is still ok.

  • R17 1.5 kR Resistor

    • There is a normally open switch between the resistor and ground so this resistor shorting out shouldn't cause a continuous short

Results:

First we removed the buck converter and check for continuity and resistance. Still there was a short in the board. Then, we removed the MCU and checked for continuity and resistance. We kept removing components in 3V3 line to find the shorted component. At the end, we found U12 as the shorted component.

(NOTE: Resistance in the board should be in the Mega ohm order).