Replacing Monitoring Chips on SGI Fuel

In this post I tried to capture all useful information I learned while trying to fix one of my SGI Fuel machines that could not power on. It's here to share with other people interested in this topic and to remind me when I have another dead Fuel on my table.

L1 Console

The machine has a low level monitoring firmware that, among other things, is responsible for basic power up tasks and sensor monitoring for system voltage levels, temperatures and fan speeds. It also has a basic log facility that can help to diagnose problems. A user can communicate with the L1 firmware via higher level L2 firmware, via running OS or RS232 link with a connector located on the motherboard. In case of a dead machine the RS232 connector is the only usable interface. The link parameters are set for 38400bps, 1 stop bit, no parity. Of course, null-modem cable or adapter is needed in this case.

The L1 system is powered by the 5V AUX rail so when one cannot be sure about state of the PSU, it's best to use an external power supply (5V@400mA was ok in my case). The PSU connector and its pinout is well described here. The first machine I started diagnosing had a bad PSU with a significant voltage drop on the 5V AUX line so it took me a few hours to track this problem down and get my so wanted L1 console output.

After applying good 5V AUX power I finally got some output that wasn't very welcoming:

ALERT: Error initializing the NODE 1 monitor, no acknowledge
ALERT: Error initializing the NODE 2 monitor, no acknowledge

This was certainly some progress but also the only thing I got from the L1 console. An L1 version string followed by a prompt was expected but nowhere to be seen. Long story short, it turned out to be a wrong null-modem cable type with crossed RX-TX line but straight data flow control RTS,CTS lines. After cutting my fingers while dismantling the plastic sealed cable, fixing the lines I finally got the desired prompt. It nicely shows the design reasoning behind this behavior - ALERTs are high priority messages that don't respect the data flow control.

So I was finally able to execute L1 commands. A user guide for L1 and L2 firmware can be found here. The last section of "env" command output shows an overview of all monitoring chip temperature readings in the system. The temperature section of "env" command is useful because it doesn't need a powered up system to read values like in case of voltage or fan sections. In my case its state related with the alert messages I got before:

                           Advisory  Critical  Fault     Current
Description    State       Temp      Temp      Temp      Temp  
-------------- ----------  --------  --------  --------  ---------
NODE 0           Disabled  Disabled  Disabled  Disabled  20c/ 68F
NODE 1           Disabled  Disabled  Disabled  Disabled   0c/ 32F
NODE 2           Disabled  Disabled  Disabled  Disabled   0c/ 32F
PIMM             Disabled  Disabled  Disabled  Disabled  24c/ 75F
ODYSSEY          Disabled  Disabled  Disabled  Disabled  25c/ 77F
BEDROCK          Disabled  Disabled  Disabled  Disabled  24c/ 75F

So now I knew the monitoring chips for node 1 and 2 were dead but what I didn't know and couldn't find online was where these chips are located on the hardware.

Dallas DS1780 CPU Peripheral Monitor

The usual problem with fixing old hardware is getting spare parts. Luckily the monitoring chips used on Fuel are still in production and it's easy to buy them online. I bought a batch of five chips on Ebay from China where I was little worried about their genuity - a common problem when buying on China market, but they turned out to be ok. The datasheet is very user friendly and the first page reveals the most important piece of information, the chip pinout:
Inspecting the hardware I could locate 5 monitoring chips. Three on the motherboard (NODE 1-3), one on the processor module (PIMM) and one on the GPU (ODYSSEY). However, the summary of the "env" command in its temperature section actually gives six sources. My theory here is that in case of the BEDROCK node the temperature is taken from something else than DS1780. For completeness, the BEDROCK is a codename for a chip that functions as a fast crossbar to interconnect CPU with peripherals and memory (source). It must be one of the two bigger chips with a heatsink on the motherboard.

The following picture shows all DS1780 located on the motherboard and their NODE numbers. 

Learning the node numbers took me some time. I wasn't able to find them online so had to get my hands dirty and reverse engineer them. The DS1780 is using I2C bus for communication where every chip has a unique address on the shared bus. The last two least significant bits in the address are set with A0 and A1 pins (pins 1 and 2 on the pinout picture) using their corresponding logic input values. So the simplest approach is to take a multimeter and read voltage values on these pins for every chip. I also wanted to see the actual communication on the I2C bus so I soldered wires on the SDA and SCL pins of the node 1 chip and used my little 8$ logic analyzer to display the communication.

After recording the communication after power up I was able to see the failed attempts to initiate a data transfer with those dead chips. This somehow helps to ensure that the problem lies entirely in the faulty DS1780 chips and it's not related with something else like a corrupted bus or any other unlikely reason. The 0x5A and 0x5C addresses relate to the datasheet where on page 4 it gives a default address as 01011(A1)(A0). Here is also one quirk of I2C bus that the zero bit of the byte that is used for address is used to signalize a direction of transaction (read/write). Hence different software can interpret this address 0x5A also as 0x2D. I wouldn't be mentioning this if I didn't find a special "test i2c" L1 command that can probe all I2C devices in the system. The result was as expected, complaining about missing acknowledge with devices 0x2D (0x5A) and 0x2E (0x5C).

Replacing chips

DS1780 chips are not difficult to replace with their TSSOP package type. I've seen some successful attempts using an ordinary heat gun but I can image in can also get very messy. I have a good soldering equipment at work so that made things much simpler. However, these days it's not expensive to get a solid soldering equipment. I've purchased a hot air soldering gun for about 25$ (this one) and a digital microscope for another 100$ (similar to this one). Then together with flux, pair of tweezers and kapton tape, the entire process is quite painless. And of course, one needs a micro soldering iron!

I started with the NODE 1 chip that's located next to the yellow battery and PS/2 ports. It's very convenient to use a kapton tape to shield the surrounding parts so they won't bubble up (or melt in case of plastic) when applying hot air. I took me a while to realize that the yellow battery brick is removable and I could have spared it the torture with hot air. After applying some flux and about a minute of ~360 degC hot air it finally let got itself off the board. Now I'm not really experienced in these things so there might be better ways how to remove chips like this one. For soldering of the new piece I used a soldering iron and help of a microscope. I know this can be done again with hot air but I felt more confident using the other method.

Additional observations

I noticed when browsing forums that for some machines it's enough to disable L1 environment monitoring with "env off" command and the machine can power on again. I suppose that in case of a chip that doesn't even respond (L1 generates alerts about node not acknowledging), like in my case, this method isn't sufficient.
When I was trying to fix another machine I got different behavior with DS1780 where everything seemed to be fine except it wasn't reporting one voltage rail correctly (0.2V instead of 2.5V). In this case it was enough to measure the voltage on a corresponding DS1780 pin to verify that the chip reading is bad and the actual voltage value is fine. This was on a GPU card and replacing the chip fixed the problem.
I wonder what's behind this high failure rate of the DS1780. My guess is that it's either a bad manufacturing process or chip design during some period of time. Maybe they were more sensitive to higher temperatures as Fuels are known to have over heating problems due to their poor mechanical design (or something like that). I read that later Fuel series didn't suffer as much as initial series but it would be interesting to know whether they fixed it by using later series of DS1780 or it was something else. Anyway, it was fun to tinker with them...