PDA

View Full Version : Troubleshooting fun!



Airbozo
12-03-2007, 01:40 PM
OK, So I have been working on a build for a customer that will be used to replace an SGI Octane2 in a medical device. Here are the specs;

Supermicro 3u server 8 drive bays and triple redundant PSU's
Dual quad core xeons
16gb of memory
nVidia 8500 512
adaptec 3805 raid controller
3x 300 GB SAS drives in raid 5 (for the os and sql)
3x 500 GB SATA-II drives in raid 5 (for data)

One of the great features of the 3805 (and the reason we use it) is the ability to connect SAS and SATA drives to the same controller _and_ even in the same raid set (although it would degrade the performance of the sas drives to the sata drives, so why would you bother).

I setup the system with the first 3 drives as sas raid 5 and the second 3 drives as sata raid 5, start the initialization and reboot into the windows server 2003 x64 install. Press F6 to install drivers, load drivers, select partition and start the install. The system gets to the point of reboot and when it comes back up I get an error with one of the sas drives. The raid controller complains that it can not find one of the sas drives. I go into the Adaptec bios and check the drives and they all seem fine. Reboot to finish the initialization and I get another failed drive (not the same one).

OK, swap the controller, cables and backplane of the array. Still fails. Called SuperMicro and Adaptec and setup the second and third systems to test. Same failures. I know that this functions as these are not the first systems to be setup this way. We have delivered 12 identical systems since March and have had no issues.

I found that the only difference is that the Backplanes have been upgraded to version 2.2 and a new company is manufacturing them for SuperMicro.

OK now the funny part. We purchase quite a bit of product from Adaptec and they are close, so they sent one of their high level engineering managers over to gather some info and try to find a workaround. The Adaptec rep, Grace Wang shows up and I take her into the lab to show her the issue. Only problem? I cannot get any of the three systems to fail. I spent the next half hour rebooting systems waiting for it to fail. Nothing. Typical that the engineer shows up and sees nothing wrong. On a hunch I wiped out both raid sets on one system and re-configure them, and reboot. YAY! It failed!

Turns out the problem only happens when _both_ the SAS and SATA raid sets are initializing and the system gets rebooted. Currently the workaround is to _only_ configure the SAS raid set, then install the OS and utilities, then once the SAS drives are initialized (several hours), it is OK to configure and initialize the SATA raid set. Adaptec considers this a major issue since they tout the fact that you can run both drives off of one card and the initialization is _supposed_ to be transparent to the operation of the system (except for a performance hit on the drives during initialization).

It was a REALLY tough week but I was able to get my angst out Friday night at the Hockey game.

Cannibal23
12-03-2007, 02:56 PM
wow thats an anoying little gotcha. glad you and Adaptec managed to find it. hopefully there will be a firmware update to fix that. nice catch +rep

crenn
12-04-2007, 06:01 AM
Isn't it funny that when problems happen and someone gets sent to fix it.... that the problem suddenly disappears!

Airbozo
12-04-2007, 11:58 AM
Isn't it funny that when problems happen and someone gets sent to fix it.... that the problem suddenly disappears!

It is funny. I have also been known as that person. At McDonnell Douglas, when the guys down on the hangar floor needed me to fix a computer, they would ask for me and remind me to bring my magic screwdriver. How I got that rep is another funny story.

Scotty
12-04-2007, 01:53 PM
Isn't it funny that when problems happen and someone gets sent to fix it.... that the problem suddenly disappears!

Like ,my internet, doesn't work no matter what i try, call them up say they don't have a problem, put phone down hit F5 and it works again... Stupid Virgin Media, stupid crappy internet...

Airbozo
12-04-2007, 02:50 PM
Like ,my internet, doesn't work no matter what i try, call them up say they don't have a problem, put phone down hit F5 and it works again... Stupid Virgin Media, stupid crappy internet...

...sounds like a name lookup or arp issue. Once they ping your system it suddenly can communicate again... Been down that road and I have more gray hair because of it.