Hello everyone,
One of our customers has experienced a very weird power failure recently and I have no clue what caused it and how to protect ourselves from such failures in the future. Hopefully somebody will be able to help me...
So basically we have got a panel with two R730 servers in it, each fitted with two redundant power supplies. The primary power supply in each server is fed from an UPS1 and the secondary power supply is fed from an UPS2, so we have got a full redundancy.
What happened is (as reported by the customer):
- the site engineer got called out to investigate a failure of the system (the servers are not accessible remotely, used in the oil and gas industry),
- on arrival he found out that that both servers were powered down,
- he tried to restart them, but was unable to - there was no response to the power button presses,
- he had a look at the server PSUs and noticed that the status lights (the green lights at the back of the PSUs) were off on all power supplies,
- he checked the breakers in the panel (all were on) and measured the voltage in the distribution board (all was OK, 230V coming from both UPS supplies),
- he decided to disconnect all the power leads from the power supplies for approx. 30 seconds and then reconnect them back, at this point the green status lights lit up,
- he then pressed the power button again and now the servers responded correctly (i.e. they started and the OS booted).
When I came to site on the following day I have reviewed the Windows Event Logs and found that the power supplies failed at approx. 14:30 and the power was restored (OS booted) at approx. 19:00.
However, when I checked in iDRAC there were no events at all to indicate that the power failure occurred at 14:30. There were no events at all since February (so the servers were running OK all the time), and the first event logged in iDRAC was around 19:00, when the engineer disconnected the power leads from the PSUs. So it looks like iDRAC was not aware of the actual power failure at all and though that the servers are running OK?
Here are mu thoughts:
- this could not have been an UPS failure - very unlikely for two power feeds to fail at the same time, plus there is other equipment in the panel (e.g. terminal servers) and it did not fail at all (checked the logs),
- could it have been a voltage spike/dip that caused the power supplies to "lock up",
- is it normal behaviour for the power supplies that the power leads have to be disconnected for a few seconds in order to restore normal operation (I assume it caused some sort of a restart),
Any thoughts appreciated, I could contact Dell support directly (the servers are still in warranty), but because this was a weird failure, I would rather see if anybody experienced something similar. My thoughts are that there is some sort of a bug in the firmware, that caused the PSUs to lock up, but I simply do not know.