Troubleshooting Dell blade server power on issues using Dell CMC and RACADM
By Arun Muthaiyan and Sridevi Chandrasekaran
In modular servers there are a lot of gating factors that control whether athe server is able to power on. When a server does not power on, it is not always straight forward to find the reason why. The below steps will provide a methodical way to trouble shoot a Dell blade server that will not turn on.
1. First, make sure that the trivial things like AC and other cables are plugged in properly.
2. Make sure that the Chassis is powered ON.
Command to check the chassis power status remotely via RACADM:
$ racadm getsysinfo –c
Chassis Information:
System Model = PowerEdge M1000e
System AssetTag = 00000
Service Tag = HLSG7R1
Chassis Name = CMC-HLSG7R1
Chassis Location = PT Lab Power Test
Chassis Midplane Version = 1.1
Power Status = OFF
Command to power ON the chassis when it is OFF:
$ racadm chassisaction powerup
Module power operation successful.
It will take a minute for chassis to powerup using this command.
3. Once the chassis is ON, check the status of iDRAC.
Command to check the status of iDRAC:
$ racadm getversion
<Server> <iDRAC Version> <Blade Type> <Gen> <Updatable>
server-1 1.35.35 (Build 03) PowerEdgeM620 iDRAC7 Y
server-3 iDRAC not ready
<Switch> <Model Name> <HW Version> <FW Version>
switch-1 Dell PowerConnect M6348 A02
switch-2 Dell PowerConnect M6220 A12
<CMC> <CMC Version> <Updatable>
cmc-1 4.30.X15.201210050401 Y
cmc-2 4.30.X15.201210050401 Y
$
If iDRAC is not ready for more than 3 minutes since chassis power on then try step 13.
4. Only if you are connected to 110 VAC source then make sure “Allow 110 VAC Operation” option is selected.
Command to check status of “Allow 110 VAC Operation” option:
$ racadm getconfig -g cfgChassisPower –o cfgChassisAllow110VACOperation
0
$
Command to enable “Allow 110 VAC Operation” option:
$ racadm config -g cfgChassisPower –o cfgChassisAllow110VACOperation 1
Object value modified successfully.
$
5. Make sure you have latest combination of firmware (CMC, iDRAC, BIOS, CPLD and LC) for the given blade server model. For latest firmware version, you can go to Dell’s support page (support.dell.com/support) and provide the product’s service tag. Under drivers and downloads tab, it will show the latest firmware levels of different components.
Command to get the current firmware version installed in the system:
$ racadm getversion
For BIOS version: $ racadm getversion –b
For CPLD version: $ racadm getversion –c
For USC version: $ racadm getversion –l
6. Verify that there is no fabric mismatch in the blade server, if you changed the fabric recently and unable to power on since then.
If DC1 and DC2 states of a server are “OK” or “N/A” then there is no fabric mismatch for that server. If either DC1 or DC2 state is “invalid” then there is fabric mismatch. To fix this issue, remove or change the mezzanine card or IOM.
In the example below, server-1, server-3 and server-11 all have valid fabrics. Server-4 has mismatched fabric in slot B.
Command to check if there is fabric mismatch:
$ racadm getdcinfo
Group A I/O Type : Gigabit Ethernet
Group B I/O Type : 10 GbE KR
Group C I/O Type : Fibre Channel 16
<IO#> <Type> <State> <Role>
switch-1 Gigabit Ethernet OK Master
switch-2 Gigabit Ethernet OK Master
switch-3 None N/A N/A
switch-4 None N/A N/A
switch-5 Fibre Channel 16 OK Master
switch-6 None N/A N/A
<Server#> <Presence> <DC1 Type> <DC1 State> <DC2 Type> <DC2 State>
server-1 Present 10 GbE KR OK None N/A
server-2 Not Present None N/A None N/A
server-3 Present None N/A None N/A
server-4 Present Fibre Channel 16 Not Ok None N/A
server-5 Not Present None N/A None N/A
server-6 Not Present None N/A None N/A
server-7 Not Present None N/A None N/A
server-8 Not Present None N/A None N/A
server-9 Not Present None N/A None N/A
server-10 Not Present None N/A None N/A
server-11 Present None N/A Fibre Channel 16 OK
server-12 Not Present None N/A None N/A
server-13 Not Present None N/A None N/A
server-14 Not Present None N/A None N/A
server-15 Not Present None N/A None N/A
server-16 Not Present None N/A None N/A
$
7. After brown out, if the blade server that was “powered on” before is not “powering on” now automatically then you probably have the auto-recovery state set to off. Auto-recovery state can be set to ON, OFF or LAST state from BIOS F2 settings.
8. You should have Chassis Control Administrator (Power Commands) priviledge or Server Administrator priviledge or iDRAC’s administrator account, to remotely control blade server power actions.
9. Make sure Maximum Power Conservation Mode (MPCM) is disabled.
Command to check the status of MPCM:
$ racadm getconfig –g cfgchassispower –o cfgchassismaxpowerconservationmode
0
$
If it returns 0 then MPCM is disabled. If it returns a time stamp then MPCM is enabled since that time.
10. Check raclog for any insufficient power related messages. Removing bad PSUs if any and installing new PSUs may help.
Command to check raclog:
$ racadm getraclog
Part of the output:
--------------------------------------------------------------------------------
SeqNumber = 73
Message ID = USR8511
Category = Audit
AgentID = CMC
Severity = Information
Timestamp = 2013-01-11 23:18:34
Message Arg 1 =
Message Arg 2 = 192.168.0.100
Message Arg 3 = root
Message Arg 4 = GUI
Message Arg 5 = 29179
Message = Login success from 192.168.0.100 (username=root, type=GUI, sid=29179)
--------------------------------------------------------------------------------
SeqNumber = 72
Message ID = USR8510
Category = Audit
AgentID = CMC
Severity = Information
Timestamp = 2013-01-11 23:17:17
Message Arg 1 =
Message Arg 2 = root
Message Arg 3 = Serial
Message Arg 4 = 6133
Message = Login success (username=root, type=Serial, sid=6133)
--------------------------------------------------------------------------------
SeqNumber = 71
Message ID = USR8506
Category = Audit
AgentID = CMC
Severity = Information
Timestamp = 2013-01-11 23:17:07
Message Arg 1 = 41269
Message = Session close succeeds: sid=41269
--------------------------------------------------------------------------------
11. If you see amber light in-front of the blade then check SEL and LC log for any critical messages. This is a scenario where a server turned on but was turned off due to hardware failure. Fix the issue based on recommended solution provided with the event log.
Command to check the log:
$ racadm getsel
Part of the output:
Mon Feb 25 2013 13:04:04 Critical The power input for power supply 2 is lost.
Mon Feb 25 2013 13:04:04 Critical Power supply redundancy is lost.
12. If you can’t ping iDRAC then do a virtual reseat.
Command to virtually reseat a blade:
$ racadm serveraction –m server-n reseat –f
Object value modified successfully
$
Where n is the slot number of the blade(iDRAC).
13. If AC redundancy is set then you may not be able to power on all the highly configured systems. Server Performance Over Redundancy feature sacrifies the redundancy to turn on all the servers. Hence try enabling Server Performance Over Redundancy.
Command to enable Server Performance Over Redundancy:
$racadm config –g cfgChassisPower –o cfgChassisPerformanceOverRedundancy 1
Object value modified successfully
$