Understanding mcelog ECC errors – Which stick of RAM is broken?

Before reading this, please note that much of the information in mcelog is hardware dependent. Your mileage may vary.

Memory gone bad

So one of the servers, running an X99-WS/IPMI board from Asus, began putting errors into /var/log/mcelog. Thankfully, they were all the same, telling me the following:

mcelog: failed to prefill DIMM database from DMI data
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 11 
MISC 90840080008228c ADDR 9ce494000 
TIME 1499161840 Tue Jul 4 09:50:40 2017
MCG status:
MCi status:
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER MS_CHANNEL2_ERR
Transaction: Memory scrubbing error
MemCtrl: Corrected patrol scrub error
STATUS 8c000051000800c2 MCGSTATUS 0
MCGCAP 7000c16 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 79

So, what does this mean?

The first few lines tell us this happened from CPU 0 on BANK 11. This wasn’t much help, as the board only has 8 memory banks. It was suggested in #debian on freenode that the high bank number might be due to dual channel memory, but then how do I pinpoint the physical stick?

Enter dmidecode

mcelog tells us a weird bank number, but it has something else that’s vitally important; the address. ADDR 9ce494000 is a memory address on the faulty stick, and dmidecode can tell us which stick is responsible for that address:

# dmidecode -t 20
[...snip...]
Handle 0x005E, DMI type 20, 35 bytes
Memory Device Mapped Address
    Starting Address: 0x00800000000
    Ending Address: 0x00BFFFFFFFF
    Range Size: 16 GB
    Physical Device Handle: 0x005D
    Memory Array Mapped Address Handle: 0x0058
    Partition Row Position: 1
[...snip...]

This should be the problematic RAM stick, as address 0x009CE494000 is between 0x00800000000 and 0x00BFFFFFFFF. The stick has “Physical Device Handle” 0x005D. dmidecode can show us more information about this handle:

# dmidecode -t 17
[...snip...]
Handle 0x005D, DMI type 17, 40 bytes
Memory Device
    Array Handle: 0x0057
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 72 bits
    Size: 16384 MB
    Form Factor: RIMM
    Set: None
    Locator: DIMM_B1
    Bank Locator: NODE 1
    Type: DDR4
    Type Detail: Synchronous
    Speed: 2133 MHz
    Manufacturer: Samsung
    Serial Number: 32BFE65D
    Asset Tag: DIMM_B1_AssetTag
    Part Number: M393A2G40DB0-CPB 
    Rank: 2
    Configured Clock Speed: 2133 MHz
    Minimum Voltage: Unknown
    Maximum Voltage: Unknown
    Configured Voltage: Unknown
[...snip...]

Here you can look at the Locator, or the Asset Tag fields. Both show the memory slot as DIMM_B1. Now that’s something we can use! Looking in the motherboard manual, available online, one can see where DIMM_B1 is:

So that’s the bad stick, which will be going back to the supplier with an RMA.

5 Comments

Jesse says:

2017-10-06 at 09:19

How do I know which DRAM chip and data bit failed on DIMM_B1

Michael L West says:

2019-04-17 at 22:38

Can we make a bash script that searches all the starting and ending addresses to see if bad address can go between them?

Kevin says:

2021-07-28 at 23:41

On my system, dmidecode -t 20 reports nothing. Memory is only reported as an array as type 19:

dmidecode -t 19

Handle 0x0009, DMI type 19, 31 bytes
Memory Array Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x00FFFFFFFFF
Range Size: 64 GB
Physical Array Handle: 0x0008
Partition Width: 16

dmidecode -t 17 reports all memory modules with the same Physical Array Handle.

In this case, I suspect that the BANK information from the original MCE identifies the DIMM location.

2cents says:

2022-10-21 at 16:18

__ list memory addresses on faulty DIMM
grep mcelog:.CPU -A1 messages

__ list address ranges per DIMM
dmidecode | grep -iE ‘locator.*node|Address:’ | grep Locat -B2

use above outputs to ID bad DIMM

2cents says:

2022-10-21 at 23:10

since this thread inspired me here is more, someone may find it helpful, including oneliner to ID BAD DIMM

grep mcelog.*MISC.*ADDR messages

dmidecode | grep -iE ‘locator.*node|Address:’ | grep Locat -A2 | grep :

dmidecode | grep -iE ‘locator.*node|Address:’ | grep Locat -A2 | grep : | sed s/Bank/\#Bank/|tr ‘\t\n\#’ ‘\ \ \n’ | grep Bank

____ onliner combo of above commands
ADDRS=”$(grep mcelog.*MISC.*ADDR messages | awk ‘{ print $NF }’|sort -u)”; dmidecode | grep -iE ‘locator.*node|Address:’ | grep Locat -A2 | grep : | sed s/Bank/\#Bank/|tr ‘\t\n\#’ ‘\ \ \n’| tr -s ‘\ ‘| cut -d’ ‘ -f3,6,9 | while read L S E; do echo $L $S $E .. ; for A in $(echo $ADDRS); do C=$(printf “0x%11s\n” “$A” | tr ‘\ ‘ ‘0’); [[ $C -ge $S ]] && [[ $C -le $E ]] && echo $L $S $E .. $C .. BINGO; done; done

___ output
# ADDRS=”$(grep mcelog.*MISC.*ADDR messages | awk ‘{ print $NF }’|sort -u)”; dmidecode | grep -iE ‘locator.*node|Address:’ | grep Locat -A2 | grep : | sed s/Bank/\#Bank/|tr ‘\t\n\#’ ‘\ \ \n’| tr -s ‘\ ‘| cut -d’ ‘ -f3,6,9 | while read L S E; do echo $L $S $E .. ; for A in $(echo $ADDRS); do C=$(printf “0x%11s\n” “$A” | tr ‘\ ‘ ‘0’); [[ $C -ge $S ]] && [[ $C -le $E ]] && echo $L $S $E .. $C .. BINGO; done; done
..
A0_Node0_Channel0_Dimm0 0x00000000000 0x007FFFFFFFF ..
A1_Node0_Channel0_Dimm1 0x00800000000 0x00FFFFFFFFF ..
B0_Node0_Channel1_Dimm0 0x01000000000 0x017FFFFFFFF ..
B1_Node0_Channel1_Dimm1 0x01800000000 0x01FFFFFFFFF ..
B1_Node0_Channel1_Dimm1 0x01800000000 0x01FFFFFFFFF .. 0x0196aa15000 .. BINGO
B1_Node0_Channel1_Dimm1 0x01800000000 0x01FFFFFFFFF .. 0x0196aa54000 .. BINGO
B1_Node0_Channel1_Dimm1 0x01800000000 0x01FFFFFFFFF .. 0x0196aa55000 .. BINGO
B1_Node0_Channel1_Dimm1 0x01800000000 0x01FFFFFFFFF .. 0x01a27614000 .. BINGO
B1_Node0_Channel1_Dimm1 0x01800000000 0x01FFFFFFFFF .. 0x01a27615000 .. BINGO
B1_Node0_Channel1_Dimm1 0x01800000000 0x01FFFFFFFFF .. 0x01a27654000 .. BINGO
B1_Node0_Channel1_Dimm1 0x01800000000 0x01FFFFFFFFF .. 0x01a27655000 .. BINGO
C0_Node0_Channel2_Dimm0 0x02000000000 0x027FFFFFFFF ..
C1_Node0_Channel2_Dimm1 0x02800000000 0x02FFFFFFFFF ..
D0_Node0_Channel3_Dimm0 0x03000000000 0x037FFFFFFFF ..
D1_Node0_Channel3_Dimm1 0x03800000000 0x03FFFFFFFFF ..
A0_Node1_Channel0_Dimm0 0x04000000000 0x047FFFFFFFF ..
A1_Node1_Channel0_Dimm1 0x04800000000 0x04FFFFFFFFF ..
B0_Node1_Channel1_Dimm0 0x05000000000 0x057FFFFFFFF ..
B1_Node1_Channel1_Dimm1 0x05800000000 0x05FFFFFFFFF ..
C0_Node1_Channel2_Dimm0 0x06000000000 0x067FFFFFFFF ..
C1_Node1_Channel2_Dimm1 0x06800000000 0x06FFFFFFFFF ..
D0_Node1_Channel3_Dimm0 0x07000000000 0x077FFFFFFFF ..
D1_Node1_Channel3_Dimm1 0x07800000000 0x07FFFFFFFFF ..

Memory gone bad

So, what does this mean?

Enter dmidecode

5 Comments

Leave a Reply Cancel reply