Understanding mcelog ECC errors – Which stick of RAM is broken?

Before reading this, please note that much of the information in mcelog is hardware dependent. Your mileage may vary.

Memory gone bad

So one of the servers, running an X99-WS/IPMI board from Asus, began putting errors into /var/log/mcelog. Thankfully, they were all the same, telling me the following:

mcelog: failed to prefill DIMM database from DMI data
Hardware event. This is not a software error.
CPU 0 BANK 11 
MISC 90840080008228c ADDR 9ce494000 
TIME 1499161840 Tue Jul 4 09:50:40 2017
MCG status:
MCi status:
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
Transaction: Memory scrubbing error
MemCtrl: Corrected patrol scrub error
STATUS 8c000051000800c2 MCGSTATUS 0
CPUID Vendor Intel Family 6 Model 79

So, what does this mean?

The first few lines tell us this happened from CPU 0 on BANK 11. This wasn’t much help, as the board only has 8 memory banks. It was suggested in #debian on freenode that the high bank number might be due to dual channel memory, but then how do I pinpoint the physical stick?

Enter dmidecode

mcelog tells us a weird bank number, but it has something else that’s vitally important; the address. ADDR 9ce494000 is a memory address on the faulty stick, and dmidecode can tell us which stick is responsible for that address:

# dmidecode -t 20
Handle 0x005E, DMI type 20, 35 bytes
Memory Device Mapped Address
    Starting Address: 0x00800000000
    Ending Address: 0x00BFFFFFFFF
    Range Size: 16 GB
    Physical Device Handle: 0x005D
    Memory Array Mapped Address Handle: 0x0058
    Partition Row Position: 1

This should be the problematic RAM stick, as address 0x009CE494000 is between 0x00800000000 and 0x00BFFFFFFFF. The stick has “Physical Device Handle” 0x005D. dmidecode can show us more information about this handle:

# dmidecode -t 17
Handle 0x005D, DMI type 17, 40 bytes
Memory Device
    Array Handle: 0x0057
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 72 bits
    Size: 16384 MB
    Form Factor: RIMM
    Set: None
    Locator: DIMM_B1
    Bank Locator: NODE 1
    Type: DDR4
    Type Detail: Synchronous
    Speed: 2133 MHz
    Manufacturer: Samsung
    Serial Number: 32BFE65D
    Asset Tag: DIMM_B1_AssetTag
    Part Number: M393A2G40DB0-CPB 
    Rank: 2
    Configured Clock Speed: 2133 MHz
    Minimum Voltage: Unknown
    Maximum Voltage: Unknown
    Configured Voltage: Unknown

Here you can look at the Locator, or the Asset Tag fields. Both show the memory slot as DIMM_B1. Now that’s something we can use! Looking in the motherboard manual, available online, one can see where DIMM_B1 is:

So that’s the bad stick, which will be going back to the supplier with an RMA.


  • Jesse says:

    How do I know which DRAM chip and data bit failed on DIMM_B1

  • Michael L West says:

    Can we make a bash script that searches all the starting and ending addresses to see if bad address can go between them?

  • Kevin says:

    On my system, dmidecode -t 20 reports nothing. Memory is only reported as an array as type 19:

    dmidecode -t 19

    Handle 0x0009, DMI type 19, 31 bytes
    Memory Array Mapped Address
    Starting Address: 0x00000000000
    Ending Address: 0x00FFFFFFFFF
    Range Size: 64 GB
    Physical Array Handle: 0x0008
    Partition Width: 16

    dmidecode -t 17 reports all memory modules with the same Physical Array Handle.

    In this case, I suspect that the BANK information from the original MCE identifies the DIMM location.

Leave a Reply

Your email address will not be published.