Before reading this, please note that much of the information in mcelog is hardware dependent. Your mileage may vary.
Memory gone bad
So one of the servers, running an X99-WS/IPMI board from Asus, began putting errors into /var/log/mcelog. Thankfully, they were all the same, telling me the following:
mcelog: failed to prefill DIMM database from DMI data Hardware event. This is not a software error. MCE 0 CPU 0 BANK 11 MISC 90840080008228c ADDR 9ce494000 TIME 1499161840 Tue Jul 4 09:50:40 2017 MCG status: MCi status: Corrected error MCi_MISC register valid MCi_ADDR register valid MCA: MEMORY CONTROLLER MS_CHANNEL2_ERR Transaction: Memory scrubbing error MemCtrl: Corrected patrol scrub error STATUS 8c000051000800c2 MCGSTATUS 0 MCGCAP 7000c16 APICID 0 SOCKETID 0 CPUID Vendor Intel Family 6 Model 79
So, what does this mean?
The first few lines tell us this happened from CPU 0 on BANK 11. This wasn’t much help, as the board only has 8 memory banks. It was suggested in #debian on freenode that the high bank number might be due to dual channel memory, but then how do I pinpoint the physical stick?
Enter dmidecode
mcelog tells us a weird bank number, but it has something else that’s vitally important; the address. ADDR 9ce494000 is a memory address on the faulty stick, and dmidecode can tell us which stick is responsible for that address:
# dmidecode -t 20 [...snip...] Handle 0x005E, DMI type 20, 35 bytes Memory Device Mapped Address Starting Address: 0x00800000000 Ending Address: 0x00BFFFFFFFF Range Size: 16 GB Physical Device Handle: 0x005D Memory Array Mapped Address Handle: 0x0058 Partition Row Position: 1 [...snip...]
This should be the problematic RAM stick, as address 0x009CE494000 is between 0x00800000000 and 0x00BFFFFFFFF. The stick has “Physical Device Handle” 0x005D. dmidecode can show us more information about this handle:
# dmidecode -t 17 [...snip...] Handle 0x005D, DMI type 17, 40 bytes Memory Device Array Handle: 0x0057 Error Information Handle: Not Provided Total Width: 72 bits Data Width: 72 bits Size: 16384 MB Form Factor: RIMM Set: None Locator: DIMM_B1 Bank Locator: NODE 1 Type: DDR4 Type Detail: Synchronous Speed: 2133 MHz Manufacturer: Samsung Serial Number: 32BFE65D Asset Tag: DIMM_B1_AssetTag Part Number: M393A2G40DB0-CPB Rank: 2 Configured Clock Speed: 2133 MHz Minimum Voltage: Unknown Maximum Voltage: Unknown Configured Voltage: Unknown [...snip...]
Here you can look at the Locator, or the Asset Tag fields. Both show the memory slot as DIMM_B1. Now that’s something we can use! Looking in the motherboard manual, available online, one can see where DIMM_B1 is:
So that’s the bad stick, which will be going back to the supplier with an RMA.
5 Comments
How do I know which DRAM chip and data bit failed on DIMM_B1
Can we make a bash script that searches all the starting and ending addresses to see if bad address can go between them?
On my system, dmidecode -t 20 reports nothing. Memory is only reported as an array as type 19:
dmidecode -t 19
Handle 0x0009, DMI type 19, 31 bytes
Memory Array Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x00FFFFFFFFF
Range Size: 64 GB
Physical Array Handle: 0x0008
Partition Width: 16
dmidecode -t 17 reports all memory modules with the same Physical Array Handle.
In this case, I suspect that the BANK information from the original MCE identifies the DIMM location.
__ list memory addresses on faulty DIMM
grep mcelog:.CPU -A1 messages
__ list address ranges per DIMM
dmidecode | grep -iE ‘locator.*node|Address:’ | grep Locat -B2
use above outputs to ID bad DIMM
since this thread inspired me here is more, someone may find it helpful, including oneliner to ID BAD DIMM
grep mcelog.*MISC.*ADDR messages
dmidecode | grep -iE ‘locator.*node|Address:’ | grep Locat -A2 | grep :
dmidecode | grep -iE ‘locator.*node|Address:’ | grep Locat -A2 | grep : | sed s/Bank/\#Bank/|tr ‘\t\n\#’ ‘\ \ \n’ | grep Bank
____ onliner combo of above commands
ADDRS=”$(grep mcelog.*MISC.*ADDR messages | awk ‘{ print $NF }’|sort -u)”; dmidecode | grep -iE ‘locator.*node|Address:’ | grep Locat -A2 | grep : | sed s/Bank/\#Bank/|tr ‘\t\n\#’ ‘\ \ \n’| tr -s ‘\ ‘| cut -d’ ‘ -f3,6,9 | while read L S E; do echo $L $S $E .. ; for A in $(echo $ADDRS); do C=$(printf “0x%11s\n” “$A” | tr ‘\ ‘ ‘0’); [[ $C -ge $S ]] && [[ $C -le $E ]] && echo $L $S $E .. $C .. BINGO; done; done
___ output
# ADDRS=”$(grep mcelog.*MISC.*ADDR messages | awk ‘{ print $NF }’|sort -u)”; dmidecode | grep -iE ‘locator.*node|Address:’ | grep Locat -A2 | grep : | sed s/Bank/\#Bank/|tr ‘\t\n\#’ ‘\ \ \n’| tr -s ‘\ ‘| cut -d’ ‘ -f3,6,9 | while read L S E; do echo $L $S $E .. ; for A in $(echo $ADDRS); do C=$(printf “0x%11s\n” “$A” | tr ‘\ ‘ ‘0’); [[ $C -ge $S ]] && [[ $C -le $E ]] && echo $L $S $E .. $C .. BINGO; done; done
..
A0_Node0_Channel0_Dimm0 0x00000000000 0x007FFFFFFFF ..
A1_Node0_Channel0_Dimm1 0x00800000000 0x00FFFFFFFFF ..
B0_Node0_Channel1_Dimm0 0x01000000000 0x017FFFFFFFF ..
B1_Node0_Channel1_Dimm1 0x01800000000 0x01FFFFFFFFF ..
B1_Node0_Channel1_Dimm1 0x01800000000 0x01FFFFFFFFF .. 0x0196aa15000 .. BINGO
B1_Node0_Channel1_Dimm1 0x01800000000 0x01FFFFFFFFF .. 0x0196aa54000 .. BINGO
B1_Node0_Channel1_Dimm1 0x01800000000 0x01FFFFFFFFF .. 0x0196aa55000 .. BINGO
B1_Node0_Channel1_Dimm1 0x01800000000 0x01FFFFFFFFF .. 0x01a27614000 .. BINGO
B1_Node0_Channel1_Dimm1 0x01800000000 0x01FFFFFFFFF .. 0x01a27615000 .. BINGO
B1_Node0_Channel1_Dimm1 0x01800000000 0x01FFFFFFFFF .. 0x01a27654000 .. BINGO
B1_Node0_Channel1_Dimm1 0x01800000000 0x01FFFFFFFFF .. 0x01a27655000 .. BINGO
C0_Node0_Channel2_Dimm0 0x02000000000 0x027FFFFFFFF ..
C1_Node0_Channel2_Dimm1 0x02800000000 0x02FFFFFFFFF ..
D0_Node0_Channel3_Dimm0 0x03000000000 0x037FFFFFFFF ..
D1_Node0_Channel3_Dimm1 0x03800000000 0x03FFFFFFFFF ..
A0_Node1_Channel0_Dimm0 0x04000000000 0x047FFFFFFFF ..
A1_Node1_Channel0_Dimm1 0x04800000000 0x04FFFFFFFFF ..
B0_Node1_Channel1_Dimm0 0x05000000000 0x057FFFFFFFF ..
B1_Node1_Channel1_Dimm1 0x05800000000 0x05FFFFFFFFF ..
C0_Node1_Channel2_Dimm0 0x06000000000 0x067FFFFFFFF ..
C1_Node1_Channel2_Dimm1 0x06800000000 0x06FFFFFFFFF ..
D0_Node1_Channel3_Dimm0 0x07000000000 0x077FFFFFFFF ..
D1_Node1_Channel3_Dimm1 0x07800000000 0x07FFFFFFFFF ..