Understanding mcelog ECC errors – Which stick of RAM is broken?

Before reading this, please note that much of the information in mcelog is hardware dependent. Your mileage may vary.

Memory gone bad

So one of the servers, running an X99-WS/IPMI board from Asus, began putting errors into /var/log/mcelog. Thankfully, they were all the same, telling me the following:

mcelog: failed to prefill DIMM database from DMI data
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 11 
MISC 90840080008228c ADDR 9ce494000 
TIME 1499161840 Tue Jul 4 09:50:40 2017
MCG status:
MCi status:
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER MS_CHANNEL2_ERR
Transaction: Memory scrubbing error
MemCtrl: Corrected patrol scrub error
STATUS 8c000051000800c2 MCGSTATUS 0
MCGCAP 7000c16 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 79

So, what does this mean?

The first few lines tell us this happened from CPU 0 on BANK 11. This wasn’t much help, as the board only has 8 memory banks. It was suggested in #debian on freenode that the high bank number might be due to dual channel memory, but then how do I pinpoint the physical stick?

Enter dmidecode

mcelog tells us a weird bank number, but it has something else that’s vitally important; the address. ADDR 9ce494000 is a memory address on the faulty stick, and dmidecode can tell us which stick is responsible for that address:

# dmidecode -t 20
[...snip...]
Handle 0x005E, DMI type 20, 35 bytes
Memory Device Mapped Address
    Starting Address: 0x00800000000
    Ending Address: 0x00BFFFFFFFF
    Range Size: 16 GB
    Physical Device Handle: 0x005D
    Memory Array Mapped Address Handle: 0x0058
    Partition Row Position: 1
[...snip...]

This should be the problematic RAM stick, as address 0x009CE494000 is between 0x00800000000 and 0x00BFFFFFFFF. The stick has “Physical Device Handle” 0x005D. dmidecode can show us more information about this handle:

# dmidecode -t 17
[...snip...]
Handle 0x005D, DMI type 17, 40 bytes
Memory Device
    Array Handle: 0x0057
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 72 bits
    Size: 16384 MB
    Form Factor: RIMM
    Set: None
    Locator: DIMM_B1
    Bank Locator: NODE 1
    Type: DDR4
    Type Detail: Synchronous
    Speed: 2133 MHz
    Manufacturer: Samsung
    Serial Number: 32BFE65D
    Asset Tag: DIMM_B1_AssetTag
    Part Number: M393A2G40DB0-CPB 
    Rank: 2
    Configured Clock Speed: 2133 MHz
    Minimum Voltage: Unknown
    Maximum Voltage: Unknown
    Configured Voltage: Unknown
[...snip...]

Here you can look at the Locator, or the Asset Tag fields. Both show the memory slot as DIMM_B1. Now that’s something we can use! Looking in the motherboard manual, available online, one can see where DIMM_B1 is:

So that’s the bad stick, which will be going back to the supplier with an RMA.

Finding duplicate files with Bash

Someone recently asked, in #bash on Freenode, how to find duplicate files with Bash. Several options were suggested, and the user ended up installing and running “fdupes“. However, this sort of thing should be reasonably easy to do using “find” and a few pipes.

As a quick overview, what you want to achieve is to find all the files, list their sizes and names followed by a NUL ($’\0′) separator to allow for wonky filenames. You start with this, as sizes are the quickest values to compare. You then get all the duplicates that have at least one other file with the same size, and discard the rest of them. Then we want to run some sort of a checksum tool. I chose sha1sum. Again we discard the files with no duplicates. And that’s really all there is to it.

Feeling bored one night, I decided to try this. As usual, the whole thing grew entirely out of proportion, and I ended up duplicating most of fdupes’ functionality in Bash. Still, the resulting script is surprisingly fast, looks nice, has a decent help menu and doesn’t rely on anything not found on virtually every GNU/Linux machine out there. It also supports a few things fdupes does not, such as (as of 2017-07-01) null terminated output. Thus it found its way into my toolbox.

Let me present “cdupes”:

$ cdupes
cdupes: no directories specified

Description:
  Bash script that's functionally similar to the tool "fdupes"

Usage: cdupes [options] <dir> [dir] ...
Options:
  -0
    Null termination. Allows piping filenames with weird characters into other tools
  -5
    md5sum. Uses md5sum instead of the default sha1sum
  -A
    NoHidden. Excludes files which names start with --> . <--
  -c
    Checksum. Show checksum of duplicate files
  -f
    Omit first match from each group. Useful with -m for cleanup scripts
  -l
    Hard link support. Do not consider hard linked files duplicates of eachother
  -m
    Machine readable. No empty lines between groups
    Probably only useful in conjunction with -c or -f
  -n
    NoEmpty. Ignore empty files (size 0)
  -p
    Permissions. Don't consider files with different owner/group or permission bits as duplicates
  -r
    Recurse. For every directory given, follow subdirectories encountered within.
  -S
    Size. Show size of duplicate files
  -q
    Quiet. Hides progress indicators
  -Q
    Quiet errors. Errors will not be printed. Does not hide progress indicators

$ cdupes -r /tmp
Files: 21
Same size: 13
Checksum: 13

/tmp/.fp1
/tmp/fp3

/tmp/foo/arse
/tmp/oh/file-OICAwX
/tmp/oh/file-WUJvo7

Duplicate search exited with error status: 1

ERRORS:
find: ‘/tmp/nonono’: Permission denied
find: ‘/tmp/root’: Permission denied
find: ‘/tmp/.cathedral’: Permission denied
sha1sum: /tmp/.startup.lock: Permission denied
sha1sum: /tmp/dabba: Permission denied
sha1sum: /tmp/fabba: Permission denied

$ sudo cdupes -rcSq /tmp
2ab06f95377aecc42e5a0e85573a3e7e3efa0961 157286400 /tmp/.fp1
2ab06f95377aecc42e5a0e85573a3e7e3efa0961 157286400 /tmp/fp3
da39a3ee5e6b4b0d3255bfef95601890afd80709 0 /tmp/.startup.lock
da39a3ee5e6b4b0d3255bfef95601890afd80709 0 /tmp/dabba
da39a3ee5e6b4b0d3255bfef95601890afd80709 0 /tmp/fabba
da39a3ee5e6b4b0d3255bfef95601890afd80709 0 /tmp/foo/arse
da39a3ee5e6b4b0d3255bfef95601890afd80709 0 /tmp/test/file-3muc5s
da39a3ee5e6b4b0d3255bfef95601890afd80709 0 /tmp/test/file-8ZjpE9
da39a3ee5e6b4b0d3255bfef95601890afd80709 0 /tmp/root/file-ox8qJb

If you’re interested in trying my script, or including it in your own set of SysOp tools,
you can –> find it here <–
Have fun!

Cygwin for Windows XP – fast mirror

Cygwin stopped being compatible with Windows XP in 2016. Following this post on stackoverflow, you can find the setup files and a (very slow) mirror that’s compatible.

I made my own, considerably faster, mirror of the last Windows XP compatible Cygwin release, at http://cygwinxp.cathedral-networks.org/

The mirror date is 2016-08-30, the 30th of August 2016 AD.

The setup file is available in the cathedral subdirectory, http://cygwinxp.cathedral-networks.org/cathedral/, along with a batch file that contains the command line options required to install from that mirror. I suggest you download both and run the batch file. If you only want the setup file, the essential option is the –no-verify one, which disables the use of package signatures. The signatures were not available in the Cygwin archive mirror, so I don’t have them.

Enjoy.

The Intel 82579V on Hyper-V Server 2016, installing unsupported network adapters

Repurposing an old developer machine to run some light VM’s, I was tasked with getting Hyper-V Server running on the Asus Maximus V GENE motherboard. Simple enough. Download Hyper-V for free from Microsoft, format a USB stick with FAT32, put all the files from the .ISO on the stick, install.

Diskpart Cheat Sheet

 1) Open the command prompt as Administrator
 2) "diskpart"
 3) "list disk"
 4) Enter the number of the USB drive
 5) "clean"
 6) "create partition primary"
 7) "active"
 8) "format fs=fat32 quick"
 9) "assign"
10) "exit"

But hey, no network! 🙁


“No active network adapters found.” eh?

Tuns out Intel isn’t a particularly big fan of people running servers on their consumer gaming boards, but not wanting to wait for a new NIC to be delivered, it was time to hit Google.

I came up with these links:
https://www.iarp.ca/blog/8-hyper-v-and-the-intel-82579
https://jayrbarrios.com/2014/11/19/intel-nuc-d54250wykh-installing-lan-driver-on-windows-hyper-v-server-2012-r2/

So I followed the instructions, but switched out the drivers for the latest ones and the OS for Hyper-V 2016.

NDIS reference

A short reference to Intel’s Windows driver naming scheme

Directory Desktop OS Server OS
NDIS62 Windows 7 Server 2008
NDIS63 Windows 8 Server 2012
NDIS64 Windows 8.1 Server 2012 R2
NDIS65 Windows 10 Server 2016

Driver Installation

1 => The latest Intel networking drivers, intended for Windows Server 2016, are available here (local archive). Downloaded and extracted with 7-Zip.

2 => Edited the file PROWinx64\PRO1000\Winx64\NDIS65\e1c65x64.inf, removing the lines under [ControlFlags] and copying the E1502NC lines to the next section as described in the links I found. It ended up looking like this:

3 => Disabled the driver signing, as the signature will now be invalid

bcdedit /set LOADOPTIONS DISABLE_INTEGRITY_CHECKS
bcdedit /set TESTSIGNING ON
bcdedit /set NOINTEGRITYCHECKS ON

4 => Rebooted (IMPORTANT)

5 => Put the driver folder on a USB stick, and installed it from the shell

X:
cd \PROWinx64\PRO1000\Winx64\NDIS65
pnputil -i -a e1c65x64.inf

6 => Agreed to the popup warning about installation of unsigned drivers

7 => Rebooted (IMPORTANT)

8 => Considered restoring the driver signing options with bcdedit, and then ignored it as I had better things to do

That’s it. If you’re uncomfortable modifying the .inf file yourself, you can download my already “unlocked” copy here: PROWinx64.zip

Killing a process tree, killing a PID and all children

Here are a couple of useful functions I wrote to kill a process tree. It’s useful when dealing with tcpserver, which refuses to disconnect a client for as long as a process it started, or a child of that process, still runs. It’s also good for tackling runaway unintentional fork bombs, rogue daemons and other mishaps.

The following script provides two functions:

sf_killtree

sends SIGSTOP to a process, kills all its children recursively, kills the process, sends SIGCONT and waits for it, to avoid “Terminated:” messages

sf_killchildren

kills all children of the current script without killing the script itself

 

Script:

function sf_killtree
{
    local ppid=$1 sig=${2:-"TERM"}
    if ! kill -0 "$ppid"; then
        builtin echo "killtree: ($ppid) - No such process" >&2
        return 1
    fi
    (( ppid == $$ )) || kill -STOP "$ppid" 2>/dev/null || :
    local pids=( $({ pgrep -P $ppid || ps -o pid= --ppid $ppid || :; } 2>/dev/null ) )
    if [[ -n "${pids[@]:-}" ]]; then
        for pid in "${pids[@]}"; do
            sf_killtree "$pid" "$sig" 2>/dev/null || :
        done
    fi
    kill "-${sig}" "$ppid" 2>/dev/null || :
    (( ppid == $$ )) || kill -CONT "$ppid" 2>/dev/null || :
    wait "$ppid" 2>/dev/null || :
}

function sf_killchildren
{
    local _term=$(builtin trap -p TERM)
    builtin trap : TERM
    sf_killtree $$ TERM;
    builtin trap - TERM
    builtin eval "$_term"
}

Addendum

Please also note the following from the KILL (2) man page:
“If pid equals 0, then sig is sent to every process in the process group of the calling process.”
While this does kill the script itself, unlike sf_killchildren, it can be a useful thing to end a script with, or to put in an EXIT trap.

Making a telnet server with Bash and tcpserver

Telnet servers are everywhere. They are simple, insecure by nature, outdated, and often badly implemented. Let’s do that, shall we? 🙂

TL;DR: Here’s the library.

A simple way to make any command line interface listen to incoming TCP connections is tcpserver, provided as part of the ucspi-tcp or ucspi-tcp-ipv6 package on Debian. Using this, we can make any Bash script available on the network.

Take a look at the following command:

# /usr/bin/tcpserver -c 10 -HR -u 65534 -g 65534 0.0.0.0 2048 /usr/local/bin/script

Here, we’re running tcpserver as root. This is to allow it to later change to a lower privilege level by switching to UID 65534 and GID 65534, nobody:nogroup. We’re also limiting the number of simultaneous connections to 10 and skipping a few reverse lookups. We’re listening on all available interfaces on port 2048.

Each incoming connection will start one instance of the script /usr/local/bin/script, which will be running as the user and group provided. If you do not specify -u and -g, the script will be running as the user that started tcpserver. Be careful with this.

Everything printed by the script to stdout will go to the client, and attempting to read from stdin will read from the client. We can test this with a short script:

#!/usr/bin/env bash
echo -n "Please enter your name: "
read -r name
echo "Hello, $name"
echo

Running this script through tcpserver, I can now connect to it with telnet:

$ telnet localhost 2048
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Please enter your name: Testy McTest
Hello, Testy McTest

Connection closed by foreign host.

That was easy. Now for the annoying stuff. RFC 854 specifies all sorts of things we should pay attention to, such as option negotiation and escape sequences. Most of this isn’t important, but we do need to pay attention to some of it if we want our telnet server to be smarter than a bag of hammers. Handling ^C (ctrl+c) would be good, as would proper telnet line endings, as if you try to connect to the above script with telnet on Windows, it won’t be pretty.

Responding properly to escape sequences requires reading incoming characters one by one – we can’t wait for a line ending. Properly formatting our line endings requires either printing every line like printf “Hello!\r\n”. Alternatively, all output could be run though another command that replaces Linux style newlines (\n) with Telnet style line endings (\r\n).

I wrote a set of functions to handle all of this for me. I included them in the Cathedral ShellFunc library, but they can run on their own if a few supporting functions are exported as well.
You can get shellfunc_telnet here if you’re interested.

With it, our script will work with Windows clients, line editing functions correctly, ctrl+c will disconnect properly, line endings are corrected, and we have login support with characters displayed as asterisks.

Here’s an example script that uses shellfunc_telnet:

#!/usr/bin/env bash
source shellfunc_telnet

echo "$TCPREMOTEIP:$TCPREMOTEPORT ++ ($$)" >&2
trap 'sf_killchildren || :; echo "$TCPREMOTEIP:$TCPREMOTEPORT -- ($$)" >&2' EXIT

# the password is "test"
testhash="9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08"

sf_tn_init echo

if ! sf_tn_read -t 120 -P "Please enter your name: " name || [[ -z "$name" ]]; then
    echo -e "\nNo name, no entry"
    exit 0
fi
echo "Hello, $name"

if ! sf_tn_read -t 120 -P "Password: " -p pass; then
    echo
    exit 0
fi
echo "Your pass hash is $pass"

if [[ "$pass" != "$testhash" ]]; then
    echo "Password incorrect. Go away."
    exit 0
fi

echo "Password correct. Welcome."
echo "I will now loop and print your lines back to you."
echo "Press ctrl+c or give me an empty line to quit"

while true; do
    sf_tn_read -t 120 line
    if (( $? == 69 )); then # interrupt
        echo "Interrupted!"
    elif (( $? > 128 )); then # timeout
        echo "Too slow!"
    fi

    [[ -n "$line" ]] || break

    echo "Your line: $line"
done

echo "Bye."
echo
sleep 1

Here’s a sample session:

$ telnet localhost 2048
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Please enter your name: Testy McTest
Hello, Testy McTest
Password: ****
Your pass hash is 9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08
Password correct. Welcome.
I will now loop and print your lines back to you.
Press ctrl+c or give me an empty line to quit
line 1
Your line: line 1
foo
Your line: foo
Q
Your line: Q
Interrupted!
Bye.

Connection closed by foreign host.

It even works from Windows!

We can make the script run at boot time by adding it to root’s crontab as such:

@reboot /usr/bin/tcpserver -c 10 -HR -u 65534 -g 65534 0.0.0.0 2048 /usr/local/bin/script 2>&1 | logger -t tcp-2048 -p user.notice

Here, we’re running our tcpserver at boot time, we’re taking anything our script prints to standard error, which is currently a log of new connections and disconnects, and putting it in the system log.

If you’re curious about sf_killchildren, you can read more about it here.

Have fun. Drop me a line if you do something cool with this.

Resource used: The Telnet Protocol (archived here)

apt-get reports Packages not found

So.. this shit again.

# apt-get update
Ign http://ftp.debian.org jessie-updates InRelease
Ign http://ftp.debian.org jessie-updates Release.gpg
Ign http://ftp.debian.org jessie-updates Release
Ign http://ftp.debian.org jessie-updates/main Sources/DiffIndex
Ign http://ftp.debian.org jessie-updates/main i386 Packages/DiffIndex
Ign http://ftp.no.debian.org jessie InRelease
Ign http://ftp.no.debian.org jessie Release.gpg
Ign http://ftp.no.debian.org jessie Release
Ign http://ftp.no.debian.org jessie/main Sources/DiffIndex
Ign http://ftp.no.debian.org jessie/main i386 Packages/DiffIndex
Ign http://security.debian.org jessie/updates InRelease
Ign http://security.debian.org jessie/updates Release.gpg
Ign http://security.debian.org jessie/updates Release
Ign http://security.debian.org jessie/updates/main Sources/DiffIndex
Ign http://security.debian.org jessie/updates/main i386 Packages/DiffIndex
Err http://security.debian.org jessie/updates/main Sources 404 Not Found [IP: 212.211.132.250 80]
Err http://security.debian.org jessie/updates/main i386 Packages 404 Not Found [IP: 212.211.132.250 80]
Err http://ftp.debian.org jessie-updates/main Sources 404 Not Found [IP: 130.89.148.12 80]
Err http://ftp.debian.org jessie-updates/main i386 Packages 404 Not Found [IP: 130.89.148.12 80]
Err http://ftp.no.debian.org jessie/main Sources 404 Not Found [IP: 130.239.18.173 80]
Err http://ftp.no.debian.org jessie/main i386 Packages 404 Not Found [IP: 130.239.18.173 80]
W: Failed to fetch http://ftp.debian.org/debian/dists/jessie-updates/main/source/Sources 404 Not Found [IP: 130.89.148.12 80]
W: Failed to fetch http://ftp.debian.org/debian/dists/jessie-updates/main/binary-i386/Packages 404 Not Found [IP: 130.89.148.12 80]
W: Failed to fetch http://ftp.no.debian.org/debian/dists/jessie/main/source/Sources 404 Not Found [IP: 130.239.18.173 80]
W: Failed to fetch http://ftp.no.debian.org/debian/dists/jessie/main/binary-i386/Packages 404 Not Found [IP: 130.239.18.173 80]
W: Failed to fetch http://security.debian.org/dists/jessie/updates/main/source/Sources 404 Not Found [IP: 212.211.132.250 80]
W: Failed to fetch http://security.debian.org/dists/jessie/updates/main/binary-i386/Packages 404 Not Found [IP: 212.211.132.250 80]
E: Some index files failed to download. They have been ignored, or old ones used instead.

If you really want to see what happens here, you can go after it with Wireshark and look at the raw http calls. I won’t go into detail, but the problem is the partially downloaded package lists, which on Debian reside in /var/lib/apt/lists/partial/

# ll /var/lib/apt/lists/partial/*
-rw-r--r-- 1 root root 140K Aug 16 2016 /var/lib/apt/lists/partial/ftp.debian.org_debian_dists_jessie-updates_InRelease.reverify
-rw-r--r-- 1 root root 2.4K Jun 4 2016 /var/lib/apt/lists/partial/ftp.no.debian.org_debian_dists_jessie_Release.gpg.reverify
-rw-r--r-- 1 root root 62K Aug 15 2016 /var/lib/apt/lists/partial/security.debian.org_dists_jessie_updates_InRelease.reverify

(your file listing may, and probably will, differ)

Anyway, remove them by issuing rm /var/lib/apt/lists/partial/* and update again.

Note that this is but one of the errors that can cause this kind of a mess. Another one I’ve experienced is a somewhat overzealous firewall. If the above does not resolve a problem, check if you’re in any way filtering outbound traffic.

Installing Windows XP on the Dell Vostro 1700

This is basically the same procedure as installing Windows XP on the Fujitsu Siemens Esprimo u9200, in that it needs custom SATA drivers to find the harddisk during install, as well as a bunch of other ones.

For this install, I decided to actually create a driver disk, and use a real floppy drive. I use a very capable ESYNIC USB floppy drive, which I’d highly recommend. Mostly because it works, and reads all the disks I’ve tried. This should be something one could take for granted, but I’ve had two other USB floppy drives before this one and they’ve both had problems.

Anyway, here’s the files to stuff onto a floppy drive for the Vostro 1700: f6floppy.zip

Select the ICH8M-E-M drivers during the install:

I couldn’t get the video drivers to install, as the nVidia installer would tell me it found no supported hardware. However, the installer leaves the driver files extracted in C:\dell\drivers\R190066, and pointing the device manager at this directory I was able to install those drivers anyway.


I got 99 updates, but a video driver ain’t one?

As this forum thread reveals, the the Vostro 1700 was designed with Windows Vista in mind, so not all the drivers were released for Windows XP. The drivers for the Vostro 1500 can, however, fill in the gaps. I collected all the drivers I used in my install in this zip file: Vostro 1700 WinXP Drivers.zip

Oh, and finally a small rant…

What the heck did manufacturers at the time have against the Windows XP wireless interface?

They literally had to take this…

..and replace it with this..

..why would I want that?!

And of course they leave the default tool inoperable: