Forcing an ARRAY() in postgres subqueries

Quick note to self, when making a complicated query, comparing to data in a subquery, the postgres query planner will sometimes decide to perform the entire outer query without the comparison, cache the results, and then run the inner query.

Example:

SELECT DISTINCT
...
FROM ...
INNER JOIN ... ON ...
WHERE
...
AND ...
AND ...id IN (
SELECT user_id
FROM ....
WHERE ...
)

Once this gets complicated enough, with large enough tables, the postgres planner may decide to run the entire outer query without filtering on the id. We can encourage it to do this by forcing the results of the inner query into an array, the equivalent of running it ourselves and simply saying WHERE id = any(array[1335, 1336, 1337]), where postgres will not hesitate to do the comparison on the fly, potentially excluding 99% of the dataset from being stored in cache.

Potentially 50x faster example for this scenario, depending on your data and the mood of the query planner:

SELECT DISTINCT
...
FROM ...
INNER JOIN ... ON ...
WHERE
...
AND ...
AND ...id = ANY ( ARRAY (
SELECT user_id
FROM ....
WHERE ...
) )

Go play with your local query planner.

TIKI-100 Documentation

Archiving some documentation for the Norwegian microcomputer TIKI-100 here, in case anyone needs it. I’ve sold mine, but Tommy Ovesen of Arctic Retro has graciously scanned the manuals I had for them.

They’re here as well, for redundancy:

Enjoy?

Allowing an empty JSON body, or no Content-Type, in a C# ASP.NET Core API

NOTE: In applications targeting net7.0+, Microsoft has fixed this issue, and you can set your Object to be default null (MyClass thingamagic= null), having it work as expected.
Leaving the below hack in here for anyone stuck on net6 or below.

You may be here because you’re getting this:

{"type":"https://tools.ietf.org/html/rfc7231#section-6.5.13","title":"Unsupported Media Type","status":415,"traceId":"00-0d785f3d1cb4fd71bc36ee25561e4b48-6bc5df1f0d070024-00"}

Or this:

{"type":"https://tools.ietf.org/html/rfc7231#section-6.5.1","title":"One or more validation errors occurred.","status":400,"traceId":"00-8ab7d4ff2d97d5ec040c25058b8d6fff-64b7c8e21e11563b-00","errors":{"":["A non-empty request body is required."]}}

Or even this:

{"type":"https://tools.ietf.org/html/rfc7231#section-6.5.1","title":"One or more validation errors occurred.","status":400,"traceId":"00-2fec89d15f0fbc0a7deaf1dd96656e15-e7fabe160575f645-00","errors":{"$":["The input does not contain any JSON tokens. Expected the input to start with a valid JSON token, when isFinalBlock is true. Path: $ | LineNumber: 0 | BytePositionInLine: 0."]}}

You may have tried adding this:

[FromBody(EmptyBodyBehavior = EmptyBodyBehavior.Allow)]

And you’ve been unsuccessful in accepting queries with no Content-Type, Content-Length 0.

What you need is an attribute that will override the missing Content-Type, provide an empty JSON body for the parser, and set an appropriate Content-Length to match, and only do so when the Content-Type is missing or JSON already, so you keep the other validation.

Luckily, I’ve written one. So here you go. Leave a comment if this helps you?

[AttributeUsage(AttributeTargets.Method)]
public class AllowEmptyJsonBodyAttribute : Attribute, IResourceFilter
{
private const string StreamOverride = "StreamOverride";

public void OnResourceExecuting(ResourceExecutingContext context)
{
var request = context.HttpContext.Request;
if (!string.IsNullOrEmpty(request.ContentType) && !request.HasJsonContentType() || (request.ContentLength ?? 0) != 0) return;
request.ContentType = "application/json";
context.HttpContext.Items[StreamOverride] = request.Body; // store the original stream
var emptyPayload = Encoding.UTF8.GetBytes("{}");
request.Body = new MemoryStream(emptyPayload); // replace the stream
request.ContentLength = emptyPayload.Length;
}

public void OnResourceExecuted(ResourceExecutedContext context)
{
if (!context.HttpContext.Items.TryGetValue(StreamOverride, out var o) || o is not Stream s) return;
var request = context.HttpContext.Request;
request.Body.Dispose(); // this disposes our injected stream
request.Body = s; // put the original back, so it can be cleaned up as usual
}
}

Moving columns in PowerQuery without naming every column

If you have pivoted data, you probably don’t want to name all the columns to move one, as a lot of them are going to contain dynamic data. The solution is to use the list of columns, manually moving the one you want into the desired position, as such:

= Table.ReorderColumns(PreviousStepName, List.InsertRange(List.RemoveItems(Table.ColumnNames(PreviousStepName), {"ColumnToMove"}), 4, {"ColumnToMove"}))

This moves the column “ColumnToMove” to position 4, by removing it from the list of columns in the previous step, and then re-inserting it at position 4.

This can also be used directly when merging two tables, putting the other table at a specific position:

= Table.ReorderColumns(Table.NestedJoin(PreviousStepName, {"LookupColumn"}, TargetTable, {"TargetLookupColumn"}, "Table", JoinKind.LeftOuter), List.InsertRange(List.RemoveItems(Table.ColumnNames(PreviousStepName), {"Table"}), 4, {"Table"}))

This joins the PreviousStepName step with the TargetTable (a step or a different query), by looking up the column “LookupColumn” in TargetTable[TargetLookupColumn], saving the join column as “Table” at position 4.

Expanding on this, you can even expand one or more columns from another table, directly into a given position:

= Table.ExpandTableColumn(Table.ReorderColumns(Table.NestedJoin(PreviousStepName, {"LookupColumn"}, TargetTable, {"TargetLookupColumn"}, "Table", JoinKind.LeftOuter), List.InsertRange(List.RemoveItems(Table.ColumnNames(PreviousStepName), {"Table"}), 4, {"Table"})), "Table", {"TargetTableColumnName"}, {"DesiredColumnName"})

The new items here are TargetTableColumnName, which must exist in the table you’re merging with, and the DesiredColumnName, which is the name you’ll end up with. This can, of course, be the same as the TargetTableColumnName if you want.

As a bonus, if the join returns several matches per lookup, you can expand the column as a List like this:

= Table.TransformColumns(PreviousStepName, {"Table", each _[Item]})

This targets the joined “Table” column we created in the previous query, extracting the matching “Item” column entries from TargetTable (see previous query) as a list, instead of multiplying your rows, as Table.ExpandTableColumn() would.

Have fun?

Unpivoting two columns in PostgreSQL

We recently had a need to unpivot two columns with item id’s in them so they’d come out as a single column with distinct values. Some of the entries had <null> as the parent id, and null should not appear in the output.

One could solve this with a simple UNION, but the view we were using was unfortunately rather expensive to run, so iterating it twice for the UNION was not an option.

So here goes:

SELECT DISTINCT
   UNNEST(
        CASE
           WHEN parent_item_id IS NULL THEN array[item_id]
           ELSE array[item_id, parent_item_id]
        END
   ) AS item_id
FROM rather_expensive_view
WHERE
   owner_id = 123456

In short, UNNEST unpivots arrays to rows, and we avoid printing NULL as an item_id when parent_item_id is NULL by not including it in the array in the CASE statement. The result is one iteration of the view, and a pretty list of item_ids.

To visualize the operation:

We go from:
---
item_id item_parent_id
10 NULL
11 4
12 3
---

To:
---
item_id
3
4
10
11
12
---

No NULL in the output :D

Adding additional columns to print, along with the UNNEST is fine. The additional columns will be duplicated for both rows in the case where an item has a parent. The output can, of course, also be joined with something like a sales table, so find sales of each item, including the parents.

Recursive PostgreSQL user manager query for PowerBI path value, with loop handling

Recursive SQL queries. They’re hard to read, and even harder to wrap your head around. They’re handy when you have a self-referencing table (intermediates allowed, but complicates things) that somehow represents a tree, such as the management tree of a corporation, where each user has a manager_id that’s also a user_id in itself.

There’s a complication however. Our data quality isn’t all that, and contains loops, where user 1 is managed by 2, managed by 3, managed by 4, managed by… 1? Yep.

There’s another complication – some managers are missing. Yes, that too.

For the output of this to be compatible with PowerBI, we want to return a pipe separated (id1|id2|id3|etc). This would be fine, except the PowerBI PATH() operator doesn’t support loops, and also doesn’t like the idea of missing managers. So we’ll have to do this ourselves, preferably in SQL.

Enough of my waffling. Here’s the query:

WITH RECURSIVE subordinates AS (
    SELECT
        u.id as user_id,
        u.manager_id,
        u.manager_id::text as "path"
    FROM
        companies c
    INNER JOIN companies_users cu ON c.id = cu.company_id
    INNER JOIN users u ON cu.user_id = u.id
    WHERE
        u.manager_id IS NOT NULL
    UNION
        SELECT
            s.user_id,
            m.manager_id,
            CONCAT(s.path, '|', m.manager_id) as "path"
        FROM
            users m
        INNER JOIN subordinates s ON
            s.manager_id = m.id
            AND m.manager_id IS NOT NULL
            AND m.manager_id != s.user_id
            AND CONCAT('|', s.path, '|') NOT LIKE CONCAT('%|', m.manager_id, '|%')
)
SELECT
    user_id,
    MAX(path) AS "path"
FROM
    subordinates
GROUP BY
    user_id

Long winded explanation:

  1. This query starts out with any user that has a manager_id not equal to null. Any additional constraints, such as limiting the query to a single starting user_id, should go there. This will create a line with this user, and the base path of the manager_id
  2. We union this result with users we find by joining with our own recursive query. This will select the manager(s), and the path is carried. We only do this if we have more to add to the path, i.e. if the manager_id is not null, and we don’t already have this manager in the path. This avoids loops, regardless of how deep they are.
    Note how paths are compared by adding a pipe to each end of each comparand, ensuring that if user 55 is in the path, user 5 will still be added, as we’re looking for, for example, |5| in a string of |40|32|55|2|, instead of 5 in a string of 40|32|55|2, which would yield a positive result.
  3. Inside the union query, we concatenate the existing path from the parent user with the manager of the current one, thus building something akin to this, when restricted to a single starting user_id:

    As you can see, the user_id follows throughout this query, as an increasing number of managers get added to the path. Insert reference to Office Space about having too many bosses.
    To make the printout clearer, you can also print intermediate managers here, if you wish, by printing u.id as “intermediate” and m.id as “intermediate” in the source and union queries, respectively, giving you something like this:

    Here we see that user 30 has a manager of 50, 50 has a manager of 92, and so on. User 200 doesn’t have a manager, and thus the join constraints will stop the query at that point. The recursion guard would also make the query stop here if 200 had a manager of, say, 50, which is already in the path, or if it referred to the original user_id.
  4. Finally, we SELECT, from our recursive query, only the user_id and the MAX(path). Since a longer string is considered by PostgreSQL to be greater than a shorter one, this will give you only the line, per user, with the longest path, thus completing our mission assignment, giving us the following for our example user:

    If a restriction is then not put on the starting user_id, as in the full query above, you’d get one line for each user in your database that has a manager. If you want to include users with no manager, with a null or empty value for the path, make your changes as appropriate 🙂

I hope this helps someone.

Copying data from a restored postgres database to the live database

The correct way to copy data is not to extract to a CSV and import with some tool, to manually create INSERT queries, or to try filtering the pg_dump output. You should use the COPY function included in postgres, as such:

psql -c "COPY (SELECT * FROM my_table WHERE created_at > '2012-05-01') TO STDOUT;" source_db | psql -c "COPY my_table FROM STDIN;" target_db

Add psql parameters as necessary.

This does not replace the target table, it simply inserts what you’ve selected. If you want to replace the target, you need to TRUNCATE it first.

AWS RDS IAM Login pitfalls and checklist

So, you’re struggling with your IAM role login?

psql: error: FATAL: password authentication failed for user "mydbuser"
psql: error: FATAL: PAM authentication failed for user "mydbuser"

This list is specifically for PostgreSQL, accessed by ECS containers and developers, but most of it would apply to any IAM compatible setup.

  1. First, check that you have allowed IAM logins to your database instance. Click “Modify” on your instance, and set it to “Password and IAM database authentication”:
  2. Make sure you have created a database user to log in as, using CREATE USER in your database. This is not your IAM user, and does not need to have the same name.
  3. Make sure the user is assigned the rds_iam role (GRANT rds_iam TO mydbuser). This should be visible when you do \du in psql, as “Member of” rds_iam
    • The user also needs access to the tables/databases you want them to access
    • A database user (again, not your IAM user) with the rds_iam role can not log in using their assigned password. …so if you’re migrating from database passwords to IAM and have applications still running, create a new user for this.
    • If the rds_iam role is not assigned, you will get “password authentication failed” instead of “PAM authentication failed” if you fail at logging in as the user
  4. For ECS containers, create a new role for the tasks to run as. This role needs to allow the rds-db:connect action on your database and user, all of which could be specified as “any” (*) if you want.
    The AWS Console will help you with this:

    • You could testing it out with arn:aws:rds-db:*:*:dbuser:*/* to allow everything, but that’s hardly recommended for production use.
    • The rds-db:connect action is not included in AmazonRDSFullAccess default policy
    • Your role should be added to the ECS task definition as the “Task role”, not the “Task execution role”
    • A fully qualified ARN looks like this:
      "arn:aws:rds-db:eu-central-1:827164351102:dbuser:db-6QRASSC7DASSW7P551LXR666CU/mydbuser"
    • A completely open ARN looks like this:
      "arn:aws:rds-db:*:*:dbuser:*/*"
  5. For developers, you want the same policy as in the previous point, just applied to your user, or a group you’re in
  6. Make sure you are using SSL. For psql you can do this by setting PGSSLMODE=require (or verify-ca if you’ve downloaded the AWS certificate bundle)
  7. If your AWS account is part of an AWS Organizations setup, you need to add rds-db:* to the service control policy (SCP) of the organization unit that the account belongs to
  8. If you’re a developer using AWS MFA, you need to have an active, functional session to obtain valid database tokens. As of 2021-12-07, the aws rds generate-db-auth-token command is perfectly happy with returning a token anyway, it just won’t work, and you’ll get “PAM authentication failed” once more
  9. When creating db-auth-tokens from the CLI, it’s also important to use the correct, and full, database endpoint
    I.e. you can’t use production-db.mydomain.com, 88.51.256.88 (the DB IP address), or any other reference to it, you need the hostname to be what’s listed as the “Endpoint” in the AWS Console, something like production-db.cdracwecaau0.eu-central-1.rds.amazonaws.com
  10. Point 8 goes for ECS containers as well. You need to have it generating auth tokens for the “Endpoint” of the database, not anything else, so check your connection string. Also check it for SSL.
  11. Access tokens are only valid for 15 minutes by default
  12. When your container is correctly set up and running, you can run an interactive bash shell in it and check that a valid AccessKey is created by running
    curl 169.254.170.2$AWS_CONTAINER_CREDENTIALS_RELATIVE_URI
    This is what the AWS libraries will use to obtain the credentials, so it’s good to know they’re working. The result will also return the name of the role you assigned to the container.
  13. A one-liner for testing your DB connection, inside a container (you’d have to install the relevant tools) or at a console:
    DBENDPOINT="production-db.cdracwecaau0.eu-central-1.rds.amazonaws.com" DBPORT=5432 DBREGION="eu-central-1" DBUSER="mydbuser" DBNAME="business_app"; PGPASSWORD=$(aws rds generate-db-auth-token --region "$DBREGION" --hostname "$DBENDPOINT" --port "$DBPORT" --username "$DBUSER") PGSSLMODE=require psql -h "$DBENDPOINT" -U "$DBUSER" -d "$DBNAME" -p "$DBPORT" <<<"SELECT 'SUCCESS' AS OK"
  14. For C#
    • You can generate credentials based on the environment variables provided by AWS by doing Amazon.Runtime.FallbackCredentialsFactory.GetCredentials(false)
    • You can generate a DB AuthToken directly using Amazon.RDS.Util.RDSAuthTokenGenerator.GenerateAuthToken(host, port, username)
    • You use the DB AuthToken as the password in your connection string
    • The System.Data.Common.DbConnectionStringBuilder class is very useful here, allowing you to easily parse a connection string, replace the passwords and generate an updated string without having to care about formatting, escapting and the like
    • Remember to add SSL mode:
      connectionStringBuilder.Add(“SSL Mode”, “Require”);
      connectionStringBuilder.Add(“Trust Server Certificate”, “true”);
    • Also, see this excellent blog post by Steve Gordon (archive) on how the AWS libraries go about obtaining your AccessKey

Allowing RDP remote passwords in Windows 10 in 2021

If you use saved passwords in your .rdp files on Windows, or from the command line on Linux, you may find yourself being prompted for a password anyway, after the latest updates. The option to allow this still exists, however, and can be found by running gpedit.msc (from the Run box, Win+R) and heading over to Local Computer Policy => Computer Configuration => Administrative Templates => Windows Components => Remote Desktop Services => Remote Desktop Session Host => Security.

What you’ll want to do is set “Always prompt for a password upon connection” to “Disabled“, as such:

Happy remoting.

DNS names for VirtualBox client machines

Following up on my previous post with the same topic, for Hyper-V, this is how to get DNS names for your VirtualBox guest VM’s into your hosts file on a Windows host.

First things first, you need the VirtualBox Guest Additions installed for the host to tell OS level information about the guest, just as with Hyper-V. Here’s a guide on how to install those on Debian (archive here). PowerShell on the host machine will have access to the IP addresses of the client machine. Well, not technically PowerShell in this case, but it can be used to run VBoxManage.exe and process its output.

Without the Guest Additions, VBoxManage.exe will not show any IP addresses for the client machine.

This is the script from the Hyper-V post, modified to do the same for VirtualBox, with the additional support for autostarting virtual machines in headless mode, as I found this lacking in VirtualBox by default. One of my VM’s, Sarge is used for example purposes. Replace this with your own VM names.

$ErrorActionPreference = "Stop"
$hosts = "$($env:windir)\System32\drivers\etc\hosts"

Start-Transcript -Path "C:\custom\vm_hosts.log" -Append

$autostart = @(
    'Sarge'
)

Write-Host -ForegroundColor Yellow "Autostarting machines"
$autostart | ForEach-Object `
{
    $name = $_
    try { & "C:\Program Files\Oracle\VirtualBox\VBoxManage.exe" startvm "$name" --type headless }
    catch {}
}

while ($true)
{
    Write-Host -ForegroundColor Yellow "Scanning for Virtual Machines"
    $content = Get-Content $hosts

    [array] $vms = & 'C:\Program Files\Oracle\VirtualBox\VBoxManage.exe' list runningvms
    $vmListRegex = '^\s*"([^"]+)"'
    $vms = $vms | Where-Object { $_ -match $vmListRegex }
    if ($vms)
    {
        $vms | ForEach-Object `
        {
            $vm = $_
            $name = [regex]::Match($vm, '^\s*"([^"]+)"').Captures.Groups[1].Value
            Write-Host -ForegroundColor Magenta "VM: $($name)"
            [array] $ips = & 'C:\Program Files\Oracle\VirtualBox\VBoxManage.exe' guestproperty get "$name" "/VirtualBox/GuestInfo/Net/0/V4/IP"
            $ip = [regex]::Match(($ips | Select -First 1), ': (\S+)').Captures.Groups[1].Value
            if ($ip)
            {
                Write-Host -ForegroundColor Green "IP: $($ip)"
                $hostsRegex = '^\s*\S+\s+' + $name.ToLower() + '$'
                $matches = $content | Select-String -Pattern $hostsRegex | ForEach-Object { $_.Line } | Out-String
                $wanted = $ip | ForEach-Object { $_ + ' ' + $name.ToLower() } | Out-String
                if ($matches -ne $wanted)
                {
                    $content = ($content | Select-String -Pattern $hostsRegex -NotMatch | ForEach-Object { $_.Line } | Out-String).Trim()
                    $content += "`n" + $wanted
                }
            }

            $content | Set-Content "${hosts}.tmp"
            if ((Get-FileHash "$hosts").Hash -ne (Get-FileHash "${hosts}.tmp").Hash)
            {
                Move-Item -Path "${hosts}.tmp" -Destination "$hosts" -Force
                Write-Host -ForegroundColor Yellow "Hosts file updated!"
            }
        }
    }

    Write-Host -ForegroundColor DarkGray "Sleeping..."
    Start-Sleep -Seconds 60
    Write-Host
}

I run it as a Scheduled Task, same as in the Hyper-V post.

Now I can easily connect to my virtual machines whenever I need to, without knowing their IP address.