Thanks, Dominic. Well, laying claim to performance improvements of your code is at least three steps away for me, I guess...
Just received my Teensy 4.1. I will probably start with some simple timing experiments using embedded assembler in the Arduino environment -- with all the caching, pipelining and branch prediction in the M7, how quick can I actually react to an incoming clock edge? In parallel I will dig into the proper NXP documentation and IDE to figure out how to program this beast on a low level. Let's see how that goes; my memories of the first steps (and the secod and third steps...) with the STM32 IDE are not that fond.
I will certainly be back with more questions!