As a side note, this is all really good practical experience for me, as one of the things Scot (the author of Tali2) and I had discussed was making the build configurable so that certain components could be omitted or shuffled around. This is a good exercise in seeing what needs to be done to make that happen (and mostly it hasn't been that difficult). Our original expected audience was folks building 32K RAM/32K ROM type single board computers, which will hopefully explain some of the design decisions that were made (like just stuffing the input buffer at the top of RAM)
Your initial code looks good. If you are comfortable with this method, it is probably a good least-effort solution.
I will throw one more thought out, and that is that it might be possible (but would be a bit more work) to have the header be in bank 0 where the word would always be found, but when called the word would automatically switch to the bank with it's code before jumping to the code to complete execution. All of your words would be available (if you have space in the first 16K for all of their headers) and they would switch to their bank before running. You technically could even make them "put it back the way it was" as they return. The down side would be that these words would always do that and it would run a little more slowly, even when you didn't need to be switching the banks around.