If would be great if you could add the modelling of N, V and Z to your testsuite - perhaps I could help? We can validate the new testsuite on a real 6502, and then I can check the various emulators.
Unless I am misunderstanding you here, the code in appendix B already tests everything. ADC & SBC, valid & invalid BCD values, the accumulator, and the N, V, Z, C flags. It's basically just an implementation of the formulas from appendix A (which despite its title, actually covers everything, not just undocumented cases).
You'd have to modify the Appendix B program to test the 65C02 or (8-bit) 65816. You'd also have to modify it (specifically the COMPARE routine) to check only valid BCD numbers (steps 1-6) or remove tests (steps 7 and 8 )
Looking at it now, the documentation is rather skimpy -- probably a combination of (a) being close to the end and being somewhat sick and tired of writing about decimal mode (b) adding comments to the code which was written long before the article -- the code was written a short time after the V flag article was written, in fact -- the actual source file to this day is still uncommented, and (c) what I will call the Hemingway method of writing

probably didn't help.
- AR is the correct value of the accumulator calculated using binary arithmetic
- NF, VF, ZF, and CF are the correct values of N, V, Z, and C flags calculated using binary arithmetic (these are in their respective positions in the P register, i.e. V in bit 6 of VF, but not masked off, i.e. the other bits of VF can be whatever, since they are ignored)
- DA is the value of the accumulator that the processor/simulator produced in decimal mode
- DNVZC is the value of the flags (all of them, including N, V, Z, and C) that the processor/simulator produced in decimal mode.
Then COMPARE, just compares the first group (AR, NF, VF, ZF, and CF) to the second group (DA and DNVZC). The idea is that it's fairly easy to modify COMPARE (the first three instructions, i.e. up to and including the first BNE C1, check the accumulator, the next four instructions, i.e. up to and including the second BNE C1, check the N flag, and so on) so that it doesn't check things you don't care about (for example, there isn't any benefit that I know of for a simulator to replicate the V flag exactly). If you care about everything, you don't have to modify COMPARE at all.