I think its easier to work back from what state does the subroutine need the processor to be in to work correctly.
All the additional instructions and modes work emulation mode so you don't usually need to change the E bit unless the fixed stack location or interrupt behaviour is important. If you need to change state then start the function with CLC (or SEC)/XCE/PHP and finish with PLP/XCE before exiting.
8/16 index register size changes are tricky as X/Y get truncated if they get smaller so you need to PHX/PHY and save the current size with a PHP before X is changed.
If the accumulator is in the 8-bit state then it may be necessary to save B as well unless the routine can guarantee not to modify it.
Putting it all together you get
Code:
SomeFunction:
if mode change needed
if requires native mode
CLC
else
SEC
endif
XCE
endif
PHX ; Save X/Y in callers size
PHY
PHP ; Save Register sizes
SEP #$xx ; Set required M/X state
REP #$xx ; .. omit instruction if #$00
if 16-bit A
PHA ; Save A/B
else
PHA ; Save A
XBA
PHA ; .. and B
endif
.. do something
if 16-bit A
PLA ; Restore A/B
else
PLA ; Restore A
XBA
PLA ; .. and B
endif
PLP ; Recover callers M/X (and E)
PLY ; Restore X/Y
PLX
if mode change needed
XCE
endif
RTS
You don't always need to save everything of course some registers mat contain results on exit but a universal routine will be restricted to 8-bit values in A/B/X/Y.