I've used loads of self-modifying code in my time, the target platform being the BBC Micro, running code out of RAM.
One example was a sprite routine which ran in the zero page which self-modified plenty of its own operands (the self-modified address being in zp as well, saving another cycle). Here's a snippet of the innermost bit of the loop:
Code:
innerloop:
read:
LDA $FFFF,Y ; self-modified
STA maskindex+1
maskindex:
LDA masktable ; page-aligned table
write:
AND $FFFF,Y ; self-modified
ORA maskindex+1
STA (write+1),Y
INY
etc
That reads sprite data (from (read+1)), looks up a mask for it in a pre-prepared table, masks the screen background, ORs the data over the masked screen data, and writes it back. I figure that the self-modification there saves 4 cycles per byte, which for this particular game saved up to 4900 cycles per frame from just that, plus other more modest savings elsewhere - not bad!
I also remember using them to self-modify the branch destination when entering an IRQ service routine, so that consecutive timer interrupts would go to different places in the code without needing to explicitly check it.