I haven't implemented anything like this either, but here are some thoughts.
1. You might be able to do without the register architecture. ZP,X addressing is just as fast as ABS addressing, so depending on how much stack space you need, it may be just as simple to use the ZP as a stack.
2. The fastest (untested) practical (only the accumulator is overwritten) token interpreter I've been able to think up takes only 22 cycles (23 on the 65C02), including the JMP to the "interpret next token" routine. (The fastest routine I could think up was 17 cycles on a 65C02, but X was overwritten and Y must be preserved). Normally, it is not necessary to have everything run at top speed, so it might be better if you don't compile everything into ML. For example, you could use special tokens to switch between interpreting tokens and compiling to ML.
3. My philosophy is: if recursion is ugly, so be it. Almost every recursive routine I have ever written was homework.
|