enso wrote:
I assume you've seen Michael's core,
http://forum.6502.org/viewtopic.php?f=10&t=2997, which has new Forth instructions...
I was not aware of it. I scanned over the thread a little bit. He is supporting DTC. That is somewhat interesting, but I was assuming STC. Most FPGAs have abundant memory, both RAM and non-volatile, built in --- there is no need to use DTC or ITC or token-threading or any thing else for the purpose of reducing code size --- that was done in the 1970s when an 8KB EPROM was often all you had for non-volatile memory.
STC isn't such a memory hog as is often assumed. A JSR is only 3 bytes, compared to 2 bytes for DTC or ITC, so there isn't all that much difference. Of course, a lot of code will be inlined for speed (that is pretty much the point of using STC). My suggestions for an improved 65c02 involved making common code sequences into instructions so they use less memory when inlined. For example, DNX instead of INX INX and DDX instead of DEX DEX will halve the memory usage there, and these are very common. The (zp,X),Y addressing mode is primarily to allow @ and ! to have significantly reduced code size and improved speed.
Here is one good optimization: The producers (functions that push data onto the data-stack) end with DDX, so they are actually working with memory under that stack prior to the DDX covering up their data. The consumers (functions that pull data from the data-stack) start with DNX, so they are actually working with memory under that stack after the DNX has dropped it. The point of this is that, if you inline a producer followed by a consumer (very common), the DDX and DNX are adjacent and so the peephole-optimizer can remove both instructions (saving 2 bytes). If a producer is inlined and is followed by a JSR to a consumer, then the DDX at the end is not compiled (saving 1 byte) and the JSR to the consumer goes to 1+ the consumer's address so the DNX at the start doesn't execute.
Also, The producer ends with DDX and it is assumed that it stored tos to the data-stack underneath prior to the DDX using SAY. Lets assume that the high-byte was in Y and the low-byte was in A. The consumer starts with DNX and it is assumed that it loads tos from the data-stack underneath using LAY. This means that SAY and LAY are now adjacent and the peephole-optimizer can get rid of those too (saving 2 bytes).
Note that the optimizations described above preclude ISRs from being written in Forth. This is because the data under the stack would get clobbered by the ISR using the stack. One solution here is for ISRs to have their own data-stack.
Here is another instruction that would be useful, which I didn't mention previously: TAY (test the Y:A 16-bit value and set the N and Z flags appropriately). 0BRANCH could then be this:
DNX
LAY
TAY
BEQ xxx
That is only 5 bytes, which isn't much. Also, given the peephole-optimization described above, the DNX and LAY can be discarded and so we are down to 3 bytes.
Another good optimization is that a JSR followed by an RTS can be converted into a JMP and the RTS not compiled (saving 1 byte). This also has the advantage of reducing return-stack usage --- a tail-recursive function won't overflow the return-stack.
All of these optimizations not only reduce code size, but also boost the speed.
enso wrote:
How do you envision cores communicating with each other?
A common-memory area --- maybe 16KB of RAM --- large enough to hold buffers for all the I/O ports.
Typically in a dual-core system, one processor will handle all of the I/O. The other processor has the main program running on it.
In a multi-core system, you could have each processor handling one I/O port, and then one processor that has the main program.
It would be similar to the Parallax Propeller --- except with a 65c02 variant as each processor.