24-bit CFA ?

Rob Finch · Post by **Rob Finch** » Fri Nov 28, 2014 12:06 am

I was just wondering if Forth had been implemented with 24 bit code addresses (24 bit words?) such as are available on the '816.

GARTHWILSON · Post by **GARTHWILSON** » Fri Nov 28, 2014 3:36 am

I learned Forth on my HP-71 with 20-bit addresses and cells. I'm sure just about everything has been done.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Fri Nov 28, 2014 5:21 am

Rob Finch wrote:

I was just wondering if Forth had been implemented with 24 bit code addresses (24 bit words?) such as are available on the '816.

32 bit code is quite a bit more efficient on the 65C816 than 24 bit when working with integers. If integers are processed as words rather than bytes the accumulator can be left in "wide" mode almost all the time, which actually makes the code execute faster. Otherwise, the accumulator has to be changed to "narrow" mode to process bits 16-23 of a 24 bit number. The constant REPs and SEPs in a number processing loop can start gobbling up clock cycles at a rapid rate.

It should be noted that all implied register operations, e.g., INC A or DEX, consume the same number of clock cycles regardless of register width. This because the 65C816's ALU always processes words, regardless of the actual register width. Also, there's only a one cycle penalty for using instructions that involve 16 bit memory accesses, the extra cycle being expended on the MSB fetch and/or store step in the instruction. The performance improvement is especially dramatic on R-M-W instructions: INC SOMEWHERE when the accumulator is set to 16 bits executes about 250 percent faster than INC SOMEWHERE -- BNE NEXT -- INC SOMEWHERE+1 with the accumulator set to 8 bits, plus the code is more succinct.

When I rewrote the firmware in POC V1 to use 16 bit operations I saw a noticeable improvement in performance, and the code size actually shrunk about 15 percent overall. The reclaimed ROM space didn't go to waste: it got used for the SCSI drivers.

barrym95838 · Post by **barrym95838** » Fri Nov 28, 2014 5:45 am

I am not a proficient '816 assembly programmer like BDD, but I almost immediately thought of the same issue. The space saved by using 24-bit addresses vs. 32-bit addresses with the most-significant 8-bits as "don't cares" (or some possibly useful metadata

) will be offset by the annoying mode-juggling. In fact, I personally find the whole mode thing to be rather annoying in general, and have even seriously considered dropping the "D" flag from my 65m32 implementation ... it's a lot of baggage to carry, especially if done in a conceptually complete manner (re: flag effects), and I would rather do something not at all rather than "half-fast".

On my m-824 (not publicly introduced), it would be much more natural to use 24-bits, because 8 and 24 are its two native width choices.

Mike

Rob Finch · Post by **Rob Finch** » Fri Nov 28, 2014 7:02 am

So I guess that 32 bit addressing is the way to go for a larger code space. I was thinking of the zero page addressing mode [zp],y which uses 24 bit addresses, but I guess a fourth byte could be added that always zero. It wastes a byte of zero page to implement the 32 bit addressing.
A 32 bit CFA addressing on the '816 in Forth would work like 16 bit addressing then.

GARTHWILSON · Post by **GARTHWILSON** » Fri Nov 28, 2014 7:45 am

I don't think direct-page addressing space on the '816 for Forth would be a problem, since the heaviest application I could come up with on the '02 took less than 20% of ZP for the data stack, and if you really wanted multitasking, the '816 can easily have a different direct page for each task. It may seem a waste to zero the 4th byte of each 32-bit cell, but many cells will be data anyway and you may want anywhere from 8 to 32 bits. In 16-bit Forth, we still waste half a cell every time we put a character on the stack. The 65Org32 will handle it much more efficiently-- not in terms of saving memory, but there will be no direct-page limitations, or limitations of hardware stack space, and 32-bit quantities get fetched in one cycle, stored in one cycle, and operated on in one cycle, all at once instead of 8 or even 16 bits at a a time.

Dr Jefyll · Post by **Dr Jefyll** » Fri Nov 28, 2014 3:39 pm

Rob Finch wrote:

So I guess that 32 bit addressing is the way to go for a larger code space.

32 is a lot. If I were doing it I'd use only 16 bits actually stored in memory, but with an implied scale. After the 16 bits get fetched, shift 'em left and stick some zeros on the right. If we shift 4 places, say, then we've multiplied the code space by 16 (now 1 MB), at the cost of requiring every entry point to be aligned to a 16-byte boundary. Minor downside: that results in about 8 bytes of wasted code space at the end of every routine.

I'm no fan of wasting RAM but the loss is largely reversed because tokens are 16-bit, not 32. And 16-bit tokens are faster to fetch, so the performance implications are significant.

It was arbitrary to suggest four as the number of shifts -- and it makes my scheme reminiscent of x86 real-mode addressing. But I'm not advocating segmentation (and the necessary use of two registers); instead there'd be one, 32-bit IP register inside the core. The wrinkle is that, when it loads from memory, the ms 12 bits and the ls 4 bits of the IP register load as zeros.

Four was an arbitrary choice, and other shift values bear consideration. Some scenarios would benefit best from a one-bit shift, or perhaps three. So that's a matter for debate. Or...

The necessity to define the shift (to have it "cast in stone") can be avoided if the shift is determined by a configuration register. (Yes a creeping feature but it needn't be over the top. Maybe allow only two options, say 0- or 4-bit shifting, if the core size is constrained. A fancier core would have a fancier configuration register, and offer, say, 0-, 2-, 4- or 6-bit shifting... ) Anyway I'd hesitate to blow the token size up to 32 bits.

Obviously the Forth compiler -- assuming the user will want one -- needs alterations to deal with this scheme. That's doable, as I know from making similar alterations. At issue was the compiler for another hybrid Forth with 32-bit cells & operations but 16-bit ITC tokens. The tokens aren't scaled, but you do have to add a fixed offset to find the start of the 64K dictionary within the flat 32-bit address space. (Actually it's a flat 20-bit address space, improbably accomplished with real-mode x86 under DOS.

)

-- Jeff
Edit: raise the compiler issue, last paragraph

Rob Finch · Post by **Rob Finch** » Sat Nov 29, 2014 2:21 am

Quote:

If I were doing it I'd use only 16 bits actually stored in memory, but with an implied scale. After the 16 bits get fetched, shift 'em left and stick some zeros on the right. If we shift 4 places

I like it. That's a pretty good option for extending the code range without going to 32 bit addresses.

I'm toying with the idea of a Forth accelerator peripheral, rather than modifying the cpu. The first thing to do would be to implement the NEXT routine with hardware.
The peripheral would sit in zero page memory and automatically update a 3 byte IP pointer. It would also automagically fetch the code address vector for W. It could then leave this vector scaled (as suggested) in another zero page location. Then all that needs to be done is a jump to the W vector.

Code for a next routine would look something like this then:

Code: Select all

STZ ForthAccelTrigger ; triggers an DMA operation
JMP ForthAccelW

Not quite as slick a solution as couple of the other 6502 modifications for Forth, but I think it could be made to work.

Rob Finch · Post by **Rob Finch** » Tue Oct 06, 2015 4:25 pm

Quote:

32 is a lot. If I were doing it I'd use only 16 bits actually stored in memory, but with an implied scale. After the 16 bits get fetched, shift 'em left and stick some zeros on the right. If we shift 4 places, say, then we've multiplied the code space by 16 (now 1 MB), at the cost of requiring every entry point to be aligned to a 16-byte boundary. Minor downside: that results in about 8 bytes of wasted code space at the end of every routine.

Memory is cheap. Shifting the CFA by eight places would place code every 256 bytes, and waste 16MB of memory. But in a system with 128MB maybe that isn't a big issue. Using an eight bit shift eliminates shifting from the NEXT routine. The Forth code could be stored in a compressed file format.

Quote:

but you do have to add a fixed offset to find the start of the 64K dictionary within the flat 32-bit address space.

This would have to be added to the CFA. It might be possible to code this just as a constant addition if the address of the dictionary was fixed. Also using 256 byte areas the least significant byte could always be zero and left out of the addition if the dictionary is aligned at a 64k bank.

Next routine would look something like:

Code: Select all

    LDD    FAR (IP),Y          ; Y is used as the low order 16 bits of IP
    LEAY  2,Y
    BNE    .0001
    INC    IP+1                   ;IP+2,IP+3=00 64k bank aligned
.0001:
    ADDD #DictionaryBase  ; dictionary must be in low 16MB
    STD    W+2                  ; set bits [23:8]
W  JMP FAR $00000000

Dr Jefyll · Post by **Dr Jefyll** » Wed Oct 07, 2015 2:46 am

Hi Rob. OK, this wasn't making sense to me at first but now it does. Earlier in the thread I proposed left-shifting four places, and you've increased that to eight because an 8-bit shift in a byte-organized machine can happen for free just by juggling which bytes get accessed. Eight is a bigger number than we might otherwise choose, but it's not that big a deal. We end up with a 16MB dictionary, with every CFA aligned on a 256-byte boundary. (As for compressed file format, that'd be unnecessary if it's source code that's being stored.)

Rob Finch wrote:

Quote:

but you do have to add a fixed offset to find the start of the 64K dictionary within the flat 32-bit address space.

This would have to be added to the CFA. It might be possible to code this just as a constant addition if the address of the dictionary was fixed. Also using 256 byte areas the least significant byte could always be zero and left out of the addition if the dictionary is aligned at a 64k bank.

Good ideas, but probably unnecessary. My remark about adding a fixed offset pertains to the DOS environment where my dictionary is only 64K and I'm not allowed to map it to the bottom of the 1-MB Real-Mode space because DOS already has stuff down there. RFT6809 Forth has far more freedom, mostly because of the far larger dictionary. It could reside at the bottom of the 4GB space -- sacrificing a portion at the bottom if necessary, which is easily affordable since 16MB is oodles. So, no constant addition required.

Code: Select all

    LDD    FAR (IP),Y          ; Y is used as the low order 16 bits of IP

And IP is a four-byte indirect pointer residing in memory, right? I like how your...

Code: Select all

W  JMP FAR $00000000

takes a four-byte operand, and it's the middle two bytes that get written to by the preceding STD instruction. This is your free 8-bit shift.

-- Jeff

Rob Finch · Post by **Rob Finch** » Wed Oct 07, 2015 1:18 pm

Quote:

Code: Select all

    LDD    FAR (IP),Y          ; Y is used as the low order 16 bits of IP

And IP is a four-byte indirect pointer residing in memory, right? I like how your...

Yes, IP is a four byte indirect pointer. The two LSB's are zero and Y contains the low order two bytes of the IP.

Quote:

Code: Select all

W  JMP FAR $00000000

takes a four-byte operand, and it's the middle two bytes that get written to by the preceding STD instruction. This is your free 8-bit shift.

That is correct. JMP FAR is a single byte opcode extension to the 6809.
It might be better to use the second 16MB bank in order to avoid zero page memory and define the W vector like:

Code: Select all

W    JMP FAR $01000000

I occurs to me this is really a 6809 topic not a 6502 one. I'll move it over to anycpu,org

scotws · Post by **scotws** » Mon Jan 18, 2016 9:31 pm

About Forth on the 65816 - the problem would seem to me is that Forth expects an address as well as any non-double number to fit in a "cell" (see http://www.forth200x.org/documents/html/port.html) so that instructions such as ! (store) and @ (fetch) work. Stuff to do with the Dictionary and XTs would not be be a problem because you could just use a token-based system with a 16-bit offset as the XT as described above. But you need to be able to do fetch and store from the command line as well, which would mean 24-bit numbers for address, which don't fit in a 16-bit accumulator. Argh.

So it would seem that with the 65816 we can either limit our address space to 64k in the current bank (which would be a shame) or do what we did with the 6502, make the cell size twice the accumulator width (16-bit for the 6502, and 32-bit for the 65816), which would be a waste and take us back to the horrible two-step addition process I had so dearly hoped to get away from. Or am I missing something?

Dr Jefyll · Post by **Dr Jefyll** » Mon Jan 18, 2016 11:16 pm

Quote:

the horrible two-step addition process I had so dearly hoped to get away from

I agree it would be a shame to limit our address space to 64k in the current bank. Back in the 20th century I extended my 16-bit Forth to augment it with "far" versions of @ ! C@ C! CMOVE and so on, and these words accepted each address as a double -- ie, two cells on stack. It worked but I quickly discovered it was a breeding ground for bugs, due to the mixture of single- and double-precision values on stack (eg: SWAP becomes ROT or possibly DSWAP) and the need for single- and double-precision operators (eg: + vs D+). It was hard to write and just as bad to read!! In fact, coding this way was so onerous I found myself reluctant to begin a project unless the data fit within 64K or could easily be hacked into pieces. It clamped a lid on productivity & creativity -- and that is what I dearly hope to get away from.

If I ever write another 65xx Forth, every on-stack cell will have 32 bits of storage available (possibly implemented under the hood as separate hi-word & lo-word stacks), and double precision operations will be the default (with strictly optional implementation of single-precision substitutes which just ignore the hi-words of the cells). IOW I will willingly sacrifice execution speed for clean use of the 24-bit data space.

And, percentage-wise, the drop in execution speed won't be as great as one might suppose. That's because the code space (the dictionary space and pointers within it) will remain 16-bit. The considerable number of clock cycles spent on NEXT 0BRANCH NEST UNNEST etc won't increase. The time spent processing data may double, but that doesn't mean the Forth VM as a whole will execute at half speed.

Needless to say, priorities vary, and my solution may not suit everyone. But I'll happily accept less than bleeding edge performance if the upside is de-PITA-ing access to the 816's large, flat data space. The double-precision operations are low-level stuff, written and debugged once. I'm much more concerned about high-level coding, where the write/debug cycle repeats with each new project.

I expressed some similar thoughts, perhaps less succinctly, in a thread I started here.

Cheers,
Jeff

scotws · Post by **scotws** » Wed Jan 20, 2016 6:35 pm

Thanks for the link to the other thread, I hadn't seen that one. Obviously, it would seem that I'm going to have write two Forths for the 65816 - one "fast and small" (16-bit cells, limited to 64k) and one "big and slow" (32-bit cells, but whole memory). The horror

!

Dr Jefyll · Post by **Dr Jefyll** » Wed Jan 20, 2016 8:15 pm

scotws wrote:

it would seem that I'm going to have write two Forths for the 65816

Well, there's a compromise solution that comes pretty close to letting you have your cake and eat it too. The key point is to make sure your small & fast Forth reserves space for the high-word stack -- the "ghost stack," as Bruce calls it. (The R stack should have space to get ghosted, too.)

Then you can write a complete set of fast, 16-bit words that'll get used by default, and selectively add 32-bit words as needed (the reverse of what I proposed). Either way, you avoid the #1 thing you really, really DON'T wanna do, and that is be forced to use two cells to hold an extended-precision value. Juggling double-cell items on stack, especially when mingled with single items, is what I learned to hate.

BTW & FWIW, in 1994 I rewrote a Forth that runs on DOS to use 32-bit operations and 32-bit cells. But the dictionary and compiler remain 16-bit, so it's a hybrid just like the '816 Forth I proposed.

The main thing is, @ ! and CMOVE etc "just work" in the large, flat address space (1 MB in this case -- Real Mode x86 -- or 16 MB with my KK or an '816), so you're freed from requiring a 64K-centric context in your thinking. I found it to be a real breath of fresh air.

-- Jeff

24-bit CFA ?

24-bit CFA ?

Re: 24-bit CFA ?

Re: 24-bit CFA ?

Re: 24-bit CFA ?

Re: 24-bit CFA ?

Re: 24-bit CFA ?

Re: 24-bit CFA ?

Re: 24-bit CFA ?

Re: 24-bit CFA ?

Re: 24-bit CFA ?

Re: 24-bit CFA ?

Re: 24-bit CFA ?

Re: 24-bit CFA ?

Re: 24-bit CFA ?

Re: 24-bit CFA ?