24-bit CFA ?
24-bit CFA ?
I was just wondering if Forth had been implemented with 24 bit code addresses (24 bit words?) such as are available on the '816.
- GARTHWILSON
- Forum Moderator
- Posts: 8774
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Re: 24-bit CFA ?
I learned Forth on my HP-71 with 20-bit addresses and cells. I'm sure just about everything has been done.
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
- BigDumbDinosaur
- Posts: 9428
- Joined: 28 May 2009
- Location: Midwestern USA (JB Pritzker’s dystopia)
- Contact:
Re: 24-bit CFA ?
Rob Finch wrote:
I was just wondering if Forth had been implemented with 24 bit code addresses (24 bit words?) such as are available on the '816.
It should be noted that all implied register operations, e.g., INC A or DEX, consume the same number of clock cycles regardless of register width. This because the 65C816's ALU always processes words, regardless of the actual register width. Also, there's only a one cycle penalty for using instructions that involve 16 bit memory accesses, the extra cycle being expended on the MSB fetch and/or store step in the instruction. The performance improvement is especially dramatic on R-M-W instructions: INC SOMEWHERE when the accumulator is set to 16 bits executes about 250 percent faster than INC SOMEWHERE -- BNE NEXT -- INC SOMEWHERE+1 with the accumulator set to 8 bits, plus the code is more succinct.
When I rewrote the firmware in POC V1 to use 16 bit operations I saw a noticeable improvement in performance, and the code size actually shrunk about 15 percent overall. The reclaimed ROM space didn't go to waste: it got used for the SCSI drivers.
x86? We ain't got no x86. We don't NEED no stinking x86!
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: 24-bit CFA ?
I am not a proficient '816 assembly programmer like BDD, but I almost immediately thought of the same issue. The space saved by using 24-bit addresses vs. 32-bit addresses with the most-significant 8-bits as "don't cares" (or some possibly useful metadata
) will be offset by the annoying mode-juggling. In fact, I personally find the whole mode thing to be rather annoying in general, and have even seriously considered dropping the "D" flag from my 65m32 implementation ... it's a lot of baggage to carry, especially if done in a conceptually complete manner (re: flag effects), and I would rather do something not at all rather than "half-fast".
On my m-824 (not publicly introduced), it would be much more natural to use 24-bits, because 8 and 24 are its two native width choices.
Mike
Mike
Re: 24-bit CFA ?
So I guess that 32 bit addressing is the way to go for a larger code space. I was thinking of the zero page addressing mode [zp],y which uses 24 bit addresses, but I guess a fourth byte could be added that always zero. It wastes a byte of zero page to implement the 32 bit addressing.
A 32 bit CFA addressing on the '816 in Forth would work like 16 bit addressing then.
A 32 bit CFA addressing on the '816 in Forth would work like 16 bit addressing then.
- GARTHWILSON
- Forum Moderator
- Posts: 8774
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Re: 24-bit CFA ?
I don't think direct-page addressing space on the '816 for Forth would be a problem, since the heaviest application I could come up with on the '02 took less than 20% of ZP for the data stack, and if you really wanted multitasking, the '816 can easily have a different direct page for each task. It may seem a waste to zero the 4th byte of each 32-bit cell, but many cells will be data anyway and you may want anywhere from 8 to 32 bits. In 16-bit Forth, we still waste half a cell every time we put a character on the stack. The 65Org32 will handle it much more efficiently-- not in terms of saving memory, but there will be no direct-page limitations, or limitations of hardware stack space, and 32-bit quantities get fetched in one cycle, stored in one cycle, and operated on in one cycle, all at once instead of 8 or even 16 bits at a a time.
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
Re: 24-bit CFA ?
Rob Finch wrote:
So I guess that 32 bit addressing is the way to go for a larger code space.
I'm no fan of wasting RAM but the loss is largely reversed because tokens are 16-bit, not 32. And 16-bit tokens are faster to fetch, so the performance implications are significant.
It was arbitrary to suggest four as the number of shifts -- and it makes my scheme reminiscent of x86 real-mode addressing. But I'm not advocating segmentation (and the necessary use of two registers); instead there'd be one, 32-bit IP register inside the core. The wrinkle is that, when it loads from memory, the ms 12 bits and the ls 4 bits of the IP register load as zeros.
Four was an arbitrary choice, and other shift values bear consideration. Some scenarios would benefit best from a one-bit shift, or perhaps three. So that's a matter for debate. Or...
The necessity to define the shift (to have it "cast in stone") can be avoided if the shift is determined by a configuration register. (Yes a creeping feature but it needn't be over the top. Maybe allow only two options, say 0- or 4-bit shifting, if the core size is constrained. A fancier core would have a fancier configuration register, and offer, say, 0-, 2-, 4- or 6-bit shifting... ) Anyway I'd hesitate to blow the token size up to 32 bits.
Obviously the Forth compiler -- assuming the user will want one -- needs alterations to deal with this scheme. That's doable, as I know from making similar alterations. At issue was the compiler for another hybrid Forth with 32-bit cells & operations but 16-bit ITC tokens. The tokens aren't scaled, but you do have to add a fixed offset to find the start of the 64K dictionary within the flat 32-bit address space. (Actually it's a flat 20-bit address space, improbably accomplished with real-mode x86 under DOS.
-- Jeff
Edit: raise the compiler issue, last paragraph
Last edited by Dr Jefyll on Sat Nov 29, 2014 12:11 pm, edited 1 time in total.
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html
https://laughtonelectronics.com/Arcana/ ... mmary.html
Re: 24-bit CFA ?
Quote:
If I were doing it I'd use only 16 bits actually stored in memory, but with an implied scale. After the 16 bits get fetched, shift 'em left and stick some zeros on the right. If we shift 4 places
I'm toying with the idea of a Forth accelerator peripheral, rather than modifying the cpu. The first thing to do would be to implement the NEXT routine with hardware.
The peripheral would sit in zero page memory and automatically update a 3 byte IP pointer. It would also automagically fetch the code address vector for W. It could then leave this vector scaled (as suggested) in another zero page location. Then all that needs to be done is a jump to the W vector.
Code for a next routine would look something like this then:
Code: Select all
STZ ForthAccelTrigger ; triggers an DMA operation
JMP ForthAccelW
Re: 24-bit CFA ?
Quote:
32 is a lot. If I were doing it I'd use only 16 bits actually stored in memory, but with an implied scale. After the 16 bits get fetched, shift 'em left and stick some zeros on the right. If we shift 4 places, say, then we've multiplied the code space by 16 (now 1 MB), at the cost of requiring every entry point to be aligned to a 16-byte boundary. Minor downside: that results in about 8 bytes of wasted code space at the end of every routine.
Quote:
but you do have to add a fixed offset to find the start of the 64K dictionary within the flat 32-bit address space.
Next routine would look something like:
Code: Select all
LDD FAR (IP),Y ; Y is used as the low order 16 bits of IP
LEAY 2,Y
BNE .0001
INC IP+1 ;IP+2,IP+3=00 64k bank aligned
.0001:
ADDD #DictionaryBase ; dictionary must be in low 16MB
STD W+2 ; set bits [23:8]
W JMP FAR $00000000
Re: 24-bit CFA ?
Hi Rob. OK, this wasn't making sense to me at first but now it does. Earlier in the thread I proposed left-shifting four places, and you've increased that to eight because an 8-bit shift in a byte-organized machine can happen for free just by juggling which bytes get accessed. Eight is a bigger number than we might otherwise choose, but it's not that big a deal. We end up with a 16MB dictionary, with every CFA aligned on a 256-byte boundary. (As for compressed file format, that'd be unnecessary if it's source code that's being stored.)
This would have to be added to the CFA. It might be possible to code this just as a constant addition if the address of the dictionary was fixed. Also using 256 byte areas the least significant byte could always be zero and left out of the addition if the dictionary is aligned at a 64k bank.Good ideas, but probably unnecessary. My remark about adding a fixed offset pertains to the DOS environment where my dictionary is only 64K and I'm not allowed to map it to the bottom of the 1-MB Real-Mode space because DOS already has stuff down there. RFT6809 Forth has far more freedom, mostly because of the far larger dictionary. It could reside at the bottom of the 4GB space -- sacrificing a portion at the bottom if necessary, which is easily affordable since 16MB is oodles. So, no constant addition required.
And IP is a four-byte indirect pointer residing in memory, right? I like how your...takes a four-byte operand, and it's the middle two bytes that get written to by the preceding STD instruction. This is your free 8-bit shift. 
-- Jeff
Rob Finch wrote:
Quote:
but you do have to add a fixed offset to find the start of the 64K dictionary within the flat 32-bit address space.
Code: Select all
LDD FAR (IP),Y ; Y is used as the low order 16 bits of IPCode: Select all
W JMP FAR $00000000-- Jeff
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html
https://laughtonelectronics.com/Arcana/ ... mmary.html
Re: 24-bit CFA ?
Quote:
Code: Select all
LDD FAR (IP),Y ; Y is used as the low order 16 bits of IP
Quote:
Code: Select all
W JMP FAR $00000000It might be better to use the second 16MB bank in order to avoid zero page memory and define the W vector like:
Code: Select all
W JMP FAR $01000000Re: 24-bit CFA ?
About Forth on the 65816 - the problem would seem to me is that Forth expects an address as well as any non-double number to fit in a "cell" (see http://www.forth200x.org/documents/html/port.html) so that instructions such as ! (store) and @ (fetch) work. Stuff to do with the Dictionary and XTs would not be be a problem because you could just use a token-based system with a 16-bit offset as the XT as described above. But you need to be able to do fetch and store from the command line as well, which would mean 24-bit numbers for address, which don't fit in a 16-bit accumulator. Argh.
So it would seem that with the 65816 we can either limit our address space to 64k in the current bank (which would be a shame) or do what we did with the 6502, make the cell size twice the accumulator width (16-bit for the 6502, and 32-bit for the 65816), which would be a waste and take us back to the horrible two-step addition process I had so dearly hoped to get away from. Or am I missing something?
So it would seem that with the 65816 we can either limit our address space to 64k in the current bank (which would be a shame) or do what we did with the 6502, make the cell size twice the accumulator width (16-bit for the 6502, and 32-bit for the 65816), which would be a waste and take us back to the horrible two-step addition process I had so dearly hoped to get away from. Or am I missing something?
Re: 24-bit CFA ?
Quote:
the horrible two-step addition process I had so dearly hoped to get away from
If I ever write another 65xx Forth, every on-stack cell will have 32 bits of storage available (possibly implemented under the hood as separate hi-word & lo-word stacks), and double precision operations will be the default (with strictly optional implementation of single-precision substitutes which just ignore the hi-words of the cells). IOW I will willingly sacrifice execution speed for clean use of the 24-bit data space.
And, percentage-wise, the drop in execution speed won't be as great as one might suppose. That's because the code space (the dictionary space and pointers within it) will remain 16-bit. The considerable number of clock cycles spent on NEXT 0BRANCH NEST UNNEST etc won't increase. The time spent processing data may double, but that doesn't mean the Forth VM as a whole will execute at half speed.
Needless to say, priorities vary, and my solution may not suit everyone. But I'll happily accept less than bleeding edge performance if the upside is de-PITA-ing access to the 816's large, flat data space. The double-precision operations are low-level stuff, written and debugged once. I'm much more concerned about high-level coding, where the write/debug cycle repeats with each new project.
I expressed some similar thoughts, perhaps less succinctly, in a thread I started here.
Cheers,
Jeff
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html
https://laughtonelectronics.com/Arcana/ ... mmary.html
Re: 24-bit CFA ?
Thanks for the link to the other thread, I hadn't seen that one. Obviously, it would seem that I'm going to have write two Forths for the 65816 - one "fast and small" (16-bit cells, limited to 64k) and one "big and slow" (32-bit cells, but whole memory). The horror
!
Re: 24-bit CFA ?
scotws wrote:
it would seem that I'm going to have write two Forths for the 65816
Then you can write a complete set of fast, 16-bit words that'll get used by default, and selectively add 32-bit words as needed (the reverse of what I proposed). Either way, you avoid the #1 thing you really, really DON'T wanna do, and that is be forced to use two cells to hold an extended-precision value. Juggling double-cell items on stack, especially when mingled with single items, is what I learned to hate.
BTW & FWIW, in 1994 I rewrote a Forth that runs on DOS to use 32-bit operations and 32-bit cells. But the dictionary and compiler remain 16-bit, so it's a hybrid just like the '816 Forth I proposed.
The main thing is, @ ! and CMOVE etc "just work" in the large, flat address space (1 MB in this case -- Real Mode x86 -- or 16 MB with my KK or an '816), so you're freed from requiring a 64K-centric context in your thinking. I found it to be a real breath of fresh air.
-- Jeff
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html
https://laughtonelectronics.com/Arcana/ ... mmary.html