6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 9:51 am

All times are UTC




Post new topic Reply to topic  [ 17 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Mon Jun 06, 2016 8:21 am 
Offline

Joined: Fri Jun 03, 2016 3:42 am
Posts: 158
I'm not much interested in the 65c816 --- I don't know why anybody would program it rather than the modern 16-bit processors such as the MSP-430 and PIC24 that are much more powerful --- other than nostalgia, of course.

I do think the 65c02 has a future, primarily as the processor used in a multi-core FPGA system. You want to have 8-bit registers so you use minimal FPGA resources and hence can have more processors in a particular FPGA than you could if the processors had 16-bit registers. Such a system should be competitive in performance and price to a big (16-bit registers) single-core system --- the Parallax Propeller has had some use in the real-world --- this would be similar.

To make the 65c02 work better with Forth however, you would want some upgrades:
1.) A (zp,X),Y addressing-mode so pointers on the data-stack can be used to to access memory (I am assuming that X is the data-stack pointer)
2.) DNX (double increment X) and DDX (double decrement X) instructions because #1 above assumes a non-split data-stack (low byte and high byte adjacent in memory)
3.) TSY (transfer S to Y) for accessing local variables on the return-stack
4.) LAY (load A from zp,X and Y from zp+1,X) to help the peephole-optimizer
5.) JMP (adr,Y) for jump tables
6.) Bit instructions for I/O ports in zp or for 1-bit variables in zp
7.) An 8x8 multiply

#1 and #3 are the only ones that are really needed --- the rest are helpful but optional.

How difficult would this be?


Top
 Profile  
Reply with quote  
PostPosted: Mon Jun 06, 2016 10:14 pm 
Offline
User avatar

Joined: Sat Sep 29, 2012 10:15 pm
Posts: 904
I assume you've seen Michael's core, http://forum.6502.org/viewtopic.php?f=10&t=2997, which has new Forth instructions...

Xilinx FPGAs are also full of multipliers...

How do you envision cores communicating with each other?

_________________
In theory, there is no difference between theory and practice. In practice, there is. ...Jan van de Snepscheut


Top
 Profile  
Reply with quote  
PostPosted: Tue Jun 07, 2016 12:00 am 
Offline

Joined: Fri Jun 03, 2016 3:42 am
Posts: 158
enso wrote:
I assume you've seen Michael's core, http://forum.6502.org/viewtopic.php?f=10&t=2997, which has new Forth instructions...

I was not aware of it. I scanned over the thread a little bit. He is supporting DTC. That is somewhat interesting, but I was assuming STC. Most FPGAs have abundant memory, both RAM and non-volatile, built in --- there is no need to use DTC or ITC or token-threading or any thing else for the purpose of reducing code size --- that was done in the 1970s when an 8KB EPROM was often all you had for non-volatile memory.

STC isn't such a memory hog as is often assumed. A JSR is only 3 bytes, compared to 2 bytes for DTC or ITC, so there isn't all that much difference. Of course, a lot of code will be inlined for speed (that is pretty much the point of using STC). My suggestions for an improved 65c02 involved making common code sequences into instructions so they use less memory when inlined. For example, DNX instead of INX INX and DDX instead of DEX DEX will halve the memory usage there, and these are very common. The (zp,X),Y addressing mode is primarily to allow @ and ! to have significantly reduced code size and improved speed.

Here is one good optimization: The producers (functions that push data onto the data-stack) end with DDX, so they are actually working with memory under that stack prior to the DDX covering up their data. The consumers (functions that pull data from the data-stack) start with DNX, so they are actually working with memory under that stack after the DNX has dropped it. The point of this is that, if you inline a producer followed by a consumer (very common), the DDX and DNX are adjacent and so the peephole-optimizer can remove both instructions (saving 2 bytes). If a producer is inlined and is followed by a JSR to a consumer, then the DDX at the end is not compiled (saving 1 byte) and the JSR to the consumer goes to 1+ the consumer's address so the DNX at the start doesn't execute.

Also, The producer ends with DDX and it is assumed that it stored tos to the data-stack underneath prior to the DDX using SAY. Lets assume that the high-byte was in Y and the low-byte was in A. The consumer starts with DNX and it is assumed that it loads tos from the data-stack underneath using LAY. This means that SAY and LAY are now adjacent and the peephole-optimizer can get rid of those too (saving 2 bytes).

Note that the optimizations described above preclude ISRs from being written in Forth. This is because the data under the stack would get clobbered by the ISR using the stack. One solution here is for ISRs to have their own data-stack.

Here is another instruction that would be useful, which I didn't mention previously: TAY (test the Y:A 16-bit value and set the N and Z flags appropriately). 0BRANCH could then be this:
DNX
LAY
TAY
BEQ xxx

That is only 5 bytes, which isn't much. Also, given the peephole-optimization described above, the DNX and LAY can be discarded and so we are down to 3 bytes.

Another good optimization is that a JSR followed by an RTS can be converted into a JMP and the RTS not compiled (saving 1 byte). This also has the advantage of reducing return-stack usage --- a tail-recursive function won't overflow the return-stack.

All of these optimizations not only reduce code size, but also boost the speed.

enso wrote:
How do you envision cores communicating with each other?

A common-memory area --- maybe 16KB of RAM --- large enough to hold buffers for all the I/O ports.

Typically in a dual-core system, one processor will handle all of the I/O. The other processor has the main program running on it.

In a multi-core system, you could have each processor handling one I/O port, and then one processor that has the main program.

It would be similar to the Parallax Propeller --- except with a 65c02 variant as each processor.


Top
 Profile  
Reply with quote  
PostPosted: Tue Jun 07, 2016 12:24 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California
Hugh Aguilar wrote:
there is no need to use DTC or ITC or token-threading or any thing else for the purpose of reducing code size --- that was done in the 1970s when an 8KB EPROM was often all you had for non-volatile memory.

STC isn't such a memory hog as is often assumed.

In this post, Bruce Clark surprised us with how the faster-running STC Forth avoids the expected memory penalties. He gives 9 reasons, starting in the middle of his long post in the middle of the page.

But also, he showed a way to do a two-instruction NEXT in ITC Forth at viewtopic.php?t=584, and then showed a way to do a single-instruction, six-cycle NEXT in DCT Forth at viewtopic.php?t=586 . Impressive.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Tue Jun 07, 2016 4:59 am 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
Hugh Aguilar wrote:
I'm not much interested in the 65c816 --- I don't know why anybody would program it rather than the modern 16-bit processors such as the MSP-430 and PIC24 that are much more powerful --- other than nostalgia, of course.


Probably the fact that it takes up just about the same amount of resources as a 6502 is as good a reason as any. It having a trivial-to-use, external memory bus is another. Performance wise, it's not too shabby either. All around good bang for the buck. And since you bring up Forth, which all but requires a 16-bit implementation to be minimally useful at all, the 65816 proves to run code measurably faster than a 6502. So if you cannot or will not run an MSP430 or PIC, the real question is, why wouldn't you choose a 65816?


Top
 Profile  
Reply with quote  
PostPosted: Tue Jun 07, 2016 5:56 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA
Quote:
Here is another instruction that would be useful, which I didn't mention previously: TAY (test the Y:A 16-bit value and set the N and Z flags appropriately).

TAY is already taken, but that isn't a big deal, because the 65xx tradition would support the notion that LAY should have appropriate flag effects all by itself. :-)

Mike B.


Top
 Profile  
Reply with quote  
PostPosted: Tue Jun 07, 2016 7:26 am 
Offline

Joined: Fri Jun 03, 2016 3:42 am
Posts: 158
barrym95838 wrote:
Quote:
Here is another instruction that would be useful, which I didn't mention previously: TAY (test the Y:A 16-bit value and set the N and Z flags appropriately).

TAY is already taken, but that isn't a big deal, because the 65xx tradition would support the notion that LAY should have appropriate flag effects all by itself. :-)Mike B.

This is true --- I forgot at the moment that the name TAY is already taken (transfer A to Y) --- I was really typing faster than I was thinking, and few extra seconds of thought before posting should have brought that to my attention.

You are right that a test instruction isn't really needed anyway because LAY can set the flags itself. This is a big reason why a new instruction is needed, because the N flag get set according to the high bit of Y whereas the Z flag gets set according to all the bits of both Y and A, so doing this with the existing 65c02 instructions would involve a lot of instructions and hence would be slow and bloaty --- but doing it in an FPGA as a new instruction should be pretty easy.

SAY isn't very important, so if it is only possible to have one, we would want LAY to be it.


Top
 Profile  
Reply with quote  
PostPosted: Tue Jun 07, 2016 8:36 am 
Offline

Joined: Fri Jun 03, 2016 3:42 am
Posts: 158
GARTHWILSON wrote:
Hugh Aguilar wrote:
there is no need to use DTC or ITC or token-threading or any thing else for the purpose of reducing code size --- that was done in the 1970s when an 8KB EPROM was often all you had for non-volatile memory.

STC isn't such a memory hog as is often assumed.

In this post, Bruce Clark surprised us with how the faster-running STC Forth avoids the expected memory penalties. He gives 9 reasons, starting in the middle of his long post in the middle of the page.

But also, he showed a way to do a two-instruction NEXT in ITC Forth at viewtopic.php?t=584, and then showed a way to do a single-instruction, six-cycle NEXT in DCT Forth at viewtopic.php?t=586 . Impressive.

What primarily kills the speed of a threaded (DTC or ITC) Forth is iteration. An STC Forth is going to be fast even if everything is compiled as JSR and nothing is inlined except for BRANCH and 0BRANCH being inlined. Inlining LIT also helps considerably.

In my experience, there are a handful of functions that are called quite a lot. Calling these with JSR requires 3 bytes, which can really bloat out a program. If you have an upgraded 65c02 however, then you could have these 1-byte instructions:
RST #0 --- effectively the same as JSR $100
RST #1 --- effectively the same as JSR $108
RST #2 --- effectively the same as JSR $110
RST #3 --- effectively the same as JSR $118
RST #4 --- effectively the same as JSR $120
RST #5 --- effectively the same as JSR $128
RST #6 --- effectively the same as JSR $130
RST #7 --- effectively the same as JSR $138
These would be similar to the RST instructions in the Z80. You get 8 functions for which a call is 1 byte rather than 3 bytes. I put these 8 functions (8 bytes each) in the bottom of page-one. This means that the return-stack has 64 fewer bytes available to it before it overflows. This should not be a problem; in my experience, the return-stack never comes close to overflowing even when local variables are on the return-stack.

I think that it is a reasonable prediction that 90% of the JSRs in a Forth program go to 8 highly-popular functions, so reducing these from 3 bytes to 1 byte should significantly reduce the program size --- it is possible that an STC Forth for such an upgraded 65c02 would generate smaller programs than a threaded Forth --- and, of course, an STC Forth is going to be about an order of magnitude faster than a threaded Forth.

On the subject of putting local variables on the return-stack, I did that in my cross-compiler (this was a feature I added that wasn't in ISYS Forth). To make this work efficiently, we would need a TSY (transfer S to Y) instruction (we can't use TSX because X is the data-stack pointer).


Top
 Profile  
Reply with quote  
PostPosted: Tue Jun 07, 2016 10:07 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California
Hugh Aguilar wrote:
What primarily kills the speed of a threaded (DTC or ITC) Forth is iteration.
and
Quote:
and, of course, an STC Forth is going to be about an order of magnitude faster than a threaded Forth.

ITC Forth on the '02 spends almost exactly half its time in NEXT; so eliminating it cannot do more than double its speed. Also, ITC's JMP NEXT is only three clocks whereas STC's RTS is six. Straightlining literals will definitely speed them up, but there is substantial material in my 6502 stacks treatise on doing assembly language in a very Forth-like way (essentially STC Forth) and I was appalled at how >R and R> and a couple of others actually had to be a lot longer in STC Forth than in ITC Forth, unless they were straightlined rather than being made to be subroutines.

Quote:
These would be similar to the RST instructions in the Z80. You get 8 functions for which a call is 1 byte rather than 3 bytes. I put these 8 functions (8 bytes each) in the bottom of page-one. This means that the return-stack has 64 fewer bytes available to it before it overflows. This should not be a problem; in my experience, the return-stack never comes close to overflowing even when local variables are on the return-stack.

So true. I measured my stack usage with the most stack-intensive Forth applications and interrupts I could think of, and I never used even 20% of either the return stack or data stack space. The exception would be if you split up the stack are into sections for multitasking and give each task its own portion. I discuss these things in the stacks treatise also.

Quote:
On the subject of putting local variables on the return-stack, I did that in my cross-compiler (this was a feature I added that wasn't in ISYS Forth). To make this work efficiently, we would need a TSY (transfer S to Y) instruction (we can't use TSX because X is the data-stack pointer).

PHX, TSX, TXA, TAY, PLX, INY. The INY could be eliminated by compensating in the operands for Y being off by 1. On the '816, you could just do TSC, TAY, leaving X untouched, and Y would be correct.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Tue Jun 07, 2016 3:54 pm 
Offline

Joined: Fri Jun 03, 2016 3:42 am
Posts: 158
GARTHWILSON wrote:
Quote:
On the subject of putting local variables on the return-stack, I did that in my cross-compiler (this was a feature I added that wasn't in ISYS Forth). To make this work efficiently, we would need a TSY (transfer S to Y) instruction (we can't use TSX because X is the data-stack pointer).

PHX, TSX, TXA, TAY, PLX, INY. The INY could be eliminated by compensating in the operands for Y being off by 1. On the '816, you could just do TSC, TAY, leaving X untouched, and Y would be correct.

I still think that the 65c816 is not a practical processor for the following reasons:
1.) Its primary feature is that it accesses a lot of memory (16MB). Nowadays though, if you want a system with MBs of memory (or even hundreds of KBs of memory), you are best off to use the ARM. The 65c816 is always going to be a poor substitute for an ARM --- being a "me too" is not the best way to take over the market --- the 65c816 will always be an obscure solution chosen more because it is fun than practical.
2.) The future of computers is multi-core systems. Multi-core x86 desktop-computers are already standard, and we are starting to get multi-core micro-controllers (the Parallax Propeller forges the path to the future). To make any existing processor efficient at this, it is going to need some new instructions. The 65c816 doesn't have any room for new instructions (other than the WDM instruction) however, so it is a dead-end.

I think the 65c02 is the best basis for a multi-core micro-controller for these reasons:
1.) The 65c02 has global variables in zp that can be used as pointers without needing to move them to registers as would be done in other processors. Most I/O ports are going to have a circular buffer associated with them. These zp-pointers can be used as the next-available-to-store and next-available-to-load pointers, which will speed up ISRs considerably.
2.) The 65c02 has very few registers, and they are all 8-bit. The save-and-restore of registers done in an ISR is pretty fast. In a lot of micro-controllers, servicing interrupts takes up over half the time, so anything that can be done to speed up ISRs will have a big effect on the overall speed.
3.) The 65c02 has a lot of opcodes undefined. It is possible to add new instructions (new opcodes) to the 65c02 to support Forth and/or to support micro-controller programs in general. There is room to grow. It is not necessary to get rid of any instructions when adding new instructions, so legacy 65c02 software will still work on the new FPGA.
4.) The 65c02 has very few registers, and they are all 8-bit. This means that it uses minimal FPGA resources. I would suspect that it is the smallest processor in the world that useful (the 6805 is smaller, but is so restricted as to not really be useful). The advantage of having a small footprint in the FPGA is that a multi-core FPGA can have more cores. If we assume that the customized 65c02 (#3 above) has half the footprint of a 65c816, this means that we can have twice as many cores. It is better to have a dual-core super-65c02 system than a single-core 65c816 system --- more than twice as much better! --- just an awesome micro-controller in every way!


Top
 Profile  
Reply with quote  
PostPosted: Tue Jun 07, 2016 6:02 pm 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
I'm skeptical; I need to see the math behind your speculations.

If I were interested in multi-core microcontrollers, the 6502 is emphatically not the right choice. It is so bus-hungry that no other core would get a chance to use the bus. You can have at most 2 cores (each running on opposite clock phases). Multiprocessor studies show that beyond 8 cores on a common bus, performance degrades sharply; this figure would be much lower for the 6502 simply because of how few idle bus cycles exist. If you "switch cores" on an idle cycle via a round-robin bus arbiter, you end up with performance on par with hyperthreading, and that's a best-case scenario.

To really get high performance, your cores would need their own private memories, wholly independent of any other core. No NUMA, no SMP, not even AMP. Pure message passing is all you get, and now you're bottlenecked by your interconnect topology.

Finally, a single 65816 can add two 32-bit words faster than two cooperating 6502s can add a 16-bit half-word, simply because it doesn't have to deal with the overhead of cooperating.

You might not want to admit it, but the 65816 really is a better CPU than the 6502 all around. And if someone put a proper 16-bit bus interface on the core, it would be faster still.


Top
 Profile  
Reply with quote  
PostPosted: Tue Jun 07, 2016 6:30 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Quote:
I do think the 65c02 has a future, primarily as the processor used in a multi-core FPGA system.

The 65(c)02 uses quite a bit of resources to deal with the irregular instruction set, and the number of cycles per instruction is rather large too. For an FPGA, you'd be better off designing a special purpose CPU that specifically targets the FPGA architecture. If you want to run Forth, something like the J1 processor comes to mind.


Top
 Profile  
Reply with quote  
PostPosted: Tue Jun 07, 2016 6:40 pm 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8505
Location: Midwestern USA
kc5tja wrote:
I'm skeptical...

As am I. A lot of assumptions exist.

Quote:
To really get high performance, your cores would need their own private memories, wholly independent of any other core. No NUMA, no SMP, not even AMP. Pure message passing is all you get, and now you're bottlenecked by your interconnect topology.

What you just described sums up the multiprocessor design characteristics of the 1980s.

Quote:
Finally, a single 65816 can add two 32-bit words faster than two cooperating 6502s can add a 16-bit half-word, simply because it doesn't have to deal with the overhead of cooperating.

The fact is the 65C816's performance in 32 bit integer math will soundly eclipse that of a 65C02 running at the same clock speed. I've concluded that on average the '816 is, at the minimum, 2-1/2 times faster at multibyte integer math than the 65C02, and that margin widens as the word size is increased.

Quote:
You might not want to admit it, but the 65816 really is a better CPU than the 6502 all around. And if someone put a proper 16-bit bus interface on the core, it would be faster still.

If only WDC would release the 65C816 in a PLCC52 package with A16-A23 brought out on the extra pins... :?

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Tue Jun 07, 2016 6:47 pm 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
You know, I was thinking the other day: the nice thing about my Backbone bus project is that, if it works, it'd make an outstanding bus interface for a 65816 clone. Requires 60 non-power pins, which isn't much. A PLCC84 package would be more than sufficient for it plus adequate power pins.


Top
 Profile  
Reply with quote  
PostPosted: Tue Jun 07, 2016 7:16 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California
I have, for decades, had an interest (not a strong one, but certainly an interest) in multiprocessing. It seems to me that this should be a set of loosely coupled computers (possibly all on the same IC but not necessarily) each having its own buses for memory and I/O and keeping most of its memory to itself, that communication with others should probably max out at being through multi-port memory but otherwise they talk to each other through I/O. I would be interested to see more discussion on how to split up the jobs in order to take maximum advantage of the available processing power, because I do see this as quite a hurdle. I know from reading articles on the subject that ten processors working together definitely do not give ten times the processing power of a single one, and that as the number increases, the return becomes less and less until you asymptotically reach a place where additional processors give no additional benefit. Of course some applications will lend themselves to multiprocessing better than others. I expect that circuit simulation is one where lots of processors will have more benefit, because so many things happen at once, in contrast to much of programming which is sequential, ie, that each step is built on the results of the previous step, meaning they cannot be done at the same time.

The '816 does however give a ton of benefits even if you never go outside of bank 0, ie, even if you limit yourself to a 64K memory map. So I would not say that its primary feature is the size of memory it can access.

Implementing ring buffers of less than 256 bytes incurs very little overhead to do something like:
Code:
        INX
        CPX  #8
        BNE  1$
        LDX  #0
 1$:    <continue>

An instruction like ANX# (AND the X register, immediate) would be a small benefit, but probably not worth it. (I would limit the buffer size to powers of two, but I don't see that as a problem.) Something like RS-232 does benefit from ring buffers, but I've never needed a ring buffer in any of the microcontroller software I've done for our aircraft communication products. A typical task might do something like:

Examine an I/O bit. Is it set?
If so,
take the current time and subtract a target time.
Is the result positive, meaning we've reached the time?
If so, turn an analog switch on,
set a flag, and
increment the state variable. <END_IF>
Exit.

and there are lots of these running at once, watching I/O bits, watching A/D converters, setting or clearing output bits, scanning and debouncing keys (sometimes with auto-repeat on certain ones, also timing the repeat speed and delay before repeat), feeding LEDs with realtime status, generating alert beeps, etc.. The primary (and often the only) interrupt is for a timer that times out every so many tens or hundreds of microseconds, for generating signals, and as a time reference that all time-based tasks refer to.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 17 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: