65C816 vs 68000

BigEd · Post by **BigEd** » Tue Dec 20, 2016 8:57 pm

Hmm, I don't think the idea is well-described as NUMA - for me, that's where different parts of the memory system are different distances from the core in question (usually because there are several cores and each is close to some part of the memory system.) Certainly it can make sense to have multiple memory busses, or multiple memory systems which can operate in parallel.

It seems we're agreed that the '816 takes about 5 times the source (C or HDL) to describe compared to the '02. That's not merely a lot more typing, but as you know, it's a lot more verification. And, AFAIK, we have no '816 test suite. We do have Bruce's document, but it's not quite executable.
http://www.6502.org/tutorials/65c816opcodes.html

KC9UDX · Post by **KC9UDX** » Tue Dec 20, 2016 9:15 pm

kc5tja wrote:

cbmeeks wrote:

I still have my Amiga 500 and it still boots today.

But what do you do with it? There are Amiga collectors, and Amigans.

I have three in use. I have an A3000 doing MIDI work, and an A2000 doing video work. You can see them in some short YouTube videos I made. I get asked asked all the time why I don't just use this Mac or that PC.

I use my wife's A2000/060 to manage our finances. I also use it to back up our Android devices because it has USB.

I've got a dead A500, several dead A2000s, and a PPC-based A3000UX all waiting for me to fix them. But I'm not a collector, I'm a user.

cbmeeks · Post by **cbmeeks** » Tue Dec 20, 2016 9:20 pm

kc5tja wrote:

Run Deluxe Paint on the IIgs and on the A500. Put the IIgs in 16-color, 640x200 mode, and likewise with the Amiga. Grab a brush that is about 1/4 the screen size on the IIgs and do the same on the Amiga. Now, on the IIgs, drag that brush around the screen, and note how, despite being clocked at 2.8MHz, it's quite capable of keeping up with the Amiga doing the same task at 7MHz.

I might have to do that.

I can get DP on my Amiga via a GoTek but not sure how I would get it on the IIgs. I will look into it because I've been wanting to mess with my IIgs lately anyway.

kc5tja wrote:

This is because the 65816 doesn't have to fight the blitter for access to memory,

While that's true, one thing I used to do on my Amiga often was run several applications and games at once and then pull down the menu bar so that I could see the tops of several apps running at what seemed like full speed. I could get 2-3 copies of DP going before I noticed any sort of slow down.

That's what I really meant when I said it felt snappier. I don't think you could do that with the IIgs. But I don't think that's the fault of the '816 at all. Just a different OS, different architecture, etc.

kc5tja wrote:

while the Amiga's custom chips, which gleefully allows 60fps HAM animations at 320x400, will cut into the CPU's processing power like a hot knife when driven at higher horizontal resolutions/bandwidths. Also, it doesn't help that the blitter, though clocked at 7MHz, can only touch memory no faster than 3.5 mega-transfers per second, and only during blanking periods at that. Ouch.

Well, I don't think you can judge the Amiga by the blitter...or the sprites...or the [insert chip name here]. For me, it was the whole package. Of course, most of the time, it comes down to the games. Just scrolling a full screen game on the IIgs took a master at '816 assembly language. I used to do it on the Amiga with compiled BASIC. Full screen, dual playfields with sprites bouncing all over and the CPU was pretty much taking a nap the entire time.

Keep in mind, I'm not insulting the IIgs. I love the IIgs.

kc5tja wrote:

The 65816 is not a slow CPU.

I agree with that statement.

kc5tja wrote:

At 2.8MHz, the IIgs was *faster* at many kinds of graphics updates than the Mac Classic, and Jobs wasn't too happy about that.

That is a sad part of our history. I *LOVE* the classic line of Macs. All the way up to the Quadra's. (I own several). But the IIgs got the raw end of the deal. It was purposely kept slower to not compete with the Mac. Imagine if it had Steve Jobs behind it instead of against it.

kc5tja wrote:

(Running a IIgs at 8MHz was magical; it felt every bit as fast as an Amiga to me.)

I can't comment on that. Since I cannot afford an accelerator for the IIgs. I have a memory upgrade (4 MB I think) but I can't justify the cost of speeding up the CPU. I wished I could.

kc5tja wrote:

GS/OS was, however, written in Pascal and largely based on MacOS System 1 code. I suspect that is where most of its sluggishness comes from.

I don't know about GS/OS, but from what I understand, the OS for Lisa was written in Pascal but the original MacOS was largely (if not completely) written in assembly language. I think there were a few "Desktop Ornaments" (as they called them) that were written in Pascal just because they could. But I'm pretty sure the bulk of the OS is optimized assembly. At least, according to the stories at Folklore (and a Byte article I read).

kc5tja wrote:

I say this, BTW, as a die-hard Amiga fan. I still have my Amiga 500 and it still boots today.

Oh, no offense taken. I'm with you there. I collect vintage computers (I have nearly 70). I have several Amiga's, several Apple's including IIgs, Mac's, etc. Atari's, TI's, etc. I love them all.

I consider the Amiga as one of my "core" machines because I worked the summer bagging groceries as a 16 year old kid so that I could buy one. I still have (and use) it to this day.

I'm also a huge fan of the IIgs. Like I said, it got the raw end of the deal. I was mostly impressed with the sound system. 32 voices!!! And I would argue that a "mildly" expanded IIgs would give the Amiga a run for it's money when it comes to audio.

cbmeeks · Post by **cbmeeks** » Tue Dec 20, 2016 9:44 pm

KC9UDX wrote:

But what do you do with it? There are Amiga collectors, and Amigans.

I have three in use. I have an A3000 doing MIDI work, and an A2000 doing video work. You can see them in some short YouTube videos I made. I get asked asked all the time why I don't just use this Mac or that PC.

I use my wife's A2000/060 to manage our finances. I also use it to back up our Android devices because it has USB.

I've got a dead A500, several dead A2000s, and a PPC-based A3000UX all waiting for me to fix them. But I'm not a collector, I'm a user.

That's impressive.

But the line between collectors and users isn't as B/W as you paint here.

I do *use* my Amiga's. What do I do with them? Programming games, game demos or playing games. It's all about Amiga games with me.

I don't do MIDI work. I don't do video work. I prefer my stock (well, 1MB expanded) Amiga 500 over my A1200. That's how it gets use from me.

For the non-game stuff that I do in my daily life, none of *my* Amiga's would be up to the task.

But that doesn't mean I'm not an Amiga user.

I am both. A user and a collector.

kc5tja · Post by **kc5tja** » Tue Dec 20, 2016 9:45 pm

BigEd wrote:

Hmm, I don't think the idea is well-described as NUMA - that's where different parts of the memory system are different distances from the core in question

What is a core? Does it have one memory bus? Two? Four? Eight? Or, should we be more precise and consider masters instead? From the POV of memory, it has no knowledge of how many execution engines exist; only a certain number of "ports", to which a "master" can connect to.

One core can easily have two masters: one for instruction fetch, and one for data access (in fact, that's the definition of a Harvard architecture system). Or, you can have two cores, with one general purpose port each. Or you can have one core with six master ports: one for instruction fetch, three for integer load/store, and two for floating point (e.g., as might be found on a superscalar architecture).

The definition of core is both hazy and misleading. It's better to think in terms of bus masters and bus slaves.

Quote:

It seems we're agreed that the '816 takes about 5 times the source (C or HDL) to describe compared to the '02.

PLA logic doesn't necessarily scale like that, and if you write your HDL in the manner of a PLA (which is exactly what SMG enables), you gain that benefit. My first 4 attempts to implement the KCP53000 CPU core failed spectacularly; all hand-written Verilog, using the usual case-statement methods, or ternary logic approaches, or what-have-you. I'm going to go on record and risk my reputation saying this here and now:

Don't do that.

Verilog is a disgustingly bad language for expressing state machine decoders. It is so bad, in fact, that when I switched my instruction decode logic from using nested case statements to the result of compiling my SMG code, besides a factor of 4 reduction in lines of code written, the number of logic cells consumed in the FPGA dropped by over a thousand.

I'm willing to bet that if someone decapped a 65816 and studied the PLA, I predict you'll find that most of the minterms are shared across all five modes of operation.

What I think is more significant is that you may end up losing precise timing closure with a real 65816, at least if you're targeting an FPGA. Because the 65816 circuitry is level sensitive and not edge-sensitive, it's difficult (or possibly even impossible) to precisely match its timing characteristics. To get everything to match, you have two choices:

1. You'd need to make everything, even clocked logic, out of asynchronous gates, and FPGAs definitely doesn't like to synthesize such designs, OR,
2. You need to clock your logic 2x as fast as a real 65816, and just bite the bullet and implement your logic so that things happen on alternating even and odd cycles. (This is the approach taken by the 80386 processor, by the way; a 33MHz 80386 is really clocked at 66MHz on the motherboard.)

People here lament the lack of a 65816 core; I cannot speak for why others haven't made one, but there are several reasons why I did not make my own (and went the stack CPU route instead), and why I still wouldn't want to:

1. I didn't have a tool like SMG available at the time. Hand writing PLA logic in Verilog is a pain. This issue is now resolved, however.
2. Stack CPUs map naturally to the fully synchronous circuits that FPGAs prefer. My S16X4 is as fast as a 65816 in practice, but consumes only 300 lines of Verilog. Not even exaggerating.
3. I didn't want to walk over WDC's sole source of income. I would never be able to look at myself in the mirror and feel comfortable with myself if I did. If I were to release a 65816 clone, it would receive a significant overhaul (see below), to the point where folks might not want to support WDC anymore. In particular, my 65816 clone would:

a. Throw away current timing constraints. You don't need them in practice. A subroutine call on the latest Intel CPUs no longer takes 25 cycles to complete. More like 2. Most RET instructions consume 0 cycles today.
b. Throw away the multiplexed bus. Unnecessary on an FPGA.
c. Throw away the 8-bit data bus, and replace it with a real 16-bit bus, minimum. I might, in fact, actually go with a 32-bit bus as well, making JML and JSL instructions faster to decode for free.
d. Support single-cycle misaligned accesses via a Motorola 68040-like bus, where all 24 address bits are exposed, and all bus transfers are tagged with a size tag (8-bit, 16-bit, 24-bit, 32-bit transfer, etc). This leaves it up to external logic to split memory accesses into multiple cycles if it wants. If you have a data/instruction cache with long lines, this can really boost performance. It also lets me focus on what's relevant (proper separation of concerns). This is also the approach I take with my KCP53000 CPU, so I speak from experience here.
e. Split instruction and data memory fetches into separate bus masters. Again, it just makes implementing the CPU that much easier. It does push the complexity off into a subsequent stage of circuitry, but it's manageable.

Extra credit:

f. Provide better support for MMUs, privilege modes, hardware coprocessors, multiple cores, and so forth. It's so easy to instantiate multiple cores in Verilog that it'd be irresponsible of me to not consider these things.
g. Macro-op fusion to allow things like STA, DEX, BNE sequences to execute in a fraction of the time it'd normally take.

Of course, all these performance enhancements would compound.

kc5tja · Post by **kc5tja** » Tue Dec 20, 2016 9:50 pm

KC9UDX wrote:

kc5tja wrote:

cbmeeks wrote:

I still have my Amiga 500 and it still boots today.

But what do you do with it? There are Amiga collectors, and Amigans.

Right now, nothing. It is sentimental to me, as my family spent their entire year's savings to get it for me when I was younger. To me, it is a family heirloom.

Quote:

I have three in use. I have an A3000 doing MIDI work, and an A2000 doing video work. You can see them in some short YouTube videos I made. I get asked asked all the time why I don't just use this Mac or that PC.

I'm not going to get into a wong-waving contest. Use what works for you. The 7MHz Amiga 500 is just barely powerful to keep up with even basic Internet terminal usage, much less anything more powerful than that. I'm not a musician. It doesn't have USB. It doesn't have networking. And I've never been able to afford expansions for it until well after I'd already established my PC compatible as a preferred platform for what I do day to day.

Quote:

But I'm not a collector, I'm a user.

Good for you. Now back to the discussion at hand.

White Flame · Post by **White Flame** » Tue Dec 20, 2016 11:37 pm

kc5tja wrote:

This is called "Non-Uniform Memory Access", or "NUMA", in today's literature. Yes, it's doable, and many CPUs already have this facility. I think the latest Intel CPUs have something on the order of *6* channels to memory capable of operating independently. Wait states on any particular bus are introduced only when one channel needs to access another channel's memory.

NUMA is more like hardware memory partitioning, where different address ranges of memory can be accessed at different latencies/speeds, across multi-hop links and such for expandability. I guess with protection bits enforcing code pages vs data pages x86 sort of has a logical notion of Harvard arch, but my understanding of Intel's unganged memory channel setup is that it generally interleaves the addresses at a 64-byte granularity such that there aren't explicitly parallel paths between data & code use.

Quote:

My homebrew RISC-V CPU works the same way; instruction and data fetch are on separate memory channels, but with the introduction of an external memory arbiter, may access a common memory pool.

So this is full harvard, where the different memory channels hit different physical RAM chips, and you have to jump through extra hoops to load data into the code memory?

kc5tja · Post by **kc5tja** » Wed Dec 21, 2016 1:12 am

White Flame wrote:

kc5tja wrote:

Quote:

My homebrew RISC-V CPU works the same way; instruction and data fetch are on separate memory channels, but with the introduction of an external memory arbiter, may access a common memory pool.

So this is full harvard, where the different memory channels hit different physical RAM chips, and you have to jump through extra hoops to load data into the code memory?

It can be either Von Neumann or Harvard, your choice. Without the arbiter, it's Harvard. The arbiter is used to multiplex a common memory bus across both I and D ports on the CPU. Should both I and D ports require access concurrently, priority goes to the D port in order to allow any pipeline to flush before a new instruction is fetched. (I don't yet implement a pipeline in my CPU, but one will be coming eventually.)

Also, not sure what "extra hoops" refers to; virtually all CPUs with a pipeline has separate I- and D-channels for memory; it's just that the arbiter logic (the "Bus Interface Unit" as Intel would call it) is usually internal to the core. With my design, I provide the arbiter in the same repository, but it's a separate Verilog module, and can be re-used to support SMP as well.

White Flame · Post by **White Flame** » Wed Dec 21, 2016 4:07 am

At that point, I think it gets specific as to what you mean by "memory" and "bus". Obviously within the CPU an L1 cache tends to have 2 memories and 2 buses, but I'm mostly talking about the largest, general tier address space memory & bus. For a SoC, that'd still be close to the CPU, but still encompass the complete memory footprint (128KB+ in my example regarding a 65816); otherwise it's external RAM chips and the bus(es) that sits on. Modern x86 has multiple channels on the motherboard, but those aren't I- and D- paths. Of course, if you were to use fully dual-port main memory, then it becomes kind of moot; it'd have the parallelism of harvard but not the limitations of partitioning.

kakemoms · Post by **kakemoms** » Wed Dec 21, 2016 8:55 am

kc5tja wrote:

People here lament the lack of a 65816 core; I cannot speak for why others haven't made one, but there are several reasons why I did not make my own (and went the stack CPU route instead), and why I still wouldn't want to:

1. I didn't have a tool like SMG available at the time. Hand writing PLA logic in Verilog is a pain. This issue is now resolved, however.
2. Stack CPUs map naturally to the fully synchronous circuits that FPGAs prefer. My S16X4 is as fast as a 65816 in practice, but consumes only 300 lines of Verilog. Not even exaggerating.
3. I didn't want to walk over WDC's sole source of income. I would never be able to look at myself in the mirror and feel comfortable with myself if I did. If I were to release a 65816 clone, it would receive a significant overhaul (see below), to the point where folks might not want to support WDC anymore. In particular, my 65816 clone would:

a. Throw away current timing constraints. You don't need them in practice. A subroutine call on the latest Intel CPUs no longer takes 25 cycles to complete. More like 2. Most RET instructions consume 0 cycles today.
b. Throw away the multiplexed bus. Unnecessary on an FPGA.
c. Throw away the 8-bit data bus, and replace it with a real 16-bit bus, minimum. I might, in fact, actually go with a 32-bit bus as well, making JML and JSL instructions faster to decode for free.
d. Support single-cycle misaligned accesses via a Motorola 68040-like bus, where all 24 address bits are exposed, and all bus transfers are tagged with a size tag (8-bit, 16-bit, 24-bit, 32-bit transfer, etc). This leaves it up to external logic to split memory accesses into multiple cycles if it wants. If you have a data/instruction cache with long lines, this can really boost performance. It also lets me focus on what's relevant (proper separation of concerns). This is also the approach I take with my KCP53000 CPU, so I speak from experience here.
e. Split instruction and data memory fetches into separate bus masters. Again, it just makes implementing the CPU that much easier. It does push the complexity off into a subsequent stage of circuitry, but it's manageable.

Extra credit:

f. Provide better support for MMUs, privilege modes, hardware coprocessors, multiple cores, and so forth. It's so easy to instantiate multiple cores in Verilog that it'd be irresponsible of me to not consider these things.
g. Macro-op fusion to allow things like STA, DEX, BNE sequences to execute in a fraction of the time it'd normally take.

Of course, all these performance enhancements would compound.

These thoughts are really interesting and along some of the lines I had myself. Now, I am not that far into the CPU design as my main competence is on silicon, transistors and basic logic. I have tested the different 02 cores available on a Lattice CPLD and it the minimization that attracts me. I really like Arlets core which is half the size of WDC's core (at least on the Lattice CPLDs), even if the latter runs faster.

I have been playing with the idea of putting a 64-bit bus on both sides of two cores as a way to speed things up without bloating the code too much. I still remember when IBM minisystems went full 68000 and all our programs got 3-5 times larger (without doing anything more), so keeping the instructions set minimalistic is also important. Expanding address space with unused opcodes is the way to go. Not to keep compability, but to keep 16-bit addressing for most of the application, compressing its size.

As for the Amiga discussion, I have an A1000, A4000 and A1200. I want to use them, but mostly I end up using my Vic-20(!). Its unfortunate that the AmigaOS ended up being closed source (basically ending its development 23 years ago). But maybe that is what a 65x16 needs.. a translated version of the old AmigaOS with its way of doing things.

cbmeeks · Post by **cbmeeks** » Wed Dec 21, 2016 1:36 pm

kakemoms wrote:

...As for the Amiga discussion, I have an A1000, A4000 and A1200.

I'll give you $25 and my left kidney for that A4000. ;-D

I have several Amiga's myself...I love the machines.

I have A500 x 4, A1000, A2000, A1200 and A600. 8 Amiga's total.

kakemoms wrote:

I end up using my Vic-20(!)

Now there's an interesting machine. Any computer endorsed by Kirk is good enough for me.

Here's an idea...how about a Vic 20 that was built with modern parts. A 65C816, 512K, couple VIA's and a CPLD for video/audio? That would be a fun project!

kakemoms wrote:

But maybe that is what a 65x16 needs.. a translated version of the old AmigaOS with its way of doing things.

This is actually a dream of mine. I've wanted to build a 65C816 based computer that was something between an Amiga 500, Atari ST and Macintosh Classic. Something that would drive basic VGA, nice audio, etc. But something that was GUI based from the start.

I read the stories over at Folklore almost daily for inspiration. That must have been crazy fun times.

White Flame · Post by **White Flame** » Thu Dec 22, 2016 7:17 am

cbmeeks wrote:

Here's an idea...how about a Vic 20 that was built with modern parts. A 65C816, 512K, couple VIA's and a CPLD for video/audio? That would be a fun project!

I have a SuperCPU hanging off my Commodore 128, so that's almost the same thing. And at 20MHz, it might be the fastest 16-bit processor for home computing that was sold. (Counting the 680x0 family as 32-bit processors, that is. But still would be competitive or faster than a 20MHz 68000.)

There's also another Commodore 64 that was built "modern": The 64DTV, as it's called, was a C64 reimplementation in an epoxy blob sold in some direct-to-tv joysticks with C64 games bundled. It was used in a few other specific projects as well. It had expanded video modes, more memory, a selectably faster CPU, and pads on the board to hook up a PS/2 keyboard and IEC bus.

65C816 vs 68000

Re: 65C816 vs 68000

Re: 65C816 vs 68000

Re: 65C816 vs 68000

Re: 65C816 vs 68000

Re: 65C816 vs 68000

Re: 65C816 vs 68000

Re: 65C816 vs 68000

Re: 65C816 vs 68000

Re: 65C816 vs 68000

Re: 65C816 vs 68000

Re: 65C816 vs 68000

Re: 65C816 vs 68000