6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 6:33 pm

All times are UTC




Post new topic Reply to topic  [ 59 posts ]  Go to page Previous  1, 2, 3, 4  Next
Author Message
PostPosted: Fri Jul 15, 2022 2:35 am 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
65LUN02 wrote:
Do you know why the 65(C)02 has those extra states? Were the designers being conservative in what could be done in a state, given they had no automated tools to measure timings? Or did they find some common logic that cause a bit of reuse traded off for an extra cycle?
I don't know whether the factors you mention played a role, but there are other dynamics at work.

For starters, there may have been a legal issue. According to Garths' post here, Commodore had a patent on a process they used in the 65CE02 that allowed them to eliminate virtually all the dead bus cycles. When creating the 65C02, WDC may not have wanted to risk infringing on Commodore's patent.

Also, Apple was an important prospective customer for the 65C02, and certain Apple legacy hardware (the disk controller, IIRC) relied on the original (NMOS) 6502 timing for certain instructions. Apple didn't want those timings changed. This probably explains (among other things) why the 'C02 applied only a partial fix to the wasteful dead cycle in the Absolute Indexed versions of the RMW instructions. ROL ROR ASL and LSR using Absolute Indexed mode got fixed. But INC and DEC using Absolute Indexed mode retain the wasteful dead cycle that the 6502 has. Details here.

ETA: I'm contrasting the 6502 and the 65C02. But perhaps you question embraces them both.

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
PostPosted: Fri Jul 15, 2022 3:15 am 
Offline

Joined: Wed Jun 29, 2022 2:15 am
Posts: 44
Does anyone know why Apple didn't just put a clock into the ][e? Or the //c? In my hindsight-empowered view of the Apple ][, some of the decisions Woz made seems short-sided, and he had the ][+ and ][e revisions to make fixes. E.g. add a clock so that the Disk ][ and other timing loops could be based on a clock, which then would later allow higher CPU speeds without worrying about messing up those loops.

Along the same lines, why by the ][e didn't Woz add an I/O address for horizontal and vertical refresh timing, if not an optional IRQ for those timings? He wanted a computer with good graphics. He know from his Atari days the importance of flicker-free graphics buffers. But he didn't provide a way to programmatically trigger code at those moments.

It's not like Apple didn't know CPUs could go faster. By 1979 they were already looking at the 10MHz 68000 for the Lisa project, if not already using that chip. Z80s came in speeds above 1MHz. Perhaps its too much hindsight but someone should have asked why the 6502 wouldn't be 2MHz by 1980, 4MHz sometime in the 80s, if not matching that 10MHz speed by the time the Lisa shipped.

---

Switching topics... I was reading about the PDP/11 last night. It used a word size of 18-bits, with a matching 18-bit address bus, and thus could address 256K. Like the 6502 it used memory-mapped I/O and as such core memory had to be smaller than that. With core memory that wasn't a problem, as core was wickedly expensive, but the PDP/11 was still selling into the 1990s and along the way switched to DRAM, and along the way needed more than 256K of address space. The solution in that corner of computing was again bank switching (plus virtual memory so that each process though it had its own 256K). That makes at least three different CPUs where the fix was bank switching instead of wider addresses.

More evidence that either hindsight is too strong to apply here, or that foresight at the time was not following Moore's lead at looking for exponential trends and planning accordingly.


Top
 Profile  
Reply with quote  
PostPosted: Fri Jul 15, 2022 6:21 am 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8505
Location: Midwestern USA
Dr Jefyll wrote:
For starters, there may have been a legal issue. According to Garths' post here, Commodore had a patent on a process they used in the 65CE02 that allowed them to eliminate virtually all the dead bus cycles. When creating the 65C02, WDC may not have wanted to risk infringing on Commodore's patent.

The improvement CSG made had to do with the instruction pipeline.

As far as the fear of infringement goes, the 65C02 was sampling in 1983 and in production by the end of that year, well before the 65CE02's debut in 1988. The 65CE02 was a derivative of the 65C02, the latter which would have been considered prior art under U.S. copyright law. Therefore, WDC would have not been infringing had they gotten rid of the dead cycles with a different pipeline. In fact, Commodore would have had to license the design from WDC to avoid running afoul of copyright law.

More likely, since Bill Mensch was working on a shoestring budget, he couldn't afford the test gear needed to dig deeply into the C02's instruction timing. Also, prior to the release of the C02 design and its adoption by Apple and Rockwell, WDC didn't have a significant revenue stream. So there was a time exigency involved as well.

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Fri Jul 15, 2022 7:54 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA
65LUN02 wrote:
Along the same lines, why by the ][e didn't Woz add an I/O address for horizontal and vertical refresh timing, if not an optional IRQ for those timings? He wanted a computer with good graphics. He know from his Atari days the importance of flicker-free graphics buffers. But he didn't provide a way to programmatically trigger code at those moments.
The //e and //c offered the vertical scan register in the I/O space, but I confess that I don't remember its address. He was originally aiming for a frugal but expandable design, and I think he made the best of the situation. Interrupts weren't a part of the original unexpanded ][ and ][+, and that was probably just a personal preference.
Quote:
Switching topics... I was reading about the PDP/11 last night. It used a word size of 18-bits, with a matching 18-bit address bus, and thus could address 256K. Like the 6502 it used memory-mapped I/O and as such core memory had to be smaller than that. With core memory that wasn't a problem, as core was wickedly expensive, but the PDP/11 was still selling into the 1990s and along the way switched to DRAM, and along the way needed more than 256K of address space. The solution in that corner of computing was again bank switching (plus virtual memory so that each process though it had its own 256K). That makes at least three different CPUs where the fix was bank switching instead of wider addresses.

More evidence that either hindsight is too strong to apply here, or that foresight at the time was not following Moore's lead at looking for exponential trends and planning accordingly.
The pdp-11 started out in about 1970 as a 16-bit system. Words and addresses were 16-bits. Years later the addresses were widened out of necessity, but the word size remained 16-bits throughout its entire lifespan, AFAIK.

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)


Top
 Profile  
Reply with quote  
PostPosted: Sat Jul 16, 2022 6:27 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
65LUN02 wrote:
Have you dug into the Visual 6502 to see how closely your control bits match up with the actual chip? I suspect your logic is very close to theirs. Same with the corresponding states, although I've never seen a definitive guide of 6502 states to compare.

You mention in your README that you removed the extra cycles. Do you know why the 65(C)02 has those extra states? Were the designers being conservative in what could be done in a state, given they had no automated tools to measure timings? Or did they find some common logic that cause a bit of reuse traded off for an extra cycle?

I haven't looked in detail at Visual 6502. I only discovered it after I made my first 6502 core. For my design, I did look closely at the block diagram, and the original 6502 hardware/software manuals that have cycle sequences mentioned. However, it's not an exact match because I designed mine with synchronous memory, which necessitates some cycle reordering in some cases.

The extra states are all due to the fact that all calculations go through the ALU. I did the same thing in my original 6502 core, and it naturally got the same cycle count for each instruction. That includes using the ALU for address index calculations as well as stack pointer inc/dec. To remove the extra states, I had to add extra adders in several places, including some 16 bit ones. However, in FPGA resources, an adder is just as expensive as a simple OR, or a 2:1 MUX, and in order to feed all the operations through the ALU, you need more muxes, and a bigger state machine, plus it involves routing the signals to and from the ALU. Therefore, it is often more efficient to add an extra adder in a place where you need it.

Using NMOS transistors, however, using a single ALU to do all the calculations made a lot of sense. Muxes are much cheaper than adders, and shared buses take the signals everywhere for little cost.

Quote:
For the 65C24T8, the biggest time suck was trying to get the thread switch to happen in two cycles instead of three. I'd bet $1 that you could squeeze out that extra state by twiddling just the right control lines. My brute force was to give up, add a bit of logic to decrement the PC in the new THRD state and add the _T_ noop in the new TYNC state so that the right opcode was run when returning to a thread and so the opcode from the new thread wasn't run twice.

I haven't studied your changes, but it sounds plausible to do it in a single cycle.


Top
 Profile  
Reply with quote  
PostPosted: Sat Jul 16, 2022 7:37 pm 
Offline
User avatar

Joined: Fri Dec 12, 2008 10:40 pm
Posts: 1007
Location: Canada
barrym95838 wrote:
The pdp-11 started out in about 1970 as a 16-bit system. Words and addresses were 16-bits. Years later the addresses were widened out of necessity, but the word size remained 16-bits throughout its entire lifespan, AFAIK.
This is correct. The 18-bit machines were the PDP-1, PDP-4, PDP-7, PDP-9 and PDP-15.

DEC had systems ranging from 12 to 36 bits. The most popular computers they produced were the 16-bit PDP-11 and the later 32 bit VAX machines. The first PDP-11/20 could only address 64K of memory. Although later machines incorporated memory management that allowed system memory of 256K then later 4MB, however the 64K limit per user/session was still there.

_________________
Bill


Top
 Profile  
Reply with quote  
PostPosted: Mon Jul 18, 2022 11:57 pm 
Offline

Joined: Mon Feb 15, 2021 2:11 am
Posts: 100
I think I read somewhere that there were later enhancements to allow separate 64K address space for data, 64K address space for program, per user, but I may be mistaken. Certainly the limit had become a major hindrance within a few years of the PDP-11's introduction, hence the VAX project. VAX started out as Virtual Address eXtensions because the initial effort was to expand the address space of the PDP-11 architecture, before it evolved into a 32-bit architecture with PDP-11 compatibility. I never worked with these systems myself, but early in my career I kept running into remnants of them at work. My employers had transitioned users of Intergraph's PDS (Plant Design System) from Clipper workstations and a VAX server to NT4 workstations and server about two years before I started. In the IT storage closet there was an old hard drive and backup tapes from the VAX system, along with a couple Clipper workstations, until around the 2008 timeframe. A few old-time users still had VAX, VI, and UNIX cheatsheets on their cube walls, and a couple had manuals of that vintage. The old stuff was kept around just in case, but the few times old project data had to be restored they were able to restore from tape to modern servers. I think once I an admin did boot up one of the Clippers to recover something from the hard drive, but it was just a few of his old scripts. The VAX drive was never used again, and I think that was the only time in ten years the Clippers ever saw use.

BillO wrote:
barrym95838 wrote:
The pdp-11 started out in about 1970 as a 16-bit system. Words and addresses were 16-bits. Years later the addresses were widened out of necessity, but the word size remained 16-bits throughout its entire lifespan, AFAIK.
This is correct. The 18-bit machines were the PDP-1, PDP-4, PDP-7, PDP-9 and PDP-15.

DEC had systems ranging from 12 to 36 bits. The most popular computers they produced were the 16-bit PDP-11 and the later 32 bit VAX machines. The first PDP-11/20 could only address 64K of memory. Although later machines incorporated memory management that allowed system memory of 256K then later 4MB, however the 64K limit per user/session was still there.


Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 24, 2022 4:30 am 
Offline

Joined: Wed Jun 29, 2022 2:15 am
Posts: 44
I tried calling this project done... but I had yet to finish the vision of expanding the registers along with the address bus. Diving into that over the last few days resulted in https://github.com/lunarmobiscuit/verilog-65C2424-fsm, the mythical 65C2424.

This CPU builds upon the 65C2402. While that design grew the address bus from 16-bits to 24-bits, this design grows the registers from 8-bit to both 16-bit and 24-bit. All of this with 99.9% backward compatibility, no new flags, no modes, with a path to expand the design to a full 32-bit address+data design.

The key change is the use of the $nF opcodes as prefix codes. $0F is the new CPU opcode, which just loads the A register with the capabilities of the chip. The bottom-most nibble is $0 in this version, as I used that to describe the number of threads in the 65C24T8, and this version has no threads. The next nibble is split again into two bit pairs. The bottom pair is 00 for 16-bit addresses, 01 for 24-bit addresses. The top pair is 00 for 8-bit registers, 01 for 16-bit registers, and 10 for 24-bit registers. The 65C2424 has capabilities of $90, i.e. 24-bit registers and 24-bit addresses.

That all sounds complicated, but it matches the prefix codes. $1F is 8-bit registers, 24-bit address. $4F is 16-bit registers, 16-bit address. $8F is 24-bit registers, 16-bit address. For every opcode you pick the widths. E.g. $4F $A9 $34 $12 = LDA #1234. E.g. $8F $A9 $56 $34 $12 = LDA #123456. $9F $AD $00 $00 $F0 = LDA $F00000 (loading 24-bits from memory location $F00000/1/2).

You do the same for opcodes like INC, DEX, TAY, with the prefix (or lack thereof) telling the CPU whether to treat the register as 8-bits, 16-bits, or 24-bits. LDA #1234, R16 DEA results in $1233 in the A register whereas LDA #1234, DEA results in $33 in the A register, with the top two bytes not just ignored, but zeroed.

Note that I keep saying registers, not data, as this CPU keeps the 8-bit data bus. It thus fits into a 48-pin DIP. This project was originally a thought experiment about what might have been in 1978/79, and back then a 40-pin DIP was considered as big as a DIP should be. Just as today's ARMs have 64-bit addresses but a 40-odd address bus, an 1979 NMOS 652424 could be squeezed into that 40-pin DIP if the address bus is extended to just 17, 18, or 19 bits using the 3 NC pins.

And speaking of 1978/79, one question I asked in the first post on this thread is why Apple didn't ask for a larger address space way back then. I thought I found that answer in an interview with the Motorola 68000 team, who said Jobs negotiated a $15/chip price for their chip. Well... a few days ago I happened across another interview, this one by Bob Schreiner of Synertek in a footnote on the https://en.wikipedia.org/wiki/Synertek Wikipedia page. Turns out Jobs did ask. The trouble is that he asked Synertek, who had a license to manufacture 6502s, and thus copies of the masks, but copies of the designs nor any of the team members from MOS. Bob said he told Steve that Synertek couldn't afford to do the design, but if Apple wanted to pay, he'd be happy to to it. Steve passed on that offer. If only...

In terms of complexity, this variation is 1176 lines of code (including commented debug code) vs. 912 in Arlet's original 65C02. A better measure of transistors is the netlist. This version is 1953 lines vs. 1591 in Arlet's original. 362/1591 = 23% larger. The 24-bit address upgrade was 11% of this growth, so basically the same with the register scale up. Both are small enough that MOS could have very likely made these changes back in the late 1970s if they had the resources, or if Jobs had offered to pay them.

A simulated test run showing 8-bit, 16-bit, and 24-bit loads and stores on the zero page. a: is the width of the address bus. r: is the width of the registers. rb: is the number of bytes to load. DR: remembers the last three bytes from DI (data in). The ALU and the registers always show their full widths in this debug code, as it's too much trouble to not do that.
Code:
$0000: 01 12 23 34 45 56 67 78 89 9a ab bc cd de ef f0 // zero page
$8100: 18 A5 01 A6 02 B4 03 ea 4F A5 01 8F B4 03 ea ea // CLC; LDA $01; LDX $02; LDY $03,X; LDA $01/2; LDY $03,X/3; NOP
$8110: 4F B2 00 8F B2 00 85 00 4F 85 01 8F 85 03 ea ea // LDA $00/2; LDA $00/3; STA $00; STA $01/2; STA $03/3; NOP
$8120: db ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff // STP

   0 R           PC:xxxxxx AB:xxxxxx a:04 r:00 rb:x DI:xx DR:xx:xx:xx DO:xx IR:ea WE:x ALU:xxxxxx S:ffff A:000000 X:000001 Y:000002 P:------
   1 -      BRK4 PC:fffffa AB:fffffa a:24 r:08 rb:0 DI:xx DR:xx:xx:xx DO:xx IR:xx WE:0 ALU:xxxxxx S:ffff A:000000 X:000001 Y:000002 P:------
   2 -      BRK3 PC:fffffb AB:fffffb a:24 r:08 rb:0 DI:00 DR:xx:xx:xx DO:xx IR:00 WE:0 ALU:xxxxxx S:ffff A:000000 X:000001 Y:000002 P:---I--
   3 -      JMP0 PC:fffffc AB:fffffc a:24 r:08 rb:0 DI:81 DR:00:00:xx DO:xx IR:81 WE:0 ALU:xxxxxx S:ffff A:000000 X:000001 Y:000002 P:---I--
   4 -      JMP1 PC:fffffd AB:008100 a:24 r:08 rb:0 DI:00 DR:81:00:xx DO:xx IR:00 WE:0 ALU:xxxxxx S:ffff A:000000 X:000001 Y:000002 P:---I--
   5 -      SYNC PC:008101 AB:008101 a:24 r:08 rb:0 DI:18 DR:81:00:xx DO:xx IR:18 WE:0 ALU:xxxxxx S:ffff A:000000 X:000001 Y:000002 P:---I--
   6 - CLC  SYNC PC:008102 AB:008102 a:24 r:08 rb:0 DI:a5 DR:81:00:xx DO:xx IR:a5 WE:0 ALU:xxxxxx S:ffff A:000000 X:000001 Y:000002 P:---I--
   7 - LDA  ZPG0 PC:008103 AB:000001 a:16 r:08 rb:0 DI:01 DR:81:00:xx DO:81 IR:01 WE:0 ALU:000081 S:ffff A:000000 X:000001 Y:000002 P:---I--
   8 - LDA  DATA PC:008103 AB:008103 a:16 r:08 rb:0 DI:12 DR:81:00:xx DO:81 IR:12 WE:0 ALU:000081 S:ffff A:000000 X:000001 Y:000002 P:---I--
   9 - LDA  SYNC PC:008104 AB:008104 a:16 r:08 rb:0 DI:a6 DR:12:81:00 DO:81 IR:a6 WE:0 ALU:000012 S:ffff A:000000 X:000001 Y:000002 P:---I--
  10 - LDX  ZPG0 PC:008105 AB:000002 a:16 r:08 rb:0 DI:02 DR:12:81:00 DO:12 IR:02 WE:0 ALU:000012 S:ffff A:000012 X:000001 Y:000002 P:---I--
  11 - LDX  DATA PC:008105 AB:008105 a:16 r:08 rb:0 DI:23 DR:12:12:81 DO:12 IR:23 WE:0 ALU:000012 S:ffff A:000012 X:000001 Y:000002 P:---I--
  12 - LDX  SYNC PC:008106 AB:008106 a:16 r:08 rb:0 DI:b4 DR:23:12:81 DO:12 IR:b4 WE:0 ALU:000023 S:ffff A:000012 X:000001 Y:000002 P:---I--
  13 - LDY  ZPG0 PC:008107 AB:000026 a:16 r:08 rb:0 DI:03 DR:23:12:81 DO:23 IR:03 WE:0 ALU:000023 S:ffff A:000012 X:000023 Y:000002 P:---I--
  14 - LDY  DATA PC:008107 AB:008107 a:16 r:08 rb:0 DI:26 DR:23:12:81 DO:23 IR:26 WE:0 ALU:000023 S:ffff A:000012 X:000023 Y:000002 P:---I--
  15 - LDY  SYNC PC:008108 AB:008108 a:16 r:08 rb:0 DI:ea DR:26:23:12 DO:23 IR:ea WE:0 ALU:000026 S:ffff A:000012 X:000023 Y:000002 P:---I--
  16 - NOP  SYNC PC:008109 AB:008109 a:16 r:08 rb:0 DI:4f DR:26:23:12 DO:23 IR:4f WE:0 ALU:000026 S:ffff A:000012 X:000023 Y:000026 P:---I--
  17 - R16  SYNC PC:00810a AB:00810a a:16 r:08 rb:1 DI:a5 DR:26:23:12 DO:23 IR:a5 WE:0 ALU:000026 S:ffff A:000012 X:000023 Y:000026 P:---I--
  18 - LDA  ZPG0 PC:00810b AB:000001 a:16 r:16 rb:1 DI:01 DR:26:23:12 DO:23 IR:01 WE:0 ALU:002623 S:ffff A:000012 X:000023 Y:000026 P:---I--
  19 - LDA  ZPGR PC:00810b AB:000002 a:16 r:16 rb:0 DI:12 DR:26:26:23 DO:23 IR:12 WE:0 ALU:002623 S:ffff A:000012 X:000023 Y:000026 P:---I--
  20 - LDA  DATA PC:00810b AB:00810b a:16 r:16 rb:0 DI:23 DR:12:26:23 DO:26 IR:23 WE:0 ALU:001226 S:ffff A:000012 X:000023 Y:000026 P:---I--
  21 - LDA  SYNC PC:00810c AB:00810c a:16 r:16 rb:0 DI:8f DR:23:12:26 DO:26 IR:8f WE:0 ALU:002312 S:ffff A:000012 X:000023 Y:000026 P:---I--
  22 - R24  SYNC PC:00810d AB:00810d a:16 r:08 rb:0 DI:b4 DR:23:12:26 DO:26 IR:b4 WE:0 ALU:000023 S:ffff A:002312 X:000023 Y:000026 P:---I--
  23 - LDY  ZPG0 PC:00810e AB:000026 a:16 r:24 rb:1 DI:03 DR:23:12:26 DO:26 IR:03 WE:0 ALU:231226 S:ffff A:002312 X:000023 Y:000026 P:---I--
  24 - LDY  ZPGR PC:00810e AB:000027 a:16 r:24 rb:1 DI:26 DR:23:12:26 DO:26 IR:26 WE:0 ALU:231226 S:ffff A:002312 X:000023 Y:000026 P:---I--
  25 - LDY  ZPGR PC:00810e AB:000028 a:16 r:24 rb:0 DI:27 DR:26:26:23 DO:12 IR:27 WE:0 ALU:262312 S:ffff A:002312 X:000023 Y:000026 P:---I--
  26 - LDY  DATA PC:00810e AB:00810e a:16 r:24 rb:0 DI:28 DR:27:26:23 DO:23 IR:28 WE:0 ALU:272623 S:ffff A:002312 X:000023 Y:000026 P:---I--
  27 - LDY  SYNC PC:00810f AB:00810f a:16 r:24 rb:0 DI:ea DR:28:27:26 DO:23 IR:ea WE:0 ALU:282726 S:ffff A:002312 X:000023 Y:000026 P:---I--
  28 - NOP  SYNC PC:008110 AB:008110 a:16 r:08 rb:0 DI:ea DR:28:27:26 DO:23 IR:ea WE:0 ALU:000028 S:ffff A:002312 X:000023 Y:282726 P:---I--
  29 - NOP  SYNC PC:008111 AB:008111 a:16 r:08 rb:0 DI:4f DR:28:27:26 DO:23 IR:4f WE:0 ALU:000028 S:ffff A:002312 X:000023 Y:282726 P:---I--
  30 - R16  SYNC PC:008112 AB:008112 a:16 r:08 rb:0 DI:b2 DR:28:27:26 DO:23 IR:b2 WE:0 ALU:000028 S:ffff A:002312 X:000023 Y:282726 P:---I--
  31 - LDA  IDX0 PC:008113 AB:000000 a:16 r:16 rb:1 DI:00 DR:28:28:27 DO:23 IR:00 WE:0 ALU:002827 S:ffff A:002312 X:000023 Y:282726 P:---I--
  32 - LDA  IDX1 PC:008113 AB:000001 a:16 r:16 rb:1 DI:01 DR:00:28:27 DO:23 IR:01 WE:0 ALU:000028 S:ffff A:002312 X:000023 Y:282726 P:---I--
  33 - LDA  IDX2 PC:008113 AB:001201 a:16 r:16 rb:0 DI:12 DR:01:00:28 DO:00 IR:12 WE:0 ALU:000100 S:ffff A:002312 X:000023 Y:282726 P:---I--
  34 - LDA  IDXR PC:008113 AB:001202 a:16 r:16 rb:0 DI:02 DR:01:00:28 DO:00 IR:02 WE:0 ALU:000100 S:ffff A:002312 X:000023 Y:282726 P:---I--
  35 - LDA  DATA PC:008113 AB:008113 a:16 r:16 rb:0 DI:f0 DR:02:02:01 DO:01 IR:f0 WE:0 ALU:000201 S:ffff A:002312 X:000023 Y:282726 P:---I--
  36 - LDA  SYNC PC:008114 AB:008114 a:16 r:16 rb:0 DI:8f DR:f0:02:01 DO:01 IR:8f WE:0 ALU:00f002 S:ffff A:002312 X:000023 Y:282726 P:---I--
  37 - R24  SYNC PC:008115 AB:008115 a:16 r:08 rb:2 DI:b2 DR:f0:02:01 DO:01 IR:b2 WE:0 ALU:0000f0 S:ffff A:00f002 X:000023 Y:282726 P:N--I--
  38 - LDA  IDX0 PC:008116 AB:000000 a:16 r:24 rb:2 DI:00 DR:f0:02:01 DO:01 IR:00 WE:0 ALU:f00201 S:ffff A:00f002 X:000023 Y:282726 P:N--I--
  39 - LDA  IDX1 PC:008116 AB:000001 a:16 r:24 rb:2 DI:01 DR:00:00:f0 DO:01 IR:01 WE:0 ALU:00f002 S:ffff A:00f002 X:000023 Y:282726 P:N--I--
  40 - LDA  IDX2 PC:008116 AB:001201 a:16 r:24 rb:2 DI:12 DR:01:00:f0 DO:f0 IR:12 WE:0 ALU:0100f0 S:ffff A:00f002 X:000023 Y:282726 P:N--I--
  41 - LDA  IDXR PC:008116 AB:001202 a:16 r:24 rb:0 DI:02 DR:01:01:00 DO:f0 IR:02 WE:0 ALU:0100f0 S:ffff A:00f002 X:000023 Y:282726 P:N--I--
  42 - LDA  IDXR PC:008116 AB:001203 a:16 r:24 rb:0 DI:f0 DR:02:01:00 DO:00 IR:f0 WE:0 ALU:020100 S:ffff A:00f002 X:000023 Y:282726 P:N--I--
  43 - LDA  DATA PC:008116 AB:008116 a:16 r:24 rb:0 DI:01 DR:f0:f0:02 DO:01 IR:01 WE:0 ALU:f00201 S:ffff A:00f002 X:000023 Y:282726 P:N--I--
  44 - LDA  SYNC PC:008117 AB:008117 a:16 r:24 rb:0 DI:85 DR:01:f0:02 DO:01 IR:85 WE:0 ALU:01f002 S:ffff A:00f002 X:000023 Y:282726 P:N--I--
  45 - STA  ZPG0 PC:008118 AB:000000 a:16 r:08 rb:0 DI:00 DR:01:f0:02 DO:02 IR:00 WE:1 ALU:000002 S:ffff A:01f002 X:000023 Y:282726 P:---I--
  46 - STA  DATA PC:008118 AB:008118 a:16 r:08 rb:0 DI:01 DR:01:f0:02 DO:02 IR:01 WE:0 ALU:000002 S:ffff A:01f002 X:000023 Y:282726 P:---I--
  47 - STA  SYNC PC:008119 AB:008119 a:16 r:08 rb:0 DI:4f DR:01:01:f0 DO:02 IR:4f WE:0 ALU:000002 S:ffff A:01f002 X:000023 Y:282726 P:---I--
  48 - R16  SYNC PC:00811a AB:00811a a:16 r:08 rb:0 DI:85 DR:01:01:f0 DO:02 IR:85 WE:0 ALU:000002 S:ffff A:01f002 X:000023 Y:282726 P:---I--
  49 - STA  ZPG0 PC:00811b AB:000001 a:16 r:16 rb:0 DI:01 DR:01:01:f0 DO:02 IR:01 WE:1 ALU:00f002 S:ffff A:01f002 X:000023 Y:282726 P:---I--
  50 - STA  ZPGR PC:00811b AB:000002 a:16 r:16 rb:0 DI:12 DR:01:01:f0 DO:f0 IR:12 WE:1 ALU:00f002 S:ffff A:01f002 X:000023 Y:282726 P:---I--
  51 - STA  DATA PC:00811b AB:00811b a:16 r:16 rb:0 DI:23 DR:12:12:01 DO:f0 IR:23 WE:0 ALU:00f002 S:ffff A:01f002 X:000023 Y:282726 P:---I--
  52 - STA  SYNC PC:00811c AB:00811c a:16 r:16 rb:0 DI:8f DR:23:12:01 DO:f0 IR:8f WE:0 ALU:00f002 S:ffff A:01f002 X:000023 Y:282726 P:---I--
  53 - R24  SYNC PC:00811d AB:00811d a:16 r:08 rb:2 DI:85 DR:23:12:01 DO:f0 IR:85 WE:0 ALU:000002 S:ffff A:01f002 X:000023 Y:282726 P:---I--
  54 - STA  ZPG0 PC:00811e AB:000003 a:16 r:24 rb:2 DI:03 DR:23:12:01 DO:02 IR:03 WE:1 ALU:01f002 S:ffff A:01f002 X:000023 Y:282726 P:---I--
  55 - STA  ZPGR PC:00811e AB:000004 a:16 r:24 rb:0 DI:34 DR:23:23:12 DO:f0 IR:34 WE:1 ALU:01f002 S:ffff A:01f002 X:000023 Y:282726 P:---I--
  56 - STA  ZPGR PC:00811e AB:000005 a:16 r:24 rb:0 DI:45 DR:34:23:12 DO:01 IR:45 WE:1 ALU:01f002 S:ffff A:01f002 X:000023 Y:282726 P:---I--
  57 - STA  DATA PC:00811e AB:00811e a:16 r:24 rb:0 DI:56 DR:45:45:34 DO:01 IR:56 WE:0 ALU:01f002 S:ffff A:01f002 X:000023 Y:282726 P:---I--
  58 - STA  SYNC PC:00811f AB:00811f a:16 r:24 rb:0 DI:ea DR:56:45:34 DO:01 IR:ea WE:0 ALU:01f002 S:ffff A:01f002 X:000023 Y:282726 P:---I--
  59 - NOP  SYNC PC:008120 AB:008120 a:16 r:08 rb:0 DI:ea DR:56:45:34 DO:01 IR:ea WE:0 ALU:000002 S:ffff A:01f002 X:000023 Y:282726 P:---I--
  60 - NOP  SYNC PC:008121 AB:008121 a:16 r:08 rb:0 DI:db DR:56:45:34 DO:01 IR:db WE:0 ALU:000002 S:ffff A:01f002 X:000023 Y:282726 P:---I--
  61 - STP  SYNC PC:008122 AB:008122 a:16 r:08 rb:0 DI:ff DR:56:45:34 DO:01 IR:ff WE:0 ALU:000002 S:ffff A:01f002 X:000023 Y:282726 P:---I--


Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 24, 2022 6:36 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Another nice increment!

Have you any microbenchmarks? One reason to justify the cost increase for this second upgrade is performance, and a second would be code density. (The first upgrade, on the address bus, is more of a game changer because it allows larger memory in a simple system.)

(BTW, I'd say this is a single thread machine, not a zero thread machine.)


Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 24, 2022 7:46 am 
Offline

Joined: Wed Jun 29, 2022 2:15 am
Posts: 44
BigEd wrote:
Have you any microbenchmarks? One reason to justify the cost increase for this second upgrade is performance, and a second would be code density. (The first upgrade, on the address bus, is more of a game changer because it allows larger memory in a simple system.)

Wedging changes into the state machine is fun. Hand assembly, not so much. So no, the longest stretch of code I’ve run is 31 bytes long, and the only loops are that one which sets up and launches threads.

I agree the address bus is the big win. It’s what sent me down this rabbit hole. If Apple had a 6502 with flat access to 128K, the Lisa 7/7 OS might have been software on an Apple //L, and affordable.

As for code density and speed, don’t overlook clock rate and memory speeds. The design team of a 7,500 gate 65C2424 should have a much easier time optimizing critical paths vs. Motorola’s 68,000 gate competitor. Motorola was able to clock the 1979 NMOS 68k at 10MHz. Given the resources, Peddle, Mensch, and team should have been able to surpass that speed that year. They just never had the resources to try.

But that said, even 10MHz was faster than DRAM at the time could read. Thus back to benchmarks, one has to assume an L1 cash of SRAM sitting between an upgraded 65k and that expanse of DRAM. And that again circles back to code density.

The nice part of this design is that you only include the bytes you need, plus a one byte cost and one cycle cost for prefix opcodes. If you need to load 1 byte, your choices are the existing opcodes and cycles for ZP, existing opcodes and cycles for $0000-$FFFF, or two extra bytes and cycles for $010000-$FFFFFF. Add one more byte of opcode and cycle for loading 2 bytes, or two extra for 3 bytes.

In comparison, that MOT68000 is going to spend 4 bytes on every opcode no matter how much data you want, but with a 16-bit data bus, each load from memory supplies twice the bytes. Compared to a stock 6502, however, the prefix codes are far more efficient for any code accessing 16-bit or 24-bit values. The one byte prefix is half of the shortest alternative, IMM or ZP addressing, and a third of ABS. Plus in a stock 6502, sorting 16 bits requires either two registers vs. one in the 65C2424. Or all three registers for 24-bits vs. one register. Or a lot more cycles storing and manipulating that value in memory instead of registers.

I’ll leave it to someone else to port FORTH or BASIC as a macro benchmark to measure the improved code density. But that wouldn’t necessarily be apples to apples as such a port would allow for native 24-but variables instead of non-native 16 bit values. Woz said his SWEET16 pseudo-CPU ran more than an order of magnitude slower than the 6502. CLC, LDA #1234, ADC $234567 is 10 bytes of code and no more than 15 cycles, depending on whether that ADC is one, two, or three bytes of data.

Anyone up for that challenge?


Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 24, 2022 9:00 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California
65LUN02 wrote:
Motorola was able to clock the 1979 NMOS 68k at 10MHz. Given the resources, Peddle, Mensch, and team should have been able to surpass that speed that year. They just never had the resources to try.

Actually, they did have 6502's running at 10MHz in the 1970's; but for whatever reason, did not have the ability to market them, whether it was because their criteria was that they had to have it pass the test at twice the marked speed and they had no way to make a tester go 20MHz, or whatever.

Quote:
I’ll leave it to someone else to port FORTH or BASIC as a macro benchmark to measure the improved code density. But that wouldn’t necessarily be apples to apples as such a port would allow for native 24-but variables instead of non-native 16 bit values.

24 would be nice. 16 is just barely enough for some of the stuff I do.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 24, 2022 2:20 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Yes, the prefix bytes look like they work out quite well for density.

What I've found in previous adventures is that an emulator and an assembler are quite handy at this stage. Failing the emulator, if one uses hardware, then a monitor.

If you can write a plain C model then it's relatively easy to plumb it into PiTubeDirect, at which point it can run as a second processor on a Beeb. Hundreds of people have this setup, so you've got a market... umm... an audience too.


Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 24, 2022 3:44 pm 
Offline

Joined: Wed Jun 29, 2022 2:15 am
Posts: 44
Without Arlet, none of my work would have happened. I’ll now leave it to the next person to move the ball further down the field, by emulator, compiler, or FPGA.

One final note on code density. On the address side, it costs the prefix byte plus each byte to reach more memory. But on the data side, it only costs the prefix byte. LDA ZPG cost two bytes to load one byte. LDA ZPG/2 costs three bytes to load two. But LDA ZPG/3 costs the same three bytes to load three bytes. Expand the design to 65C3232 and it then still costs three bytes but to load four bytes. That cost grows by one byte to load from $0000-$FFFF no matter how many bytes of data you want. And then one more byte to load from the whole address space no matter how much data you load, as it only costs the one prefix opcode to expand both addressing and registers.

An interesting head to head would be a 65C3232 vs. an ARM1. They’d both be around 10,000 transistors. Load-store vs. addressable registers but otherwise both RISCy. The 65k would win on code density vs. the 32-bit/64-bit instructions of the ARM. But memory speeds would kill the 65k without a cache. I suggest a dedicated ZPG cache, as then the 65k really would have the equivalent of 259 registers, 256 of them 8-bit addressable with 16 bit opcodes, or 128 16-bit registers addressable with 24-bit opcodes, or 82 24-bit or 64 32-bit registers addressable with 24-but opcodes. All 32 ARM registers take 32-bits to address, but on the speed side, it only takes one cycle to load and process 32-bits.

Ultimately the width of the data bus is going to drive the benchmarks. It’s why the 8086 beat out the 8080 and why the 68008 was never picked for a computer. But winding back to the constraints of the design, fitting into a 40-pin DIP was important in 1979-1980. You can only do that for a 32-bit CPU by multiplexing address and data, which cuts the memory access speed in half. For the 65C2424, growing the data bus to 24 bits is a rather trivial change that would remove logic and cycles in exchange for Ann extra mux or three. But with no need to change opcodes or backward compatibility.


Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 24, 2022 6:41 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
(For strict historical accuracy: Sinclair picked the 68008 for the QL, although they did regret it.)


Top
 Profile  
Reply with quote  
PostPosted: Mon Jul 25, 2022 7:49 pm 
Offline

Joined: Mon Feb 15, 2021 2:11 am
Posts: 100
65LUN02 wrote:
An interesting head to head would be a 65C3232 vs. an ARM1. They’d both be around 10,000 transistors.


Where do you get 10K transistors for the ARM1? Wiki's transistor count article lists 25K citing "ARM's Race to Embedded World Domination", a November 9, 2000 article by Paul Demone at https://www.realworldtech.com/arms-race. An article by Markus Levy titled "The History of The ARM Architecture: From Inception to IPO" from an ARM 20th anniversary special supplement (not sure what it supplements, found the PDF online) mentions 30K transistors for the ARM1.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 59 posts ]  Go to page Previous  1, 2, 3, 4  Next

All times are UTC


Who is online

Users browsing this forum: Google [Bot] and 14 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: