6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Tue Jun 04, 2024 3:44 pm

All times are UTC




Post new topic Reply to topic  [ 27 posts ]  Go to page Previous  1, 2
Author Message
PostPosted: Sun Jan 08, 2017 10:18 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10822
Location: England
I think technically it might be possible to make a significantly higher-performance 6502, in the sense of fewer clocks per instruction, but it would take a very sophisticated microarchitecture, much more than just pipelining. (One caveat: I think it would have to have some means of knowing which accesses are to I/O devices, to make good use of cache.)

The idea is that reads can be satisfied from cache and writes can be written through deep queues and may not need to happen at all. So fetches, reads, and writes can have wider paths than the external path to RAM. More instructions can be in flight, including speculatively executed instructions. I think complex instructions would have to be broken into micro-operations, so those can be reordered and dispatched and executed speculatively as resources allow. As with the x86, you'd need lots more registers than are architecturally defined - you might even bring in some of page zero and some part of the top of stack as if they are registers. And indeed, you'd need to decode more than one instruction per cycle, but that might even be the easy part.

One difficulty is this: it would take an absurd amount of effort - with the necessary knowledge and that amount of effort, a person could build a rather good implementation of a more regular machine - RISC-V, any other RISC, or something of their own invention, and they might well do that instead.

Another difficulty is that as complexity increases it gets more difficult to get the clock speed high, especially in FPGA. If the clock speed is not going to be so high, there's less advantage in trying to make such an aggressive microarchitecture.

I should say, I'm not an expert at any of this, just a keen follower of developments, which I find very interesting. Here's a diagram describing AMD's Barcelona generation, as an idea of how things can be arranged:
Attachment:
File comment: From https://books.google.co.uk/books?id=DMxe9AI4-9gC&pg=PA404
Barcelona.png
Barcelona.png [ 93.53 KiB | Viewed 3022 times ]


(From Computer Organization and Design: The Hardware/software Interface, by David A. Patterson, John L. Hennessy)

There's a heap of good information at http://agner.org/optimize/ too, if you're prepared to consider things from an x86 perspective. It's register-poor and has similar status bits, so is similar enough to 6502 for these purposes.


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 13, 2017 7:01 pm 
Offline

Joined: Wed Mar 02, 2016 12:00 pm
Posts: 343
Thanks for lots of ideas and background information. The 32-bit databus (or even 64-bit) is very interesting to feed a cache and I have already been thinking alot about that, but I had never seen the 65m02 effort.

From my understanding a pipelined microarchitecture would require one to run several instructions in parallel, each instruction being divided into micro-opcodes (u-opcode). The idea is that a u-opcode runs faster since it does less, but one will need several of them to complete a full opcode, so more clock cycles are needed for each opcode. I am not shure if that is what you mean with x86-like direction, but that is what I have looked into.

For example "STA $(ZP,X)" is a quite complicated instruction which requires both fetching Zeropage, adding X and storing A. At first I though that Zeropage could be looked at as a 256-register area, but in order to run several instructions in parallel, one either have the u-opcodes to access a certain register at the same cycle (for example 6th cycle from start of opcode), or ending up with a very complicated pre- or post-fetcht handling of such a register.

For example, the above instruction could be divided into 8 u-opcodes (or maybe even more):

1) Fetch instruction byte
2) Fetch data byte
3) Fetch Zeropage to work register
4) Fetch and add X to work register
5) Fetch MSB to storage address
6) Fetch LSB to storage address
7) Fetch data from Accumulator to work register
8) Store work register to storage address

So the instruction would take 8 cycles to complete, but each instruction being simpler could (in theory) allow the core to run much faster.. Which is how the x86 does it. But as you point out, this might not be true for a FPGA.

Anyway, always manipulating Accumulator at cycle 7) would enable one to run many instructions in paralell so that effectively one have 8 paralell pipelines and each cycle finishing 1 instruction. Due to this, ALL instruction will need to take 8 cycles to finish since for example Zeropage has to be accessed at cycle 3) to prevent conflict with other instructions accessing that.

Now if one stores Accumulator into Zeropage, that is obviously not going to work, so one will need to put in wait-states for this to work in practice (at least for upcoming instructions that manipulate the same data in Zeropage). This is also true for X and Y registers (which in the above example accesses in cycle 4), so that for example TAX and TXA would need waitstates if they happend after each other.

So all-in-all I think this would allow for faster 6502 execution, but it would be very dependent on the code if it was any faster at all IRL.

The "other way" would be to compile several instructions into more complicated ones for internal execution, for example "LDA #$xx + STA $yyyy" into "LSA #$xx $yyyy" (LSA would be load and store A+$yyyy) or longer sets of instructions. That would not enable pipelining, but reduce number of cycles per opcode. But maybe the 65m02 is a better way in such a case since its the loading and storing to RAM that is the bottleneck of most opcodes.

Well, at least that are my thoughts, but then I don't have that much experience into this.


Top
 Profile  
Reply with quote  
PostPosted: Sat Jan 14, 2017 2:05 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1933
Location: Sacramento, CA, USA
kakemoms wrote:
Thanks for lots of ideas and background information. The 32-bit databus (or even 64-bit) is very interesting to feed a cache and I have already been thinking alot about that, but I had never seen the 65m02 effort.
...

The 65m32 is not ready for prime-time, and is not really a wide 65xx or a wide 68xx, but something kind of in between.

http://anycpu.org/forum/viewtopic.php?f=23&t=300

It has a simpler and more direct addressing scheme than most, but tries to be efficient by staying 32-bit all the time and tucking most of the more common operands into the op-code word ... not entirely unlike the original ARM, as I understand it. It has 6502-like mnemonics, is accumulator-centric, and is heavily bound to memory (or memory cache, if applicable). I am using it to self-learn about microprocessor architecture and implementation, and progress is rather slow, for a wide variety of reasons.

Mike B.


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 15, 2017 9:10 am 
Offline

Joined: Wed Mar 02, 2016 12:00 pm
Posts: 343
Very interesting implementation! And quite ARM like. The x86 went the other way with a pre-opcode byte to increase its original 1-byte instruction set into 512 instructions.... long before it became todays bloathed instruction set (with lots of complications for the vompilers). But you all know that history.

I have been looking mostly into the 6509 since I got myself a proto Commodore B720 board, but I only partly liked its way of increasing the memory area. So I am still at the 64KB limit, but have thought out a scheme that adds a value "PP" to the PC in order to make the 64KB area extended. In contrast to the 6509, my PP (which stands for Program Page) is an internal 24-bit pointer that adds to the MSB of the PC. So the memory area becomes 2^24 pages of 256byte memory, or effectively 32-bit addressing.

All instructions work as usual within its "current" 64KB area, but that area is defined by the PP register. The PP register itself will be accessed by some new opcodes. At least one of those opcodes will "keep current page(s) memory" in cache, effectively moving it along with the PP (for fast moving of data).

At least thats how I planned it for now.


Top
 Profile  
Reply with quote  
PostPosted: Fri Feb 03, 2017 9:50 am 
Offline

Joined: Wed Mar 02, 2016 12:00 pm
Posts: 343
After some more thinking about the design, it looks like I want to implement a pipeline with 8+ cycle uOpCodes with the intention of bringing down minimum cycle time and increase speed. This will also allow the CPU to start most instructions within one cycle (of the last) and can allow dual instruction execution per cycle, e.g. superpipelined (but later).

There is also another option I have been looking into, and that is to increase the number of registers and make the registers less constricted (e.g. increasing them from only one 8-bit byte). The most feasible way to do this is to use zeropage in which I would replace the accumulator with a zeropage location. Since the zeropage resides in internal memory I am hoping to get past this without too much increase in processing time, or increase the number of cycles for carrying out the instruction.

The way I am thinking about doing this is to add an instruction "SAZ #$ZP" to "set accumulator to zeropage". It would point to one of the 256 different zeropage locations. I would then add another instruction "ZAS #$XX" for "Zeropage Accumulator Size" which points to the number of bytes that the Accumulator register would contain.

The implementation with a pipelined uOpcode set would enable loading (or storing) of one byte per cycle in/out of the accumulator, so the execution time (per instruction) would increase, but only at a fraction of the cost (Since each instruction takes 8 cycles, another byte transfer into the Accumulator would only add 1 cycle).

Has anyone used the zeropage in such a way before?


Top
 Profile  
Reply with quote  
PostPosted: Fri Apr 14, 2017 9:03 am 
Offline

Joined: Wed Mar 02, 2016 12:00 pm
Posts: 343
I have a question on the new 65C02 core (and probably on the original 6502 core).

The reason is that I am implementing this as a cpu that shares the bus towards internal memory in a fpga. The other device is a 6502 which accesses the memory correctly and everything there works (e.g. 6502<->fpga memory).

Now, to get the 65C02 core (or Arlets original 6502 core) to communicate with the internal memory, I was assuming that RDY=1 and a simple CLK signal was enough. But the interpreter is complaining about the RDY signal; it wants me to clock RDY as well since the output data (from the 65C02 core) seems to be dependent on that?

Is this correct or is it a peculiarity with the Lattice Diamond interpreter? I know that some things seem to compile under Altera or Xilinx, but requires more accurate specification under Lattice.

Whats your thought on this?


Top
 Profile  
Reply with quote  
PostPosted: Fri Apr 14, 2017 9:09 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
RDY=1 should be enough. Can you tell more about the exact complaint you're getting ?


Top
 Profile  
Reply with quote  
PostPosted: Fri Apr 14, 2017 8:16 pm 
Offline

Joined: Wed Mar 02, 2016 12:00 pm
Posts: 343
Arlet wrote:
RDY=1 should be enough. Can you tell more about the exact complaint you're getting ?


Well, I am trying to understand what happens, but its not so easy since I get no errors or (related) warnings. Its basically just not generating code from the cpu.v, which usually indicates some dead state. At first I thought it was due to some conflict between DO, DI and RDY but I am not so sure anymore.

I will give some more hints here if I can't get to the bottom of it. My best guess is that its confused by the separate in/out data wire into the cpu.v (I have a common databus for other parts of the logic, not separated).


Top
 Profile  
Reply with quote  
PostPosted: Sat Apr 29, 2017 7:51 pm 
Offline

Joined: Wed Mar 02, 2016 12:00 pm
Posts: 343
Ok. I think I have pinpointed the problem. If I set the reset input to high (=1'b1) it looks like the 6502 fpga core is in reset mode. This is contrary to the real 6502 which is in reset during low and starts running once the reset line goes high.

So connecting a real 6502 to this 6502 verilog core means that I have to feed it ~Reset instead. In fact if I put 1'b1 into the reset signal, the interpreter removes most of the core.

Is this correct? How is it for the IRQ and NMI signals? E.g. the real 6502 has negated input pins (-IRQ and -NMI) which means that a negative edge on the pin is a positive edge NMI signal in the core to generate an interrupt.


Top
 Profile  
Reply with quote  
PostPosted: Sat Apr 29, 2017 7:54 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
This is correct. The core uses positive logic for reset, NMI, and IRQ.


Last edited by Arlet on Thu May 11, 2017 8:07 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Sat Apr 29, 2017 8:31 pm 
Offline

Joined: Wed Mar 02, 2016 12:00 pm
Posts: 343
The real 6502 has inverters on the NMI and IRQ inputs, but not on the Reset, so it was a little confusing.

Anyway, thanks for clearing that up.


Top
 Profile  
Reply with quote  
PostPosted: Wed May 03, 2017 7:23 am 
Offline

Joined: Wed Mar 02, 2016 12:00 pm
Posts: 343
(Reply moved to new tread about Lattice related cores)


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 27 posts ]  Go to page Previous  1, 2

All times are UTC


Who is online

Users browsing this forum: No registered users and 6 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: