6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 8:26 pm

All times are UTC




Post new topic Reply to topic  [ 168 posts ]  Go to page Previous  1 ... 3, 4, 5, 6, 7, 8, 9 ... 12  Next
Author Message
 Post subject: Re: 32 is the new 8-bit
PostPosted: Thu Jun 06, 2013 9:57 pm 
Online
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
See previous post!


Top
 Profile  
Reply with quote  
 Post subject: Re: 32 is the new 8-bit
PostPosted: Fri Jun 07, 2013 1:06 am 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
On threads that I start here in the Programmable Logic Section I like to post all kinds of stuff. Sort of like a diary (I've never kept one), but it is useful for me, so that I can review the inputs of others at a later date after I become more educated. Most often due to these inputs. It's ashame Windfall seems to come across abrasively... I, for one, overlook it. Now if someone were to do that in one of my threads, I'd have a mindset more like enso. But it is his thread after all, and I would hate to loose someone with that much enthusiasm. There's not that many of us out there!

_________________
65Org16:https://github.com/ElEctric-EyE/verilog-6502


Top
 Profile  
Reply with quote  
 Post subject: Re: 32 is the new 8-bit
PostPosted: Fri Jun 07, 2013 8:57 am 
Offline
User avatar

Joined: Sun Nov 27, 2011 12:03 pm
Posts: 229
Location: Amsterdam, Netherlands
BigDumbDinosaur wrote:
While Windfall's great idea might result in a 65xx core that actually will run faster with wide memory, all we have seen to date is (largely unsupported) theory

No, theory. Not unsupported theory. I've been 'supporting' my ass off, against my better judgement. Except for actually making the thing, I can't do much more.
BigDumbDinosaur wrote:
along with a palpable level of hubris.

I agree. There are two forms of hubris to be found here. One of them from people apparently not able and/or not willing to understand, and translate that, via their ego, to a continuous stream of bears crossing the road, instead of asking for elaboration. Every mention of a register elicits the response 'that needs another cycle'. I mean, come on ! Can I really not expect something better than that here ?

But I suppose it's partially my fault for overestimating the simplicity of my idea. And I really don't mean that as an insult. I blame myself.
BigDumbDinosaur wrote:
Why should others do the work of translating this theory into working hardware—something that will not be a trivial engineering exercise, and might not produce results of value?

Well, it only requires several people who think it will.
BigDumbDinosaur wrote:
None of us has infinite time to invest in something that has been, in my opinion, poorly explained. Unless Windfall commits to creating a working core, I suspect his theory will remain just that. In other words, show us a proof of concept and then we'll talk.

I agree with Enso that we should abandon this thread.

I agree as well. I have a million other things to do.

Moving on ! :D


Top
 Profile  
Reply with quote  
 Post subject: Re: 32 is the new 8-bit
PostPosted: Fri Jun 07, 2013 9:23 am 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
Oops -- was gettin ready to post, and now Windfall has slipped in ahead of me!
Windfall wrote:
But I suppose it's partially my fault for overestimating the simplicity of my idea. And I really don't mean that as an insult. I blame myself.
Ok, thanks for that. IMO, you and that other character both allowed the tone of the dialog to get way out of whack. Being the OP doesn't make it OK, and neither does one or the other person being "right." It kinda makes me hope that new-comers, browsing 6502.org for the first time, find this thread LAST. :oops: But, onward and upward. In fairness I'll admit there've been times I've been guilty of lapses of friendly attitude myself. Hey -- I'm workin on it! :lol:

Anyway, following the PM's you & I traded earlier, I created some tidied-up (and more detailed) versions of your diagrams. I hope I've not introduced any distortions.
Attachment:
Old_.gif
Old_.gif [ 7.37 KiB | Viewed 1753 times ]
Attachment:
New_.gif
New_.gif [ 27.59 KiB | Viewed 1753 times ]

One thing that bothers me is that the new version seems inconsistent with the quote below. Was it supposed to be? Am I missing something still?
Windfall wrote:
Okay, I see. You've shifted the paths a little by preprocessing the opcode while it's not even registered yet. :shock: :D

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
 Post subject: Re: 32 is the new 8-bit
PostPosted: Fri Jun 07, 2013 3:55 pm 
Offline
User avatar

Joined: Sat Sep 29, 2012 10:15 pm
Posts: 904
In spite of myself, I find myself writing this (OCD). I don't think I or anyone else should withhold information from our community because of bad behaviour of others. The truth is that I've pursued this direction (fast 6502 core with 32-bit RAM) in the past, and abandoned it as too complex a solution. My interest lies more in cycle-compatible retro-machinery anyway, so I am somewhat biased. I will try to stick to the facts and refer to notes from experiments some years back, and to avoid seeing an onslaught of abuse windfall is on my 'foe' list. If windfall posts something useful, please rephrase it in non-judgemental technical language for me.

My initial off-hand dismissal of the idea has to do with the complexity of even the most simple solution, and the fact that the poster has not considered even remotely the consequences of this idea (stated inaptly as 'you are talking out your butt', a regrettable choice of words that no doubt caused much of the escalation. However, that is a pretty good summary of windfall's presentation). As I can't fully rely on my memory, I found notes from several years ago when I pursued this direction extensively, in a misguided effort to create a really fast 8-bit CPU core. I certainly learned a lot, but it was a total waste of time. I sincerely hope that no one here repeats my mistakes.

Jeff's diagram is a nice start, but it doesn't even remotely approach the entirety of the issues at hand. All parts of the 6502 core must be modified and integrated in order for this to actually run at full speed (or at all). Here are some of my notes paraphrased for the occasion.

Summary of the idea

A 6502 core with 32-bit memory can be made to run as fast as possible, avoiding multi-cycle address and data fetches cycles as long as the data is available in the pre-fetch buffer. In order to avoid corner conditions, the buffer has to be 8-bytes long.

In order to provide execution fast enough to take advantage of this pre-fetching, the core has to be able to decode an incoming instruction and mux in the data during a single cycle. In fact, most simple instructions would have to decode and execute via sequential logic during the initial cycle. This affects all aspects of the 6502 core. Therefore this is not a simple pre-fetch modification, but an entirely different core.

Pre-fetch unit
Pre-fetcher has to be able to provide 3 bytes of opcode-data. Since the opcode can be positioned anywhere in bytes 0-7, an 8-1 mux is required to provide output. Since data can be one or two bytes, similar muxes must be created to get data from anywhere in the prefetch buffer. A state machine for sequential pre-fetching of the next longword is required, co-ordinating its activities with the PC register. In addition, to avoid a cycle loss after fetching a new byte, bypasses must be implemented, shunting any of the 4 8-bit chunks of the direct memory input to the instruction decoder.

Instruction Decoder
The instruction decoder must be fast enough to decode an instruction in the first cycle and leave enough time for the ALU and other execution units to finish the task, as well as for the address generator to operate (see next item). The instruction decoder must have a mux connecting 8 prefetch bytes and 4 memory bypasses (12-1 mux) in order to operate at full speed.

Address Generation
For absolute addressing, since the normal 6502 circuitry is not applicable, the absolute address from the prefetch buffer needs to be placed onto the address bus during the same first cycle. This requires muxing all 8 bytes into each of the AB bytes, an adder for indirect X/Y ops muxed to all 8 prefetch bytes, X and Y register, SP etc. combined with the other possibilites (see next items) - resulting in a pretty wide mux.

Memory Indirection
Indirect addressing creates an additional problem. Since we don't want to abandon the prefetch buffer and our memory is 32-bits wide, the address from the prefetch buffer needs to be placed onto AB (see above). The result needs to be placed onto AB for the next cycle. However, since the 16-bit address is not aligned, it may occur across two 32-bit words, requiring an extra cycle 1/4 of the time. No easy solution exists here - bypasses are required for the 3/4 fast operation, and a register-bypass combination must be implemented to handle 32-bit boundary cross. These are extra inputs to the AB, and must be accounted for in the state machine.

Write Operations
In fact, a whole other 4 byte buffer should be implemented for data addressing. Since the RAM is 32-bits wide, every read has to go through a 4-1 mux, and every write operation becomes a read-modify-store operation. The write is more complicated to be efficient: the input of the data register, in addition to coming from RAM, has to have an override for the write from every possible write source. That is every one of the 4 bytes to be written next cycle has to be able to come from any of the registers (or other sources), requiring a crossbar switch or a complex muxing arrangement.

Jumps
Direct jump address must be placed onto AB immediately. The result is bypassed into the Instruction register, and the prefetch state machine must fill in the rest of the prefetch buffer.

Relative jumps
The PC must include a 16-bit adder circuit to accomodate relative jumps. The adder must mux in any of the 8 prefetch bytes.

Indirect jumps
Indirect jump address must be placed onto AB immediately. See the discussion above about indirect memory access - a similar state machine with a possible 1-cycle delay for 32-bit boundary crossing must be implemented. AB mux has to have extra inputs from memory bypass (8 8-bit inputs) and the register in case of mis-alignment.

Zero-page operations
Since 0-page memory includes indirection for both addresses and data, these must be accounted for in the mux and the state machine.

Indirect X and Y jumps
An adder on the AB output, muxed in from X and Y is required. In reality, a little more complicated due to state machine issues.

Stack operations
Stack operations require temporary data fetches and stores. In addition, the stack supports 16-bit read and write operations, requiring for buffering of 32-bit boundary mismatches which now occur 1/2 the time. To be as fast as possible (that is incurring only 1 clock delay 1/2 of the time, 16-bit wide muxes and bypasses are required. Consider that since writing is possible, a RAM writing path must be established, and the contents of the prefetch buffer must be flushed. I don't even want to address the full complexity of this in this document.

ALU
Since the ALU no longer uses the restricted 6502 buses and is required to execute in a single cycle, wide muxes must connect the inputs of the ALU to every one of the 8 possible bytes of the prefetch buffer (bypass inputs are not necessary since after a jump an instruction is encountered, not data). But the alu must also connect to all the registers and memory-read datapaths to maintain full speed.

Register File
Every register capable of loading an immediate must be muxed from every one of the 8 prefetch bytes. Since registers can be written, the paths must be muxed in as well, and the data buffer flushing must occur on write.

Summary

Speed

The complexity of the instruction decoder in even simple addressing modes, combined with the fact that combinational logic must be used to keep the system running at full speed, results in long path delays and unacceptably low maximum clock rates, completely defeating the purpose of the acceleration circuit. Large muxes and routing congestion associated with the core size slow down the design considerably. A fast 8-bit core connected to 8-bit RAM will in most cases outrun a CPU accelerated in this fashion.

Size

The implementation requires many wide (8 and 16-bit wide) muxes, with many inputs (as the prefetch buffer has 8 bytes in addition to all possible paths). 16-1 muxes in FPGA architectures are enormous, leading to unacceptably large core sizes and associated routing congestion.


If anyone still feels that this direction is worth pursuing, please use this input as a positive starting point. I am sorry if my descriptions or conclusions are negative - I am a little bitter about wasting a year of my life pursuing this rabbithole.

Edit: there is a reason 32-bit processors are a lot more complicated, especially ones that allow 8 and 16-bit unaligned access... As I said before, 32 is not at all the new 8 bit.

_________________
In theory, there is no difference between theory and practice. In practice, there is. ...Jan van de Snepscheut


Last edited by enso on Fri Jun 07, 2013 4:59 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
 Post subject: Re: 32 is the new 8-bit
PostPosted: Fri Jun 07, 2013 4:47 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
It counts for a lot when people can "de-escalate," preserving decorum in the face of perceived provocation. It's incongruous that you, Enso -- who admonished us to take the "high road" -- don't acknowledge at least partial responsibility (as Winfall has done). 'Nuff said.

Quote:
diagram is a nice start, but it doesn't even remotely approach the entirety of the issues at hand
True. And, to be clear, the diagrams (original & revised) focus on instruction fetch/decode, and plainly aren't intended to be comprehensive. Your summary raises some pertinent issues, so thank you for that -- I'll review those at leisure.

I like what Tor had to say earlier:
Quote:
Thanks to Windfall anyway, I for one like the speculation threads that pop up now and then, even when an idea may be impractical - there's always some interesting discussions around as to why, and other ideas are spawned. And as I was reading this thread I got some ideas for my own emulators, not directly related, but when something wakes the mind up it starts churning. Ideas generate other ideas, even distant ones.

-Tor

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
 Post subject: Re: 32 is the new 8-bit
PostPosted: Fri Jun 07, 2013 4:51 pm 
Offline
User avatar

Joined: Sat Sep 29, 2012 10:15 pm
Posts: 904
Dr Jefyll wrote:
It counts for a lot when people can "de-escalate," preserving decorum in the face of perceived provocation. It's incongruous that you, Enso -- who admonished us to take the "high road" -- don't acknowledge at least partial responsibility (as Winfall has done). 'Nuff said.


Oh, I do take the responsibility (and mention the poor choice of words). I was provoked, no question, but I certainly could have done better.

_________________
In theory, there is no difference between theory and practice. In practice, there is. ...Jan van de Snepscheut


Top
 Profile  
Reply with quote  
 Post subject: Re: 32 is the new 8-bit
PostPosted: Fri Jun 07, 2013 6:17 pm 
Online
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Thanks for the writeup, enso.
I think it's worth noting that there are usually diminishing returns. Implementing some but not all speedups would be the pragmatic approach. It's clear that there at least a danger of losing a lot of clock speed.
Implementing a write buffer might help with the partial writes problem, especially if you drop support for self modifying code. Ultimately wide memory access lends itself to cache file and spills anyway.
Good point about misaligned accesses.
Cheers
Ed


Top
 Profile  
Reply with quote  
 Post subject: Re: 32 is the new 8-bit
PostPosted: Fri Jun 07, 2013 8:42 pm 
Offline

Joined: Sat Mar 27, 2010 7:50 pm
Posts: 149
Location: Chexbres, VD, Switzerland
Quote:
I certainly learned a lot, but it was a total waste of time. I sincerely hope that no one here repeats my mistakes.

Would you mind to explain more ? Are you just saying this because your critical path ended up being too long ?

I think the problem would be easily solved by having a 32-bit instruction master, but a 8-bit data master. The bus on which the processor is connected would have to solve the alignment issues by it's own. If both masters are cached, that is not even a problem, as they can read/write data with burst in whatever size (8, 16 or 32) at the maximum bus speed. However problems of cache coherence will arise.

If only the instruction is cached then it's the same, the memory width does not matter, and the I-cache would fill itself at maximum bus speed, but then it would read as 32-bit words from the instruction master which would be able to inject entiere 6502 instructions in the pipeline every cycle. I don't see anything wrong with that, except coherence problem if you want to do selfmod code. In this case, a cache with a strong coherence fix in hardware is the solution, and everything is solved, super fast and nice.


Top
 Profile  
Reply with quote  
 Post subject: Re: 32 is the new 8-bit
PostPosted: Fri Jun 07, 2013 9:08 pm 
Online
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
(I suspect it's true that it would be more or less a new core that would do this, not a minor mod of an existing one.)


Top
 Profile  
Reply with quote  
 Post subject: Re: 32 is the new 8-bit
PostPosted: Fri Jun 07, 2013 10:20 pm 
Offline
User avatar

Joined: Sat Sep 29, 2012 10:15 pm
Posts: 904
Bregalad wrote:
Quote:
I certainly learned a lot, but it was a total waste of time. I sincerely hope that no one here repeats my mistakes.

Would you mind to explain more ? Are you just saying this because your critical path ended up being too long ?

I think the problem would be easily solved by having a 32-bit instruction master, but a 8-bit data master. The bus on which the processor is connected would have to solve the alignment issues by it's own. If both masters are cached, that is not even a problem, as they can read/write data with burst in whatever size (8, 16 or 32) at the maximum bus speed. However problems of cache coherence will arise.

If only the instruction is cached then it's the same, the memory width does not matter, and the I-cache would fill itself at maximum bus speed, but then it would read as 32-bit words from the instruction master which would be able to inject entiere 6502 instructions in the pipeline every cycle. I don't see anything wrong with that, except coherence problem if you want to do selfmod code. In this case, a cache with a strong coherence fix in hardware is the solution, and everything is solved, super fast and nice.


A RISC core tuned for 6502-style instructions and flags running as a microcode engine would definitely be a better fit than trying to 'accelerate' the core by attaching a 32-bit bus. But the beauty of the 6502 is the straightforwardness of the implementation and hiding things in the pipeline seems to be almost wrong. It makes more sense to just use 65org32 or some other 'wide-load' version of the 6502, if compatibility is not an issue.

Also, note that Arlet's core will run at 100MHz; it is probably faster and smaller then these other approaches that seem so attractive at first.

_________________
In theory, there is no difference between theory and practice. In practice, there is. ...Jan van de Snepscheut


Top
 Profile  
Reply with quote  
 Post subject: Re: 32 is the new 8-bit
PostPosted: Sat Jun 08, 2013 11:11 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
enso wrote:
Also, note that Arlet's core will run at 100MHz; it is probably faster and smaller then these other approaches that seem so attractive at first.

Also, one could attempt to modify this core to reduce INX, CLC, TAX (and similar) to 1 cycle. This would not involve any of the memory accesses, so it would be a much simpler project, but can provide some insight in changes in resources and speed.


Top
 Profile  
Reply with quote  
 Post subject: Re: 32 is the new 8-bit
PostPosted: Sat Jun 08, 2013 2:24 pm 
Offline

Joined: Sat Mar 27, 2010 7:50 pm
Posts: 149
Location: Chexbres, VD, Switzerland
(to cotinue here what you talked about in the other topic) : 16-bit data reads and writes are only performed by jsr, rts, rti and interrupt start. It is acceptable to split those instructions in 2 or more cycles (in fact they will probably be split in more than 2 cycles anyways), so they just perform a good old 2 8-bit reads and writes to data bus or data cache if there is any.

It would not be too complicated to run all instructions in 1 cycle exept (),Y and (,X) addressing modes. They would be doable in 1 cycle but would add extra read ports on their own. I'll study it more in detail, and try to do adapt my ARM core into running some 6502 friendly microcode instead. This will be the hardest part.

Then I will have to write a instruction cache with special FIFO for misaligned reads, as it was already mentioned in this thread. Finally, the work of converting 6502 instructions to microcode will have to be done. This work is going to be a bit long but dead easy. If I feel like it, some kind of branch prediction could be added, in order to improve the core further. And then I'll be able to benchmark this core and remove all bugs that might appear either in the microcode or in the RISC part. Once this is done, I'll be able to compare this with existing 6502 implementations and see if it is more "powerful" or not.

Until then, nobody can guess which one will "win". For now it's just an idea, the first step would be to see how 6502 instructions would decompose to RISC, and I'd do it in a way so that it takes as few RISC instructions per CISC instructions as possible (1 in most cases). This would be not too hard if I can consider zero page as "registers", but unfortunately we will have to deal with the possibility of use zero page with non-zero page addressing modes (this has to be executed correctly). A lost cycle when this happens is acceptable though, because this happens rarely.

It would be great if executing from zero-page and having zero-page shared with the system would be possible, but I'd say those are not a necessity. Those would only be possible if zero-page accesses actually show on the bus somehow and that a strong coherence mechanism is in place. Unfortunately I currently have no knowledge how a cache coherence mechanism can be designed on FPGA. A quick google search gives me nothing satisfying which is probably bad news.


Top
 Profile  
Reply with quote  
 Post subject: Re: 32 is the new 8-bit
PostPosted: Sat Jun 08, 2013 2:44 pm 
Online
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
An alternative to coding up an implementation in HDL is to write a simulation of a proposed microarchitecture: I've seen this done several times at work, although not written one. It might seem like doubling the work, but it's not quite that bad. You do of course need a plan of attack to write a cycle-accurate emulator in a high level language, and I don't have any ideas there.

Again with the diminishing returns point: a core which usually gets the benefit of a wide read buffer is going to be simpler than one which always gets a benefit. Same with 16-bit reads and writes: sometimes they will straddle a boundary and sometimes they won't. (Note that the indexed instructions you mention need to do a 16-bit read of zero page. Again, a separate memory path for zero page and the stack page could help here, both because it can be wider and perhaps because on-FPGA block RAMs are dual-ported.)

Cheers
Ed


Top
 Profile  
Reply with quote  
 Post subject: Re: 32 is the new 8-bit
PostPosted: Sat Jun 08, 2013 2:53 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
As a first step, I recommend taking a simple spreadsheet, writing a number of signals in the columns, like ADDR, DATA, PC, ... Add a row for each clock cycle, and write down the contents of each signal at each clock. If you do this for a couple of instruction sequences, you get a good idea of what data is where in the system, and how many cycles you need.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 168 posts ]  Go to page Previous  1 ... 3, 4, 5, 6, 7, 8, 9 ... 12  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 8 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: