32 is the new 8-bit

player55328 · Post by **player55328** » Tue Oct 15, 2013 2:34 pm

SID is not working yet and IEC interface very unstable...

ElEctric_EyE · Post by **ElEctric_EyE** » Tue Oct 15, 2013 9:09 pm

Are you out to make the SID work within the FPGA?
Other efforts have successfully reproduced the 6581 successfully outside of the FPGA that I know of. I'm sure you've heard of them....
Or maybe you desire to do this 100% yourself and have your own version inside the FPGA? That would be awesome IMO, but so much work with the DSP functions.

BTW, great effort on your Atlys system.

GARTHWILSON · Post by **GARTHWILSON** » Tue Oct 15, 2013 9:36 pm

Yes! I forgot about that, even though I have them linked on my links page.
http://roboterclub-freiburg.de/atmega_s ... gaSID.html
http://www.swinkels.tvtom.pl/swinsid/

Bregalad · Post by **Bregalad** » Mon Nov 04, 2013 5:54 pm

So, in order to fix the problem of cache pressure that Morita pointed out, an optimal processor pipeline should be redesigned completely to think in terms of loads and stores instead of instructions.
Instead of aiming to do 1 instruction per cycle (including RMW instructions), I'd aim for 1 memory operation per cycle (or as close as possible).

Instructions that don't use memory :
- immediate instructions : ADC #imm, AND #imm, LDA #imm, etc...
- accumulator instructions ASL A, ROL A, etc...
- register transfer instructions : TAX, TYA, etc...
- register inc/dec instructions : INX, DEY, etc...
- flags clear/set instructions : CLC, SEC, etc...
Can be executed any time (simultaneously if possible)

Instructions using absoulte or indexed adressing should take 1 cycle to execute (potentially at the same time as an instruction above)

Instructions using RMW on memory (INC $mem, LSR $mem, etc..) should take 2 cycles (by that I mean 2 emplacements in the pipeline) (potentially merged with instructions that don't use any memory accesses in both cycles)

Instructions using indirection ( LDA ($xx,X), STA ($xx),Y etc...) should take 3 cycles (techincally 3 emplacements in the pipeline) (2 reads for pointer and one read or write for the instruction), also potentially merged with instruction that don't use memory access in the first and last cycles.
I like the idea that chains of instructions like INX INX INX INX can be replaced with ADD X, #4 and be executed in a single cycle.

So now the need for 3-ported D-cache disappears, however to push performance to the max it would be nice to process multiple instructions at the same time. I know it's possible and that modern processors do it, but I don't know how and it sounds like a crazy thing to do.

We could do it 1 instruction at a time first and try to improve later though. I'd end up with something very close than what player55328 annonced, but with a different approach..

player55328 · Post by **player55328** » Fri Nov 08, 2013 7:17 pm

I am not positive but it looks like you are wanting to keep an 8 bit data bus?

If that is true it will be hard to execute multiple instructions because of the available memory bandwidth for loading instructions ahead of time along with doing the operand accesses, that is why my fastest design is a Harvard architecture with a completely separate instruction and operand buses.

I think the hardest thing to deal with is flag coherency when trying to execute multiple instructions. Right now I am trying to implement executing a non-taken branch and the next instruction (if it a single byte) simultaneously - making the non-taken branch instruction 0 cycles effectively. The reason why it is limited to single byte instructions is because my pipeline only guarantees there is 3 bytes in the IR when starting to execute an instruction. If I go to a 64 bit bus I would be able to guaranteed 5 and should be able to execute any instruction simultaneously with a non-taken branch.

If you use a real cache these limitations may not apply but my intention for multi-instruction execution would be to always have the same timing for any instruction sequence rather than having it run faster/slower at times depending on how full the IR/cache is at the time.

The only other way to make my design faster would be to get rid of one of the read pipeline delays but this cannot be done using the internal BRAMS for ram and rom.

For those of you who may have downloaded my cpu I have found a bug...

change this line
negative <= (x == NEG_ONE) ? H : (x == 8'h7F) ? H : x[7];
to
negative <= (x == NEG_ONE) ? L : (x == 8'h7F) ? H : x[7];

Bregalad · Post by **Bregalad** » Fri Nov 08, 2013 8:06 pm

I didn't specify it because it didn't change from my last "design-draft", but of course instruction and data masters are completely separate. Instruction master reads 32-bits per cycle while data master reads or write 8-bits per cycle.

Since both would probably be cached using FPGA SRAM, access to external memory would only be done when the cache' logic requires so, or when an access to memory mapped I/O is detected. (from data master, no sane people would execute code from memory mapped IO). In both cases the processor would have to be blocked until the access is done completely of course.

nyef · Post by **nyef** » Sat Nov 09, 2013 12:09 am

Bregalad wrote:

no sane people would execute code from memory mapped IO.

Then you probably don't want to hear about the hardware-detection scheme which involved setting up a DMA process to put NOPs on the data bus every cycle for a bit, followed by something on the order of an absolute JMP or BRK, then jumping to undecoded address space, relying on bus capacitance to hold the DMA data available to the CPU for instruction fetch...

Then again, the group that came up with that was trying to come up with the most twisted ideas they could for code that would run on a particular hardware platform, but not under simulation. Yes, a simulator could be written that would execute the code correctly, but there's the "more trouble than it's worth" argument.

Bregalad · Post by **Bregalad** » Sat Nov 23, 2013 10:32 am

This is an interesting concept. A "instruction-supported DMA" is very cool. This would however be incompatible with "CPI improvement" that are what I was searching for.

I think I have an idea to improve CPI drastically, and that is drastically different than what I have proposed so far.

Just like before, we have a 32-bit "FIFO" that always show the bytes of PC, PC+4, PC+8 and PC+12, even if PC is non 32-bit aligned. After a branch, 2 32-bit reads will be required to fill this FIFO again. Also, just like before, the next stages says if it will consume 0, 1, 2, 3 or 4 bytes for the executed instructions. Therefore, the FIFO can shift it's bytes by any amount.

However the thing will be drastically different in the following pipeline stages.
The decoder will try to decode all instructions completely in the FIFO. This can be 1, 2, 3 or 4 instructions depending on the situation. It will immediately inform the previous stage how many bytes are consumed so that PC can be updated.

There is then 4 pipeline stages which are identical and that can execute simple 6502 instructions that don't access memory.
For instance, they can do an ALU operation on A, X, Y and S, and they can change the status flags. Each of those stages have their own copy of the registers and the status flags.
Normally, instruction 1 is mapped to pipeline stage 1, instruciton 2 to pipeline stage 2, etc...
This only works when no memory is accessed.

One of those stages (probably the 2nd or 3rd one) is however a "complete" state machine that can execute all 6502 instructions, including those who accesses memory (be it 1, 2, or 3 accesses, for instance, indirect addressing instructions will need 3 accesses). All data accesses are done in 8-bit quantities (instead of 32-bit like the instruction accesses).

When the decoder detects an instruction which requires access to memory, it of course maps it to the pipeline stage which can do it, and maps the other instructions before/after it to other pipeline stages the best it can.

Each stage have a logic to know which values it should use for A, X, Y, S and status flags. There is logic to detect conflicts and stall the pipeline (this will happen often much unfortunately) whenever it detects one of the previous instruction in the pipeline hasn't finished and is going to update a register or the status flags.
In order to improve efficiency, this logic is separate for all status flags, i.e. if an instruction that is going to affect C is in the last pipeline stage and that an ADC have to be executed in the first stage, it has to wait for the other operation that affects C to complete, but if only N and Z were to be affected then it would not stall.

This has a theoretical maximum of 4 instructions per cycle, and minimum of 1/3 instructions per cycles (if only indirect addressing instructions were to be executed), so the performance would vary very widely.
In addition I still haven't figured how branches would be done in such a system. Any pipeline stage should be able to execute a branch, but this will cause many problems. The decoder would have to be somewhat complicated, and this will also be a problem.

So there is potentially a lot of issues, but I think it's a nice idea on how to improve the good old 6502's performance. I know it's just a theoretical wall of text right now but I'll draw a schematic diagram.

LGB · Post by **LGB** » Sun Dec 01, 2013 5:03 pm

Windfall wrote:

Dear 6502 core writers

.

Say, you can do 32-bit wide memory accesses. Why not exploit this in the core ? Gets rid of all those pesky byte-wide instruction fetches. Instant speedup. E.g. simply replace every opcode fetch with a 32-bit memory read. Combine result with leftover bytes from any previous read (shift register ? swap between two 32-bit registers and multiplex ?). Voila, opcode fetches have become instruction fetches. Argument bytes instantly available. E.g. LDA abs reduces from 'opcode fetch, low byte fetch, high byte fetch, access' to 'instruction fetch, access'.

Who's first ?

I guess Jeri's C64DTV use this scheme, the memory is 32 bit, and if you enable a single bit in the "processor control" special register, the "65J02" core would execute opcodes which can be presented by a single 32 bit sized "word" in a much shorter time. Sometimes it's even worth to put a NOP into the program code just to have 32 bit aligned opcodes. As far as I remember it's even specificated that opcodes fit into 32 bit word and no other memory access is needed then all of the ops will be executed in a single clock cycle.

Having 32 bit size memory is also useful if some wants implement DMA (for I/O, memory-to-memory copy or even for display hardware, eg generating VGA/TV signal): a single memory access can produce 32 bits instead of 8. Even with a 8 bit CPU without any 32 bit "hack", some special hardware (DMA/video/etc) can benefit from the larger data bus size at least! It can be important if generating VGA/TV signal and video RAM would conflict too much with the CPU memory access, I think.

I am thinking of using good old 32 (36 with parity?) eg SIMM modules for an SBC, because they're really cheap, a single memory module and the speed must be OK with features like hidden refresh, etc, though I don't know too much about using DRAMs in the practice ...

Windfall · Post by **Windfall** » Sat Dec 07, 2013 10:40 am

LGB wrote:

I guess Jeri's C64DTV use this scheme, the memory is 32 bit, and if you enable a single bit in the "processor control" special register, the "65J02" core would execute opcodes which can be presented by a single 32 bit sized "word" in a much shorter time. Sometimes it's even worth to put a NOP into the program code just to have 32 bit aligned opcodes. As far as I remember it's even specificated that opcodes fit into 32 bit word and no other memory access is needed then all of the ops will be executed in a single clock cycle.

Well, I guess I'm in good company, then.

Although what I had in mind is basically a 64-bit cache with always at least 5 instruction bytes in it, including all of the current instruction.

Sadly, the noise of all the wannabees and knowitalls quickly drowned out a proper discussion. That kind of killed it for me.

LGB wrote:

I am thinking of using good old 32 (36 with parity?) eg SIMM modules for an SBC, because they're really cheap, a single memory module and the speed must be OK with features like hidden refresh, etc, though I don't know too much about using DRAMs in the practice ...

You should know that SIMMs also come with SRAM. I bought a set recently, thinking I might do something retro with them. Very cheaply too (E0.10 per 4 MB module)

. DRAM is such a hassle, and for retro purposes (i.e. low memory demands) often unnecessary.

Klaus2m5 · Post by **Klaus2m5** » Sat Dec 07, 2013 11:07 am

Windfall wrote:

You should know that SIMMs also come with SRAM. I bought a set recently, thinking I might do something retro with them. Very cheaply too (E0.10 per 4 MB module)

. DRAM is such a hassle, and for retro purposes (i.e. low memory demands) often unnecessary.

SIMMs are DRAM! The S in SIMM stands for single referring to its double sided connector carrying the same signals on both sides. SIMMs are made of FPM or EDO DRAMs.

Windfall · Post by **Windfall** » Sat Dec 07, 2013 11:36 am

Klaus2m5 wrote:

Windfall wrote:

You should know that SIMMs also come with SRAM. I bought a set recently, thinking I might do something retro with them. Very cheaply too (E0.10 per 4 MB module)

. DRAM is such a hassle, and for retro purposes (i.e. low memory demands) often unnecessary.

SIMMs are DRAM! The S in SIMM stands for single referring to its double sided connector carrying the same signals on both sides. SIMMs are made of FPM or EDO DRAMs.

Not all of them. Like I said, they also come with SRAM. This is the one I have (although mine is 4 MB) :

http://pdf1.alldatasheet.com/datasheet- ... S2229.html

I didn't say they're as easy to obtain as DRAM SIMMs. But nevertheless they exist.

GARTHWILSON · Post by **GARTHWILSON** » Sat Dec 07, 2013 4:19 pm

My 4Mx8 5V 10ns SRAM module is easy-to-use SRAM, and the module plugs into sockets that go into normal perfboard with the holes on .100" centers. There's forum discussion on it at viewtopic.php?f=4&t=1908

BigDumbDinosaur · Post by **BigDumbDinosaur** » Sat Dec 07, 2013 7:23 pm

Windfall wrote:

Klaus2m5 wrote:

Windfall wrote:

You should know that SIMMs also come with SRAM. I bought a set recently, thinking I might do something retro with them. Very cheaply too (E0.10 per 4 MB module)

. DRAM is such a hassle, and for retro purposes (i.e. low memory demands) often unnecessary.

SIMMs are DRAM! The S in SIMM stands for single referring to its double sided connector carrying the same signals on both sides. SIMMs are made of FPM or EDO DRAMs.

Not all of them. Like I said, they also come with SRAM. This is the one I have (although mine is 4 MB) :

http://pdf1.alldatasheet.com/datasheet- ... S2229.html

I didn't say they're as easy to obtain as DRAM SIMMs. But nevertheless they exist.

You should be aware that that Dallas part is quite slow at 85ns, and is also a 16 bit part, which would complicate its adaptation to a 6502 bus. I wouldn't use it.

If you really need a large quantity of RAM look at Garth's 4MB DIMM, which is built with 10ns SRAMs and is fast enough to stay with a 65C816 running at 20 MHz. The next version of my POC series will have one of these modules, along with an Atmel 1508as CPLD to handle the glue logic.

Windfall · Post by **Windfall** » Sat Dec 07, 2013 9:06 pm

BigDumbDinosaur wrote:

You should be aware that that Dallas part is quite slow at 85ns, and is also a 16 bit part, which would complicate its adaptation to a 6502 bus. I wouldn't use it.

I do not suggest using that particular one. For a bit of warpspeed, you would obviously need to get faster SRAM. But the attraction of the form factor stays. Four sockets. 'Up to 16 MB of RAM !'. That sort of thing.

BigDumbDinosaur wrote:

If you really need a large quantity of RAM look at Garth's 4MB DIMM, which is built with 10ns SRAMs and is fast enough to stay with a 65C816 running at 20 MHz. The next version of my POC series will have one of these modules, along with an Atmel 1508as CPLD to handle the glue logic.

Should be good. At 10 ns I'd have some reservations about an intermediate connector and relatively long tracking, though.

32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit