32 is the new 8-bit

Windfall · Post by **Windfall** » Wed Jun 05, 2013 8:50 am

enso wrote:

OK, windfall's idea is to use 32-bit memory. Fetch 32 bits into a 4-byte buffer, then use it as a cache. However, the problems I see are
- Why are you using 32-bit memory with an 8 or 16-bit processor anyway?

Because sometimes you can, whether by accident or by design. Example : my re-implementation of the Acorn 6502 Second Processor on a Terasic DE0 Nano (see http://web.inter.nl.net/users/J.Kortink ... /index.htm). It's supposed to run existing 65C02 code. I use the memory internal to the FPGA, since it's big enough, and fast. It could easily be accessed 32-bit wide. Hence the opportunity for optimization.

enso wrote:

- First read will cost a cycle (with SRAM), unless a bypass of some kind is implemented.

Like I said, the opcode fetch becomes the instruction fetch. You read a new 32-bit aligned word, and combine it with leftover bytes from the last one (if possible). Only if that is not possible (e.g. directly after control flow changes like branch and jump), you need to read two.

enso wrote:

- Subsequent reads will not be aligned, requiring a delay to read more if instruction does not fit.

You can simply read aligned all the time, and use a shift register.

enso wrote:

- The complexity of this prefetch circuit greatly overshadows any benefit. It is likely to slow down the fetching, not accelerate it.

No, see above, it speeds up. All opcode fetches become instruction fetches, which means all the cycles that fetch post opcode instruction bytes can be eliminated.

enso wrote:

What exactly are you trying to gain here?

See above.

Windfall · Post by **Windfall** » Wed Jun 05, 2013 8:54 am

Arlet wrote:

It would be very challenging to rewrite the classic 6502 core so it would use less cycles for the same instructions.

Can you explain why ? As far as I can see, you can simply build the tiny memory cache and then eliminate all the post opcode instruction byte fetches.

Arlet wrote:

That much effort would be better spent on a 16 or 32 bit processor to start with.

Then that is also true of making a(nother) faithful core in the first place.

Arlet wrote:

Another option, which would be easier to implement, is to use a cache inside the FPGA. This cache could have an 8 bit interface to the 6502 core, but a 16 or 32 bit interface to external memory, reducing the amount of memory cycles it would take to read/write a cache line. This would work nicely in combination with SDRAM, and video access. It would increase memory bandwidth, but without complicating the 6502 core itself.

Cache inside an FPGA is rather pointless if you're already using its RAM directly. I do (see above for a link).

GARTHWILSON · Post by **GARTHWILSON** » Wed Jun 05, 2013 9:01 am

So, if I understand you correctly, you're suggesting wasting bytes in the 32-bit word if necessary to keep 2- and 3-byte instructions from crossing word boundaries, since memory is cheap enough now to do this to get better processor performance. Would a word be able to handle all combinations of multiple instructions that total no more than four bytes?

Edit: Actually, with less complexity, it might be worth doing this for ROM working with off-the-shelf processors first, since RAM is avalailable in something like 6-8ns but ROM bottoms out at about 45ns, and although ROM would still have branches and jumps, it would not have any variable space, stack space, etc.. OTOH it could still have tables and definitely vectors, and an easier way to handle all that would probably be to copy it to faster RAM at boot-up and then switch the ROM out, something we've discussed before.

Windfall · Post by **Windfall** » Wed Jun 05, 2013 9:15 am

GARTHWILSON wrote:

So, if I understand you correctly, you're suggesting wasting bytes in the 32-bit word if necessary to keep 2- and 3-byte instructions from crossing word boundaries, since memory is cheap enough now to do this to get better processor performance.

No, you simply use a 32-bit shift register (for example). It will contain zero to three of the instruction bytes you need. You read the rest by doing an aligned 32-bit memory read, and shift them in. Sometimes you won't even need to read (the next instruction will already be there), but for simplicity, read 32 whenever you read opcode.

Arlet · Post by **Arlet** » Wed Jun 05, 2013 9:18 am

Windfall wrote:

Can you explain why ? As far as I can see, you can simply build the tiny memory cache and then eliminate all the post opcode instruction byte fetches.

First of all, you can only avoid the fetches in some cases. If a 2 or 3 byte instruction straddles a 32 bit boundary, you still need an extra fetch. This requires extra state to keep track of. Also, you may be able to avoid the extra operand fetches, but if your core still does internal processing, you still need those cycles. For instance, looking at my own core, it would require a major rewrite to actually reduce the cycles.

Quote:

Cache inside an FPGA is rather pointless if you're already using its RAM directly. I do (see above for a link).

It's not pointless if you're using SDRAM. Even if you're using SRAM, you'll have to worry about bus-turnaround delays, and associated dead times which can be minimized by using the SRAM in a burst-wise fashion. Also, like I said, you can use a cache to increase memory bandwidth but use an unmodified 8 bit core.

Windfall · Post by **Windfall** » Wed Jun 05, 2013 9:38 am

Arlet wrote:

Windfall wrote:

Can you explain why ? As far as I can see, you can simply build the tiny memory cache and then eliminate all the post opcode instruction byte fetches.

First of all, you can only avoid the fetches in some cases. If a 2 or 3 byte instruction straddles a 32 bit boundary, you still need an extra fetch.

No, you don't. As long as instruction fetches are sequential, i.e. there are no control flow changes, you only need one read per instruction. The cache will contain the previous word. You will fetch the next one (which will contain all of, or the last few of, the instruction bytes). You combine with the cache. Only control flow changes will break this up, and may require two reads if, as you say, the instruction straddles a 32 bit boundary.

Arlet wrote:

Also, you may be able to avoid the extra operand fetches, but if your core still does internal processing, you still need those cycles. For instance, looking at my own core, it would require a major rewrite to actually reduce the cycles.

You may not be able to eliminate the cycle, but you can still eliminate the read of the instruction byte.

E.g. LDA abs, which was opcode fetch, low byte fetch, high byte fetch, read target address, becomes instruction fetch, read target address.

Arlet wrote:

Quote:

Cache inside an FPGA is rather pointless if you're already using its RAM directly. I do (see above for a link).

It's not pointless if you're using SDRAM. Even if you're using SRAM, you'll have to worry about bus-turnaround delays, and associated dead times which can be minimized by using the SRAM in a burst-wise fashion. Also, like I said, you can use a cache to increase memory bandwidth but use an unmodified 8 bit core.

Of course. But cache-to-memory coupling is therefore a totally separate issue, exactly because it could be done without any changes to the core.

Arlet · Post by **Arlet** » Wed Jun 05, 2013 9:54 am

Windfall wrote:

i.e. there are no control flow changes

But there are, so you'll need to deal with them.

The whole idea seems rather complicated. You may be able to save a few cycles here and there, but at the cost of a lower clock frequency due to longer paths.

Windfall · Post by **Windfall** » Wed Jun 05, 2013 10:10 am

Arlet wrote:

Windfall wrote:

i.e. there are no control flow changes

But there are, so you'll need to deal with them.

Well. yeah. But it's not that complicated. And none of it needs to be in the core.

Arlet wrote:

The whole idea seems rather complicated. You may be able to save a few cycles here and there, but at the cost of a lower clock frequency due to longer paths.

Perhaps. But the benefit (eliminating entire cycles) will outweigh the cost (a few MHz).

ElEctric_EyE · Post by **ElEctric_EyE** » Wed Jun 05, 2013 11:25 am

Windfall wrote:

ElEctric_EyE wrote:

Please do not inject emotion into this area... We all desire knowledge and look down on emotional BS

So don't start it, by responding with 'do it yourself' instead of getting into the technical discussion.

Sorry, I do seem to get excited sometimes.

Windfall · Post by **Windfall** » Wed Jun 05, 2013 12:07 pm

So, in a nutshell, my suggestion (to able and willing core writers) would be like this : I write the bolt-on, you use it to eliminate your opcode and argument byte registers, and as many argument byte cycles as possible.

It'd be something like this (verilog-gy) :

module memory_gateway
(
clock,
address,
request_instruction,
instruction,
acknowledge_instruction
)

input clock;
input [15:0] address;
input request_instruction;
output [23..0] instruction;
output acknowledge_instruction;

I.e. on clock (whatever edge), if request_instruction, address == PC -> wait for acknowledge_instruction on next clock(s) -> instruction out (7:0 == opcode, 15:8 == next byte, 23:16 == next byte).

One could even combine this with other memory accesses, i.e. if NOT request_instruction, address == whatever data byte address -> next clock data out (7:0), and similar for writes. The memory gateway could then use 32-bit, 32-bit aligned accesses with byte enables, which would combine easily with the instruction fetches using same.

I doubt that there'd be any lengthening of paths in the core itself, since opcode + argument bytes are usually already in registers somewhere, and are only somewhere else, now. As long as the instruction fetch itself can be done in one cycle (including a bit of state changes and shifting bits around), there should be no performance penalty at all.

Left as an exercise for later : allow for zero cycle instruction fetches in the core (if the instruction is already there, e.g. this will often be the case with single byte instructions).

BitWise · Post by **BitWise** » Wed Jun 05, 2013 2:45 pm

Curiously I've been playing with a similar ideas in my 65C816 emulator. I'm using a PIC32MX device which has a MIPS4K core. The MIPS processor normally demands aligned word accesses but the instruction set contains opcodes that allow the upper and lower portions of a register to be loaded/stored from consecutive words depending where address falls between aligned values.

Fetching the entire instruction as a single word and then extracting the opcode and operands from it is much quicker than accessing the memory with individual byte accesses but it does raise questions of emulation accuracy at edge cases. I don't think many programs would be coded to wrap around within a bank (its possible on banks other than zero where there aren't reserved areas for vectors at the end) but data accesses on the direct page certainly can make use of the wrap aounds that occur in indexed address calculation.

Arlet · Post by **Arlet** » Wed Jun 05, 2013 3:04 pm

Windfall wrote:

I doubt that there'd be any lengthening of paths in the core itself, since opcode + argument bytes are usually already in registers somewhere, and are only somewhere else, now. As long as the instruction fetch itself can be done in one cycle (including a bit of state changes and shifting bits around), there should be no performance penalty at all.

If you allow one cycle for the instruction fetch, there will be a penalty of that one cycle. If you follow the same method for data accesses, there will also be an extra cycle for each data access.

enso · Post by **enso** » Wed Jun 05, 2013 3:09 pm

Why are we arguing about this?

If you are implementing an 8-bit core with 32-bit memory, you _have_ to do something like this to extract 8-bits at a time (although why use 32-bit memory when 64K 8-bit SRAM is cheap?). Although I would suggest a mux, not a shift register.

With SRAM, instruction fetch on a 6502 is hardly a bottleneck requiring speeding up. 6502s are designed to be tightly bound to RAM.

With DRAM you will get refresh cycles requiring you to run your DRAM controller at 100MHz to keep up reliably with a 2MHz 6502. A 4-byte buffer will not get you anything.

Caches of any size cause long reload delays and any gains happen when loops fit inside them. A 4-byte cache is statistically an abomination that is guaranteed to lose speed and at best gain nothing compared to an SRAM. Except all the complexity for no reason.

Windfall, if you still insist that it's a good idea, really, without any resentment, 'go do it and tell us how it works out for you' . These are not words to discourage you - many great inventions happened while everyone said 'it will never work'. However, consider that your post describes a solution to a non-problem.

.

Windfall · Post by **Windfall** » Wed Jun 05, 2013 3:20 pm

Arlet wrote:

Windfall wrote:

I doubt that there'd be any lengthening of paths in the core itself, since opcode + argument bytes are usually already in registers somewhere, and are only somewhere else, now. As long as the instruction fetch itself can be done in one cycle (including a bit of state changes and shifting bits around), there should be no performance penalty at all.

If you allow one cycle for the instruction fetch, there will be a penalty of that one cycle. If you follow the same method for data accesses, there will also be an extra cycle for each data access.

Again, no.

You fetch an instruction instead of an opcode. The end result of the former is the entire instruction, in a register. The end result of the latter is just the opcode, in a register. There is no penalty. Same for data bytes.

Windfall · Post by **Windfall** » Wed Jun 05, 2013 3:23 pm

enso wrote:

Why are we arguing about this?

After reading all this bickering I am joining the 'go do it and tell us how it works out for you' camp.

If you are implementing an 8-bit core with 32-bit memory, you _have_ to do something like this to extract 8-bits at a time (although why use 32-bit memory when 64K 8-bit SRAM is cheap?). If you are not, you don't care. Instruction fetch is hardly a bottleneck requiring speeding up. Either way there is little else to say.

What are you on about ? How is reducing e.g. LDA abs from 4 to 2 cycles, by reading more of the instruction in one go, not a speedup ?

32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit