- Why are you using 32-bit memory with an 8 or 16-bit processor anyway?
32 is the new 8-bit
Re: 32 is the new 8-bit
enso wrote:
OK, windfall's idea is to use 32-bit memory. Fetch 32 bits into a 4-byte buffer, then use it as a cache. However, the problems I see are
- Why are you using 32-bit memory with an 8 or 16-bit processor anyway?
- Why are you using 32-bit memory with an 8 or 16-bit processor anyway?
enso wrote:
- First read will cost a cycle (with SRAM), unless a bypass of some kind is implemented.
enso wrote:
- Subsequent reads will not be aligned, requiring a delay to read more if instruction does not fit.
enso wrote:
- The complexity of this prefetch circuit greatly overshadows any benefit. It is likely to slow down the fetching, not accelerate it.
enso wrote:
What exactly are you trying to gain here?
Re: 32 is the new 8-bit
Arlet wrote:
It would be very challenging to rewrite the classic 6502 core so it would use less cycles for the same instructions.
Arlet wrote:
That much effort would be better spent on a 16 or 32 bit processor to start with.
Arlet wrote:
Another option, which would be easier to implement, is to use a cache inside the FPGA. This cache could have an 8 bit interface to the 6502 core, but a 16 or 32 bit interface to external memory, reducing the amount of memory cycles it would take to read/write a cache line. This would work nicely in combination with SDRAM, and video access. It would increase memory bandwidth, but without complicating the 6502 core itself.
Last edited by Windfall on Wed Jun 05, 2013 9:07 am, edited 1 time in total.
- GARTHWILSON
- Forum Moderator
- Posts: 8775
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Re: 32 is the new 8-bit
So, if I understand you correctly, you're suggesting wasting bytes in the 32-bit word if necessary to keep 2- and 3-byte instructions from crossing word boundaries, since memory is cheap enough now to do this to get better processor performance. Would a word be able to handle all combinations of multiple instructions that total no more than four bytes?
Edit: Actually, with less complexity, it might be worth doing this for ROM working with off-the-shelf processors first, since RAM is avalailable in something like 6-8ns but ROM bottoms out at about 45ns, and although ROM would still have branches and jumps, it would not have any variable space, stack space, etc.. OTOH it could still have tables and definitely vectors, and an easier way to handle all that would probably be to copy it to faster RAM at boot-up and then switch the ROM out, something we've discussed before.
Edit: Actually, with less complexity, it might be worth doing this for ROM working with off-the-shelf processors first, since RAM is avalailable in something like 6-8ns but ROM bottoms out at about 45ns, and although ROM would still have branches and jumps, it would not have any variable space, stack space, etc.. OTOH it could still have tables and definitely vectors, and an easier way to handle all that would probably be to copy it to faster RAM at boot-up and then switch the ROM out, something we've discussed before.
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
Re: 32 is the new 8-bit
GARTHWILSON wrote:
So, if I understand you correctly, you're suggesting wasting bytes in the 32-bit word if necessary to keep 2- and 3-byte instructions from crossing word boundaries, since memory is cheap enough now to do this to get better processor performance.
Re: 32 is the new 8-bit
Windfall wrote:
Can you explain why ? As far as I can see, you can simply build the tiny memory cache and then eliminate all the post opcode instruction byte fetches.
Quote:
Cache inside an FPGA is rather pointless if you're already using its RAM directly. I do (see above for a link).
Re: 32 is the new 8-bit
Arlet wrote:
Windfall wrote:
Can you explain why ? As far as I can see, you can simply build the tiny memory cache and then eliminate all the post opcode instruction byte fetches.
Arlet wrote:
Also, you may be able to avoid the extra operand fetches, but if your core still does internal processing, you still need those cycles. For instance, looking at my own core, it would require a major rewrite to actually reduce the cycles.
E.g. LDA abs, which was opcode fetch, low byte fetch, high byte fetch, read target address, becomes instruction fetch, read target address.
Arlet wrote:
Quote:
Cache inside an FPGA is rather pointless if you're already using its RAM directly. I do (see above for a link).
Re: 32 is the new 8-bit
Windfall wrote:
i.e. there are no control flow changes
The whole idea seems rather complicated. You may be able to save a few cycles here and there, but at the cost of a lower clock frequency due to longer paths.
Re: 32 is the new 8-bit
Arlet wrote:
Windfall wrote:
i.e. there are no control flow changes
Arlet wrote:
The whole idea seems rather complicated. You may be able to save a few cycles here and there, but at the cost of a lower clock frequency due to longer paths.
-
ElEctric_EyE
- Posts: 3260
- Joined: 02 Mar 2009
- Location: OH, USA
Re: 32 is the new 8-bit
Windfall wrote:
ElEctric_EyE wrote:
Please do not inject emotion into this area... We all desire knowledge and look down on emotional BS
Re: 32 is the new 8-bit
So, in a nutshell, my suggestion (to able and willing core writers) would be like this : I write the bolt-on, you use it to eliminate your opcode and argument byte registers, and as many argument byte cycles as possible.
It'd be something like this (verilog-gy) :
module memory_gateway
(
clock,
address,
request_instruction,
instruction,
acknowledge_instruction
)
input clock;
input [15:0] address;
input request_instruction;
output [23..0] instruction;
output acknowledge_instruction;
I.e. on clock (whatever edge), if request_instruction, address == PC -> wait for acknowledge_instruction on next clock(s) -> instruction out (7:0 == opcode, 15:8 == next byte, 23:16 == next byte).
One could even combine this with other memory accesses, i.e. if NOT request_instruction, address == whatever data byte address -> next clock data out (7:0), and similar for writes. The memory gateway could then use 32-bit, 32-bit aligned accesses with byte enables, which would combine easily with the instruction fetches using same.
I doubt that there'd be any lengthening of paths in the core itself, since opcode + argument bytes are usually already in registers somewhere, and are only somewhere else, now. As long as the instruction fetch itself can be done in one cycle (including a bit of state changes and shifting bits around), there should be no performance penalty at all.
Left as an exercise for later : allow for zero cycle instruction fetches in the core (if the instruction is already there, e.g. this will often be the case with single byte instructions).
It'd be something like this (verilog-gy) :
module memory_gateway
(
clock,
address,
request_instruction,
instruction,
acknowledge_instruction
)
input clock;
input [15:0] address;
input request_instruction;
output [23..0] instruction;
output acknowledge_instruction;
I.e. on clock (whatever edge), if request_instruction, address == PC -> wait for acknowledge_instruction on next clock(s) -> instruction out (7:0 == opcode, 15:8 == next byte, 23:16 == next byte).
One could even combine this with other memory accesses, i.e. if NOT request_instruction, address == whatever data byte address -> next clock data out (7:0), and similar for writes. The memory gateway could then use 32-bit, 32-bit aligned accesses with byte enables, which would combine easily with the instruction fetches using same.
I doubt that there'd be any lengthening of paths in the core itself, since opcode + argument bytes are usually already in registers somewhere, and are only somewhere else, now. As long as the instruction fetch itself can be done in one cycle (including a bit of state changes and shifting bits around), there should be no performance penalty at all.
Left as an exercise for later : allow for zero cycle instruction fetches in the core (if the instruction is already there, e.g. this will often be the case with single byte instructions).
- BitWise
- In Memoriam
- Posts: 996
- Joined: 02 Mar 2004
- Location: Berkshire, UK
- Contact:
Re: 32 is the new 8-bit
Curiously I've been playing with a similar ideas in my 65C816 emulator. I'm using a PIC32MX device which has a MIPS4K core. The MIPS processor normally demands aligned word accesses but the instruction set contains opcodes that allow the upper and lower portions of a register to be loaded/stored from consecutive words depending where address falls between aligned values.
Fetching the entire instruction as a single word and then extracting the opcode and operands from it is much quicker than accessing the memory with individual byte accesses but it does raise questions of emulation accuracy at edge cases. I don't think many programs would be coded to wrap around within a bank (its possible on banks other than zero where there aren't reserved areas for vectors at the end) but data accesses on the direct page certainly can make use of the wrap aounds that occur in indexed address calculation.
Fetching the entire instruction as a single word and then extracting the opcode and operands from it is much quicker than accessing the memory with individual byte accesses but it does raise questions of emulation accuracy at edge cases. I don't think many programs would be coded to wrap around within a bank (its possible on banks other than zero where there aren't reserved areas for vectors at the end) but data accesses on the direct page certainly can make use of the wrap aounds that occur in indexed address calculation.
Andrew Jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs
Re: 32 is the new 8-bit
Windfall wrote:
I doubt that there'd be any lengthening of paths in the core itself, since opcode + argument bytes are usually already in registers somewhere, and are only somewhere else, now. As long as the instruction fetch itself can be done in one cycle (including a bit of state changes and shifting bits around), there should be no performance penalty at all.
Re: 32 is the new 8-bit
Why are we arguing about this?
If you are implementing an 8-bit core with 32-bit memory, you _have_ to do something like this to extract 8-bits at a time (although why use 32-bit memory when 64K 8-bit SRAM is cheap?). Although I would suggest a mux, not a shift register.
With SRAM, instruction fetch on a 6502 is hardly a bottleneck requiring speeding up. 6502s are designed to be tightly bound to RAM.
With DRAM you will get refresh cycles requiring you to run your DRAM controller at 100MHz to keep up reliably with a 2MHz 6502. A 4-byte buffer will not get you anything.
Caches of any size cause long reload delays and any gains happen when loops fit inside them. A 4-byte cache is statistically an abomination that is guaranteed to lose speed and at best gain nothing compared to an SRAM. Except all the complexity for no reason.
Windfall, if you still insist that it's a good idea, really, without any resentment, 'go do it and tell us how it works out for you' . These are not words to discourage you - many great inventions happened while everyone said 'it will never work'. However, consider that your post describes a solution to a non-problem.
.
If you are implementing an 8-bit core with 32-bit memory, you _have_ to do something like this to extract 8-bits at a time (although why use 32-bit memory when 64K 8-bit SRAM is cheap?). Although I would suggest a mux, not a shift register.
With SRAM, instruction fetch on a 6502 is hardly a bottleneck requiring speeding up. 6502s are designed to be tightly bound to RAM.
With DRAM you will get refresh cycles requiring you to run your DRAM controller at 100MHz to keep up reliably with a 2MHz 6502. A 4-byte buffer will not get you anything.
Caches of any size cause long reload delays and any gains happen when loops fit inside them. A 4-byte cache is statistically an abomination that is guaranteed to lose speed and at best gain nothing compared to an SRAM. Except all the complexity for no reason.
Windfall, if you still insist that it's a good idea, really, without any resentment, 'go do it and tell us how it works out for you' . These are not words to discourage you - many great inventions happened while everyone said 'it will never work'. However, consider that your post describes a solution to a non-problem.
.
Last edited by enso on Wed Jun 05, 2013 3:21 pm, edited 1 time in total.
In theory, there is no difference between theory and practice. In practice, there is. ...Jan van de Snepscheut
Re: 32 is the new 8-bit
Arlet wrote:
Windfall wrote:
I doubt that there'd be any lengthening of paths in the core itself, since opcode + argument bytes are usually already in registers somewhere, and are only somewhere else, now. As long as the instruction fetch itself can be done in one cycle (including a bit of state changes and shifting bits around), there should be no performance penalty at all.
You fetch an instruction instead of an opcode. The end result of the former is the entire instruction, in a register. The end result of the latter is just the opcode, in a register. There is no penalty. Same for data bytes.
Re: 32 is the new 8-bit
enso wrote:
Why are we arguing about this?
After reading all this bickering I am joining the 'go do it and tell us how it works out for you' camp.
If you are implementing an 8-bit core with 32-bit memory, you _have_ to do something like this to extract 8-bits at a time (although why use 32-bit memory when 64K 8-bit SRAM is cheap?). If you are not, you don't care. Instruction fetch is hardly a bottleneck requiring speeding up. Either way there is little else to say.
After reading all this bickering I am joining the 'go do it and tell us how it works out for you' camp.
If you are implementing an 8-bit core with 32-bit memory, you _have_ to do something like this to extract 8-bits at a time (although why use 32-bit memory when 64K 8-bit SRAM is cheap?). If you are not, you don't care. Instruction fetch is hardly a bottleneck requiring speeding up. Either way there is little else to say.