6502.org • View topic - 32 is the new 8-bit

View unanswered posts | View active topics

Board index » 6502.org Users Forum » Programmable Logic

All times are UTC

32 is the new 8-bit

Page 8 of 12

[ 168 posts ]

Go to page Previous 1 ... 5, 6, 7, 8, 9, 10, 11, 12 Next

Print view

Previous topic | Next topic

Author

Message

BigEd

Post subject: Re: 32 is the new 8-bit

Posted: Mon Aug 05, 2013 9:15 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10938
Location: England

Thanks for the links - I'll read those. What I meant specifically is that we here on this forum have not yet put together a cache on FPGA, whether our own invention or from elsewhere, with one of our cores.
Cheers
Ed

Top

Bregalad

Post subject: Re: 32 is the new 8-bit

Posted: Tue Aug 06, 2013 8:26 pm

Joined: Sat Mar 27, 2010 7:50 pm
Posts: 149
Location: Chexbres, VD, Switzerland

Hi Morita,
I've just seen your post, and it's a great thing you find flaws in my (entierely theorical) design, because I can learn from it. I see that you're experienced in this particular domain so I'm interested.

As for the problems themselves...

Quote:

1) The cache is overbuilt

I guess it's too early to know. The only real reason for a cache is for multiple read ports, necessary the hope to execute one 6502 instruction per cycle. As I said before, the first two reading ports might or might not be fusionned somehow (since they will always read two consecutive addresses, mostly in ZP).

I'm not knownledgable about caches enough to say more, but it looks like I've underestimated the problem, as in "well I'll just throw more cache/more ports and it'll fix the problem". I'm now sure it's not that simple.

However, the dimention of the cache can always be reduced, even if it hurts performance, reducing the area or FPGA usage. I'd agree that 80% for cache and 20% for the logic is a bit crazy. But if somehow this configuration makes a more efficient processor, then it should not be overlooked. I said IF, I never said I proofed that was the case.

Also I'm not planning to do ASICs, just possibly a FPGA implementation. If I ever get loads of money and loads of potential customers, then why not an ASIC, but hell, the probability such a thing happens is ridiculous. FPGA has SRAM cells no matter if you use them or not, so using them is good. Duplicate reading ports is as simple as duplicating the SRAM itself.

Quote:

2) Basic pipeline has terrible performance

The following parameters can/should be taken in account for branch prediction :
* What instruction it is (BPL, BMI, etc, etc...)
* What direction it branches to (forwards, backwards)
* What is the previous instruction opcode

If a great analysis is made on an large existing sample of emulated 6502 code, then it becomes relatively easy to do static branch prediction without any cache or SRAM, that is relatively efficient. This doesn't remove the penalty when it guesses wrong, but at least it will guess wrong rarely enough.

For exemple, DEX followed by BNE backwards has more probability to predict to "true" than "false", don't you agree ?

Of course if then someone somehow manages to have a loop with somewhere a "dex bne backwards" which is hardly ever branching, then it will loose ~5 cycle everytime which is terrible, but experience told me that, any processor improvement has some pathological case which makes the improvement even worse than the original processor. If it didn't work then pipelines and caches would not exist. It's kind of a hardware equivalent of huffman compression (for those who see what I'm talking about).

Finally I think the branch could be done at the "operand & fetch" stage by "backwarding" the new flag values from the "ALU & Flags" stage. This would make about 4 cycles of penalty for a wrong prediction, not much worse than the 6502's original 3 cycles. I know this "backwarding" technique works because I used it in my ARM processor, however it can become the critical path easily since the signals propagate through the entiere ALU and only then to the entiere branch logic in the previous stage, all in the same clock cycle.

Quote:

3) Bad assumptions made about FIFO

FIFO is just a general concept which means "First In First Out"

What I mean is an unit which will always provide the 4 bytes following the PC, and able to increment the PC from 1 to 4 bytes every cycle. Since it's possible to read memory in 32-bit words (4 bytes) such a thing is possible (and it was the point of this thread before this guy started to get insulted by everyone). It is not nessesarly simple, but it is possible. I call this a FIFO because it "eats bytes in advance" from the memory, and outputs them at the desired pace. If another name is better suited, then just say it.

Quote:

4) Bad assumption made about Operand Fetch pipeline stage

The result of $1234 + X is available at the end of the cycle. The result of the D-cache read needs to be available at the start of the next clock cycle. This gives the D-cache less than one clock to return the value.

Exacly one clock, no ? I mean it's fairly classical to have addresses lines set on cycle X and data comes ready on cycle X+1.

However it can be a problem if the cache is not direct mapped, then either some trickery or extra dummy piplelines stages are required. I already mentionned this in my previous post so I won't do it again. It's probably one of the bigger problem to solve, though.

Quote:

5) Missing bypass from the Writeback stage

We mentionned that the A, X, Y and S registers are asynchronous, because it's not worth using SRAM for only 32-bit of storage. This removes the need of a bypass from the writback stage - the registers themselves are a bypass.

Quote:

Fundamentally, ISAs with (memory) read-modify-write instructions are not well-suited for a pipelined processor design without weird kludges like breaking into micro-ops.
This is why CISC ISAs are mostly dead.

The reason for this is:

1) RMW instructions require extra pipeline stages
2) Extra pipeline stages increases the branch penalty
3) The amount of performance gained through RMW instructions (typically 2-5%) is much less than the
peformance loss through increased branch penalty (>10%)

I believe you, however such a technology has so far only been tried with the x86 and 68000 as far I know. Never with the 6502, which while being CISC is QUITE different from the latter two in terms of instruction set and addressing modes.

Therefore, yes of course such a pipelined "ISA" 6502 may suck completely but we can't know for sure without trying.
1) has been solved by me (it's only 5 stages so far, some RISCs have more than 10...), 2) can be solved with smart branch prediction, and 3) you don't really know unless you do a further analysis

I really throught about using RISC microcode, but I came to the conclusion it would suck in 6502's case. Why ? Let's consider an instruction like adc ($12),Y.
It would translate into something like (assuming 16-bit regs...) :

Code:

mov r0, #$12
ldr r1, [r0]
add r1, ry
ldr r0, [r1]
adds ra, r0

This results in 4 RISC instructions, and would require 7 cycles to complete even if the inner processor is multi-issue, because they all depend on the previous instruciton. (ldr usually takes at least 2 cycles on pipelined RISCs if the result is used by the next instruction).

This is the same number of cycles than the original 6502, so in other words, this sucks, as it would overly complicate the processor for no performance gain (excpet a higher clock speed, which can also be attained without using any inner RISC machine nor microcode).

Finally :

Quote:

A cache of this design can support a load-store ISA pipeline that can issue four instructions per clock. The current pipeline design can only issue one instruction per clock.

In fact no, I'm planning to issue up to 4 instructions per clock but only in special cases. Some "idioms" could be reconized and bypass the normal instruction decoding, and execute in a single clock.
For example : INX INX INX INX (which is quite frequently used), will translate into something that will perform : add x, #4 and take a single cycle.
Of course most of the case this won't apply. But this could boost the performance some little more.

Sorry for the wall of text, but I hope this makes my point clear.

Top

TMorita

Post subject: Re: 32 is the new 8-bit

Posted: Wed Aug 07, 2013 1:33 am

Joined: Sun Sep 15, 2002 10:42 pm
Posts: 214

Bregalad wrote:

For exemple, DEX followed by BNE backwards has more probability to predict to "true" than "false", don't you agree ?

This will not work, because by the time you decode the BNE, it's already the end of the clock cycle, and it's too late to modify the PC for the next clock cycle.

Bregalad wrote:

TMorita wrote:

3) Bad assumptions made about FIFO

FIFO is just a general concept which means "First In First Out"

In software, a FIFO is a general queue. In hardware, it means something specific.
It's a hardware unit which contains a certain number of entries, and data can be clocked in and out in discrete units.

Bregalad wrote:

TMorita wrote:

Exacly one clock, no ? I mean it's fairly classical to have addresses lines set on cycle X and data comes ready on cycle X+1.

This is like a doctor telling you not to eat for 24 hours before your morning X-ray. You arrive in the morning and he takes the X-ray, and your stomach X-rays are useless because there's food in your stomach. The doctor says, "Didn't I tell you not to eat for 24 hours before your X-ray?" You say, "The last meal I ate was dinner yesterday! So since I haven't eaten today, it's been one day since my last meal!"

Bregalad wrote:

TMorita wrote:

5) Missing bypass from the Writeback stage

You can only do this if you can guarantee your logic is glitch-free.
If you can't guarantee this, then you cannot guarantee the values read from your registers are correct.

Bregalad wrote:

TMorita wrote:

The problem isn't the instruction set or addressing modes.
The problem is the RMW instructions, as I mentioned in the original text.

Incidentally, your writing and thinking style are very similar to kc5tja's.
When someone points out something is wrong, you doggedly stick to your point as long as possible.

Toshi

Top

Bregalad

Post subject: Re: 32 is the new 8-bit

Posted: Wed Aug 07, 2013 2:34 pm

Joined: Sat Mar 27, 2010 7:50 pm
Posts: 149
Location: Chexbres, VD, Switzerland

Quote:

This will not work, because by the time you decode the BNE, it's already the end of the clock cycle, and it's too late to modify the PC for the next clock cycle.

Ok, anyways it would have been very complicated to take the previous instruction in account in the logic (there is too many possibilities !), but it was just an idea I've thrown like that.

Statistical analysis on the branch type and direction can be done pretty easily.

Quote:

This is like a doctor telling you not to eat for 24 hours before your morning X-ray. [...]

Yes, the duration between when the address is ready and when the data is expected is less than 1 cycle.
On synchronous SRAMs, this is standard.

If a cache can't be that fast, or if doing it this ways makes a too long critical path, then yes this is a problem and things will have to be redesigned a little bit, like by adding extra stages, which will of course hurt performance. It's a trade-off between the # of stages (more stages hurt branch performance) and critical path then.

Quote:

You can only do this if you can guarantee your logic is glitch-free.

It's certainly not glitch-free. I think the write process will have to be synchronous anyways, only the read will be async then, which implies some forwarding would have to be done from the WB stage.

Quote:

The problem isn't the instruction set or addressing modes.
The problem is the RMW instructions, as I mentioned in the original text.

I see. Perhaps I should rethink the pipeline then. Originally I wanted to do RMW because I wanted to do something like sta ($34),Y in a single cycle (which implies reading $34, reading $35 and writing somewhere).

A new design could have a different vision where sta indirect is 2 cycles (RMW) but lda indirect is still 1 cycle (multiple reads). Would this increase the chances of success ?

Top

TMorita

Post subject: Re: 32 is the new 8-bit

Posted: Thu Aug 08, 2013 7:18 pm

Joined: Sun Sep 15, 2002 10:42 pm
Posts: 214

Bregalad wrote:

TMorita wrote:

This will not work, because by the time you decode the BNE, it's already the end of the clock cycle, and it's too late to modify the PC for the next clock cycle.

LDA $IOPORT
BMI label

Using statistical analysis, can you predict how often this branch is taken?

Bregalad wrote:

TMorita wrote:

Bregalad wrote:

TMorita wrote:

Exacly one clock, no ? I mean it's fairly classical to have addresses lines set on cycle X and data comes ready on cycle X+1.
...

This is like a doctor telling you not to eat for 24 hours before your morning X-ray. [...]

Yes, the duration between when the address is ready and when the data is expected is less than 1 cycle.

On synchronous SRAMs, this is standard.

You want to go from Los Angeles, CA to Baltimore, MD.
You see that it is cheaper to fly from LA to Phoenix, AZ, then to Baltimore.
So you buy two plane tickets.
The first plane lands in Phoenix at 12:30 pm.
The second plane takes off from Phoenix at 12:30 pm on the same day.

Bregalad wrote:

TMorita wrote:

Bregalad wrote:

TMorita wrote:

5) Missing bypass from the Writeback stage

If an instruction sequence such as:

ADC #1
NOP
NOP
ADC #1

is executed, the second ADC will calculate an incorrect result because there is no bypass from the writeback stage.

You can only do this if you can guarantee your logic is glitch-free.

It's certainly not glitch-free. I think the write process will have to be synchronous anyways, only the read will be async then, which implies some forwarding would have to be done from the WB stage.

Yes, read my original comment.

Bregalad wrote:

TMorita wrote:

Bregalad wrote:

TMorita wrote:

...
believe you, however such a technology has so far only been tried with the x86 and 68000 as far I know. Never with the 6502, which while being CISC is QUITE different from the latter two in terms of instruction set and addressing modes.

The problem isn't the instruction set or addressing modes.
The problem is the RMW instructions, as I mentioned in the original text.

You go to your doctor, and you say, "Every time I drink coffee, I get a sharp pain in my eye."
You drink a cup of coffee to demonstrate, and yell, "OW OW OW OW".
The doctor observes and says, "Yes, the stirrer in the cup is poking you in the eye."
You ask the doctor, "If I change to decaf, will it keep the pain in from happening?"
The doctor says, "The problem is the stirrer in your cup."
You ask the doctor, "Well, how about if I water down the coffee to 50% strength? Maybe 25% strength???"
The doctor says, "..."

Toshi

Last edited by TMorita on Thu Aug 08, 2013 7:26 pm, edited 1 time in total.

Top

BigEd

Post subject: Re: 32 is the new 8-bit

Posted: Thu Aug 08, 2013 7:24 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10938
Location: England

Not terribly constructive, Toshi. To my way of thinking, even if a cache takes two cycles for some reason, it's still a win compared to a (hypothetical) 6-cycle memory access. I don't suppose anyone here is trying to make a no-holds-barred fastest-possible core, just trying to make something interesting, working, and maybe a little faster than it otherwise would be. Pretty much anyone doing such a thing would also learn a lot during the process.

Cheers
Ed

Top

player55328

Post subject: Re: 32 is the new 8-bit

Posted: Fri Aug 09, 2013 7:16 am

Joined: Tue Aug 06, 2013 6:46 pm
Posts: 23
Location: Oregon

I am new to this forum and saw this thread that really interested me, at least at first but it seems to have changed from the original discussion. I have made most of an fpga based C64 on my Digilent Atlys board. My goal was pretty much just to make it run as fast as possible. I did it with 2 32 bit data buses using dual ports into the bram for ram (1 for reads and 1 for writes). For the rom brams I just used 1 32 bit bus for reads at this point. There is a 2 stage pipeline for reads so some instructions do not have much improvement. I intend to implement a ZP and stack 'cache' that can be read asynchronously to speed up some more instructions. The key to speed/efficiency with the 32 bit data bus is 2 separate 16 bit memories with their own 'independent' address busses so you can always get 4 useful bytes of instruction stream with each access (except for possibly the first access which always still gives you enough bytes for one instruction). Fmax is currently at 80MHz and I listed below the clock cycles if anyone is interested. Even if you are all going a different direction I just thought you should know that the 32 bit bus is doable. I am sure my cpu is huge compared to the others that have been made though.

//instruction set decode - 6510 cycles / X6510 cycles
assign brk = ir[7:0] == 8'h00; // 7 / 5 cycles - a cached interrupt vector would reduce this by 2 more cycles
assign rti = ir[7:0] == 8'h40; // 6 / 5
assign jsr_a = ir[7:0] == 8'h20; // 6 / 3
assign rts = ir[7:0] == 8'h60; // 6 / 5
assign jmp_a = ir[7:0] == 8'h4C; // 3 / 3 * no improvement
assign jmp_i = ir[7:0] == 8'h6C; // 5 / 5 * no improvement

assign bpl = ir[7:0] == 8'h10; // 3+,2 / 3,1 (taken,not taken)
assign bmi = ir[7:0] == 8'h30; // 3+,2 / 3,1
assign bvc = ir[7:0] == 8'h50; // 3+,2 / 3,1
assign bvs = ir[7:0] == 8'h70; // 3+,2 / 3,1
assign bcc = ir[7:0] == 8'h90; // 3+,2 / 3,1
assign bcs = ir[7:0] == 8'hB0; // 3+,2 / 3,1
assign bne = ir[7:0] == 8'hD0; // 3+,2 / 3,1
assign beq = ir[7:0] == 8'hF0; // 3+,2 / 3,1

assign nop = ir[7:0] == 8'hEA; // 2 / 1
assign asla = ir[7:0] == 8'h0A; // 2 / 1
assign clc = ir[7:0] == 8'h18; // 2 / 1
assign rola = ir[7:0] == 8'h2A; // 2 / 1
assign sec = ir[7:0] == 8'h38; // 2 / 1
assign lsra = ir[7:0] == 8'h4A; // 2 / 1
assign cli = ir[7:0] == 8'h58; // 2 / 1
assign rora = ir[7:0] == 8'h6A; // 2 / 1
assign sei = ir[7:0] == 8'h78; // 2 / 1
assign dey = ir[7:0] == 8'h88; // 2 / 1
assign txa = ir[7:0] == 8'h8A; // 2 / 1
assign tya = ir[7:0] == 8'h98; // 2 / 1
assign txs = ir[7:0] == 8'h9A; // 2 / 1
assign tay = ir[7:0] == 8'hA8; // 2 / 1
assign tax = ir[7:0] == 8'hAA; // 2 / 1
assign clv = ir[7:0] == 8'hB8; // 2 / 1
assign tsx = ir[7:0] == 8'hBA; // 2 / 1
assign iny = ir[7:0] == 8'hC8; // 2 / 1
assign dex = ir[7:0] == 8'hCA; // 2 / 1
assign cld = ir[7:0] == 8'hD8; // 2 / 1
assign inx = ir[7:0] == 8'hE8; // 2 / 1
assign sed = ir[7:0] == 8'hF8; // 2 / 1

assign php = ir[7:0] == 8'h08; // 3 / 1
assign pha = ir[7:0] == 8'h48; // 3 / 1
assign plp = ir[7:0] == 8'h28; // 4 / 3
assign pla = ir[7:0] == 8'h68; // 4 / 3

assign anda_i = ir[7:0] == 8'h29; // 2 / 1
assign eora_i = ir[7:0] == 8'h49; // 2 / 1
assign ora_i = ir[7:0] == 8'h09; // 2 / 1
assign ldy_i = ir[7:0] == 8'hA0; // 2 / 1
assign ldx_i = ir[7:0] == 8'hA2; // 2 / 1
assign lda_i = ir[7:0] == 8'hA9; // 2 / 1
assign cmpa_i = ir[7:0] == 8'hC9; // 2 / 1
assign adca_i = ir[7:0] == 8'h69; // 2 / 1
assign sbca_i = ir[7:0] == 8'hE9; // 2 / 1
assign cpy_i = ir[7:0] == 8'hC0; // 2 / 1
assign cpx_i = ir[7:0] == 8'hE0; // 2 / 1

assign sta_ix = ir[7:0] == 8'h81; // 5 / 3
assign ora_ix = ir[7:0] == 8'h01; // 6 / 5
assign anda_ix = ir[7:0] == 8'h21; // 6 / 5
assign eora_ix = ir[7:0] == 8'h41; // 6 / 5
assign lda_ix = ir[7:0] == 8'hA1; // 6 / 5
assign cmpa_ix = ir[7:0] == 8'hC1; // 6 / 5
assign adca_ix = ir[7:0] == 8'h61; // 6 / 5
assign sbca_ix = ir[7:0] == 8'hE1; // 6 / 5

assign sta_iy = ir[7:0] == 8'h91; // 6 / 3
assign ora_iy = ir[7:0] == 8'h11; // 5+ / 5 * better in some cases
assign anda_iy = ir[7:0] == 8'h31; // 5+ / 5 * better in some cases
assign eora_iy = ir[7:0] == 8'h51; // 5+ / 5 * better in some cases
assign lda_iy = ir[7:0] == 8'hB1; // 5+ / 5 * better in some cases
assign cmpa_iy = ir[7:0] == 8'hD1; // 5+ / 5 * better in some cases
assign adca_iy = ir[7:0] == 8'h71; // 5+ / 5 * better in some cases
assign sbca_iy = ir[7:0] == 8'hF1; // 5+ / 5 * better in some cases

assign sty_z = ir[7:0] == 8'h84; // 3 / 1
assign sta_z = ir[7:0] == 8'h85; // 3 / 1
assign stx_z = ir[7:0] == 8'h86; // 3 / 1
assign ora_z = ir[7:0] == 8'h05; // 3 / 3 * no improvement
assign bita_z = ir[7:0] == 8'h24; // 3 / 3 * no improvement
assign anda_z = ir[7:0] == 8'h25; // 3 / 3 * no improvement
assign eora_z = ir[7:0] == 8'h45; // 3 / 3 * no improvement
assign ldy_z = ir[7:0] == 8'hA4; // 3 / 3 * no improvement
assign lda_z = ir[7:0] == 8'hA5; // 3 / 3 * no improvement
assign ldx_z = ir[7:0] == 8'hA6; // 3 / 3 * no improvement
assign cmpa_z = ir[7:0] == 8'hC5; // 3 / 3 * no improvement
assign adca_z = ir[7:0] == 8'h65; // 3 / 3 * no improvement
assign sbca_z = ir[7:0] == 8'hE5; // 3 / 3 * no improvement
assign cpy_z = ir[7:0] == 8'hC4; // 3 / 3 * no improvement
assign cpx_z = ir[7:0] == 8'hE4; // 3 / 3 * no improvement

assign sty_zx = ir[7:0] == 8'h94; // 3 / 1
assign sta_zx = ir[7:0] == 8'h95; // 3 / 1
assign ora_zx = ir[7:0] == 8'h15; // 4 / 3
assign anda_zx = ir[7:0] == 8'h35; // 4 / 3
assign eora_zx = ir[7:0] == 8'h55; // 4 / 3
assign ldy_zx = ir[7:0] == 8'hB4; // 4 / 3
assign lda_zx = ir[7:0] == 8'hB5; // 4 / 3
assign cmpa_zx = ir[7:0] == 8'hD5; // 4 / 3
assign adca_zx = ir[7:0] == 8'h75; // 4 / 3
assign sbca_zx = ir[7:0] == 8'hF5; // 4 / 3

assign stx_zy = ir[7:0] == 8'h96; // 3 / 1
assign ldx_zy = ir[7:0] == 8'hB6; // 4 / 3

assign sty_a = ir[7:0] == 8'h8C; // 3 / 1
assign sta_a = ir[7:0] == 8'h8D; // 3 / 1
assign stx_a = ir[7:0] == 8'h8E; // 3 / 1
assign ora_a = ir[7:0] == 8'h0D; // 4 / 3
assign bita_a = ir[7:0] == 8'h2C; // 4 / 3
assign anda_a = ir[7:0] == 8'h2D; // 4 / 3
assign eora_a = ir[7:0] == 8'h4D; // 4 / 3
assign ldy_a = ir[7:0] == 8'hAC; // 4 / 3
assign lda_a = ir[7:0] == 8'hAD; // 4 / 3
assign ldx_a = ir[7:0] == 8'hAE; // 4 / 3
assign cmpa_a = ir[7:0] == 8'hCD; // 4 / 3
assign adca_a = ir[7:0] == 8'h6D; // 4 / 3
assign sbca_a = ir[7:0] == 8'hED; // 4 / 3
assign cpy_a = ir[7:0] == 8'hCC; // 4 / 3
assign cpx_a = ir[7:0] == 8'hEC; // 4 / 3

assign sta_ax = ir[7:0] == 8'h9D; // 3 / 1
assign ora_ax = ir[7:0] == 8'h1D; // 4+ / 3
assign anda_ax = ir[7:0] == 8'h3D; // 4+ / 3
assign eora_ax = ir[7:0] == 8'h5D; // 4+ / 3
assign ldy_ax = ir[7:0] == 8'hBC; // 4+ / 3
assign lda_ax = ir[7:0] == 8'hBD; // 4+ / 3
assign cmpa_ax = ir[7:0] == 8'hDD; // 4+ / 3
assign adca_ax = ir[7:0] == 8'h7D; // 4+ / 3
assign sbca_ax = ir[7:0] == 8'hFD; // 4+ / 3

assign sta_ay = ir[7:0] == 8'h99; // 3 / 1
assign ora_ay = ir[7:0] == 8'h19; // 4+ / 3
assign anda_ay = ir[7:0] == 8'h39; // 4+ / 3
assign eora_ay = ir[7:0] == 8'h59; // 4+ / 3
assign lda_ay = ir[7:0] == 8'hB9; // 4+ / 3
assign ldx_ay = ir[7:0] == 8'hBE; // 4+ / 3
assign cmpa_ay = ir[7:0] == 8'hD9; // 4+ / 3
assign adca_ay = ir[7:0] == 8'h79; // 4+ / 3
assign sbca_ay = ir[7:0] == 8'hF9; // 4+ / 3

assign asl_z = ir[7:0] == 8'h06; // 5 / 3
assign rol_z = ir[7:0] == 8'h26; // 5 / 3
assign lsr_z = ir[7:0] == 8'h46; // 5 / 3
assign ror_z = ir[7:0] == 8'h66; // 5 / 3
assign dec_z = ir[7:0] == 8'hC6; // 5 / 3
assign inc_z = ir[7:0] == 8'hE6; // 5 / 3

assign asl_zx = ir[7:0] == 8'h16; // 6 / 3
assign rol_zx = ir[7:0] == 8'h36; // 6 / 3
assign lsr_zx = ir[7:0] == 8'h56; // 6 / 3
assign ror_zx = ir[7:0] == 8'h76; // 6 / 3
assign dec_zx = ir[7:0] == 8'hD6; // 6 / 3
assign inc_zx = ir[7:0] == 8'hF6; // 6 / 3

assign asl_a = ir[7:0] == 8'h0E; // 6 / 3
assign rol_a = ir[7:0] == 8'h2E; // 6 / 3
assign lsr_a = ir[7:0] == 8'h4E; // 6 / 3
assign ror_a = ir[7:0] == 8'h6E; // 6 / 3
assign dec_a = ir[7:0] == 8'hCE; // 6 / 3
assign inc_a = ir[7:0] == 8'hEE; // 6 / 3

assign asl_ax = ir[7:0] == 8'h1E; // 7 / 3
assign rol_ax = ir[7:0] == 8'h3E; // 7 / 3
assign lsr_ax = ir[7:0] == 8'h5E; // 7 / 3
assign ror_ax = ir[7:0] == 8'h7E; // 7 / 3
assign dec_ax = ir[7:0] == 8'hDE; // 7 / 3
assign inc_ax = ir[7:0] == 8'hFE; // 7 / 3

//nmi, irq 7 + completion of current instruction? / 5 + completion of previous instruction
// - a cached interrupt vector would reduce this by 2 more cycles

These are the instructions with cycle count reductions after adding the sp/zp cache. Lost about 30MHz in fmax but its probably worth it...

//instruction set decode - 6510 cycles / X6510 cycles
assign rti = ir[7:0] == 8'h40; // 6 / 4
assign rts = ir[7:0] == 8'h60; // 6 / 4

assign plp = ir[7:0] == 8'h28; // 4 / 1
assign pla = ir[7:0] == 8'h68; // 4 / 1

assign sta_ix = ir[7:0] == 8'h81; // 5 / 2
assign ora_ix = ir[7:0] == 8'h01; // 6 / 4
assign anda_ix = ir[7:0] == 8'h21; // 6 / 4
assign eora_ix = ir[7:0] == 8'h41; // 6 / 4
assign lda_ix = ir[7:0] == 8'hA1; // 6 / 4
assign cmpa_ix = ir[7:0] == 8'hC1; // 6 / 4
assign adca_ix = ir[7:0] == 8'h61; // 6 / 4
assign sbca_ix = ir[7:0] == 8'hE1; // 6 / 4

assign sta_iy = ir[7:0] == 8'h91; // 6 / 2
assign ora_iy = ir[7:0] == 8'h11; // 5+ / 4
assign anda_iy = ir[7:0] == 8'h31; // 5+ / 4
assign eora_iy = ir[7:0] == 8'h51; // 5+ / 4
assign lda_iy = ir[7:0] == 8'hB1; // 5+ / 4
assign cmpa_iy = ir[7:0] == 8'hD1; // 5+ / 4
assign adca_iy = ir[7:0] == 8'h71; // 5+ / 4
assign sbca_iy = ir[7:0] == 8'hF1; // 5+ / 4

assign ora_z = ir[7:0] == 8'h05; // 3 / 1
assign bita_z = ir[7:0] == 8'h24; // 3 / 1
assign anda_z = ir[7:0] == 8'h25; // 3 / 1
assign eora_z = ir[7:0] == 8'h45; // 3 / 1
assign ldy_z = ir[7:0] == 8'hA4; // 3 / 1
assign lda_z = ir[7:0] == 8'hA5; // 3 / 1
assign ldx_z = ir[7:0] == 8'hA6; // 3 / 1
assign cmpa_z = ir[7:0] == 8'hC5; // 3 / 1
assign adca_z = ir[7:0] == 8'h65; // 3 / 1
assign sbca_z = ir[7:0] == 8'hE5; // 3 / 1
assign cpy_z = ir[7:0] == 8'hC4; // 3 / 1
assign cpx_z = ir[7:0] == 8'hE4; // 3 / 1

assign ora_zx = ir[7:0] == 8'h15; // 4 / 1
assign anda_zx = ir[7:0] == 8'h35; // 4 / 1
assign eora_zx = ir[7:0] == 8'h55; // 4 / 1
assign ldy_zx = ir[7:0] == 8'hB4; // 4 / 1
assign lda_zx = ir[7:0] == 8'hB5; // 4 / 1
assign cmpa_zx = ir[7:0] == 8'hD5; // 4 / 1
assign adca_zx = ir[7:0] == 8'h75; // 4 / 1
assign sbca_zx = ir[7:0] == 8'hF5; // 4 / 1

assign ldx_zy = ir[7:0] == 8'hB6; // 4 / 1

assign asl_z = ir[7:0] == 8'h06; // 5 / 1
assign rol_z = ir[7:0] == 8'h26; // 5 / 1
assign lsr_z = ir[7:0] == 8'h46; // 5 / 1
assign ror_z = ir[7:0] == 8'h66; // 5 / 1
assign dec_z = ir[7:0] == 8'hC6; // 5 / 1
assign inc_z = ir[7:0] == 8'hE6; // 5 / 1

assign asl_zx = ir[7:0] == 8'h16; // 6 / 1
assign rol_zx = ir[7:0] == 8'h36; // 6 / 1
assign lsr_zx = ir[7:0] == 8'h56; // 6 / 1
assign ror_zx = ir[7:0] == 8'h76; // 6 / 1
assign dec_zx = ir[7:0] == 8'hD6; // 6 / 1
assign inc_zx = ir[7:0] == 8'hF6; // 6 / 1

Last edited by player55328 on Sat Sep 07, 2013 5:35 pm, edited 1 time in total.

Top

BigEd

Post subject: Re: 32 is the new 8-bit

Posted: Fri Aug 09, 2013 9:05 am

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10938
Location: England

Welcome - always good to hear about another different approach!
Cheers
Ed

Top

player55328

Post subject: Re: 32 is the new 8-bit

Posted: Mon Sep 09, 2013 3:39 am

Joined: Tue Aug 06, 2013 6:46 pm
Posts: 23
Location: Oregon

Here are the final cycle count improvements made by adding a S/ZP cache. Originally the last 33 were only one cycle but I lost so much fmax I could not keep it that way. Now fmax is about 72MHz, the C64 boots at 80 but I know it does not use all of the opcodes possible.

//instruction set decode - 6510 cycles / X6510 cycles
assign rti = ir[7:0] == 8'h40; // 6 / 4
assign rts = ir[7:0] == 8'h60; // 6 / 4

assign plp = ir[7:0] == 8'h28; // 4 / 1
assign pla = ir[7:0] == 8'h68; // 4 / 2

assign sta_ix = ir[7:0] == 8'h81; // 5 / 2
assign ora_ix = ir[7:0] == 8'h01; // 6 / 4
assign anda_ix = ir[7:0] == 8'h21; // 6 / 4
assign eora_ix = ir[7:0] == 8'h41; // 6 / 4
assign lda_ix = ir[7:0] == 8'hA1; // 6 / 4
assign cmpa_ix = ir[7:0] == 8'hC1; // 6 / 4
assign adca_ix = ir[7:0] == 8'h61; // 6 / 4
assign sbca_ix = ir[7:0] == 8'hE1; // 6 / 4

assign sta_iy = ir[7:0] == 8'h91; // 6 / 2
assign ora_iy = ir[7:0] == 8'h11; // 5+ / 4
assign anda_iy = ir[7:0] == 8'h31; // 5+ / 4
assign eora_iy = ir[7:0] == 8'h51; // 5+ / 4
assign lda_iy = ir[7:0] == 8'hB1; // 5+ / 4
assign cmpa_iy = ir[7:0] == 8'hD1; // 5+ / 4
assign adca_iy = ir[7:0] == 8'h71; // 5+ / 4
assign sbca_iy = ir[7:0] == 8'hF1; // 5+ / 4

assign ora_z = ir[7:0] == 8'h05; // 3 / 2
assign bita_z = ir[7:0] == 8'h24; // 3 / 2
assign anda_z = ir[7:0] == 8'h25; // 3 / 2
assign eora_z = ir[7:0] == 8'h45; // 3 / 2
assign ldy_z = ir[7:0] == 8'hA4; // 3 / 2
assign lda_z = ir[7:0] == 8'hA5; // 3 / 2
assign ldx_z = ir[7:0] == 8'hA6; // 3 / 2
assign cmpa_z = ir[7:0] == 8'hC5; // 3 / 2
assign adca_z = ir[7:0] == 8'h65; // 3 / 2
assign sbca_z = ir[7:0] == 8'hE5; // 3 / 2
assign cpy_z = ir[7:0] == 8'hC4; // 3 / 2
assign cpx_z = ir[7:0] == 8'hE4; // 3 / 2

assign ora_zx = ir[7:0] == 8'h15; // 4 / 2
assign anda_zx = ir[7:0] == 8'h35; // 4 / 2
assign eora_zx = ir[7:0] == 8'h55; // 4 / 2
assign ldy_zx = ir[7:0] == 8'hB4; // 4 / 2
assign lda_zx = ir[7:0] == 8'hB5; // 4 / 2
assign cmpa_zx = ir[7:0] == 8'hD5; // 4 / 2
assign adca_zx = ir[7:0] == 8'h75; // 4 / 2
assign sbca_zx = ir[7:0] == 8'hF5; // 4 / 2

assign ldx_zy = ir[7:0] == 8'hB6; // 4 / 2

assign asl_z = ir[7:0] == 8'h06; // 5 / 2
assign rol_z = ir[7:0] == 8'h26; // 5 / 2
assign lsr_z = ir[7:0] == 8'h46; // 5 / 2
assign ror_z = ir[7:0] == 8'h66; // 5 / 2
assign dec_z = ir[7:0] == 8'hC6; // 5 / 2
assign inc_z = ir[7:0] == 8'hE6; // 5 / 2

assign asl_zx = ir[7:0] == 8'h16; // 6 / 2
assign rol_zx = ir[7:0] == 8'h36; // 6 / 2
assign lsr_zx = ir[7:0] == 8'h56; // 6 / 2
assign ror_zx = ir[7:0] == 8'h76; // 6 / 2
assign dec_zx = ir[7:0] == 8'hD6; // 6 / 2
assign inc_zx = ir[7:0] == 8'hF6; // 6 / 2

Top

player55328

Post subject: Re: 32 is the new 8-bit

Posted: Sat Sep 28, 2013 4:34 am

Joined: Tue Aug 06, 2013 6:46 pm
Posts: 23
Location: Oregon

Thanks to Bruce Clark's document I was able to implement the Decimal mode adc/sbc commands successfully. Because of the extra logic involved I added an extra cycle for those operations just like the 65C02. I have been able to get fmax back up to ~ 82-85MHz by pre-registering as many internal operations as I can. I do not have a real understanding of the timing constraints though. When I add the main clock constraint it sometimes actually breaks the build (does not boot) even though I am running it under the fmax it gives. So I am not sure if I can get it up to 96MHz where I would really like too. At times it will actually run at 96MHz but not always (without any constraints). Is there anyone here who understands timing constraints willing to help me with this? If there is something I could do to return the favor I would be up for that...

Top

Arlet

Post subject: Re: 32 is the new 8-bit

Posted: Sat Sep 28, 2013 8:09 am

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands

The timing constraints are supposed to 'just work'. Is your design all internal in the FPGA, or do you use external memories ?

Top

player55328

Post subject: Re: 32 is the new 8-bit

Posted: Sun Sep 29, 2013 7:27 am

Joined: Tue Aug 06, 2013 6:46 pm
Posts: 23
Location: Oregon

The design is using all internal memory for rom, ram and cache. I have keyboard / mouse / serial inputs and vga output but I do not think those are causing the cpu timing limits.

I am not understanding the post-par static timing results... I will study it's output some more to see if I can figure out what it is telling me.
The logic level paths do not make a lot of sense to me, it seems like it is mixing unrelated signals in the path from one endpoint to another.

Ken

Top

Arlet

Post subject: Re: 32 is the new 8-bit

Posted: Sun Sep 29, 2013 7:55 am

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands

Check you warnings carefully. There may be some clue hidden in there.

Top

player55328

Post subject: Re: 32 is the new 8-bit

Posted: Sat Oct 05, 2013 6:09 am

Joined: Tue Aug 06, 2013 6:46 pm
Posts: 23
Location: Oregon

I have been able to generate a programming file (of my whole system) while passing an 85MHz timing spec several times in a row now. I had to slow down rts, and (),Y instructions to do it. Below are the instruction clock cycles. This is a pseudo Harvard architecture design with a 32 bit instruction bus and an 8 bit data bus. Both expect a 1 cycle pipeline delay before clocking in read data. The 32 bit bus has separate addresses for the upper and lower halves of the data, this allows for always being able to load a whole instruction per access. Any memory that can be used for executing code and storing data must be dual-ported. IO devices should reside on the 8 bit bus, only the 8 bit bus does writes. There are 2 types dma ports, single byte r/w and a read only burst (for video access) on the 32 bit port. The are no memory cycles lost during bus mastership transitions. The cpu has a pause input that will stop it for debugging purposes or for very slow memory accesses. The system I developed this on is a Digilent Atlys Board. If anyone is interested I can provide the Verilog code.

Ken

//instruction set decode - 6510 cycles / X6510 cycles
// +cross page / +Decimal mode
//,nmi, irq 7 + completion of current instruction? / 5 + completion of previous instruction
parameter BRK = 8'H00; // 7 / 5 (a cached vector would reduce this by 2 more cycles,including irq/nmi)
parameter RTI = 8'H40; // 6 / 4
parameter JSR_A = 8'H20; // 6 / 3
parameter RTS = 8'H60; // 6 / 5
parameter JMP_A = 8'H4C; // 3 / 3
parameter JMP_I = 8'H6C; // 5 / 5

parameter BPL = 8'H10; // 3+,2 / 3,1 (TAKEN,NOT TAKEN)
parameter BMI = 8'H30; // 3+,2 / 3,1
parameter BVC = 8'H50; // 3+,2 / 3,1
parameter BVS = 8'H70; // 3+,2 / 3,1
parameter BCC = 8'H90; // 3+,2 / 3,1
parameter BCS = 8'HB0; // 3+,2 / 3,1
parameter BNE = 8'HD0; // 3+,2 / 3,1
parameter BEQ = 8'HF0; // 3+,2 / 3,1

parameter NOP = 8'HEA; // 2 / 1
parameter ASLA = 8'H0A; // 2 / 1
parameter CLC = 8'H18; // 2 / 1
parameter ROLA = 8'H2A; // 2 / 1
parameter SEC = 8'H38; // 2 / 1
parameter LSRA = 8'H4A; // 2 / 1
parameter CLI = 8'H58; // 2 / 1
parameter RORA = 8'H6A; // 2 / 1
parameter SEI = 8'H78; // 2 / 1
parameter DEY = 8'H88; // 2 / 1
parameter TXA = 8'H8A; // 2 / 1
parameter TYA = 8'H98; // 2 / 1
parameter TXS = 8'H9A; // 2 / 1
parameter TAY = 8'HA8; // 2 / 1
parameter TAX = 8'HAA; // 2 / 1
parameter CLV = 8'HB8; // 2 / 1
parameter TSX = 8'HBA; // 2 / 1
parameter INY = 8'HC8; // 2 / 1
parameter DEX = 8'HCA; // 2 / 1
parameter CLD = 8'HD8; // 2 / 1
parameter INX = 8'HE8; // 2 / 1
parameter SED = 8'HF8; // 2 / 1

parameter PHP = 8'H08; // 3 / 1
parameter PHA = 8'H48; // 3 / 1
parameter PLP = 8'H28; // 4 / 1
parameter PLA = 8'H68; // 4 / 2

parameter ANDA_I = 8'H29; // 2 / 1
parameter EORA_I = 8'H49; // 2 / 1
parameter ORA_I = 8'H09; // 2 / 1
parameter LDY_I = 8'HA0; // 2 / 1
parameter LDX_I = 8'HA2; // 2 / 1
parameter LDA_I = 8'HA9; // 2 / 1
parameter CMPA_I = 8'HC9; // 2 / 1
parameter ADCA_I = 8'H69; // 2 / 1+
parameter SBCA_I = 8'HE9; // 2 / 1+
parameter CPY_I = 8'HC0; // 2 / 1
parameter CPX_I = 8'HE0; // 2 / 1

parameter STA_IX = 8'H81; // 5 / 2
parameter ORA_IX = 8'H01; // 6 / 4
parameter ANDA_IX = 8'H21; // 6 / 4
parameter EORA_IX = 8'H41; // 6 / 4
parameter LDA_IX = 8'HA1; // 6 / 4
parameter CMPA_IX = 8'HC1; // 6 / 4
parameter ADCA_IX = 8'H61; // 6 / 4+
parameter SBCA_IX = 8'HE1; // 6 / 4+

parameter STA_IY = 8'H91; // 6 / 3
parameter ORA_IY = 8'H11; // 5+ / 5
parameter ANDA_IY = 8'H31; // 5+ / 5
parameter EORA_IY = 8'H51; // 5+ / 5
parameter LDA_IY = 8'HB1; // 5+ / 5
parameter CMPA_IY = 8'HD1; // 5+ / 5
parameter ADCA_IY = 8'H71; // 5+ / 5+
parameter SBCA_IY = 8'HF1; // 5+ / 5+

parameter STY_Z = 8'H84; // 3 / 1
parameter STA_Z = 8'H85; // 3 / 1
parameter STX_Z = 8'H86; // 3 / 1
parameter ORA_Z = 8'H05; // 3 / 2
parameter BITA_Z = 8'H24; // 3 / 2
parameter ANDA_Z = 8'H25; // 3 / 2
parameter EORA_Z = 8'H45; // 3 / 2
parameter LDY_Z = 8'HA4; // 3 / 2
parameter LDA_Z = 8'HA5; // 3 / 2
parameter LDX_Z = 8'HA6; // 3 / 2
parameter CMPA_Z = 8'HC5; // 3 / 2
parameter ADCA_Z = 8'H65; // 3 / 2+
parameter SBCA_Z = 8'HE5; // 3 / 2+
parameter CPY_Z = 8'HC4; // 3 / 2
parameter CPX_Z = 8'HE4; // 3 / 2

parameter STY_ZX = 8'H94; // 3 / 1
parameter STA_ZX = 8'H95; // 3 / 1
parameter ORA_ZX = 8'H15; // 4 / 2
parameter ANDA_ZX = 8'H35; // 4 / 2
parameter EORA_ZX = 8'H55; // 4 / 2
parameter LDY_ZX = 8'HB4; // 4 / 2
parameter LDA_ZX = 8'HB5; // 4 / 2
parameter CMPA_ZX = 8'HD5; // 4 / 2
parameter ADCA_ZX = 8'H75; // 4 / 2+
parameter SBCA_ZX = 8'HF5; // 4 / 2+

parameter STX_ZY = 8'H96; // 3 / 1
parameter LDX_ZY = 8'HB6; // 4 / 2

parameter STY_A = 8'H8C; // 3 / 1
parameter STA_A = 8'H8D; // 3 / 1
parameter STX_A = 8'H8E; // 3 / 1
parameter ORA_A = 8'H0D; // 4 / 3
parameter BITA_A = 8'H2C; // 4 / 3
parameter ANDA_A = 8'H2D; // 4 / 3
parameter EORA_A = 8'H4D; // 4 / 3
parameter LDY_A = 8'HAC; // 4 / 3
parameter LDA_A = 8'HAD; // 4 / 3
parameter LDX_A = 8'HAE; // 4 / 3
parameter CMPA_A = 8'HCD; // 4 / 3
parameter ADCA_A = 8'H6D; // 4 / 3+
parameter SBCA_A = 8'HED; // 4 / 3+
parameter CPY_A = 8'HCC; // 4 / 3
parameter CPX_A = 8'HEC; // 4 / 3

parameter STA_AX = 8'H9D; // 3 / 1
parameter ORA_AX = 8'H1D; // 4+ / 3
parameter ANDA_AX = 8'H3D; // 4+ / 3
parameter EORA_AX = 8'H5D; // 4+ / 3
parameter LDY_AX = 8'HBC; // 4+ / 3
parameter LDA_AX = 8'HBD; // 4+ / 3
parameter CMPA_AX = 8'HDD; // 4+ / 3
parameter ADCA_AX = 8'H7D; // 4+ / 3+
parameter SBCA_AX = 8'HFD; // 4+ / 3+

parameter STA_AY = 8'H99; // 3 / 1
parameter ORA_AY = 8'H19; // 4+ / 3
parameter ANDA_AY = 8'H39; // 4+ / 3
parameter EORA_AY = 8'H59; // 4+ / 3
parameter LDA_AY = 8'HB9; // 4+ / 3
parameter LDX_AY = 8'HBE; // 4+ / 3
parameter CMPA_AY = 8'HD9; // 4+ / 3
parameter ADCA_AY = 8'H79; // 4+ / 3+
parameter SBCA_AY = 8'HF9; // 4+ / 3+

parameter ASL_Z = 8'H06; // 5 / 2
parameter ROL_Z = 8'H26; // 5 / 2
parameter LSR_Z = 8'H46; // 5 / 2
parameter ROR_Z = 8'H66; // 5 / 2
parameter DEC_Z = 8'HC6; // 5 / 2
parameter INC_Z = 8'HE6; // 5 / 2

parameter ASL_ZX = 8'H16; // 6 / 2
parameter ROL_ZX = 8'H36; // 6 / 2
parameter LSR_ZX = 8'H56; // 6 / 2
parameter ROR_ZX = 8'H76; // 6 / 2
parameter DEC_ZX = 8'HD6; // 6 / 2
parameter INC_ZX = 8'HF6; // 6 / 2

parameter ASL_A = 8'H0E; // 6 / 3
parameter ROL_A = 8'H2E; // 6 / 3
parameter LSR_A = 8'H4E; // 6 / 3
parameter ROR_A = 8'H6E; // 6 / 3
parameter DEC_A = 8'HCE; // 6 / 3
parameter INC_A = 8'HEE; // 6 / 3

parameter ASL_AX = 8'H1E; // 7 / 3
parameter ROL_AX = 8'H3E; // 7 / 3
parameter LSR_AX = 8'H5E; // 7 / 3
parameter ROR_AX = 8'H7E; // 7 / 3
parameter DEC_AX = 8'HDE; // 7 / 3
parameter INC_AX = 8'HFE; // 7 / 3

Top

BigEd

Post subject: Re: 32 is the new 8-bit

Posted: Sat Oct 05, 2013 10:43 am

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10938
Location: England

Would a pair of independent 16-bit memories be adequate, or do you indeed need a dual-ported memory?

Top

Page 8 of 12

[ 168 posts ]

Go to page Previous 1 ... 5, 6, 7, 8, 9, 10, 11, 12 Next

Board index » 6502.org Users Forum » Programmable Logic

All times are UTC

Who is online

Users browsing this forum: No registered users and 23 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum