6502.org • 65ORG16.b Core - Page 12

Page 12 of 24

Posted: Sat Mar 17, 2012 11:34 am

by BigEd

Hi Arlet
thanks for the re-pipelined code!

Hi EEye
looking at your present code on github, I notice a few things:
- regsel is still a 3-bit signal, so I think you may have only 8 registers total
- A is hard coded as register 0, X as 4, Y as 5 and SP as 6. This is not a bad thing, as it presumably means one can do arithmetic on X, Y and SP. But it might be worth noting. Also you might want to move them around a bit.

Here's an idea: instead of dedicating 4 bits to the shift distance, and therefore having to restrict the variable-distance shift opcodes to just 4 registers, how about allowing just 2 bits of shift distance
00 : shift by 1
01 : shift by 2
10 : shift by 4
11 : shift by 8
and then you'll have room to apply the variable-shift distance to 8 registers.

The most common shifts can still be performed in a single operation, and other shifts can be performed in two or three operations.

Cheers
Ed

Posted: Sat Mar 17, 2012 1:33 pm

by ElEctric_EyE

BigEd wrote:

Hi Arlet
thanks for the re-pipelined code!

Another thanks! I'll try to implement it in the .b core and measure the results.

BigEd wrote:

...looking at your present code on github, I notice a few things:
- regsel is still a 3-bit signal, so I think you may have only 8 registers total...

That is still at the stage of 3 Acc's. I've not updated Github to 16 Acc's yet. Thanks for checking it out though. Just abit more testing...

BigEd wrote:

...Here's an idea: instead of dedicating 4 bits to the shift distance, and therefore having to restrict the variable-distance shift opcodes to just 4 registers, how about allowing just 2 bits of shift distance
00 : shift by 1
01 : shift by 2
10 : shift by 4
11 : shift by 8
and then you'll have room to apply the variable-shift distance to 8 registers.

The most common shifts can still be performed in a single operation, and other shifts can be performed in two or three operations.

Cheers
Ed

If I did that, IR[11:8] wouldn't follow the rule I have now for 16 Acc's:

Code: Select all

For IR[15:0] 
16'bssdd_ssdd_xxxx_xxxx 

IR[15:14] = src_reg 
IR[13:12] = dst_reg 
IR[11:10] = src_reg 
IR[9:8] = dst_reg 
IR[7:0] = NMOS 6502 opcode

BigEd wrote:

...- A is hard coded as register 0, X as 4, Y as 5 and SP as 6. This is not a bad thing, as it presumably means one can do arithmetic on X, Y and SP. But it might be worth noting. Also you might want to move them around a bit...

What do you mean?

Posted: Sat Mar 17, 2012 2:23 pm

by BigEd

ElEctric_EyE wrote:

BigEd wrote:

how about allowing just 2 bits of shift distance

If I did that, IR[11:8] wouldn't follow the rule I have now for 16 Acc's:

Code: Select all

For IR[15:0] 
16'bssdd_ssdd_xxxx_xxxx

How about mapping those groups of bits like this instead:

Code: Select all

For IR[15:0] 
16'bsdsd_ssdd_xxxx_xxxx  ;  s and d are src and dst for non-shift
16'brrsd_ssdd_xxxx_xxxx  ;  rr is shift/rotate distance for shift ops

or like this

Code: Select all

For IR[15:0] 
16'bsdss_sddd_xxxx_xxxx  ;  s and d are src and dst for non-shift
16'brrss_sddd_xxxx_xxxx  ;  rr is shift/rotate distance for shift ops

Quote:

BigEd wrote:

...- A is hard coded as register 0, X as 4, Y as 5 and SP as 6. This is not a bad thing, as it presumably means one can do arithmetic on X, Y and SP. But it might be worth noting. Also you might want to move them around a bit...

What do you mean?

If you name your accumulators according to the alphabet, and we see which register numbers they land on, with the present github code we get

Code: Select all

0 A
1 B
2 C
3 D
4 E also X
5 F also Y
6 G also SP
7 H

and similarly (but with more lines) in the 16-accumulator case. So, I'm suggesting you'd want to put X, Y and SP at the top end, where you have accumulators N, O, and P (or Q) which is registers 13, 14 and 15.

For example
TAN
will be the same thing as
TAX
which isn't very interesting, but adding something to D placing it in N will be placing it in X (they are different names for the same place) and this might well be useful.

(This is assuming that I've understood correctly the way your verilog makes use of the register file)

Although it would be yet more work for the assembler, I'd also suggest R0 to R15 as synonyms for the accumulators (or registers) - that would allow a more regular form of assembly which would be less familiar to 6502 fans but might be easier to work with.

Cheers
Ed

Posted: Sat Mar 17, 2012 3:20 pm

by ElEctric_EyE

This is what I have for the 16 Acc .b core now:

Code: Select all

reg [4:0] regsel;			// Select A thru P, X, Y or S register
wire [dw-1:0] regfile = PAXYS[regsel];	// Selected register output

parameter 
	SEL_A    = 5'd0,
	SEL_B		= 5'd1,
	SEL_C		= 5'd2,
	SEL_D		= 5'd3,
	SEL_E		= 5'd4,
	SEL_F		= 5'd5,
	SEL_G		= 5'd6,
	SEL_H		= 5'd7,
	SEL_I		= 5'd8,
	SEL_J		= 5'd9,
	SEL_K		= 5'd10,
	SEL_L		= 5'd11,
	SEL_M		= 5'd12,
	SEL_N		= 5'd13,
	SEL_O		= 5'd14,
	SEL_P		= 5'd15,
	SEL_X	   = 5'd16,
	SEL_Y    = 5'd17,
	SEL_S    = 5'd18;

As far as changing values during a transfer, there's really no need for that. I've done what Arlet suggested to do that kind of thing during the actual math or logical operation. So it looks something like this:

Code: Select all

ADCAopBi #$001F    ; Add $1F to A, store in B

an ADCAopAi is actually the original ADC #$xx, immediate.
or

Code: Select all

EOR AopBa $FFFFE000    ; EOR value in $FFFFE000 with A, store in B

So the bit structure is pretty much locked in at this point, especially since I've already done alot of work typing out a good number of Macro's and running them through simulation.
Just so one can get a grasp on what I am doing: just as there were 256 opcode possibilities for all Transfers between 16 Acc's (only 240 useful), there are another 256 opcodes for each and every operation per addressing mode. 256 for ADCXopXi, 256 for ADC XopXa, 256 for ANDXopXi, etc.

Also, back to the variable shift. If this cpu is working with audio for example and you wanted to lower the volume of a sample, there would be a need for a variable high speed shift such as this. I would imagine this is also the case with video.

Posted: Sat Mar 17, 2012 3:31 pm

by BigEd

Oh I see - X Y and S are now distinct from your accumulators. Do you see the advantage I mean in making them aliases for 3 of your accumulators? It makes these index registers less special and allows one to perform arithmetic on them.

Posted: Sat Mar 17, 2012 3:35 pm

by BigEd

On the question of variable shift, I'd actually be very interested to know what performance critical code might need it. Perhaps some kind of decompression. Not volume adjustment I don't think - that would more likely be a multiplication. (Adding a multiplier won't hurt speed)

Cheers Ed

Posted: Sat Mar 17, 2012 3:43 pm

by ElEctric_EyE

BigEd wrote:

Oh I see - X Y and S are now distinct from your accumulators. Do you see the advantage I mean in making them aliases for 3 of your accumulators? It makes these index registers less special and allows one to perform arithmetic on them.

This is interesting. The first time you mentioned doing math on an index register it got me thinking... The value that acted on an index register should 'stick' I would think...I will have to come back to this at a later point. Could you give us an example?

As you know, I wanted to do variable shift really just for x4, x8, & x12, but I think I should keep the full 4 bits available.

Multiplication, as in 16x16, won't hurt speed? I have an idea I am going to try real soon, on presenting the ALU with 2 16bit wide registers and a multiply bit. There will also be a 32, 33? bit wide result register, just not had time to get around to it yet!!!

Posted: Sat Mar 17, 2012 4:03 pm

by Arlet

ElEctric_EyE wrote:

Multiplication, as in 16x16, won't hurt speed?

Probably not too much, if you do it right. If you change this code here:

Code: Select all

AXYS[regsel] <= (state == JSR0) ? DIMUX : ADD;

into something like this:

Code: Select all

AXYS[regsel] <= (state == JSR0) ? DIMUX : 
                              (state == MUL0) ? MUL : ADD;

This assumes an extra MUL0 state in the state machine, that's used in multiply instructions. The 'MUL' value is a register that contains the output of the multiply. Something like this:

Code: Select all

always @(posedge clk)
    MUL <= AI * BI;

Because the output of the multiplier is stored in a register, the synthesis tools will use a pipelined version of the multiplier, which should be sufficiently fast.

If you put the multiply inside the ALU, I would expect a bigger drop in speed, because it can't pipeline it, and even though the multiplier itself is still pretty fast, you also have to consider the routing delays to and from the multipliers, which are only found in certain places on the FPGA.

Posted: Sat Mar 17, 2012 4:10 pm

by Arlet

By the way, for volume adjustments in audio, you also might want to consider audio channels with multipliers built into the hardware path. That way, you can write the audio samples into a sample register, and adjust the volume with a separate register.

Of course, having a multiply in the CPU is always useful. For instance, when doing video output address calculation, you'd have to multiply the Y value by the screen width.

Posted: Sat Mar 17, 2012 4:55 pm

by ElEctric_EyE

Arlet wrote:

...And secondly, it needs to be based on OUT rather than temp:

Code: Select all

assign Z = ~|OUT;

Awhile ago you suggested to make this change inside the cpu module:

Code: Select all

assign AZ = ~|ADD; 	//calculate the Z flag inside the cpu.v module

Which one is best?

Posted: Sat Mar 17, 2012 5:06 pm

by Arlet

I think changing it in the ALU module is more elegant, because it keeps all the flags together. For synthesis, it should be exactly the same.

Posted: Sat Mar 17, 2012 9:35 pm

by ElEctric_EyE

I've implemented your changes and am up to 98MHz with 16 Acc .b core

... Still tightening constraints on slow laptop (still at work)

103MHz.. Work done. Time to head home and finalize this speed test on a real machine.

Posted: Sun Mar 18, 2012 12:24 am

by ElEctric_EyE

This speed test will have to wait a few days. On my main machine, the modified version of cpu.v and ALU.v is not even passing a 10.5ns constraint which is BS, since the laptop was good down to at least 9.8ns.

Will double check tomorrow.

On a side note, today I've added 512 opcodes for ADCXopXi and SBCXopXi. Testing will continue tomorrow on the 16Acc's.

Posted: Sun Mar 18, 2012 10:59 pm

by ElEctric_EyE

Testing the SBCXopXi opcodes today took all day and pointed out some big time errors. I won't get specific because that would make for alot of boring reading. But I got this working:

Code: Select all

START             LDA #$0001     ;LDAi #$0001
                  LDBi
                  .BYTE $0002
                  LDCi
                  .BYTE $0003
                  LDDi
                  .BYTE $0004
                  LDEi
                  .BYTE $0005
                  LDFi
                  .BYTE $0006
                  LDGi
                  .BYTE $0007
                  LDHi
                  .BYTE $0008
                  LDIi
                  .BYTE $0009
                  LDJi
                  .BYTE $000A
                  LDKi
                  .BYTE $000B
                  LDLi
                  .BYTE $000C
                  LDMi
                  .BYTE $000D
                  LDNi
                  .BYTE $000E
                  LDOi
                  .BYTE $000F
                  LDPi
                  .BYTE $0010
                  SEC
AA                SBCPopPi
                  .BYTE $0001
                  BNE AA
BA                SBCOopOi
                  .BYTE $0001
                  BNE BA
CA                SBCNopNi
                  .BYTE $0001
                  BNE CA
DA                SBCMopMi
                  .BYTE $0001
                  BNE DA
EA                SBCLopLi
                  .BYTE $0001
                  BNE EA
                  
                  NOP
                  NOP
                  NOP

and I was able to get rid of the default: case statements.

I was debating whether to update the .b core Github posting using the 4Acc's, but the .b core has progressed to 16 Acc's and when all is finally "straightened out", it should be easy enough to scale back down to 4Acc's, with everything working and tested in the 16 Acc core. So, if one is following Github changes, there is going to be monstrous changes which will be difficult to follow. Now that I have 16 Acc's working, I'll post regular updates in the future.

Posted: Tue Mar 20, 2012 1:35 pm

by ElEctric_EyE

Progress continues, corrected many mistakes. There must be some errors still because when I try the core in my project it is not displaying characters correctly. I would have suspected an issue with the carry flag, or a shifting operation but it worked with the 4Acc Var shift version. So most likely, my opcode decoding is screwed up somewhere. Will troubleshoot some more...

Using SmartXplorer to push top speeds though:

Code: Select all

Timing summary: 
 --------------- 
  
 Timing errors: 0  Score: 0  (Setup/Max: 0, Hold: 0) 
  
 Constraints cover 597244 paths, 0 nets, and 6385 connections 
  
 Design statistics: 
    Minimum period:   9.642ns{1}   (Maximum frequency: 103.713MHz) 
  
  
 ------------------------------------Footnotes----------------------------------- 
 1)  The minimum period statistic assumes all single cycle delays. 
  
 Analysis completed Tue Mar 20 21:15:13 2012  
 --------------------------------------------------------------------------------