6502.org • View topic - 65ORG16.b Core

View unanswered posts | View active topics

Board index » 6502.org Users Forum » Programmable Logic

All times are UTC

65ORG16.b Core

Page 12 of 24

[ 353 posts ]

Go to page Previous 1 ... 9, 10, 11, 12, 13, 14, 15 ... 24 Next

Print view

Previous topic | Next topic

Author

Message

BigEd

Post subject:

Posted: Sat Mar 17, 2012 11:34 am

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England

Hi Arlet
thanks for the re-pipelined code!

Hi EEye
looking at your present code on github, I notice a few things:
- regsel is still a 3-bit signal, so I think you may have only 8 registers total
- A is hard coded as register 0, X as 4, Y as 5 and SP as 6. This is not a bad thing, as it presumably means one can do arithmetic on X, Y and SP. But it might be worth noting. Also you might want to move them around a bit.

Here's an idea: instead of dedicating 4 bits to the shift distance, and therefore having to restrict the variable-distance shift opcodes to just 4 registers, how about allowing just 2 bits of shift distance
00 : shift by 1
01 : shift by 2
10 : shift by 4
11 : shift by 8
and then you'll have room to apply the variable-shift distance to 8 registers.

The most common shifts can still be performed in a single operation, and other shifts can be performed in two or three operations.

Cheers
Ed

Top

ElEctric_EyE

Post subject:

Posted: Sat Mar 17, 2012 1:33 pm

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA

BigEd wrote:

Hi Arlet
thanks for the re-pipelined code!

Another thanks! I'll try to implement it in the .b core and measure the results.

BigEd wrote:

...looking at your present code on github, I notice a few things:
- regsel is still a 3-bit signal, so I think you may have only 8 registers total...

That is still at the stage of 3 Acc's. I've not updated Github to 16 Acc's yet. Thanks for checking it out though. Just abit more testing...

BigEd wrote:

...Here's an idea: instead of dedicating 4 bits to the shift distance, and therefore having to restrict the variable-distance shift opcodes to just 4 registers, how about allowing just 2 bits of shift distance
00 : shift by 1
01 : shift by 2
10 : shift by 4
11 : shift by 8
and then you'll have room to apply the variable-shift distance to 8 registers.

The most common shifts can still be performed in a single operation, and other shifts can be performed in two or three operations.

Cheers
Ed

If I did that, IR[11:8] wouldn't follow the rule I have now for 16 Acc's:

Code:

For IR[15:0] 
16'bssdd_ssdd_xxxx_xxxx 

IR[15:14] = src_reg 
IR[13:12] = dst_reg 
IR[11:10] = src_reg 
IR[9:8] = dst_reg 
IR[7:0] = NMOS 6502 opcode

BigEd wrote:

...- A is hard coded as register 0, X as 4, Y as 5 and SP as 6. This is not a bad thing, as it presumably means one can do arithmetic on X, Y and SP. But it might be worth noting. Also you might want to move them around a bit...

What do you mean?

Top

BigEd

Post subject:

Posted: Sat Mar 17, 2012 2:23 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England

ElEctric_EyE wrote:

BigEd wrote:

how about allowing just 2 bits of shift distance

If I did that, IR[11:8] wouldn't follow the rule I have now for 16 Acc's:

Code:

For IR[15:0] 
16'bssdd_ssdd_xxxx_xxxx 

How about mapping those groups of bits like this instead:

Code:

For IR[15:0] 
16'bsdsd_ssdd_xxxx_xxxx  ;  s and d are src and dst for non-shift
16'brrsd_ssdd_xxxx_xxxx  ;  rr is shift/rotate distance for shift ops

or like this

Code:

For IR[15:0] 
16'bsdss_sddd_xxxx_xxxx  ;  s and d are src and dst for non-shift
16'brrss_sddd_xxxx_xxxx  ;  rr is shift/rotate distance for shift ops

Quote:

BigEd wrote:

What do you mean?

If you name your accumulators according to the alphabet, and we see which register numbers they land on, with the present github code we get

Code:

A
B
C
D
E also X
F also Y
G also SP
H

and similarly (but with more lines) in the 16-accumulator case. So, I'm suggesting you'd want to put X, Y and SP at the top end, where you have accumulators N, O, and P (or Q) which is registers 13, 14 and 15.

For example
TAN
will be the same thing as
TAX
which isn't very interesting, but adding something to D placing it in N will be placing it in X (they are different names for the same place) and this might well be useful.

(This is assuming that I've understood correctly the way your verilog makes use of the register file)

Although it would be yet more work for the assembler, I'd also suggest R0 to R15 as synonyms for the accumulators (or registers) - that would allow a more regular form of assembly which would be less familiar to 6502 fans but might be easier to work with.

Cheers
Ed

Top

ElEctric_EyE

Post subject:

Posted: Sat Mar 17, 2012 3:20 pm

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA

This is what I have for the 16 Acc .b core now:

Code:

reg [4:0] regsel;         // Select A thru P, X, Y or S register
wire [dw-1:0] regfile = PAXYS[regsel];   // Selected register output

parameter 
   SEL_A    = 5'd0,
   SEL_B      = 5'd1,
   SEL_C      = 5'd2,
   SEL_D      = 5'd3,
   SEL_E      = 5'd4,
   SEL_F      = 5'd5,
   SEL_G      = 5'd6,
   SEL_H      = 5'd7,
   SEL_I      = 5'd8,
   SEL_J      = 5'd9,
   SEL_K      = 5'd10,
   SEL_L      = 5'd11,
   SEL_M      = 5'd12,
   SEL_N      = 5'd13,
   SEL_O      = 5'd14,
   SEL_P      = 5'd15,
   SEL_X      = 5'd16,
   SEL_Y    = 5'd17,
   SEL_S    = 5'd18;

As far as changing values during a transfer, there's really no need for that. I've done what Arlet suggested to do that kind of thing during the actual math or logical operation. So it looks something like this:

Code:

ADCAopBi #$001F ; Add $1F to A, store in B

an ADCAopAi is actually the original ADC #$xx, immediate.
or

Code:

EOR AopBa $FFFFE000 ; EOR value in $FFFFE000 with A, store in B

So the bit structure is pretty much locked in at this point, especially since I've already done alot of work typing out a good number of Macro's and running them through simulation.
Just so one can get a grasp on what I am doing: just as there were 256 opcode possibilities for all Transfers between 16 Acc's (only 240 useful), there are another 256 opcodes for each and every operation per addressing mode. 256 for ADCXopXi, 256 for ADC XopXa, 256 for ANDXopXi, etc.

Also, back to the variable shift. If this cpu is working with audio for example and you wanted to lower the volume of a sample, there would be a need for a variable high speed shift such as this. I would imagine this is also the case with video.

Top

BigEd

Post subject:

Posted: Sat Mar 17, 2012 3:31 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England

Oh I see - X Y and S are now distinct from your accumulators. Do you see the advantage I mean in making them aliases for 3 of your accumulators? It makes these index registers less special and allows one to perform arithmetic on them.

Top

BigEd

Post subject:

Posted: Sat Mar 17, 2012 3:35 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England

On the question of variable shift, I'd actually be very interested to know what performance critical code might need it. Perhaps some kind of decompression. Not volume adjustment I don't think - that would more likely be a multiplication. (Adding a multiplier won't hurt speed)

Cheers Ed

Top

ElEctric_EyE

Post subject:

Posted: Sat Mar 17, 2012 3:43 pm

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA

BigEd wrote:

This is interesting. The first time you mentioned doing math on an index register it got me thinking... The value that acted on an index register should 'stick' I would think...I will have to come back to this at a later point. Could you give us an example?

As you know, I wanted to do variable shift really just for x4, x8, & x12, but I think I should keep the full 4 bits available.

Multiplication, as in 16x16, won't hurt speed? I have an idea I am going to try real soon, on presenting the ALU with 2 16bit wide registers and a multiply bit. There will also be a 32, 33? bit wide result register, just not had time to get around to it yet!!!

Top

Arlet

Post subject:

Posted: Sat Mar 17, 2012 4:03 pm

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands

ElEctric_EyE wrote:

Multiplication, as in 16x16, won't hurt speed?

Probably not too much, if you do it right. If you change this code here:

Code:

AXYS[regsel] <= (state == JSR0) ? DIMUX : ADD;

into something like this:

Code:

AXYS[regsel] <= (state == JSR0) ? DIMUX : 
                              (state == MUL0) ? MUL : ADD;

This assumes an extra MUL0 state in the state machine, that's used in multiply instructions. The 'MUL' value is a register that contains the output of the multiply. Something like this:

Code:

always @(posedge clk)
    MUL <= AI * BI;

Because the output of the multiplier is stored in a register, the synthesis tools will use a pipelined version of the multiplier, which should be sufficiently fast.

If you put the multiply inside the ALU, I would expect a bigger drop in speed, because it can't pipeline it, and even though the multiplier itself is still pretty fast, you also have to consider the routing delays to and from the multipliers, which are only found in certain places on the FPGA.

Top

Arlet

Post subject:

Posted: Sat Mar 17, 2012 4:10 pm

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands

By the way, for volume adjustments in audio, you also might want to consider audio channels with multipliers built into the hardware path. That way, you can write the audio samples into a sample register, and adjust the volume with a separate register.

Of course, having a multiply in the CPU is always useful. For instance, when doing video output address calculation, you'd have to multiply the Y value by the screen width.

Top

ElEctric_EyE

Post subject:

Posted: Sat Mar 17, 2012 4:55 pm

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA

Arlet wrote:

...And secondly, it needs to be based on OUT rather than temp:

Code:

assign Z = ~|OUT;

Awhile ago you suggested to make this change inside the cpu module:

Code:

assign AZ = ~|ADD; //calculate the Z flag inside the cpu.v module

Which one is best?

Top

Arlet

Post subject:

Posted: Sat Mar 17, 2012 5:06 pm

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands

I think changing it in the ALU module is more elegant, because it keeps all the flags together. For synthesis, it should be exactly the same.

Top

ElEctric_EyE

Post subject:

Posted: Sat Mar 17, 2012 9:35 pm

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA

I've implemented your changes and am up to 98MHz with 16 Acc .b core :lol:

... Still tightening constraints on slow laptop (still at work)

103MHz.. Work done. Time to head home and finalize this speed test on a real machine.

Top

ElEctric_EyE

Post subject:

Posted: Sun Mar 18, 2012 12:24 am

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA

This speed test will have to wait a few days. On my main machine, the modified version of cpu.v and ALU.v is not even passing a 10.5ns constraint which is BS, since the laptop was good down to at least 9.8ns.

Will double check tomorrow.

On a side note, today I've added 512 opcodes for ADCXopXi and SBCXopXi. Testing will continue tomorrow on the 16Acc's.

Top

ElEctric_EyE

Post subject:

Posted: Sun Mar 18, 2012 10:59 pm

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA

Testing the SBCXopXi opcodes today took all day and pointed out some big time errors. I won't get specific because that would make for alot of boring reading. But I got this working:

Code:

START             LDA #$0001     ;LDAi #$0001
                  LDBi
                  .BYTE $0002
                  LDCi
                  .BYTE $0003
                  LDDi
                  .BYTE $0004
                  LDEi
                  .BYTE $0005
                  LDFi
                  .BYTE $0006
                  LDGi
                  .BYTE $0007
                  LDHi
                  .BYTE $0008
                  LDIi
                  .BYTE $0009
                  LDJi
                  .BYTE $000A
                  LDKi
                  .BYTE $000B
                  LDLi
                  .BYTE $000C
                  LDMi
                  .BYTE $000D
                  LDNi
                  .BYTE $000E
                  LDOi
                  .BYTE $000F
                  LDPi
                  .BYTE $0010
                  SEC
AA                SBCPopPi
                  .BYTE $0001
                  BNE AA
BA                SBCOopOi
                  .BYTE $0001
                  BNE BA
CA                SBCNopNi
                  .BYTE $0001
                  BNE CA
DA                SBCMopMi
                  .BYTE $0001
                  BNE DA
EA                SBCLopLi
                  .BYTE $0001
                  BNE EA
                  
                  NOP
                  NOP
                  NOP

and I was able to get rid of the default: case statements.

I was debating whether to update the .b core Github posting using the 4Acc's, but the .b core has progressed to 16 Acc's and when all is finally "straightened out", it should be easy enough to scale back down to 4Acc's, with everything working and tested in the 16 Acc core. So, if one is following Github changes, there is going to be monstrous changes which will be difficult to follow. Now that I have 16 Acc's working, I'll post regular updates in the future.

Top

ElEctric_EyE

Post subject:

Posted: Tue Mar 20, 2012 1:35 pm

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA

Progress continues, corrected many mistakes. There must be some errors still because when I try the core in my project it is not displaying characters correctly. I would have suspected an issue with the carry flag, or a shifting operation but it worked with the 4Acc Var shift version. So most likely, my opcode decoding is screwed up somewhere. Will troubleshoot some more...

Using SmartXplorer to push top speeds though:

Code:

Timing summary: 
 --------------- 
  
 Timing errors: 0  Score: 0  (Setup/Max: 0, Hold: 0) 
  
 Constraints cover 597244 paths, 0 nets, and 6385 connections 
  
 Design statistics: 
    Minimum period:   9.642ns{1}   (Maximum frequency: 103.713MHz) 
  
  
 ------------------------------------Footnotes----------------------------------- 
 1)  The minimum period statistic assumes all single cycle delays. 
  
 Analysis completed Tue Mar 20 21:15:13 2012  
 -------------------------------------------------------------------------------- 

Top

Page 12 of 24

[ 353 posts ]

Go to page Previous 1 ... 9, 10, 11, 12, 13, 14, 15 ... 24 Next

Board index » 6502.org Users Forum » Programmable Logic

All times are UTC

Who is online

Users browsing this forum: No registered users and 14 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum