6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Mon Apr 29, 2024 3:35 am

All times are UTC




Post new topic Reply to topic  [ 353 posts ]  Go to page Previous  1 ... 9, 10, 11, 12, 13, 14, 15 ... 24  Next
Author Message
 Post subject:
PostPosted: Sat Mar 17, 2012 11:34 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
Hi Arlet
thanks for the re-pipelined code!

Hi EEye
looking at your present code on github, I notice a few things:
- regsel is still a 3-bit signal, so I think you may have only 8 registers total
- A is hard coded as register 0, X as 4, Y as 5 and SP as 6. This is not a bad thing, as it presumably means one can do arithmetic on X, Y and SP. But it might be worth noting. Also you might want to move them around a bit.

Here's an idea: instead of dedicating 4 bits to the shift distance, and therefore having to restrict the variable-distance shift opcodes to just 4 registers, how about allowing just 2 bits of shift distance
00 : shift by 1
01 : shift by 2
10 : shift by 4
11 : shift by 8
and then you'll have room to apply the variable-shift distance to 8 registers.

The most common shifts can still be performed in a single operation, and other shifts can be performed in two or three operations.

Cheers
Ed


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sat Mar 17, 2012 1:33 pm 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
BigEd wrote:
Hi Arlet
thanks for the re-pipelined code!

Another thanks! I'll try to implement it in the .b core and measure the results.

BigEd wrote:
...looking at your present code on github, I notice a few things:
- regsel is still a 3-bit signal, so I think you may have only 8 registers total...

That is still at the stage of 3 Acc's. I've not updated Github to 16 Acc's yet. Thanks for checking it out though. Just abit more testing...

BigEd wrote:
...Here's an idea: instead of dedicating 4 bits to the shift distance, and therefore having to restrict the variable-distance shift opcodes to just 4 registers, how about allowing just 2 bits of shift distance
00 : shift by 1
01 : shift by 2
10 : shift by 4
11 : shift by 8
and then you'll have room to apply the variable-shift distance to 8 registers.

The most common shifts can still be performed in a single operation, and other shifts can be performed in two or three operations.

Cheers
Ed

If I did that, IR[11:8] wouldn't follow the rule I have now for 16 Acc's:
Code:
For IR[15:0]
16'bssdd_ssdd_xxxx_xxxx

IR[15:14] = src_reg
IR[13:12] = dst_reg
IR[11:10] = src_reg
IR[9:8] = dst_reg
IR[7:0] = NMOS 6502 opcode


BigEd wrote:
...- A is hard coded as register 0, X as 4, Y as 5 and SP as 6. This is not a bad thing, as it presumably means one can do arithmetic on X, Y and SP. But it might be worth noting. Also you might want to move them around a bit...

What do you mean?


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sat Mar 17, 2012 2:23 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
ElEctric_EyE wrote:
BigEd wrote:
how about allowing just 2 bits of shift distance

If I did that, IR[11:8] wouldn't follow the rule I have now for 16 Acc's:
Code:
For IR[15:0]
16'bssdd_ssdd_xxxx_xxxx


How about mapping those groups of bits like this instead:
Code:
For IR[15:0]
16'bsdsd_ssdd_xxxx_xxxx  ;  s and d are src and dst for non-shift
16'brrsd_ssdd_xxxx_xxxx  ;  rr is shift/rotate distance for shift ops

or like this
Code:
For IR[15:0]
16'bsdss_sddd_xxxx_xxxx  ;  s and d are src and dst for non-shift
16'brrss_sddd_xxxx_xxxx  ;  rr is shift/rotate distance for shift ops



Quote:
BigEd wrote:
...- A is hard coded as register 0, X as 4, Y as 5 and SP as 6. This is not a bad thing, as it presumably means one can do arithmetic on X, Y and SP. But it might be worth noting. Also you might want to move them around a bit...

What do you mean?

If you name your accumulators according to the alphabet, and we see which register numbers they land on, with the present github code we get
Code:
0 A
1 B
2 C
3 D
4 E also X
5 F also Y
6 G also SP
7 H

and similarly (but with more lines) in the 16-accumulator case. So, I'm suggesting you'd want to put X, Y and SP at the top end, where you have accumulators N, O, and P (or Q) which is registers 13, 14 and 15.

For example
TAN
will be the same thing as
TAX
which isn't very interesting, but adding something to D placing it in N will be placing it in X (they are different names for the same place) and this might well be useful.

(This is assuming that I've understood correctly the way your verilog makes use of the register file)

Although it would be yet more work for the assembler, I'd also suggest R0 to R15 as synonyms for the accumulators (or registers) - that would allow a more regular form of assembly which would be less familiar to 6502 fans but might be easier to work with.

Cheers
Ed


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sat Mar 17, 2012 3:20 pm 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
This is what I have for the 16 Acc .b core now:
Code:
reg [4:0] regsel;         // Select A thru P, X, Y or S register
wire [dw-1:0] regfile = PAXYS[regsel];   // Selected register output

parameter
   SEL_A    = 5'd0,
   SEL_B      = 5'd1,
   SEL_C      = 5'd2,
   SEL_D      = 5'd3,
   SEL_E      = 5'd4,
   SEL_F      = 5'd5,
   SEL_G      = 5'd6,
   SEL_H      = 5'd7,
   SEL_I      = 5'd8,
   SEL_J      = 5'd9,
   SEL_K      = 5'd10,
   SEL_L      = 5'd11,
   SEL_M      = 5'd12,
   SEL_N      = 5'd13,
   SEL_O      = 5'd14,
   SEL_P      = 5'd15,
   SEL_X      = 5'd16,
   SEL_Y    = 5'd17,
   SEL_S    = 5'd18;


As far as changing values during a transfer, there's really no need for that. I've done what Arlet suggested to do that kind of thing during the actual math or logical operation. So it looks something like this:
Code:
ADCAopBi #$001F    ; Add $1F to A, store in B

an ADCAopAi is actually the original ADC #$xx, immediate.
or
Code:
EOR AopBa $FFFFE000    ; EOR value in $FFFFE000 with A, store in B


So the bit structure is pretty much locked in at this point, especially since I've already done alot of work typing out a good number of Macro's and running them through simulation.
Just so one can get a grasp on what I am doing: just as there were 256 opcode possibilities for all Transfers between 16 Acc's (only 240 useful), there are another 256 opcodes for each and every operation per addressing mode. 256 for ADCXopXi, 256 for ADC XopXa, 256 for ANDXopXi, etc.


Also, back to the variable shift. If this cpu is working with audio for example and you wanted to lower the volume of a sample, there would be a need for a variable high speed shift such as this. I would imagine this is also the case with video.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sat Mar 17, 2012 3:31 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
Oh I see - X Y and S are now distinct from your accumulators. Do you see the advantage I mean in making them aliases for 3 of your accumulators? It makes these index registers less special and allows one to perform arithmetic on them.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sat Mar 17, 2012 3:35 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
On the question of variable shift, I'd actually be very interested to know what performance critical code might need it. Perhaps some kind of decompression. Not volume adjustment I don't think - that would more likely be a multiplication. (Adding a multiplier won't hurt speed)

Cheers Ed


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sat Mar 17, 2012 3:43 pm 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
BigEd wrote:
Oh I see - X Y and S are now distinct from your accumulators. Do you see the advantage I mean in making them aliases for 3 of your accumulators? It makes these index registers less special and allows one to perform arithmetic on them.

This is interesting. The first time you mentioned doing math on an index register it got me thinking... The value that acted on an index register should 'stick' I would think...I will have to come back to this at a later point. Could you give us an example?

As you know, I wanted to do variable shift really just for x4, x8, & x12, but I think I should keep the full 4 bits available.

Multiplication, as in 16x16, won't hurt speed? I have an idea I am going to try real soon, on presenting the ALU with 2 16bit wide registers and a multiply bit. There will also be a 32, 33? bit wide result register, just not had time to get around to it yet!!!


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sat Mar 17, 2012 4:03 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
ElEctric_EyE wrote:
Multiplication, as in 16x16, won't hurt speed?


Probably not too much, if you do it right. If you change this code here:
Code:
AXYS[regsel] <= (state == JSR0) ? DIMUX : ADD;

into something like this:
Code:
AXYS[regsel] <= (state == JSR0) ? DIMUX :
                              (state == MUL0) ? MUL : ADD;

This assumes an extra MUL0 state in the state machine, that's used in multiply instructions. The 'MUL' value is a register that contains the output of the multiply. Something like this:
Code:
always @(posedge clk)
    MUL <= AI * BI;

Because the output of the multiplier is stored in a register, the synthesis tools will use a pipelined version of the multiplier, which should be sufficiently fast.

If you put the multiply inside the ALU, I would expect a bigger drop in speed, because it can't pipeline it, and even though the multiplier itself is still pretty fast, you also have to consider the routing delays to and from the multipliers, which are only found in certain places on the FPGA.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sat Mar 17, 2012 4:10 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
By the way, for volume adjustments in audio, you also might want to consider audio channels with multipliers built into the hardware path. That way, you can write the audio samples into a sample register, and adjust the volume with a separate register.

Of course, having a multiply in the CPU is always useful. For instance, when doing video output address calculation, you'd have to multiply the Y value by the screen width.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sat Mar 17, 2012 4:55 pm 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
Arlet wrote:
...And secondly, it needs to be based on OUT rather than temp:
Code:
assign Z = ~|OUT;

Awhile ago you suggested to make this change inside the cpu module:
Code:
assign AZ = ~|ADD;    //calculate the Z flag inside the cpu.v module

Which one is best?


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sat Mar 17, 2012 5:06 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
I think changing it in the ALU module is more elegant, because it keeps all the flags together. For synthesis, it should be exactly the same.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sat Mar 17, 2012 9:35 pm 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
I've implemented your changes and am up to 98MHz with 16 Acc .b core :lol:... Still tightening constraints on slow laptop (still at work) :(
103MHz.. Work done. Time to head home and finalize this speed test on a real machine.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sun Mar 18, 2012 12:24 am 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
This speed test will have to wait a few days. On my main machine, the modified version of cpu.v and ALU.v is not even passing a 10.5ns constraint which is BS, since the laptop was good down to at least 9.8ns.

Will double check tomorrow.

On a side note, today I've added 512 opcodes for ADCXopXi and SBCXopXi. Testing will continue tomorrow on the 16Acc's.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sun Mar 18, 2012 10:59 pm 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
Testing the SBCXopXi opcodes today took all day and pointed out some big time errors. I won't get specific because that would make for alot of boring reading. But I got this working:
Code:
START             LDA #$0001     ;LDAi #$0001
                  LDBi
                  .BYTE $0002
                  LDCi
                  .BYTE $0003
                  LDDi
                  .BYTE $0004
                  LDEi
                  .BYTE $0005
                  LDFi
                  .BYTE $0006
                  LDGi
                  .BYTE $0007
                  LDHi
                  .BYTE $0008
                  LDIi
                  .BYTE $0009
                  LDJi
                  .BYTE $000A
                  LDKi
                  .BYTE $000B
                  LDLi
                  .BYTE $000C
                  LDMi
                  .BYTE $000D
                  LDNi
                  .BYTE $000E
                  LDOi
                  .BYTE $000F
                  LDPi
                  .BYTE $0010
                  SEC
AA                SBCPopPi
                  .BYTE $0001
                  BNE AA
BA                SBCOopOi
                  .BYTE $0001
                  BNE BA
CA                SBCNopNi
                  .BYTE $0001
                  BNE CA
DA                SBCMopMi
                  .BYTE $0001
                  BNE DA
EA                SBCLopLi
                  .BYTE $0001
                  BNE EA
                 
                  NOP
                  NOP
                  NOP

and I was able to get rid of the default: case statements.

I was debating whether to update the .b core Github posting using the 4Acc's, but the .b core has progressed to 16 Acc's and when all is finally "straightened out", it should be easy enough to scale back down to 4Acc's, with everything working and tested in the 16 Acc core. So, if one is following Github changes, there is going to be monstrous changes which will be difficult to follow. Now that I have 16 Acc's working, I'll post regular updates in the future.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Tue Mar 20, 2012 1:35 pm 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
Progress continues, corrected many mistakes. There must be some errors still because when I try the core in my project it is not displaying characters correctly. I would have suspected an issue with the carry flag, or a shifting operation but it worked with the 4Acc Var shift version. So most likely, my opcode decoding is screwed up somewhere. Will troubleshoot some more...

Using SmartXplorer to push top speeds though:
Code:
Timing summary:
 ---------------
 
 Timing errors: 0  Score: 0  (Setup/Max: 0, Hold: 0)
 
 Constraints cover 597244 paths, 0 nets, and 6385 connections
 
 Design statistics:
    Minimum period:   9.642ns{1}   (Maximum frequency: 103.713MHz)
 
 
 ------------------------------------Footnotes-----------------------------------
 1)  The minimum period statistic assumes all single cycle delays.
 
 Analysis completed Tue Mar 20 21:15:13 2012 
 --------------------------------------------------------------------------------


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 353 posts ]  Go to page Previous  1 ... 9, 10, 11, 12, 13, 14, 15 ... 24  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 14 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: