Page 12 of 24
Posted: Sat Mar 17, 2012 11:34 am
by BigEd
Hi Arlet
thanks for the re-pipelined code!
Hi EEye
looking at your present code on github, I notice a few things:
- regsel is still a 3-bit signal, so I think you may have only 8 registers total
- A is hard coded as register 0, X as 4, Y as 5 and SP as 6. This is not a bad thing, as it presumably means one can do arithmetic on X, Y and SP. But it might be worth noting. Also you might want to move them around a bit.
Here's an idea: instead of dedicating 4 bits to the shift distance, and therefore having to restrict the variable-distance shift opcodes to just 4 registers, how about allowing just 2 bits of shift distance
00 : shift by 1
01 : shift by 2
10 : shift by 4
11 : shift by 8
and then you'll have room to apply the variable-shift distance to 8 registers.
The most common shifts can still be performed in a single operation, and other shifts can be performed in two or three operations.
Cheers
Ed
Posted: Sat Mar 17, 2012 1:33 pm
by ElEctric_EyE
Hi Arlet
thanks for the re-pipelined code!
Another thanks! I'll try to implement it in the .b core and measure the results.
...looking at your present code on github, I notice a few things:
- regsel is still a 3-bit signal, so I think you may have only 8 registers total...
That is still at the stage of 3 Acc's. I've not updated Github to 16 Acc's yet. Thanks for checking it out though. Just abit more testing...
...Here's an idea: instead of dedicating 4 bits to the shift distance, and therefore having to restrict the variable-distance shift opcodes to just 4 registers, how about allowing just 2 bits of shift distance
00 : shift by 1
01 : shift by 2
10 : shift by 4
11 : shift by 8
and then you'll have room to apply the variable-shift distance to 8 registers.
The most common shifts can still be performed in a single operation, and other shifts can be performed in two or three operations.
Cheers
Ed
If I did that, IR[11:8] wouldn't follow the rule I have now for 16 Acc's:
Code: Select all
For IR[15:0]
16'bssdd_ssdd_xxxx_xxxx
IR[15:14] = src_reg
IR[13:12] = dst_reg
IR[11:10] = src_reg
IR[9:8] = dst_reg
IR[7:0] = NMOS 6502 opcode
...- A is hard coded as register 0, X as 4, Y as 5 and SP as 6. This is not a bad thing, as it presumably means one can do arithmetic on X, Y and SP. But it might be worth noting. Also you might want to move them around a bit...
What do you mean?
Posted: Sat Mar 17, 2012 2:23 pm
by BigEd
how about allowing just 2 bits of shift distance
If I did that, IR[11:8] wouldn't follow the rule I have now for 16 Acc's:
Code: Select all
For IR[15:0]
16'bssdd_ssdd_xxxx_xxxx
How about mapping those groups of bits like this instead:
Code: Select all
For IR[15:0]
16'bsdsd_ssdd_xxxx_xxxx ; s and d are src and dst for non-shift
16'brrsd_ssdd_xxxx_xxxx ; rr is shift/rotate distance for shift ops
or like this
Code: Select all
For IR[15:0]
16'bsdss_sddd_xxxx_xxxx ; s and d are src and dst for non-shift
16'brrss_sddd_xxxx_xxxx ; rr is shift/rotate distance for shift ops
...- A is hard coded as register 0, X as 4, Y as 5 and SP as 6. This is not a bad thing, as it presumably means one can do arithmetic on X, Y and SP. But it might be worth noting. Also you might want to move them around a bit...
What do you mean?
If you name your accumulators according to the alphabet, and we see which register numbers they land on, with the present github code we get
Code: Select all
0 A
1 B
2 C
3 D
4 E also X
5 F also Y
6 G also SP
7 H
and similarly (but with more lines) in the 16-accumulator case. So, I'm suggesting you'd want to put X, Y and SP at the top end, where you have accumulators N, O, and P (or Q) which is registers 13, 14 and 15.
For example
TAN
will be the same thing as
TAX
which isn't very interesting, but adding something to D placing it in N will be placing it in X (they are different names for the same place) and this might well be useful.
(This is assuming that I've understood correctly the way your verilog makes use of the register file)
Although it would be yet more work for the assembler, I'd also suggest R0 to R15 as synonyms for the accumulators (or registers) - that would allow a more regular form of assembly which would be less familiar to 6502 fans but might be easier to work with.
Cheers
Ed
Posted: Sat Mar 17, 2012 3:20 pm
by ElEctric_EyE
This is what I have for the 16 Acc .b core now:
Code: Select all
reg [4:0] regsel; // Select A thru P, X, Y or S register
wire [dw-1:0] regfile = PAXYS[regsel]; // Selected register output
parameter
SEL_A = 5'd0,
SEL_B = 5'd1,
SEL_C = 5'd2,
SEL_D = 5'd3,
SEL_E = 5'd4,
SEL_F = 5'd5,
SEL_G = 5'd6,
SEL_H = 5'd7,
SEL_I = 5'd8,
SEL_J = 5'd9,
SEL_K = 5'd10,
SEL_L = 5'd11,
SEL_M = 5'd12,
SEL_N = 5'd13,
SEL_O = 5'd14,
SEL_P = 5'd15,
SEL_X = 5'd16,
SEL_Y = 5'd17,
SEL_S = 5'd18;
As far as changing values during a transfer, there's really no need for that. I've done what Arlet suggested to do that kind of thing during the actual math or logical operation. So it looks something like this:
Code: Select all
ADCAopBi #$001F ; Add $1F to A, store in B
an ADCAopAi is actually the original ADC #$xx, immediate.
or
Code: Select all
EOR AopBa $FFFFE000 ; EOR value in $FFFFE000 with A, store in B
So the bit structure is pretty much locked in at this point, especially since I've already done alot of work typing out a good number of Macro's and running them through simulation.
Just so one can get a grasp on what I am doing: just as there were 256 opcode possibilities for all Transfers between 16 Acc's (only 240 useful), there are another 256 opcodes for each and every operation per addressing mode. 256 for ADCXopXi, 256 for ADC XopXa, 256 for ANDXopXi, etc.
Also, back to the variable shift. If this cpu is working with audio for example and you wanted to lower the volume of a sample, there would be a need for a variable high speed shift such as this. I would imagine this is also the case with video.
Posted: Sat Mar 17, 2012 3:31 pm
by BigEd
Oh I see - X Y and S are now distinct from your accumulators. Do you see the advantage I mean in making them aliases for 3 of your accumulators? It makes these index registers less special and allows one to perform arithmetic on them.
Posted: Sat Mar 17, 2012 3:35 pm
by BigEd
On the question of variable shift, I'd actually be very interested to know what performance critical code might need it. Perhaps some kind of decompression. Not volume adjustment I don't think - that would more likely be a multiplication. (Adding a multiplier won't hurt speed)
Cheers Ed
Posted: Sat Mar 17, 2012 3:43 pm
by ElEctric_EyE
Oh I see - X Y and S are now distinct from your accumulators. Do you see the advantage I mean in making them aliases for 3 of your accumulators? It makes these index registers less special and allows one to perform arithmetic on them.
This is interesting. The first time you mentioned doing math on an index register it got me thinking... The value that acted on an index register should 'stick' I would think...I will have to come back to this at a later point. Could you give us an example?
As you know, I wanted to do variable shift really just for x4, x8, & x12, but I think I should keep the full 4 bits available.
Multiplication, as in 16x16, won't hurt speed? I have an idea I am going to try real soon, on presenting the ALU with 2 16bit wide registers and a multiply bit. There will also be a 32, 33? bit wide result register, just not had time to get around to it yet!!!
Posted: Sat Mar 17, 2012 4:03 pm
by Arlet
Multiplication, as in 16x16, won't hurt speed?
Probably not too much, if you do it right. If you change this code here:
Code: Select all
AXYS[regsel] <= (state == JSR0) ? DIMUX : ADD;
into something like this:
Code: Select all
AXYS[regsel] <= (state == JSR0) ? DIMUX :
(state == MUL0) ? MUL : ADD;
This assumes an extra MUL0 state in the state machine, that's used in multiply instructions. The 'MUL' value is a register that contains the output of the multiply. Something like this:
Code: Select all
always @(posedge clk)
MUL <= AI * BI;
Because the output of the multiplier is stored in a register, the synthesis tools will use a pipelined version of the multiplier, which should be sufficiently fast.
If you put the multiply inside the ALU, I would expect a bigger drop in speed, because it can't pipeline it, and even though the multiplier itself is still pretty fast, you also have to consider the routing delays to and from the multipliers, which are only found in certain places on the FPGA.
Posted: Sat Mar 17, 2012 4:10 pm
by Arlet
By the way, for volume adjustments in audio, you also might want to consider audio channels with multipliers built into the hardware path. That way, you can write the audio samples into a sample register, and adjust the volume with a separate register.
Of course, having a multiply in the CPU is always useful. For instance, when doing video output address calculation, you'd have to multiply the Y value by the screen width.
Posted: Sat Mar 17, 2012 4:55 pm
by ElEctric_EyE
...And secondly, it needs to be based on OUT rather than temp:
Awhile ago you suggested to make this change inside the cpu module:
Code: Select all
assign AZ = ~|ADD; //calculate the Z flag inside the cpu.v module
Which one is best?
Posted: Sat Mar 17, 2012 5:06 pm
by Arlet
I think changing it in the ALU module is more elegant, because it keeps all the flags together. For synthesis, it should be exactly the same.
Posted: Sat Mar 17, 2012 9:35 pm
by ElEctric_EyE
I've implemented your changes and am up to 98MHz with 16 Acc .b core

... Still tightening constraints on slow laptop (still at work)

103MHz.. Work done. Time to head home and finalize this speed test on a real machine.
Posted: Sun Mar 18, 2012 12:24 am
by ElEctric_EyE
This speed test will have to wait a few days. On my main machine, the modified version of cpu.v and ALU.v is not even passing a 10.5ns constraint which is BS, since the laptop was good down to at least 9.8ns.
Will double check tomorrow.
On a side note, today I've added 512 opcodes for ADCXopXi and SBCXopXi. Testing will continue tomorrow on the 16Acc's.
Posted: Sun Mar 18, 2012 10:59 pm
by ElEctric_EyE
Testing the SBCXopXi opcodes today took all day and pointed out some big time errors. I won't get specific because that would make for alot of boring reading. But I got this working:
Code: Select all
START LDA #$0001 ;LDAi #$0001
LDBi
.BYTE $0002
LDCi
.BYTE $0003
LDDi
.BYTE $0004
LDEi
.BYTE $0005
LDFi
.BYTE $0006
LDGi
.BYTE $0007
LDHi
.BYTE $0008
LDIi
.BYTE $0009
LDJi
.BYTE $000A
LDKi
.BYTE $000B
LDLi
.BYTE $000C
LDMi
.BYTE $000D
LDNi
.BYTE $000E
LDOi
.BYTE $000F
LDPi
.BYTE $0010
SEC
AA SBCPopPi
.BYTE $0001
BNE AA
BA SBCOopOi
.BYTE $0001
BNE BA
CA SBCNopNi
.BYTE $0001
BNE CA
DA SBCMopMi
.BYTE $0001
BNE DA
EA SBCLopLi
.BYTE $0001
BNE EA
NOP
NOP
NOP
and I was able to get rid of the default: case statements.
I was debating whether to update the .b core Github posting using the 4Acc's, but the .b core has progressed to 16 Acc's and when all is finally "straightened out", it should be easy enough to scale back down to 4Acc's, with everything working and tested in the 16 Acc core. So, if one is following Github changes, there is going to be monstrous changes which will be difficult to follow. Now that I have 16 Acc's working, I'll post regular updates in the future.
Posted: Tue Mar 20, 2012 1:35 pm
by ElEctric_EyE
Progress continues, corrected many mistakes. There must be some errors still because when I try the core in my project it is not displaying characters correctly. I would have suspected an issue with the carry flag, or a shifting operation but it worked with the 4Acc Var shift version. So most likely, my opcode decoding is screwed up somewhere. Will troubleshoot some more...
Using SmartXplorer to push top speeds though:
Code: Select all
Timing summary:
---------------
Timing errors: 0 Score: 0 (Setup/Max: 0, Hold: 0)
Constraints cover 597244 paths, 0 nets, and 6385 connections
Design statistics:
Minimum period: 9.642ns{1} (Maximum frequency: 103.713MHz)
------------------------------------Footnotes-----------------------------------
1) The minimum period statistic assumes all single cycle delays.
Analysis completed Tue Mar 20 21:15:13 2012
--------------------------------------------------------------------------------