65ORG16.b Core
Hi Arlet
thanks for the re-pipelined code!
Hi EEye
looking at your present code on github, I notice a few things:
- regsel is still a 3-bit signal, so I think you may have only 8 registers total
- A is hard coded as register 0, X as 4, Y as 5 and SP as 6. This is not a bad thing, as it presumably means one can do arithmetic on X, Y and SP. But it might be worth noting. Also you might want to move them around a bit.
Here's an idea: instead of dedicating 4 bits to the shift distance, and therefore having to restrict the variable-distance shift opcodes to just 4 registers, how about allowing just 2 bits of shift distance
00 : shift by 1
01 : shift by 2
10 : shift by 4
11 : shift by 8
and then you'll have room to apply the variable-shift distance to 8 registers.
The most common shifts can still be performed in a single operation, and other shifts can be performed in two or three operations.
Cheers
Ed
thanks for the re-pipelined code!
Hi EEye
looking at your present code on github, I notice a few things:
- regsel is still a 3-bit signal, so I think you may have only 8 registers total
- A is hard coded as register 0, X as 4, Y as 5 and SP as 6. This is not a bad thing, as it presumably means one can do arithmetic on X, Y and SP. But it might be worth noting. Also you might want to move them around a bit.
Here's an idea: instead of dedicating 4 bits to the shift distance, and therefore having to restrict the variable-distance shift opcodes to just 4 registers, how about allowing just 2 bits of shift distance
00 : shift by 1
01 : shift by 2
10 : shift by 4
11 : shift by 8
and then you'll have room to apply the variable-shift distance to 8 registers.
The most common shifts can still be performed in a single operation, and other shifts can be performed in two or three operations.
Cheers
Ed
-
ElEctric_EyE
- Posts: 3260
- Joined: 02 Mar 2009
- Location: OH, USA
BigEd wrote:
Hi Arlet
thanks for the re-pipelined code!
thanks for the re-pipelined code!
BigEd wrote:
...looking at your present code on github, I notice a few things:
- regsel is still a 3-bit signal, so I think you may have only 8 registers total...
- regsel is still a 3-bit signal, so I think you may have only 8 registers total...
BigEd wrote:
...Here's an idea: instead of dedicating 4 bits to the shift distance, and therefore having to restrict the variable-distance shift opcodes to just 4 registers, how about allowing just 2 bits of shift distance
00 : shift by 1
01 : shift by 2
10 : shift by 4
11 : shift by 8
and then you'll have room to apply the variable-shift distance to 8 registers.
The most common shifts can still be performed in a single operation, and other shifts can be performed in two or three operations.
Cheers
Ed
00 : shift by 1
01 : shift by 2
10 : shift by 4
11 : shift by 8
and then you'll have room to apply the variable-shift distance to 8 registers.
The most common shifts can still be performed in a single operation, and other shifts can be performed in two or three operations.
Cheers
Ed
Code: Select all
For IR[15:0]
16'bssdd_ssdd_xxxx_xxxx
IR[15:14] = src_reg
IR[13:12] = dst_reg
IR[11:10] = src_reg
IR[9:8] = dst_reg
IR[7:0] = NMOS 6502 opcodeBigEd wrote:
...- A is hard coded as register 0, X as 4, Y as 5 and SP as 6. This is not a bad thing, as it presumably means one can do arithmetic on X, Y and SP. But it might be worth noting. Also you might want to move them around a bit...
ElEctric_EyE wrote:
BigEd wrote:
how about allowing just 2 bits of shift distance
Code: Select all
For IR[15:0]
16'bssdd_ssdd_xxxx_xxxx
Code: Select all
For IR[15:0]
16'bsdsd_ssdd_xxxx_xxxx ; s and d are src and dst for non-shift
16'brrsd_ssdd_xxxx_xxxx ; rr is shift/rotate distance for shift ops
Code: Select all
For IR[15:0]
16'bsdss_sddd_xxxx_xxxx ; s and d are src and dst for non-shift
16'brrss_sddd_xxxx_xxxx ; rr is shift/rotate distance for shift ops
Quote:
BigEd wrote:
...- A is hard coded as register 0, X as 4, Y as 5 and SP as 6. This is not a bad thing, as it presumably means one can do arithmetic on X, Y and SP. But it might be worth noting. Also you might want to move them around a bit...
Code: Select all
0 A
1 B
2 C
3 D
4 E also X
5 F also Y
6 G also SP
7 H
For example
TAN
will be the same thing as
TAX
which isn't very interesting, but adding something to D placing it in N will be placing it in X (they are different names for the same place) and this might well be useful.
(This is assuming that I've understood correctly the way your verilog makes use of the register file)
Although it would be yet more work for the assembler, I'd also suggest R0 to R15 as synonyms for the accumulators (or registers) - that would allow a more regular form of assembly which would be less familiar to 6502 fans but might be easier to work with.
Cheers
Ed
-
ElEctric_EyE
- Posts: 3260
- Joined: 02 Mar 2009
- Location: OH, USA
This is what I have for the 16 Acc .b core now:
As far as changing values during a transfer, there's really no need for that. I've done what Arlet suggested to do that kind of thing during the actual math or logical operation. So it looks something like this:
an ADCAopAi is actually the original ADC #$xx, immediate.
or
So the bit structure is pretty much locked in at this point, especially since I've already done alot of work typing out a good number of Macro's and running them through simulation.
Just so one can get a grasp on what I am doing: just as there were 256 opcode possibilities for all Transfers between 16 Acc's (only 240 useful), there are another 256 opcodes for each and every operation per addressing mode. 256 for ADCXopXi, 256 for ADC XopXa, 256 for ANDXopXi, etc.
Also, back to the variable shift. If this cpu is working with audio for example and you wanted to lower the volume of a sample, there would be a need for a variable high speed shift such as this. I would imagine this is also the case with video.
Code: Select all
reg [4:0] regsel; // Select A thru P, X, Y or S register
wire [dw-1:0] regfile = PAXYS[regsel]; // Selected register output
parameter
SEL_A = 5'd0,
SEL_B = 5'd1,
SEL_C = 5'd2,
SEL_D = 5'd3,
SEL_E = 5'd4,
SEL_F = 5'd5,
SEL_G = 5'd6,
SEL_H = 5'd7,
SEL_I = 5'd8,
SEL_J = 5'd9,
SEL_K = 5'd10,
SEL_L = 5'd11,
SEL_M = 5'd12,
SEL_N = 5'd13,
SEL_O = 5'd14,
SEL_P = 5'd15,
SEL_X = 5'd16,
SEL_Y = 5'd17,
SEL_S = 5'd18;Code: Select all
ADCAopBi #$001F ; Add $1F to A, store in Bor
Code: Select all
EOR AopBa $FFFFE000 ; EOR value in $FFFFE000 with A, store in BJust so one can get a grasp on what I am doing: just as there were 256 opcode possibilities for all Transfers between 16 Acc's (only 240 useful), there are another 256 opcodes for each and every operation per addressing mode. 256 for ADCXopXi, 256 for ADC XopXa, 256 for ANDXopXi, etc.
Also, back to the variable shift. If this cpu is working with audio for example and you wanted to lower the volume of a sample, there would be a need for a variable high speed shift such as this. I would imagine this is also the case with video.
-
ElEctric_EyE
- Posts: 3260
- Joined: 02 Mar 2009
- Location: OH, USA
BigEd wrote:
Oh I see - X Y and S are now distinct from your accumulators. Do you see the advantage I mean in making them aliases for 3 of your accumulators? It makes these index registers less special and allows one to perform arithmetic on them.
As you know, I wanted to do variable shift really just for x4, x8, & x12, but I think I should keep the full 4 bits available.
Multiplication, as in 16x16, won't hurt speed? I have an idea I am going to try real soon, on presenting the ALU with 2 16bit wide registers and a multiply bit. There will also be a 32, 33? bit wide result register, just not had time to get around to it yet!!!
ElEctric_EyE wrote:
Multiplication, as in 16x16, won't hurt speed?
Code: Select all
AXYS[regsel] <= (state == JSR0) ? DIMUX : ADD;Code: Select all
AXYS[regsel] <= (state == JSR0) ? DIMUX :
(state == MUL0) ? MUL : ADD;Code: Select all
always @(posedge clk)
MUL <= AI * BI;
If you put the multiply inside the ALU, I would expect a bigger drop in speed, because it can't pipeline it, and even though the multiplier itself is still pretty fast, you also have to consider the routing delays to and from the multipliers, which are only found in certain places on the FPGA.
By the way, for volume adjustments in audio, you also might want to consider audio channels with multipliers built into the hardware path. That way, you can write the audio samples into a sample register, and adjust the volume with a separate register.
Of course, having a multiply in the CPU is always useful. For instance, when doing video output address calculation, you'd have to multiply the Y value by the screen width.
Of course, having a multiply in the CPU is always useful. For instance, when doing video output address calculation, you'd have to multiply the Y value by the screen width.
-
ElEctric_EyE
- Posts: 3260
- Joined: 02 Mar 2009
- Location: OH, USA
Arlet wrote:
...And secondly, it needs to be based on OUT rather than temp:
Code: Select all
assign Z = ~|OUT;
Code: Select all
assign AZ = ~|ADD; //calculate the Z flag inside the cpu.v module-
ElEctric_EyE
- Posts: 3260
- Joined: 02 Mar 2009
- Location: OH, USA
-
ElEctric_EyE
- Posts: 3260
- Joined: 02 Mar 2009
- Location: OH, USA
This speed test will have to wait a few days. On my main machine, the modified version of cpu.v and ALU.v is not even passing a 10.5ns constraint which is BS, since the laptop was good down to at least 9.8ns.
Will double check tomorrow.
On a side note, today I've added 512 opcodes for ADCXopXi and SBCXopXi. Testing will continue tomorrow on the 16Acc's.
Will double check tomorrow.
On a side note, today I've added 512 opcodes for ADCXopXi and SBCXopXi. Testing will continue tomorrow on the 16Acc's.
-
ElEctric_EyE
- Posts: 3260
- Joined: 02 Mar 2009
- Location: OH, USA
Testing the SBCXopXi opcodes today took all day and pointed out some big time errors. I won't get specific because that would make for alot of boring reading. But I got this working:
and I was able to get rid of the default: case statements.
I was debating whether to update the .b core Github posting using the 4Acc's, but the .b core has progressed to 16 Acc's and when all is finally "straightened out", it should be easy enough to scale back down to 4Acc's, with everything working and tested in the 16 Acc core. So, if one is following Github changes, there is going to be monstrous changes which will be difficult to follow. Now that I have 16 Acc's working, I'll post regular updates in the future.
Code: Select all
START LDA #$0001 ;LDAi #$0001
LDBi
.BYTE $0002
LDCi
.BYTE $0003
LDDi
.BYTE $0004
LDEi
.BYTE $0005
LDFi
.BYTE $0006
LDGi
.BYTE $0007
LDHi
.BYTE $0008
LDIi
.BYTE $0009
LDJi
.BYTE $000A
LDKi
.BYTE $000B
LDLi
.BYTE $000C
LDMi
.BYTE $000D
LDNi
.BYTE $000E
LDOi
.BYTE $000F
LDPi
.BYTE $0010
SEC
AA SBCPopPi
.BYTE $0001
BNE AA
BA SBCOopOi
.BYTE $0001
BNE BA
CA SBCNopNi
.BYTE $0001
BNE CA
DA SBCMopMi
.BYTE $0001
BNE DA
EA SBCLopLi
.BYTE $0001
BNE EA
NOP
NOP
NOPI was debating whether to update the .b core Github posting using the 4Acc's, but the .b core has progressed to 16 Acc's and when all is finally "straightened out", it should be easy enough to scale back down to 4Acc's, with everything working and tested in the 16 Acc core. So, if one is following Github changes, there is going to be monstrous changes which will be difficult to follow. Now that I have 16 Acc's working, I'll post regular updates in the future.
-
ElEctric_EyE
- Posts: 3260
- Joined: 02 Mar 2009
- Location: OH, USA
Progress continues, corrected many mistakes. There must be some errors still because when I try the core in my project it is not displaying characters correctly. I would have suspected an issue with the carry flag, or a shifting operation but it worked with the 4Acc Var shift version. So most likely, my opcode decoding is screwed up somewhere. Will troubleshoot some more...
Using SmartXplorer to push top speeds though:
Using SmartXplorer to push top speeds though:
Code: Select all
Timing summary:
---------------
Timing errors: 0 Score: 0 (Setup/Max: 0, Hold: 0)
Constraints cover 597244 paths, 0 nets, and 6385 connections
Design statistics:
Minimum period: 9.642ns{1} (Maximum frequency: 103.713MHz)
------------------------------------Footnotes-----------------------------------
1) The minimum period statistic assumes all single cycle delays.
Analysis completed Tue Mar 20 21:15:13 2012
--------------------------------------------------------------------------------