65ORG16.b Core

Arlet · Post by **Arlet** » Sun Apr 22, 2012 2:44 pm

ElEctric_EyE wrote:

I thought involving the regfile, any part of it, in modifying stack page and zero page pointers would make it a 3 ported RAM? My regsel is already up to 5 bits, as I have 16 accumulators, the 3 index registers and the stack register.

No, what makes it a 3 ported RAM is the fact that you accessed the register file directly. As long as you only read the registers through the 'regfile' wires, the tools will make a single ported RAM. The problem with these special registers is that the 'regfile' output only represents the register that's currently in 'regsel'. So, if you're doing something like "STA ZP, X" the core needs both the X register as the A register, and both are funneled through the 'regfile' output, at different cycles of the instruction. Now if you also need the ZP base register, it's going to get crowded, maybe even impossible.

With my new proposal, the register file is only used as a backup. When you write the ZP base register, you write it both to the register file (like any other register), and at the same time, you write it to a dedicated register. When it's needed for the AB high byte, it reads it from the dedicated register, so it's not interfering with the register file at the wrong time. When it's needed for reading, like in a Transfer instruction, it's read back from the register file, like we do for the other registers. This means a minimal amount of change in the core, and the best chance that speed will not be compromised because we haven't added anything to the critical path.

ElEctric_EyE · Post by **ElEctric_EyE** » Sun Apr 22, 2012 7:24 pm

So something like this should avoid 3 ported RAM?

Code: Select all

BRK0:		AB = { QAWXYS[SEL_SPP], regfile };

Before all the modifications to make a relocatable stack and zero page, with what is presently on Github(commit#b4acf177ad300aa40a48cb280f66aca6becaeed3), max speed with a constraint of 10.7ns was 97MHz.

If the above is ok, (it does seem to work in simulation), max speed is 89MHz with a constraint of 11.2ns. 11.0ns fails. I made 2 opcodes at the base matrix $xx00_xx00_00x0_0111, so there are 32 opcodes total that transfer a value from any accumulator to either the ZeroPage Pointer or the StackPage Pointer. That should be enough I think.

EDIT: I reread, and see that I still need to make the 'shadow' register. Half way there! I'll post what I have so far on Github since we're confident in a good outcome.

EDIT: Oh, now I follow you! Thanks for the code. Working... 11.0ns passes! Trying 10.8ns constraint... Drat! It fails, but this is on the laptop without the SmartXPlorer optimizations. Firing up the desktop with the optimizations... Fails 11.0ns. Rerunning SmartXPlorer, it's following a different approach this time... Now it passes 11.0ns on the desktop. Trying 10.8ns... Fails.

Conclusion: I didn't retry a SmartXPlorer run on the desktop before the addition of the shadow write registers, which may have gotten the constraint to pass at 11.0ns as well, but as Arlet has said, anything added afterwards we would notice a substantially bigger decrease in speed. Overall, I'm glad the speed is back up to a 11.0ns constraint! even with this powerful feature of zero page and stack page relocation feature that's been added.

Testing continues... I see something I don't like with the INC A ($001A) opcode. It's a decode and reg opcode, but the IR seems to only recognize the DECODE portion and not the REG. Everything works, but IR is skewed from 'statename" for the rest of the simulation. The thing is, all other accumulator INC's($xx1A) display properly in ISim. Any suggestions where the problem my lie?

Here's the proof of speed from ISE13.4 before all recent mod's:

Code: Select all

                       Timing summary: 
 --------------- 
  
 Timing errors: 0  Score: 0  (Setup/Max: 0, Hold: 0) 
  
 Constraints cover 1115709 paths, 0 nets, and 8144 connections 
  
 Design statistics: 
    Minimum period:  10.246ns{1}   (Maximum frequency:  97.599MHz) 
  
  
 ------------------------------------Footnotes----------------------------------- 
 1)  The minimum period statistic assumes all single cycle delays. 
  
 Analysis completed Mon Apr 23 00:31:08 2012  
 --------------------------------------------------------------------------------

And after all mod's with Arlet's shadow write reg's and SmartXPlorer:

Code: Select all

              Timing summary: 
 --------------- 
  
 Timing errors: 0  Score: 0  (Setup/Max: 0, Hold: 0) 
  
 Constraints cover 1370332 paths, 0 nets, and 8928 connections 
  
 Design statistics: 
    Minimum period:  10.759ns{1}   (Maximum frequency:  92.945MHz) 
  
  
 ------------------------------------Footnotes----------------------------------- 
 1)  The minimum period statistic assumes all single cycle delays. 
  
 Analysis completed Mon Apr 23 06:30:28 2012  
 --------------------------------------------------------------------------------

ElEctric_EyE · Post by **ElEctric_EyE** » Mon Apr 23, 2012 2:12 am

ElEctric_EyE wrote:

...Testing continues... I see something I don't like with the INC A ($001A) opcode. It's a decode and reg opcode, but the IR seems to only recognize the DECODE portion and not the REG. Everything works, but IR is skewed from 'statename" for the rest of the simulation. The thing is, all other accumulator INC's($xx1A) display properly in ISim. Any suggestions where the problem my lie?

I noticed 1 more important item now, and I've noticed it before in other simulations. It doesn't have anything to do with $xx1A or any other opcode value. The actual program starts at $FFFFE000 with an LDA #$00($00A9, $0000). Although, the sim is saying $00A9 starts at $FFFFE001. What gives? Like it's starting out at PC+1, not PC. Hmmm, could be the assembler causing the problem? This is what I have for the vectors:

Code: Select all

.ORG	$FFFFFFFA
		
		.WORD	START
		.WORD	START
		.WORD	START
		
		.END

ElEctric_EyE · Post by **ElEctric_EyE** » Mon Apr 23, 2012 5:46 pm

I've been looking at the 65CE02 datasheet today and saw it also had a basepage (zeropage) pointer, and opcodes to transfer X register into the pointer and transfer the pointer back into the X. So far the .b core does not have a pointer to accumulator transfer function. I will add these for the zero page and stack page pointers.

Messing around with the zeropage pointer is neat. It's just like indirect indexed. There's the opportunity to save some cycles compared to indirect indexed I think.

GARTHWILSON · Post by **GARTHWILSON** » Mon Apr 23, 2012 5:51 pm

The 65CE02 was neat. Too bad they hardly made any. I have a sample or two here but it did not work in my 65c02 socket on the workbench computer. I never tried to find out why. The '816 is nicer, and actually I have a 65802 in the 6502 socket now, which is how I developed my bank-0-only '816 Forth.

BigEd · Post by **BigEd** » Mon Apr 23, 2012 6:26 pm

ElEctric_EyE wrote:

I will add these for the zero page and stack page pointers.

Hi EEye, I agree that these base pointers are a neat feature. How much effort is it for you to add each opcode? I fear that you might be doing it in a way which involves more effort or more code than it really needs. (But I can't see your step-by-step progress so I don't know.) As I said before, for the special registers I favour a swap instruction, because a single instruction gives us both a read and a write - less work. For special registers (I mean for example these base registers) I see no need to do arithmetic, or any need to exchange with multiple other registers - so again, that's less work.

Any time there are multiple sources or destinations or operands, I would hope you are able to extract the bitfields and use them directly, without needing to use lots of lines of code.

Cheers
Ed

Arlet · Post by **Arlet** » Mon Apr 23, 2012 6:36 pm

BigEd wrote:

I fear that you might be doing it in a way which involves more effort or more code than it really needs. (But I can't see your step-by-step progress so I don't know.) As I said before, for the special registers I favour a swap instruction, because a single instruction gives us both a read and a write - less work.

I don't think a swap instruction is less work. It would require at least a change to the state machine. Using the existing register-register transfer only requires setting up the proper src_reg/dst_reg registers during DECODE.

BigEd · Post by **BigEd** » Mon Apr 23, 2012 6:41 pm

But I did XBA without changing the state machine... ah, it will be because B was outside my register file. So, point conceded! (I would have placed these special registers outside the register file too...)

Arlet · Post by **Arlet** » Mon Apr 23, 2012 7:04 pm

The advantage of putting them in an (extended) register file is that you get a free multiplexer. If you create special registers, it requires a wider mux in the internal data path, and probably a bigger slow down overall.

ElEctric_EyE · Post by **ElEctric_EyE** » Mon Apr 23, 2012 8:01 pm

BigEd wrote:

...Any time there are multiple sources or destinations or operands, I would hope you are able to extract the bitfields and use them directly, without needing to use lots of lines of code.

Cheers
Ed

This is where I've had ALOT of practice modifying the .b core, bitfields with 'x's. Made lots of mistakes early on that cost me much time. But once I got a hang for the errors in simulation, I started also to learn where to look for potential errors in my decoding. I make less mistakes now, and when I make them I'm faster finding the problem area. It can challenge one's sanity however, doing this day after day! I think I'm getting OCD with this core.

BigEd · Post by **BigEd** » Mon Apr 23, 2012 8:17 pm

Well, yes, you've had lots of practice, and you're pretty sure you've got the bugs out, but the code you have is verbose and probably difficult to maintain. I'm looking at the big case statements here and here and thinking they should really be done differently.

I think what you'd do is store the necessary bitfields during the decode state so you can use them when you need them. (Maybe you don't even need to do that.) The reason that instruction encodings place register identities into bitfields is so that the CPU can just wire the register address from the IR to the register file through a fairly simple multiplexer arrangement. These big case statements hide that, and maybe hide some bugs at the same time.

Cheers
Ed

ElEctric_EyE · Post by **ElEctric_EyE** » Tue Apr 24, 2012 1:31 am

Hmm, earlier I said I was getting OCD. I was mistaken. I have OCO, Obsessive Compulsive Order. This is no disorder, but it is OC!

BigEd wrote:

...These big case statements hide that, and maybe hide some bugs at the same time...

No, I argue against this.
I expanded the decoding statements large so I, as a human still adding opcodes, could make sure new opcode decoding will not interfere with any other predefined opcode decodes. This becomes abit difficult when dealing with 65536 opcode possibilities compared to the original NMOS 8-bit. Look at Arlet's State machine and his load_reg opcode decoding, then look at mine. His is optimized horizontally and vertically. I am presently optimizing my set based on the columns only, and it looks like a mess (scroll to lines 1043-1084 and lines 899-950 on your link, but it works! And it works well and it's not such a mess after all because at this point, I am letting the tools do their work. The tools can optimize... I am out to eradicate the errors in my .b core. And I am doing this by testing every Macro I write.

BigEd · Post by **BigEd** » Tue Apr 24, 2012 5:32 am

Hi EEye
I completely agree that handling the non-overlap of the opcode map is crucial, and that the not-quite-so-big case statements at lines 900 and 1047 need to take something like the form they do. It's the other two I mention which are I think worth reconsidering.

Cheers
Ed

ElEctric_EyE · Post by **ElEctric_EyE** » Tue Apr 24, 2012 10:37 am

I'm ears Ed, do you (or anyone) have any ideas on how to go actually go about this?

BTW, I've finished adding TWX, TWY, TYW, TXW, rearranged 1 or 2 opcodes and finished an update to the .b spec.

ElEctric_EyE · Post by **ElEctric_EyE** » Wed Apr 25, 2012 9:36 pm

I'm wondering, as I constantly do speed tests on this core as I've added features on the .b core and tested them on 2 different x86 machines, if the opcode decoding in the state machine would result in a speed increase if I were to replace a 'x' with a known state, especially a '0'.

I ask only because it seems I am getting 2 different results depending on which machine I run ISE13.4 from in order to optimize this core with SmartXPlorer. My mobile laptop has a Pentium 4. My main desktop has a 4GHz I7 dual core 875K (my main optimizer).

2 different results, one that takes about 1 hour. 1 that takes about 5 min's. But 2 different results.

65ORG16.b Core

Re: 65ORG16.b Core

Re: 65ORG16.b Core

Re: 65ORG16.b Core

Re: 65ORG16.b Core

Re: 65ORG16.b Core

Re: 65ORG16.b Core

Re: 65ORG16.b Core

Re: 65ORG16.b Core

Re: 65ORG16.b Core

Re: 65ORG16.b Core

Re: 65ORG16.b Core

Re: 65ORG16.b Core

Re: 65ORG16.b Core

Re: 65ORG16.b Core

Re: 65ORG16.b Core