6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sun Jun 16, 2024 1:22 pm

All times are UTC




Post new topic Reply to topic  [ 353 posts ]  Go to page Previous  1 ... 8, 9, 10, 11, 12, 13, 14 ... 24  Next
Author Message
 Post subject:
PostPosted: Sat Mar 10, 2012 2:48 am 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
More 'transposing <shift,rotate> opcodes' successful testing:
Code:
START             LDA #$05
                  ROL A
                  ROL A
                  LDA #$A5      ;ASLx8 shiftAop, store in A Acc
                  .BYTE $700A
                  LDA #$55      ;ASLx8 shiftAop, store in B Acc
                  .BYTE $740A
                  LDA #$AA      ;ASLx8 shiftAop, store in C Acc
                  .BYTE $780A
                  LDA #$5A      ;ASLx8 shiftAop, store in D Acc
                  .BYTE $7C0A     
                  NOP
                  NOP
                  NOP

Image
Image

Need to observe the carry next...


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sat Mar 10, 2012 7:21 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
ElEctric_EyE wrote:
Hmm, I had always thought the cycles were clearly stated in the MOS Hardware manual. I probably didn't look close enough.


Actually, yes, in the MOS hardware manual on page A-2, it explains the cycle timing of the single byte instructions, and they all perform an extra read of the opcode which is then discarded.

Because my core has a different pipeline, the dummy reads are in different places. The idea behind them is still the same.

The differences between the original 6502 and my core are due to the fact that I only use a single clock, so a TYA instruction takes 2 whole cycles. On the 6502, with 2-phase clocking, it takes 4 half cycles, which gives them more flexibility.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sat Mar 10, 2012 2:00 pm 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
Excellent explanation!

I think I am done adding to the .b core. I am very pleased with the speed of it at around 95MHz, and adding multipliers would surely slow it down. Perhaps when I'll have real need for multipliers, maybe in the Digital Sound Synthesis board, I'll try tackling a .c core. I would like to use the .b core as it is now and start using it on the DevBoard! For anyone interested in the Verilog code, I'll post an update to Github today. I think I'll spend a majority of my day off next Tuesday writing up a synopsis of all the new opcodes and their values, which I will also make available.

For comparison's sake, here is the utilization of the 144-pin Spartan6 XC6LX9 with the .b core at its first stage with BCD and D flag removed versus the latest .b core on the bottom. For as much that was added, not much more has been used.
Image
Image


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Mon Mar 12, 2012 5:08 pm 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
Looking at the opcodes there are 4 bits left over for all opcodes except the <shift,rotate> opcodes which are presently operable on the A,B,C,D Accumulators.
I think I am going to try to use the left over bits to expand the src_reg and dst_reg to 4 bits each. This means 16 Acc's total.
12 'general purpose' Acc's that can do ADC/SBC but not ASL/ROL/LSR/ROR and also transpositional stores to all 16 Acc's. And the 4 Acc's now that can do everything. This will max out the opcode bits and finalize this .b version.

The writeup has been delayed for just a little. Should be easy to code for this idea and test it in the next couple days.

EDIT(3/13/2012): While trying to jump to 16 Acc's, I've had to back up to my work on 4 Acc's because I have found a major flaw in how I was decoding of LDA[A..D], STA[A..D] & CMP[A..D]. Also, There is a problem with the variable shifting. The shifted value only lasts 2 cycles then goes back to the original value. Working on it...


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Tue Mar 13, 2012 6:15 pm 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
Those problems have been fixed and Github updated!! Now on to 16 Acc's backward compatible with the original .b version.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Wed Mar 14, 2012 10:17 am 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
I think I have it worked out. It's abit convoluted, but it was important to me to have the uppermost 4 bits define the variable for the 16bit barrelshifter for the ASL,ROL,LSR,ROR opcodes. These opcodes work on 1 of 4 Acc's A,B,C,D. They can take a value from A,B,C,D then shift/rotate and finally store the shifted/rotated value in A,B,C,D.

Opcode bits for ASL etc. look like this for all addressing modes. 2 bits for source, 2 bits for destination.
For IR[15:0]
16'bxxxx_ssdd_0xxx_x110
16'bxxxx_ssdd_0xxx_1010

IR[15:12] = distance to shift/rotate
IR[11:10] = src_reg
IR[9:8] = dst_reg
IR[7:0] = NMOS 6502 opcode

Now for all other NMOS 6502 opcodes, IR[15:12] together with IR[11:8] define 16 Acc's (A thru P) using 4 bits for source and 4 bits for destination. The structure of IR [11:8] had to be kept the same so we can have instructions that can transfer values from A,B,C,D to the other 12 Acc's.

For IR[15:0]
16'bssdd_ssdd_xxxx_xxxx

IR[15:14] = src_reg
IR[13:12] = dst_reg
IR[11:10] = src_reg
IR[9:8] = dst_reg
IR[7:0] = NMOS 6502 opcode

Writing up some macro's now to make a ROM to test this out. But just about 16 macro's per instruction takes some typing. Having done LDA #$xxxx, LDA $xxxxxxxx, TAX, TXA, TAY, TYA for 16 Acc's, I see a pattern and progress quickens. Just a few more instructions like STA $xxxxxxxx and I can test.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Wed Mar 14, 2012 1:50 pm 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
Not having too much success. The Sim is crashing when I try to scroll to the beginning of the waveforms! :lol: The statename is confused, all X's. I'm focusing on the Microcode state machine. All I'm trying to do is load each Acc with a value. Any suggestions? (Quit while I'm ahead with 4 Acc's?)


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Wed Mar 14, 2012 2:08 pm 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
Well the good news is I plugged the new .b core with 4Acc's+transposing stores+variable shift (the version on Github) and it works on the Devboard with all the software I have written. In addition Bruces' C'mon works. So it is truly backward compatible!
Instead of working on advancing the .b core to 16 Acc's, I'll try to take advantage of the new opcodes. This is one reason I wanted a cycle counter built into the .b Core, in order to compare cycles against new opcodes vs. original opcodes. I might bring it back...

Will give the brain a rest on 16 Acc's for now...


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Thu Mar 15, 2012 4:32 pm 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
Enough rest! I got a good sim working for 16 basic LDA[A..P] immediate opcodes and a var shift ASLA working at the end of the test.

The problem was in the microcode state machine as I thought. I had too quickly assigned 'xxxx' to IR[11:8] for all opcodes. When it should've looked like this:
Code:
16'bxxxx_00xx_0x00_1000:   state <= PUSH0;
      16'bxxxx_00xx_0x10_1000:   state <= PULL0;
      16'bxxxx_00xx_0xx1_1000:   state <= REG;   // CLC, SEC, CLI, SEI
      16'b0000_0000_1xx0_00x0:   state <= FETCH; // IMM
      16'b0000_0000_1xx0_1100:   state <= ABS0;  // X/Y abs
      16'bxxxx_00xx_1xxx_1000:   state <= REG;   // DEY, TYA, ...


Will do more testing, but this is good news! Just looking at it I know it's not 100%, but 1 step at a time...


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Mar 16, 2012 12:16 am 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
I feel some opcode syntax definitions are in order since we are all 6502 fans and when one like myself ventures into a project and becomes very involved in the detail, some rules for common ground are in order so we are all on the same page (all 1 of us?, :lol:)

Some of my observations, self critiques:

I find myself haphazardly coming up with opcode names like LDB or LDC or maybe you've seen LD[A..P] which is not proper. The amount of time I spend on this core, I intend all angles be covered and if I miss any, please say so.

Also, posting waveforms for 16 Accumulators becomes cumbersome especially for a long piece of software. And I've noticed test routines I write are quite long just to test a basic LDA[A..P], so you may not see many more waveforms, unless there's a serious problem. There I go again! Time for definitions.

No more LD[A..P], which was intended to be 'load accumulator A thru P' to define a piece of software, not one opcode. It actually means 16 opcodes. But the syntax itself is incorrect. And this will become important when/if I post code heavily laden with Macro's which are defined elsewhere and the viewer is unable to see the def's. The new opcode syntax are meant to be 'intuitive', so LD[A..P] will become LDA[A..P] (LoaD Accumulators A thru P) when I speak of a chunk of code, i.e. there is no opcode to load Acc's AthruP with a single value. It means I've done a test which loads values successively into those accumulators.

Anyway, I'm not one to sit here all night typing away. I need to make more Macro's!!!!
__________________________________________________________________________________________________________________________
Today, I have found some errors while testing in simulation and have fixed those errors for the folowing opcodes so far.:

LDA[A..P]i (like LDA #$xxxx, immediate); LoaD Acc [A thru P] immediate. Example: LDAAi #$0000, LDABi #$0001, etc.

TA[A..P]Y (like TAY); Transfer Acc [A thru P] to Y reg. Example: TAAY TABY, etc.

TXY,TYX (new); Transfer X reg to Y reg and vise versa. Example: TXY, TYX.

TXA[A..P]; (like TXA). Transfer X reg to Acc [A thru P]. Example: TXAA, TXAB, etc.

PLA[A..P]; (like PLA) PulL from stack and put in Acc [A thru P]. Example: PLAA, PLAB, etc.

PHA[A..P]; (like PHA) PusH Acc [A thru P] to stack. Example: PHAA, PHAB, etc.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Mar 16, 2012 10:11 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10828
Location: England
Hi EEye
Instead of waveforms, you can use $display to dump out the state of the machine textually, then paste that into a posting: it's much more concise and readable.

There's a commented-out one in cpu.v, which of course you can adjust according to which registers or signals are of interest. It's normal to put the $display in a higher-level module, indeed into a testbench level, and then to refer to inner-level signals. See here for example:
Code:
 69 initial begin
 70   $display("+--------------------------------------------------------+");
 71   $display("|  r1  |  r2  |  ci  | u0.sum | u1.sum | u2.sum | u3.sum |");
 72   $display("+--------------------------------------------------------+");
 73   $monitor("|  %h   |  %h   |  %h   |    %h    |   %h   |   %h    |   %h    |",
 74   r1,r2,ci, tb.U.u0.sum, tb.U.u1.sum, tb.U.u2.sum, tb.U.u3.sum);
 75 end

where the $display and $monitor pick up signals inside sub-modules 3 levels deep.

Cheers
Ed


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Mar 16, 2012 9:46 pm 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
I'll have to check that out. Not been doing much in the way of testbench though, except to force logic level on reset, and force clock on O2.

Tested all 256 Acc to Acc transfer opcodes ok. My macro file is becoming as valuable as the core itself!


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Mar 16, 2012 11:41 pm 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
Are there any parties here, especially Bitwise & TT, that are willing to extend the support from the original 65Org16 core, i.e. IR[15:8] = 0000_0000 to the 65Org16.b I have almost finished?

I'm gonna try to sell this here. Here is what the 65ORG16.b core offers (almost fully tested, and always willing to test):
16 Accumulators [A thru P] although P Acc may change to Q in order to preserve the PHP, PLP procesor status opcodes
Do math (i.e. ADC/SBC/AND/ORA/EOR/) on any one Accumulator and store value in any of the other Acc's. All former addressing modes included
Multiple shifts (i.e. 1 thru 15 shifts/rotates using ASL, ROL, LSR, ROR opcodes) in 2 cycles defined by the upper 4 bits of these opcodes. However, these opcodes only work in Acc's A,B,C,D. All former addressing modes included
Full Acc to Acc transfer ability in 2 cycles
Just a few new instructions like TXY, TYX. Maybe a new index register called Z that will be like X, and maybe a new one called W that will be like Y


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sat Mar 17, 2012 12:16 am 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
Although I haven't tested the latest 16 Acc .b core in the Devboard yet, I thought I would give it a run and find out the top speed.

Code:
Timing errors: 0  Score: 0  (Setup/Max: 0, Hold: 0)
 
 Constraints cover 957692 paths, 0 nets, and 7099 connections
 
 Design statistics:
    Minimum period:  10.495ns{1}   (Maximum frequency:  95.283MHz)

SWEEEEET! I think this is credit to Arlet's architecture.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sat Mar 17, 2012 10:14 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
I was just playing around with the 8 bit core (with RDY, but no BCD mode). With the original, I was able to run at 100 MHz. With some small optimizations to the ALU flag handling, I could increase speed to 125 MHz. Pushing it harder with SmartXplorer, I got 133 MHz. It does requiring adding an extra pipeline stage to the SDRAM interface, otherwise it becomes a huge bottleneck. However, the extra pipeline stage provides some timing slack that can be used to optimize the SDRAM controller later, which could potentially result in a 2 cycle savings in most cases. I've not tested the SDRAM in this new configuration, so no guarantee that the highest speed actually works reliably. :)

The longest path involved the calculation of the Z flag, because it involves a wide OR over all the ALU result bits in "temp", and "temp" is a complex combinatorial expression. This can be avoided by performing the Z flag calculation on "OUT" instead. This is the registered version of the same value. The consequence is that the Z flag is calculated one cycle later, but this can be fixed by making Z a combinatorial output, rather than a registered output. This is easy to implement. First 'Z' needs to be changed from 'reg' to 'wire' in ALU.v:
Code:
wire Z;

And secondly, it needs to be based on OUT rather than temp:
Code:
assign Z = ~|OUT;

The 2nd longest path involved the calculation of the V flag. We can do something similar there. First make registered versions of AI[7] and BI[7], change 'reg' to 'wire', and add assignment:
Code:
assign V = AI7 ^ BI7 ^ CO ^ N;

Complete code can be downloaded here: alu.v


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 353 posts ]  Go to page Previous  1 ... 8, 9, 10, 11, 12, 13, 14 ... 24  Next

All times are UTC


Who is online

Users browsing this forum: Google [Bot] and 11 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: