6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Thu Nov 21, 2024 5:58 pm

All times are UTC




Post new topic Reply to topic  [ 19 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Mon Jun 12, 2017 11:27 am 
Offline

Joined: Wed Mar 02, 2016 12:00 pm
Posts: 343
Hi

I am working on a 6502 core that I want to improve with respect to multiplication. I am thinking of adding these instructions:

SQA - square of A, result in A (MSB) and X (LSB)
SQX - square of X, result in A (MSB) and X (LSB)
SQR $ZP - square of Zero page, result in A (MSB) and X (LSB)
MXA - multiply A and X, result in A(MSB) and X(LSB)
MUA $ZP - multiply A with Zero Page, result in A(MSB) and X(LSB)

The square of a 8-bit number will be in a 512 byte table, while multiplication is the classical a*b = ((a+b)/2)^2-((a-b)/2)^2 from the same table.

Any comments/suggestions?


Top
 Profile  
Reply with quote  
PostPosted: Mon Jun 12, 2017 12:07 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
It feels like a good idea! I'm not sure I'd bother with the squaring, since multiplication gives you squaring with only one or two instructions.

Is this an FPGA core? I'm used to FPGAs which have heaps of free multipliers, so single-cycle multiply is really easy. But I think there are some FPGAs without a multiplier. How many clock ticks do you expect your table approach to take?


Top
 Profile  
Reply with quote  
PostPosted: Mon Jun 12, 2017 1:26 pm 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
I echo BigEd's recommendation regarding the use of the muiltipliers built into current FPGA families that you should be considering as hosts for your 6502 core.

If you just want to make a multiplier, I would recommend considering a Booth multiplier. BigEd, myself, and others discussed these types of multipliers on this forum several years ago. A 2 bits at a time Booth multiplier is not particularly large in terms of FPGA resources, and it would yield a full 16-bit signed result in 5 cycles. If you'd rather conserve resources, then you could construct a 1 bit at a time Booth multiplier using the on-chip registers. A one bit at a time Booth Multiplier will require 9 total cycles to yield a 16-bit signed product. For Booth multipliers, you will need a single 8-bit temporary register in addition to CPU registers you intend to use. Since the partial product additions are done using right shifts from lsb to msb, you will only need the 8-bit adder function you should already have in the core.

A small amount of additional logic is required to protect the sign bit of the product. Some modifications to the sign bit protection logic ought to suffice if you also intend to provide an unsigned multiply. For greatest flexibility, I would consider including both a multiply instruction for signed operands, and a another multiply instruction for unsigned operands. Finally, I think that because the X register is more often used that you should consider using the {A,Y} register pair instead of the {A,X} register pair.

I have a repository on github with some (Verilog) code examples for 1-bit, 2-bit, and 4-bit at a time standalone Booth Multipliers that you could review if that's the route you choose to go.

Good luck with your project. There are many ways to skin a cat. Do not let comments on the forum, or previous projects, discourage your efforts. We are always interested in seeing new approaches to the implementation of 6502s and extensions to its basic instruction set.

_________________
Michael A.


Top
 Profile  
Reply with quote  
PostPosted: Mon Jun 12, 2017 2:01 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Something to note, which I've never completely got my head around, is that signed multiply and unsigned multiply do have to be tackled differently. I assume it's the case that if you've got one, you can do the other, but I imagine that's a lot easier one way around than the other way around.


Top
 Profile  
Reply with quote  
PostPosted: Mon Jun 12, 2017 6:21 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California
Quote:
Something to note, which I've never completely got my head around, is that signed multiply and unsigned multiply do have to be tackled differently. I assume it's the case that if you've got one, you can do the other, but I imagine that's a lot easier one way around than the other way around.

In Forth on the 6502, the only multiply primitives typically are unsigned. If you want to do signed numbers, that's done in secondaries, and the numbers' signs are recorded and then the numbers are processed in their absolute values, then the result is negated if appropriate. I did it this way too when I wrote a set of 7-digit decimal floating-point routines for the '02 in approximately 1988. I have not heard whether any hardware multipliers handle negative numbers directly. It would be a separate instruction.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Tue Jun 13, 2017 7:37 pm 
Offline

Joined: Wed Mar 02, 2016 12:00 pm
Posts: 343
BigEd wrote:
It feels like a good idea! I'm not sure I'd bother with the squaring, since multiplication gives you squaring with only one or two instructions.

Is this an FPGA core? I'm used to FPGAs which have heaps of free multipliers, so single-cycle multiply is really easy. But I think there are some FPGAs without a multiplier. How many clock ticks do you expect your table approach to take?


Oh, well I made a nano-6502 implementation which I try to get as fast as possible. It works in simulation (Active-HDL) and compiles well under Lattice (for MachXO3L), but untested in a real chip at the moment. All instructions take 1 cycle and it "runs" at 100MHz. E.g. I still have some positive slack at this speed. Currently only 12 instructions and 3 addressing modes (#imm, $ZP and $ZP,x). At the moment it complies to 155 LUTS with 1KB internal memory for each core (one block).

The reason I want multiplication is to use it for neural networks (which needs multiplication). If I finish it <200LUTS I can get around 50 cores into a MachXO3 or 100 cores into a Lattice EPC5. At 100MHz (or slightly below) it gives 100 cores the processing speed of 10GHz with this one-cycle implementation. But the multiplication is probably going to need a few cycles to not slow everything down.

Anyone else here worked with multicore-6502 ?


Top
 Profile  
Reply with quote  
PostPosted: Tue Jun 13, 2017 7:55 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
(You might be interested in the machines being discussed over at Arlet's challenge, which are intended to fit within 128 slices.)


Top
 Profile  
Reply with quote  
PostPosted: Tue Jun 13, 2017 8:05 pm 
Offline

Joined: Wed Mar 02, 2016 12:00 pm
Posts: 343
MichaelM wrote:
I echo BigEd's recommendation regarding the use of the muiltipliers built into current FPGA families that you should be considering as hosts for your 6502 core.

If you just want to make a multiplier, I would recommend considering a Booth multiplier. BigEd, myself, and others discussed these types of multipliers on this forum several years ago. A 2 bits at a time Booth multiplier is not particularly large in terms of FPGA resources, and it would yield a full 16-bit signed result in 5 cycles. If you'd rather conserve resources, then you could construct a 1 bit at a time Booth multiplier using the on-chip registers. A one bit at a time Booth Multiplier will require 9 total cycles to yield a 16-bit signed product. For Booth multipliers, you will need a single 8-bit temporary register in addition to CPU registers you intend to use. Since the partial product additions are done using right shifts from lsb to msb, you will only need the 8-bit adder function you should already have in the core.

A small amount of additional logic is required to protect the sign bit of the product. Some modifications to the sign bit protection logic ought to suffice if you also intend to provide an unsigned multiply. For greatest flexibility, I would consider including both a multiply instruction for signed operands, and a another multiply instruction for unsigned operands. Finally, I think that because the X register is more often used that you should consider using the {A,Y} register pair instead of the {A,X} register pair.

I have a repository on github with some (Verilog) code examples for 1-bit, 2-bit, and 4-bit at a time standalone Booth Multipliers that you could review if that's the route you choose to go.

Good luck with your project. There are many ways to skin a cat. Do not let comments on the forum, or previous projects, discourage your efforts. We are always interested in seeing new approaches to the implementation of 6502s and extensions to its basic instruction set.


Thanks!

These boot multipliers look really interesting. I will look more into them and see what speed I get on a MachXO3. I will need to compare the different (other) methods with how much resources they need as I want to have many cores running at the same time. Currently I can have 4 cpu's sharing the same table without congestion.

(My 6502 doesn't have a Y register)


Top
 Profile  
Reply with quote  
PostPosted: Tue Jun 13, 2017 8:12 pm 
Offline

Joined: Wed Mar 02, 2016 12:00 pm
Posts: 343
BigEd wrote:
(You might be interested in the machines being discussed over at Arlet's challenge, which are intended to fit within 128 slices.)


Aww, thats a hard one! With only 12 instruction, 3 address modes, 2 registers and 155 LUTS, I'm already above the limit. :shock:


Top
 Profile  
Reply with quote  
PostPosted: Tue Jun 13, 2017 8:17 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
No, it's ok - several LUTs in a slice. Is it 2, or maybe 4?


Top
 Profile  
Reply with quote  
PostPosted: Tue Jun 13, 2017 9:14 pm 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
I think that you need to normalize the metric. A Xilinx Spartan 6 has 4 6-bit LUTs and 8 registers per Slice. It is roughly equivalent to 4 Xilinx Spartan 3A slices which have 2 4-bit LUTs and 2 registers per slice. The MachX02 is roughly equivalent to the Xilinx Spartan 3A.

_________________
Michael A.


Top
 Profile  
Reply with quote  
PostPosted: Wed Jun 14, 2017 12:56 pm 
Offline

Joined: Wed Mar 02, 2016 12:00 pm
Posts: 343
Ok, well I need internal memory which is excluded in that competition. I am also at 187 luts now, at 15 instructions (3 address modes for half of them).. and expanding. Still reports a slack of +1ns for a 100MHz cpu clock, so hopefully I can include the last few instructions and keep it at that.

For a real-world multi-cpu implementation its not going to run at 100MHz though. Maybe with 4 copies, but eventually space will become less and less optimal for keeping short wires to clock source and internal memory. I wonder what it would achieve in an 14nm ASIC implementation.. except for huge mask costs.. :P

Edit: added another 4 instructions and it bloated to 228 LUTS :( Still need to implement PHA/PLA and MUA. Its going to be hard to get it below 200.

Edit2: I added a second copy of the cpu and it got to 444 LUTS, so a net increase of 216 LUTS (rest is glue logic). Still shows a slack of 1ns when running at 100MHz, so might work.


Top
 Profile  
Reply with quote  
PostPosted: Fri Jun 16, 2017 1:25 pm 
Offline

Joined: Wed Mar 02, 2016 12:00 pm
Posts: 343
Finished implementing a 256 byte stack and I managed to bloat the code to 256 LUTS (or 128 slices I guess). I'll look into ways to reduce it, but its really hard at this point (non-stack instructions have been optimized & compressed a few times).

The 1Kbyte memory layout is such:

512 bytes = 256 words instruction memory
256 bytes data memory
256 bytes stack

PC points to word number (not byte number) since all instructions are two bytes. It means that you can use your old assembler, but one-byte instructions need to have a NOP after them to get them to two bytes. You can branch 256 words (with wrap-around). I may change this to +/-128 bytes to keep compatibility with existing assemblers.

All instructions take 1 Cycle with the exception of a BCC/BCS that requires 1 cycle if a branch is not taken and 2 cycles when a branch is done.

The memory clock is twice the CPU clock, e.g. 200MHz at the moment. So the core starts with fetching two bytes on first CPU cycle, then it pre-processes the instruction (half a cycle) and decides to read or write to ZP or stack. The opcode is decoded in the second half of the CPU cycle. Next instruction starts immediately as the next two bytes have been fetched in parallel with pre-processing/(read/write ZP)/decoding. That is with the exception of a branch which gets an extra NOP (since next instruction has been fetched already). I could get it to pre-fetch the instruction at branch target to prevent this (but not within 128 slices).

This is how it looks while running some test instructions (I am still debugging, so bear with me):
Attachment:
N6502-simulation.png
N6502-simulation.png [ 294.91 KiB | Viewed 3875 times ]


The N6502 supports these instructions:
    // LDA #nn is $A9, %10101001
    // LDA $ZP is $A5, %10100101
    // LDA $ZP,X is $B5, %10110101
    // LDX #nn is $A2, %10100010
    // LDX $ZP is $A6, %10100110
    // STA $ZP is $85, %10000101
    // STA $ZP,X is $95, %10010101
    // STX $ZP is $86, %10000110
    // BCC $xx is $90, %10010000
    // BCS $xx is $B0, %10110000
    // ADC #nn is $69, %01101001
    // ADC $ZP is $65, %01100101
    // ADC $ZP,X is $75, %01110101
    // SBC #nn is $E9, %11101001
    // SBC $ZP is $E5, %11100101
    // SBC $ZP,X is $F5, %11110101
    // CMP #nn is $C9, %11001001
    // CMP $ZP is $C5, %11000101
    // AND #nn is $29, %00101001
    // AND $ZP is $25, %00100101
    // AND $ZP,X is $35, %00110101
    // ORA #nn is $09, %00001001
    // ORA $ZP is $05, %00000101
    // ORA $ZP,X is $15, %00010101
    // CLC is $18, %00011000
    // SEC is $38, %00111000
    // TXA is $8A, %10001010
    // TAX is $AA, %10101010
    // ROL A is $2A, %00101010
    // ROR A is $6A, %01101010
    // ASL A is $0A, %00001010
    // LSR A is $4A, %01001010
    // PHA is $48, %01001000
    // PLA is $68, %01101000

Comments?


Top
 Profile  
Reply with quote  
PostPosted: Fri Jun 16, 2017 2:24 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Nice work!


Top
 Profile  
Reply with quote  
PostPosted: Fri Jun 16, 2017 3:31 pm 
Offline

Joined: Wed Mar 02, 2016 12:00 pm
Posts: 343
Thanks!

I will have to test it in a real device and see how stable it is at different frequencies before I can decide on what cpu clock it can handle. The analysis is all over and depends more on how the IPP module is routed than actual clock signal. I can get it "stable" from anything between 90MHz and 250MHz depending on that! :lol:

Edit:
Wow! It did not run very well at 200MHz cpu frequency, but at 150MHz it seems to be doing fine! That was a surprise. I changed the interpreter to Synplify PRO which made the code smaller. It made room for some needed extra instructions:

    // CPX #nn is $E0, %11100000
    // CPX $ZP is $E4, %11100100
    // DEX is $CA, %11001010
    // TSX is $BA, %10111010
    // TXS is $9A, %10011010

With these it compiles to 251 LUTs or 126 slices.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 19 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 7 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron