6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat Oct 05, 2024 4:29 pm

All times are UTC




Post new topic Reply to topic  [ 22 posts ]  Go to page Previous  1, 2
Author Message
PostPosted: Thu Dec 05, 2019 12:03 pm 
Online
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10949
Location: England
Yes, that's the table! And using BIT to test the input could well be a win - with the loop unrolled, it can be a bit immediate, if you have that instruction. But if you're doing an in-place x10, you can LSR the bignum's byte in-place: you'll be updating it from the accumulator when you're done, so no harm in modifying it. Just possibly copying it to zero page first would be a win.


Top
 Profile  
Reply with quote  
PostPosted: Thu Dec 05, 2019 12:14 pm 
Offline
User avatar

Joined: Sat Dec 01, 2018 1:53 pm
Posts: 727
Location: Tokyo, Japan
BigEd wrote:
And using BIT to test the input could well be a win - with the loop unrolled, it can be a bit immediate, if you have that instruction.

I don't. Remember I live in the mid-70s. (For me, the modern 6502s are the ones with ROR!)

Quote:
But if you're doing an in-place x10, you can LSR the bignum's byte in-place: you'll be updating it from the accumulator when you're done, so no harm in modifying it. Just possibly copying it to zero page first would be a win.

The buffers holding these numbers are allocated in the heap, so to use LSR in-place I'd need to use self-modifying code. (Which I'm not entirely ruling out, but it's definitely additional complexity because ROM etc.) But copying to the ZP and LSRing it there sounds like it would work great. (I actually suspect that, because all values will be in the heap, I'm going to have tons of zero page space available, so I'm definitely not worrying about that right now.)

_________________
Curt J. Sampson - github.com/0cjs


Top
 Profile  
Reply with quote  
PostPosted: Thu Dec 05, 2019 12:37 pm 
Offline

Joined: Tue Sep 03, 2002 12:58 pm
Posts: 325
I don't have the tools to test it here, and I am more than a little rusty with 6502 code. But maybe something like this:
Code:
   ldy size
   lda (num),y
   sta srcLow
   lda #0
   sta accHigh
   dey
byteLoop
   lda (num),y
   sta srcHigh
   lda accHigh
   sta accLow
   ldx #7
bitLoop
   lsr srcLow
   bcc bitIs0
   lda accLow
   adc mul10Low,x
   sta accLow
   lda accHigh
   adc mul10High,x
   sta accHigh
bitIs0
   dex
   bpl bitLoop
   lda accLow
   iny
   sta (num),y
   lda srcHIgh
   sta srcLow
   dey
   bne byteLoop

mul10Low   .byte 9, 19, 39, 79, 159, 63, 127, 255
mul10High   .byte 0, 0,  0,  0,  0,   1,  2,   4
srcLow      .byte 0
srcHigh      .byte 0
accLow      .byte 0
accHigh      .byte 0


Each byte that gets processed modifies two bytes of the result. But we need the original value of the next byte. So it keeps two bytes of the input in srcLow/srcHigh. accLow/accHigh is the accumulator used to build each byte of the result. At the end of processing each byte, accHigh becomes the initial value of accLow for the next byte. The table is 10*2^n - 1 to save a clc on each bit - we know that carry is set.

(edit) and I am clearly confused. It does not modify the next byte of the number yet - that's what accHigh gets copied to accLow is for. So there's no need for srcHigh. But copying the byte that we're working on does mean we don't need to pretend that we've got lsr (zp),y.


Last edited by John West on Thu Dec 05, 2019 12:45 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Thu Dec 05, 2019 12:40 pm 
Online
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10949
Location: England
reusing the known-to-be-set carry is a nice trick!


Top
 Profile  
Reply with quote  
PostPosted: Thu Dec 05, 2019 1:11 pm 
Offline
User avatar

Joined: Sat Dec 01, 2018 1:53 pm
Posts: 727
Location: Tokyo, Japan
In case anybody gets enthusiastic enough about this to start assembling code, here are my test cases. (And do let me know if you see any coverage missing.)

Code:
@pytest.mark.parametrize('signed, value', [
    ( 0, [0x03]), (-1, [0xFD]),
    (-1, [0xF4]),                         # min 1 byte input
    (-1, [0x0C]),                         # max 1 byte input
    ( 0, [0x19]),                         # max 1 byte input (unsigned)
    (-1, [0xFF, 0xF3]),
    (-1, [0xF3, 0x34]),                   # min 2 byte input
    (-1, [0x0C, 0xCC]),                   # max 2 byte input
    ( 0, [0x19, 0x99]),                   # max 2 byte input (unsigned)
    (-1, [0xFF, 0x87, 0x65, 0x43, 0x21]), # -20 million and a bit
    (-1, [0x02, 0xab, 0xcd, 0xef, 0x10]), # 11 billion or so
    (-1, [0xF4] + [0x00]*254),            # max length, approaching min value
    (-1, [0x0B] + [0xFF]*254),            # max length, approaching max value
    ( 0, [0x18] + [0xFF]*254),            # max length, (unsigned)
])

The expected output is calculated for each test using standard Python 3 functions, along the lines of:
Code:
    decval   = int.from_bytes(value, byteorder='big', signed=signed)
    expected = (decval*10).to_bytes(len(value), byteorder='big', signed=signed)

These cases include both signed and unsigned multiplies (as explained earlier in this thread, the current routine doesn't know the difference, only the caller does), but I don't see any particular reason off-hand to support unsigned if that makes life more difficult.

Of course, if you don't already have a testing framework in which to plop these cases, you may find it easier just to grab 0cjs/8bitdev from GitHub and just replace my routine in src/bigint.a65 with your own. The test cases are in src/bigint_test.py, and anything you print() will be shown if the test case fails. (You can force failure with assert False.)

_________________
Curt J. Sampson - github.com/0cjs


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 05, 2020 5:34 pm 
Offline
User avatar

Joined: Thu Jan 02, 2020 7:45 am
Posts: 3
Location: Columbus
thedrip wrote:
The most elegant (in my mind) but maybe not most efficient Multiply by 10 I've seen is to take your byte, shift Left (multiply by 2), store it as a temp, shift left twice more (multiply by 4), then add back in your saved temporary.

For multi byte, just apply the roll left from LSB to MSB


Possibly better: take your byte, make sure there's another copy somewhere in memory (free if you've just copied it into accumulator from elsewhere), shift left twice (so you have original value ×4), add the other copy (giving you original value ×5), shift left once more (yielding original value ×10). Instead of saving a temporary ×2 value, you simply access the original value twice.

But yeah, this still requires shifting the entire number, which is fine if it's a byte or two, but expensive on an arbitrarily big int. The other approach outlined a few posts above is probably best for bigints.

_________________
Retro 6k


Top
 Profile  
Reply with quote  
PostPosted: Mon Jan 06, 2020 7:03 am 
Offline
User avatar

Joined: Sat Dec 01, 2018 1:53 pm
Posts: 727
Location: Tokyo, Japan
vtk wrote:
Possibly better: take your byte, make sure there's another copy somewhere in memory (free if you've just copied it into accumulator from elsewhere), shift left twice (so you have original value ×4), add the other copy (giving you original value ×5), shift left once more (yielding original value ×10). Instead of saving a temporary ×2 value, you simply access the original value twice.

Unfortunately in my application I am working on the original number, rather than a copy of it, as this multiply is used while building the original number (from a string of ASCII digits) in the first place.

That said, this was a really worthwhile comment. Not only did it remind me that I should always continue to think about alternate ways of doing things (using copy ×2 ×2 add ×2 instead of ×2 copy ×2 ×2 add), but it also caused me to reexamine my buffer usage in the current routine and realize that I could incorporate the copy into the first multiply loop and delete the second loop that did the copy:

Code:
 bi_x10      ldy buf0len
-            ;   Multiply the buffer by 2
+            ;   Multiply the buffer by 2, also making a copy for later addition
             clc
 -           lda (buf0ptr),y
             rol
             sta (buf0ptr),y
-            dey
-            bne -
-            ;   Make a copy of the ×2 value for later addition
-            ldy buf0len
--           lda (buf0ptr),y
             sta (bufSptr),y
             dey
             bne -
             ;   Multiply buffer by 2 again, for ×4
             ldy buf0len
             ...

(The diff is perhaps not totally easy to read: after the sta (buf0ptr),y in the first loop, which stores the shifted byte back into the source buffer I add just sta (bufSptr),y to store that same byte in the scratch buffer, and then entirely delete very similar second loop.)

This reduces the code size from 55 to 48 bytes (7 fewer bytes, 13% improvement) and also reduces the execution time by 11% (1 byte 109→97 cycles, 2 bytes 200→178, 255 bytes 24741→21936 cycles). Since the multiply is the majority of the work in reading a bigint from an ASCII string, that translates into a 10% speedup in reading numbers, which is pretty sweet for such a simple change.

_________________
Curt J. Sampson - github.com/0cjs


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 22 posts ]  Go to page Previous  1, 2

All times are UTC


Who is online

Users browsing this forum: No registered users and 19 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: