6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat Oct 05, 2024 2:27 pm

All times are UTC




Post new topic Reply to topic  [ 22 posts ]  Go to page 1, 2  Next
Author Message
 Post subject: bigint code for the 6502
PostPosted: Thu Nov 21, 2019 4:20 pm 
Offline
User avatar

Joined: Sat Dec 01, 2018 1:53 pm
Posts: 727
Location: Tokyo, Japan
Looking at some of the integer routines here on 6502.org, it occured to me that it shouldn't be too hard to do "bigint" (large and variable precision) arithmetic on the 6502. A multibyte add is just a CLC followed by ADCs on the bytes in low to high order sequence, so stick it in a loop and voilà! (Though of course it's probably not quite that simple.)

But though I'm a decent programmer and have a pretty good understanding of low-level machine operations, I'm pretty sure I'm terrible at writing good assembler code. So I was hoping to get some thoughts and hints on my attempts to play with this bigint idea on the 6502. (I also am thinking I'll port this to other 8-bit CPUs—8080, 6800, 6809, maybe even RCA 1802—at some point to get an idea of how those work, so more general comments related to doing this kind of thing in any assembly language are welcome.)

I've started with a routine to read in an ASCII hexadecimal number of arbitrary (but obviously <=255) length and turn it into a binary number. I'm currently storing the value in big-endian format because it seems to me that it will be more efficient on the 6502 to work down from the end when doing adds, subtracts, etc. because I can probably avoid some CMP instructions in the loops, though already I'm not totally sure how well that's working out for me.

The following code is also available in a GitHub repo as src/bigint.a65; the unit tests are in src/bigint_test.py. (The top-level `Test` script in that repo assembles the code using ASxxxx and runs the tests using the `py65` 6502 emulator; simply running `./Test src/bigint_test.py` in a Bash shell stands a reasonable chance of showing you passing tests, at least if you're on a Linux system with Python 3 and pip available. If that side of things is of particular interest to anyone, start a topic on it and ping me and I'll be happy to contribute.)

The two routines I present here are:
  • `bi_readhex`, which is mainly what I'm interested in feedback on. It seemed to start out well and then got bogged down with some edge cases when doing things like skipping leading zeros. I actually want to be reading ASCII strings representing positive and negative base 10 integers, but hex seemed a good first step because it avoids having to multiply things by ten.
  • `readascdigit` is a base-40 converter that I spent a fair amount of time optimizing; that was more to help me wrap my head around the techniques and tricks of efficiently setting up and re-using existing state instead of explicitly always resetting state to what I want it to be. (Thus the massive comments in that one.)

So, what are your thoughts on both this particular code and my general approach so far?

Code:
;   "bigint": variable precision integers

            .radix h

            .globl readascdigit
            .globl bi_readhex

            .area ZP  (abs,pag)
            .org 0
            .area   _CODE

;   Convert ASCII character in A to binary digit.
;   N flag clear on success, set on error.
;   This reads A-Z[\]^_ and a-z{|}~DEL as 10, 11...40
;   The caller must check the result if using a base less than 41.
readascdigit:
            ;   This routine has extra explanation for 6502 beginners.
            cmp #'9+1
            bcs 5$              ; >'9', convert letter
            ;   At this point we know that the carry is clear, better thought
            ;   of as the borrow being set. Rather than use an extra instr
            ;   to set the carry, we instead subtract 1 from the subtrahend
            ;   because the set borrow will also be subtracted from the result.
            sbc #'0-1           ; convert digit; if <'0', N flag set for error
            ;   Normally for an unsigned comparison we'd check to see if the
            ;   carry is clear, i.e., the borrow was used, to determine
            ;   whether our result was negative. We can't do this here to see
            ;   if our char was < '0' because of the optimization above. But
            ;   it's safe to check the N flag because we know from the check at
            ;   the start that the char was ≤ $39 ('9') and so this will
            ;   always produce a result between $39-$30=$09 and $00-$30=-$30
            ;   ($D0 in two's complement), all values of which are negative.
            ;   Since the N flag is our error code, we need not even BMI;
            ;   just let the N flag pass through.
            rts                 ; N clear, no error
5$:
            and #$%11011111     ; clear bit 5 to convert to upper-case
            ;   Now the char is in one of two ranges:
            ;     $3A-$40  chars between '9' and 'A'
            ;     $40-$5F  A-Z and punctuation after ([\]^_)
            ;     $80-$FF  chars above DEL
            ;   We check the N flag here to see if the high bit is set, which
            ;   means the character is invalid. We have to do the AND first
            ;   because the N bit from the CMP test is based on the result
            ;   of the CMP subtraction.
            bmi 9$              ; high bit set, error
            ;   We subtract 'A'-$0A to bring it down to the $00-$28 range,
            ;   No SEC is needed before this SBC; we branched here because the
            ;   carry was already set and AND does not affect the carry.
            sbc #'A-0A
            ;   Values less than $0A are invalid, so error out on those.
            cmp #0A             ; if <$0A, set N flag to indicate error
            ;   The result of SBC is signed, so it may be, e.g., -$01 = $FF.
            ;   However, CMP does only unsigned comparisons so the carry flag
            ;   would be set (indicating ≥) rather than clear (indicating <)
            ;   for those "negative" results. But since we know our value in A
            ;   is in range $00-$28, the most negative possible result produced
            ;   by the SBC is -$28 = $D8, so we can check the N flag instead.
            ;   Since the N flag is our error code, we need not even BMI.
9$:         rts


;   Bigints are stored in big-endian format on 6502: a length byte
;   followed by the integer in two's complement format.
;
;   We use big-endian format, despite being the opposite of native
;   6502 order, because we generally want to loop from the LSB to the
;   MSB and looping towards zero saves means we can start with the
;   length and decrement and compare with zero, saving the CMP we'd
;   need to do if counting up to the length.

            .area ZP  (abs,pag)
inbuf:      .ds 2               ; pointer to text input buffer
outbuf:     .ds 2               ; pointer to len byte followed by data
            .area   _CODE

;   Read the ASCII hex two's complement representation of an integer
;   and convert it to a bigint. The first bit is the sign, i.e.,
;   `FF00` will be read as -256 decimal; a minus prefix is not
;   allowed.
;
;   A register: inbuf length; 1 <= A <= 255
;        inbuf: input chars
;       outbuf: output buffer, must be length (A+1)/2.
;   All inputs and registers are destroyed.
;
bi_readhex: ldx #0              ; constant for indirect addressing
            ;   Start by skipping past any leading '0's.
            tay                 ; save length
0$:         lda [inbuf,x]
            cmp #'0
            bne 3$              ; no (more) leading zeros; start conversion
            dey                 ; reduce length
            beq 2$              ; but if it's last digit, convert the 0 anyway
            inc inbuf           ; move past leading 0
            bne 1$
            inc inbuf+1
1$:         clc
            bcc 0$
            ;   Conversion, two input digits at a time
2$:         iny
3$:         tya                 ; copy input length to A
            dey                 ; length -1 = last byte of input buffer
4$:         clc
            ror                 ; output length is 1/2 input length
            adc #0              ; plus 1 if input length is odd
            sta [outbuf,x]      ; store output length
            adc outbuf          ; set outbuf to end of buffer
            sta outbuf
            bcc 5$
            inc outbuf+1
5$:         lda [inbuf],y       ; first char for this byte of output buffer
            jsr readascdigit    ; convert ASCII digit to binary
            bmi 9$              ; bad digit, error
            cmp #10
            bcs 9$              ; >=16, error
            sta [outbuf,x]
            dey                 ; second char for this byte of output buffer
            bmi 8$              ; if input done, return success
            lda [inbuf],y
            jsr readascdigit    ; convert ASCII digit to binary
            bmi 9$              ; bad digit, error
            cmp #10
            bcs 9$              ; >=16, error
            asl                 ; shift up to high nybble
            asl
            asl
            asl
            clc
            adc [outbuf,x]
            sta [outbuf,x]
            dec outbuf
            ;   XXX must decrement outbuf+1 on outbuf underflow
            lda outbuf
            cmp #0FF
            bne 6$
            dec outbuf+1
6$:         dey
            cpy #0FF
            bne 5$              ; next chars
8$:         rts                 ; success; return
9$:         brk                 ; readhex error

(And yeah, I know that the 4$ label in `bi_readhex` isn't used; that was what the earlier BEQ 2$ was jumping to before I realized I was stuck with the "wrong" value in Y at that point.)

_________________
Curt J. Sampson - github.com/0cjs


Top
 Profile  
Reply with quote  
PostPosted: Thu Nov 21, 2019 10:56 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10947
Location: England
A fine mission! Afraid I don't quite have the energy right now to review the code, but thanks for sharing. Aim for clarity, I'd say, and don't worry at this stage about performance. You already have a test suite: that's good!


Top
 Profile  
Reply with quote  
PostPosted: Tue Dec 03, 2019 3:39 pm 
Offline
User avatar

Joined: Sat Dec 01, 2018 1:53 pm
Posts: 727
Location: Tokyo, Japan
Ok, after the equivalant of several days of full-time programming over the last week and a half I've finally got what seems to be a working routine to read ASCII decimal integers to multi-precision binary. This is some of the more difficult programming I've ever done (in decades of programming in a lot of different languages), but I assume (or hope!) a good part of that is due to being at an early part of the learning curve for 6502 assembler in particular.

GitHub's web interface is probably the easiest way to view the code and the tests, and it's probably a good idea to read the detailed commit message, which includes a lot of my reasoning (excuses?) behind certain design decisions, to inform your comments. I've also attached the code, tests and commit message as files here in case that's easier for you.

(You should also be able to pretty easily run the tests, at least on a Linux system, by cloning the repo and running ./Test. This does need support for running 32-bit binaries because that's what's available for the assembler; see the README for details.)

Comments are very welcome. And don't hold back or worry about being too harsh or unsupportive; everybody in this forum always seems very nice, but my main aim here is to find out what I'm doing wrong or poorly (and I have thick skin anyway) so I wouldn't want someone to be holding back useful criticism just because they think I might feel bad about it.


Attachments:
commit-message.txt [3.63 KiB]
Downloaded 66 times
bigint.a65 [21.6 KiB]
Downloaded 65 times
bigint_test.py.txt [13.48 KiB]
Downloaded 75 times

_________________
Curt J. Sampson - github.com/0cjs
Top
 Profile  
Reply with quote  
PostPosted: Tue Dec 03, 2019 8:52 pm 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8411
Location: Midwestern USA
cjs wrote:
GitHub's web interface is probably the easiest way to view the code and the tests, and it's probably a good idea to read the detailed commit message, which includes a lot of my reasoning (excuses?) behind certain design decisions, to inform your comments.

GitHub is a royal pain to use for casual viewing. I'd like to look at your code, but can't open the .a65 file. What's in it?

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Tue Dec 03, 2019 9:17 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10947
Location: England
I feel like an in-place multiple-by-10 on a bigint stored somewhere, referenced only by a pointer in zero page, might be an interesting challenge for the low level coders around here.

Thanks for publishing your code, and for commenting it so well!


Top
 Profile  
Reply with quote  
PostPosted: Tue Dec 03, 2019 11:41 pm 
Offline

Joined: Tue Oct 02, 2018 4:22 am
Posts: 48
The most elegant (in my mind) but maybe not most efficient Multiply by 10 I've seen is to take your byte, shift Left (multiply by 2), store it as a temp, shift left twice more (multiply by 4), then add back in your saved temporary.

For multi byte, just apply the roll left from LSB to MSB


Top
 Profile  
Reply with quote  
PostPosted: Wed Dec 04, 2019 3:04 am 
Offline
User avatar

Joined: Sat Dec 01, 2018 1:53 pm
Posts: 727
Location: Tokyo, Japan
BigDumbDinosaur wrote:
GitHub is a royal pain to use for casual viewing. I'd like to look at your code, but can't open the .a65 file. What's in it?

For me, in both Chrome on Linux and Chrome and Edge on Windows 10, clicking on the links in my post above produces a reasonably readable rendered HTML page with the code in it.. That page also has a "Raw" button just above the file contents that produces this raw text dump of the file which again renders fine in those browsers/OSes. Could you tell me what browser and OS you're using, what those URLs produce for you and what difficulties you're encountering? I'd like to find ways to help deal with those, though it seems odd that GitHub isn't producing readable, downloadable source code at those links on some platforms.

The .a65 file is just a text file containing assembler source code. You should be able to use File / Open... in your editor, IDE or even Notepad just to open and view it (I confirmed with Notepad on Windows 10), though you may need to change a "File types" or similar dropdown menu in the File / Open... dialogue box to ask it to show all files rather than just, e.g., *.txt files, as Notepad does. (On earlier versions of Windows you may need to use Wordpad instead of Notepad to handle the Unix LF-only line endings.) In Windows you can also right-click on the file and choose Open with... from the context menu, or just rename the file to end in .txt. If you can tell me what tools you usually use to view and edit assembly-language files I may be able to provide further help.

By the way, was the Python code in bigint_test.py.txt easily viewable for you? I had to rename that file before uploading because this site wouldn't let me attach a file named bigint_test.py, but perhaps the best strategy for me to make life convenient for Windows users (if that's what you're using) is to add a .txt extension to every file I upload that contains source code of any kind.

_________________
Curt J. Sampson - github.com/0cjs


Top
 Profile  
Reply with quote  
PostPosted: Wed Dec 04, 2019 4:28 am 
Offline
User avatar

Joined: Sat Dec 01, 2018 1:53 pm
Posts: 727
Location: Tokyo, Japan
thedrip wrote:
The most elegant (in my mind) but maybe not most efficient Multiply by 10 I've seen is to take your byte, shift Left (multiply by 2), store it as a temp, shift left twice more (multiply by 4), then add back in your saved temporary. For multi byte, just apply the roll left from LSB to MSB

That is, I believe, exactly what I'm doing in my bi_x10 routine, unless I'm misinterpeting what you're saying.

Speaking of which, my version is not exactly fast: uncommenting the line at the end of test_bi_x10() in bigint_test.py shows that it takes 24741 cycles to multiply a 255-byte integer. Not that bad, you might think, until you are using the routine 254 times on 128-byte binary integers (12422 cycles) in the course of reading a 255 character decimal integer. That routine alone consumes 3.16 M of the 3.53 M cycles to read and convert that number, about 85% of the total runtime.

One obvious issue seems to be my use of indirect indexed addressing mode. That's not supported by ROL, so I need to use a LDA [ptr],Y / ROL / STA [ptr],Y sequence (6+2+6=14 cycles) for every byte. So what are my options for attacking that?

Some involve self-modifying code, so perhaps here is the place to mention that I have no particular moral objection to that (at least on small, single-threaded systems like this); my main concern is just the extra complexity of handling that, particularly if the code is in ROM and needs to be copied into RAM.

The other addressing modes I seem to have available (let me know if I'm missing any) are:
  • Indexed addressing: ROL addr,X, 7 cycles.
  • Indexed zero-page addressing: ROL zp,X, 6 cycles.
  • Absolute addressing: ROL addr, 6 cycles.
  • Zero-page addresing: ROL zp, 5 cycles.

My current loop overhead time is 6 cycles for the DEY / BNE.

The non-indexed modes would change this to modify a location operand every loop. What's the best case for this? Probably putting the loop code in the zero page so I can do a 2-cycle INC zp, the same cycle count as DEY, and I still need the BNE so that seems a wash. Or worse, given that I think it requires alignment of the buffer or adding CMPs and the like. Are there any tricks I'm missing here? (I hadn't thought before about putting self-modifying code in the zero page; I will remember that for potential use in other situations.)

The indexed modes offer a bit more hope; I think they don't change my loop overhead at all, right?

So for these I see two strategies: a fixed output buffer location or self-modifying code.

The self-modifying code version modifies the code outside the loop, so we can probably ignore the cost of that. That can get me down to 6 + 6 cycles if I put the buffer in the zero page, but that seems inadvisable for a whole host of reasons. So 6 + 7 cycles for absolute mode it would be; 13/20 is a 35% improvement. Not bad, and it makes buffer handling dead easy since they can be at arbitrary locations in memory. But I do then need to handle the issues with self-modifying code, particularly handling the copy from ROM to RAM where necessary, which feels expensive and difficult. But maybe it's not so bad. I'd imagine on many systems there are small corners in which to tuck it, especially in the zero page. And actually it could be a generic multi-precision ROL used by other multipliers, and perhaps even dividers, as well! Hmm!

A fixed buffer location requires copying the value into and out of the buffer, which at first blush might seem to negate the advantage of this. But actually it might not be so bad, since we might be able to "pull up" the copy to higher levels outside even more loops. In this case, bi_x10 runs over and over again on the same buffer for each ASCII char I read, and actually that buffer needs to be copied anyway for normalization, so that appears to be no extra cost at all to me. The main issue there then appears to be memory parsimony; dedicating 256 or even 128 bytes to "immovable buffer," even if that can be used for a lot of different things, is a pretty hefty cost in a 4 KB system.

Thoughts?

_________________
Curt J. Sampson - github.com/0cjs


Top
 Profile  
Reply with quote  
PostPosted: Wed Dec 04, 2019 5:20 am 
Offline
User avatar

Joined: Sat Dec 01, 2018 1:53 pm
Posts: 727
Location: Tokyo, Japan
Oops! I've just had it pointed out to me that INC zp is five cycles, not two. (I was reading from the opcode bytes column rather than the cycles column in the reference I was using.) And duh, that makes sense; you can't read an instruction, a location, a value and write a value in less than four bus cycles.

So that just makes the non-indexed modes completely worse, which is good to know. It seems to me that this may be a general thing, then, now that I think about it: always use an index register to tweak offsets in a (tight) loop where that addressing mode is available.

And the BNE is usually less than I thought, too; I was going by the count in my assembly listing, which gives the worst case of 4 cycles when branching across a page boundry, but it's generally only 3 taken / 2 not taken. Still, I don't think that makes a difference to the analysis, except slightly worse percentage improvement when making the work in the loop faster.

I knew that branches take different amounts of time, depending, but I'd forgotten that. It hadn't occured to me that not taken is always faster than taken, but of course: the 6502 isn't a pipelined CPU with branch prediction. (Ok, Drass, most 6502 CPUs are not pipelined. :-))

_________________
Curt J. Sampson - github.com/0cjs


Top
 Profile  
Reply with quote  
PostPosted: Wed Dec 04, 2019 8:43 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10947
Location: England
Thinking about the 10x problem, without actually coding it, I felt that it should be possible to use two 256 byte tables - usually the fastest way! Each byte when multiplied by ten gives you two bytes: the LSB is a result byte and the MSB is to be held over to add into the next byte up.

It's possible that multiplying a negative number is a whole different thing, but in the case of initialising a bignum from a decimal string, I'd be inclined first to build the positive value from the digits and then invert the bignum if it's to be negative. One could, possibly, compress this into a single pass for the negative case, maybe with a different pair of tables.


Top
 Profile  
Reply with quote  
PostPosted: Wed Dec 04, 2019 4:35 pm 
Offline
User avatar

Joined: Sat Dec 01, 2018 1:53 pm
Posts: 727
Location: Tokyo, Japan
BigEd wrote:
Thinking about the 10x problem, without actually coding it, I felt that it should be possible to use two 256 byte tables - usually the fastest way! Each byte when multiplied by ten gives you two bytes: the LSB is a result byte and the MSB is to be held over to add into the next byte up.

One of my design goals is that I'd like to be able to have a version of this without too many extra constraints as part of an interpreter that runs in 4K of RAM with no ROM. (The 8080 and 6800 versions would be a direct competitor to Altair (MS) 4K BASIC. By 1977 there were a ton of 6502 computer variants, but were the interpreter a third party product for an Apple I, PET 2001 or Apple II, it would also want to fit into 4 KB RAM, though it might be able to use a few ROM routines.)

That said, though it wouldn't be something I'd be looking at before getting smaller versions working, yes, I should keep tables and the like in mind as a possibility for some systems where I might have extensive ROM space. So thanks for that.
_____________________________________________________

Before replying to your other comments, here's a caveat: I still seem to be having great difficulty wrapping my head around all the corners of multi-precision two's-complement arithmetic (and that's been a source of many of my problems). But I'll present some thoughts based on my experience writing the code linked/attached above. You can see the initial versions of the x10 multiply (bi_x10u) and digits-only decmal read (bi_read_udec) in this commit. (You can see the whole history of my work on this on the dev/cjs/191126/bi-decimal branch in the repo, though I warn you that much of it is a hot mess and may make your head hurt.)

Quote:
It's possible that multiplying a negative number is a whole different thing....

As it turns out, not at all, as far as I can tell.

I had originally thought that the x10 multiply was unsigned, as you can see from my early code where the routine was called bi_x10u. I realized days later, near the end of all this coding, that in classic 6502 style it actually turns out to be both signed and unsigned: you're free to look at it either way. The only trick (and it shares this in commnon with the 6502 ADC and SBC instructions) is that if you are treating it as signed and over/underflow the MSB (the only byte with ADC/SBC) by one bit you need to remember that the value is still correct as long as you sign-extend it with an additional byte. (For signed values with ADC/SBC, the overflow flag is used to indicate loss of the sign bit. With those instructions the value is still always correct when sign-extended with another byte because you can never overflow the value itself: the A register holds 8 bits but the input is never more than 7 bits in absolute value when considered signed.*)

I thought I'd explained this in the comments, but it turns out I hadn't, so I've added the following to the header comment for that function:
Code:
;   Like the 6502 ADC/SBC instructions, this does not care whether you are
;   multiplying a signed or unsigned value; you decide which way you want
;   to treat it. This allows you an extra bit of precision if you have
;   separately kept track of the sign of the input; you can safely overflow
;   the sign bit out of the MSB and sign-extend afterwards. For example,
;   input [$FF, $00] results in [$30, $30]. If you were considering your
;   input to be unsigned (65280₁₀) it overflowed and the result is
;   incorrect. But if you were considering your input to be signed
;   (-5320₁₀) only the sign bit "overflowed"; you know the output is
;   negative and can simply sign-extend extend to $FF3030 to have the
;   correct result.


Quote:
...but in the case of initialising a bignum from a decimal string, I'd be inclined first to build the positive value from the digits and then invert the bignum if it's to be negative. One could, possibly, compress this into a single pass for the negative case...

And that's just how I was first inclined, too. This was exactly the way I'd started my original implementation, thus bi_read_decdigits being originally named bi_read_udec and doing only adds. My intent was, exactly as you say, to invert the value during the normalization pass, where I have to copy the read-digits result to remove redundant leading sign bytes anyway. Unfortunately I ran into difficulty with this and so rewrote the digits reader to read positive or negative, making it longer along the way because now I needed separate code for the add loop and subtract loop.

Looking at it now, I think you're right: it should be possible always to do an unsigned digits read and do a straight two's-complement inversion (EOR #$FF and add one) during the normalization pass, followed by a sign extension byte if necessary. But I'm still not quite wrapping my head around the details of that. Also, I'm not sure there's actually any savings in code size since the instruction savings in the digits reader routine might be lost in the additional complexity of the negation in the normalization copy pass. And it might even be slower since there now would be per-byte calculations be done on the normalization that previously had already been done for us by the separate SBC instruction in the read-digits pass.

__________
* Bruce Clark's "The Overflow (V) Flag Explained" probably explains exactly this, but I have to admit I didn't really understand it from my first reading of the article, or even after my first go in detail at doing 6502 arithmetic.

_________________
Curt J. Sampson - github.com/0cjs


Top
 Profile  
Reply with quote  
PostPosted: Wed Dec 04, 2019 5:02 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10947
Location: England
(I've just remembered something which might be worth bearing in mind further down the road: I think I've noticed previously in bignum libraries the tactic of over-extending when reallocating a growing bignum. That is, if you need an extra byte of space, better to add 2 or 4 or even some fraction of the current size, to give some headroom and avoid continually reallocating.)


Top
 Profile  
Reply with quote  
PostPosted: Wed Dec 04, 2019 5:58 pm 
Offline
User avatar

Joined: Sat Dec 01, 2018 1:53 pm
Posts: 727
Location: Tokyo, Japan
BigEd wrote:
(I've just remembered something which might be worth bearing in mind further down the road: I think I've noticed previously in bignum libraries the tactic of over-extending when reallocating a growing bignum. That is, if you need an extra byte of space, better to add 2 or 4 or even some fraction of the current size, to give some headroom and avoid continually reallocating.)

Ah, yes. Definitely worthwhile when using numbers locations where values are expected to be mutated.

But here's yet more background I'd not provided: I'm intending to use this in an interpreter where the values are allocated on the heap and "variables" are actually bindings of names to pointers to the values. (E.g., x = -123456 allocates an object at $7F14 holding -123456, and the binding is ("x", $7F14).) This invokes the spectre of aliasing, so my current plan is to make numeric values in the heap immutable.* (I may later, however, provide a facility for special mutable values in the heap if that seems like it would be useful.)

That said, it would be nice if the bigint library were usable by other applications without this requirement, so I'm going to keep this in mind as I start to get into the memory allocation code and how the bigint library uses it.

______________
* Further explanation, in case the above is not clear: consider the code x = 1; y = x; inc(x). After running it, you'd typically expect y still to be 1, not become 2 even though x has changed. Thus, for the inc(x) a new object must be allocated for "2", leaving y to continue to reference the object "1". In the case where there is no reference to the object "1" after x has been reassigned (x =1; inc(x)) the object can be garbage-collected.

_________________
Curt J. Sampson - github.com/0cjs


Top
 Profile  
Reply with quote  
PostPosted: Thu Dec 05, 2019 9:06 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10947
Location: England
Back to the question of how to multiply a bigint by 10... If using a general multiplication algorithm, there'd be a lot of number-sized shifting. But I think for x10 there doesn't need to be any.

If you have a table of multiples of 10, one for each bit position in a byte, that just 16 bytes of table. Or, better, it's two byte-sized tables of size 8.

For each byte of the big number, you shift it to expose each of the bits, and lookup into the two-byte table for the corresponding multiple of 10. You add the two-byte number into a two-byte accumulator, and after you've dealt with 8 bits, you have one byte of result and one byte of carry which you use to initialise the accumulator for the next byte of the big number.

You could unroll this loop of 8 actions, which means no loop overhead and no need to index into the tables - absolute addresses will do, or even the loading of the 16 values as immediates. The table vanishes. And now you can choose not to implement or to add the high bytes of the first few multiples of 10, because they are known to be zero. (Indeed, for the final pair, the low byte is zero - extra win!)


Top
 Profile  
Reply with quote  
PostPosted: Thu Dec 05, 2019 11:29 am 
Offline
User avatar

Joined: Sat Dec 01, 2018 1:53 pm
Posts: 727
Location: Tokyo, Japan
The "table" (which as you point out might not be implemented as an actual lookup table) you're talking about is this one, right?

Code:
bit           carry result
 0   $01×10 =        $0a
 1   $02×10 =        $14
 2   $04×10 =        $28
 3   $08×10 =        $50
 4   $10×10 =        $a0
 5   $20×10 =  $01   $40
 6   $40×10 =  $02   $80
 7   $80×10 =  $05   $00

That's freaking brilliant. Particularly cool, from my point of view, is that it also avoids the need for the temporary buffer I was using for the doubled value to add back in to the octupled value.

I'm definitely going to give this a go when I get back to this. (Trying out the AS Macroassembler and hacking out some ideas for memory allocation are currently the next two things on my list.)

(Though as I think about it, it's not sounding so easy to implement, since I need both to shift out the bits of the original value and add to the new value. This is where the 6800's second accumulator would come in quite handy. But I'll play around and see what tricks I can come up with. Maybe I can BIT the input value. And it's not like my current version isn't chock-full of loads and stores anyway.)

_________________
Curt J. Sampson - github.com/0cjs


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 22 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: Kannagi, teamtempest and 11 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: