6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 6:52 pm

All times are UTC




Post new topic Reply to topic  [ 19 posts ]  Go to page Previous  1, 2
Author Message
PostPosted: Tue Jun 21, 2022 12:47 am 
Offline

Joined: Wed Jun 26, 2013 9:06 pm
Posts: 56
Proxy wrote:
nope, because the ADDs overlap on the upper byte of eachother, which effectly acts like a 8-bit wide carry, and the muliplication never sets the carry bit because the result is only ever 16-bits wide.


Is this for the 8-bit 6502 or the 16-bit 65816?


Top
 Profile  
Reply with quote  
PostPosted: Tue Jun 21, 2022 3:13 pm 
Offline
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 746
Location: Germany
Aaendi wrote:
Proxy wrote:
nope, because the ADDs overlap on the upper byte of eachother, which effectly acts like a 8-bit wide carry, and the muliplication never sets the carry bit because the result is only ever 16-bits wide.


Is this for the 8-bit 6502 or the 16-bit 65816?

the code that was shown is for the 65816 with the Accumulator in 16-bit mode.
For the 6502/65C02 the concept is the same but since it's lacking native 16 bit operations you just have to split up the adding of the result to the output into 2 seperate "LDA ADC STA" sequences with the carry being cleared before the first ADC but then preserved for the next ADC. which makes it functionally identical to the 65816's single "CLC ADC STA" sequence.
So you only care about the carry within each byte of the result (because the result is larger than 1 byte), but not between seperate results (for example between (W*S) and (W*R) the carry is cleared) because they all overlap in the output.


Top
 Profile  
Reply with quote  
PostPosted: Fri Jun 24, 2022 1:36 am 
Offline

Joined: Wed Aug 21, 2019 6:10 pm
Posts: 217
Proxy wrote:
... but ultimately does that mean there is no feasible way of extending an DIV operation for larger numbers? a bit disappointing but division has always been a bit of an outlier. ...


The way you extend it is by using a fast multiply.

For one approach, suppose you are dividing A by B. A number equal to or smaller than A divided by a number equal to or bigger than B is guaranteed to have an equal or smaller result. Discard the remainder, it is still equal or smaller.

As an example in ordinary decimal integer division, it may work something like:

232/47 -> 230/50 -> 23/5 = 4 rem 3 -> 4.
47*4 = 188
232- 188 = 44, which is < 47, so 44/47 = 0 rem 44
So 230/47 is (4+0) remainder 42.

232/42 -> 230/50 -> 23/5 = 4 rem 4
42*4 = 168
232-168 = 64, which is > 42
64/42 = 1 rem 22
So 230/47 is (4+1) remainder 22

So that is a strategy for combining a smaller width fast hardware divide and a fast hardware multiply to shortcut the number of iterations required for binary long division in software.

With a hardware divide as good or better than an (unsigned) 16 bit by 8bit divide for 16 result and 8 bit remainder, AND a fast hardware multiply of the size of your hardware result and the size of your divisor (say, 16x16->32 for a 32bit/16bit divide routine), you might apply an approach of that class (AFAIU, there are several specific algorithms with various pros and cons of each).

Another strategy for using a fast hardware multiply is to have a table of the first 16 bits of the binary fractional inverses of the integers from 1-255, (rounded down, not rounded closest, because you want a first estimate of the result that is equal to or under the actual result), and multiply by the inverse of one more than the leading non-zero byte to generate the first guess of the result. This might be refined by normalizing the divisor to the first non-zero bit as the high bit and then the following eight bits, which is likely to cover more ground in the first guess, but requires managing the position of that leading bit, which complicates the process.


Top
 Profile  
Reply with quote  
PostPosted: Sat Jul 09, 2022 4:14 pm 
Offline
User avatar

Joined: Tue Aug 11, 2020 3:45 am
Posts: 311
Location: A magnetic field
To Proxy, on Thu Nov 11 2021, I suggested minimal FPU without division and mentioned Raphson and FDIV bug. To John West, I've previously suggested hardware multiply and software division and modulo. Processors, such as TMS9900, typically supported multiply and divide (or not) as a pair. Since the 1990s RISC era, processors usually have multiply and typically support divide and modulo (or not) as a pair. I am quite impressed that Proxy correctly implemented divide and modulo as FPGA state machine, although Proxy is encountering the same rationale as ARMv1 for omitting divide and modulo.

I wrongly assumed that ARMv1 omitted division because it is an ugly operation. I assume understanding of open sets, groups, commutative operations, finite fields, rings and float. An understanding of the Abelian grape joke is optional. Programmers who work with bytes will definitely understand integer overflow and will already be familiar with finite fields. Likewise, programmers should already be aware that 1/3 or 1/10 has rounding error when represented in binary.

If you don't care about integer overflow then finite binary addition, subtraction and multiplication are closed. However, division has the exception case of divide by zero and there is inevitable rounding error. No division instruction? No divide by zero exception. No problem. If there are conditions around a division instruction then it isn't a huge hardship to also have an iterative software loop. I wrongly thought that ARM gave a specious argument that division is infrequent. However, I have no immediate use for 6502 division outside one instance to implement floating point division for perspective correct texture mapping or division operator in BASIC, Forth, C or spreadsheet application. Analysis of Lisp, Smalltalk and Squeak finds similar infrequent use - although frequency in source code or bytecode does not equal frequency of execution. For example, scaling textures requires four divisions per triangle, ideally for a minimum of 180000 triangles per second.

With or without easy carry operations, addition, subtraction and multiplication are composable. Specifically, small 8*8 operations (or similar) can be combined into one large operation. This doesn't apply to divide or modulo - unless multiplication of the inverse is performed. Counter-intuitively, if you only have one multiplication operation and it is significantly larger than 8*8 then make both inputs signed. For example, 31*31 signed multiply hardware can perform 16*8 unsigned multiplication without bit shifting, overflow or branching. 8*8 is an awkward case because it is just at the threshold where look-up table is feasible. I believe that a common attitude is "Go large or go home." An exception is MC6809 which offers 8*8 unsigned multiply - but I strongly suspect they would have offered 16 bit multiply and divide (signed and unsigned) if the transistor budget allowed it.

Going back to ARMv1, I wrongly assumed the correctness of division, the lack of composable cases and the frequency of operations contributed to a decision to not allocate transistors for division or modulo in the first version (and all subsequent versions). Instead, it is preferable to make a smaller, cheaper device with higher yield. However, a finite state machine to implement division adversely affects interrupt response time. This was in the era when MC68000 ran at 12MHz and NS32032 was too slow for Acorn's preferred floppy disk controller. Intel's infamous FDIV bug occurred due to a failed attempt to perform division in base 4. I find this highly amusing because the public sector VIPER documents from 1987 state that correctness proofs of float point were out of scope. Meanwhile, the INMOS T800 documents from 1987 show how to convert provably correct Occam to provably correct microcode. It took Intel a further eight years - and the largest processor recall ever - to discover the merit of verification. Cowboys.

Nowadays, I believe that fast processors perform 64 bit integer division in base 8 for the purpose of keeping out-of-order execution below 32 cycles. This avoids allocating more that 160 shadow registers. While some may appreciate a scheme where 64 bit integer division effectively takes less than 1ns, it is far less impressive to know this is a fancy FIFO scheme with highly restricted throughput.

The speed and density of decimal representation is poor. The major advantage was that 4 bit hardware could operate on arbitrarily long sequences of decimal digits without rounding error. I like the simplicity of GARTHWILSON's iterative decimal subtraction. I have considered optimized decimal algorithms, such as subtracting 3x. In the worst case, division where a digit is 8 incurs the sequence subtract 3x, subtract 3x, subtract 3x, underflow, restore previous value, subtract 1x, subtract 1x, subtract 1x, underflow, restore previous value. However, this technique (and powers of two within one decimal digit) is fiddly while providing minimal advantage compared to the trivial technique.

Given a free choice, I would implement integer addition, subtraction and multiplication in hardware and integer division and modulo in software. If I don't care about modulo, I might be inclined to implement division by casting to float, performing float division and then casting back to integer. If the float mantissa is the same or larger than integer representation then precision is never lost and only one implementation is required. Some versions of BASIC worked in a similar manner. If I do care about modulo, it may be very useful to preserve floating point remainder and cast that to integer. However, it is very well known to the Forth programmers that one's compliment arithmetic or two's compliment arithmetic may affect the rounding of negative inputs.

Choice is influenced by available or anticipated hardware support for integer multiply or float divide. I previously suggested minimal FPU with bitwise support for division, Raphson and similar techniques. Bit at a time division allows low precision approximations. In particular, it allows floating point mantissa to never exceed integer precision. It also allows low latency interrupts. I believe this has many of the properties that Proxy seeks.

There is a mis-match of common operations. All hyperbolic functions can be ignored. Inverse square root is highly desirable. In the Rankin algorithms, square root is implemented as log, half, exp. Inverse square root can be implemented as log, half, negate, exp. Triangle and sawtooth functions are commonly implemented in GPU for texture mapping. I believe that a sawtooth function would also be useful for low precision sine and cosine operating in right angles rather than radians. In the trivial case, cosine may be 1-(x*x) and I've found this or cruder approximations to be sufficient for many graphics applications.

On Thu 11 Nov 2021, I expressed strong preference for squaring as a separate operation. This reduces ten 8*8 multiplications to two reads from 256*16 bit LUT and four 8*8 multiplications. If this is not possible or desirable then caching and ordering of operations will guarantee 100% cache hit on the redundant commutative multiplies; possibly across multiple invocations.

However, stateful multiplication interacts horribly with interrupt latency - with or without memory mapped hardware. In the case of software only multiplication, the call graph of subroutines suspected of being outside interrupt (typically, everything) may update a multiplication cache via a semaphore which does not release a wanted interrupt mask. Obviously, this adversely affects interrupt response time. In the case of write-only hardware latches, subroutines outside of interrupt may require duplicate writes to known memory locations or a semaphore to prevent latches being over-written. In the case of read/write octal latches (where the preferred implementation is LE74HC574K245N), subroutines suspected of being inside interrupt (typically, everything except main) must preserve latch contents before use. Obviously, any implementation of division which uses multiplication is similarly affected.

I like the idea of using address latches and ROM for 8*8 multiplication and other functions. In particular, this doesn't require vast quantities of address-space and retains 6502 compatibility. For 16 bit LUT, only one ROM is required because the bottom address line may be wired directly and one 8 bit ROM may perform double duty. I thought of this after studying AndersNielsen's video circuit. I made XR2801 16 bit ROM programmer with the primary purpose of programming 16*16 pixel fonts but also secondary purposes, such as Gigatron programming. More recently, I found that AndersNielsen's 8*8 pixel font system can be increased to 8*16 or 16*16 with no additional components. In particular, one 8 bit ROM can provide both halfs of 16*16 pixel font face. Similarly, one 8 bit ROM may provide both parts of 16 bit LUT. I believe this reduces your design to one read strobe and one 128KB ROM or larger, two write strobes and two latches for 8 bit inputs and one write strobe and one latch for additional functions. All strobes may be provided by one 74HC138 or similar which may be shared with other system functionality.

I don't trust the long-term storage of cheap flash ROM. I understand that the really cheap ones store up to five bits of data per cell and then use hideous finite field arithmetic to correct bit errors. Understandably, they can only be programmed in blocks due to the checksum calculations. To overcome the inevitable failure of flash ROM, you might want to notch out unreliable rows or columns of LUT. Unfortunately, performance will degrade unsteadily as the ROM fails and it does not prevent arithmetic errors during operation. However, it allows a minimal system where flash ROM and latches are optional.

_________________
Modules | Processors | Boards | Boxes | Beep, Beep! I'm a sheep!


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 19 posts ]  Go to page Previous  1, 2

All times are UTC


Who is online

Users browsing this forum: No registered users and 25 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: