6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat Apr 27, 2024 4:20 pm

All times are UTC




Post new topic Reply to topic  [ 26 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Sun Jan 17, 2010 7:29 am 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
I know at least one other person on this forum is interested in using multiple CPUs of the 65xx family in a single system. This application of multiprocessing poses a significant challenge, however: how do you synchronize the workloads of the multiple CPUs against each other? The lack of atomic read-modify-write instructions on the order of complexity of compare-and-swap or test-and-set makes implementing semaphores between processors somewhat tricky.

As it happens, tricky doesn't imply impossible. Leslie Lamport had discovered an algorithm which accomplishes this task without relying on any customized hardware resources. In other words, you can synchronize any set of processors against some shared resource without the use of dedicated read-modify-write logic.

I thought I'd post the link for the benefit of those looking to explore building SMP systems around the 6502 or 65816 (or, perhaps, mixed CPU environments), both of which lacks suitable hardware infrastructure for SMP work.

http://research.microsoft.com/en-us/um/ ... tml#bakery


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sun Jan 17, 2010 10:26 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
Very interesting. I'd heard of this (Lamport's bakery) but not studied it. The wikipedia article points us to this short but careful discussion

Isn't it true though that the 65816 has a pin which signals a read-modify-write to allow external hardware to offer an atomic operation? (In fact, André covers this ML output in his page on synchronisation)


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sun Jan 17, 2010 4:00 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8428
Location: Southern California
Quote:
Isn't it true though that the 65816 has a pin which signals a read-modify-write

WDC's 65c02 also has it, even on the DIP, on pin 5 which is NC on other brands.


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 17, 2010 6:13 pm 
Offline

Joined: Tue Jul 05, 2005 7:08 pm
Posts: 990
Location: near Heidelberg, Germany
kc5tja wrote:
I know at least one other person on this forum is interested in using multiple CPUs of the 65xx family in a single system. This application of multiprocessing poses a significant challenge,


Thanks for an interesting link.

One important thing the algorithm does to prevent race conditions (not necessarily, but it helps), is that each process/processor only writes to memory location it "owns", that is only read by other processes/processors.

The algorithm relies on the fact, however, that the memory for all processors is synchronized, i.e. one processor always sees what the other processor writes. In modern systems this condition is sometimes relaxed (e.g. memory pages that are cached only in the specific processor where the process owning that page runs). Not synchronizing the memory reduces the bus bandwidth and increases overall throughput, which is the reason for this "optimization". On the other hand then explicit synchronization operations are required, like flushing the cache line to "push" the new value to other processors. But that's stuff that had not been invented (I think) in '74 and which is not (yet?) relevant to 6502 and relatives.

Synchronizing multiple CPUs still is hard to implement correctly today.

In my 6502 coprocessor implementation I used a single "global" register that implemented a load-link/store-conditional optimistic locking scheme - which is a (simple) hardware locking support.

Does anyone actually know a pessimistic locking circuit using the WDCs "ML" line? I guess it would use (to synchronize a single address) a combination of select line CPU A, select line CPU B, plus the ML line of one CPU to assert the RDY pin on the other CPU and vice versa. But somehow the "ownership" of the register needs to be determined, otherwise two CPUs doing Read-Modify-Write would block each other...
Anyone seen this somewhere, would sure be interesting?

André


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sun Jan 17, 2010 6:39 pm 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
Not with the 65xx architecture, but absolutely with the 68000, i386, and PowerPC architectures. Still, here's how it'd work -- it's virtually the exact same algorithm.

VPA/VDA would be decoded to a "bus request" signal to the bus arbiter (on the 68000 series, the AS signal performs this task). RDY is held low until the arbiter grants bus access.

Then, for as long as VPA/VDA indicates a valid memory access *OR* the ML line for that CPU is asserted, the bus arbiter does not advance to the next CPU (regardless of bus request states). Thus, RDY remains asserted.

As soon as VPA/VDA indicates an internal operation AND the ML signal is negated, bus access is rescinded so that other CPUs have a chance at acquiring the bus.

Concerning load link/store conditional functionality, I'm not satisfied with its semantics. Because the link is a single, global register, and because you can receive an interrupt between establishing the link and attempting to conditionally store, you can end up with live-lock between your normal application and a frequently run ISR. Worse, even if an ISR doesn't attempt a load-link, another thread can, potentially resulting in live-lock between threads.

For this reason, I very much prefer compare-and-swap. The way I'd implement it in hardware is with a trigger register in I/O space, which when written to would assert the ML line for, say, 25 cycles (the exact value written isn't important). On a 65816, this is enough time to discretely implement a single compare-and-swap using non-blocking instructions:
Code:
LDA p
STA pOriginal

; create a new, private copy of the desired data structure
; and point newP at it.

lostLock_tryAgain:
; . . .

; Lock the bus for the next 22 CPU clocks.  Lock counter takes
; input from A1-A8, hence doubling of cycle count in address
; offset.
STA BUSLOCK+44

; compare and swap!
LDA p
CMP pOriginal
BNE lostLock_tryAgain
LDA newP
STA p

I count that at 22 cycles, assuming all absolute addresses and 16-bit accesses are used. More time would be needed if our addressing mode was more complex. I can easily envision a trigger system where the number of cycles to lock the bus for was taken from address bits A0-A7 (or A1-A8 for 65816 native-mode), which means your bus-lock trigger "device" would occupy 256 (or 512) bytes of I/O space.

The hardware requirements for this are pretty low too; essentially it's just an 8-bit down-counter, whose output is ORed with ML.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Mon Jan 18, 2010 12:03 am 
Offline

Joined: Tue Jul 05, 2005 7:08 pm
Posts: 990
Location: near Heidelberg, Germany
kc5tja wrote:
VPA/VDA would be decoded to a "bus request" signal to the bus arbiter (on the 68000 series, the AS signal performs this task). RDY is held low until the arbiter grants bus access.


You're right, with multiple CPUs you most likely need a kind of bus arbiter. My coprocessor uses a (phi1 vs. phi2) time-shared memory, so I forgot about that.

Quote:
Concerning load link/store conditional functionality, I'm not satisfied with its semantics. Because the link is a single, global register, and because you can receive an interrupt between establishing the link and attempting to conditionally store, you can end up with live-lock between your normal application and a frequently run ISR. Worse, even if an ISR doesn't attempt a load-link, another thread can, potentially resulting in live-lock between threads.


AFAIK on modern CPUs you can use any address for load link/store conditional - only my coprocessor board uses it only for that single global register.

The load link/store conditional functionality is actually running a busy-loop on contention. Therefore you should only use it in situations where little contention is. I.e. the operation between LL and SC should be very short. On the other hand, this functionality is exactly used to implement semantics like "compare and swap":

Code:
LL_LOOP:
     load-link ADDR
     compare ORIG_VALUE
     branch-if-not-equal LOST_LOCK
     store-conditional ADDR, NEW_VALUE
     if-store-failed-branch LL_LOOP
GOT_LOCK:
     ...
LOST_LOCK:
     ...


As far as I understand, "snooping" the bus for other CPUs writing to an address where the local CPU has done a load-link is less costly (in terms of hardware) than actually locking the bus. Therefore load-link/store-conditional has been invented.

Compared to pessimistic locking with ML there is no need to make the central bus arbiter more complex with LL/SC. Instead each CPU has a "snooping" facility and a local address register for the last (N) load-link operation(s), against which writes on other processors are compared against.

OTOH with the, let's say "more flexible" memory models in newer (S)MP systems, I understand that more research is going into lockless algorithms (Read-Copy-Update, RCU?), at least on the higher level. I did not have a look into how those are implemented in hardware.

Quote:
For this reason, I very much prefer compare-and-swap. The way I'd implement it in hardware is with a trigger register in I/O space, which when written to would assert the ML line for, say, 25 cycles (the exact value written isn't important). On a 65816, this is enough time to discretely implement a single compare-and-swap using non-blocking instructions:


The problem I see with this approach is, that the bus is completely locked for this amount of cycles. This increases the interrupt latency for example if the interrupt is on a different CPU that can not react as it can not get the bus (assuming it needs the bus for the interrupt routine).

I am thinking of building a blitter for my selfbuilt 6502, and I have a bad feeling about it if I just block the CPU and transfer say 256 bytes in 512 cycles. I am definitely planning a mode where the 6502 unused cycles are used (using SYNC to get the opcode, add a counter to find out in which opcode cycle we are, and use a PROM/GAL to enable a transfer in exactly known unused cycles (e.g. second cycle of a one-byte opcode).

I got bit before with my selfbuilt OS where the interrupt latency actually disallowed the use of higher RS232 baud rates with the 6551 - only the 16550 with its 16 byte buffer made it work well.

Anyway, I haven't seen something using ML in actual hardware and that is what counts for me. I'm really interested to hear about something like this, I'd extend my article on 6502 multiprocessing for it :-)

André


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Mon Jan 18, 2010 1:04 am 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
Can you use dual channel SRAM? I know it's extremely expensive and you can't truly read/write simultaneously, but isn't the arbitration logic built into the dual channel SRAM itself?

edit: bad spelling 2x

_________________
65Org16:https://github.com/ElEctric-EyE/verilog-6502


Last edited by ElEctric_EyE on Mon Jan 18, 2010 2:52 am, edited 2 times in total.

Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Mon Jan 18, 2010 2:45 am 
Offline

Joined: Sun Sep 15, 2002 10:42 pm
Posts: 214
fachat wrote:
...
OTOH with the, let's say "more flexible" memory models in newer (S)MP systems, I understand that more research is going into lockless algorithms (Read-Copy-Update, RCU?), at least on the higher level. I did not have a look into how those are implemented in hardware.
...
André


Depends on what you are referencing, but if you are referencing transactional memory, then I think that is dead. Sun was the main company pushing that, and it was slated to be included in the Rock processor, but Rock is now dead apparently.

Sources:

http://research.sun.com/spotlight/2007/ ... emory.html
http://bits.blogs.nytimes.com/2009/06/1 ... p-project/

Toshi


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Wed Jan 20, 2010 9:03 pm 
Offline

Joined: Tue Jul 05, 2005 7:08 pm
Posts: 990
Location: near Heidelberg, Germany
TMorita wrote:
fachat wrote:
...
OTOH with the, let's say "more flexible" memory models in newer (S)MP systems, I understand that more research is going into lockless algorithms (Read-Copy-Update, RCU?), at least on the higher level. I did not have a look into how those are implemented in hardware.
...
André


Depends on what you are referencing, but if you are referencing transactional memory, then I think that is dead. Sun was the main company pushing that, and it was slated to be included in the Rock processor, but Rock is now dead apparently.



No, I was not referring to transactional memory. Rather more to some algorithms used that only require very little hardware support.

For example you can build a single-direction round-robin buffer with a write pointer and a read pointer, if the writer CPU writes the buffer and the write pointer and only reads the read pointer (to avoid overwriting), and the reader CPU reads the buffer, writes the read pointer but only reads the write pointer.
This is an example of such an algorithms where a single memory location is only written by a single CPU, and only read by the others (*).

In a 6502 these things might work out of the box, but modern systems with compiler reordering writes, caches and non-uniform memory access (NUMA - i.e. for example each processor has its own "fast" RAM and access to the other CPU's memory only via a "slow" communication channel) it gets more complicated.

For the aforementioned round-robin buffer for example it has to be guaranteed that the writer "flushes" the data before it updates the write pointer - i.e. writes need to be ordered, at least on some synchronization point. Otherwise the other CPU could see the updated write pointer, but not yet the new data in the actual buffer. And the update of the write pointer must be atomic (why this is is an exercise to the reader). And so on.

The more modern (versions of) programming languages are specified in a way that these side effects are well-defined, mostly named the "memory model" of a language. Java for example has had a major update of its memory model in version 5 to provide improved language semantics for multiprocessor systems, that enabled many well though-out "concurrency-improved" classes in the java.util.concurrent package.

See for example [1] (for the before-Java5-memory-model) and [2] about the problems one can encounter if concurrency is not understood.

Even C++ (and C?) in its modern version have similar problems as can be seen in [3].

In Java5 and later, the "volatile" keyword is defined to prevent some of the reordering that would otherwise happen. For example declaring the read- and write-pointer of above example as volatile would provide the necessary ordering guarantees.

The "volatile" ordering guarantee is defined as part of the Java language (see the history of volatile in [4]). The Java compiler itself has to implement this guarantee on the platforms it runs on. If it is in a single-processor machine, that's easy, but on a NUMA platform, additional cache flushes have to be inserted - which is normally done by some extra CPU opcodes.

Similar things happen at the Java "sychronized" block. Here the guarantees are even stronger. However, the Java compilers are now intelligent enough if a synchonized block never "leaks" to another processor, the compiler automatically removes the synchronized block.

Synchronization comes at a cost. The only question is how much....

André

[1] http://www.ibm.com/developerworks/java/ ... j-dcl.html
[2] http://www.cs.umd.edu/~pugh/java/memory ... cking.html
[3] http://www.ddj.com/184405772
[4] http://www.ddj.com/showArticle.jhtml?do ... 08l&pgno=9

(*) There are other such algorithms...

Edit: added [4].


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Thu Jan 21, 2010 12:13 am 
Offline

Joined: Sun Sep 15, 2002 10:42 pm
Posts: 214
fachat wrote:
...
For the aforementioned round-robin buffer for example it has to be guaranteed that the writer "flushes" the data before it updates the write pointer - i.e. writes need to be ordered, at least on some synchronization point. Otherwise the other CPU could see the updated write pointer, but not yet the new data in the actual buffer. And the update of the write pointer must be atomic (why this is is an exercise to the reader). And so on.
...


The operation you are trying to describe is architecturally called a "memory barrier" or a "memory fence". This is a good description of it:

http://en.wikipedia.org/wiki/Memory_barrier

Also, I believe the update of the write pointer does not need to be atomic, because there is only a single writer.

Toshi


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Thu Jan 21, 2010 9:42 am 
Offline

Joined: Tue Jul 05, 2005 7:08 pm
Posts: 990
Location: near Heidelberg, Germany
Quote:
Also, I believe the update of the write pointer does not need to be atomic, because there is only a single writer.


Now think of a write pointer being two byte wide, and the write being non-atomic. The reader could see invalid data!

André


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Thu Jan 21, 2010 3:49 pm 
Offline

Joined: Fri Jun 27, 2003 8:12 am
Posts: 618
Location: Meadowbrook
Galaga method involves a single common RAM which each CPU can access on a priority basis. The master has highest and everyone else has to wait down. Uses Z80s and hardware wise, is a pain in the (biblical beast of burden) Galaga uses this as a common ram for each one. Perhaps it can be used as a commuinication ram with each processor having its own op sys and ram?

_________________
"My biggest dream in life? Building black plywood Habitrails"


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Thu Jan 21, 2010 7:24 pm 
Offline

Joined: Tue Jul 05, 2005 7:08 pm
Posts: 990
Location: near Heidelberg, Germany
TMorita wrote:
The operation you are trying to describe is architecturally called a "memory barrier" or a "memory fence".


I know - I just didn't want to put too many buzzwords into it ;-)

For a further discussion about atomic operations in Java, see
http://www.ibm.com/developerworks/java/ ... -jtp11234/

One interesting part of that article is: "With the addition of the atomic variables classes in the java.util.concurrent.atomic package [in Java5], that has changed. The atomic variable classes all expose a compare-and-set primitive (similar to compare-and-swap), which is implemented using the fastest native construct available on the platform (compare-and-swap, load linked/store conditional, or, in the worst case, spin locks)"

This is (part of) what wikipedia says on spin locks: "Implementing spin locks correctly is difficult because one must take into account the possibility of simultaneous access to the lock to prevent race conditions. Generally this is only possible with special machine language instructions, such as atomic test-and-set operations, and cannot be easily implemented in high-level programming languages or those languages which don't support truly atomic operations.[1] On architectures without such operations, or if high-level language implementation is required, a non-atomic locking algorithm may be used, e.g. Peterson's algorithm. But note that such an implementation may require more memory than a spinlock, be slower to allow progress after unlocking, and may not be implementable in a high-level language if out-of-order execution is allowed"

Note the kind of chicken-and-egg problem, with Compare-and-Swap referring to spin locks, and spin locks referring to "atomic test-and-set operations" as efficient way to implement them. Luckily there are algorithms like the bakery or Peterson's algorithm that do not need that.

To get back to our common topic, with the absence of memory locks (except for ML on 65816 - which I still haven't seen used, and there even isn't a compare/swap opcode) and load linked/store conditional (except for my coprocessor board), we have to resolve to other algorithms for atomic operations resp. synchronization.

The Commodore IEEE488 disk drives use a shared memory location to send controller commands from one processor to another (resp. from the main program to the interrupt routine in the serial IEC drives like the 1541, and even the 1541-derived 2031LP). The main processor writes a command to the command location, the controller processor executes it and writes the result to the command location. Bit 7 is used to distinguish between command (128) and result (0) and each processor knows that it writes only when the bit is set to the correct value (main processor writes when bit 7 = 0 and controller processor writes when bit 7 = 1)

For other problems we might even have to resolve to the bakery algorithm (note that the "entering" array and the "number" array must be (technically) of type byte, to allow atomic operations (there is only read or write needed) on the 6502.
To create atomic operations in a general way, we have to use spin locks e.g. using Peterson's algorithm as above.

Unfortunately, and that sometimes is the problem with research algorithms, there is one problem with the bakery algorithm - overflow. The algorithm does not define what happens when the "number" overflows (I haven't looked more into it though - if we're lucky only the execution order changes, but after looking into it again, the thread that has the highest number (255) might be starved by later coming threads that according to the algorithm get 255+1 = 0, which means a higher lock priority).


André


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Thu Jan 21, 2010 10:58 pm 
Offline

Joined: Sun Sep 15, 2002 10:42 pm
Posts: 214
fachat wrote:
Quote:
Also, I believe the update of the write pointer does not need to be atomic, because there is only a single writer.


Now think of a write pointer being two byte wide, and the write being non-atomic. The reader could see invalid data!

André


I agree the write needs to be atomic.

However, the entire update does not need to be atomic, e.g. it can be accomplished with separate read and write instructions, rather than an atomic read-modify-write instruction.

Toshi


Top
 Profile  
Reply with quote  
PostPosted: Thu Jan 21, 2010 11:04 pm 
Offline

Joined: Sun Sep 15, 2002 10:42 pm
Posts: 214
kc5tja wrote:
..
The lack of atomic read-modify-write instructions on the order of complexity of compare-and-swap or test-and-set makes implementing semaphores between processors somewhat tricky.
...


The WDC 65C02 has MLB, so why not just honor MLB?

Then you get proper atomic read/modify/write instructions.

Toshi


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 26 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 24 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: