6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sun Apr 28, 2024 12:53 pm

All times are UTC




Post new topic Reply to topic  [ 11 posts ] 
Author Message
 Post subject: Some 65816 questions
PostPosted: Wed Nov 10, 2010 5:02 pm 
Offline
User avatar

Joined: Fri Dec 12, 2003 7:22 am
Posts: 259
Location: Heerlen, NL
Hallo allemaal,

I'm busy creating a 65816 core in VHDL. So I stumbled on MVN and MVP. These instructions move blocks of data. From WDC's manual:
Code:
Block Move (xyc) addressing is used by the Block Move instructions. The second byte of the instruction
contains the high-order 8 bits of the destination address and the Y Index Register contains the low-
order 16 bits of the destination address.  The third byte of the instruction contains the high-order 8 bits
of the source address and the X Index Register contains the low-order bits of the source address.  The
C Accumulator contains one less than the number of bytes to move.  The second byte of the block
move instructions is also loaded into the Data Bank Register.

So where is the third byte stored then?

The manual also says that these instructions need 7 cycles. Moving 10 bytes means at least 20 extra cycles so this number cannot include the actual block move. Farther up in the manual the expanation par cycle is given. And no, I don't understand it (or I'm too tired after a day of work and having had a good delicious meal).

What is the use of PER? I (think I) know what it does but where is it good for ???

Thank you for your explanations!

_________________
Code:
    ___
   / __|__
  / /  |_/     Groetjes, Ruud
  \ \__|_\
   \___|       URL: www.baltissen.org



Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Wed Nov 10, 2010 5:35 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8428
Location: Southern California
The lower 16 bits of the source and destination go in X and Y respectively. The bank numbers go in the operands. Notes I have in my paper programming manual say the data bank register gets changed to the destination bank address, and that the C input (number of bytes to move, minus 1) must not be FFFF. (Maybe it's ok to do that if you really want to move the whole block though-- I don't know.)

You can also use MVP or MVN to zero a section of RAM or fill it with a desired character, by having the ranges overlap by all but one byte, then, for example, put the desired value at the low end and use MVN go get it copied over and over to higher and higher addresses until the range is full. (The bank bytes would have to match.)

It does take some setup overhead, but after that it only takes 7 clocks per byte, and it is interruptable.

PER: The programming manual says, "Because PER's operand is a displacement relative to the current value of the program counter (as with the branch instructions), this instruction is helpful in writing self-relocatable code in which an address within the program (typially of a data area) must be accessed. The address pushed onto the stack will be the runtime address of the data area, regardless of where the program was loaded in memory; in may be pulled into a register, stored in an indirect pointer, or used on the stack with the stack relative indirect indexed addressing mode to access the data at that location." You can do other things with it too like simulate a branch-to-subroutine (w/ 16-bit rel addr) instruction.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Thu Nov 11, 2010 7:58 am 
Offline
User avatar

Joined: Fri Dec 12, 2003 7:22 am
Posts: 259
Location: Heerlen, NL
GARTHWILSON wrote:
... the data bank register gets changed to the destination bank address,

But the source needs to be read first! And that puzzles me. Using the DBR for both the source and destination address I can understand. But the DOC mentions the source only. I can understand that the DBR ends up being loaded with the destination bank address because it isn't mentioned if the DBR is restored with the original value.
In VHDL I can program what ever I want as long as the result emulates the real CPU. What is better for the load of the FPGA: adding an extra register or updating DBR everytime?

Quote:
and that the C input (number of bytes to move, minus 1) must not be FFFF.

One can copy 256 bytes using X starting with X=0 and checking if X=0 after the copy action and the de/increment of X.
In this case do the move, decrease A and check if it is $FFFF.

Quote:
but after that it only takes 7 clocks per byte,

"Only" he says. Then I understood the table indeed. And
Quote:
and it is interruptable.
explains the need for reading the opcode every time. In this way it fits nicely in my VHDL; I only have to decrease the PC with 3 at the end of the instruction and the Instruction Decoder first checks if an IRQ or NMI is pending and if not, restarts the move instruction again.
I could do the move of a single byte in 2, maybe 3 cycles but that automatically would mean that both IRQ and NMI are disregarded during the whole move.

_________________
Code:
    ___
   / __|__
  / /  |_/     Groetjes, Ruud
  \ \__|_\
   \___|       URL: www.baltissen.org



Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Thu Nov 11, 2010 8:21 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8428
Location: Southern California
Quote:
But the source needs to be read first! And that puzzles me. Using the DBR for both the source and destination address I can understand. But the DOC mentions the source only. I can understand that the DBR ends up being loaded with the destination bank address because it isn't mentioned if the DBR is restored with the original value.

As I understand it, it does not matter what the DBR is when you start. The bank addresses are read from the operands, not the DBR. So the DBR is affected but never read. Why it was done this way is puzzling to me. I might have read about it in the programming manual, but at this point I don't remember, maybe since I don't have memory outside bank 0 anyway. I use MVN in my Forth words CMOVE and FILL , and ERASE and BLANK in turn use FILL . I use MVP in CMOVE> .


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Thu Nov 11, 2010 8:39 am 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
As far as I can see, the source bank is loaded into a temporary register, and the destination bank is loaded into the DBR. Thus, you can get by with only one temporary register. I believe the 6502 already requires a temporary register to achieve some of its timing characteristics -- it's probably just re-using that temporary.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Wed Jul 06, 2011 6:55 pm 
Offline
User avatar

Joined: Thu Mar 11, 2004 7:42 am
Posts: 362
GARTHWILSON wrote:
Notes I have in my paper programming manual say the data bank register gets changed to the destination bank address, and that the C input (number of bytes to move, minus 1) must not be FFFF. (Maybe it's ok to do that if you really want to move the whole block though-- I don't know.)


While I was testing 65C816 things, I tried C=$FFFF with MVN and MVP and it worked fine for both. 65536 bytes are moved, and it wraps at the bank boundary in both the source and destination banks, which was what I expected.

Even filling memory works, even though there's no need to use C > $FFFE since it wraps at the bank boundary.


Top
 Profile  
Reply with quote  
 Post subject: Block Fill with MVN/MVP
PostPosted: Thu Jul 14, 2011 10:00 pm 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8147
Location: Midwestern USA
dclxvi wrote:
While I was testing 65C816 things, I tried C=$FFFF with MVN and MVP and it worked fine for both. 65536 bytes are moved, and it wraps at the bank boundary in both the source and destination banks, which was what I expected.

Even filling memory works, even though there's no need to use C > $FFFE since it wraps at the bank boundary.

For the benefit of anyone who is not entirely clear on how MVN can be used to fill a block of RAM, here's 65C816 code that fills RAM from $001000 to $001FFF with whatever value is in the LSB of the accumulator:
Code:
;FILL RAM FROM $001000-$001FFF WITH 8 BIT VALUE IN .A
;
saddr    =$1000                ;1st address to fill
eaddr    =$1fff                ;last address for fill
workbank =$00                  ;RAM bank in which to fill
;
         pha                   ;save for posterity
         php                   ;save register sizes
         sep #%00100000        ;select 8 bit accumulator
         sta startadr          ;write to 1st location
         rep #%00110000        ;select 16 bit registers
         lda #eaddr-saddr-1    ;total copy iterations
         ldx #saddr            ;base address
         ldy #saddr+1          ;start fill address
         mvn workbank,workbank ;copy next operation
         plp                   ;restore register sizes
         pla                   ;restore
;
         ...program continues...

When MVN has finished the above operation, .A will contain $FFFF, .X will contain $1FFF and .Y will contain $2000. The above is fully interruptable as long as the interrupt service routine preserves and restores the registers. As MVN runs it increments .X and .Y, and decrements the accumulator. The behavior of MVN and MVP is technically undefined if something disturbs any of those registers while the copy operation is in progress.

As MVN requires seven clock cycles to move one byte, the copy operation in the above code would require 28,658 cycles to complete, plus the time required to execute the setup code and ending PLP and PLA instructions. Anything you could hand-code would use more than twice the number of clocks to accomplish the same task.

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Wed Mar 01, 2017 5:35 pm 
Offline

Joined: Mon Jan 07, 2013 2:42 pm
Posts: 576
Location: Just outside Berlin, Germany
BigDumbDinosaur wrote:
Anything you could hand-code would use more than twice the number of clocks to accomplish the same task.
Let me revive this thread with some actual numbers, because for Liara Forth, I have to decide how to save the string from input buffer to some safe memory space with the word S" ( "string" -- addr u ) when interpreted.

We arrive at this part of the code from PARSE which has returned ( addr u ) of the string from the input buffer (starts at 00:7C00). Because the way Liara is set up, u as TOS in the Y register, and addr as NOS is in 00,X (with X as the Data Stack Pointer, DSP). To set things up for MVN, we need to do all of this (based on http://6502.org/tutorials/65c816opcodes.html#6.6) :
Code:
    tya        ; move length of string to A
    dec.a      ; decrease by one because that's how MVN works
    phx        ; save the DSP for later
    phy        ; save TOS for later
    ldy.dx 00  ; LDY $00,X - get the source from NOS
    tyx        ; put it in X
    ldy.d cp   ; LDY CP - get the Compiler Pointer address as destination
    mvn 0,0    ; do the actual work
    ply        ; get TOS back
    plx        ; get the DSP back
There is more code, of course, NOS needs to point to the new address and CP needs to be updated, but that's the same for both routines. If I've done my math right, we need 11 bytes and 33 cycles for setup and teardown, plus 7 cycles for each byte moved. Then to move one byte only - the worst case for this scenario - we need 11 bytes and 40 cycles. Now let's try the same thing the good old way:
Code:
    phy        ; save TOS
    lda.dx 00  ; LDA $00,X - get the source address from NOS
    sta.d tmp1 ; STA TMP1 - put it temporary storage
    dey        ; convert length of string in TOS to index
    .a8        ; switch to 8 bit A

loop:
    lda.diy tmp1 ; LDA (TMP1),Y - get one byte from source
    sta.diy cp ; STA (CP),Y - store one byte at destination
    dey        ; loop counter
    bpl loop

    .a16       ; return A to 16 bit
    ply        ; get TOS back
Setup and teardown are 9 bytes and 22 cycles, which sounds a lot better until you realize that you need up to 16 cycles (depending on the branch) per byte moved. To move even a single byte, we need 38 cycles. The moment we have to move more than one byte, we are better off with MVN/MVP. And if we know we have to move just one byte, we can do something more effective anyway.


Top
 Profile  
Reply with quote  
PostPosted: Thu Apr 26, 2018 5:23 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1927
Location: Sacramento, CA, USA
BigDumbDinosaur wrote:
... Anything you could hand-code would use more than twice the number of clocks to accomplish the same task.

If you're going to imply a challenge like that, why did you hard-code such tasty constants for your example?
Code:
;FILL RAM FROM $001000-$001FFF WITH 8 BIT VALUE IN .A
;
saddr    =$1000               ;1st address to fill
eaddr    =$1fff               ;last address for fill
workbank =$00                 ;RAM bank in which to fill
;
     php                      ;save register sizes
     sep  #%00100000          ;8-bit accumulator
     pha
     xba                      ;duplicate lower half in upper
     pla
     rep  #%00110000          ;16-bit acc, 16-bit index
     tsx
     txy                      ;save stack pointer
     ldx  #eaddr
     txs                      ;set stack ptr to last fill loc
     ldx  #(eaddr-saddr+1)/8  ;init counter
again:
     pha
     pha                      ;store eight bytes per iteration
     pha
     pha
     dex
     bne  again               ;21 * 511 + 20 cycles, right?
     tyx
     txs                      ;restore stack pointer
     plp                      ;restore register sizes
;
         ...program continues...

I haven't tested it, but I think you get the gist. It destroys your routine in raw performance (and maybe also a few bytes just below $1000 if an interrupt hits near the end :wink: ).
It should go without saying that your routine has the potential to be far more general-purpose, though ...

Mike B.


Top
 Profile  
Reply with quote  
PostPosted: Thu Apr 26, 2018 7:06 am 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8147
Location: Midwestern USA
barrym95838 wrote:
BigDumbDinosaur wrote:
... Anything you could hand-code would use more than twice the number of clocks to accomplish the same task.

If you're going to imply a challenge like that, why did you hard-code such tasty constants for your example?

Code:
;FILL RAM FROM $001000-$001FFF WITH 8 BIT VALUE IN .A
;
saddr    =$1000               ;1st address to fill
eaddr    =$1fff               ;last address for fill
workbank =$00                 ;RAM bank in which to fill
;
     php                      ;save register sizes
     sep  #%00100000          ;8-bit accumulator
     pha
     xba                      ;duplicate lower half in upper
     pla
     rep  #%00110000          ;16-bit acc, 16-bit index
     tsx
     txy                      ;save stack pointer
     ldx  #eaddr
     txs                      ;set stack ptr to last fill loc
     ldx  #(eaddr-saddr+1)/8  ;init counter
again:
     pha
     pha                      ;store eight bytes per iteration
     pha
     pha
     dex
     bne  again               ;21 * 511 + 20 cycles, right?
     tyx
     txs                      ;restore stack pointer
     plp                      ;restore register sizes
;
         ...program continues...

I haven't tested it, but I think you get the gist. It destroys your routine in raw performance (and maybe also a few bytes just below $1000 if an interrupt hits near the end :wink: ).
It should go without saying that your routine has the potential to be far more general-purpose, though ...

Mike B.

I'm aware of that technique, and had used it at one time in POC V1.0's firmware. It will run faster than the MVx instructions, and could be sped up a little more by increasing the number of pushes per pass through the loop.

That said, pushing more than one byte per loop iteration implies the size of the area to be filled is evenly divisible by the number of bytes pushed per iteration, which is only practical in some scenarios. Even if you only push one byte per iteration, you still lose generality because stack accesses are always forced to bank $00—you can't use this technique to fill extended RAM.

Also, there is a booby-trap lurking in that code, which you already mentioned. :wink: Depending on the ISR involved, up to 13 bytes could be overwritten below SADDR when an interrupt hits. Moreover, if an NMI were to hit right after an IRQ's ISR had preserved the full machine state on the stack, the NMI's ISR could potentially write another 13 bytes to the stack, possibly trashing up to 26 bytes below SADDR. I won't even mention the possibility of nested IRQs...

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
 Post subject: Re: Some 65816 questions
PostPosted: Thu Apr 26, 2018 7:50 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
Duff's device! It's rather normal in block-move routines to mop up the few uneven bytes before or after embarking on a loop which moves multiple bytes at once.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 11 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 22 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron