6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Tue Apr 16, 2024 10:02 am

All times are UTC




Post new topic Reply to topic  [ 32 posts ]  Go to page 1, 2, 3  Next
Author Message
PostPosted: Fri Nov 13, 2020 8:05 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10787
Location: England
The beeb816 project puts a fast 65816 into a BBC Micro. It brings several benefits even in 6502 mode: speed, and paged RAM, and shadowing of video RAM.

But it has 512k of fast RAM, so there's plenty unassigned, and it can be put into 65816 mode. The question is, how to motivate that. How can we show that it's worthwhile to write '816-specific code?

Ideas welcome: preferably relatively specific - we know that operating on 16 bit quantities should be more efficient, but how to demonstrate that in an application? Likewise, we know it's easy to use 24 bit pointers and access lots of memory with 16 bit indexing, but what kind of application will show that this is what is happening?

This is what I've thought of so far:
- using block moves to page memory into and out of the 6502 accessible spaces
- using block moves to paint and unpaint sprites and blobs in video memory
- using 16 bit operations to speed up 32 bit integer arithmetic
- using huge lookup tables to speed up arithmetic and trig/log functions
- using huge lookup tables for a fast Life
- using large memory for a large Life
- running a 65816 specific Basic, especially a BBC Basic with built in assembler (needs porting!)
- running Gordon's native BCPL compiler (which will surely need porting to an odd memory map)
- offering a large fast RAM disk (needs a filing system - difficult!)


Top
 Profile  
Reply with quote  
PostPosted: Sat Nov 14, 2020 12:19 am 
Offline

Joined: Sun Apr 26, 2020 3:08 am
Posts: 357
I think a lot of the speed comes from just having the ability to use 16-bit registers. Especially when using indexing instructions. On a 6502, one has to always set up at least the hi-byte of a zero-page pointer to get the 16-bit address one wants to collect data from.

And there is an even faster MOVE instruction on the 65816 (MVN) that is quite a bit faster than even moving 2-bytes (16-bits) at a time within a loop.

Depending how the BBC Micro displays its graphics, I saw a really neat trick on an Apple IIGS to display 200 palettes of 16 colors to give 3200 colors on Super Hi-res screen all at the same time. It used the PEA instruction to store each palette in the Palette area that affects the screen. It had to move 32 bytes in the time it takes the VBL of the monitor just to scan one line. very fast.

Although memory can see gains in using 16-bits versus 8-bits, but the most significant speed increase would be with the slowest device. Which should be the Hard drive. Reading 16-bits at a time to read a file from the Hard Drive was the biggest speed increase I saw using a 65816. And even faster if the device allows DMA addressing.

And one last mention of some of the instructions that not only save a lot of bytes over the 6502 way of programming, but make programs easier to understand. Some of my favorite instructions are XBA (which swaps hi and lo bytes of the 16-bit Accumulator), TRB/TSB (to set or clear specific bits), INA/DEA (increment or decrement Acc), and BRL (Branch Long).

And honorable mention goes to having a floating zero-page. Don't know if this works on the BBC Micro, but on an Apple IIGS, it allows any page in bank zero to be used as the zero-page. Very handy to have multiple zero-pages to use index addressing modes.


Just an FYI. You don't need a filing system to create a RAM Disk. But having a filing system to save and load programs from a disk would be handy. One wouldn't want to type in the RAM Disk driver every time.

And lastly, if the objective is to use the speed of the 65816, creating a BASIC (IMO) would not be the way to go. Learning to program using an assembler would be far move advantageous.


Top
 Profile  
Reply with quote  
PostPosted: Sat Nov 14, 2020 12:38 am 
Offline
User avatar

Joined: Sat Dec 01, 2018 1:53 pm
Posts: 727
Location: Tokyo, Japan
BigEd wrote:
- running a 65816 specific Basic, especially a BBC Basic with built in assembler (needs porting!)
- running Gordon's native BCPL compiler (which will surely need porting to an odd memory map)

My first thought on things like this is that it's only useful for large programs where you're willing to take the performance hit to avoid making them fit into 64K. As soon as you start using 24-bit pointers into your heap, you slow everything down drastically compared to using 16-bit pointers, regardless of whether it's a 6502 or 65816.

And to be clear: "64K" is address space for pointers, not actual memory size. You may well effectively use several times that number of bytes of memory (whether on 65816 or 6502).

As an example of this, consider a system with two 64 Kbyte heaps, one for variable-length objects (strings etc.) and another for fixed-length objects (ints, pointers, lists of both). Pointers into the two heaps are easily distinguished via the LSB so long as you align the variable-length-object heap to even bytes. (You can even get a couple more heaps if you are willing to align your allocations to 4 bytes.) If the heaps are GC'd, it would be reasonable to use Cheny's algorithm with the "halves" of the heap being separate 64K banks. (Nice trick: avoid bank-switch overhead by mapping reads to from-space but writes to to-space.) In this case you're quite effectively using 256 KB of memory for a heap totalling 128 KB, but still all with 16-bit pointers.

_________________
Curt J. Sampson - github.com/0cjs


Top
 Profile  
Reply with quote  
PostPosted: Sat Nov 14, 2020 1:44 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8422
Location: Southern California
IamRob wrote:
And one last mention of some of the instructions that not only save a lot of bytes over the 6502 way of programming, but make programs easier to understand. Some of my favorite instructions are XBA (which swaps hi and lo bytes of the 16-bit Accumulator), TRB/TSB (to set or clear specific bits), INA/DEA (increment or decrement Acc), and BRL (Branch Long).

Note that the 65c02 does have TRB/TSB and INA/DEA, But yes, the '816 will be a lot more efficient at a lot of jobs.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Sat Nov 14, 2020 6:31 am 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8133
Location: Midwestern USA
IamRob wrote:
And there is an even faster MOVE instruction on the 65816 (MVN) that is quite a bit faster than even moving 2-bytes (16-bits) at a time within a loop.

Don't forget MVP. Incidentally, neither instruction is a "move." Both are a "copy," but in opposite directions. Also, MVN can be used as a high-speed fill instruction, such as to zero out a buffer.

Quote:
Depending how the BBC Micro displays its graphics, I saw a really neat trick on an Apple IIGS to display 200 palettes of 16 colors to give 3200 colors on Super Hi-res screen all at the same time. It used the PEA instruction to store each palette in the Palette area that affects the screen. It had to move 32 bytes in the time it takes the VBL of the monitor just to scan one line. very fast.

In order for PEA to work in such an application, SP has to be pointed at the palette area. However, doing so can create some interesting problems when an interrupt hits and the '816 pushes PB, PC and SR into the palette area. Consider also that PEA is an immediate-mode instruction, which means it can only push data that is embedded in the program code. I actually find PEA somewhat limited in usefulness. A much faster way to saturate an area of memory with static data is with MVN or MVP—and interrupts won't step on your work if your ISR is correctly coded.

Quote:
Although memory can see gains in using 16-bits versus 8-bits, but the most significant speed increase would be with the slowest device. Which should be the Hard drive. Reading 16-bits at a time to read a file from the Hard Drive was the biggest speed increase I saw using a 65816. And even faster if the device allows DMA addressing.

The only way 16-bit processing will help during mass storage I/O is in the manipulation of indices and pointers. A 16-bit access on the I/O port of a hardware device will touch the port and the register immediately adjacent to it in the memory map, with possibly undefined results. Devices that have a 16-bit data port generally expect all 16 bits to be read/written in a single access cycle, which is not possible with the 65C816.

Note that all memory accesses by the 65C816 are byte-at-a-time operations, even when the target MPU register is set to 16 bits. In the majority of cases, only RAM or ROM accesses will benefit from reading or writing 16 bits at a time. In general, I/O devices must be accessed in 8-bit mode to avoid inadvertently "stepping" on an adjacent register.

Quote:
And one last mention of some of the instructions that not only save a lot of bytes over the 6502 way of programming, but make programs easier to understand. Some of my favorite instructions are XBA (which swaps hi and lo bytes of the 16-bit Accumulator)...

XBA is indeed quite useful, especially when endianess issues are involved. XBA is slightly slower than a register-to-register copy, but is of considerable value if individual bytes are being processed and it is desired to not clobber .X or .Y.

Quote:
...INA/DEA (increment or decrement Acc)...

A little bit of pedantry: the official mnemonics for incrementing or decrementing the accumulator are INC and DEC, respectively. Standards-based assemblers may expect the pseudo-operand A as part of the instruction, e.g., INC A.

Quote:
...and BRL (Branch Long).

BRL offers no performance advantages and in fact, I do not recommend its use in most programs. As I note in my 65C816 interrupt white paper:

  • Don't use the BRL instruction unless you are writing relocatable code:

    As mentioned earlier, BRA and JMP take three clock cycles to complete, whereas BRL consumes four cycles. BRL confers no advantages in a system where the interrupt service routines are loaded to fixed addresses. While BRA is no faster than JMP it does require one less byte of code, which may be important if available code space is real tight.

The only time BRL is of value is in developing relocatable code or for synthesizing instructions such as BSR (branch to subroutine, an MC6800 instruction).

Quote:
And honorable mention goes to having a floating zero-page. Don't know if this works on the BBC Micro, but on an Apple IIGS, it allows any page in bank zero to be used as the zero-page. Very handy to have multiple zero-pages to use index addressing modes.

Direct page can start on any address in bank $00—the start of direct page is not limited to a page boundary. That MPU characteristic makes it practical to give a function an ephemeral direct page by reserving space on the stack and pointing DP to that space. Obviously, that would not be possible if the start of direct page had to be aligned to a page boundary.

Minor caveat: pointing direct page to anything other than a page boundary will result in a one cycle penalty on each direct page access.


Quote:
And lastly, if the objective is to use the speed of the 65816, creating a BASIC (IMO) would not be the way to go. Learning to program using an assembler would be far move advantageous.

I agree that the utmost performance will be had with carefully-crafted assembly language. However, a BASIC interpreter that has been tailored to the 65C816's capabilities in native mode will run significantly faster than a 65C02 equivalent. My own experience with the POC series shows that integer arithmetic routines that fully exploit the 65C816's features run nearly twice as fast as they would on a 65C02 at the same Ø2 rate. In string handling, the gain is even greater, as much as 3:1 in some cases. Clearly, a BASIC interpreter that can do math 80-90 percent faster and process strings 300 percent faster than achievable with a 65C02 will be quite peppy.

cjs wrote:
As soon as you start using 24-bit pointers into your heap, you slow everything down drastically compared to using 16-bit pointers, regardless of whether it's a 6502 or 65816.

A 24-bit addressing mode of any kind with the '816 will incur a one-cycle penalty per memory access compared to use of 16-bit addressing with the same type of instruction, e.g., LDA [<dpaddr>] (24-bit) versus LDA (<dpaddr>) (16-bit). I would not characterize that as a drastic slowdown. In the overall scheme of things, the performance penalty won't be all that severe, especially if the code is running on a system that can support a high Ø2 rate. Meanwhile, 24-bit indirect addressing combined with a 16-bit index register width will allow you to reach anywhere into address space with succinct code. Most importantly, you will be able to readily index across bank boundaries with no performance penalty.

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Sat Nov 14, 2020 6:34 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
...


Last edited by Arlet on Sun Nov 15, 2020 11:10 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Sat Nov 14, 2020 6:39 am 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8133
Location: Midwestern USA
GARTHWILSON wrote:
Note that the 65c02 does have TRB/TSB and INA/DEA, But yes, the '816 will be a lot more efficient at a lot of jobs.

In particular, TRB and TSB with the 65C816 are very handy for processing bit fields. If accumulator/memory is set "wide" a 16-bit bit field can be manipulated with a single atomic instruction.

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Sat Nov 14, 2020 10:42 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10787
Location: England
Thanks for the ideas everyone.

Curt, I quite agree: a language such as Basic could surely do well with several 64k arenas for different types of information. I very much like the idea of a pair of 64k heaps to make for easier garbage collection.

It feels like the least-effort route to a large-memory BBC Basic might be to start with Acorn's BAS128, which already handles a separate space for interpreter purposes, yielding 64k space for program and data. Whether it would be easy or not to separate further, I don't know.


Last edited by BigEd on Sat Nov 14, 2020 11:15 am, edited 3 times in total.

Top
 Profile  
Reply with quote  
PostPosted: Sat Nov 14, 2020 10:47 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8422
Location: Southern California
BigEd wrote:
Whether it would be easy or not to separate further, I don't know.

An early Forth I had contact with on PCs used 16-bit cells but had a 64K space for program, another for data, and another for headers.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Sat Nov 14, 2020 7:02 pm 
Offline
User avatar

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1395
Location: Scotland
BigEd wrote:
- running Gordon's native BCPL compiler (which will surely need porting to an odd memory map)


I'm guessing it shadows host RAM in the Beeb, so Bank 0 from the '816 side is $0000 through $7FFF on the host Beeb side?

If-so, then it's not really viable. Mostly because of a speed/design constraint I made when implementing the 32-bit VM that it all runs under... Although... My BCPl OS runs under my Ruby OS which is compatible enough with the Acorn MOS to run e.g. BBC Basic - so it understands how RAM from $0000 through PAGE (typically $E00) is laid out and in particular zero page (direct page in the 816)

But from PAGE to HIMEM... That's used as the global vector(s) and stack(s) for the VM - the 'trick' here is that I only use the bottom 16-bits of the VMs 32-bit stack register - for speed/efficiency. In practice this means that some programs could work, but maybe only in video mode 7. Running the compiler needs almost all the stack space, so it may not be possible to compile directly on the system although cross compiling on a Linux desktop is easy.

Actual compiled code and static data is in the rest of RAM from $01.0000 upwards.

It might be possible to relocate the global vectors and stacks to upper RAM, but it would run much slower.

Cheers,

-Gordon

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/


Top
 Profile  
Reply with quote  
PostPosted: Sat Nov 14, 2020 7:56 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10787
Location: England
Hi Gordon. Interesting... I think it might not be as bad as all that. Although Bank 00 is where the host memory is found, we keep a fast copy of it on the '816 board, and we only write back when we need to. If we're running in shadow mode, we don't need to, because only the OS VDU code is allowed to write to host memory. So it's full speed ahead, and 32k of (fast) RAM in bank 00. Or 48k if we have a sideways RAM paged in. Maybe we have a chance!


Top
 Profile  
Reply with quote  
PostPosted: Sun Nov 15, 2020 4:12 am 
Offline
User avatar

Joined: Sat Dec 01, 2018 1:53 pm
Posts: 727
Location: Tokyo, Japan
BigDumbDinosaur wrote:
cjs wrote:
As soon as you start using 24-bit pointers into your heap, you slow everything down drastically compared to using 16-bit pointers, regardless of whether it's a 6502 or 65816.
A 24-bit addressing mode of any kind with the '816 will incur a one-cycle penalty per memory access compared to use of 16-bit addressing with the same type of instruction....In the overall scheme of things, the performance penalty won't be all that severe, especially if the code is running on a system that can support a high Ø2 rate. Meanwhile, 24-bit indirect addressing combined with a 16-bit index register width will allow you to reach anywhere into address space with succinct code. Most importantly, you will be able to readily index across bank boundaries with no performance penalty.[/color]

I'm not sure that a "high φ2 rate" is relevant here; don't modern 8-bit 6502s also support the same clock speeds?

Certainly indexing across bank boundaries is a good advantage. But I'm not sure really covers everything you'd want for fast 24-bit pointer manipulation. Consider the following fragment of program text in a format like the "crunched" one used by early '80s MS BASIC interpreters. (8F and 89 are the tokens used for the REM and GOTO keywords, respectively, and the target of GOTO is stored as a pointer to the target line's data structure. That's converted to a line number in a saved image, and converted back to a pointer after loading, IIRC.)

Code:
    address    next   linenum   line data (hex)          listing
    ──────────────────────────────────────────────────────────────────
    10FFF7:  FE FF 10  00 20  8F 68 65 6C 6C 6F 00     32 REM hello
    10FFFE:  08 00 11  00 40  89 FE FF 10 00           64 GOTO 64
    110008:  00 00 00

Here if you add a new line, 48 GOTO 64, the new line will be inserted at location $10FFFE, moving the 64 GOTO 64 line and end-of-program marker up in memory. I'm not too familiar with 65816 assembler, but looking through the instruction set this seems to me near as complex as doing it on a 6502, and probably not too much faster, since you still have no easily way to load, store and copy 24-bit pointers as a single unit, and you still usually need to keep your pointers in direct page memory "pseudo-registers" instead of just in a register. (That said, at least some 24-bit pointer handling should still be faster on a 65816 than on a bank-switched 6502.)

On the other hand, as you point out, the 65816 offers considerable advantage in speed and ease of handling 16-bit arithmetic, so perhaps one should tout the advantage as being that you're getting the memory space of a bank-switched 6502 (many hundreds of kilobytes rather than many tens) with much more speed.

I guess that makes the 65816 more of a PDP-11-style "16 bit" processor than a 68000-style "16-bit" processor? Maybe I should be thinking of the 68000 as really a "16/32-bit" processor, and not expecting the PDP-11, 8086 and 65816 to compare to it in ease of use of >64K address space?

BigEd wrote:
Curt, I quite agree: a language such as Basic could surely do well with several 64k arenas for different types of information. I very much like the idea of a pair of 64k heaps to make for easier garbage collection.

It feels like the least-effort route to a large-memory BBC Basic might be to start with Acorn's BAS128, which already handles a separate space for interpreter purposes, yielding 64k space for program and data. Whether it would be easy or not to separate further, I don't know.

Sure, there's lots of room for easy improvement here in a Microsoft-style BASIC if you're willing to limit the size of any particular area to 64K. These use separate areas for program text, DIM'd arrays, scalar variables and string heap for non-constant strings; all these can be separated into their own 64K address spaces, along with moving disk buffers and the like to their own space as well.

A bit more difficult would be to separate string constants (normally part of program text) to their own area, or even multiple areas by using 24-bit pointers just for those. How much of a win that would be depends on the amount of constant string text in the particular program, but for something like a text adventure game this would probably be a big win.

This isn't anything you couldn't do on a 6502 without huge amounts of difficulty, but as BDD points out, it should all run significantly faster on a 65816 due to easy 16-bit arithmetic and being able to use 24-bit pointers in the direct page rather than having to bank switch.

_________________
Curt J. Sampson - github.com/0cjs


Last edited by cjs on Sun Nov 15, 2020 12:06 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Sun Nov 15, 2020 10:42 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10787
Location: England
It's an interesting thing, in my specific context, that the '816 is already running a lot faster than the 6502 it replaces. So, to some extent, speed or efficiency advantages of the '816 are somewhat hidden, or overwhelmed, by clock speed advantage. It is perhaps the large memory space that's going to appear revolutionary, as opposed to evolutionary.

Maybe some demonstrations will make use of a large flat memory, some will work with 16 bit indexing to offer multiple arenas, and some will make use of the bank structure as a coarse-grained memory extension.

For example... one could imagine a task-swapping approach to multitasking. It might take only tens of milliseconds to swap out a 32k workspace. Perhaps multiprogramming is closer to the term I'm looking for: I swap out my editor to run my assembler, swap out my assembler to run my program, then swap my editor back in.


Top
 Profile  
Reply with quote  
PostPosted: Sun Nov 15, 2020 12:05 pm 
Offline
User avatar

Joined: Sat Dec 01, 2018 1:53 pm
Posts: 727
Location: Tokyo, Japan
BigEd wrote:
It's an interesting thing, in my specific context, that the '816 is already running a lot faster than the 6502 it replaces. So, to some extent, speed or efficiency advantages of the '816 are somewhat hidden, or overwhelmed, by clock speed advantage.
I'm confused about why you say this. Looking at the data sheets, both the W65C02S (table 6-3 on p. 25) and the W65C816S (table 4-2 on p. 27) appear to have the same range of clock speeds.

Quote:
It is perhaps the large memory space that's going to appear revolutionary, as opposed to evolutionary.
Well, to be fair I think we should be comparing against a 6502 with the same amount of RAM using some reasonable bank-switching scheme. So it's really the better ways of addressing that space, than the amount of space itself, that's the advantage here. (And it is an advantage; the question is how much of an advantage.)

Quote:
For example... one could imagine a task-swapping approach to multitasking. It might take only tens of milliseconds to swap out a 32k workspace. Perhaps multiprogramming is closer to the term I'm looking for: I swap out my editor to run my assembler, swap out my assembler to run my program, then swap my editor back in.
If all these programs fit nicely into less than 64K RAM, or can be broken up into nice pieces of less than that size, that's exactly the kind of situation where I'd imagine that bank switching would handle things nearly as well.

_________________
Curt J. Sampson - github.com/0cjs


Top
 Profile  
Reply with quote  
PostPosted: Sun Nov 15, 2020 12:44 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10787
Location: England
Ah, perhaps there are two or even three possible threads of discussion going on here.

My own immediate interest, and the reason for posting, is the beeb816 project, where we fit a fast '816 with fast RAM into a BBC Micro, replacing the 2MHz 6502. In this case we're not comparing two architectures at comparable speeds, or two architectures with comparable amounts of RAM.

Indeed, other possible discussions include one on the possible advantages of an '816 vs an '02, given both running at the same speed and both with lots of RAM, the '02 necessarily needing some kind of mapping support. It could be a good one: please feel free to start a thread!

(Another thread, different again, is "On the usefulness of 65816 as a 65C02 alternative". Even older, there's "65816 system engineering pros and cons")

Edit: I tried, when titling this discussion, not to ask "what are the advantages of the 816", because that wasn't the question I had in mind. It's still possible that the title could be better, of course.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 32 posts ]  Go to page 1, 2, 3  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: