6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sun Nov 24, 2024 10:19 pm

All times are UTC




Post new topic Reply to topic  [ 82 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6  Next
Author Message
PostPosted: Mon Feb 05, 2018 7:33 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Well done - a milestone! It will be interesting to see the progress as you find optimisations.


Top
 Profile  
Reply with quote  
PostPosted: Tue Feb 06, 2018 6:07 pm 
Offline

Joined: Sat Dec 13, 2003 3:37 pm
Posts: 1004
14MHz on a 700MHz machine is 50Hz per instruction. Now, I don't know anything about ARM assembly, or how many "Hz per instruction". On the 6502, a cycle is a clock pulse, so 14MHz is 14M cycles.

With an average instruction time of 2 cycles (total swag), that's 7M Instructions (7 MIPS).

So, how many instructions can an ARM do in 50 cycles? That should give you a better idea of what "ultimate performance" might be.

Are you using a table lookup for the instruction processing? I assume so. But, really, 50 cycles is a lot, but it's not that much in this context. 2-3 cycles per instruction, that's 16-25 instructions for each 6502 instruction.

Basic dispatch is: (contrived CPU Agnostic 6502ish Assembly syntax)
Code:
main_loop:
  LDX (PC)
  INC PC
  AND PC, #$FFFF
  JSR (INSTRUCTIONS,X)
  JMP main_loop

There's 4 or 5+ instructions right there, and not "cheap" instructions either.

Then something as simple as LDA <ABS>:
Code:
LDA_ABS:
    LDA (PC)
    INC PC,2
    AND PC, $FFFF
    STA A_REGISTER
    RTS


So, LDA ABS is close to 60% of that 14MHz budget already.

Anyway, you may not be able to get much more than 14MHz. Its just interesting how fast it all adds up.

Have you considered JIT compiling the 6502? Zziinnng!!


Top
 Profile  
Reply with quote  
PostPosted: Tue Feb 06, 2018 6:18 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
(The PiTubeDirect project manages >200MHz at 1GHz. It's written in very optimised assembly language, and is a fast-as-possible emulator making no effort to match the proportional cycle counts of each instruction.
There's a timeline here showing the progress from the initial 84MHz. The project offers several models of CPU, including another 6502 model based on Ian Piumarta's lib6502 written in C, which runs at about 20MHz. So that's a benchmark for a C model, and is very much in line with 14MHz at 700MHz.)


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 07, 2018 4:44 am 
Offline
User avatar

Joined: Mon May 12, 2014 6:18 pm
Posts: 365
whartung wrote:
2-3 cycles per instruction, that's 16-25 instructions for each 6502 instruction.
Aren't a lot of ARM instructions single cycle or close to it? You may have more instructions to work with than 25. OTOH, not sure how the raspberry pi is set up and how fast it can run from RAM.
https://hardwarebug.org/2014/05/15/cortex-a7-instruction-cycle-timings/


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 07, 2018 9:59 am 
Offline
User avatar

Joined: Tue Mar 02, 2004 8:55 am
Posts: 996
Location: Berkshire, UK
I found the biggest speed gains in my code were from using the processors native status word and creating P only when it was actually needed. Also C is only flag bit that ever needs to be restored and then only for shifts and arithmetic.

_________________
Andrew Jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 07, 2018 12:47 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 9:02 pm
Posts: 1748
Location: Sacramento, CA
Wow, last night I posted a lengthy response and this morning its not here. I always do the preview before I submit, so there's a small chance I had done that and forgot to click submit.

I'll try to recap that post and add more when I get home from work. Thank you to everyone who has offered help!!

Daryl

_________________
Please visit my website -> https://sbc.rictor.org/


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 07, 2018 12:52 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
(I think if you compose something and someone else posts in the meantime, you get a second confirmation stage - I've lost a post or two that way, thinking I'd posted when I hadn't.)


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 07, 2018 7:04 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8546
Location: Southern California
I hope you can find it in your browser history, either by re-opening closed tabs, or by using the back arrow ("Go back one page") in the tab history enough times.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Thu Feb 08, 2018 12:11 am 
Offline

Joined: Tue Jul 24, 2012 2:27 am
Posts: 679
BitWise wrote:
I found the biggest speed gains in my code were from using the processors native status word and creating P only when it was actually needed. Also C is only flag bit that ever needs to be restored and then only for shifts and arithmetic.


I'm not sure if this is the same as your approach, but what I've done for N and Z is when those flags are modified, save the current 8-bit value that would affect them into a "last value" location. Then when a branch on those is needed, pull that last value and use a native branch. Those 2 flags in particular are often modified but ignored, so it doesn't make sense to put work into computing them all the time.

With a custom native asm environment on ARM, you might even dedicate a register for holding that NZ-affecting value.

_________________
WFDis Interactive 6502 Disassembler
AcheronVM: A Reconfigurable 16-bit Virtual CPU for the 6502 Microprocessor


Top
 Profile  
Reply with quote  
PostPosted: Thu Feb 08, 2018 5:34 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 9:02 pm
Posts: 1748
Location: Sacramento, CA
I don't have the energy to recompose the full message from yesterday, but the summary is this:

Thanks for all the help, the tube project is very intriguing, as they have some awesome performance numbers. I have reviewed some of the comments in the sources and have gained a few ideas that I can try on my build.

I have been simulating the Status Register, which admittedly does use up a lot of instructions. The memory load and save also takes up quote a bit, due partly to 32 bit word alignment. 16 bit reads are being done with two 8 bit reads and then combining the results. I read something last night about disabling "alignment", which may help me with that issue. I need to experiment some more.

This build uses the same model as my AVR code, using macros to do memory and IO read and writes, address mode, and opcodes. I also use a macro for the fetch process, which gets added to the end of every opcode. I'll explain why in a minute.

Here's a sample of my LDA ABS instruction:

Code:
C_AD:             ; LDA_ABS - opcode $AD
   ABS         
   C_LDA         
   CYCLE 4         
   FETCH         


ABS will read the 2 bytes after the opcode, form the target address, and read the byte at that address (RAM, ROM, or IO).
C_LDA will transfer that data to the Acc register and update the proper flags
CYCLE 4 is a 1 ARM instruction macro that adds the number to the cycle count.
FETCH will get the opcode from the location pointed to by the PC register and branch to the proper code (label C_AD in the case of and LDA ABS)

By placing the fetch macro at the end of each opcode, I avoid subroutine overhead and/or extra branch instructions. The pipeline architecture of the ARM allows for 4 instructions to be moving through the pipe, starting with the fetch and ending with the operation. If a branch occurs, the pipeline stalls (has to be refreshed), which uses up extra cycles. The cool thing is that most instructions can be conditionally executed, which eliminated the need for forward branches, and keeps the pipeline moving. If a condition is not met, a 1 cycle nop replaces the instruction.

I also need to find optimizations in calculating flags for an 8 bit value in a 32 bit register, i.e., the ARM has a V flag, but its looking at overflows at the 31st bit, not the 7th.

Bottom line, by streamlining the code, removing complexity with memory/IO read and write, using native flags when possible vs. simulating them, and reducing pipeline stalls will allow me to speed up the simulation.

More still to come.

Daryl

_________________
Please visit my website -> https://sbc.rictor.org/


Top
 Profile  
Reply with quote  
PostPosted: Thu Feb 08, 2018 8:51 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Because many ARM instructions offer a free shift, you can consider the technique of keeping 8-bit values left-aligned in their registers - this way, N and V work out much better.


Top
 Profile  
Reply with quote  
PostPosted: Thu Feb 08, 2018 11:10 am 
Offline

Joined: Sat Nov 11, 2017 1:08 pm
Posts: 33
With the ARM on the Pi shifts aren't always free these days, it depends on interlocks.


Top
 Profile  
Reply with quote  
PostPosted: Thu Feb 08, 2018 11:48 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Oh, that's a surprise!


Top
 Profile  
Reply with quote  
PostPosted: Thu Feb 08, 2018 12:59 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 9:02 pm
Posts: 1748
Location: Sacramento, CA
So far, I haven't run into a problem with "free" shifts for operand 2. I haven't seen anything about interlocks in the assembly docs, can you explain what they are dp11? I've only been using ARM assembly for about 3 weeks, and have a lot to learn still.

I had not considered keeping the data left aligned, thanks for the tip Ed! I was toying with the idea of filling bits 8-30 with 1's and copying the sign (bit 7) to bit 31. I think I would have the same results, but there'd be too much overhead. I will definitely experiment with left alignment.

Another trick I leaned from the tube project was placing simulated memory at address 0x0000 - 0xFFFF, that way the PC can directly address memory. So far, that is working well for me. My only concern is with the Video core mailboxes (which I have not played with yet). I think they live in the space between 0x0000 and 0x8000. If that is true and I cannot relocate them, I will not be able to use this trick. I will eventually want to use the video display, especially after I get a 65816 simulator working, which can address 16MB of RAM.

Back to the code.

Daryl

_________________
Please visit my website -> https://sbc.rictor.org/


Top
 Profile  
Reply with quote  
PostPosted: Thu Feb 08, 2018 1:28 pm 
Offline

Joined: Sat Nov 11, 2017 1:08 pm
Posts: 33
With newer ARMs some of what you thought you knew about timings have changed. A lot isn't documented publicly either. It also can change between ARMv6 v7 and v8

You will only spot the interlocks if you accurately cycle count the ARM instructions.

Code:
; code that might simulate an LDA
    LDRB R0,[Rx]
    MOVS R1,R0
; the above say takes n cycles

    LDRB R0,[Rx]
    MOVS R1,R0,LSL #24
; the above take n +1 cycles on the ARM core used in the Pi1



mailboxes can live where ever you like due to the MMU. Using 0x0000 for the memory doesn't give you a big win in performance. The big wins are from scheduling the loads from memory and keeping the code small enough it fits in the cache. Dealing with 6502 interrupts can be a big overhead.


Last edited by dp11 on Thu Feb 08, 2018 3:49 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 82 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6  Next

All times are UTC


Who is online

Users browsing this forum: noneya and 12 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: