6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 5:56 am

All times are UTC




Post new topic Reply to topic  [ 82 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6  Next
Author Message
PostPosted: Thu Feb 08, 2018 1:33 pm 
Offline

Joined: Sun Jun 29, 2014 5:42 am
Posts: 352
In case you missed it (because it's a different thread), a while back I wrote up some of the history of the PiTubeDirect project:
http://www.stardot.org.uk/forums/viewto ... 55#p140955

Dave


Top
 Profile  
Reply with quote  
PostPosted: Thu Feb 08, 2018 1:45 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 9:02 pm
Posts: 1748
Location: Sacramento, CA
Yes, I have read that, but just on a cursory level. I do want to go back and dig into it and the source code at some point. The speeds you are getting are outstanding, and I'm sure I could learn a lot from this project. Thanks for sharing this Dave!

Daryl

_________________
Please visit my website -> https://sbc.rictor.org/


Top
 Profile  
Reply with quote  
PostPosted: Thu Feb 08, 2018 1:51 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 9:02 pm
Posts: 1748
Location: Sacramento, CA
dp11 wrote:
With newer ARMs some of what you thought you knew about timings have changed. A lot isn't documented publicly either. It also can change between ARMv6 v7 and v8

You will only spot the interlocks if you accurately cycle count the ARM instructions.

Code:
; code that might simulate an LDA
    LDRB R0,[Rx]
    MOVS R1,R0
; the above say takes n cycles

    LDRB R0,[Rx]
    MOVS R1,R0,LSL #24
; the above take n +1 cycles on the ARM core used in the Pi1



mailbox box can live where ever you like due to the MMU. Using 0x0000 for the memory doesn't give you a big win in performance. The big wins are from scheduling the loads from memory and keeping the code small enough it fits in the cache. Dealing with 6502 interrupts can be a big overhead.


OK, I see what you are saying. Thanks for sharing. I know using 0x0000 won't boost performance, but it will help ease the code a bit. I have not played with the MMU yet, I'm pretty much just using David Welch's basic set the stack pointer and go example. I did manage to enable the L1 caches. I know I have a lot to learn yet, especially with the PI hardware on top of the ARM.

Daryl

_________________
Please visit my website -> https://sbc.rictor.org/


Top
 Profile  
Reply with quote  
PostPosted: Thu Feb 08, 2018 10:48 pm 
Offline

Joined: Tue Jul 24, 2012 2:27 am
Posts: 679
BigEd wrote:
Because many ARM instructions offer a free shift, you can consider the technique of keeping 8-bit values left-aligned in their registers - this way, N and V work out much better.

I think you meant N and Z, not V? I was going to suggest something similar, but refrained because it doesn't support carry coming in from the LSB for addition/subtraction. There are a few solutions that pull in more cases, but they all have downsides:

  1. Do ops in the upper 8 bits: C, N, V, Z work, but not right-input carry (ADC/SBC/ROL) nor ROR/LSR's right-output carry.
  2. Do ops in the upper 8 bits with the lower 24 set: Right-input carry for ADC/SBC now work, but Z breaks.
  3. Do ops in the lower 8 bits, then shift left 24 bits: C, N, Z, right-input, right-output, & left-output carry all work, but V and left-input carry don't.

Across the space of ops, you'd have to find the best balance for the intermediate register storage, but probably handle the individual ops differently. I also admit that I don't know how ARM's carry bit interacts with the shifter, if at all, so that would affect how useful my "shift by 24 after operation to output carry" assumptions are.

You could also do something crazy like keep the accumulator as a 10-digit number to represent what's going on with carry on both ends. But I'm pretty sure you'd always lose V in the process.

_________________
WFDis Interactive 6502 Disassembler
AcheronVM: A Reconfigurable 16-bit Virtual CPU for the 6502 Microprocessor


Top
 Profile  
Reply with quote  
PostPosted: Fri Feb 09, 2018 12:30 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 9:02 pm
Posts: 1748
Location: Sacramento, CA
Having 3 of the 4 flags would still greatly simplify (and speed) the simulation. V is the hardest to simulate, so ensuring it gets adjusted would be my first priority. N,Z,C are easy to adjust.

I am going to work up a left-aligned model and give that a go. My dev process is pretty fast, downloaded the 90k image takes about 20 seconds, and compiling takes 5. Should only take a few hours to work out the bugs.

Daryl

_________________
Please visit my website -> https://sbc.rictor.org/


Top
 Profile  
Reply with quote  
PostPosted: Fri Feb 09, 2018 9:54 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
I've the vaguest recollection that the left-aligned approach works well for everything except ADC and SBC. But you can if you like have a look at a reverse-engineered source for one of Acorn's efforts:
http://stardot.org.uk/forums/viewtopic. ... 32#p127732


Top
 Profile  
Reply with quote  
PostPosted: Sat Feb 10, 2018 1:05 am 
Offline

Joined: Tue Jul 24, 2012 2:27 am
Posts: 679
If just for ADC and SBC you set the low 24 bits to 1 before the operation, then the input carry should ripple up and have its proper effect. You'd need to clear them afterwards, though, to get Z working again.

_________________
WFDis Interactive 6502 Disassembler
AcheronVM: A Reconfigurable 16-bit Virtual CPU for the 6502 Microprocessor


Top
 Profile  
Reply with quote  
PostPosted: Sun Feb 11, 2018 6:07 pm 
Offline

Joined: Sat Nov 11, 2017 1:08 pm
Posts: 33
Just been doing a few speed tests with PitTubeDirect on a RPiZero at 1GHz.

This is with the C version of the 6502 core, not the high performance assembler version which is about 275MHz

With an older GCC ( most likely GCC 6) we get about 86.44MHz equivalent performance
With GCC7 we get about 73.47MHz equivalent performance

I haven't investigated this difference yet. So it might be another compiler flag is needed to get gcc7 working correctly.


Top
 Profile  
Reply with quote  
PostPosted: Sun Feb 11, 2018 10:36 pm 
Offline
User avatar

Joined: Tue Mar 02, 2004 8:55 am
Posts: 996
Location: Berkshire, UK
Has anyone tried lookup tables? The pi has plenty of RAM.

_________________
Andrew Jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs


Top
 Profile  
Reply with quote  
PostPosted: Mon Feb 12, 2018 8:44 am 
Offline

Joined: Sat Nov 11, 2017 1:08 pm
Posts: 33
Lookup tables have been slower in the past possibly due the code not fitting into the cache. My spreadsheet simulation suggests lookup table might now be about 0.5% faster if a few other things hold true.


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 14, 2018 3:49 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 9:02 pm
Posts: 1748
Location: Sacramento, CA
I've done some experimenting with the left-alignment and effects on the ARM's flags. This method works well with most 6502 instructions. ROR and ROL need some extra help and I have not started testing with ADC or SBC yet. I did note that the ARM V flag gets changed with many of the instructions that the 65C02 does not. For instance, INC and DEC have to use ARM ADD and SUB, which change the V flag. It might be more costly to adjust and save/restore the ARM's flags that to continue simulating the 65C02 Status register. Using the left aligned method does reduce my code some though, so I plan to implement it in the A, X, and Y registers.

I have also written a script to convert my macro based source to the full assembly for each opcode. This will allow me to streamline each opcode and eliminate some extra register moves the macros created. The Tube project aligns the opcode processing on 16 word boundaries, which makes the fetch and execute control really efficient. But, it gives a maximum of 16 ARM instructions per opcode. About half of mine are more than 16 currently. I could use a 32 word boundary, but the 16 word method allows the full 256 opcodes to reside in the L1 cache, which provides a big speed boost.

I need to focus on the memory decoding of the read and write functions, as testing for IO and write protecting the "ROM" takes too many instructions.

This has been a huge learning experience for me... and I'm sure I 'll keep finding ways to reduce code and increase speed.

Daryl

_________________
Please visit my website -> https://sbc.rictor.org/


Top
 Profile  
Reply with quote  
PostPosted: Thu Feb 22, 2018 5:40 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 9:02 pm
Posts: 1748
Location: Sacramento, CA
Here's an update, I've been working on optimizing to ARM assembly code, mostly in the areas of memory reads and writes. I have also worked out the code for keeping the A,X,Y registers left aligned which has really simplified the Status register updates. I'm still keeping the SR simulated vs. using the ARM's own SR for N,V,Z,C. BCD math is still not very optimized, but I may try to tackle that. With all that, I was able to bump the speed up to 25MHz. There are few places that I cut a corner though. A 16 bit read of address $FF will not wrap around to $00. Also, a 16 bit read at the last RAM byte will not get its upper value from the IO page, but rather the hidden RAM behind the IO page. Both are acceptable to me because there's no logical reason to have a program need to do that.

Not being satisfied with the speed, I started digging into the ARM Architecture Reference Manual (ARM ARM) as I really didn't understand the cache and MMU functions that well. I also dug into David Welch's references and tried to set up the MMU and caches. It seems the ARM comes out of power up with the MMU and caches disabled. After 3 or 4 attempts, I finally got both enabled and my simulated speed went up to over 100MHz. Bingo! And, after reading how the caches are organized, found that disabling the data cache increased the speed by about 10-12 more MHz. My speed test now reads around 120Mhz!

I will most likely go back and fix the shortcuts mentioned above and then start reading about the IO available in the PI to play with. Getting video display working is #1.

I'll package up what I have so far also and post it for those who are interested in giving it a try.

Daryl

_________________
Please visit my website -> https://sbc.rictor.org/


Top
 Profile  
Reply with quote  
PostPosted: Thu Feb 22, 2018 7:33 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Excellent progress!


Top
 Profile  
Reply with quote  
PostPosted: Thu Feb 22, 2018 8:23 pm 
Offline

Joined: Sun Jun 29, 2014 5:42 am
Posts: 352
8BIT wrote:
Bingo! And, after reading how the caches are organized, found that disabling the data cache increased the speed by about 10-12 more MHz.

Interesting... can you say a bit more about why this would be the case?


Top
 Profile  
Reply with quote  
PostPosted: Thu Feb 22, 2018 8:33 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
(Maybe because cache lines are much bigger than bytes, and an emulator's access doesn't have great locality compared to the size of a cache line??)


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 82 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 3 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: