6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Mon Apr 29, 2024 2:29 am

All times are UTC




Post new topic Reply to topic  [ 11 posts ] 
Author Message
PostPosted: Fri Jun 19, 2020 10:51 am 
Offline

Joined: Thu Feb 27, 2020 9:15 am
Posts: 18
Hello

Over the past week, i have been writing an assembler for the 6502 as a hobby project.
What i planned was to create a system that was flexible(easy to add/modify assembler directives, instructions, etc.) and easy to implement(no spaghetti code, everything is tidy, etc.)

I chose to write it in C# as i dont need it to work on anything that doesnt support .net core

The main bulk of the assembler are the statements and statement templates. Statement templates are basically regex strings with some additional information that can also have optional array of operand templates.

In its current version, the preprocessing works and separates source into statements, however it is quite inefficient, as it needs to look through all possible variations of statement templates with their operands(for example in case of a statement with 5 operands that can have 20 different operand variations, there will be 100 different variants to check)

This means that for example given a ~1000 lines of code, it will take roughly 2-5 seconds to process, whereas other compilers like VASM take under half a second.

Is there a point to try and optimise my code even more amd keep working on it in the current iteration or should i rather rethink the whole system and ditch the regex variant matching(or at least make it as fixed regex lookup table), making it more "dumber" and less flexible but getting more speed & memory efficiency in exchange?

I dont really plan for it to be a "commercial development tool" but it is kinda a bit discouraging, seeing that other assemblers perform 100 times better. :cry:


Top
 Profile  
Reply with quote  
PostPosted: Fri Jun 19, 2020 11:33 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
You should do it if you want to do it. Is it organised in such a way that you could swap out one mechanism for another? You can put up with the slow mechanism, and fill our the rest of the project, until you have some idea about how to replace it with a faster one. (It's usually true that a better algorithm is a much preferable way forward than an optimised implementation of a slow algorithm.)

It's not unusual, I think, to have to revisit a project as substantial as this, because in the process of making the first incarnation, you learn a lot about how you might prefer to approach it.

So, maybe it's not so much about writing an assembler, as learning how to write an assembler, by doing it.


Top
 Profile  
Reply with quote  
PostPosted: Fri Jun 19, 2020 11:37 am 
Offline
User avatar

Joined: Tue Mar 02, 2004 8:55 am
Posts: 996
Location: Berkshire, UK
A more traditional approach is to tokenise the lines and do a LL(n) parse. Assembly instructions follow a simple syntax and are easy to process.

My assembler (in Java) is based on a framework where each opcode/directive is a object with a compile method. My base classes provide common things like expression parsing and object/listing generation while derived classes customise the tokenising (e.g. $1234 in MOS/Motorola 01234H on Intel/Zilog) and instruction sets.

The generic core of the assembler is here:
https://github.com/andrew-jacobs/dev65/tree/master/src/uk/co/demon/obelisk/xasm
While this customises it to the 65xx family of processors:
https://github.com/andrew-jacobs/dev65/tree/master/src/uk/co/demon/obelisk/w65xx

_________________
Andrew Jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs


Top
 Profile  
Reply with quote  
PostPosted: Fri Jun 19, 2020 12:10 pm 
Offline

Joined: Tue Sep 03, 2002 12:58 pm
Posts: 293
If you can measure the time it takes to assemble 1000 lines, you're doing something very wrong. Regular expressions are not the right tool - I wouldn't continue in that direction.

My assembler does the whole C64 ROM (about 10,000 source lines in total) pretty much instantly. It loads the whole file into memory and converts it into tokens with a simple state-based lexer. If it sees an include directive while doing this, it recursively loads that file and converts it into tokens too. The file on disc is only accessed once, and once it has been converted to tokens, the text is only used for printing lines in error messages or the listing output. The rest of the assembler only cares about tokens.

The parser is fairly ad-hoc, because that's all I need. It looks at the next token and decides what to do based on its type. If it's an identifier, call the label-handling function then go back for anything else that might be on the line. If it's a directive, call the directive-handling function. The instruction handler uses the following tokens (and some hints from what this instruction expects) to build operands.

A more flexible but slightly slower approach might be to allow handlers to register themselves for particular token types. The main loop looks at the next token and iterates over all of the handlers for it. Each handler can have a go at parsing, either succeeding, advancing the token pointer and doing whatever work is needed, or failing, leaving the token pointer where it was for the next handler. A handler can itself be a loop calling its own set of handlers - you'd need that for operands.

I don't know how you'd generate decent error messages with that approach though. A handler can't emit an error when it fails, because a future handler might succeed. But the caller doesn't have information on what went wrong. If multiple handlers all produce their own candidate errors, which do you choose? Error messages are hard.


Top
 Profile  
Reply with quote  
PostPosted: Fri Jun 19, 2020 3:29 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1927
Location: Sacramento, CA, USA
John West wrote:
If you can measure the time it takes to assemble 1000 lines, you're doing something very wrong. Regular expressions are not the right tool - I wouldn't continue in that direction.

The difference between zero seconds and five seconds is five seconds ... just sayin' ... :)
Quote:
A more flexible but slightly slower approach might be to allow handlers to register themselves for particular token types. The main loop looks at the next token and iterates over all of the handlers for it. Each handler can have a go at parsing, either succeeding, advancing the token pointer and doing whatever work is needed, or failing, leaving the token pointer where it was for the next handler. A handler can itself be a loop calling its own set of handlers - you'd need that for operands.

I don't know how you'd generate decent error messages with that approach though. A handler can't emit an error when it fails, because a future handler might succeed. But the caller doesn't have information on what went wrong. If multiple handlers all produce their own candidate errors, which do you choose? Error messages are hard.

That sounds like an interesting path. Maybe each handler could be allocated its own success/failure flag, then the parent loop could prioritize the successes/failures and act accordingly after everyone in the pool of candidates has had a crack at it? IANASE (I am not a software engineer), so my alternative might be to treat all errors as comments. :P

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)


Top
 Profile  
Reply with quote  
PostPosted: Fri Jun 19, 2020 3:40 pm 
Offline
User avatar

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1399
Location: Scotland
JustClaire wrote:
I dont really plan for it to be a "commercial development tool" but it is kinda a bit discouraging, seeing that other assemblers perform 100 times better. :cry:


Faster isn't always a measure of Better.

If it works for you, produces the right results, but takes 2 seconds when something else takes under a second then (initially) so what? It's yours, does what you want it to do, so be happy and use it!

The last assembler I wrote was in BBC Basic for my Mk14 (INS8060) - I didn't really care how long it took, but I'll soon be starting a 6502/65816 assembler myself from scratch. I currently use ca65 from the cc65 suite on my Linux desktop but I'm looking to make my Ruby816 SBC as self-hosting as possible and as it currently runs both BBC Basic and BCPL, then I could write it in either, but will be writing it in BCPL - mostly because I've just completed a working screen editor (in BCPL) and I can compile directly on the system, so got to do the "eat your own dog food" thing... (which, if nothing else, will show up any bugs in my editor!)

Good luck with it.

-Gordon

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/


Top
 Profile  
Reply with quote  
PostPosted: Fri Jun 19, 2020 7:39 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8428
Location: Southern California
I have a friend who managed software projects for the Navy, with dozens of programmers under him. He said that a complete re-compile of a big project may take 24 hours or longer. Fortunately it wasn't necessary to do a complete re-compile very often; but clearly that would still put a damper on development speed. If you're only talking about a few seconds, don't worry about it! Go ahead and make it do everything you wanted. The run speed of the final software product is much more important than how fast the assembler runs. So is bug-free code, which it sounds like you're more likely to get if the assembler supports the things on your wish list.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Fri Jun 19, 2020 10:52 pm 
Offline

Joined: Thu Feb 27, 2020 9:15 am
Posts: 18
BigEd wrote:
You should do it if you want to do it. Is it organised in such a way that you could swap out one mechanism for another? You can put up with the slow mechanism, and fill our the rest of the project, until you have some idea about how to replace it with a faster one. (It's usually true that a better algorithm is a much preferable way forward than an optimised implementation of a slow algorithm.)

Yes, after all it is always a good practice to make your code interchangeable.

John West wrote:
A more flexible but slightly slower approach might be to allow handlers to register themselves for particular token types. The main loop looks at the next token and iterates over all of the handlers for it. Each handler can have a go at parsing, either succeeding, advancing the token pointer and doing whatever work is needed, or failing, leaving the token pointer where it was for the next handler. A handler can itself be a loop calling its own set of handlers - you'd need that for operands.

Thank you, that is a good idea, i will try to work from it

Quote:
I don't know how you'd generate decent error messages with that approach though. A handler can't emit an error when it fails, because a future handler might succeed. But the caller doesn't have information on what went wrong. If multiple handlers all produce their own candidate errors, which do you choose? Error messages are hard.

For exception handling, i could loop through all template handlers and let them a go at parsing, where each handler would return a boolean if it has suceeded parsing or not.
The handlers themselves would be responsible for displaying different error/warning messages in case the match was partial(for example the instruction mnemonic matched) and the error/warning handler-specific(ex. 16bit argument instead of 8bit, etc.)
The parsing loop itself can also keep track of current line, in order to display messages.
At the end of the loop, if none of the parsers have succeeded, an error message can be displayed stating that it is unable to parse a statement at whatever line the line pointer is set to


Top
 Profile  
Reply with quote  
PostPosted: Sat Jun 20, 2020 6:25 am 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8154
Location: Midwestern USA
GARTHWILSON wrote:
I have a friend who managed software projects for the Navy, with dozens of programmers under him. He said that a complete re-compile of a big project may take 24 hours or longer.

Back when I was doing professional development on the Commodore 128D and Lt. Kernal, my truck leasing package took two full days to assemble about 100,000 lines of code. Needless to say, my development system was powered by a UPS. :D I would have killed in those days for a development system that could have assembled that code in 24 hours. Fortunately, it was modularized, so full assembly was seldom required. Even so, assembling some parts could take an hour or more. A two megahertz machine can only go so fast...

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Sat Jun 20, 2020 6:32 am 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8154
Location: Midwestern USA
JustClaire wrote:
This means that for example given a ~1000 lines of code, it will take roughly 2-5 seconds to process, whereas other compilers like VASM take under half a second.

Is there a point to try and optimise my code even more amd keep working on it in the current iteration or should i rather rethink the whole system and ditch the regex variant matching(or at least make it as fixed regex lookup table), making it more "dumber" and less flexible but getting more speed & memory efficiency in exchange?

Think of it like building a race car. First you get it to run. Then you get it to run fast.

So, what you should be doing for now is making sure your assembler emits proper code and that all desired features are working. Once that is done, you go back through your code, find the obvious choke points and optimize them. Then you go back through your code again looking for non-obvious inefficiencies and fix them. Finally, you objectively measure performance and decide if it's sufficient for your needs. If it is, you are finished with development—until you decided to add some new features. Otherwise, you again go through your code...

BTW, I suspect going the regex route is consuming a significant part of the total processing time. As Andrew noted, parsing and assembling a line of code is mostly a task of tokenizing and LL(n) processing. Using MOS Technology syntax, you have three significant fields to process: (optional) label, mnemonic and (optional) operand. The 6502 assembly language is very regular in structure (e.g., mnemonics are always three characters) and addressing modes are easily parsed. None of this is particularly difficult to do in an efficient manner, even if your assembler is being written in a high- level language.

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Sat Jun 20, 2020 1:45 pm 
Offline

Joined: Thu Mar 12, 2020 10:04 pm
Posts: 690
Location: North Tejas
BigDumbDinosaur wrote:
GARTHWILSON wrote:
I have a friend who managed software projects for the Navy, with dozens of programmers under him. He said that a complete re-compile of a big project may take 24 hours or longer.

Back when I was doing professional development on the Commodore 128D and Lt. Kernal, my truck leasing package took two full days to assemble about 100,000 lines of code. Needless to say, my development system was powered by a UPS. :D I would have killed in those days for a development system that could have assembled that code in 24 hours. Fortunately, it was modularized, so full assembly was seldom required. Even so, assembling some parts could take an hour or more. A two megahertz machine can only go so fast...



You needed one of these viewtopic.php?f=3&t=6178 ...


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 11 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 28 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron