scotws wrote:
Just out of idle curiosity, is there any text that explains how to write an assembler? Something like the Dragon Book for compilers? Not that I'm planning another project, it would just be interesting to learn how they work internally.
Simple one are reasonably simple, but not trivial.
As with anything else, you have the Lexing and Parsing phases. Lexing breaks the individual assembly source lines in to tokens for use by the parser. Parsing takes those tokens, ensures that they actually make sense, and then acts upon them in some way. The most difficult component for the parser, at the statement level, is the expression parser. Things like: 1+2*(3+4). Folks get hung up on expressions. The rest of it is pretty simple, simple pattern matching really.
Mine is a simple Recursive Descent parser.
Assemblers are simpler that compilers because the semantics, especially with a simple instruction set like the 6502, are very straight forward. You don't have to worry as much about scopes and call stacks, etc. You can take advantage of the fact that your statements are pretty much all contained on a single line, that simplifies things as well.
For the basic case, you need to look up the mnemonic, determine the addressing mode and arguments, and then append the correct instruction sequence to your output. That's what a Forth assembler does.
Forward references complicate things a bit, and that where the "second pass" kicks in.
For example:
Code:
LABEL1: LDA $1234
BEQ LABEL2
JMP LABEL1
LABEL2: ...
When you encounter the JMP statement, it's destination is already defined, so the JMP is readily assembled.
However, for the earlier BEQ, you don't know at that point where, or even if, LABEL2 is defined, so you simply don't have the information necessary to properly assemble the instruction. This statement may even be in error, if LABEL2 never show up or is out of range.
So, the second pass can be done different ways. It could be as simple as literally reassembling the file, but now you have a better idea from the first run what it all looks like. In my assembler, I simply capture instructions that I don't quite know enough yet and at the end of the first phase, run through those few and fill in the blanks. With a modern machine, loading the entire assembly file in to memory is completely doable, this can make multiple passes quick.
My assembler effectively does that. It loaded each instruction into memory, if the instruction is "known" (like JMP above), it captures it, including it's destination address, and keeps moving. If it's not, it flags the instruction for later. At the end, it simply dumps the assembled instructions out to a buffer, that's finally written to the disk. My stuff is simple enough that I don't get "phase errors", as other have mentioned.
For example, if the instruction knows it's talking to ZP, it compiles ZP. If it doesn't know, it assumes it won't be ZP.
If you do something like:
Code:
.ORG $0200
LDA LABEL1
.ORG $0000
LABEL1: .BYTE $01
You'll get AC 00 00 assembled instead of A5 00 assembled. Since all my ZP is defined first, this isn't a problem.
Then there's macros and other stuff, but in the end, the fundamentals are the same: converting an instruction stream in to the binary (or not, mine produces HEX files).
But, that's the basics of it.