Asked  7 Months ago    Answers:  5   Viewed   36 times

I'm currently in the process of building a PHP Parser written in PHP, as no existing parser came up in my previous question. The parser itself works fairly well.

Now obviously a parser by itself does little good (apart from static analysis). I would like to apply transformations to the AST and then compile it back to source code. Applying the transformations isn't much of a problem, a normal Visitor pattern should do.

What my problem currently is, is how to compile the AST back to source. There are basically two possibilities I see:

  1. Compile the code using some predefined scheme
  2. Keep the formatting of the original code and apply 1. only on Nodes that were changed.

For now I would like to concentrate on 1. as 2. seems pretty hard to accomplish (but if you got tips concerning that, I would like to hear them).

But I'm not really sure which design pattern can be used to compile the code. The easiest way I see to implement this, is to add a ->compile method to all Nodes. The drawback I see here, is that it would be pretty hard to change the formatting of the generated output. One would need to change the Nodes itself in order to do that. Thus I'm looking for a different solution.

I have heard that the Visitor pattern can be used for this, too, but I can't really imagine how this is supposed to work. As I understand the visitor pattern you have some NodeTraverser that iterates recursively over all Nodes and calls a ->visit method of a Visitor. This sounds pretty promising for node manipulation, where the Visitor->visit method could simply change the Node it was passed, but I don't know how it can be used for compilation. An obvious idea would be to iterate the node tree from leaves to root and replace the visited nodes with source code. But this somehow doesn't seem a very clean solution?

 Answers

40

The problem of converting an AST back into source code is generally called "prettyprinting". There are two subtle variations: regenerating the text matching the original as much as possible (I call this "fidelity printing"), and (nice) prettyprinting, which generates nicely formatted text. And how you print matters depending on whether coders will be working on the regenerated code (they often want fidelity printing) or your only intention is to compile it (at which point any legal prettyprinting is fine).

To do prettyprinting well requires usually more information than a classic parser collects, aggravated by the fact that most parser generators don't support this extra-information collection. I call parsers that collect enough information to do this well "re-engineering parsers". More details below.

The fundamental way prettyprinting is accomplished is by walking the AST ("Visitor pattern" as you put it), and generating text based on the AST node content. The basic trick is: call children nodes left-to-right (assuming that's the order of the original text) to generate the text they represent, interspersing additional text as appropriate for this AST node type. To prettyprint a block of statements you might have the following psuedocode:

 PrettyPrintBlock:
     Print("{"}; PrintNewline();
     Call PrettyPrint(Node.children[1]); // prints out statements in block
     Print("}"); PrintNewline();
     return;


 PrettyPrintStatements:
     do i=1,number_of_children
         Call PrettyPrint(Node.children[i]); Print(";"); PrintNewline(); // print one statement
     endo
     return;

Note that this spits out text on the fly as you visit the tree.

There's a number of details you need to manage:

  • For AST nodes representing literals, you have to regenerate the literal value. This is harder than it looks if you want the answer to be accurate. Printing floating point numbers without losing any precision is a lot harder than it looks (scientists hate it when you damage the value of Pi). For string literals, you have to regenerate the quotes and the string literal content; you have to be careful to regenerate escape sequences for characters that have to be escaped. PHP doubly-quoted string literals may be a bit more difficult, as they are not represented by single tokens in the AST. (Our PHP Front End (a reengineering parser/prettyprinter represents them essentially as an expression that concatenates the string fragments, enabling transformations to be applied inside the string "literal").

  • Spacing: some languages require whitespace in critical places. The tokens ABC17 42 better not be printed as ABC1742, but it is ok for the tokens ( ABC17 ) to be printed as (ABC17). One way to solve this problem is to put a space wherever it is legal, but people won't like the result: too much whitespace. Not an issue if you are only compiling the result.

  • Newlines: languages that allow arbitrary whitespace can technically be regenerated as a single line of text. People hate this, even if you are going to compile the result; sometimes you have to look at the generated code and this makes it impossible. So you need a way to introduce newlines for AST nodes representing major language elements (statements, blocks, methods, classes, etc.). This isn't usually hard; when visiting a node representing such a construct, print out the construct and append a newline.

  • You will discover, if you want users to accept your result, that you will have to preserve some properties of the source text that you wouldn't normally think to store For literals, you may have to regenerate the radix of the literal; coders having entered a number as a hex literal are not happy when you regenerate the decimal equivalent even though it means exactly the same thing. Likewise strings have to have the "original" quotes; most languages allow either " or ' as string quote characters and people want what they used originally. For PHP, which quote you use matters, and determines which characters in the string literal has to be escaped. Some languages allow upper or lower case keywords (or even abbreviations), and upper and lower case variable names meaning the same variable; again the original authors typically want their original casing back. PHP has funny characters in different type of identifiers (e.g., "$") but you'll discover that it isn't always there (see $ variables in literal strings). Often people want their original layout formatting; to do this you have to store at column-number information for concrete tokens, and have prettyprinting rules about when to use that column-number data to position prettyprinted text where in the same column when possible, and what to do if the so-far-prettyprinted line is filled past that column.

  • Comments: Most standard parsers (including the one you implemented using the Zend parser, I'm pretty sure) throw comments away completely. Again, people hate this, and will reject a prettyprinted answer in which comments are lost. This is the principal reason that some prettyprinters attempt to regenerate code by using the original text (the other is to copy the original code layout for fidelity printing, if you didn't capture column-number information). IMHO, the right trick is to capture the comments in the AST, so that AST transformations can inspect/generate comments too, but everybody makes his own design choice.

All of this "extra" information is collected by a good reenginering parser. Conventional parsers usually don't collect any of it, which makes printing acceptable ASTs difficult.

A more principled approach distinguishes prettyprinting whose purpose is nice formatting, from fidelity printing whose purpose is to regenerate the text to match the original source to a maximal extent. It should be clear that at the level of the terminals, you pretty much want fidelity printing. Depending on your purpose, you can pretty print with nice formatting, or fidelity printing. A strategy we use is to default to fidelity printing when the AST hasn't been changed, and prettyprinting where it has (because often the change machinery doesn't have any information about column numbers or number radixes, etc.). The transformations stamp the AST nodes that are newly generated as "no fidelity data present".

An organized approach to prettyprinting nicely is to understand that virtually all text-based programming language are rendered nicely in terms of rectangular blocks of text. (Knuth's TeX document generator has this idea, too). If you have some set of text boxes representing pieces of the regenerated code (e.g., primitive boxes generated directly for the terminal tokens), you can then imagine operators for composing those boxes: Horizontal composition (stack one box to the right of another), Vertical (stack boxes on top of each other; this in effect replaces printing newlines), Indent (Horizontal composition with a box of blanks), etc. Then you can construct your prettyprinter by building and composing text boxes:

 PrettyPrintBlock:
     Box1=PrimitiveBox("{"); Box2=PrimitiveBox("}");
     ChildBox=PrettyPrint(Node.children[1]); // gets box for statements in block
     ResultBox=VerticalBox(Box1,Indent(3,ChildBox),Box2);
     return ResultBox;

PrettyPrintStatements:
     ResultBox=EmptyBox();
     do i=1,number_of_children
         ResultBox=VerticalBox(ResultBox,HorizontalBox(PrettyPrint(Node.children[i]); PrimitiveBox(";")
     endo
     return;

The real value in this is any node can compose the text boxes produced by its children in arbitrary order with arbitrary intervening text. You can rearrange huge blocks of text this way (imagine VBox'ing the methods of class in method-name order). No text is spit out as encountered; only when the root is reached, or some AST node where it is known that all the children boxes have been generated correctly.

Our DMS Software Reengineering Toolkit uses this approach to prettyprint all the languages it can parse (including PHP, Java, C#, etc.). Instead of attaching the box computations to AST nodes via visitors, we attach the box computations in a domain-specific text-box notation

  • H(...) for Horizontal boxes
  • V(....) for vertical boxes
  • I(...) for indented boxes)

directly to the grammar rules, allowing us to succinctly express the grammar (parser) and the prettyprinter ("anti-parser") in one place. The prettyprinter box rules are compiled automatically by DMS into a visitor. The prettyprinter machinery has to be smart enough to understand how comments play into this, and that's frankly a bit arcane but you only have to do it once. An DMS example:

block = '{' statements '}' ; -- grammar rule to recognize block of statements
<<PrettyPrinter>>: { V('{',I(statements),'}'); };

You can see a bigger example of how this is done for Wirth's Oberon programming language PrettyPrinter showing how grammar rules and prettyprinting rules are combined. The PHP Front End looks like this but its a lot bigger, obviously.

A more complex way to do prettyprinting is to build a syntax-directed translator (means, walk the tree and build text or other data structures in tree-visted order) to produce text-boxes in a special text-box AST. The text-box AST is then prettyprinted by another tree walk, but the actions for it are basically trivial: print the text boxes. See this technical paper: Pretty-printing for software reengineering

An additional point: you can of course go build all this machinery yourself. But the same reason that you choose to use a parser generator (its a lot of work to make one, and that work doesn't contribute to your goal in an interesting way) is the same reason you want to choose an off-the-shelf prettyprinter generator. There are lots of parser generators around. Not many prettyprinter generators. [DMS is one of the few that has both built in.]

Wednesday, March 31, 2021
 
felipsmartins
answered 7 Months ago
17

Here it is.

First, you're gonna buy the ANTLR4 book ;-)

Second, you'll download antlr4 jar and the java grammar (http://pragprog.com/book/tpantlr2/the-definitive-antlr-4-reference)

Then, you can change the grammar a little bit, adding these to the header

    (...)
grammar Java;

options 
{
    language = Java;
}

// starting point for parsing a java file
compilationUnit
    (...)

I'll change a little thing in the grammar just to illustrate something.

/*
methodDeclaration
    :   (type|'void') Identifier formalParameters ('[' ']')*
        ('throws' qualifiedNameList)?
        (   methodBody
        |   ';'
        )
    ;
*/
methodDeclaration
    :   (type|'void') myMethodName formalParameters ('[' ']')*
        ('throws' qualifiedNameList)?
        (   methodBody
        |   ';'
        )
    ;

myMethodName
    :   Identifier
    ;

You see, the original grammar does not let you identify the method identifier from any other identifier, so I've commented the original block and added a new one just to show you how to get what you want.

You'll have to do the same for other elements you want to retrieve, like the comments, that are currently being just skipped. That's for you :-)

Now, create a class like this to generate all the stubs

package mypackage;

public class Gen {

    public static void main(String[] args) {
        String[] arg0 = { "-visitor", "/home/leoks/EclipseIndigo/workspace2/SO/src/mypackage/Java.g4", "-package", "mypackage" };
        org.antlr.v4.Tool.main(arg0);
    }

}

Run Gen, and you'll get some java code created for you in mypackage.

Now create a Visitor. Actually, the visitor will parse itself in this example

package mypackage;

import java.io.FileInputStream;
import java.io.IOException;

import mypackage.JavaParser.MyMethodNameContext;

import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.tree.ParseTree;
import org.antlr.v4.runtime.tree.ParseTreeWalker;

/**
 * @author Leonardo Kenji Feb 4, 2014
 */
public class MyVisitor extends JavaBaseVisitor<Void> {

    /**
     * Main Method
     * 
     * @param args
     * @throws IOException
     */
    public static void main(String[] args) throws IOException {
        ANTLRInputStream input = new ANTLRInputStream(new FileInputStream("/home/leoks/EclipseIndigo/workspace2/SO/src/mypackage/MyVisitor.java")); // we'll
                                                                                                                                                    // parse
                                                                                                                                                    // this
                                                                                                                                                    // file
        JavaLexer lexer = new JavaLexer(input);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        JavaParser parser = new JavaParser(tokens);
        ParseTree tree = parser.compilationUnit(); // see the grammar ->
                                                    // starting point for
                                                    // parsing a java file



        MyVisitor visitor = new MyVisitor(); // extends JavaBaseVisitor<Void>
                                                // and overrides the methods
                                                // you're interested
        visitor.visit(tree);
    }

    /**
     * some attribute comment
     */
    private String  someAttribute;

    @Override
    public Void visitMyMethodName(MyMethodNameContext ctx) {
        System.out.println("Method name:" + ctx.getText());
        return super.visitMyMethodName(ctx);
    }

}

and that's it.

You'll get something like

Method name:main
Method name:visitMyMethodName

ps. one more thing. While I was writing this code in eclipse, I've got a strange exception. This is caused by Java 7 and can be fixed just adding these parameters to your compiler (thanks to this link http://java.dzone.com/articles/javalangverifyerror-expecting)

enter image description here

Sunday, August 8, 2021
 
JackTheKnife
answered 3 Months ago
68

Managed to solve the problem by using the getsourceposition() and some string manipulation (no regex was needed)

Wednesday, August 11, 2021
 
aries12
answered 3 Months ago
11

Disclaimer: I only have experience with X86 machine code. Other instruction sets may have, for example, different addressing capabilities, so parts of my advice might not apply. I'm sorry that I don't have time to research instruction sets at the moment.


Well firstly, most compilers don't generate assembly as text, because it's kinda inefficient to serialise the code into assembly only to have it parsed straight away by the assembler, as you have probably realised. It is reasonable to have separate compilation and assembly phases, but not essential.

In the compilation phase, the two strategies I would consider are:

  • (a) generate the assembly as a tree / array of instruction objects which can symbolically refer to one another. In the assembly phase these need to be serialised into bytecode/machinecode. I'd recommend this method, even if it makes your compiler's architecture a little more complex.

  • (b) generate the assembly as machinecode/bytecode into a buffer, with some helper functions; in this case you don't really have a separate assembly phase. I've personally tried this method, and within the scope of a single function it's not bad, but may cause some extra difficulties by not knowing how large a function is going to be before it's assembled.

I would guess that (a) is the approach used by optimising compilers like GCC, while (b) is the approach used by high-speed compilers like TCC.


Let's again consider the if example by examining the code that an existing compiler generates for a simple if/else branch:

source -> disassembly

Note the overlapping jumps in the disassembly - one that skips the 'taken' block and one that skips the 'not-taken' block.

These are relative jumps, so in order to assemble them we need to know how many bytes of instructions are between the jump instruction and the destination.

Here's an example of what the compilation function might look like using strategy (a):

Instruction[] compile_if(IfNode n) {
    Instruction[] code;

    code ~= compile_condition(n.condition);

    Instruction skip_taken = new JumpInstruction(`jz`);
    code ~= skip_taken;

    code ~= compile_block(n.taken_block);

    Instruction skip_nottaken = new JumpInstruction(`jmp`);
    code ~= skip_nottaken;

    Instruction[] nottaken_code = compile_block(n.nottaken_block);
    skip_taken.destination = nottaken_code[0];
    code ~= nottaken_code;

    Instruction end = new NopInstruction();
    skip_nottaken.destination = end;
    code ~= end;

    return code;
};

This should be pretty self-explanatory.

Note how instructions refer to one another symbolically (skip_taken.destination = nottaken_code[0]), rather than by byte-offsets like in serialised machinecode. We leave those offset calculations for the assembler.

Also note how we set the destinations of the JumpInstructions only once they become available.

The NopInstruction at the end is just to give the skip_nottaken jump something to refer to.

Now, how do we actually assemble these jumps into real machinecode/bytecode? here's one possibility (a very basic example):

byte[2] assemble_jz(Instruction[] code, int idx) {
    // assemble the jz instruction at code[idx]

    JumpInstruction jump = code[idx];
    ++idx;

    byte jump_offset = 0;
    while (code[idx] != jump.destination) {
        jump_offset += size_of_instruction(code[idx]);
        ++idx;
    };

    byte[2] machinecode = [
        0x74, // jz short
        jump_offset
    ];
    return machinecode;
};

Because the assembler has access to all the instruction objects, it can calculate the actual offsets for relative jumps by scanning ahead until it finds the destination instruction.


I hope this brief introduction helps you get started with designing your own compiler backend. Obviously I'm not suggesting that you write your compiler exactly like my example, but it should give you some ideas of how to approach the general problem of compiling and assembling non-linear instruction blocks.

You might also want to take a look at some existing assembler APIs such as https://github.com/asmjit/asmjit .

Good luck.

Monday, September 13, 2021
 
Sergey Ryabov
answered 1 Month ago
70

Check out the aptly named Sarcasm project for a reference implementation of a grammar, parser, and AST built on Irony. I found this blog entry by the author to be helpful in building the AST.

The following is a general purpose guide to getting the AST up and running.

  1. Define your grammar (example)
  2. Create an abstract base class (MyBaseNode) deriving from AstNode (example). Copy/Paste the methods from the example
  3. For each terminal and non-terminal create a new class derived from MyBaseNode and

    1. Override Accept method (example):

    public override void Accept(IMyNodeVisitor visitor) { visitor.Visit(this); }

    1. Override Init (mostly on terminals) or InitChildren (non-terminals) as appropriate. This is where the AST magic happens.
  4. Add an interface IMyNodeVisitor and add a Visit method for each class defined in the previous step (example):

    void Visit(MyDerivedNode1 node);

  5. Set the ASTNodeType for each of your terminals and non-terminals in your grammar from step 1.

    1. For terminals - (example)

      MyTerminal1.AstConfig.NodeType = typeof(MyDerivedNode1);

    2. For non-terminals - (example)

      var MyNonTerminal2 = new NonTerminal("MyNonTerminal2", typeof(MyDerivedNode2));

  6. In the grammar enable AST creation: (example)

    LanguageFlags = LanguageFlags.CreateAst;

Wednesday, September 29, 2021
 
user123
answered 3 Weeks ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :