lOMoARcPSD|387 422 56 Lab M anual for C ompiler D esign Prepared By, Dr.Farheen Mohammed lOMoARcPSD|387 422 56 Course Code Course Title Core/ Elective COMPILER DESIGN LAB CORE Prerequisite Contact Hours Per Week CIE SEE Credits L T D P - - - - 2 25 50 2 Course Objectives To learn usage of tools LEX, YAAC To develop a code generator To implement different code optimization schemes Course Outcomes Generate scanner and parser from formal specification. Generate top down and bottom up parsing tables using Predictive parsing, SLR and LR Parsing techniques. Apply the knowledge of YACC to syntax directed translations for generating intermediate code – 3 address code. Build a code generator using different intermediate codes and optimize the target code. List of Experiments to be performed: 1. Sample programs using LEX. 2. Scanner Generation using LEX. 3. Elimination of Left Recursion in a grammar. 4. Left Factoring a grammar. 5. Top down parsers. 6. Bottom up parsers. 7. Parser Generation using YACC. 8. Intermediate Code Generation. 9. Target Code Generation. 10. Code optimization. COMPILER DESIGN LAB 1 lOMoARcPSD|387 422 56 1.1 Lex program to count the number of words, characters, blank spaces and lines Procedure: 1. Create a lex specification file to recognize words, characters, blank spaces and lines 2. Compile it by using LEX compiler to get ‘C’ file 3. Now compile the c file using C compiler to get executable file 4. Run the executable file to get the desired output by providing necessary input Program: %{ int c=0,w=0,l=0,s=0; %} %% [ \ n] l++; [' ' \ n \ t] s++; [^' ' \ t \ n]+ w++; c+=yyleng; %% int main(int argc, char *argv[]) { if(argc==2) { yyin=fopen(argv[1],"r"); yylex(); printf(" \ nNUMBER OF SPACES = %d",s); printf(" \ nCHARACTER=%d",c); printf(" \ nLINES=%d",l); COMPILER DESIGN LAB 2 lOMoARcPSD|387 422 56 printf(" \ nWORD=%d \ n",w); } else printf("ERROR"); } Input File: Hello how are you Output: lex filename.l cc lex.yy.c - ll ./a.out in.txt COMPILER DESIGN LAB 3 lOMoARcPSD|387 422 56 1.2 LEX program to identify REAL PRECISION of the given number Procedure: 1. Create a lex specification file to recognize single line and multiple comment statements 2. Compile it by using LEX compiler to get ‘C’ file 3. Now compile the c file using C compiler to get executable file 4. Run the executable file to get the desired output by providing necessary input Program: %{ /*Program to identify a integer/float precision*/ %} integer ([0 - 9]+) float ([0 - 9]+ \ .[0 - 9]+)|([+| - ]?[0 - 9]+ \ .[0 - 9]*[e|E][+| - ][0 - 9]*) %% {integer} printf(" \ n %s is an integer. \ n",yytext); {float} printf(" \ n %s is a floating number. \ n",yytext); %% main() { yylex(); } int yywrap() { return 1; } Output: lex filename.l cc lex.yy.c - ll ./a.out *Pass the number 2 COMPILER DESIGN LAB 4 lOMoARcPSD|387 422 56 2 is an integer 2.3 2.3 is a floating number COMPILER DESIGN LAB 5 lOMoARcPSD|387 422 56 2.1 Scanner generation using LEX Concepts: LEX Program Concepts: Lex helps write programs whose control flow is directed by instances of regular expressions in the input stream. It is well suited for editor - script type transformations and for segmenting input in preparation for a parsing routine. Lex source is a table of regular expressions and correspond - ing program fragments. The table is translated to a program which reads an input stream, copying it to an output stream and partitioning the input into strings which match the given expres - sions. As each such string is recognized the corresponding pro - gram fragment is executed. The recognition of the expressions is performed by a deterministic fini te automaton generated by Lex. The program fragments written by the user are executed in the order in which the corresponding regular expressions occur in the input stream. The lexical analysis programs written with Lex accept ambi - guous specifications a nd choose the longest match possible at each input point. If necessary, substantial lookahead is per - formed on the input, but the input stream will be backed up to the end of the current partition, so that the user has general freedom to manipulate it. Lex can generate analyzers in C A token is a string of characters, categorized according to the rules as a symbol (e.g., IDENTIFIER, NUMBER, COMMA). The process of forming tokens from an input stream of characters is COMPILER DESIGN LAB 6 lOMoARcPSD|387 422 56 called tokenization , and the lexer categorizes them according to a symbol type. A token can look like anything that is useful for processing an input text stream or text file. A lexical analyzer generally does nothing with combinations of tokens, a task left for a parser . For example, a typical lexical analyzer recognizes parentheses as tokens, but does nothing to ensure that each "(" is matched with a ")". Consider this expression in the C programming language : sum=3+2; Tokenized in the following table: Lexeme Token type sum Identifier = Assignment operator 3 Integer literal + Addition operator 2 Integer literal ; End of statement Tokens are frequently defined by regular expressions , which are understood by a lexical analyzer generator such as lex . The lexical analyzer (either generated automatically by a tool like lex, or hand - crafted) reads in a stream of characters, identifies the lexemes in the stream, and categorizes them into tokens. COMPILER DESIGN LAB 7 lOMoARcPSD|387 422 56 This is called "tokenizing." If the lexer finds an invalid token, it w ill report an error. Following tokenizing is parsing From there, the interpreted data may be loaded into data structures for general use, interpretation, or compiling The structure of a Lex file is intentionally similar to that of a yacc file; files are divided into three sections, separated by lines that contain only two percent signs, as follows: The definition section defines macros and imports he ader files written in C . It is also possible to write any C code here, which will be copied verbatim into the generated source file. The rules section associates regular expression patterns with C statements . When the lexer sees text in the input matching a given pattern, it will execute the associated C code. The C code section contains C statements and functions that are copied verbatim to the generated source file. These statements presumably contain code called by the rules in the rules section. In large programs it is more convenient to place this code in a separate file linked in at compile time. The following LEX Program is to recognize tokens given as C source code statements and return token values. COMPILER DESIGN LAB 8 Definition section %% Rules section %% C code section lOMoARcPSD|387 422 56 /* LEX Program to recognize tokens */ %{ #define LT 256 #define LE 257 #define EQ 258 #define NE 259 #define GT 260 #define GE 261 #define RELOP 262 #define ID 263 #define NUM 264 #define IF 265 #define THEN 266 #define ELSE 267 int attribute; %} delim [ \ t \ n] ws {delim}+ letter [A - Za - z] digit [0 - 9] id {letter}({letter}|{digit})* num {digit}+( \ .{digit}+)?(E[+ - ]?{digit}+)? %% {ws} {} if { return(IF); } then { return(THEN); } else { return(ELSE); } {id} { return(ID); } {num} { return(NUM); } "<" { attribute=LT;return(RELOP); } "<=" { attribute=LE;return(RE LOP); } "<>" { attribute=NE;return(RELOP); } "=" { attribute=EQ;return(RELOP); } ">" { attribute=GT;return(RELOP); } ">=" { attribute=GE;return(RELOP); } %% int yywrap(){ return 1; COMPILER DESIGN LAB 9 lOMoARcPSD|387 422 56 } int main() { int token; while(token=yylex()){ printf("<%d,",token); switch(token){ case ID:case NUM: printf("%s> \ n",yytext); break; case RELOP: printf("%d> \ n",attribute); break; default: printf(") \ n"); break; } } return 0; } OUTPUT: $ lex filename.l $ cc lex.yy.c - ll $ ./a.out if a>b then a else b <265> <263,a> <262,260> <263,b> <266> <263,a> <267> <263,b> COMPILER DESIGN LAB 10 lOMoARcPSD|387 422 56 2.2. Implementation Of Lex Analyzer Tool. Lex is a program designed to generate scanners, also known as tokenizers, which recognize lexical patterns in text. Lex is an acronym that stands for "lexical analyzer generator." It is intended primarily for Unix - based systems. The code for Lex was originally developed by Eric Schmidt and Mike Lesk. Lex can perform simple transformations by itself but its main purpose is to facilitate lexical analysis, the processing of character sequences such as source code to produce symbol sequences called token s for use as input to other programs such as parser s. Lex can be used with a parser generator to perform lexical analysis. It is easy, for example, to interface Lex and Yacc , an open source program that g enerates code for the parser in the C programming language The general format of Lex source is: {definitions} %% {rules} %% {user subroutines} where the definitions and the user subroutines are often omitted. The second %% is optional, but the first is required to mark the beginning of the rules. The absolute minimum Lex program is thus %% (no definitions, no rules) which translates into a program which copies the input to the output unchanged. In the outline of Lex programs shown above, the rules represent the user's control decisions; they are a table, in which the left column contains regular expressions and the right column contains COMPILER DESIGN LAB 11 lOMoARcPSD|387 422 56 actions, program fragments to be executed when the expressions are recognized. Thus an individual rule might appear integer printf("found keyword INT"); to look for the string integer in the input stream and print the message ``found keyword INT'' whenever it appears. In this example the host procedural language is C and the C library function printf is used to print the string. The end of the expression is indicated by the first blank or tab character. If the action is merely a single C expression, it can just be given on the right side of the line; if it is compound, or takes more than a line, it should be enclosed in braces. As a slightly more useful examp le, suppose it is desired to change a number of words from British to American spelling. Lex rules such as colour printf("color"); mechanise printf("mechanize"); petrol printf("gas"); would be a start. These rules are not quite enough, since theword petroleum would become gaseum; a way of dealing with thiswill be described later. Lex Regular Expressions. A regular expression specifies a set of strings to be matched. It contains text characters (which match the corresponding characters in the strings b eing compared) and operator characters (which specify repetitions, choices, and other features). The letters of the alphabet and the digits are always text characters; thus the regular expression integer matches the string integer wherever it appears and t he expression COMPILER DESIGN LAB 12 lOMoARcPSD|387 422 56 a57D looks for the string a57D. Operators The operator characters are " \ [ ] ^ - ? * + | ( ) $ / { } % < > and if they are to be used as text characters, an escape should be used. The quotation mark operator (") indicates that whatever is contained between a pair of quotes is to be taken as text characters. Thus xyz"++" matches the string xyz++ when it appears. No te that a part of a string may be quoted. It is harmless but unnecessary to quote an ordinary text character; the expression "xyz++" is the same as the one above. Thus by quoting every non - alphanumeric character being used as a text character, the user can avoid remembering the list above of current operator characters, and is safe should further extensions to Lex lengthen the list. An operator character may also be turned into a text character by preceding it with \ as in xyz \ + \ + which is another, less readable, equivalent of the above expressions. Another use of the quoting mechanism is to get a blank into an expression; normally, as explained above, blanks or tabs end a rule. Any blank character not contained within [] (see below) must be quoted. Several normal C escapes with \ are recognized: \ n is newline, \ t is tab, and \ b is backspace. To enter \ COMPILER DESIGN LAB 13 lOMoARcPSD|387 422 56 itself, use \ \ . Since newline is illegal in an expression, \ n must be used; it is not required to escape tab and backspace. Every character but blank, tab, newline and the list above is always a text character. Character classes Classes of characters can be specified using the operator pair []. The construction [abc] matches a single character, which may be a, b, or c. Within square brackets, most operator meanings are ignored. Only three characters are special: these are \ - and ^. The - character indicates ranges. For example, [a - z0 - 9<>_] indicates the character class containi ng all the lower case letters, the digits, the angle brackets, and underline. Ranges may be given in either order. Using - between any pair of characters which are not both upper case letters, both lower case letters, or both digits is implementation depen dent and will get a warning message. (E.g., [0 - z] in ASCII is many more characters than it is in EBCDIC). If it is desired to include the character - in a character class, it should be first or last; thus [ - +0 - 9] matches all the digits and the two signs. In character classes, the ^ operator must appear as the first character after the left bracket; it indicates that the resulting string is to be complemented with respect to the computer character set. Thus [^abc] matches all characters except a, b, or c, i ncluding all special or control characters; or COMPILER DESIGN LAB 14 lOMoARcPSD|387 422 56 [^a - zA - Z] is any character which is not a letter. The \ character provides the usual escapes within character class brackets. Arbitrary character. To match almost any character, the operator character is the class of all characters except newline. Escaping into octal is possible although non - portable: [ \ 40 - \ 176] matches all printable characters in the ASCII character set, from octal 40 (blank) to octal 176 (tilde). Optional expressions The operator ? indicates an optional element of an expression. Thus ab?c matches either ac or abc. Repeated expressions . Repetitions of classes are indicated by the operators * and +. a* is any number of consecutive a characters, including zero; while a+ is one or more instances of a. For example, [a - z]+ is all strings of lower case letters. And COMPILER DESIGN LAB 15 lOMoARcPSD|387 422 56 [A - Za - z][A - Za - z0 - 9]* indicates all alphanumeric strings with a leading alphabetic character. This is a typical expression for recognizing identifiers in computer languages. Alternation and Grouping The operator | indicates alternation: (ab|cd) matches either ab or cd. Note that parentheses are used for grouping, although they are not necessary on the outside level; ab|cd would have sufficed. Parentheses can be used for more complex expressions: (ab|cd+)?(ef)* matches such strings as abefef, efefef, cdef, or cddd ; but not abc, abcd, or abcdef. Echo - matching new line Yytext -- matching string COMPILER DESIGN LAB 16 lOMoARcPSD|387 422 56 /* Implementation of LEX Tool */ %{ /*lex program to recognize a C program*/ int COMMENT=0; %} identifier [a - zA - Z][a - zA - Z0 - 9]* %% #.* {printf(" \ n %s is a PREPROCESSOR DIRECTIVE",yytext);} int | float | char | double | while | for | do | if | break | continue | void | switch | case | long | struct | const | typedef | return | else | /*.....*/ goto {printf(" \ n \ t %s is a KEYWORD",yytext);} "/*" {COMMENT=1; printf(" \ n \ n \ t %s is a COMMENT \ n",yytext);} "*/" {COMMENT=0; printf(" \ n \ n \ t %s is a COMMENT \ n",yytext);} {identifier} \ ( {if(!COMMENT)printf(" \ n \ n FUNCTION \ n \ t %s",yytext);} \ { {if(!COM MENT)printf(" \ n BLOCK BEGINS");} \ } {if(!COMMENT)printf(" \ n BLOCK ENDS");} {identifier}( \ [[0 - 9]* \ ])? {if(!COMMENT)printf(" \ n %s IDENTIFIER",yytext);} \ ".* \ " {if(!COMMENT)printf(" \ n \ t %s is a STRING",yytext);} [0 - 9]+ {if(!COMMENT)printf(" \ n \ t %s is a NUMBER",yytext);} COMPILER DESIGN LAB 17 lOMoARcPSD|387 422 56 \ )( \ ;)? {if(!COMMENT)printf(" \ n \ t");ECHO;printf(" \ n");} \ ( ECHO; = {if(!COMMENT)printf(" \ n \ t %s is an ASSIGNMENT OPERATOR",yytext);} \ <= | \ >= | \ < | == | \ > {if( !COMMENT)printf(" \ n \ t %s is a RELATIONAL OPERATOR",yytext);} %% int main(int argc,char** argv) { if(argc>1) { FILE *file; file=fopen(argv[1],"r"); if(!file) { printf("could not open %s \ n",argv[1]); exit(0); } yyin=file; } yylex(); printf(" \ n \ n"); return 0; } :r // input exhausted {return 0; // no more input to process } Var.c /* This is LEx Tool Program */ #include<stdio.h> main() { int a,b; } COMPILER DESIGN LAB 18 lOMoARcPSD|387 422 56 OUTPUT : $ lex lextool.l $ cc lex.yy.c – ll $ ./a.out var.c /* is a COMMENT */ is a COMMENT #include<stdio.h> is a PREPROCESSOR DIRECTIVE FUNCTION main( ) BLOCK BEGINS int is a KEYWORD a IDENTIFIER, b IDENTIFIER; BLOCK ENDS COMPILER DESIGN LAB 19