[CS143-PA1] Lexer

[CS143-PA1] Lexer

Target

write a cool lexer using flex

0. Preparation


1. Basic Keywords

4 Scanner Results

  • Your implementation needs to define Flex rules that match the regular expressions defining each token defined in cool-parse.h and perform the appropriate action for each matched token.

  • For example, if you match on a token BOOL_CONST, your lexer has to record whether its value is true or false;

  • similarly if you match on a TYPEID token, you need to record the name of the type.

  • Note that not every token requires storing additional information; for example, only returning the token type is sufficient for some tokens like keywords.

find all tokens definition in cool-parse.h , return corresponding token directly

  • 1
    2
    
    #define CLASS 258
    ...
    

grammar in flex for case-insensitive

  • (?r-s:pattern)
    apply option r and omit option s while interpreting pattern. Options may be zero or more of the characters i, s, or x.
    i means case-insensitive. -i means case-sensitive.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
 /* =========
  * operators
  * =========
  */
"=>"            { return (DARROW);      }
"<-"            { return (ASSIGN);      }
"<="            { return (LE);          }

 /* ========
  * keywords
  * ========
  * Keywords are case-insensitive except for the values true and false,
  * which must begin with a lower-case letter.
  */


 /* apply i option for case-insensitive */
(?i:class)      { return CLASS;     }
(?i:else)       { return ELSE;      }
(?i:fi)         { return FI;        }
(?i:if)         { return IF;        }
(?i:in)         { return IN;        }
(?i:inherits)   { return INHERITS;  }
(?i:let)        { return LET;       }
(?i:loop)       { return LOOP;      }
(?i:pool)       { return POOL;      }
(?i:then)       { return THEN;      }
(?i:while)      { return WHILE;     }
(?i:case)       { return CASE;      } 
(?i:esac)       { return ESAC;      }
(?i:of)         { return OF;        }
(?i:new)        { return NEW;       }
(?i:isvoid)     { return ISVOID;    }
(?i:not)        { return NOT;       }


5 Implementation Notes

  • Each call on the scanner returns the next token and lexeme from the input.

  • The second component, the semantic value or lexeme, is placed in the global union cool yylval, which is of type YYSTYPE. The type YYSTYPE is also defined in cool-parse.h.

    1
    2
    3
    4
    5
    6
    
    typedef union YYSTYPE
    {
      Boolean boolean;
      Symbol symbol;
      ...
    } YYSTYPE;
    
  • For class identifiers, object identifiers, integers, and strings, the semantic value should be a Symbol stored in the field cool_yylval.symbol. For boolean constants, the semantic value is stored in the field cool_yylval.boolean.

3 String Tables

cool-tour.pdf

  • An important point about the structure of the Cool compiler is that there are actually three distinct string tables: one for string constants (stringtable), one for integer constants (inttable), and one for identifiers (idtable).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
 /* INT_CONST */
{DIGIT}+ {
    cool_yylval.symbol = inttable.add_string(yytext);
    return INT_CONST;
}

 /* BOOL_CONST */
t(?i:rue) {
  cool_yylval.boolean = 1;
  return BOOL_CONST;
}

f(?i:alse) {
  cool_yylval.boolean = 0;
  return BOOL_CONST;
}

 /* only differences between type and object id is the leading char case */

 /* TYPEID */
 /* Class names begin with an uppercase letter. */
[A-Z][A-Za-z0-9_]* {
    cool_yylval.symbol = idtable.add_string(yytext);
    return TYPEID;
}

 /* OBJECTID */
[a-z][A-Za-z0-9_]* {
  cool_yylval.symbol = idtable.add_string(yytext);
  return OBJECTID;
}


2. Comment

2 Introduction to Flex

  • When writing rules in Flex, it may be necessary to perform different actions depending on previously encountered tokens.

  • For example, when processing a closing comment token, you might be interested in knowing whether an opening comment was previously encountered.

  • One obvious way to track state is to declare global variables in your declaration section, which are set to true when certain tokens of interest are encountered.

  • Flex also provides syntactic sugar for achieving similar functionality by using state declarations such as:

    %Start COMMENT

    which can be set to true by writing BEGIN(COMMENT).

  • To perform an action only if an opening comment was previously encountered, you can predicate your rule on COMMENT using the syntax:

    1
    2
    3
    
    <COMMENT> {
    // the rest of your rule ...
    }
    
  • There is also a special default state called INITIAL which is active unless you explicitly indicate the
    beginning of a new state.

10.3 Comments

cool-manual.pdf

  • There are two forms of comments in Cool. Any characters between two dashes “–” and the next newline (or EOF, if there is no next newline) are treated as comments. Comments may also be written by enclosing text in (∗ . . . ∗).

4.1 Error Handling

  • If a comment remains open when EOF is encountered, report this error with the message “EOF in comment”. Do not tokenize the comment’s contents simply because the terminator is missing.
  • If you see *) outside a comment, report this error as “Unmatched *)“, rather than tokenizing it
    as * and ).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
...
%}

DIGIT           [0-9]
%Start          COMMENT
%Start          INLINE_COMMENT

%%
 /* =================
  *  Nested comments
  * ================= 
  */

 /* begin a comment or nested comment */
<INITIAL,COMMENT,INLINE_COMMENT>"(*" {
  comment_layer++;
  BEGIN COMMENT;
}

 /* if we met *) in (the outermost) comment */
<COMMENT>"*)" {
  comment_layer--;
  if(comment_layer == 0) {
    BEGIN 0; }
}

 /* any character other than '\n','(','*' is ok */ 
<COMMENT>[^\n(*]* {  }

 /* ( or ) or * that appears only once */
<COMMENT>[()*] {  }

<COMMENT>\n { curr_lineno++; }

 /*
  * Error handling in comment 
  */

"*)" {
    cool_yylval.error_msg = "Unmatched *)";
    return ERROR;
}

<COMMENT><<EOF>> {
    cool_yylval.error_msg = "EOF in comment";
    BEGIN 0;
    return ERROR;
}


 /* ===============
  * inline comments
  * ===============
  */

 /* if seen "--", start inline comment */
<INITIAL>"--" { BEGIN INLINE_COMMENT; }

 /* any character other than '\n' is a nop in inline comments */ 
<INLINE_COMMENT>[^\n]* { }

 /* if seen '\n' in inline comment, the comment ends */
<INLINE_COMMENT>\n {
    curr_lineno++;
    BEGIN 0;
}


3. Strings

4.3 Strings

  • Your scanner should convert escape characters in string constants to their correct values
  • return error for a string containing the literal null character.

4.1 Error Handling

  • String constant too long
  • Unterminated string constant
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
...
%}
...
%Start STRING

%%
 /* =========
  * STR_CONST
  * =========
  *  String constants (C syntax)
  *  Escape sequence \c is accepted for all characters c. Except for 
  *  \n \t \b \f, the result is c.
  *
  */

 /* if seen '\"', start string */
<INITIAL>\" {
    BEGIN STRING;
    yymore();
}

 /* Cannot read '\\' '\"' '\n' */
<STRING>[^\\\"\n]* { yymore(); }

 /* normal escape characters, not \n */
<STRING>\\[^\n] { yymore(); }

 /* seen a '\\' at the end of a line, the string continues */
<STRING>\\\n {
    curr_lineno++;
    yymore();
}

 /* meet EOF in the middle of a string, error */
<STRING><<EOF>> {
    yylval.error_msg = "EOF in string constant";
    BEGIN 0;
    yyrestart(yyin);
    return ERROR;
}

 /* meet a '\n' in the middle of a string without a '\\', error */
<STRING>\n {
    yylval.error_msg = "Unterminated string constant";
    BEGIN 0;
    curr_lineno++;
    return ERROR;
}

 /* string ends, we need to deal with some escape characters */
<STRING>\" {
   char* ptr = yytext; 
   ptr++; /* delete the first '"' char */

   string_buf_ptr = string_buf;
   int string_len = 0;
   
   while( *ptr != '"' && string_len < MAX_STR_CONST ) {
     if( *ptr == '\0' ) {
         cool_yylval.error_msg = "String contains null character";
         BEGIN 0;
         return ERROR;
     }
     else if( *ptr == '\\' ){
         ptr++;
         switch(*ptr){
           case 'b':
             *string_buf_ptr++ = '\b';
             break;
           case 't':
             *string_buf_ptr++ = '\t';
             break;
           case 'f':
             *string_buf_ptr++ = '\f';
             break;
           case 'n':
             *string_buf_ptr++ = '\n';
             break;
           /* null character escaped --> error */
           case '\0':
             cool_yylval.error_msg = "String contains null character";
             BEGIN 0;
             return ERROR;
           default:
             *string_buf_ptr++ = *ptr;
             break;
         }
         ptr++; string_len++;
     }else{
        *string_buf_ptr++ = *ptr++;
        string_len++;
     }
   }
   if( string_len >= MAX_STR_CONST ) {
     cool_yylval.error_msg = "String constant too long"; 
     BEGIN 0;
     return ERROR;
   }

   *string_buf_ptr++ = '\0'; /* end string */
   cool_yylval.symbol = stringtable.add_string(string_buf);
   BEGIN 0;
   return STR_CONST;
}


4. Others and Error

5 Implementation Notes

  • The tokens for single character symbols (e.g., “;” and “,”) are represented just by the integer (ASCII) value of the character itself.

    All of the single character tokens are listed in the grammar for Cool in the Cool manual.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
 /* ========
  * others 
  * ========
  */

 /* White Space */
[ \f\r\t\v]+ { }

 /* New Line */
"\n" { curr_lineno++; }

 /* all allowed single character symbols */
[:;{}().+\-*/<,[email protected]=] { return *yytext; }

 /* ========
  * error
  * ========
  */
.               {
    cool_yylval.error_msg = yytext;
    return ERROR;
}


5. problems encountered

line number

be careful the order of the rules..

字符串真是匹配到自闭x想了好多种办法,还是一下子全输入进来再匹配比较优雅