I assume you can program in C and understand data structures such as linked-lists and trees. The Overview describes the basic building blocks of a compiler and explains the interaction between lex and yacc.
The next two sections describe lex and yacc in more detail. With this background we can construct a sophisticated calculator. Conventional arithmetic operations and control statements, such as if-else and while , are implemented. With minor changes we will convert the calculator into a compiler for a stack-based machine. The remaining sections discuss issues that commonly arise in compiler writing. Click on the following links for additional downloads:. For declaring words, the first group of rules sets the state to the type corresponding to the part of speech being declared.
We reset the state to LOOKUP at the beginning of each line so that after we add new words interactively we can test our table of words to determine if it is working correctly.
The user subroutines section in Example contains the same skeletal main routine and our two supporting functions. These last two functions create and search a linked list of words. If there are a lot of words, the functions will be slow since, for each word, they might have to search through the entire list.
In a production environment we would use a faster but more complex scheme, probably using a hash table. Our simple example does the job, albeit slowly. Traditionally, a description of such a set of actions is known as a grammar. It seems especially appropriate for our example. Suppose that we wished to recognize common sentences. Here is a list of simple sentence types:.
At this point, it seems convenient to introduce some notation for describing grammars. As an added example we could define an object as follows:. Indeed, we could expand this definition of sentence to fit a much wider variety of sentences. However, at this stage we would like to build a yacc grammar so we can test our ideas out interactively.
Before we introduce our yacc grammar, we must modify our lexical analyzer in order to return values useful to our new parser. When you use a lex scanner and a yacc parser together, the parser is the higher level routine.
It calls the lexer yylex whenever it needs a token from the input. The lexer then scans through the input recognizing tokens. The lexer and the parser have to agree what the token codes are. We solve this problem by letting yacc define the token codes. Yacc defines each of these as a small integer using a preprocessor define. Here are the definitions it used in this example:. Token code zero is always returned for the logical end of the input. Yacc can optionally write a C header file containing all of the token definitions.
You include this file, called y. Example shows the declarations and rules sections of the new lexer. There are several important differences here. We have also added return statements to pass to the parser the token codes for the words that it recognizes.
These return statements show that yylex acts like a coroutine. Each time the parser calls it, it takes up processing at the exact point it left off. This allows us to examine and operate upon the input stream incrementally. The backslash in front of the period quotes the period, so this rule matches a period followed by a newline. The other change we made to our lexical analyzer was to omit the main routine as it will now be provided within the parser.
Finally, Example introduces our first cut at the yacc grammar. The structure of a yacc parser is, not by accident, similar to that of a lex lexer. We use it here for a C comment as with lex, C comments belong inside C code blocks, at least within the definition section and a single include file.
Then come definitions of all the tokens we expect to receive from the lexical analyzer. In this example, they correspond to the eight parts of speech. The name of a token does not have any intrinsic meaning to yacc, although well-chosen token names tell the reader what they represent. Although yacc lets you use any valid C identifier name for a yacc symbol, universal custom dictates that token names be all uppercase and other names in the parser mostly or entirely lowercase.
The routine yyparse is the parser generated by yacc, so our main program repeatedly tries to parse sentences until the input runs out. The rules section describes the actual grammar as a set of production rules or simply rules. Some people also call them productions. By default, the first rule is the highest-level rule. That is, the parser attempts to find a list of tokens which match this initial rule, or more commonly, rules found from the initial rule.
The expression on the right-hand side of the rule is a list of zero or more names. A typical simple rule has a single symbol on the right-hand side as in the object rule which is defined to be a NOUN. The symbol on the left-hand side of the rule can then be used like a token in other rules. From this, we build complex grammars. The parser executes an action at the end of a rule as soon as the rule matches. Since sentence is the top-level symbol, the entire input must match a sentence.
The parser returns to its caller, in this case the main program, when the lexer reports the end of the input. Subsequent calls to yyparse reset the state and begin processing again. The parser calls yyerror , which we provide in the user subroutines section, and then recognizes the special rule error.
You can provide error recovery code that tries to get the parser back into a state where it can continue parsing.
If error recovery fails or, as is the case here, there is no error recovery code, yyparse returns to the caller after it finds an error. This section can contain any C code and is copied, verbatim, into the resulting parser. In our example, we have provided the minimal set of functions necessary for a yacc-generated parser using a lex-generated lexer to compile: main and yyerror.
The main routine keeps calling the parser until it reaches the end-of-file on yyin , the lex input file. The only other necessary routine is yylex which is provided by our lexer. In our final example of this chapter, Example , we expand our earlier grammar to recognize a richer, although by no means complete, set of sentences.
We invite you to experiment further with this example—you will see how difficult English is to describe in an unambiguous way. We have expanded our sentence rule by introducing a traditional grammar formulation from elementary school English class: a sentence can be either a simple sentence or a compound sentence which contains two or more independent clauses joined with a coordinating conjunction. Our current lexical analyzer does not distinguish between a coordinating conjunction e.
We have also introduced recursion into this grammar. Recursion, in which a rule refers directly or indirectly to itself, is a powerful tool for describing grammars, and we use the technique in nearly every yacc grammar we write. The first possible match,. For example, in this C language statement,. We called our various lexers ch1-N. Similarly, we called our parsers ch1-M.
Then, to build the output, we did the following in UNIX:. The first line runs lex over the lex specification and generates a file, lex.
In the second line, we use yacc to generate both y. The next line compiles each of the two C files. The final line links them together and uses the routines in the lex library libl. In particular, Berkeley yacc and flex will work merely by changing the lex and yacc commands to byacc and flex , and removing the - ll linker flag.
However, we know of far too many differences to assure the reader that this is true. For example, if we use the GNU replacement bison instead of yacc, it would generate two files called ch1-M. See Appendices A through H for details on the various lex and yacc implementations.
People have often told us that writing a lexer in C is so easy that there is no point in going to the effort to learn lex. Maybe and maybe not. Example shows a lexer written in C suitable for a simple command language that handles commands, numbers, strings, and new lines, ignoring white space and comments.
Example is an equivalent lexer written in lex. The lex version is a third the length of the C lexer. Lex handles some subtle situations in a natural way that are difficult to get right in a hand written lexer. A very common bug in C lexers is not to consider the case that the next character is itself a star, and the slash might follow that.
In practice, this means that some comments fail:. In the next chapter we delve into the workings of lex more deeply. Extend the English-language parser to handle more complex syntax: prepositional phrases in the subject, adverbs modifying adjectives, etc. Make the parser handle compound verbs better, e. Some words can be more than one part of speech, e. How well does this work? When people hear an unfamiliar word, they can usually guess from the context what part of speech it is.
Could the lexer characterize new words on the fly? All tokens are symbols, but not all symbols are tokens. Skip to main content. Start your free trial. Chapter 1.
0コメント