Notes for Lecture 10 - Tuesday October 12th, 2004
Return to the main FrontPage.
- Scribe = Kathy King
- Editor =
Contents
Administrivia
- Next exam will be November 12 (probably posted before this date and due later)
- Next problem set is to be posted on Friday
Review of what weve done
- Talked about Data Structures and Abstract Data Types
- Recall that an abstract data type is removed from any language or implementation. It is simply the contract.
- A data structure is the actual implementation of an ADT in some language.
- Looked at a particular language (namely Lisp)
- Studied automata
- Finite State Machines (which are models of computation) and
- Regular expressions (which are models of language). Even though computer scientists talk about these as if theyre identical, its important to recall that theyre distinct.
- Trees
- Expressions can be represented as trees
- Here's an example, for the expression 4 + 3 x 2 (where x means "multiply)
+ / \ 4 x / \ 3 2
- To interpret this tree, we interpret the left child, then the right child, and finally combine these two with the tree root's operation. The base case is "if we have a number, return it."
- A tree itself is an abstract data type (we can use them in any language, without knowing the implementation).
All languages are evaluated using trees! That's why we just reviewed all of this.
Grammars
Every expression language (e.g., English, Java, Lisp) needs a grammar. When you define a programming language, you must create a grammar for it.
Whats in a grammar?
Or, actually, a context free grammar, which is what computer science people usually deal with.
- An alphabet, Sigma
- The "letters" of this alphabet are the things that appear in the string we're trying to parse.
These symbols are called terminals.
- A set of symbols, Gamma
- These symbols are internal to the grammar.
These are called nonterminals.
A set of rules (also called productions) that explain how the nonterminals and terminals may be combined
Here's an example of a grammar that corresponds to arithmatic expressions using integers (like the one in the tree above):
Sigma = {1,2,3,4,5,6,7,8,9,0,+,x,-,/} Gamma = {E} Rules: E -> E + E E -> E x E E -> E - E E -> E / E E -> (0 ... 9)(0 ... 9)* (recall that "*" means "0 or more")
Generating Strings and Parsing
Now, using this grammar, we can generate any arithmatic expression involving integers and the basic operations. We could generate the one we used in the example above like this:
E / \ E + E / / \ 4 + E x E / / / 4 + 3 x 2
Or, more commonly, we might be given a string that we need to evaluate. Using our grammar, we can figure out how to treat each part of the string. When we try this with our example, though, we discover a problem with our grammar:
4 + 3 x 2 4 + 3 x 2 \ / / \ \ / E x E E + E \ / \ / E E
Both of these trees are valid according to our grammar and the rules of evaluating trees, but they give different answers. For the tree on the left, 4 + 3 x 2 = 7 * 2 = 14. For the tree on the right, 4 + 3 x 2 = 4 + 6 = 10. This is not good. We call a grammar that allows situations like this one to occur ambiguous. In this case, we could could make our grammar unambiguous by adding another rule about operator precedence or demand that parentheses be used.
Another Example
Here's a simple grammar for English:
S -> NP VP NP -> det noun VP -> verb_i | verb_t NP NP -> properNoun | det adj* noun where our nonterminals are S = start symbol (we must begin with a start symbol to get a complete phrase) NP = noun phrase VP = verb phrase and our terminals are represented by the words that begin with lower cases. (Each could be replaced by a specific word of its type.) verb_i = intransitive verb verb_t = transitive verb det = determiner (the, a) adj = adjective
We can use this grammar to generate a string, by rolling dice to choose which production we use for NP (which has two possibilities) and to choose what specific word to substitute for each terminal. One possibility is below:
S / \ NP VP / / \ properNoun v_t NP / / / | \ Jesus saves det a* noun / / \ \ a adj adj screenshot / \ wicked cool
So, we have generated the phrase, "Jesus saves a wicked cool screenshot."
Note that in this grammar we have a start symbol, but we did not have one in the previous grammar. This is because our English grammar has multiple nonterminals (VP, NP, S), while our integer arithmetic grammar had only one (E), so we can start anywhere. The start symbol could potentially have multiple productions, too.
Big Point
You use parse trees to interpret a language
Lisp
In Lisp:
When we see an expression like this: ((lambda (x) (+, x, x)), 4) it means: Evaluate (+, x, x) with x = 4
The basic idea here is that we start with a table that binds operations to symbols (so "+" means add, "/" means divide, etc.). When we're doing an assignment like x=4, we temporarily add to the table (so "x" means 4). (Lisp reality is slightly more complicated than this - there are actually nested lists, but that doesn't really matter for us, for now.)
Lisp's Grammar
Here's what we need:
program -> S-expression S-expression -> number S-expression -> symbol S-expression -> () S-expression -> application S-expression -> conditional S-expression -> quotation S-expression -> define S-expression -> lambda
And we might want to have S-expression -> set if we want mutation. This grammar will give all parsable scheme programs. These will not necessarily run without error, though, just like you can say something grammatically correct that doesn't actually mean anything. (For example, Chompsky says that the sentence "Colorless green ideas sleep furiously." is correct but meaningless.)
Now, how do we deal each part of this?
- Number
- Want to return the number itself
S-expression -> number number -> -?(0...9)(0...9)*.(0...9)(0...9)*
And look! Number is equivalent to a regular expression! ("?" could be constructed from the basic regular expression operations (concatenate, or, 1 or more) if we wanted to make this look uglier.) Having regular expressions define terminals is called lexical analysis. To do lexical analysis, we use a lexical analyzer like Lex (such great names). This is done by Yacc ("Yet Another Compiler Compiler").
- Symbol
- Look up the symbol in the table of symbols (discussed above) for the environment. Then return whatever the symbol is bound to.
- We could write a regular expression for symbol.
- Empty Set
- Return the empty list
- Application
- First, evaluate all of the sub-S-expressions. Then apply the value of the first one to all of the rest.
S-expression -> application application -> (S-expression S-expression*)
- Conditional
- First, evaluate S-expression_test. Then, depending on the result, evaluate S-expression_result or S-expression_else.
- Note that in Lisp, you must define either if or cond; you can build the other out of whichever you choose.
S-expression -> conditional conditional -> (if S-expression_test S-expression_result S-expression_else)
- Quotation
- Return the S-expression without evaluating
S-expression -> quotation quotation -> (quote S-expression)
Notes on Lisp grammar
- We need to be careful, because right now the grammar we've written is ambiguous; conditional and application could look the same. To fix this, we need a rule to say "if it's a conditional, don't check whether it's an application."
The grammar productions are directly converted to rules. By associating grammar productions with rules, you can build an interpreter.
- We still need to create rules of define, lambda, and possibly set!. (NB: define is important only for recursive procedures and requires set! or some other additional machinery.)
This interpreter is commonly called eval. We also need to define apply, which does important work.
- Once we have all of these things, we have an interpreter that can handle all of Lisp, because everything else can built out of these.
- So now we better understand why Lisp is cool. Because
- You can use Lisp to write a grammar for Lisp.
- The grammars of Java, Python, and most other languages have at least several pages of grammar rules. Lisp is so simple and elegant!