[FrontPage] [TitleIndex] [WordIndex

If you're all about doing it now, skip ahead to This Week's Extensions

Lynn waxes philosophical

This isn't really a new project assignment. Instead, we're continuing last week's assignment but extending it in new directions as you feel comfortable. Remember, the purpose of this assignment is to help you to understand the material covered in this course. It is not to develop your software engineering skills per se (though that may happen as a side effect).

As long as you are programming in Lisp, take advantage of its interpreted nature to try things out. Poke around at your data structures. See how they look and respond. Don't be afraid to experiment.

The shtml tree

Html is written using tags. Each tag is enclosed in angle brackets: <html> or <b>. Properly, each opening tag has a matching close tag: </html> or </b>. Anything between the matching open and close tag is a child of that tag. In this example

there are four tagged nodes: The html node is the parent node. It has a body node as a child. The body node contains text including an i (italic) and a b (bold) node. The shtml parse of the html above looks like this:

If we eliminate the newlines ("\n") and blank spaces -- for clarity, though we can certainly write code to do this -- we get

The shtml format is a tree. Each node of the tree is a list containing the node label followed by its children. For example, the first list contains two elements, the html tag and the entire body subtree. The body subtree has five elements:

Some html tags, like a, also contain attributes. For example, a hyperlink is written

In shtml, the attribute/value pair href (the attribute) and "http://www.olin.edu" (the value) together is a child of the @ (attribute list) node, which is itself a child of the a node:

Note that the text between <a... and </a> is also a child of the a node. In other words, this is a list of three elements:

You can generate html from shtml using the shtml->html function in the htmlprag library.

Web Crawler Redux

If you have not yet written your web crawler, you should focus on doing that. That assignment contains three major (separate) pieces:

  1. Recovering shtml from a string denoting a url. (In the original
    • assignment, this is called wget.)

  2. Building a queue ADT.
  3. Given some shtml, recovering the urls contained inside it.

The first piece requires some basic lisp know-how and some familiarity with the net and htmlprag libraries. If you are having trouble with these pieces, please make sure that you ask specific questions and we will try to help you to connect them together. Remember that lisp programming generally involves returning values rather than changing things per se.

The second piece is pure lisp programming. You should also be able to test it independently of anything else that you do. Make sure that you can enqueue, dequeue, and retrieve elements from your queue.

The final piece involves playing with shtml either as a list (after using flatten, from the ProjectAssignment1 page) or as a tree. You can generate your own shtml to test by supplying an html string as an argument to html->shtml, which is built into the htmlprag library. Look for the hrefs as a way to spot the embedded urls. (It's not a guarantee, but it's a good start.)

Combining these three pieces involves genuinely thinking like a lisp programmer and is the hardest part of the assignment until/unless you truly think in Lisp. Talk through this with someone if you're stuck.

This week's extensions

Once you have figured out how to manipulate web pages this way, there are several extensions that you might try. These are intended to inspire; obviously you'll only pick a small (reasonable) amount of extending....

  1. Build a web page transformer. Change all the <b>s to <i>s. Add a

    • parenthetical note after each anchor reference that identifies the

      linked url in small italics. Count the number of <img> tags. (For things that produce html, don't forget the shtml->html procedure.)

  2. Screen scrape. Pick a web page and use this sort of web transformation to extract useful information. For example, write a scheme function that will go out and price a particular item at your favorite e-store and return the price, extracted from the web page. "Screenscraping the Senate" gives a nice description of a screen scraping problem (in a totally different language, but the idea is the same). (You can stop reading when he gets to the RDF, or read along and see if it makes sense if you prefer.)

  3. Play with XML (as opposed to HTML). Resources include
  4. You can also serve your own transformed web pages. The instructions are in DrScheme's documentation (Help) or the Scheme Cookbook (see http://schemecookbook.org/view/Cookbook/WebPLTWebServer)

  5. Try using map/filter/reduce as different ways of thinking about scheme programming.
  6. Eliza is a fun program to think about writing, but I would need to search for a web page and add it here....

Resources:

What to turn in

Please email the following to Lynn within 12h of Tuesday (Sept 21)'s class:

Please also bring your code to class -- electronically is OK -- and be prepared to talk about what you built.


2013-07-17 10:43