ProjectAssignment1.5

Contents

Lynn waxes philosophical
The shtml tree
Web Crawler Redux
This week's extensions
What to turn in

If you're all about doing it now, skip ahead to This Week's Extensions

Lynn waxes philosophical

This isn't really a new project assignment. Instead, we're continuing last week's assignment but extending it in new directions as you feel comfortable. Remember, the purpose of this assignment is to help you to understand the material covered in this course. It is not to develop your software engineering skills per se (though that may happen as a side effect).

As long as you are programming in Lisp, take advantage of its interpreted nature to try things out. Poke around at your data structures. See how they look and respond. Don't be afraid to experiment.

The shtml tree

Html is written using tags. Each tag is enclosed in angle brackets: <html> or <b>. Properly, each opening tag has a matching close tag: </html> or </b>. Anything between the matching open and close tag is a child of that tag. In this example

<html>
  <body>
    This is some text.  It's <i>really, really</i> <b>bold</b> text,
    isn't it?
  </body>
</html>

there are four tagged nodes: The html node is the parent node. It has a body node as a child. The body node contains text including an i (italic) and a b (bold) node. The shtml parse of the html above looks like this:

(|*TOP*|
  (html
    "\n"
    "  "
    (body
      "\n"
      "    This is some text.  It's "
      (i "really, really")
      " "
      (b "bold")
      " text, isn't it?\n"
      "  ")
    "\n"))

If we eliminate the newlines ("\n") and blank spaces -- for clarity, though we can certainly write code to do this -- we get

(|*TOP*|
  (html
    (body
      "    This is some text.  It's "
      (i "really, really")
      (b "bold")
      " text, isn't it?")))

The shtml format is a tree. Each node of the tree is a list containing the node label followed by its children. For example, the first list contains two elements, the html tag and the entire body subtree. The body subtree has five elements:

the body tag
the first text excerpt
the i subtree, including its own child text
the b subtree, including its own child text
the final piece of text

Some html tags, like a, also contain attributes. For example, a hyperlink is written

<a href="http://www.olin.edu/">Olin College</a>

In shtml, the attribute/value pair href (the attribute) and "http://www.olin.edu" (the value) together is a child of the @ (attribute list) node, which is itself a child of the a node:

(a (@ (href "http://www.olin.edu/")) "Olin College")

Note that the text between <a... and </a> is also a child of the a node. In other words, this is a list of three elements:

a, the tag
@, the attribute list node with its own attribute/value pair children, and
"Olin College", the child node of the a tag

You can generate html from shtml using the shtml->html function in the htmlprag library.

Web Crawler Redux

If you have not yet written your web crawler, you should focus on doing that. That assignment contains three major (separate) pieces:

Recovering shtml from a string denoting a url. (In the original
- assignment, this is called wget.)
Building a queue ADT.
Given some shtml, recovering the urls contained inside it.

The first piece requires some basic lisp know-how and some familiarity with the net and htmlprag libraries. If you are having trouble with these pieces, please make sure that you ask specific questions and we will try to help you to connect them together. Remember that lisp programming generally involves returning values rather than changing things per se.

The second piece is pure lisp programming. You should also be able to test it independently of anything else that you do. Make sure that you can enqueue, dequeue, and retrieve elements from your queue.

The final piece involves playing with shtml either as a list (after using flatten, from the ProjectAssignment1 page) or as a tree. You can generate your own shtml to test by supplying an html string as an argument to html->shtml, which is built into the htmlprag library. Look for the hrefs as a way to spot the embedded urls. (It's not a guarantee, but it's a good start.)

Combining these three pieces involves genuinely thinking like a lisp programmer and is the hardest part of the assignment until/unless you truly think in Lisp. Talk through this with someone if you're stuck.

This week's extensions

Once you have figured out how to manipulate web pages this way, there are several extensions that you might try. These are intended to inspire; obviously you'll only pick a small (reasonable) amount of extending....

Build a web page transformer. Change all the <b>s to <i>s. Add a
- parenthetical note after each anchor reference that identifies the
  linked url in small italics. Count the number of <img> tags. (For things that produce html, don't forget the shtml->html procedure.)
Screen scrape. Pick a web page and use this sort of web transformation to extract useful information. For example, write a scheme function that will go out and price a particular item at your favorite e-store and return the price, extracted from the web page. "Screenscraping the Senate" gives a nice description of a screen scraping problem (in a totally different language, but the idea is the same). (You can stop reading when he gets to the RDF, or read along and see if it makes sense if you prefer.)
Play with XML (as opposed to HTML). Resources include
- W3Schools XML Tutorial
- XML.com's Technical Introduction to XML - may be outdated in detail, though
You can also serve your own transformed web pages. The instructions are in DrScheme's documentation (Help) or the Scheme Cookbook (see http://schemecookbook.org/view/Cookbook/WebPLTWebServer)
Try using map/filter/reduce as different ways of thinking about scheme programming.
Eliza is a fun program to think about writing, but I would need to search for a web page and add it here....

Resources:

SSAX XML api for scheme
http://www.w3schools.com/ Lots of free tutorials, including quizzes. Pretty basic but pretty helpful.
Scheme Cookbook
SWeb X Scheme
- http://okmij.org/ftp/Scheme/xml.html Scheme and xml, especially SXML
- HtmlPrag pragmatic HTML parser generates SXML
- Scheme Cookbook on SchemeWiki -- includes recipes; in PLTscheme
  - http://schemewiki.org/Cookbook/WebChapter#Web_Programming Web Programming
  - http://schemewiki.org/view/Cookbook/XmlChapter XML on the SchemeWiki
  - http://schemewiki.org/Cookbook/XMLRecipeRSS Reading and Writing RSS Files from the SchemeWiki

What to turn in

Please email the following to Lynn within 12h of Tuesday (Sept 21)'s class:

A reasonably cleaned up, documented version of your web crawler code.
At your option, one or more examples of additional functionality that you built; however, this should be no more than two or three additional pages of code beyond your web crawler.
A brief, well-written English description of what you did. This should be at least a few sentences but no more than a page. I'm interested in knowing the scope of your work -- what you actually built -- as well as what kinds of things you learned and any places you got particularly stuck.
An estimate of how long you've spent on all course activities over the previous week, broken down into class sessions, reading, written work, project work, and any other categories you consider particularly notable.
A collaboration statement, of course.

Please also bring your code to class -- electronically is OK -- and be prepared to talk about what you built.