[FrontPage] [TitleIndex] [WordIndex

Notes for Project Lecture 4 - Tuesday September 14th, 2004

Expectations

Apologies to the extent that class disorganization causes pain. If it happens again, first stop what you're doing and let Lynn know.

The goal of the project is to reinforce the content of the main part of the course. Some people learn best by coding and application of the theory learned in the 4u part. If lots of time is not spent reinforcing what you're learning, then something is wrong.

Webcrawler

For those who got the webcrawler to work, the goal of this week is to manipulate the tree. By next Tuesday, everyone should be comfortable with the webcrawler piece and manipulation of the shtml tree. Some gravy would be to take two shopping sites and compare prices.

Places where people could have gotten stuck

Lisp programming in itself can be hard to wrap your brain around, but one logical way to start thinking about the problem could be the following.

Lisp programming :
string
url
open port
read shtml

This is the normal way to think about the problem, but in Lisp, the actual code is written from the bottm up:

(html->shtml (open-pure-port (string->url <string>)))

It is ok to write the logical thinking presented above and even common, but you need to be able to write it from the bottom up. In Java, this is a lot like how you open a Stream or Reader. This concept is about constructing objects around other objects.

Another place to get stuck is in writing the queue abstract data type in scheme. However, since this is an interpreted language, it is very easy to poke at the code by running it often. It is ok and even a good idea to poke at a queue as you are writing it.

And a third place to get hung up is in writing the piece of code that takes the queue, retrieves the front, enqueues the strings and dequeues the front. We cover this later.

Random note: The nice thing about doing this program in Lisp is that what you get back is much more manipulatable than that in other languages.

HTML & SHTML

Simple html document:

<html>
   <body>
     Here's some <i> slanty </i> text and a <a id = "foo" href = "http://nowhere"> link </a>.
   </body>
<html>

This would display as:
Here's some slanty text and a link.

In general, tags are in angle brackets (<>), and close tags are in angle brackets that start with a slash (</>).

The following is the shtml tree that represents the html seen above:

                html
                  |
                body
         /     |     |           \
"here's some"  i  "text and a"    a
               |                 /   \
             "slanty"           |    "link"
                                |
                                @ _________   
                               /           \
                          /      \        /  \
                       href "http://..." id "foo" 

Here's what you get in Lisp:

(html (body "Here's some"
            (i "slanty")
            "text and a "
            (a (@ (href "http://nowhere") (id "foo")) "link")

Sadly, you cannot simply cdr down a tree as you can in lists. So you need to look at the cars too. What you can do is look for siblings in the tree structure.

An idea of something to play with is to try changing all i tags to b's. Do this by walking through the tree structure and instead of consing i, cons b onto result.

The bottom line is that you should do the webcrawler for next week and be able to do some sort of extension to it.

Notes: get-pure-port is only good for one read. If you have trouble with a particular seed webpage, try another one. uri.ss is great for url strings.

Today, turn in what you have done so far, and give an indication of where you are and why (if) you are stuck. Also give an idea of how much time you are spending: way too much, about right, or too little.

Algorithm for queue

Syntax for let:

(let ((foo 3)
      (bar (+ 5 2))
    <body>)

Note that let only holds the temporary definitions for the <body> expressions.

Now let's figure out how to glue together the parts of a webcrawler. Basically, we want to:
Define crawl queue.
Have a queue of strings.
Take a string.
Turn it into a url.
Turn the url into a page.
Find the links in shtml.
This might result in code that looks like this:
(extracturls (html->shtml (get-pure-port (string->url newurl))) newqueue)

We want to return a queue with the new urls inside if we are given some shtml and a queue.

(define extracturls
(lambda (shtml queue)
   (if (null? shtml)
     -->queue with urls inside
     (extracturls 

2013-07-17 10:43