[FrontPage] [TitleIndex] [WordIndex

NB: I have heard from a lot of people that they're having trouble. It's OK not to be done as long as you've tried and understand where you're wedged.

See the session notes

Start with NotesSeptember7Proj, which describes what we did and what you should do.

Several people have had issues with parts of this assignment. You should come to class with either a web crawler or a strong sense of where you're wedged and why. The goal is not to work until it's done. Remember that the project itself shouldn't be more than about 6h/week. If you're wedged, let me know and we'll figure out how to debug you or the assignment, whichever needs fixing.

This assignment in particular will continue/evolve into an assignment for the following week. See ProjectAssignment1.5

Oops, my bad

I had forgotten that the html returned by html->shtml is a parse tree. That is, it's a structure that is a little bit more complicated than your average list. We will discuss trees in class Real Soon Now, but hadn't by the time that I asked you to work with shtml. If this structure is too complicated for you to manipulate directly right now, you can use the following function to turn it into a list:

(define (flatten tree)
  (cond ((null? tree) tree)
        ((not (pair? tree)) (list tree))
        (else (append (flatten (car tree)) (flatten (cdr tree))))))

The tokens will still be just as clean, but now you can just cdr down the list. (Hint: Look for the hrefs.) Of course, if you poke around in the shtml you may find that the structure is actually more useful...

Don't blow up the web (or Olin's network)

Things (not) to do:

One possible approach to web crawling

Write Wget: Already almost exists for PLTscheme as we saw in class

Extract info from HTML: Pretty easy to do with htmlprag

URL-extractor: Given some html, find the next (or all) urls

Write Queue

Web Crawler

WAY advanced stuff

Make wget/crawler smarter: This requires grovelling through the url package documentation in PLT scheme to figure out how to extract the hostname

What to turn in when

NB: I have heard from a lot of people that they're having trouble. It's OK not to be done as long as you've tried and understand where you're wedged.

What this used to say: Bring your code with you (on paper and electronically) to the next lab session. Also email a copy to las. Hopefully your code is readable and as documented as it needs to be....


2013-07-17 10:43