NB: I have heard from a lot of people that they're having trouble. It's OK not to be done as long as you've tried and understand where you're wedged.
Contents
See the session notes
Start with NotesSeptember7Proj, which describes what we did and what you should do.
Several people have had issues with parts of this assignment. You should come to class with either a web crawler or a strong sense of where you're wedged and why. The goal is not to work until it's done. Remember that the project itself shouldn't be more than about 6h/week. If you're wedged, let me know and we'll figure out how to debug you or the assignment, whichever needs fixing.
This assignment in particular will continue/evolve into an assignment for the following week. See ProjectAssignment1.5
Oops, my bad
I had forgotten that the html returned by html->shtml is a parse tree. That is, it's a structure that is a little bit more complicated than your average list. We will discuss trees in class Real Soon Now, but hadn't by the time that I asked you to work with shtml. If this structure is too complicated for you to manipulate directly right now, you can use the following function to turn it into a list:
(define (flatten tree) (cond ((null? tree) tree) ((not (pair? tree)) (list tree)) (else (append (flatten (car tree)) (flatten (cdr tree))))))
The tokens will still be just as clean, but now you can just cdr down the list. (Hint: Look for the hrefs.) Of course, if you poke around in the shtml you may find that the structure is actually more useful...
Don't blow up the web (or Olin's network)
Things (not) to do:
- DO make sure that you bound the number of urls you look at. (100 is *plenty* for testing purposes. 10 might be good to start with.)
- If possible, DO respect robots.txt. This means that you should look at the root of the web site and see if there's a robots.txt file. If so, don't worry about what it says; just ignore that site. (Info on robots.txt is easily google-able -- or see below -- and not all robots.txt sites say don't look at all, but this is a safe/easy approach.)
There's a test page that's really clean and simple up with full URLs and whatnot (and no non-html files) at http://s-whispers.olin.edu/temp/crawl/. Bug me if you don't like it, need stuff added, etc. -Jon Tse
One possible approach to web crawling
Write Wget: Already almost exists for PLTscheme as we saw in class
- What it does: given a URL (presumably represented as a string), retrieve its contents
Optional: have an option to display the contents
Optional: have an option to save the contents to a file
A good idea: What might go wrong? -- make your wgetter more robust by handling error conditions
Extract info from HTML: Pretty easy to do with htmlprag
- What it does: extend Wget to extract some piece of information (of your choosing) from the file
- e.g., the title
Hint: if the car is title, where is the actual title?
Don't forget to use flatten if the tree structure weirds you out
A good idea: Handle anomalies
- Don't die if the information you're looking for is missing or ill formed
- What other error conditions might arise?
URL-extractor: Given some html, find the next (or all) urls
If you're using a list (flattened shtml):
cdr down the list
- what are you looking for?
- If you're using a tree (raw shtml):
you're not cdr'ing, but something similar
Challenge: Alternately, read up on stateful programming (SICP chapter 3) and write an iterator
- given a url (or some html), write code that supports:
- next URL (returns the next URL in the file, possibly destructively) (NB: destructively means to your data, not to the web page!)
- returns "" or null or whatever the appropriate empty value is when there are no more URLs in the file
- what error conditions?
Alternatively, just return a list of URLs from the file all at once (e.g., using filter)
Write Queue
- build a (possibly bounded) queue (abstract data type)
- deal with errors
Web Crawler
- using wget, url-finder, and queue, wget a url, then find its urls, sticking them in the queue (until it fills)
- when done with that file, pull the next url off the queue and do the same thing to
Advanced: respect robots.txt and, if possible, ROBOTS meta tag (see below)
Advanced: do something interesting with the URLs as they go by!
WAY advanced stuff
Make wget/crawler smarter: This requires grovelling through the url package documentation in PLT scheme to figure out how to extract the hostname
- deal with DISALLOW
- check whether there's a robots.txt file at the top level of the server
- if so, don't wget the URL
- don't be fooled by 404
- this requires manipulating the url structure
- be smarter about robots.txt
robots tutorial gives info about robots.txt
- don't die if robots.txt is ill-formed
- check whether there's a robots.txt file at the top level of the server
- add ROBOTS meta tag
documented in the robots meta tag page
- do a dumb version and a smarter one
- don't die if ROBOTS or html are ill formed
What to turn in when
NB: I have heard from a lot of people that they're having trouble. It's OK not to be done as long as you've tried and understand where you're wedged.
What this used to say: Bring your code with you (on paper and electronically) to the next lab session. Also email a copy to las. Hopefully your code is readable and as documented as it needs to be....