[FrontPage] [TitleIndex] [WordIndex

Notes for Lecture 1 - FoCS Project - Tuesday, September 7th, 2004

Feeling of the Class

  1. The regular class is not going to be about a lot of programming. These first two classes have been very programming intensive, but the work we did with the orders of magnitude today is more typical of what class will be like for the rest of the semester.
  2. We will focus on more web programming in project (for example, semantic web stuff). The project for this week will be programming a very simple queue-enabled web crawler. Add bells and whistles if it doesn't take up enough of your time, for example:
    • honor the robots.txt and robot META tags
    • implement a way to avoid revisiting pages the program has already visited
  3. More specifically, the base assignment is:
    • get the source of a URL
    • extract the URLs from the links in the page into a queue
    • iterate over the queue of URLs, collecting more links into the queue for each page
  4. Scheme is recommended for this assignment, however you can use Python or Java if you like.

The Assignment Description (email from Lynn should be along Real Soon) will include a list of things you should be careful not to do.

url.ss - This plt library comes with DrScheme. It lets you make connections to web pages, get content, etc.

htmlprag.ss - Download and install this plt file - it's a library that lets you manipulate HTML more easily in scheme. After you install it, you treat it like regular module. The whole download/install thing can be managed in DrScheme by using the File > Install PLT file option. The library is at http://www.neilvandyke.org/htmlprag/htmlprag-0-12.plt and there's some further documentation on the htmlprag site.

require is how you load libraries/modules in PLTscheme. The syntax is

Some code:

Get a connection("port") to a website:

(require (lib "url.ss" "net"))
(define pt (get-pure-port (string->url "http://www.olin.edu/")))

A pure port gives you the content of the page, stripped of its header. An impure port gives you the unstripped page, complete with header information.

Reading all text from a page:

(define (read-it-all port)
  (let ((token (read port)))
    (if (eof-object? token)
      (cons token (read-it-all port)))))

Getting the first n tokens/words from the read information:

(define first-n
  (lambda (n list)
    (if (= n 0)
       (cons (car list) (first-n (- n 1) (cdr list))))))

htmlprag gives better HTML tokens.

Convert text HTML to Scheme-HTML:

(require (lib "htmlprag.ss" "htmlprag"))
(html->shtml (get-pure-port (string->url "http://focs.olin.edu/")))

Handy dandy helper function (not defined in class):

(define urlstring->port 
  (lambda (urlstring)
    (get-pure-port (string->url urlstring))))

(html->shtml (urlstring->port "http://focs.olin.edu/"))

2013-07-17 10:43