Potential FoCS projects

..and their link-ins with CourseTopics

What's here?

Contents

Potential FoCS projects

Semantic web: FOAF

I built a few programs that process FOAF files -- .rdf documents that are meant to describe a person/organization and his/her/its connections to other people/organizations in a machine-readable format. Some points to bring out:

Data Structures -- how the data is stored drastically affects what you can do with it. Tried:
- Lists of tuples
- Nested dictionaries(hash)
- Binary tree (don't ask me why)
Graph algorithms -- help you express and navigate connections between FOAF entities. Tried:
- Breadth-first shortest-path connecting person A to person B
- List-style representation of the personspace, nondirectional
General algorithms -- searching and sorting are sometimes applicable, depending on how data is being stored and processed. Tried:
- Under tuple-list storage, used mergesort and the cutting-out-early variety of search.
There is probably an IR connection here too, but I haven't quite poked it hard enough.

Prolog

Lends itself to depth-first search, once you get a handle on how backtracking works
Is a bit of a mind#@!%, which I think is generally a Good Thing
- Lynn said something to this effect: Bad: "Prolog is weird and confusing and I don't get it and it sucks." Good: "Prolog is weird and I don't get it -- hey neat! I don't understand all of CS yet; there's more to life than C++!" Great: "Prolog is weird, but I can see how it'd be really useful if you wanted to do m, n, or q."
Writing textgames is easy, and has links to backtracking and recursive programming. I wrote HuntTheWumpus in half a morning. This is cool because there's a basic implementation anyone should be able to figure out, but it's extensible up the wazoo for people who like to show off.

Random Preliminary Thoughts

Ethereal handwaving:

It needs to be something with several parts to it, that we could look at in turn, and cycle back and revise things as we learn more. Parts should imply eachother or otherwise lead into eachother obviously -- needs of reek of opportunity. (?)

Potential topics:

Semantic web
webcrawling in general

Wouldn't it be cool if we could end the course having produced something informative and useful for the rest of the world?

Semantic web
- the Time is definitely Right -- the cutting edge of the field is NOW
- Olin is tending to be good at or tending to "do" SW stuff anyhow
- Not a whole lot/nothing exists on the web yet approaching SW from this direction

Exploring the Web Crawler theme

Consider using the Google API (1000 queries a day)
- Includes Java support
- Python support here
Consider using XML Cocoon, XML XSL

Write Wget:

given a URL (presumably represented as a string), retrieve its contents
have an option to display the contents
have an option to save the contents to a file
what are error conditions?

Extract info from HTML:

extend Wget to extract some piece of information from the file, e.g., the title
don't die if missing or ill formed
what other error conditions?

Make wget smarter

deal with DISALLOW
- check whether there's a robots.txt file at the top level of the server
  - if so, don't wget the URL
  - don't be fooled by 404
- be smarter about robots.txt
  - robots tutorial gives info about robots.txt
    - don't die if robots.txt is ill-formed
add ROBOTS meta tag
- documented in the robots meta tag page
- do a dumb version and a smarter one
- don't die if ROBOTS or html are ill formed

Iterator

given a url (or some html), write code that supports:
- next URL (returns the next URL in the file, possibly destructively) (NB: destructively means to your data, not to the web page!)
- returns "" or null or whatever the appropriate empty value is when there are no more URLs in the file
- what error conditions?

Write Queue

build a (possibly bounded) queue (abstract data type)
deal with errors

Web Crawler

using wget, iterator, and queue, wget a url, then iterate over its urls, sticking them in the queue (until it fills)
when done with that file, pull the next url off the queue and iterate over it
respect robots.txt and, if possible, ROBOTS meta tag
do something interesting with the URLs as they go by!

Changing Queue

reimplement the queue as a file rather than in-memory data storage
ideally, this shouldn't change your implementation of the web crawler

Changing Crawler

this crawler deals with URLs. what else can you crawl? (Amazon book recommendations. Javadoc. FOAF.....)
what interesting things can you do with the data you have retrieved?

Some Next Steps

Write web crawler in Java, Python, Scheme
- Python web crawler/iterator/queue: WgetCrawler
- Java web crawler/iterator/queue should be trivially similar
- Scheme web crawler is a bit more complex, have to do a lot of things By Hand (unless I'm missing something). See more notes on the WgetCrawler page. -kmr
- involves identifying appropriate libraries
  - Java
    - net.URL, io
  - Python
    - urllib, httplib, HTMLParser? sys and re are also helpful
  - Scheme
    - Guile(net or www would work if we can find them; info online is a bit old)
    - PLT DrScheme (MzScheme) -- works! use the url.ss submodule in net, in language "PLT Textual(MzScheme)"
Write html parser that simply builds tagged parse tree
- java: wrote HtmlOrganizer; uses java.net.URL, java.io.BufferedReader, and an assortment of java.util stuff.
- python: also in HtmlOrganizer. uses urllib, sys, and re.
- yes, there are libraries to do this. should we be using them?
- should we be using lex/yacc?
  - find lex/yacc tools for Java, Python, Scheme
    - Lex/YACC explanation Brilliant but unfortunately C-based
    - Lex & Yacc Also C-based, but goes slower, more depth, very clear
  - Java
    - JavaCC Java YACC
      - tips
    - List of Java Compiler Tools
  - Python
    - PLY Python Lex/YACC (page also has pointers to other tools)
- am assuming we're mostly only responsible for compliant html; obviously this is a Bad Assumption
- is there a way to invoke html tidy or a validator on it so we can barf cleanly instead of mid-execution?
Identify RegExp libraries for simplified html screen scraping
Either write or identify libraries for RDF parsing
- rdfxml is a year old, but works -- extracts triples out of rdf; I used it a lot earlier this summer. See FoafingStartpage for more content. -kmr
- I think the xml namespace piece is the only genuinely hard part of this
- one possible output would be a graph
- another possible output would be a database of triples
- this would provide an opportunity for relational algebra tutorial....
Forth
- Forth interpreter in java
- pForth tutorial
- Minotaur, which apparently is very much like Forth, lets you use Tcl/Perl/Python from any of the three.