Potential FoCS projects
..and their link-ins with CourseTopics
What's here?
Contents
Semantic web: FOAF
I built a few programs that process FOAF files -- .rdf documents that are meant to describe a person/organization and his/her/its connections to other people/organizations in a machine-readable format. Some points to bring out:
Data Structures -- how the data is stored drastically affects what you can do with it. Tried:
- Lists of tuples
- Nested dictionaries(hash)
- Binary tree (don't ask me why)
Graph algorithms -- help you express and navigate connections between FOAF entities. Tried:
- Breadth-first shortest-path connecting person A to person B
- List-style representation of the personspace, nondirectional
General algorithms -- searching and sorting are sometimes applicable, depending on how data is being stored and processed. Tried:
- Under tuple-list storage, used mergesort and the cutting-out-early variety of search.
There is probably an IR connection here too, but I haven't quite poked it hard enough.
Prolog
Lends itself to depth-first search, once you get a handle on how backtracking works
- Is a bit of a mind#@!%, which I think is generally a Good Thing
Lynn said something to this effect: Bad: "Prolog is weird and confusing and I don't get it and it sucks." Good: "Prolog is weird and I don't get it -- hey neat! I don't understand all of CS yet; there's more to life than C++!" Great: "Prolog is weird, but I can see how it'd be really useful if you wanted to do m, n, or q."
Writing textgames is easy, and has links to backtracking and recursive programming. I wrote HuntTheWumpus in half a morning. This is cool because there's a basic implementation anyone should be able to figure out, but it's extensible up the wazoo for people who like to show off.
Random Preliminary Thoughts
Ethereal handwaving:
- It needs to be something with several parts to it, that we could look at in turn, and cycle back and revise things as we learn more. Parts should imply eachother or otherwise lead into eachother obviously -- needs of reek of opportunity. (?)
Potential topics:
- Semantic web
- webcrawling in general
Wouldn't it be cool if we could end the course having produced something informative and useful for the rest of the world?
- Semantic web
- the Time is definitely Right -- the cutting edge of the field is NOW
- Olin is tending to be good at or tending to "do" SW stuff anyhow
- Not a whole lot/nothing exists on the web yet approaching SW from this direction
Exploring the Web Crawler theme
Consider using the Google API (1000 queries a day)
- Includes Java support
Consider using XML Cocoon, XML XSL
Write Wget:
- given a URL (presumably represented as a string), retrieve its contents
- have an option to display the contents
- have an option to save the contents to a file
- what are error conditions?
Extract info from HTML:
- extend Wget to extract some piece of information from the file, e.g., the title
- don't die if missing or ill formed
- what other error conditions?
Make wget smarter
- deal with DISALLOW
- check whether there's a robots.txt file at the top level of the server
- if so, don't wget the URL
- don't be fooled by 404
- be smarter about robots.txt
robots tutorial gives info about robots.txt
- don't die if robots.txt is ill-formed
- check whether there's a robots.txt file at the top level of the server
- add ROBOTS meta tag
documented in the robots meta tag page
- do a dumb version and a smarter one
- don't die if ROBOTS or html are ill formed
Iterator
- given a url (or some html), write code that supports:
- next URL (returns the next URL in the file, possibly destructively) (NB: destructively means to your data, not to the web page!)
- returns "" or null or whatever the appropriate empty value is when there are no more URLs in the file
- what error conditions?
Write Queue
- build a (possibly bounded) queue (abstract data type)
- deal with errors
Web Crawler
- using wget, iterator, and queue, wget a url, then iterate over its urls, sticking them in the queue (until it fills)
- when done with that file, pull the next url off the queue and iterate over it
- respect robots.txt and, if possible, ROBOTS meta tag
- do something interesting with the URLs as they go by!
Changing Queue
- reimplement the queue as a file rather than in-memory data storage
- ideally, this shouldn't change your implementation of the web crawler
Changing Crawler
- this crawler deals with URLs. what else can you crawl? (Amazon book recommendations. Javadoc. FOAF.....)
- what interesting things can you do with the data you have retrieved?
Some Next Steps
- Write web crawler in Java, Python, Scheme
Python web crawler/iterator/queue: WgetCrawler
- Java web crawler/iterator/queue should be trivially similar
Scheme web crawler is a bit more complex, have to do a lot of things By Hand (unless I'm missing something). See more notes on the WgetCrawler page. -kmr
- involves identifying appropriate libraries
- Java
net.URL, io
- Python
urllib, httplib, HTMLParser? sys and re are also helpful
- Scheme
Guile(net or www would work if we can find them; info online is a bit old)
PLT DrScheme (MzScheme) -- works! use the url.ss submodule in net, in language "PLT Textual(MzScheme)"
- Java
- Write html parser that simply builds tagged parse tree
java: wrote HtmlOrganizer; uses java.net.URL, java.io.BufferedReader, and an assortment of java.util stuff.
python: also in HtmlOrganizer. uses urllib, sys, and re.
- yes, there are libraries to do this. should we be using them?
- should we be using lex/yacc?
- find lex/yacc tools for Java, Python, Scheme
Lex/YACC explanation Brilliant but unfortunately C-based
Lex & Yacc Also C-based, but goes slower, more depth, very clear
- Java
JavaCC Java YACC
- Python
PLY Python Lex/YACC (page also has pointers to other tools)
- find lex/yacc tools for Java, Python, Scheme
- am assuming we're mostly only responsible for compliant html; obviously this is a Bad Assumption
- is there a way to invoke html tidy or a validator on it so we can barf cleanly instead of mid-execution?
Identify RegExp libraries for simplified html screen scraping
- Either write or identify libraries for RDF parsing
rdfxml is a year old, but works -- extracts triples out of rdf; I used it a lot earlier this summer. See FoafingStartpage for more content. -kmr
- I think the xml namespace piece is the only genuinely hard part of this
- one possible output would be a graph
- another possible output would be a database of triples
- this would provide an opportunity for relational algebra tutorial....
- Forth
Minotaur, which apparently is very much like Forth, lets you use Tcl/Perl/Python from any of the three.