Friend of a Friend
intro-level stuff
Coded wget using httplib, sys, and re
using to grab foaf rdf files, particularly http://rdfweb.org/people/danbri/rdfweb/danbri-foaf.rdf
Sourced from "Foaf Explorer" at http://xml.mfd-consult.dk/foaf/explorer/
Counted all instances of "<knows>" as an approximation of people person knows(inaccurate, as the knows tag can also belong to a particular person he knows, instead of to him. Not the same thing. But close)
Used xml.parsers.expat to "parse" xml file.
- Turned up a BUNCH of empty "data" fields, probably due to the heavy nesting you get in foaf files?
Incidentally, expat is used in RL: http://rdfweb.org/pipermail/rdfweb-dev/2004-June/013323.html
Also from that discussion: These folks are using PHP, and talking pretty seriously about it. Wonders as to the speed comparison vs python?
- Google-searched for a python rdf parser, came up with a buncha stuff.
http://infomesh.net/2003/rdfparser/ and the rdfxml module is small and snappy -- all it does is extract triples, which you can put into whatever data structure you like(convenient).
Filtering and searching
Not all foafs are organized the same way; in particular, the base node is sometimes named, sometimes not -- unnamed nodes are given ids in order, so an unnamed base node receives the name "_:id0". Files with named base nodes, however, also have some limited number of "_:id0" entries. Eugh.
Easy to get stats on how many nodes there are, get all entries in a node, get all names, compounding filters, etc.
Code: FoafingPython
Sorting triples list definitely results in a differently ordered data set than that returned from rdfxml. Could be useful for better searching.
Revamped searching a bit to limit the search space, realized that a space sorted by subject would be a PAIN to limit by predicate, thought about binary search trees, implemented one, realized it's not what I wanted. Call it Learning Experience.
Consider the following: each triple stored as a point in the data space allows you to keep everything sorted all the time. Actual implementation needs some Thing to hold each point together, since if you don't keep track when you pull a point out of the collapsed space you'll just end up with a bucket of strings that don't relate to eachother.
On crawling: can negotiate differences between the "names" set and the "knows" set; can get the foaf entry for each id in this final set; can fetch foaf file off the web using uri extracted from foaf entry. So, crawling is highly doable.
- === Hashing ===
- very much a new topic, and not quite sure I grasp it. But, worth a good shot or two.
Have a structure that's basically a singly nested dictionary: master stores as keys the subjects of the triples, and as values a dictionary for the subjects of the triples. These node dictionaries store as keys predicates, and as values objects. This structure makes clear a hierarchy of data, but I haven't the slightest whether it is a useful way in which to store data.
- some output:
Reading from out5.txt _:id1 <http://xmlns.com/foaf/0.1/name> "Chris Murphy" <http://xmlns.com/foaf/0.1/mbox_sha1sum> "483d02da8c321d9f885298c571c0521dd1b88069" <http://www.w3.org/2000/01/rdf-schema#seeAlso> <http://unicorn.olin.edu/~chrism/2003/foaf/camfoaf.rdf> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> _:id0 <http://xmlns.com/foaf/0.1/knows> _:id1 _:id2 <http://xmlns.com/foaf/0.1/mbox_sha1sum> "233d1943c83ee02bb35bb787940248cbe3b9d0d3" <http://xmlns.com/foaf/0.1/homepage> <http://katie.rivard.org> <http://xmlns.com/foaf/0.1/nick> "Katie" <http://xmlns.com/foaf/0.1/firstName> "Kathryn\n" <http://xmlns.com/foaf/0.1/schoolHomepage> <http://www.olin.edu> <http://xmlns.com/foaf/0.1/title> "Ms" <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> <http://xmlns.com/foaf/0.1/name> "Kathryn Rivard" <http://xmlns.com/foaf/0.1/surname> "Rivard" _:id2 <http://xmlns.com/foaf/0.1/name> "Drew Harry" <http://xmlns.com/foaf/0.1/mbox_sha1sum> "1956530d6d839db321259f589d330d93fcce5af0" <http://www.w3.org/2000/01/rdf-schema#seeAlso> <http://unicorn.olin.edu/~chrism/2003/foaf/dehfoaf.rdf> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person>
- relevant code:
class HashSink(object): def __init__(self): self.master = {} def triple(self, s, p, o): if not self.master.has_key(s): self.master[s] = {} if self.master[s].has_key(p): self.master[s][p].append(o) else: self.master[s][p] = [o] def write(self): for i in self.master.iterkeys(): print i for j, k in self.master[i].iteritems(): if len(k) > 1: print "\t", j for m in k: print "\t\t", m else: print "\t", j, k[0]
there is a potential need to have some external structure to hold processed data -- a person associated with a node, for instance, has their name buried somewhere in a variety of inconsistent ways to hack together one's name using RDF tags. Since the only things in predicates are uris, however, I could add custom keys to each node dictionary to hold this processed data, since grokkable strings would be assured(I hope) unique.
found and fixed a silly bug where I was replacing rather than appending to member dictionaries in self.master; fix reflected in above code.
added functions for finding nodes and predicate members of nodes; extracting foaf uris now takes only 6 lines of main code.
have breadth-first person-finding working: goes through some set number of levels of search depth to find a person connected to a root file.
Note to self: writing to disk takes an enourmous amount of time. (duh) or rather, if you don't write to disk, the script is speedyfast.
- Next: apply person-hopping to graphing.
- Ha... do-learn is great. I just read about my algorithm in a textbook.
- some output:
Graphing
Found breadth-first shortest-path between two names at end of last section. Caveats: foaf is a BIG place. If any member of the chain ever links someone on ecademy.com, the thing takes freaking forever -- ecademy is an enourmous business networking site, some foaf files there have thousands of nodes, it's ugly and horrible. Theoretically anyone who's anyone should be within 3 steps, but depending on the community you could either reach 5 people or 5,000 people in 3 steps. Eww.
- Have small library of foaf files with connect-a-point info in list style. Code is horrid to read. Probably need a larger library for doing anything useful with it.
- Next: read a small library of foaf files from disk, then store their connections using graphing methods.