[FrontPage] [TitleIndex] [WordIndex

HTML Organizer

(Clickback to KatieRivard)

  1. Inputs a URL,
  2. reads the file,
  3. recursively loads the tag hierarchy into some data structure.

This went through at least three wildly different revisions, including a rather cute little state machine(HtmlParserFsm) that might've worked if doing lambda-like things in Java were easier(for managing the side effects of state changes)(no, it wouldn't've -- needs to know where it's been, not just where it is.). I could probably take the solution below and work it backwards towards something prettier, and get rid of the rather horrid while/if construct I ended up using instead, but I don't know that that's wholly useful for just playing around with stuff.

The other bit of it is that I'm finally getting the hang of hash tables, so I've been tending to use them, lots. Java has TreeMap which lets you do ordered tables, which is nice. Python dictionaries aren't ordered, but I faked the effect by using lists of (key, item) tuples. Anyhoo, yay for source code.


import java.io.*; import java.util.*;

public class HtmlOrganizer {


Python (all code)

Sample Output

Lex/YACC Parser

This doesn't build a parse tree, but it does tell you what tags the text it's looking at is nested in, and a recursive treebuilder wouldn't be too hard to implement. Plus, this uses Lex/YACC, which is just cool.

S statement; T tagexp; Ot opentagexp; E expression; Ct closetagexp; X text






Ot E Ct



E E this doesn't look happy, but..





2013-07-17 10:43