Notes for September 21
Return to the main StudentFrontPage.
- Scribe = Drew Harry
- Editor = Kathy King
- Editor =
- Editor =
- Editor =
Contents
Ball Comparisons
- I have 9 balls that are indistinguishable, except that one is heaver than the others. With only a scales-of-justice style scale - how many operations will it take to figure out (for sure) which ball it is?
- Clicker (8, 4, 3, 2, 1, 0, not sure) - Most people think 3. This problem is related to a sorting problem.
- With 8 operations, it's like a selection sort - comparing two and moving on.
- With 4 operations, compare pairs of balls. Each time, if it's unequal, you've found it. If, for each comparison, they're unequal, then it must be the 9th ball. Remember that this is the WORST case performance of this algorithm. You could find it much faster than this. This is taking advantage of information about what "equals" means in this test.
- With 3 operations, put 4 balls in each pan. If each are equal, it is the 9th ball. Otherwise, take the balls in the heavier pan and split them into pairs. Split again to find exactly which ball is the heaviest.
- With 2 operations, seperate into three groups of three. If the two tested groups are of equal weights, test the group on the side. Do the same for the groups of three. Put one ball on each pan. If they're equal, you know the third ball is the heaviest, otherwise it will be clear which of the two balls is heavier on the scale.
Themes of Today's Class
- Variation in problems that we've seen. These small variations can help us do better than n log(n). The information theoretic bound on sorting depends on comparing two items. If we change the rules slightly (like using equals in the above example) we can do better.
Finding the Minimum/Maximum of a Collection
- As a first attempt, we could sort the collection. This would take n log(n) and then pop the item off the top or bottom.
find min
n-1 comparisons
find max
n-1 comparisons
find min&max simultaneously
3/2 * n (see below for explanation)
Aside: Formality and Expected Knowledge
- We don't need to be able to write the code for all these sorts. We should develop intuition and a toolbox of ways to approach algorithms and sorting. It is important for us to be able to understand why a naive sort would be n^2, and what is different about n log(n). It's not crucial that we can bind the names to the actual sorts.
Finding the Ith element in the list (Quicksort)
Pick an element in the list (randomly). This is the pivot.
- Divide the list into two chunks, those numbers that are larger (in value) than the selected value and those that are less. (this will take n-1 comparisons to split)
- Hopefully this gives us about half and half in each group. This is like the dictionary example, halving the size of the thing for which we are searching. This should end up taking about log(n) passes. (If the random selection of the pivot gets the smallest value in that list, it takes n comparisons and doesn't effectively split the group - every value will be larger than the pivot and so we only end up with another list that's just n-1 long. If this happens every time, this method will suck.)
Taking a random pivot could take as long as n2, but it could also end up a lot better than n2.
It turns out we can do a better job of avoiding the O(n^2) worst case scenario. Take the collection and divide it into groups of five. Find the median of each of these groups. This can be done in six comparisons. Now we have a whole bunch of elements about which we know that there are two things greater than the middle element and two things less than the middle element.best case
O(n)
(1 pass, get lucky)
average case
O(n)
2 = 1 + 1/2 + 1/4 + ... (We would say this is O(2n), except that order is the upper bound, so O(2n) = O(n))
worst case
O(n^2)
see above justification
- Take the collection and separate into n/5 groups of five. (O(n), maybe faster, but def. proportional to n)
- Find the median of each group (2 above, 2 below, 1 median) (n/5 * 6, O(n) - some discussion about the actual cost of finding the median. might be 7?)
- find the median of the medians. (f(n/5))
- Use the median of the medians as a pivot.
- Recurse on what's left. (do it like we did it above, but always use this algorithm to select the pivot.) It turns out that if you do it this way, eliminating about a 1/4 for each cycle, this is still O(n). It also has a huge constant. This means that it's not a good way to find the ith element of a small collection - small collections should be sorted. What's cool is that it's a guaranteed linear way to find the ith element. It's also neat that the number 5 appears, and is crucial to making the math work out. It's especially slick because it isn't a power of 2.
o o o o o|o o o o o | o o o o o|o o o o o _______ | o o o o[o]o o o o o | --------- o o o o|o o o o o o | o o o o|o o o o o o
Each column is a set of five, with the third row being the medians. The center one (bracketed) is the median of the medians. The ones within the boxes (up-right and bottom-left) are eliminated as being greater than or less than the selected pivot.
Aside: Orders of Growth
- The order of growth is about figuring out the shape of the curve, not the specific behavior. To find specific behavior we would need to assign specific costs to things like comparing numbers, storing things, etc. The order notation lets us think about the shape of the curve, not it's specific values. (Recurrance relations are a way in which we could be more formal about this topic.) Sometimes we talk about the number of times you do something (eg n-1, 1 + 1/2 + 1/4, etc) - these are not statements about Order, they're an approximation. Order has a more formal definition. Rely on context to figure out what we're talking about.
Sorting Reprise: Doing better than n log(n)
- Sharon's thought: parallelize the process and ask people for certain cards, eg call out "Aces" and everyone puts down their aces. (This is quite close, and has two great ideas - parallization If you know something about the potential values of the cards, I can sort in the amount of time it takes me to go through the deck by putting things into bins (matching those known values.) This is sorting in order n time. Try sorting on either number or suit. Scan through the list selecting items that are a certain number or certain suit. Now do a bin sort with the other criterion, ie if you did suit the first time, sort each bin (in order) using the other criterion. This ends up preserving the previous sort criteria. === Aside: Parallelization ===
- This is why we sometimes talk about number of cycles instead of time - if we can make it a parallel process, it can be easier to compute in actual time, even if it's more total cycles.
Hash Sort
Example: The Faculty Members at Olin and the classrooms where they teach.
We want something like an array that's order 1 access to retrieve items.
So we define a hash function. For example take the 1st letter of the faculty member's name mod 10 (a = 0). In our example (Stein, 318). This says: put the data (318) in the 8th slot. For Martello, put it in 2 (12 mod 10). To retrieve values, we ask "what is Martello's hash number?" and then ask that array location for its values. This makes it constant time to both put things in _and_ get things out. This is cool.
There are potential glitches, though. What if we want to add somerville? His has resolves to the same as Stein. Collision! There are two ways to solve this problem:
Closed Hashing - Things go in the bin, so we might need many things in each bin. Thus each bin needs to contain a collection data type. If we do it this way, we want to make sure that not too many things wind up in each bin. So #elements/#bin is small and the distribution is relatively even - don't want them to all end up in one bin. This can be perfectly reasonable thing to do, but it depends on being able to guarantee small lists in each bin.
Open Hashing - If you get a collision, you rehash. To a first approximation, just let it trickle down to the next (this is actually a really bad way to do it, but just an example). In general, this strategy give us more flexibility to be smart about our collision resolution strategy. If we can make it be independant of why the collision happened in the first place, that would be great. In general, you can't remove things from this hash. This is also good because you know exactly how much space this hash is going to take.
Take Aways
- There are ways you can cheat if you know more about the problem than the fully generic problem. The hash algorithm in particular is one that is very common. It's great when you have lots of extra space. If you can afford this you can very quickly stick things in and pull them out. Being able to tell if something is there and if there is a value for a certain key. There is another level of tradeoffs between open and closed hashing models. You should also be concious about the actual hash method. A good hash method will be deterministic, spread the values out reasonably well (to avoid collisions). Also, in open hashing, the two hashing functions should be independent so the second hashing function resolves the collision without future collisions.