Notes for September 21

Return to the main StudentFrontPage.

Scribe = Drew Harry
Editor = Kathy King
Editor =
Editor =
Editor =

Contents

Notes for September 21

Ball Comparisons

I have 9 balls that are indistinguishable, except that one is heaver than the others. With only a scales-of-justice style scale - how many operations will it take to figure out (for sure) which ball it is?

Clicker (8, 4, 3, 2, 1, 0, not sure) - Most people think 3. This problem is related to a sorting problem.
- With 8 operations, it's like a selection sort - comparing two and moving on.
- With 4 operations, compare pairs of balls. Each time, if it's unequal, you've found it. If, for each comparison, they're unequal, then it must be the 9th ball. Remember that this is the WORST case performance of this algorithm. You could find it much faster than this. This is taking advantage of information about what "equals" means in this test.
- With 3 operations, put 4 balls in each pan. If each are equal, it is the 9th ball. Otherwise, take the balls in the heavier pan and split them into pairs. Split again to find exactly which ball is the heaviest.
- With 2 operations, seperate into three groups of three. If the two tested groups are of equal weights, test the group on the side. Do the same for the groups of three. Put one ball on each pan. If they're equal, you know the third ball is the heaviest, otherwise it will be clear which of the two balls is heavier on the scale.

Themes of Today's Class

Variation in problems that we've seen. These small variations can help us do better than n log(n). The information theoretic bound on sorting depends on comparing two items. If we change the rules slightly (like using equals in the above example) we can do better.

Finding the Minimum/Maximum of a Collection

As a first attempt, we could sort the collection. This would take n log(n) and then pop the item off the top or bottom.
- find min
  
  n-1 comparisons
  
  find max
  
  n-1 comparisons
  
  find min&max simultaneously
  
  3/2 * n (see below for explanation)
The trick for finding the min/max simultaneously. Maintain a "curMax" and "curMin". Pick up two values at once, and compare them. Then compare the larger one to the current maximum and the smaller one to the minimum. This uses three comparisons to eliminate two values. This is better than other methods which use one comparison to select one value. This ends up scaling as 3/2 * n. This is still order n, but it scales better than the other proposed methods.

Aside: Formality and Expected Knowledge

We don't need to be able to write the code for all these sorts. We should develop intuition and a toolbox of ways to approach algorithms and sorting. It is important for us to be able to understand why a naive sort would be n^2, and what is different about n log(n). It's not crucial that we can bind the names to the actual sorts.

Finding the Ith element in the list (Quicksort)

WikiPedia entry on Quicksort

Pick an element in the list (randomly). This is the pivot.
Divide the list into two chunks, those numbers that are larger (in value) than the selected value and those that are less. (this will take n-1 comparisons to split)

Hopefully this gives us about half and half in each group. This is like the dictionary example, halving the size of the thing for which we are searching. This should end up taking about log(n) passes. (If the random selection of the pivot gets the smallest value in that list, it takes n comparisons and doesn't effectively split the group - every value will be larger than the pivot and so we only end up with another list that's just n-1 long. If this happens every time, this method will suck.)

Taking a random pivot could take as long as n², but it could also end up a lot better than n².

best case	O(n)	(1 pass, get lucky)
average case	O(n)	2 = 1 + 1/2 + 1/4 + ... (We would say this is O(2n), except that order is the upper bound, so O(2n) = O(n))
worst case	O(n^2)	see above justification

It turns out we can do a better job of avoiding the O(n^2) worst case scenario. Take the collection and divide it into groups of five. Find the median of each of these groups. This can be done in six comparisons. Now we have a whole bunch of elements about which we know that there are two things greater than the middle element and two things less than the middle element.

Take the collection and separate into n/5 groups of five. (O(n), maybe faster, but def. proportional to n)
Find the median of each group (2 above, 2 below, 1 median) (n/5 * 6, O(n) - some discussion about the actual cost of finding the median. might be 7?)
find the median of the medians. (f(n/5))
Use the median of the medians as a pivot.
Recurse on what's left. (do it like we did it above, but always use this algorithm to select the pivot.) It turns out that if you do it this way, eliminating about a 1/4 for each cycle, this is still O(n). It also has a huge constant. This means that it's not a good way to find the ith element of a small collection - small collections should be sorted. What's cool is that it's a guaranteed linear way to find the ith element. It's also neat that the number 5 appears, and is crucial to making the math work out. It's especially slick because it isn't a power of 2.

o o o o o|o o o o o
         |
o o o o o|o o o o o
_______  |  
o o o o[o]o o o o o
       |  ---------
o o o o|o o o o o o
       |
o o o o|o o o o o o

Each column is a set of five, with the third row being the medians. The center one (bracketed) is the median of the medians. The ones within the boxes (up-right and bottom-left) are eliminated as being greater than or less than the selected pivot.

Aside: Orders of Growth

The order of growth is about figuring out the shape of the curve, not the specific behavior. To find specific behavior we would need to assign specific costs to things like comparing numbers, storing things, etc. The order notation lets us think about the shape of the curve, not it's specific values. (Recurrance relations are a way in which we could be more formal about this topic.) Sometimes we talk about the number of times you do something (eg n-1, 1 + 1/2 + 1/4, etc) - these are not statements about Order, they're an approximation. Order has a more formal definition. Rely on context to figure out what we're talking about.

Sorting Reprise: Doing better than n log(n)

Sharon's thought: parallelize the process and ask people for certain cards, eg call out "Aces" and everyone puts down their aces. (This is quite close, and has two great ideas - parallization If you know something about the potential values of the cards, I can sort in the amount of time it takes me to go through the deck by putting things into bins (matching those known values.) This is sorting in order n time. Try sorting on either number or suit. Scan through the list selecting items that are a certain number or certain suit. Now do a bin sort with the other criterion, ie if you did suit the first time, sort each bin (in order) using the other criterion. This ends up preserving the previous sort criteria. === Aside: Parallelization ===
- This is why we sometimes talk about number of cycles instead of time - if we can make it a parallel process, it can be easier to compute in actual time, even if it's more total cycles.

Hash Sort

Example: The Faculty Members at Olin and the classrooms where they teach.

We want something like an array that's order 1 access to retrieve items.

So we define a hash function. For example take the 1st letter of the faculty member's name mod 10 (a = 0). In our example (Stein, 318). This says: put the data (318) in the 8th slot. For Martello, put it in 2 (12 mod 10). To retrieve values, we ask "what is Martello's hash number?" and then ask that array location for its values. This makes it constant time to both put things in _and_ get things out. This is cool.

There are potential glitches, though. What if we want to add somerville? His has resolves to the same as Stein. Collision! There are two ways to solve this problem:

Closed Hashing - Things go in the bin, so we might need many things in each bin. Thus each bin needs to contain a collection data type. If we do it this way, we want to make sure that not too many things wind up in each bin. So #elements/#bin is small and the distribution is relatively even - don't want them to all end up in one bin. This can be perfectly reasonable thing to do, but it depends on being able to guarantee small lists in each bin.
Open Hashing - If you get a collision, you rehash. To a first approximation, just let it trickle down to the next (this is actually a really bad way to do it, but just an example). In general, this strategy give us more flexibility to be smart about our collision resolution strategy. If we can make it be independant of why the collision happened in the first place, that would be great. In general, you can't remove things from this hash. This is also good because you know exactly how much space this hash is going to take.

Take Aways

There are ways you can cheat if you know more about the problem than the fully generic problem. The hash algorithm in particular is one that is very common. It's great when you have lots of extra space. If you can afford this you can very quickly stick things in and pull them out. Being able to tell if something is there and if there is a value for a certain key. There is another level of tradeoffs between open and closed hashing models. You should also be concious about the actual hash method. A good hash method will be deterministic, spread the values out reasonably well (to avoid collisions). Also, in open hashing, the two hashing functions should be independent so the second hashing function resolves the collision without future collisions.

find min	n-1 comparisons
find max	n-1 comparisons
find min&max simultaneously	3/2 * n (see below for explanation)