1 2 Pt The Aptly Named Average Hotel Has 100 Rooms Each Belonging To One Of 100 Gues 2774497
-
1.(2 pt.)The aptly-named Average Hotel has 100 rooms, each belonging to one of 100 guests. After an evening soiree, all of the guests (not thinking straight) randomly select a room to sleep in for that night. Multiple guests might end up in the same room.
(a) (1 pt.) What is the expected number of guests that end up returning to their own hotel room?
[We are expecting: A number, and please show your work.]
(b) (1 pt.) What is the expected number of guests that end up in a room with exactly one other person? (Hint: you may find it easier to count by rooms instead of guests.)
[We are expecting a mathematical expression like(6)(8)or a number like48, and please show your work.]2.(3 pt.)LetUkbe the universe of all strings consisting ofknumeric digits. (0000,0123, and9898are part of universeU4butb000,012,9!9!are not.) Letuidenote theithdigit of a stringu?Ukwhere 0=i
LetHkbe a family ofkhash functions mapping universeUkto values{0,1,2, . . . ,9}whereh0?Hk hashes all strings according to their first digit. (For all stringsuwhereu0= 0,h0(u) = 0; for all strings uwhereu0= 1,h0(u) = 1; for all stringsuwhereu0= 9,h0(u) = 9.) Likewise,h1?Hkhashes all strings according to their second digit. Generally, for all stringsu?Ukwhereui=x,hi(u) =xfor 0=i
-
(a) (1 pt.) What is an example of a maximally-sized subsetU3such thatH3is universal for the subset?
[We are expecting: An example subset.]
-
(b) (2 pt.) WouldHkbe a good family of hash functions (where “good” is defined as universal) to useforUkfork=3?
[We are expecting: A short explanation (2-3 sentences) that answers the question.]
-
-
3.(Plagiarism detection) (5 pt.)Hash functions are extremely good at what they do. Unsurprisingly, there are many fancier data structures that can be built on top of them. In this problem we will motivate and explore the idea of a “Bloom Filter,” which is one example of a fancier structure built on top of hash functions.
Suppose you are hired by someone to make a plagiarism detection software for internal use so as to avoid any potentially embarrassing allegations of plagiarism. Specifically, your goal is to make a lightweight (i.e. fast, and relatively low-memory) piece of software that will take a sentence and output one of the following messages: 1) “potentially problematic, please rewrite”, or 2) “fresh like an ocean breeze.” Suppose your goal is the following: if the input sentence is something that you have already seen, you output “potentially problematic” (with probability 1), and if the input is something new, you want to output “fresh” with probability at least 0.99 (its alright if you have a few false-alarms).
-
(a) (1 pt.) First, you decide to use a hash table. You will make a has table that maps a piece of text to a bucket, then scrape the web for all English sentences, and hash each one to your table. Given a new sentence, you will check to see if it hashes to an empty bucket—if so, you will output option “fresh” otherwise you will output “potential plagiarism.” Suppose there are 1 billion unique sentences online. How many buckets will your hash table need to have to have the desired functionality?
[We are expecting: A number (to the nearest order of magnitude) and one to two sentences of justification.]
-
(b) (2 pt.) You decide that is a little too much space usage, and consider the following approach: you choose 10 hash functions,h1,…,h10that each map sentences to the numbers 1 though 10 billion. You initialize an arrayAof 10 billion bits, initially set to 0. For each sentencesthat you encounter, you computeh1(s),h2(s),…,h10(s), and set the corresponding indices ofAto be 1 (namely you setA[h1(s)]?1, A[h2(s)]?1, . . .). Argue that after processing the 1 billion unique sentences, you expect a (1-1/(10 billion))10 billion˜0.37 fraction of the elements to be 0.
For this part, feel free to assume that thehiare “idealized” hash functions that map each keys to a uniformly random bucket.
[We are expecting: A paragraph with your argument.]
(c) (2 pt.) Now, given a sentences, to check if it might be plagiarized, you compute the 10 hashes of s, and check ifA[h1(s)] =A[h2(s)] =. . .=A[h10(s)] = 1. If so, you output “potential problem,” otherwise you output “fresh.” Prove that ifsis actually in your set of 1 Billion sentences, that you will output “potential problem” with probability 1, and that ifsisnotin your set of 1 Billion sentences, you will output “fresh” with probability˜1-(1-0.37)10˜0.99.
Again, feel free to assume that the hash functions are “idealized,” and that the claim of the previous part holds, namely that after processing the 1 Billion sentences, there are 3.7 billion indices in the arrayAwith value 0.
[We are expecting: Informal mathematical justifications for each of the bounds.]