Monday, June 12, 2006

Goooogle interview questions

The other day I got hooked by a friend of mine with a couple of interview questions given by google. Some of them are challenging and quite interesting. Here is one (taken from here): Question #8) Given an array A[string], an array of strings where each string represents a word in a text document. Also given 3 search terms T1, T2, and T3 and 3 corresponding sorted sequences of integers S1, S2, and S3 where each integer in Si represents an index in A where search term Ti occured (i.e. S1, S2, and S3 contain the locations of the search terms in our array of words). Now find a minimal subarray of A that contains all of the search terms T1, T2, and T3. Extend this algorithm for an arbitrary number of search terms. This one in effect is asking you to solve the excerpting problem: Given a text, a couple of search terms with the matching positions in the original text you are asked to return an "interesting" fragment of the original text containing all the search term. In this case the interesting criteria is the length of the fragment: the fragment is most interesting when the length is shorter.

My first solution was to the [max(min(S1), min(S2), min(S3)), min(max(S1), max(S2), max(S3))] which is not good. The reason is that you can have cases in which the interval you choose can have a one or more search terms left out: S1 = {1, 3}, S2 = {5, 7}, S3 = {4, 8}. The interval you would get in this case would have been 1, 3 which will only contain the first term. You can also choose the min(min()..) .. max(max()..) solution which is good as an interval containing all the terms but is too broad (in effect is the longest possible). You can merge the S1, S2, S3 sequences and sort them (having "colored" the indexes first). This will be a proper sequences (it has all the colors in it) but it is too long. We need to shorten it. For this we will look at the sequence ends for the shortest subsequences beginning and ending at each end of the current sequence. For example given this sequence: [R1, B3, Y4, R5, .... R11, B10, B15, Y16] we will look at those 2 subsequences: [R1, B3, Y4], [R11, B10, B15, Y16]. Right now we have 3 sequences: the original, the first one of length 3 and the second one of length 4. Because we are looking for the sequence of minimal length we can break the one with length 4. We will ignore then the Y16 term and go ahead and rebuild the second sequence looking deeper into the original sequence. Once we find a proper one we apply the same rule (break the longest one). This will guarantee that we will always have the shorter one. The algorithm will stop when both the found sequences are in fact the same.

There is a problem somewhere in this algorithm but I'll let a little fun in problem for the readers :-). You can easily extend this to more than 3 terms maintaining the exact complexity. The actual complexity of this is O(n) where n is the sum of the term positions in the original text. Also there is another solution totally different than this one and quite interesting which for multiple terms might be faster.
Maybe I'll post it at some point.

Mihai

No comments: