Tuesday, June 13, 2006

Goooogle interview questions II

This is the follow-up I promised some posts ago for the search problem.

Last time I was hinting at a totally different algorithm which might solve the problem. It's quite interesting and fun but .. it doesn't work :-). It will not find the shortest fragment but a short fragment. I'll give the code later. For now here is a variation of the alg I posted which has the same complexity but it's different :-).

Remember that with the algorithm posted you would merge the index arrays while keeping score of the term each index belonged (coloring them as I said). After that you would start with the minimal valid sequence from either end and walk toward the middle of the complete array (by breaking the longest sequence and then rebuilding again with new data) until you would have the same sequence in either end. This will require you to have the original index sets in memory before starting the work. This might be ok in some cases but if the index arrays are big you might get them from a database or web service or something else as a stream. In this case you would start only from one end and do the same (find a valid seq, break it, find another valid seq .. etc) while keeping score of the shortest one so far. This can enable you to actually apply the alg while reading the index sets (if they are too big).

The other interesting solution is this:
  • sort the shortest index array.
  • for each index in this index array you would find the nearest index in the second array.
  • with those two numbers you would compute the mean find in the third array the closest number to the mean. - now you will have a couple of candidates (the same amount as the number of items in the shortest index array). You need choose the shortest one.
It's a nice algorithm but it doesn't work completely. The counter example is real nice :-). However it is faster because the complexity is something like O(card(S1)) * (1 + O(log(card(S2)) + log(card(S3))) compared with O(card(S1) + card(S2) + card(S3))
as in the first algorithm. In some cases you might be content with the size of the solution it would find. I think is not that far off the real solution. In the case of 3 search terms they are even related :-).

Mihai.

No comments: