Gorilla Thinking: June 2006

Saturday, June 17, 2006

Semantics

Usually when defining a new programming language one will have to provide two items:
the syntax of the language and the semantics of the language. Now you might ask why do you need both things ? Isn't the syntax enough ?

Think about this: if I came to you and I give you a paper written in Klingon would you be able to read it ? (Star Trek hardcore fans would probably get a paper written in rural romanian :-)). In general the answer would be no. In order to read it you would need to know how to transform the Klingon phrases in English phrases (a language that you already know). You would need at least a Klingon/english dictionary. In order to do it properly you would also need to know the grammer rules of the Klingon language. The dictionay and the grammar rules are in fact a semantic of the klingon language with the english language as a target (one that the recipient knows!). In effect the semantic of a language will define the meaning of phrases written in that language. This is another reason why crypto works :-). When you get an encrypted message you really have no clue what is the meaning associated with that text you got. And you can deduce virtually zero information about the semantics starting from the text you have. The stupid cyphers usually have semantics embedded into the syntax or are too close to the actual target language (the one which is encrypted) and slip that info on carefull analysis.

You can define the semantics in more than one way (see here for a more detalied presentation of the formal semantics of programming languages). To put it short there are three big ways: translation, interpretation and using logical axioms. When you give a semantics using the translation method you actually writing the translation rules from the source langauge into a known target language (usually compiling C to ASM is doing something like this). When using the interpretation method you are effectively describing what it will happen for every source phrase with a target known system (usually assembler can be described like that). I really don't have a good idea of an example using the third approach so i will leave like that. But basically what it is happennig with the third approach is this: you can define axioms for every building block of the language and inference rules for every combinatorial building block of the language. The meaning of a phrase in the languages is the resulting meaning in the defined logic (probably XSLT is/can be defined using something similar).

That's it for today :-). See you next time.

Mihai

Wednesday, June 14, 2006

Problem: Detecting duplicated items in an array

And another problem :-).

Someone asked me to give him problems so that he can train for an interview. I gave him the problems from google posted here and in turn he gave me another one:

You have an array of length n. It is filled with n random numbers between 0 and n-1. The task it to find out if any of them are repeated.

The trivial solution would be to sort the array and detect any repeated number in the sorted array. This one has O(n * log(n)) time complexity. We should be able to do better. Here is what i came out with after a little thinking.

start with the first element of the array which has a value different from the index at which it is located. For int[] arr = {0, 1, 2, 3, 5, 4} i would start with the position 4 because the value of 5 is different than its index.
take the value and go in the array at the position represented by that index. Look at the value from there. If the value is the same as that index we just found our duplicate. If not then remember the new value, put the old value in there and do the same.
if the next value is the original starting position you have just arranged a subset of the original array in the proper place. select the next number matching the condition at step 1 if possible. If not then you are done, you have the array sorted and you also know that there are no repeted numbers in it.

The worst case complexity is lower than O(2 * n). In the best case and medium case probably you will find in less than n/2 operations but this is just a guess. You would have to apply some statistics :-) to find out exactly what is this in the medium case.

Mihai.

Velocity compiler again

I postes a while ago about my Velocity "compiler". There is another interesting problem in it: How do you handle internal macros?. In Velocity you can define macros which are pieces of reusable code. The problem is how do you translate that into a java class ? The solution would be something like a method call obviously (this is what I use right but it might be something totally different). But with a method call you have the typing problem :-).

You start from this:


  #macro( macroName $macroParam1 $macroParam1);
   ## macro code here
  #end

  ## and a macro call here
  #macroName( $actualParam1, $actualParam2 );

and you should end up with something like:


  protected void macroName(Type1 macroParam1, Type2 macroParam2) {
    // function code translated from macro here. 
  }

  // and a call here
  macroName(actualParam1, actualParam2);

The problem is: how do you find out the Type1, Type2 parameter types?. It turns out that there are two approaces. Infer the type from actual calls and infer the type from the actual macro code. I haven't really thought about the latter so I will talk mostly about the first one.

I used this because i already had some code helping me do it. I used it to infer the type of an expression in a #set( $var = $value); statement in order to properly convert it InferredType var = value; statement. So what i ended up doing is this:
- visit the AST tree of the velocity macro
- do the translation for all the nodes which are not ASTDirective with the name of macro (this is how an internal macro is presented in a velocity macro).
- for every node not translated i'm putting in a queue and remember it for later.
- while processing the other nodes i look for callers of macros.
- I infer the types for any parameters passed to the macro and keep them in a map in the generator.
- after the processing is completed I iterate the macros and build them as functions using the parameters types inferred earlier.

This one works reasonably well in most of the cases however there are cases in which it fails :-). It should be visible from the algorithm description and the solution to this problem too ..

Mihai.

Tuesday, June 13, 2006

Goooogle interview questions II

This is the follow-up I promised some posts ago for the search problem.

Last time I was hinting at a totally different algorithm which might solve the problem. It's quite interesting and fun but .. it doesn't work :-). It will not find the shortest fragment but a short fragment. I'll give the code later. For now here is a variation of the alg I posted which has the same complexity but it's different :-).

Remember that with the algorithm posted you would merge the index arrays while keeping score of the term each index belonged (coloring them as I said). After that you would start with the minimal valid sequence from either end and walk toward the middle of the complete array (by breaking the longest sequence and then rebuilding again with new data) until you would have the same sequence in either end. This will require you to have the original index sets in memory before starting the work. This might be ok in some cases but if the index arrays are big you might get them from a database or web service or something else as a stream. In this case you would start only from one end and do the same (find a valid seq, break it, find another valid seq .. etc) while keeping score of the shortest one so far. This can enable you to actually apply the alg while reading the index sets (if they are too big).

The other interesting solution is this:

sort the shortest index array.
for each index in this index array you would find the nearest index in the second array.
with those two numbers you would compute the mean find in the third array the closest number to the mean. - now you will have a couple of candidates (the same amount as the number of items in the shortest index array). You need choose the shortest one.

It's a nice algorithm but it doesn't work completely. The counter example is real nice :-). However it is faster because the complexity is something like O(card(S1)) * (1 + O(log(card(S2)) + log(card(S3))) compared with O(card(S1) + card(S2) + card(S3))
as in the first algorithm. In some cases you might be content with the size of the solution it would find. I think is not that far off the real solution. In the case of 3 search terms they are even related :-).

Mihai.

Spellchecking :-((

I really need to learn how to use a spellchecker or at least read the post again before posting.

The trouble is .. In general I really don't have something worth of telling to other people and when i do i usually rush to tell them and i forget to follow the proper practices. The good news is i know about the issue and i'm trying to change the habit. In the meantime I'll ask the reader to bear with me and/or report the mistakes find if you think are bad :-).

Mihai

Monday, June 12, 2006

Goooogle interview questions

The other day I got hooked by a friend of mine with a couple of interview questions given by google. Some of them are challenging and quite interesting. Here is one (taken from here):

Question #8) Given an array A[string], an array of strings where each string represents a word in a text document. Also given 3 search terms T1, T2, and T3 and 3 corresponding sorted sequences of integers S1, S2, and S3 where each integer in Si represents an index in A where search term Ti occured (i.e. S1, S2, and S3 contain the locations of the search terms in our array of words). Now find a minimal subarray of A that contains all of the search terms T1, T2, and T3. Extend this algorithm for an arbitrary number of search terms.

This one in effect is asking you to solve the excerpting problem: Given a text, a couple of search terms with the matching positions in the original text you are asked to return an "interesting" fragment of the original text containing all the search term. In this case the interesting criteria is the length of the fragment: the fragment is most interesting when the length is shorter.

My first solution was to the [max(min(S1), min(S2), min(S3)), min(max(S1), max(S2), max(S3))] which is not good. The reason is that you can have cases in which the interval you choose can have a one or more search terms left out: S1 = {1, 3}, S2 = {5, 7}, S3 = {4, 8}. The interval you would get in this case would have been 1, 3 which will only contain the first term. You can also choose the min(min()..) .. max(max()..) solution which is good as an interval containing all the terms but is too broad (in effect is the longest possible). You can merge the S1, S2, S3 sequences and sort them (having "colored" the indexes first). This will be a proper sequences (it has all the colors in it) but it is too long. We need to shorten it. For this we will look at the sequence ends for the shortest subsequences beginning and ending at each end of the current sequence. For example given this sequence: [R1, B3, Y4, R5, .... R11, B10, B15, Y16] we will look at those 2 subsequences: [R1, B3, Y4], [R11, B10, B15, Y16]. Right now we have 3 sequences: the original, the first one of length 3 and the second one of length 4. Because we are looking for the sequence of minimal length we can break the one with length 4. We will ignore then the Y16 term and go ahead and rebuild the second sequence looking deeper into the original sequence. Once we find a proper one we apply the same rule (break the longest one). This will guarantee that we will always have the shorter one. The algorithm will stop when both the found sequences are in fact the same.

There is a problem somewhere in this algorithm but I'll let a little fun in problem for the readers :-). You can easily extend this to more than 3 terms maintaining the exact complexity. The actual complexity of this is O(n) where n is the sum of the term positions in the original text. Also there is another solution totally different than this one and quite interesting which for multiple terms might be faster.
Maybe I'll post it at some point.

Mihai

Friday, June 02, 2006

Compiling velocity

So i started to write a Velocity compiler :-). It will convert a Velocity template into a java class which when compiled will produce a generator which will have the output comparable (ideally the same) with the original output of the velocity template.

The main reason for doing this is to see if there is any performance improvement. Right now .. the way the templates are used is very inconvenient. Another issue is the fact that the templates are still being edited so a "one shot" conversion will not do any good. Yet another reason would the coolness factor :-).

The task is quite challenging since velocity is not a typed language and the target language is typed. The way to do it properly is to have some sort of type inferrence engine to be applied for expressions to find the most likely type for it. It actually works quite nice. Also you have issues with variable scopes because of certain usage practices of velocity macros. For example you can write something like this:


  #set($var = "");
  #set($var = $expression);

If you translate this to:


  String var = "";
  var = expression;

and expression has a different type than String you are in trouble with javac.

More next time.

Looooong developer cycle.

Looooong developer cycle.

How long should the code/compile/test cycle should be for a developer? Most of the time if this is bigger than 10-20 seconds is too long. Well right now i have the pleasure of working with a codebase that has the smallest possible cycle (trivial changes usually) of around 30 seconds ...a minute. If you need to do code changes it will set you back aprox 10 minutes :-). If you need to sync from svn and do a clean build with the proper model changes and stuff you pay 90-120 minutes. This will not account for not visible changes in the database schema. If the model had changed and you need to add a column to a table you will only find out at the runtime :-(.

So.. how long is the longest developer cycle ?

Mihai

Gorilla Thinking