Last weekend I had a few hours to spare, so I decided to work on one of the problems from CodeSprint2, InterviewStreet’s contest to find the best programming talent.

Since I didn’t have long to work, I decided to try to complete the algorithm problem with the highest value. I chose to work on Count Strings (closed now). Given a regular expression, and a string length N, count the number of distinct strings of length N which the regular expression can match.

I looked at the sample test case, spent some time thinking about the problem and decided to try it. Although bounds for test cases were provided, I gave them only a cursory glance. This would turn out to be a mistake.

In my daily work, I typically focus on getting things working rather than spending time on determining the optimal way to do things. It is with this mindset that I decided that I would actually enumerate every string that a regular expression could produce, and then count them. I failed even before I began.

The regular expression language was limited to the alphabet {a, b}, with operators star, union, and concatenation.

I decided to use Haskell, which has the terseness of Ruby or Python, but the performance and type checking of a compiled language. I created a data type to represent the regular expressions. Recursive data types made this representation very straightforward.

data RegularExpression = Symbol Char | Concat RegularExpression RegularExpression | Union RegularExpression RegularExpression | Star RegularExpression deriving (Show, Eq)

I had the enumeration part complete fairly quickly. Pattern matching makes everything really intuitive and readable. The star operation required the most code.

listPossibilities (Symbol c) limit | limit > 0 = [c] | otherwise = [] listPossibilities (Concat r1 r2) limit = combos where o1 = listPossibilities r1 limit o2 = listPossibilities r2 limit combos = [ a ++ b | a <- o1, b <- o2, length (a++b) <= limit ] listPossibilities (Union r1 r2) limit = possibilities where o1 = listPossibilities r1 limit o2 = listPossibilities r2 limit possibilities = o1 ++ o2 listPossibilities (Star _) 0 = [""] listPossibilities (Star r1) limit = possibilities where opt = listPossibilities r1 limit possibilities = "": whileLimit opt limit optSet optSet = Set.fromList opt -- | Uses nub right now, really inefficient. probably should use some sort of memoization whileLimit :: [String] -> Int -> [String] -> [String] whileLimit base lim acc = pos where new = List.nub [a++b | a <- base, b <- acc, length (a++b) <= lim] pos = if null new || length new < length acc then acc else whileLimit base lim (List.nub $ acc ++ new)

The next component I needed was a parser to convert input into these regular expression objects. I used the Parsec parser module to construct the parser. This part actually took a bit longer, as I tried to refine the parser to accept deeply nested expressions.

The regular expression union parser below parses a parenthesized expression or symbol, followed by a pipe, and another parenthesized expression or symbol, returning a RegularExpression object.

reUnion :: GenParser Char st RegularExpression reUnion = do re1 <- pexpr <|> reSymbol char '|' re2 <- pexpr <|> reSymbol return (Union re1 re2)

Once I got the parser working, I glued it together with the enumeration portion, and submitted it to the contest. It immediately failed to compiled due to a dependency on Test.HUnit I had in my code. I reorganized the code so that I could easily remove this dependency, and resubmitted.

This time, it took, but when I went back after a few minutes to check the submission status, it had failed all but 1 of the test cases. It had taken too much time! I went back to the submission guidelines and noted that there was a time limit of 5 minutes.

Originally, I had used an algorithm which enumerated by checking distinctness using a List type. I knew at the time that this was very inefficient, but I focused on finishing the implementation at the time. I went back to this part and used a Set instead. I was optimistic that this would fix my problems and I would be able to knockout at least another test case.

I resubmitted and I failed 10 tests again. Then, I reread the problem, and looks at the boundaries for N. N could be up to 100,000 (IIRC)! Of course enumeration (especially the way I was doing it which would enumerate many strings of less than length N) would fail–I had taken a totally incorrect approach to this problem. I did not need to enumerate–I only needed to give a number of potential strings. This could’ve been done without enumeration. For example, given the expression a*, and a number N, there is only 1 possibility. Given . For example, one of the test cases was , which has 32 possibilities. Another of the expressions was . Since b can only appear once, the possible strings include strings of a’s and a single b. There are N such combinations (the number of positions b can be in, in an N length string): baa…, aba…, aab…, … . This is the approach I should’ve taken.*

I left the solution there, and acknowledged my defeat. Source here: https://github.com/jamiely/count-strings