Former interview question: compression and decompression

In this exercise, you're going to decompress a compressed string.

Your input is a compressed string of the format number[string] and the decompressed output form should be the string written number times. For example:

The input

3[abc]4[ab]c

Would be output as

abcabcabcababababc

Other rules

Number can have more than one digit. For example, 10[a] is allowed, and just means aaaaaaaaaa

One repetition can occur inside another. For example, 2[3[a]b] decompresses into aaabaaab

Characters allowed as input include digits, small English letters and brackets [ ].

Digits are only to represent amount of repetitions.

Letters are just letters.

Brackets are only part of syntax of writing repeated substring.

Input is always valid, so no need to check its validity.

Learning objectives

This question gives you the chance to practice with strings, recursion, algorithm, compilers, automata, and loops. It’s also an opportunity to work on coding with better efficiency.

Hints

Think about a couple of different ways of handling nesting.

Try to find approaches that minimize the number of times substrings are copied around.

For your algorithm, what would worst-case input look like? Can malicious input force N^2 behavior?

Does not having to deal with numbers or brackets in the strings allow doing things more idiomatically in your implementation language?

Consider edge cases: a[]b, 0[abc]

Would your implementation be simpler if you always prepended 1[ and appended ] to the input string?

A basic solution involves use of simple recursion and loops.

Alternatively you can try a “peel the onion" approach, where you first decode top layer without using recursion, and then if there is any repetition group left, do the same with output after first phase. This is not bad, but the first approach is better as it may use a lot less memory.

There are several catches, for example:

numbers with multiple digits, like 10 or 100

handling non-repeated strings

handling repetitions inside

not doing the same job twice

not copying too much

Discussion questions

Once you've tried the problem, try thinking through these questions -- and maybe even write more code to push your skills a step further.

1. How would you test your solution in light of you've practiced in this path?

2. What are some different approaches you could take to your solution -- and how would you test those?

#!/usr/bin/env pythonimport sysdef decomp(text, start=0, times=1):    """    Iterate over and decompress the compressed string.    This approach makes use of nested Python iterators, which is a very clean way    of expressing expansion of nested iterated items.    Args:        text: The entire string to decompress.  It's unobvious, but nice          to have the whole string plus an index; this allows any error          detection code to report the absolute index of a problematic          character.        start: The starting index within 'text'.  We decompress from          'start' up through the matching close-brace or end-of-string.        times: The number of times to repeat decompression.    """# Repeat iteration over our sub-chunk N times.for _ in xrange(times):    i = start    # Until we hit the end of the string, or end of our chunk...    while i < len(text) and text[i] != ']':        # Emit letters as themselves.        if text[i].islower():            yield text[i]    # If it's not a letter, it must be digit(s).  Convert to integer.    else:        sub_times = 0        while text[i].isdigit():            sub_times = sub_times * 10 + int(text[i])            i += 1        i += 1  # Skip past open-bracket        # Start an iterator over the sub-chunk.        for item in decomp(text, i, sub_times):            # Iterator generates many characters, and then at the very end,            # it generates an integer.  Provide characters up to our caller,            # and use the integer to advance our index 'i' to end-of-chunk.            if isinstance(item, basestring):                yield item            else:                i = item        # Advance 'i' to the next letter, or skip the close-bracket, whichever.        i += 1# Don't emit the trailing integer if we are doing the outermost call.  This# test could be moved to the decompress() call instead; we would check there# to see if the result item was basestring or int, just as we do above, but# I suspect that check would be more expensive than a simple integer# comparison here, where the type is known.if start > 0:    yield idef decompress(text):    # We could wrap 'text' to add a leading '1[' and trailing ']' to allow a    # little bit of simplification in the logic in decomp(), but the    # simplification would lead to harder-to-read code, as well as requiring    # O(n) additional time, and a temporary requirement for O(n + 3) additional    # space during the copy operation.    #    # This is O(decompressed-length) for speed (probably), and up to    # O(compressed-length/4) for additional storage.  In this implementation,    # the storage requirement is well-disguised: It appears on the function call    # stack, rather than in declared variables.  E.g., consider a worst-case    # input of: 1[1[1[1[1[1[1[1[1[1[1[1[1[1[1[1[1[1[1[1[xx]]]]]]]]]]]]]]]]]]]]    # which would require a stack depth of 20.    #    # The (probably) for the big-O depends on how well Python implements    # resumption of nested iteration.  At worst, the string above would require    # a full stack descent, then ascent for *each* of the two 'x' characters,    # for a total of O(n^2) time.  Another very well-hidden possible cost.    for letter in decomp(text):        sys.stdout.write(letter)    sys.stdout.write('
')    if __name__ == '__main__':        decompress(sys.argv[1])

A further solution

http://massivealgorithms.blogspot.tw/2016/12/leetcode-471-encode-string-with.html

Resource Library

Other rules

Learning objectives

Hints

Discussion questions

A further solution

Related Topics