Foundations Path

Reach beyond the basics of coding with these carefully curated resources for newer programmers who have already completed 1-2 computing courses. You’re at an exciting point and we want to support you however and wherever you’re learning!

View all sequences

Practice testing and debugging first

Hone your problem-solving skills with this former Google interview question focused on strings. It'll test your ability to optimize code and work with algorithms and data structures.

Take the challenge

Google

The Challenge

Given a string S and a set of words D, find the longest word in D that is a subsequence of S.

Word W is a subsequence of S if some number of characters, possibly zero, can be deleted from S to form W, without reordering the remaining characters.

Note: D can appear in any format (list, hash table, prefix tree, etc.

For example, given the input of S = "abppplee" and D = {"able", "ale", "apple", "bale", "kangaroo"} the correct output would be "apple"

  • The words "able" and "ale" are both subsequences of S, but they are shorter than "apple".
  • The word "bale" is not a subsequence of S because even though S has all the right letters, they are not in the right order.
  • The word "kangaroo" is the longest word in D, but it isn't a subsequence of S.
  • Learning objectives

    This question gives you the chance to practice with algorithms and data structures. It’s also a good example of why careful analysis for Big-O performance is often worthwhile, as is careful exploration of common and worst-case input conditions.

    Solution explained, commentary

    There are a number of different viable approaches (for example, there’s more than one algorithm that applies). This means you can really put your problem-solving ability to the test and build on what you know.

    Some solutions:

    Brute force

    Generate all 2^n possible subsequences, and check each one against the dictionary for a match. A slight optimization is to at least ensure the dictionary is represented by a hash table or prefix tree so that lookups are efficient.

    Check each dictionary word using a greedy algorithm

    You can prove that a greedy algorithm works for checking if a word w is a subsequence of S. Scan S from the beginning for w[0]. After you find it, continue scanning from that point for w[1], and so on, until either you run out of characters in S (w is not a subsequence), or you find all the characters in w (w is a subsequence).

    You could sort the dictionary words in order of descending length, run the above decision procedure for each word, and take the first word that is a subsequence of S. The running time is O(N*W) where W is the number of words in D and N is the number of characters in S.

    O(N*W) is asymptotically optimal if most dictionary words are close to size N (because then the size of the input is O(N*W)). However, it is far from optimal in a worst-case scenario where S may be long and the dictionary strings quite short.

    Define L to be the total number of characters in the dictionary over all words. Then the best possible time complexity (assuming words shorter than S) is O(N + L). However, the existing algorithm is O(N * W), which might be O(N * L) in the case where dictionary words are a small constant in size.

    Improving the greedy approach

    Notice that to check a dictionary word w, you end up needing to know, for each character c in w, the least index i in S where S[i] == c, such that i is greater than some given index j (the index of the previously-matched letter). Observe that the greedy algorithm is slow precisely because it naively scans for this index i.

    You can preprocess S to find such indices much faster. Build a map of letter -> sorted list of indices where letter occurs. For example, if S == "abppplee":

    Java Copied
    a -> [0]b -> [1]p -> [2, 3, 4]l -> [5]e -> [6, 7]

    When you need to know "Given my last matched character was at index X and my next character to match is Y, where does this match occur?", you would look up Y in the map, and then binary search the occurrence list to find the smallest number > X in that list, or report that none exists (the word is not a subsequence then).

    This approach requires one binary search per dictionary letter. Since it takes only O(N) time to preprocess S in this way, the total complexity is O(N + L * log N), which is quite close to the best possible complexity.

    Advanced approach: You might note that since this is a classic "successor-value" problem, specialized data structures exist that can run the search even faster. A vEB tree, for example, improves the runtime to O(N + L * log log N).

    An optimal O(N + L) approach for small alphabets

    If the size of the alphabet is a-z or a small constant, we can give a solution based off the previous idea that runs in O(N + L).

    The real complexity is O(N*A + L), where A is the size of the alphabet. Instead of making the representation sparse, you could make it dense. Instead of p -> [2, 3, 4] you could use p -> [2, 2, 3, 4, -1, -1, -1, -1] where the i-th index directly yields the answer to the query. This requires O(N) space per alphabet letter (O(NA) total) and also that much time during preprocessing. It is therefore not suitable for large alphabets.

    An optimal O(N + L) approach for any alphabet

    Instead of processing the words one at a time, we will process all words simultaneously.

    First, for each word w in D, create a 2-tuple containing w and the number 0 (i.e. (w, 0)). The number refers to the number of characters in w that have already been found in S; initially, no characters have been found. For a tuple t, we'll refer to the word as t.w, and to the number as t.i (since it is an index).

    Group the words by t.w[t.i] (initially t.i is 0, so initially it is by first letter). For our example dictionary D = {"able", "ale", "apple", "bale", "kangaroo"}, we’ll have:

    Java Copied
    a -> [("able", 0), ("ale", 0), ("apple", 0)]b -> [("bale", 0)]k -> [("kangaroo", 0)]

    What this is doing is telling you what words you will make progress on finding for each possible letter that you might see next in S.

    Scan S one character at a time. If the character you just scanned is c, for each tuple t in map[c], increment t.i by 1 and transfer t to map[t.w[t.i]]. The case where t.i == t.w.length is special - transfer these words to a "found" list. In the end, the output is the longest word in the found list.

    Code snippet for solution

    Copied
    #!/usr/bin/env pythonimport collectionsimport sysdef find_longest_word_in_string(letters, words):    letter_positions = collections.defaultdict(list)    # For each letter in 'letters', collect all the indices at which it appears.    # O(#letters) space and speed.    for index, letter in enumerate(letters):        letter_positions[letter].append(index)    # For words, in descending order by length...    # Bails out early on first matched word, and within word on    # impossible letter/position combinations, but worst case is    # O(#words # avg-len) * O(#letters / 26) time; constant space.    # With some work, could be O(#W * avg-len) * log2(#letters/26)    # But since binary search has more overhead    # than simple iteration, log2(#letters) is about as     # expensive as simple iterations as long as     # the length of the arrays for each letter is    # “small”.  If letters are randomly present in the    # search string, the log2 is about equal in speed to simple traversal    # up to lengths of a few hundred characters.                  for word in sorted(words, key=lambda w: len(w), reverse=True):        pos = 0        for letter in word:            if letter not in letter_positions:                break        # Find any remaining valid positions in search string where this        # letter appears.  It would be better to do this with binary search,        # but this is very Python-ic.        possible_positions = [p for p in letter_positions[letter] if p >= pos]        if not possible_positions:            break        pos = possible_positions[0] + 1        else:            # We didn't break out of the loop, so all letters have valid positions              return wordif __name__ == '__main__':    print subdict(sys.argv[1], sys.argv[2:])
    A further solution

    https://www.careercup.com/question?id=5757216146587648

    Related Topics