# Foundations Path

Reach beyond the basics of coding with these carefully curated resources for newer programmers who have already completed 1-2 computing courses. You’re at an exciting point and we want to support you however and wherever you’re learning!

## Practice testing and debugging first

### Former Coding Interview Question: Find longest word in dictionary that is a subsequence of a given string

Hone your problem-solving skills with this former Google interview question focused on strings. It'll test your ability to optimize code and work with algorithms and data structures.

Take the challenge#### The Challenge

Given a string `S`

and a set of words `D`

, find the longest word in `D`

that is a subsequence of `S`

.

Word `W`

is a subsequence of `S`

if some number of characters, possibly zero, can be deleted from `S`

to form `W`

, without reordering the remaining characters.

Note: `D`

can appear in any format (list, hash table, prefix tree, etc.

For example, given the input of `S = "abppplee"`

and `D = {"able", "ale", "apple", "bale", "kangaroo"}`

the correct output would be `"apple"`

##### Learning objectives

This question gives you the chance to practice with algorithms and data structures. It’s also a good example of why careful analysis for Big-O performance is often worthwhile, as is careful exploration of common and worst-case input conditions.

#### Solution explained, commentary

There are a number of different viable approaches (for example, there’s more than one algorithm that applies). This means you can really put your problem-solving ability to the test and build on what you know.

Some solutions:

##### Brute force

Generate all 2^n possible subsequences, and check each one against the dictionary for a match. A slight optimization is to at least ensure the dictionary is represented by a hash table or prefix tree so that lookups are efficient.

##### Check each dictionary word using a greedy algorithm

You can prove that a greedy algorithm works for checking if a word `w`

is a subsequence of `S`

. Scan `S`

from the beginning for `w[0]`

. After you find it, continue scanning from that point for `w[1]`

, and so on, until either you run out of characters in `S`

(`w`

is not a subsequence), or you find all the characters in `w`

(`w`

is a subsequence).

You could sort the dictionary words in order of descending length, run the above decision procedure for each word, and take the first word that is a subsequence of `S`

. The running time is `O(N*W)`

where `W`

is the number of words in `D`

and `N`

is the number of characters in `S`

.

`O(N*W)`

is asymptotically optimal if most dictionary words are close to size `N`

(because then the size of the input is `O(N*W)`

). However, it is far from optimal in a worst-case scenario where `S`

may be long and the dictionary strings quite short.

Define `L`

to be the total number of characters in the dictionary over all words. Then the best possible time complexity (assuming words shorter than `S`

) is `O(N + L)`

. However, the existing algorithm is `O(N * W)`

, which might be `O(N * L)`

in the case where dictionary words are a small constant in size.

##### Improving the greedy approach

Notice that to check a dictionary word w, you end up needing to know, for each character `c`

in `w`

, the least index `i`

in `S`

where `S[i] == c`

, such that `i`

is greater than some given index `j`

(the index of the previously-matched letter). Observe that the greedy algorithm is slow precisely because it naively scans for this index `i`

.

You can preprocess `S`

to find such indices much faster. Build a map of letter -> sorted list of indices where letter occurs. For example, if `S == "abppplee"`

:

`a -> [0]`

`b -> [1]`

`p -> [2, 3, 4]`

`l -> [5]`

`e -> [6, 7]`

When you need to know "Given my last matched character was at index X and my next character to match is Y, where does this match occur?", you would look up `Y`

in the map, and then binary search the occurrence list to find the smallest number > `X`

in that list, or report that none exists (the word is not a subsequence then).

This approach requires one binary search per dictionary letter. Since it takes only `O(N)`

time to preprocess `S`

in this way, the total complexity is `O(N + L * log N)`

, which is quite close to the best possible complexity.

Advanced approach: You might note that since this is a classic "successor-value" problem, specialized data structures exist that can run the search even faster. A vEB tree, for example, improves the runtime to `O(N + L * log log N)`

.

##### An optimal O(N + L) approach for small alphabets

If the size of the alphabet is a-z or a small constant, we can give a solution based off the previous idea that runs in `O(N + L)`

.

The real complexity is `O(N*A + L)`

, where A is the size of the alphabet. Instead of making the representation sparse, you could make it dense. Instead of `p -> [2, 3, 4]`

you could use `p -> [2, 2, 3, 4, -1, -1, -1, -1]`

where the i-th index directly yields the answer to the query. This requires `O(N)`

space per alphabet letter (O(NA) total) and also that much time during preprocessing. It is therefore not suitable for large alphabets.

##### An optimal O(N + L) approach for any alphabet

Instead of processing the words one at a time, we will process all words simultaneously.

First, for each word `w`

in `D`

, create a 2-tuple containing `w`

and the number `0`

(i.e. `(w, 0)`

). The number refers to the number of characters in `w`

that have already been found in `S`

; initially, no characters have been found. For a tuple `t`

, we'll refer to the word as `t.w`

, and to the number as `t.i`

(since it is an index).

Group the words by `t.w[t.i]`

(initially `t.i`

is `0`

, so initially it is by first letter). For our example dictionary `D = {"able", "ale", "apple", "bale", "kangaroo"}`

, we’ll have:

`a -> [("able", 0), ("ale", 0), ("apple", 0)]`

`b -> [("bale", 0)]`

`k -> [("kangaroo", 0)]`

What this is doing is telling you what words you will make progress on finding for each possible letter that you might see next in `S`

.

Scan `S`

one character at a time. If the character you just scanned is `c`

, for each tuple `t`

in `map[c]`

, increment `t.i`

by 1 and transfer `t`

to `map[t.w[t.i]]`

. The case where `t.i == t.w.length`

is special - transfer these words to a "found" list. In the end, the output is the longest word in the found list.

#### Code snippet for solution

`#!/usr/bin/env python`

`import collections`

`import sys`

`def find_longest_word_in_string(letters, words):`

`letter_positions = collections.defaultdict(list)`

`# For each letter in 'letters', collect all the indices at which it appears.`

`# O(#letters) space and speed.`

`for index, letter in enumerate(letters):`

`letter_positions[letter].append(index)`

`# For words, in descending order by length...`

`# Bails out early on first matched word, and within word on`

`# impossible letter/position combinations, but worst case is`

`# O(#words # avg-len) * O(#letters / 26) time; constant space.`

`# With some work, could be O(#W * avg-len) * log2(#letters/26)`

`# But since binary search has more overhead`

`# than simple iteration, log2(#letters) is about as`

`# expensive as simple iterations as long as`

`# the length of the arrays for each letter is`

`# “small”. If letters are randomly present in the`

`# search string, the log2 is about equal in speed to simple traversal`

`# up to lengths of a few hundred characters.`

`for word in sorted(words, key=lambda w: len(w), reverse=True):`

`pos = 0`

`for letter in word:`

`if letter not in letter_positions:`

`break`

`# Find any remaining valid positions in search string where this`

`# letter appears. It would be better to do this with binary search,`

`# but this is very Python-ic.`

`possible_positions = [p for p in letter_positions[letter] if p >= pos]`

`if not possible_positions:`

`break`

`pos = possible_positions[0] + 1`

`else:`

`# We didn't break out of the loop, so all letters have valid positions`

`return word`

`if __name__ == '__main__':`

`print subdict(sys.argv[1], sys.argv[2:])`