In today's lesson, we'll dive into the world of Python sets. We'll focus on leveraging this powerful data structure to solve two common interview problems. Sets' ability to handle unique elements can simplify and optimize our problem-solving process, making it more efficient. So, if you're ready to discover the power and nuances of using sets to solve problems, let's get started!
Our first problem revolves around identifying the first unique string from a list. Imagine you're working on a text analyzing tool that needs to identify the first unique word in a piece of text. This problem simulates such a real-world scenario.
At first glance, a naive approach would be to iterate over the list of words and, for each word, scan the entire list again to count its occurrences. This method of double-pass scanning the list results in an unsightly time complexity of as each word incurs another full traverse. As the list grows in size, the time taken by this approach grows exponentially, making it impractical for larger datasets.
Let's introduce Python sets to the stage! The defining property of a set is that it contains unique elements, making it a perfect fit for our current predicament.
Our strategy consists of two parts, each tailored to leverage the capabilities of sets:
-
We scan through the words, keeping track of the previously encountered words in a set called
seen
. A crucial aspect of our solution comes from an inherent feature of sets: if a word is already inseen
, adding it again does not change the set. Keeping this in mind, we create a second set,duplicates
, consisting only of words that we have previously seen. -
Once we know which words are duplicates, it becomes elementary to find the first word in our original list that isn't a duplicate. We make another pass over the list, checking each word to see if it's in the
duplicates
set. The first word we find that isn't a duplicate is our answer.
By focusing our solution around sets, we've reduced the problem to two single-pass traversals, giving our solution a linear time complexity of O(2N), a significant improvement over the naive approach.
Let's break down our solution:
1. In the initial iteration over words
, if a word
already exists in the seen
set, we identify it as a duplicate and add it to duplicates
. If not, the word
is added to seen
.
Python1def find_unique_string(words): 2 seen = set() 3 duplicates = set() 4 5 for word in words: 6 if word in seen: 7 duplicates.add(word) 8 seen.add(word)
This code creates two sets. seen
contains all words we've come across, and duplicates
contains words that have appeared more than once. To visualize it, consider words
as 'apple', 'banana', 'apple'. After the above block of code, we'd have seen
as 'apple', 'banana', and duplicates
as 'apple'.
2. In the next phase of our solution, we iterate over words
again, checking if a word
is in duplicates
. The first word
that is not in duplicates
is our answer as it's the first unique string in the list. If we don't find any unique string, we return an empty string.
Python1for word in words: 2 if word not in duplicates: 3 return word 4 5return ""
Thanks to two simple traversals, we've efficiently solved a problem that initially seemed complex.
The task at hand for this problem is to identify all pairs of anagrams where each pair is made up of one word from the first list and one from the second list. For those unfamiliar, an anagram is a word or phrase that is created by rearranging the letters of another word or phrase, using all the original letters exactly once.
Consider a cryptology scenario. You've intercepted two separate messages, each consisting of a list of coded words. You suspect that there might be some connection between the two messages - specifically, that some words from one list are anagrams of words in the other list. Your goal is to find these pairs of anagram words to decipher the code.
The most straightforward approach to this problem would involve checking each word from the first list against each word from the second list to find anagrams. While this would provide correct results, it's an inefficient method with a time complexity of , where is the size of the first list of words, is the size of the second list of words, and is the average word length. As you can see, it gets impractically slow for larger inputs.
We can achieve a more efficient solution by representing each word from both lists as a sorted tuple of characters. This gives us a unified form for each set of anagram words, making them easy to compare. If the sorted tuples of characters for two words are the same, then those words are anagrams. Once we have these sorted tuples, we can use Python's set methods to find pairs of words that are anagrams of each other.
Here's how we fulfill the task:
- We first convert every word from both lists to a sorted tuple of its characters to have a unified form for all anagram words.
Python1sorted_tuples_1 = set(tuple(sorted(word)) for word in list_1) 2sorted_tuples_2 = set(tuple(sorted(word)) for word in list_2)
- Now, those sets themselves have unique character tuples. We find the common tuples between the two, which represent the anagram words.
Python1common_tuples = sorted_tuples_1 & sorted_tuples_2
- For the final output, we iterate over the words in the original lists again, and for each word, if its sorted character tuple is present in
common_tuples
set, we add it to the respective output list.
Python1list_1_output = [word for word in list_1 if tuple(sorted(word)) in common_tuples] # contains anagrams from the first list 2list_2_output = [word for word in list_2 if tuple(sorted(word)) in common_tuples] # contains anagrams from the second list
- Finally, we return a list of tuples, where each tuple is an anagram pair from
list_1_output
andlist_2_output
.
Python1output = [] 2for word1 in list_1_output: 3 for word2 in list_2_output: 4 # traversing every pair of words in filtered lists 5 if tuple(sorted(word1)) == tuple(sorted(word2)): 6 # If words in the pair are anagrams, add them to the output list 7 output.append((word1, word2)) 8return output
The average time complexity of this method is , where and are sizes of and , respectively, and is the number of word pairs in list_1_output
and list_2_output
lists, which makes this solution much quicker and more scalable than the naive approach. Note that as not every pair in list_1_output
and list_2_output
is actually an anagram pair, the time complexity of this solution can potentially degrade to quadratic complexity in rare cases while still having not too many anagram pairs in the output.
While this course didn't cover Python dictionaries yet, they are essentially a very powerful tool that can make the solution here even more effective. We are going to dig into Python dictionaries later in this course, but take a moment to go through and try to understand the solution for this problem if dictionaries would be possible to use. This is totally optional; no worries if you don't understand the solution yet; we will learn dictionaries later in this course.
Python1from collections import defaultdict 2 3# Create mapping for `list_1` 4mapping_1 = defaultdict(list) 5# mapping_1 stores (sorted anagram) -> list[anagrams] mapping for `list_1` 6for word in list_1: 7 sorted_tuple = tuple(sorted(word)) # unique identifier of the anagram 8 mapping_1[sorted_tuple].append(word) 9 # `mapping_1[sorted_tuple]` stores all anagrams under the same identifier for `list_1` 10 11# Create mapping for `list_2` 12mapping_2 = defaultdict(list) 13# mapping_2 stores (sorted anagram) -> list[anagrams] mapping for `list_2` 14for word in list_2: 15 sorted_tuple = tuple(sorted(word)) # unique identifier of the anagram 16 mapping_2[sorted_tuple].append(word) 17 # `mapping_2[sorted_tuple]` stores all anagrams under the same identifier for `list_2` 18 19# Intersect keys from mapping_1 and mapping_2 to get common sorted tuples 20# Every element in `common_tuples` is an anagram identifier that exists in both lists 21common_tuples = set(mapping_1.keys()) & set(mapping_2.keys()) 22 23output = [] 24for anagram_tuple in common_tuples: 25 for word1 in mapping_1[anagram_tuple]: 26 for word2 in mapping_2[anagram_tuple]: 27 # Both word1 and word2 have the same anagram identifier, so are anagrams 28 output.append((word1, word2)) 29 30return output
This solution will only traverse pairs that are actually anagrams, so the complexity will be , where and are sizes of and , respectively, and is the number of anagram pairs in the output. Note that if you only need to calculate the number of anagram pairs, the time complexity can be decreased up to .
In today's lesson, we explored two common interview problems using Python's set data structure. We implemented efficient solutions by exploiting the properties of sets— primarily their ability to manage unique elements and perform operations like intersection. These techniques not only simplified seemingly complex problems but also significantly improved our code's runtime.
Now that we've dipped our toes in the waters of Python sets, it's time to dive deeper! Upcoming practice exercises will provide the opportunity to apply what you've learned today, cementing these skills and concepts. So, stay tuned and get ready to roll up your sleeves for some hands-on practice!