Mastering Algorithmic Problems with Sets in Python

Mastering Algorithms and Data Structures in Python

Hashing, Dictionaries, and Sets in PythonLesson 4

Lesson 4

Introduction

In today's lesson, we'll dive into the world of Python sets. We'll focus on leveraging this powerful data structure to solve two common interview problems. Sets' ability to handle unique elements can simplify and optimize our problem-solving process, making it more efficient. So, if you're ready to discover the power and nuances of using sets to solve problems, let's get started!

Problem 1: Unique String in the List

Our first problem revolves around identifying the first unique string from a list. Imagine you're working on a text analyzing tool that needs to identify the first unique word in a piece of text. This problem simulates such a real-world scenario.

Naive Approach

At first glance, a naive approach would be to iterate over the list of words and, for each word, scan the entire list again to count its occurrences. This method of double-pass scanning the list results in an unsightly time complexity of $O(N^2)$ as each word incurs another full traverse. As the list grows in size, the time taken by this approach grows exponentially, making it impractical for larger datasets.

Efficient Approach Explanation

Let's introduce Python sets to the stage! The defining property of a set is that it contains unique elements, making it a perfect fit for our current predicament.

Our strategy consists of two parts, each tailored to leverage the capabilities of sets:

We scan through the words, keeping track of the previously encountered words in a set called seen. A crucial aspect of our solution comes from an inherent feature of sets: if a word is already in seen, adding it again does not change the set. Keeping this in mind, we create a second set, duplicates, consisting only of words that we have previously seen.
Once we know which words are duplicates, it becomes elementary to find the first word in our original list that isn't a duplicate. We make another pass over the list, checking each word to see if it's in the duplicates set. The first word we find that isn't a duplicate is our answer.

By focusing our solution around sets, we've reduced the problem to two single-pass traversals, giving our solution a linear time complexity of O(2N), a significant improvement over the naive approach.

Solution Building

Let's break down our solution:

1. In the initial iteration over words, if a word already exists in the seen set, we identify it as a duplicate and add it to duplicates. If not, the word is added to seen.

Python
1def find_unique_string(words):
2    seen = set()
3    duplicates = set()
4    
5    for word in words:
6        if word in seen:
7            duplicates.add(word)
8        seen.add(word)

This code creates two sets. seen contains all words we've come across, and duplicates contains words that have appeared more than once. To visualize it, consider words as 'apple', 'banana', 'apple'. After the above block of code, we'd have seen as 'apple', 'banana', and duplicates as 'apple'.

2. In the next phase of our solution, we iterate over words again, checking if a word is in duplicates. The first word that is not in duplicates is our answer as it's the first unique string in the list. If we don't find any unique string, we return an empty string.

Python
1for word in words:
2    if word not in duplicates:
3        return word
4
5return ""

Thanks to two simple traversals, we've efficiently solved a problem that initially seemed complex.

Problem 2: Anagram Pairs in Two Lists

The task at hand for this problem is to identify all pairs of anagrams where each pair is made up of one word from the first list and one from the second list. For those unfamiliar, an anagram is a word or phrase that is created by rearranging the letters of another word or phrase, using all the original letters exactly once.

Problem 2: Problem Actualization

Consider a cryptology scenario. You've intercepted two separate messages, each consisting of a list of coded words. You suspect that there might be some connection between the two messages - specifically, that some words from one list are anagrams of words in the other list. Your goal is to find these pairs of anagram words to decipher the code.

Problem 2: Naive Approach

The most straightforward approach to this problem would involve checking each word from the first list against each word from the second list to find anagrams. While this would provide correct results, it's an inefficient method with a time complexity of $O(n \cdot m \cdot w)$ , where $n$ is the size of the first list of words, $m$ is the size of the second list of words, and $w$ is the average word length. As you can see, it gets impractically slow for larger inputs.

Problem 2: Efficient Approach Explanation

We can achieve a more efficient solution by representing each word from both lists as a sorted tuple of characters. This gives us a unified form for each set of anagram words, making them easy to compare. If the sorted tuples of characters for two words are the same, then those words are anagrams. Once we have these sorted tuples, we can use Python's set methods to find pairs of words that are anagrams of each other.

Problem 2: Solution Building

Here's how we fulfill the task:

We first convert every word from both lists to a sorted tuple of its characters to have a unified form for all anagram words.

Python
1sorted_tuples_1 = set(tuple(sorted(word)) for word in list_1)
2sorted_tuples_2 = set(tuple(sorted(word)) for word in list_2)

Now, those sets themselves have unique character tuples. We find the common tuples between the two, which represent the anagram words.

Python
1common_tuples = sorted_tuples_1 & sorted_tuples_2

For the final output, we iterate over the words in the original lists again, and for each word, if its sorted character tuple is present in common_tuples set, we add it to the respective output list.

Python
1list_1_output = [word for word in list_1 if tuple(sorted(word)) in common_tuples] # contains anagrams from the first list
2list_2_output = [word for word in list_2 if tuple(sorted(word)) in common_tuples] # contains anagrams from the second list

Finally, we return a list of tuples, where each tuple is an anagram pair from list_1_output and list_2_output.

Python
1output = []
2for word1 in list_1_output:
3    for word2 in list_2_output:
4        # traversing every pair of words in filtered lists
5        if tuple(sorted(word1)) == tuple(sorted(word2)):
6            # If words in the pair are anagrams, add them to the output list
7            output.append((word1, word2))
8return output

The average time complexity of this method is $O((n + m) \log (n + m) + P)$ , where $n$ and $m$ are sizes of $list_1$ and $list_2$ , respectively, and $P$ is the number of word pairs in list_1_output and list_2_output lists, which makes this solution much quicker and more scalable than the naive approach. Note that as not every pair in list_1_output and list_2_output is actually an anagram pair, the time complexity of this solution can potentially degrade to quadratic complexity in rare cases while still having not too many anagram pairs in the output.

Problem 2: Solution with Dictionaries (Optional)

While this course didn't cover Python dictionaries yet, they are essentially a very powerful tool that can make the solution here even more effective. We are going to dig into Python dictionaries later in this course, but take a moment to go through and try to understand the solution for this problem if dictionaries would be possible to use. This is totally optional; no worries if you don't understand the solution yet; we will learn dictionaries later in this course.

Python
1from collections import defaultdict
2
3# Create mapping for `list_1`
4mapping_1 = defaultdict(list)
5# mapping_1 stores (sorted anagram) -> list[anagrams] mapping for `list_1`
6for word in list_1:
7    sorted_tuple = tuple(sorted(word)) # unique identifier of the anagram
8    mapping_1[sorted_tuple].append(word)
9    # `mapping_1[sorted_tuple]` stores all anagrams under the same identifier for `list_1`
10
11# Create mapping for `list_2`
12mapping_2 = defaultdict(list)
13# mapping_2 stores (sorted anagram) -> list[anagrams] mapping for `list_2`
14for word in list_2:
15    sorted_tuple = tuple(sorted(word)) # unique identifier of the anagram
16    mapping_2[sorted_tuple].append(word)
17    # `mapping_2[sorted_tuple]` stores all anagrams under the same identifier for `list_2`
18
19# Intersect keys from mapping_1 and mapping_2 to get common sorted tuples
20# Every element in `common_tuples` is an anagram identifier that exists in both lists
21common_tuples = set(mapping_1.keys()) & set(mapping_2.keys())
22
23output = []
24for anagram_tuple in common_tuples:
25    for word1 in mapping_1[anagram_tuple]:
26        for word2 in mapping_2[anagram_tuple]:
27            # Both word1 and word2 have the same anagram identifier, so are anagrams
28            output.append((word1, word2))
29
30return output

This solution will only traverse pairs that are actually anagrams, so the complexity will be $O((n + m) \log (n + m) + P)$ , where $n$ and $m$ are sizes of $list_1$ and $list_2$ , respectively, and $P$ is the number of anagram pairs in the output. Note that if you only need to calculate the number of anagram pairs, the time complexity can be decreased up to $O((n + m) \log (n + m))$ .

Lesson Summary

In today's lesson, we explored two common interview problems using Python's set data structure. We implemented efficient solutions by exploiting the properties of sets— primarily their ability to manage unique elements and perform operations like intersection. These techniques not only simplified seemingly complex problems but also significantly improved our code's runtime.

Practice Exercises

Now that we've dipped our toes in the waters of Python sets, it's time to dive deeper! Upcoming practice exercises will provide the opportunity to apply what you've learned today, cementing these skills and concepts. So, stay tuned and get ready to roll up your sleeves for some hands-on practice!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.