Hello and welcome! Today, we're exploring practical data manipulation techniques in Python. We'll use Python lists to represent our data stream and perform projection, filtering, and aggregation. And here's the star of the show: our operations will be neatly packaged within a Python class! No mess, all clean code.
Data manipulation is akin to being a sculptor but for data. We chisel and shape our data to get the desired structure. Python lists are perfect for this, and our operations will be conveniently bundled inside a Python class. So, let's get our toolbox ready! Here's a simple Python class, DataStream
, that will serve as our toolbox:
Python1class DataStream: 2 def __init__(self, data): 3 self.data = data
Our first stop is data projection. Think of it like capturing a photo of our desired features. Suppose we have data about people. If we're only interested in names and ages, we project our data to include just these details. We'll extend our DataStream
class with a project_data
method for this:
Python1class DataStream: 2 def __init__(self, data): 3 self.data = data 4 5 def project_data(self, keys): 6 projected_data = [{key: d.get(key, None) for key in keys} for d in self.data] 7 return projected_data 8 9# Let's use it! 10ds = DataStream([ 11 {'name': 'Alice', 'age': 25, 'profession': 'Engineer'}, 12 {'name': 'Bob', 'age': 30, 'profession': 'Doctor'}, 13 ]) 14projected_ds = ds.project_data(['name', 'age']) 15print(projected_ds) # Outputs: [{'name': 'Alice', 'age': 25}, {'name': 'Bob', 'age': 30}]
As you can see, we now have a new list with just the names and ages!
Next, we have data filtering, which is like cherry-picking our preferred data entries. We'll extend our DataStream
class with a filter_data
method that uses a "test" function to filter data:
Python1class DataStream: 2 def __init__(self, data): 3 self.data = data 4 5 #... other methods ... 6 7 def filter_data(self, test_func): 8 filtered_data = list(filter(test_func, self.data)) 9 return filtered_data 10 11# Applying it: 12ds = DataStream([ 13 {'name': 'Alice', 'age': 25, 'profession': 'Engineer'}, 14 {'name': 'Bob', 'age': 30, 'profession': 'Doctor'}, 15 ]) 16age_test = lambda x: x['age'] > 26 # our "test" function 17filtered_ds = ds.filter_data(age_test) 18print(filtered_ds) # Outputs: [{'name': 'Bob', 'age': 30, 'profession': 'Doctor'}]
With the filter method, our output is a list with only Bob’s data, as he's the only one who passes the 'age over 26' test.
Last is data aggregation, where we condense our data into a summary. We will add an aggregate_data
method to our DataStream
class for this:
Python1class DataStream: 2 def __init__(self, data): 3 self.data = data 4 5 #... other methods ... 6 7 def aggregate_data(self, key, agg_func): 8 values = [d.get(key, None) for d in self.data] 9 return agg_func(values) 10 11# Let's put it to use 12ds = DataStream([ 13 {'name': 'Alice', 'age': 25, 'profession': 'Engineer'}, 14 {'name': 'Bob', 'age': 30, 'profession': 'Doctor'}, 15 ]) 16average_age = ds.aggregate_data('age', lambda ages: sum(ages) / len(ages)) 17print(average_age) # Outputs: 27.5
With this script, we get the average age of Alice and Bob, which is 27.5
.
Now, let's combine projection, filtering, and aggregation to see the collective power of these techniques. We'll extend our example to demonstrate this flow:
We'll modify our DataStream
class to include all the methods and then use them together in a workflow. On top of that, projection and filtering methods will now return an instance of DataStream
, not a list as before, so that we can chain these methods when calling them:
Python1class DataStream: 2 def __init__(self, data): 3 self.data = data 4 5 def project_data(self, keys): 6 projected_data = [{key: d.get(key, None) for key in keys} for d in self.data] 7 return DataStream(projected_data) # Return a new DataStream object for chaining 8 9 def filter_data(self, test_func): 10 filtered_data = list(filter(test_func, self.data)) 11 return DataStream(filtered_data) # Return a new DataStream object for chaining 12 13 def aggregate_data(self, key, agg_func): 14 values = [d.get(key, None) for d in self.data] 15 return agg_func(values) 16 17# Example usage 18ds = DataStream([ 19 {'name': 'Alice', 'age': 25, 'profession': 'Engineer', 'salary': 70000}, 20 {'name': 'Bob', 'age': 30, 'profession': 'Doctor', 'salary': 120000}, 21 {'name': 'Carol', 'age': 35, 'profession': 'Artist', 'salary': 50000}, 22 {'name': 'David', 'age': 40, 'profession': 'Engineer', 'salary': 90000}, 23]) 24 25# Step 1: Project the data to include only 'name', 'age', and 'salary' 26projected_ds = ds.project_data(['name', 'age', 'salary']) 27 28# Step 2: Filter the projected data to include only those with age > 30 29filtered_ds = projected_ds.filter_data(lambda x: x['age'] > 30) 30 31# Step 3: Aggregate the filtered data to compute the average salary 32average_salary = filtered_ds.aggregate_data('salary', lambda salaries: sum(salaries) / len(salaries)) 33print(average_salary) # Outputs: 70000.0
Here:
name
, age
, and salary
fields from our data. The project_data
method now returns a DataStream
object, allowing us to chain multiple operations.filter_data
method also returns a DataStream
object for chaining.70,000
.By combining these methods, our data manipulation becomes both powerful and concise. Try experimenting and see what you can create!
Brilliant job! You've now grasped the basics of data projection, filtering, and aggregation on Python lists. Plus, you've learned to package these operations in a Python class — a neat bundle of reusable code magic!
Now, why not try applying these fresh skills with some practice exercises? They're just around the corner. Ready? Let's dive into more fun with data manipulation!