Exploring SQL Joins Through PySpark

Lesson 3

Introduction to SQL Joins in PySpark

Welcome back! This lesson focuses on a powerful SQL technique — joins — specifically using SQL queries within PySpark. While you have previously learned to perform joins using DataFrame operations, SQL queries provide a more intuitive syntax for combining data from different sources.

By using SQL queries for joins, you can seamlessly integrate these datasets based on a common key. This lesson will guide you through this process with an emphasis on ease and clarity.

Setting Up Temporary Views

Before diving into joins, let’s prepare our data for the current lesson:

Python
1from pyspark.sql import SparkSession
2
3# Initialize a SparkSession
4spark = SparkSession.builder.master("local").appName("SQLJoins").getOrCreate()
5
6# Load the customer dataset
7df1 = spark.read.csv("customers.csv", header=True, inferSchema=True)
8
9# Load customer purchase history dataset
10df2 = spark.read.csv("customer_purchase_history.csv", header=True, inferSchema=True)
11
12# Convert DataFrames into temporary views
13df1.createOrReplaceTempView("customers")
14df2.createOrReplaceTempView("purchase_history")

In addition to the existing customer dataset, we've loaded the "customer_purchase_history.csv" dataset, which contains information about the purchases made by customers. With both datasets converted into temporary views, we are set to perform join operations by aligning and integrating these records based on their shared Customer Id attribute.

Executing SQL JOINs in PySpark

Executing SQL joins in PySpark involves writing SQL queries that combine data from multiple tables (or views in this case). Let's dive into a practical example where we perform an INNER JOIN:

Python
1# Define and execute an SQL JOIN query
2join_query = """
3SELECT c.*, p.PurchaseAmount
4FROM customers c
5INNER JOIN purchase_history p
6ON c.`Customer Id` = p.`Customer Id`
7"""
8
9# Execute the join query
10joined_df = spark.sql(join_query)
11
12# Display the joined DataFrame
13joined_df.show(5)

In this example, we use an INNER JOIN to combine records from the customers and purchase_history views where the Customer Id matches. The SQL query selects all columns from the customers table and includes only the PurchaseAmount column from the purchase_history table, combining them into a single cohesive dataset.

The show() method presents the resulting combined data:

Plain text
1+-----------+----------+-----------+--------------------+--------------------+--------------------+-----------------+--------------+
2|Customer ID|First Name|  Last Name|             Country|               Phone|               Email|Subscription Date|PurchaseAmount|
3+-----------+----------+-----------+--------------------+--------------------+--------------------+-----------------+--------------+
4|          1|     Jamie|      Olson|               Aruba|001-766-675-7997x...|  vwhite@example.com|       2022-07-28|        501.66|
5|          9|   Stephen|     Martin|Cocos (Keeling) I...|   459-625-0426x6992|walkerjill@exampl...|       2023-06-10|        614.48|
6|         14|     Kevin|       King|           Gibraltar|    001-298-301-1010| vhuerta@example.org|       2021-04-04|        481.67|
7|         18|      Kyle|        Lee|            Guernsey|    970-483-7914x349|laurascott@exampl...|       2021-06-28|        731.81|
8|         19|    Alyssa|       Wolf|Libyan Arab Jamah...|    001-378-585-7972| amber34@example.net|       2020-05-09|        512.47|
9+-----------+----------+-----------+--------------------+--------------------+--------------------+-----------------+--------------+

This output provides a richer dataset for more insightful analysis, combining customer details and their purchase history.

Summary and Next Steps

In this lesson, you mastered performing SQL join operations in PySpark to combine and analyze data from multiple sources efficiently. As you proceed to the practice exercises, try exploring different types of joins, such as LEFT JOIN or RIGHT JOIN, using the same datasets. Remember, this lesson builds on your foundational skills and prepares you for even more complex analytical challenges in data processing.

Happy querying, and best of luck in your continued learning and application!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.