Welcome back! This lesson focuses on a powerful SQL technique — joins — specifically using SQL queries within PySpark. While you have previously learned to perform joins using DataFrame operations, SQL queries provide a more intuitive syntax for combining data from different sources.
By using SQL queries for joins, you can seamlessly integrate these datasets based on a common key. This lesson will guide you through this process with an emphasis on ease and clarity.
Before diving into joins, let’s prepare our data for the current lesson:
Python1from pyspark.sql import SparkSession 2 3# Initialize a SparkSession 4spark = SparkSession.builder.master("local").appName("SQLJoins").getOrCreate() 5 6# Load the customer dataset 7df1 = spark.read.csv("customers.csv", header=True, inferSchema=True) 8 9# Load customer purchase history dataset 10df2 = spark.read.csv("customer_purchase_history.csv", header=True, inferSchema=True) 11 12# Convert DataFrames into temporary views 13df1.createOrReplaceTempView("customers") 14df2.createOrReplaceTempView("purchase_history")
In addition to the existing customer dataset, we've loaded the "customer_purchase_history.csv"
dataset, which contains information about the purchases made by customers. With both datasets converted into temporary views, we are set to perform join operations by aligning and integrating these records based on their shared Customer Id
attribute.
Executing SQL joins in PySpark involves writing SQL queries that combine data from multiple tables (or views in this case). Let's dive into a practical example where we perform an INNER JOIN:
Python1# Define and execute an SQL JOIN query 2join_query = """ 3SELECT c.*, p.PurchaseAmount 4FROM customers c 5INNER JOIN purchase_history p 6ON c.`Customer Id` = p.`Customer Id` 7""" 8 9# Execute the join query 10joined_df = spark.sql(join_query) 11 12# Display the joined DataFrame 13joined_df.show(5)
In this example, we use an INNER JOIN to combine records from the customers
and purchase_history
views where the Customer Id
matches. The SQL query selects all columns from the customers
table and includes only the PurchaseAmount
column from the purchase_history
table, combining them into a single cohesive dataset.
The show()
method presents the resulting combined data:
Plain text1+-----------+----------+-----------+--------------------+--------------------+--------------------+-----------------+--------------+ 2|Customer ID|First Name| Last Name| Country| Phone| Email|Subscription Date|PurchaseAmount| 3+-----------+----------+-----------+--------------------+--------------------+--------------------+-----------------+--------------+ 4| 1| Jamie| Olson| Aruba|001-766-675-7997x...| vwhite@example.com| 2022-07-28| 501.66| 5| 9| Stephen| Martin|Cocos (Keeling) I...| 459-625-0426x6992|walkerjill@exampl...| 2023-06-10| 614.48| 6| 14| Kevin| King| Gibraltar| 001-298-301-1010| vhuerta@example.org| 2021-04-04| 481.67| 7| 18| Kyle| Lee| Guernsey| 970-483-7914x349|laurascott@exampl...| 2021-06-28| 731.81| 8| 19| Alyssa| Wolf|Libyan Arab Jamah...| 001-378-585-7972| amber34@example.net| 2020-05-09| 512.47| 9+-----------+----------+-----------+--------------------+--------------------+--------------------+-----------------+--------------+
This output provides a richer dataset for more insightful analysis, combining customer details and their purchase history.
In this lesson, you mastered performing SQL join operations in PySpark to combine and analyze data from multiple sources efficiently. As you proceed to the practice exercises, try exploring different types of joins, such as LEFT JOIN
or RIGHT JOIN
, using the same datasets. Remember, this lesson builds on your foundational skills and prepares you for even more complex analytical challenges in data processing.
Happy querying, and best of luck in your continued learning and application!