Introduction

In this tutorial, we want to sort a PySpark DataFrame by specific columns. In order to do this, we use the the orderBy() method of PySpark.

Import Libraries

First, we import the following python modules:

from pyspark.sql import SparkSession

Create SparkSession

Before we can work with Pyspark, we need to create a SparkSession. A SparkSession is the entry point into all functionalities of Spark.

In order to create a basic SparkSession programmatically, we use the following command:

spark = SparkSession \
    .builder \
    .appName("Python PySpark Example") \
    .getOrCreate()

Create PySpark DataFrame

Next, we create a PySpark DataFrame with some example data from a list. To do this, we use the method createDataFrame() and pass the data and the column names as arguments.

column_names = ["language", "framework", "users"]
data = [
    ("Python", "FastAPI", 9000),
    ("JavaScript", "ReactJS", 7000),
    ("Python", "Django", 20000),
    ("Java", "Spring", 12000),
]
df = spark.createDataFrame(data, column_names)
df.show()

Sorting by a single Column

Ascending Order

We would like to sort the DataFrame by the column "users" in ascending order.

To do this, we use the orderBy() method of PySpark. We pass the name of the column to sort by as argument and set the parameter "ascending" to True:

df_sorted = df.orderBy('users', ascending=True)
df_sorted.show()

Descending Order

Now, we would like to sort the DataFrame by the column "users" in descending order.

In this case, we set the parameter "ascending" to False:

df_sorted = df.orderBy('users', ascending=False)
df_sorted.show()

Sorting by multiple Columns

Now, we would like to sort the DataFrame by the column "language" in ascending order and the column "users" in descending order.

To do this, we use the orderBy() method of PySpark. We pass a list with the names of the columns to sort by and a list with the boolean values identifing the orders as arguments:

df_sorted = df.orderBy(['language', 'users'], ascending=[True, False])
df_sorted.show()

Conclusion

Congratulations! Now you are one step closer to become an AI Expert. You have seen that it is very easy to sort a PySpark DataFrame by specific columns. We can simply use the orderBy() method of PySpark. Try it yourself!

Instagram

Also check out our Instagram page. We appreciate your like or comment. Feel free to share this post with your friends.