Pandas - Remove Duplicates from a DataFrame

Introduction

In this tutorial, we want to drop duplicates from a Pandas DataFrame. In order to do this, we use the the drop_duplicates() method of Pandas.

Import Libraries

First, we import the following python modules:

import numpy as np
import pandas as pd

Create Pandas DataFrame

Next, we create a Pandas DataFrame with some example data from a dictionary:

mydict = {
    "column1": [3, 7, 7, 8,  np.nan],
    "column2": [11.3, 12.5, 12.5, 0.77, 9.4],
    "column3": ["AI", "Python", "Python", np.nan, "AI"],
}
df = pd.DataFrame(mydict)
df

Remove duplicate Rows

Keep First Occurences

Now, we would like to remove duplicate rows from the DataFrame based on all columns. The first occurrences should be kept.

To do this, we use the drop_duplicates() method of Pandas and set the parameter "keep" to "first":

df_cleaned = df.drop_duplicates(keep='first')
df_cleaned

Keep Last Occurences

If we want to keep the last occurrences, we have to set the parameter "keep" to "last":

df_cleaned = df.drop_duplicates(keep='last')
df_cleaned

Remove duplicate Rows based on a certain Column

Next, we would like to remove duplicate rows from the DataFrame based on the column "language". The first occurrences should be kept.

To do this, we use the drop_duplicates() method of Pandas with the parameters "keep" and "subset":

df_cleaned = df.drop_duplicates(keep='first', subset=['column3'])
df_cleaned

Conclusion

Congratulations! Now you are one step closer to become an AI Expert. You have seen that it is very easy to drop duplicates from a Pandas DataFrame. We can simply use the drop_duplicates() method of Pandas. Try it yourself!

Instagram

Also check out our Instagram page. We appreciate your like or comment. Feel free to share this post with your friends.

Sieh dir diesen Beitrag auf Instagram an

Ein Beitrag geteilt von Deep Learning Nerds | AI, Data Science & Machine Learning (@deeplearningnerds)

Pandas - Remove Duplicates from a DataFrame

Data Scientist

Power BI - Import Data from JSON file

How to visualize Query Results using the Data Explorer in Microsoft Fabric

Power BI - How to use Conditional Formatting with Icons in a Table

Introduction

Import Libraries

Create Pandas DataFrame

Remove duplicate Rows

Keep First Occurences

Keep Last Occurences

Remove duplicate Rows based on a certain Column

Conclusion

Instagram

How to randomly sample a Subset of a PySpark DataFrame

PySpark - Create Embedding Vectors with Sentence-Transformers

PySpark - Concatenate String Columns of a DataFrame

PySpark - Remove Whitespaces from a String Column of a DataFrame

How to read Excel File into PySpark DataFrame in Databricks