Introduction

In this tutorial, we want to split data into train and test sets. In order to do this, we use the train_test_split() function of Scikit-Learn.

Import Libraries

First, we import the following python modules:

import pandas as pd
from sklearn.model_selection import train_test_split

Load Data

Now, we load our dataset from a csv file. In order to do this, we use the read_csv() function of Pandas.

The dataset consists of 10 examples with the feature "height" (in cm) and the target variable "weight" (in kg). We save the data in a DataFrame.

dataset = pd.read_csv("data.csv", sep=";")
dataset

Split Dataset

Now, we would like to split the dataset into a train set and test set. The data should be divided as follows:

  • Training: 80%
  • Testing: 20%

Besides, we want to shuffle our dataset before applying the split. To do this, we use the function train_test_split() of Scikit-Learn with the following arguments:

X = dataset["height"]
y = dataset["weight"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size = 0.8, shuffle=True)

Train and Test Sets

Let's have a look to our two datasets.

Train Set

The train set consists of 8 examples and looks as follows:

print(X_train)
print(y_train)

Test Set

The test set consists of 2 examples and looks as follows:

print(X_test)
print(y_test)

Conclusion

Congratulations! Now you are one step closer to become an AI Expert. You have seen that it is very easy to split data into train and test sets. We can simply use the train_test_split() function of Scikit-Learn. Try it yourself!

Instagram

Also check out our Instagram page. We appreciate your like or comment. Feel free to share this post with your friends.