Plan Your Joy

Plan Your Joy. I learned this week that you have to intentionally plan to fill joyous moments in your life. Plan for joy and more joy will come..

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Data Preprocessing with Python for Beginner

The following are the steps involved in the Data preprocessing.

import numpy as np

import pandas pd

import matplotlib.pyplot as plt

import seaborn as sns

a)data.ndim

b) data.shape

c)data.columns

Telecom dataset columns

d)data.index

To know the count of non-NA cells in each row or column use the following method.

data.count()

From the above table represents that All Variables containingg 3333 non-NA values Except Daily Charge MV.

Daily Charge MV containing 3283 non-NA Values

Dataset containing 22 variables , now need to understand which are numerical variables and which are categorical variables .

data.info()

This dataset containing 16 continuous variables and 6 categorical variables .

To change some data type transformations use astype function

After use astype function then check data types transformations in dataset

From the above , clearly shows that the variables Churn, Intl Plan ,VMail Plan, and Area code transform object data types . Hence, conclude that dataset has 16- continuous varaibles and 6 categorical variables.

In Data preprocessing check the duplicated rows in the dataset ,if dataset containing duplicated rows then remove duplicated rows.

use duplicated() function to know the duplicated rows in the dataset .

To count all the duplicated rows in the dataset use the following command .

data.duplicated().sum()

Dataset containing 3332 duplicated rows now drop that duplicated rows use the dropna (how=”all”) .

Now check the dataset containing duplicated rows .

Use again duplicated().sum() in the dataset it shows “0" then conclude that dataset has no duplicated rows.

To know the Redundancy of the dataset, check the homonyms and synonyms

Homonyms: No same columns names in the dataset. if 2 columns names are same then change one column name in the dataset.

Synonyms: No same information in columns in the dataset. if 2 columns having same information then delete one of the column in the dataset.

This dataset containing no homonym and no synonym then there is no Redundancy in the dataset.

To find the missing values in the dataset use isnull() or isna() then dataset variables shows that how many missing values contained in that variables.

data.isnull().sum()

Dataset variables having no null values except Daily charge MV . And the variable Daily Charges MV having 50 NA values .

One of the technique to impute null values in the dataset use mean ,median or mode .

To impute null values with median in the Daily Charges MV variable ,Use the fillna() function with median .

data[“Daily Charges MV”].fillna(data[“Daily Charges MV”].median(),inplace=True)

Now the Daily Charges MV variable having no null values or missing values .For checking again use isnull() or isna()

Now Daily Charges MV variable shows that no null values

So again check the dataset containing null values or not?

From the above , it is clearly shows that there is no missing values or null values in the dataset.

Separate the dataset into continuous variables and categorial variables.

To know the variable names of the 16 continuous varaibles use the following command.

To visuliaze the box plot for all continuous variables for outliers use following way.

From the above diagram, clearly shows that all outliers in the continuous variables.

For outliers treatment techniques , one of the commonly used technique is capping and flooring ,here I used capping and flooring for the outliers treatment.

Add a comment

Related posts:

Hello World for Sandcastle

I always remember the very first time I hold my farther’s business card. It was when he got a huge promotion and wanted to share it with his family. When I was little, I am so confused by his…

A British Pakistani Doctor is Killing the Internet

Ali Abdaal is a British Pakistani who studied Medicine at Cambridge University for 6 years and eventually graduated in 2018. He’s currently employed as a junior doctor in the UK but that’s not the…

The Rise of the Ultra Runners

Adharanand Finn is the author of Running with the Kenyans (2012), The Way of the Runner (2015) and The Rise of the Ultra Runners (2019). The first of these was the Sunday Times Sports Book of the…