LibGuides: Python for Basic Data Analysis: PD.6 Handling Missing Data

Handling Missing Data

If a data set has missing entries, the values are given as NaN, which stands for "Not a Number".These NaN values will always be float64 dtype due to technical reason.

Pandas allows us to access missing data. To access NaN entries we can use pd.isnull() or pd.notnull()

We can also replace these null values with whatever we want using fillna()

Should we need to replace other values that represent null values but are not shown by a NaN entries, we may use replace()

Video Guides

Removing Missing Data

We can also acquire specific statistical information using common pandas syntaxes, as well as retrieve information with slicing methods similar to a list, try out these examples and take a look at the output.

1. Output rows which have Nan entries in Net Sales

2. Output rows which do not have Nan entries in order fufilled

3. Replace the Nan entries in Net Sales with 0 using fillna()

4. Replace entries which are 'MISSING' in order fufilled with False using replace()

Retail Dataset

Answers for Activity: Missing Data

import numpy as np
import pandas as pd
df=pd.read_csv("Retail dataset.csv")

#1. Output rows which have Nan entries in Net Sales
print(df[pd.isnull(df['net_sales'])])
#2. Output rows which do not have Nan entries in order fufilled
print(df[pd.notnull(df['order_fufilled'])])
#3. Replace the Nan entries in Net Sales using fillna()
df['net_sales'].fillna(0)
#4. Replace Nan entries in order fufilled using replace()'
df['order_fufilled'].replace("MISSING",False)

Python for Basic Data Analysis

Handling Missing Data

Video Guides

Activity: Missing Data

Further Readings