In order to perform data analysis on data, it must be first structured in a manner which we are able to manipulate and perform operations on, a common way in python in which this is done is through the pandas module.
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
We will be going through the following pages you to equip you with the basic understanding and applications of pandas for data analysis along with other modules often used in conjunction with pandas to achieve these goals.
Note: We will be using python 3 syntax
Pandas first and foremost function is to assist with data structures, we will be using the following modules to assist with the data visualization process, do import them when trying out the other activities
%matplotlib inline
import numpy as np
import pandas as pd
Selecting Data source
To better understand how pandas works with datasets, we've prepared large data sets for you to manipulate and test.
Select from either Retail/BNF Industries dataset
+ Retail Industry dataset : Retail dataset.csv
You may have noticed that the two files are in diff formats (i.e. csv and xlsx). That's the beauty of pandas, it allows you to read both variants of the excel files!
Now to use pandas to read the dataset we can use the following syntax below and assign it to a variable.
pd.read_csv("FILE_NAME_HERE.FORMAT")
Go ahead and try assigning it to the variable df below!
"""Read the data file of choice""" df=pd.read_csv("Retail dataset.csv")
In this case we are reading csv files, however there may be circumstances where we deal with excel or xlsx files as well, we can simply replace pd.read_csv
with pd.read_excel
You are expected to comply with University policies and guidelines namely, Appropriate Use of Information Resources Policy, IT Usage Policy and Social Media Policy. Users will be personally liable for any infringement of Copyright and Licensing laws. Unless otherwise stated, all guide content is licensed by CC BY-NC 4.0.