Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

Python for Basic Data Analysis: PD.9 Combining data

Get started on your learning journey towards data science using Python. Equip yourself with practical skills in Python programming for the purpose of basic data manipulation and analysis.

Combining data

There may be situations where we must combine different datasets to perform other analysis, Pandas allows us to do this in three ways: concat(), join(), and merge() (In ascending order of complexity).

The most common and simplest method is concat([DATAFRAME1,DATAFRAME2]). This function puts together a list of elements along an axis, especially useful when the columns in the dataframe are the same.

join() lets you combine different DataFrame objects which have an index in common. More useful for cases where two dataframes only have one column in common and different other headers.

Example:

DATAFRAME1.join(DATFRAME2, lsuffix='STRING1', rsuffix='STRING2')

Most of what merge() does can be done by join() and at a much lower complexity, you may read the full documentation if you're interested.

Activity: Combining data

Let's move away from our retail dataset for awhile and use the Canada and UK Dataset to see if we can successfully perform our combination operations on these data sets.

1. Using concat(), combine the 2 dataset, assuming all headers are the same

2. Assuming only the column category_id is the same, combine the 2 datasets now. (Hint: Are there extra parameters to take into account if the two dataframes have the same column name but are not to be classified together under one column?)

Answers for Activity: Combining Data

import numpy as np
import pandas as pd

#1. Using concat(), combine the 2 dataset, assuming all headers are the same
df1=pd.read_csv("Canadian Vids Data.csv")
df2=pd.read_csv("UK Vids Data.csv")
print(pd.concat([df1,df2]))
#2. Assuming only the columns title and trending_date are the same, combine the 2 datasets now. 
df1=pd.read_csv("Canadian Vids Data.csv")
df2=pd.read_csv("UK Vids Data.csv")
left=df1.set_index(['category_id'])
right=df2.set_index(['category_id'])
print(left.join(right,lsuffix='_CANADA',rsuffix='_UK').columns)

Video Guides