Skip to Main Content

Python for Basic Data Analysis

Start your data science journey with Python. Learn practical Python programming skills for basic data manipulation and analysis.

Combining data

There may be situations where we must combine different datasets to perform other analysis, Pandas allows us to do this in three ways: concat(), join(), and merge() (In ascending order of complexity).

The most common and simplest method is concat([DATAFRAME1,DATAFRAME2]). This function puts together a list of elements along an axis, especially useful when the columns in the dataframe are the same.

join() lets you combine different DataFrame objects which have an index in common. More useful for cases where two dataframes only have one column in common and different other headers.

Example:

DATAFRAME1.join(DATFRAME2, lsuffix='STRING1', rsuffix='STRING2')

Most of what merge() does can be done by join() and at a much lower complexity, you may read the full documentation if you're interested.

Video Guides

Activity: Combining data

Let's move away from our retail dataset for awhile and use the Canada and UK Dataset to see if we can successfully perform our combination operations on these data sets.

1. Using concat(), combine the 2 dataset, assuming all headers are the same

2. Assuming only the column category_id is the same, combine the 2 datasets now. (Hint: Are there extra parameters to take into account if the two dataframes have the same column name but are not to be classified together under one column?)

Answers for Activity: Combining Data

import numpy as np
import pandas as pd

#1. Using concat(), combine the 2 dataset, assuming all headers are the same
df1=pd.read_csv("Canadian Vids Data.csv")
df2=pd.read_csv("UK Vids Data.csv")
print(pd.concat([df1,df2]))
#2. Assuming only the columns title and trending_date are the same, combine the 2 datasets now. 
df1=pd.read_csv("Canadian Vids Data.csv")
df2=pd.read_csv("UK Vids Data.csv")
left=df1.set_index(['category_id'])
right=df2.set_index(['category_id'])
print(left.join(right,lsuffix='_CANADA',rsuffix='_UK').columns)