When you combine DataFrame
objects using pd.merge()
, the way rows are matched and included in the final result depends heavily on the type of merge or join you perform. This behavior is controlled by the how
parameter within the pd.merge()
function. Think of these merge types as different rules for deciding which rows to keep based on whether the join keys match in the input DataFrame
objects.
Let's set up two simple DataFrame
objects to illustrate these different join types. Imagine we have one DataFrame
with employee information and another with their project assignments.
import pandas as pd
# DataFrame with employee names
employees = pd.DataFrame({
'EmpID': [101, 102, 103, 104],
'Name': ['Alice', 'Bob', 'Charlie', 'David']
})
# DataFrame with project assignments
projects = pd.DataFrame({
'EmpID': [103, 104, 105, 106],
'Project': ['Zeus', 'Apollo', 'Athena', 'Poseidon']
})
print("Employees DataFrame:")
print(employees)
print("\nProjects DataFrame:")
print(projects)
Running this gives us:
Employees DataFrame:
EmpID Name
0 101 Alice
1 102 Bob
2 103 Charlie
3 104 David
Projects DataFrame:
EmpID Project
0 103 Zeus
1 104 Apollo
2 105 Athena
3 106 Poseidon
Notice that EmpID
103 and 104 exist in both DataFrame
objects. EmpID
101 and 102 are only in employees
, while EmpID
105 and 106 are only in projects
. The EmpID
column is our common column, often called the join key, which pd.merge()
will use by default if not specified otherwise.
how='inner'
)An inner join is the default merge type in Pandas. It combines the two DataFrame
objects and keeps only the rows where the join key (EmpID
in our case) exists in both the left and the right DataFrame
objects. Rows with keys present in only one of the DataFrame
objects are discarded. It essentially finds the intersection of the keys.
# Perform an inner join
inner_join_df = pd.merge(employees, projects, on='EmpID', how='inner')
print("Inner Join Result:")
print(inner_join_df)
The output shows only the employees present in both tables:
Inner Join Result:
EmpID Name Project
0 103 Charlie Zeus
1 104 David Apollo
An inner join retains only the common
EmpID
values (103, 104) found in both the Employees and Projects datasets.
how='outer'
)An outer join, sometimes called a full outer join, keeps all rows from both DataFrame
objects. If a row from one DataFrame
doesn't have a matching key in the other, the columns from the other DataFrame
are filled with NaN
(Not a Number) values. This join type finds the union of the keys.
# Perform an outer join
outer_join_df = pd.merge(employees, projects, on='EmpID', how='outer')
print("Outer Join Result:")
print(outer_join_df)
The result includes all employees and all projects:
Outer Join Result:
EmpID Name Project
0 101.0 Alice NaN
1 102.0 Bob NaN
2 103.0 Charlie Zeus
3 104.0 David Apollo
4 105.0 NaN Athena
5 106.0 NaN Poseidon
Notice how Alice (101) and Bob (102) have NaN
in the Project
column because they weren't in the projects
DataFrame
. Similarly, projects assigned to EmpID
105 and 106 have NaN
in the Name
column.
An outer join retains all
EmpID
values from both datasets, usingNaN
where a corresponding match is not found in the other dataset.
how='left'
)A left join keeps all rows from the left DataFrame
(the first one passed to pd.merge()
, which is employees
in our examples) and includes matching rows from the right DataFrame
(projects
). If a key from the left DataFrame
does not exist in the right DataFrame
, the columns from the right DataFrame
are filled with NaN
. Keys that exist only in the right DataFrame
are discarded.
# Perform a left join
left_join_df = pd.merge(employees, projects, on='EmpID', how='left')
print("Left Join Result:")
print(left_join_df)
This keeps all employees and adds project information where available:
Left Join Result:
EmpID Name Project
0 101 Alice NaN
1 102 Bob NaN
2 103 Charlie Zeus
3 104 David Apollo
All EmpID
s from employees
(101, 102, 103, 104) are present. Since 101 and 102 don't have matching entries in projects
, their Project
values are NaN
. EmpID
s 105 and 106 from projects
are excluded because they are not in the left DataFrame
(employees
).
A left join retains all
EmpID
values from the left dataset (Employees) and includes matching data from the right dataset (Projects). Non-matches from the right are filled withNaN
.
how='right'
)A right join is the mirror image of a left join. It keeps all rows from the right DataFrame
(projects
) and includes matching rows from the left DataFrame
(employees
). If a key from the right DataFrame
does not exist in the left DataFrame
, the columns from the left DataFrame
are filled with NaN
. Keys that exist only in the left DataFrame
are discarded.
# Perform a right join
right_join_df = pd.merge(employees, projects, on='EmpID', how='right')
print("Right Join Result:")
print(right_join_df)
This keeps all projects and adds employee information where available:
Right Join Result:
EmpID Name Project
0 103.0 Charlie Zeus
1 104.0 David Apollo
2 105.0 NaN Athena
3 106.0 NaN Poseidon
All EmpID
s from projects
(103, 104, 105, 106) are present. Since 105 and 106 don't have matching entries in employees
, their Name
values are NaN
. EmpID
s 101 and 102 from employees
are excluded because they are not in the right DataFrame
(projects
).
A right join retains all
EmpID
values from the right dataset (Projects) and includes matching data from the left dataset (Employees). Non-matches from the left are filled withNaN
.
Choosing the correct merge type is an important step in data preparation. It depends entirely on what information you need in your final combined dataset. Do you only care about entries present in both sources (inner)? Do you need a complete picture, including entries from only one source (outer)? Or are you primarily interested in augmenting one specific dataset with information from another (left or right)? Understanding these join types allows you to precisely control how your data is combined.
© 2025 ApX Machine Learning