import pandas as pd12-DataFrame
Link to the Google Colaboratory notebook: 12-DataFrame
Pandas is a Python package that is used to manipulate data, like tabular one. This is like an Excel sheet, with data and rows with data inside. It enables the manipulation of data like sorting, grouping, reshaping, joining, etc.
It is widely used for data analysis, and is made to work better than current Python built-in functions for large data.
Installation
If you have Python and PIP already installed on a system, then installation of pandas is very easy.
pip install pandas
If you’re using a notebook, the installation process is a little different: you need to add a ! before the pip:
!pip install pandas
Once pandas is installed, you can load it and name it pd to use it:
DataFrame
A DataFrame is a two-dimensional labeled data with columns (that can have different types).
A first step would be to create a DataFrame from scratch:
import pandas as pd
import numpy as np
data = np.array([[0, 2, 77], [18, -4, 12], [3, -1, 0], [18, 3, 1], [12, 4, 0]])
df = pd.DataFrame(
data=data,
index=["Row1", "Row2", "Row3", "Row4", "Row5"],
columns=["Col1", "Col2", "Col3"]
)
df| Col1 | Col2 | Col3 | |
|---|---|---|---|
| Row1 | 0 | 2 | 77 |
| Row2 | 18 | -4 | 12 |
| Row3 | 3 | -1 | 0 |
| Row4 | 18 | 3 | 1 |
| Row5 | 12 | 4 | 0 |
You can then view the data with the .head() argument (and give the number of row to show):
# Show the first 5 rows
df.head(5)| Col1 | Col2 | Col3 | |
|---|---|---|---|
| Row1 | 0 | 2 | 77 |
| Row2 | 18 | -4 | 12 |
| Row3 | 3 | -1 | 0 |
| Row4 | 18 | 3 | 1 |
| Row5 | 12 | 4 | 0 |
You can also get the columns or the index:
print(df.columns)
print(df.index)Index(['Col1', 'Col2', 'Col3'], dtype='object')
Index(['Row1', 'Row2', 'Row3', 'Row4', 'Row5'], dtype='object')
You can convert it into a numpy array:
df.to_numpy()array([[ 0, 2, 77],
[18, -4, 12],
[ 3, -1, 0],
[18, 3, 1],
[12, 4, 0]])
You can sort the values depending on a column:
df.sort_values(by="Col2")| Col1 | Col2 | Col3 | |
|---|---|---|---|
| Row2 | 18 | -4 | 12 |
| Row3 | 3 | -1 | 0 |
| Row1 | 0 | 2 | 77 |
| Row4 | 18 | 3 | 1 |
| Row5 | 12 | 4 | 0 |
Selection
You can select a specific column when using dataframes:
df["Col1"]Row1 0
Row2 18
Row3 3
Row4 18
Row5 12
Name: Col1, dtype: int64
The type of the selected data would be a Series, which is another specific type of the pandas library.
The selection could use slices :, like so:
df[0:3]| Col1 | Col2 | Col3 | |
|---|---|---|---|
| Row1 | 0 | 2 | 77 |
| Row2 | 18 | -4 | 12 |
| Row3 | 3 | -1 | 0 |
You can combine them by selecting a specific row and column using .loc:
df.loc["Row1", "Col2"]2
The selection can be done using indexes:
df.iloc[4, 1:3]Col2 4
Col3 0
Name: Row5, dtype: int64
Boolean
Instead of using directly the names of the columns or rows (or the indexes), we can use boolean conditions.
For instance, let’s return the rows where the Col1 is greater than 10:
df[df["Col1"] > 10]| Col1 | Col2 | Col3 | |
|---|---|---|---|
| Row2 | 18 | -4 | 12 |
| Row4 | 18 | 3 | 1 |
| Row5 | 12 | 4 | 0 |
We can also specify categorical values. To do so, let’s create a new column with “categorical values”:
df["Col4"] = ["A", "A", "B", "B", "C"]
df| Col1 | Col2 | Col3 | Col4 | |
|---|---|---|---|---|
| Row1 | 0 | 2 | 77 | A |
| Row2 | 18 | -4 | 12 | A |
| Row3 | 3 | -1 | 0 | B |
| Row4 | 18 | 3 | 1 | B |
| Row5 | 12 | 4 | 0 | C |
Now we can check all the values that are in category “A” or “C”:
df[df["Col4"].isin(["A", "C"])]| Col1 | Col2 | Col3 | Col4 | |
|---|---|---|---|---|
| Row1 | 0 | 2 | 77 | A |
| Row2 | 18 | -4 | 12 | A |
| Row5 | 12 | 4 | 0 | C |
Operations
You can have operations on columns like addition, substraction, etc.
df["Col12"] = df["Col1"] + df["Col2"]
df["Col1_2"] = df["Col1"] - df["Col2"]
df| Col1 | Col2 | Col3 | Col4 | Col12 | Col1_2 | |
|---|---|---|---|---|---|---|
| Row1 | 0 | 2 | 77 | A | 2 | -2 |
| Row2 | 18 | -4 | 12 | A | 14 | 22 |
| Row3 | 3 | -1 | 0 | B | 2 | 4 |
| Row4 | 18 | 3 | 1 | B | 21 | 15 |
| Row5 | 12 | 4 | 0 | C | 16 | 8 |
You can get the mean of a column:
df["Col12"].mean()11.0
You can also define your own function:
df[["Col1"]].agg(lambda x: np.mean(x) * 77)Col1 785.4
dtype: float64
Merge
You can merge two dataframes by row:
df = pd.DataFrame(np.random.randn(3, 4))
df2 = pd.DataFrame(np.random.randn(3, 4))
new_df = pd.concat([df, df2])
new_df| 0 | 1 | 2 | 3 | |
|---|---|---|---|---|
| 0 | -0.747020 | -0.381566 | 1.038549 | 0.162756 |
| 1 | -0.797599 | 0.836374 | -1.420236 | 0.622965 |
| 2 | 0.253680 | -0.272074 | -0.947134 | 0.773496 |
| 0 | 0.173705 | 0.483917 | 1.045827 | -1.931953 |
| 1 | -1.028613 | 0.655464 | -0.236323 | 0.722218 |
| 2 | 0.659679 | -0.478420 | 1.029704 | -1.137475 |
You can also use join command to replicate the JOIN method from SQL:
left = pd.DataFrame({"customer_id": ["1", "2"], "name": ["clement", "guillaume"]})
right = pd.DataFrame({"customer_id": ["1", "3"], "name": ["bernard", "alexis"]})
new_df = pd.merge(left, right, on = "customer_id")
new_df| customer_id | name_x | name_y | |
|---|---|---|---|
| 0 | 1 | clement | bernard |
Group
Using dataframe, you can group by a given category. It will: - split data into groups (based on the given criteria) - apply a function to each group - combine the results into one dataframe
df = pd.DataFrame(
{
"A": ["a", "z", "i", "o", "o", "ar", "sf", "ao"],
"B": ["one", "one", "two", "three", "two", "two", "one", "three"],
"C": np.random.randn(8),
"D": np.random.randn(8),
}
)
# Group by "B" and take the sum
df.groupby("B")[["C", "D"]].sum()| C | D | |
|---|---|---|
| B | ||
| one | 2.235186 | -0.293617 |
| three | -0.183364 | -0.742939 |
| two | -0.059337 | -0.927755 |