12-DataFrame

Notebook

Link to the Google Colaboratory notebook: 12-DataFrame

Pandas is a Python package that is used to manipulate data, like tabular one. This is like an Excel sheet, with data and rows with data inside. It enables the manipulation of data like sorting, grouping, reshaping, joining, etc.

It is widely used for data analysis, and is made to work better than current Python built-in functions for large data.

Installation

If you have Python and PIP already installed on a system, then installation of pandas is very easy.

pip install pandas

Notebook

If you’re using a notebook, the installation process is a little different: you need to add a ! before the pip:

!pip install pandas

Once pandas is installed, you can load it and name it pd to use it:

import pandas as pd

DataFrame

A DataFrame is a two-dimensional labeled data with columns (that can have different types).

A first step would be to create a DataFrame from scratch:

import pandas as pd
import numpy as np

data = np.array([[0, 2, 77], [18, -4, 12], [3, -1, 0], [18, 3, 1], [12, 4, 0]])
df = pd.DataFrame(
    data=data,
    index=["Row1", "Row2", "Row3", "Row4", "Row5"],
    columns=["Col1", "Col2", "Col3"]
)
df

	Col1	Col2	Col3
Row1	0	2	77
Row2	18	-4	12
Row3	3	-1	0
Row4	18	3	1
Row5	12	4	0

You can then view the data with the .head() argument (and give the number of row to show):

# Show the first 5 rows
df.head(5)

	Col1	Col2	Col3
Row1	0	2	77
Row2	18	-4	12
Row3	3	-1	0
Row4	18	3	1
Row5	12	4	0

You can also get the columns or the index:

print(df.columns)
print(df.index)

Index(['Col1', 'Col2', 'Col3'], dtype='object')
Index(['Row1', 'Row2', 'Row3', 'Row4', 'Row5'], dtype='object')

You can convert it into a numpy array:

df.to_numpy()

array([[ 0,  2, 77],
       [18, -4, 12],
       [ 3, -1,  0],
       [18,  3,  1],
       [12,  4,  0]])

You can sort the values depending on a column:

df.sort_values(by="Col2")

	Col1	Col2	Col3
Row2	18	-4	12
Row3	3	-1	0
Row1	0	2	77
Row4	18	3	1
Row5	12	4	0

Selection

You can select a specific column when using dataframes:

df["Col1"]

Row1     0
Row2    18
Row3     3
Row4    18
Row5    12
Name: Col1, dtype: int64

The type of the selected data would be a Series, which is another specific type of the pandas library.

The selection could use slices :, like so:

df[0:3]

	Col1	Col2	Col3
Row1	0	2	77
Row2	18	-4	12
Row3	3	-1	0

You can combine them by selecting a specific row and column using .loc:

df.loc["Row1", "Col2"]

The selection can be done using indexes:

df.iloc[4, 1:3]

Col2    4
Col3    0
Name: Row5, dtype: int64

Boolean

Instead of using directly the names of the columns or rows (or the indexes), we can use boolean conditions.

For instance, let’s return the rows where the Col1 is greater than 10:

df[df["Col1"] > 10]

	Col1	Col2	Col3
Row2	18	-4	12
Row4	18	3	1
Row5	12	4	0

We can also specify categorical values. To do so, let’s create a new column with “categorical values”:

df["Col4"] = ["A", "A", "B", "B", "C"]
df

	Col1	Col2	Col3	Col4
Row1	0	2	77	A
Row2	18	-4	12	A
Row3	3	-1	0	B
Row4	18	3	1	B
Row5	12	4	0	C

Now we can check all the values that are in category “A” or “C”:

df[df["Col4"].isin(["A", "C"])]

	Col1	Col2	Col3	Col4
Row1	0	2	77	A
Row2	18	-4	12	A
Row5	12	4	0	C

Operations

You can have operations on columns like addition, substraction, etc.

df["Col12"] = df["Col1"] + df["Col2"]
df["Col1_2"] = df["Col1"] - df["Col2"]
df

	Col1	Col2	Col3	Col4	Col12	Col1_2
Row1	0	2	77	A	2	-2
Row2	18	-4	12	A	14	22
Row3	3	-1	0	B	2	4
Row4	18	3	1	B	21	15
Row5	12	4	0	C	16	8

You can get the mean of a column:

df["Col12"].mean()

11.0

You can also define your own function:

df[["Col1"]].agg(lambda x: np.mean(x) * 77)

Col1    785.4
dtype: float64

Merge

You can merge two dataframes by row:

df = pd.DataFrame(np.random.randn(3, 4))
df2 = pd.DataFrame(np.random.randn(3, 4))

new_df = pd.concat([df, df2])

new_df

	0	1	2	3
0	-0.747020	-0.381566	1.038549	0.162756
1	-0.797599	0.836374	-1.420236	0.622965
2	0.253680	-0.272074	-0.947134	0.773496
0	0.173705	0.483917	1.045827	-1.931953
1	-1.028613	0.655464	-0.236323	0.722218
2	0.659679	-0.478420	1.029704	-1.137475

You can also use join command to replicate the JOIN method from SQL:

left = pd.DataFrame({"customer_id": ["1", "2"], "name": ["clement", "guillaume"]})
right = pd.DataFrame({"customer_id": ["1", "3"], "name": ["bernard", "alexis"]})


new_df = pd.merge(left, right, on = "customer_id")
new_df

	customer_id	name_x	name_y
0	1	clement	bernard

Group

Using dataframe, you can group by a given category. It will: - split data into groups (based on the given criteria) - apply a function to each group - combine the results into one dataframe

df = pd.DataFrame(
    {
        "A": ["a", "z", "i", "o", "o", "ar", "sf", "ao"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)


# Group by "B" and take the sum

df.groupby("B")[["C", "D"]].sum()

	C	D
B
one	2.235186	-0.293617
three	-0.183364	-0.742939
two	-0.059337	-0.927755