Python basics
  • Tutorial
  • Exercices
  • About
  1. Advanced
  2. 12-DataFrame
  • Introduction
    • Python
  • Fundamentals
    • 0-Basic Syntax
    • 1-Variables
    • 2-Data types
    • 3-Numbers
    • 4-Strings
    • 5-Lists
    • 6-Booleans
    • 7-Loops
    • 8-Functions
    • 9-Dictionary
  • Advanced
    • 10-Numpy
    • 11-Matplotlib
    • 12-DataFrame
  • CLI
    • Exercices
    • Solutions
  • Git
    • Git introduction
    • Git branch
    • Git exercices
  • SQL
    • SQL Exercices
    • SQL Solutions

On this page

  • Installation
  • DataFrame
    • Selection
    • Boolean
    • Operations
    • Merge
    • Group
  1. Advanced
  2. 12-DataFrame

12-DataFrame

Notebook

Link to the Google Colaboratory notebook: 12-DataFrame

Pandas is a Python package that is used to manipulate data, like tabular one. This is like an Excel sheet, with data and rows with data inside. It enables the manipulation of data like sorting, grouping, reshaping, joining, etc.

It is widely used for data analysis, and is made to work better than current Python built-in functions for large data.

Installation

If you have Python and PIP already installed on a system, then installation of pandas is very easy.

pip install pandas
Notebook

If you’re using a notebook, the installation process is a little different: you need to add a ! before the pip:

!pip install pandas

Once pandas is installed, you can load it and name it pd to use it:

import pandas as pd

DataFrame

A DataFrame is a two-dimensional labeled data with columns (that can have different types).

A first step would be to create a DataFrame from scratch:

import pandas as pd
import numpy as np

data = np.array([[0, 2, 77], [18, -4, 12], [3, -1, 0], [18, 3, 1], [12, 4, 0]])
df = pd.DataFrame(
    data=data,
    index=["Row1", "Row2", "Row3", "Row4", "Row5"],
    columns=["Col1", "Col2", "Col3"]
)
df
Col1 Col2 Col3
Row1 0 2 77
Row2 18 -4 12
Row3 3 -1 0
Row4 18 3 1
Row5 12 4 0

You can then view the data with the .head() argument (and give the number of row to show):

# Show the first 5 rows
df.head(5)
Col1 Col2 Col3
Row1 0 2 77
Row2 18 -4 12
Row3 3 -1 0
Row4 18 3 1
Row5 12 4 0

You can also get the columns or the index:

print(df.columns)
print(df.index)
Index(['Col1', 'Col2', 'Col3'], dtype='object')
Index(['Row1', 'Row2', 'Row3', 'Row4', 'Row5'], dtype='object')

You can convert it into a numpy array:

df.to_numpy()
array([[ 0,  2, 77],
       [18, -4, 12],
       [ 3, -1,  0],
       [18,  3,  1],
       [12,  4,  0]])

You can sort the values depending on a column:

df.sort_values(by="Col2")
Col1 Col2 Col3
Row2 18 -4 12
Row3 3 -1 0
Row1 0 2 77
Row4 18 3 1
Row5 12 4 0

Selection

You can select a specific column when using dataframes:

df["Col1"]
Row1     0
Row2    18
Row3     3
Row4    18
Row5    12
Name: Col1, dtype: int64

The type of the selected data would be a Series, which is another specific type of the pandas library.

The selection could use slices :, like so:

df[0:3]
Col1 Col2 Col3
Row1 0 2 77
Row2 18 -4 12
Row3 3 -1 0

You can combine them by selecting a specific row and column using .loc:

df.loc["Row1", "Col2"]
2

The selection can be done using indexes:

df.iloc[4, 1:3]
Col2    4
Col3    0
Name: Row5, dtype: int64

Boolean

Instead of using directly the names of the columns or rows (or the indexes), we can use boolean conditions.

For instance, let’s return the rows where the Col1 is greater than 10:

df[df["Col1"] > 10]
Col1 Col2 Col3
Row2 18 -4 12
Row4 18 3 1
Row5 12 4 0

We can also specify categorical values. To do so, let’s create a new column with “categorical values”:

df["Col4"] = ["A", "A", "B", "B", "C"]
df
Col1 Col2 Col3 Col4
Row1 0 2 77 A
Row2 18 -4 12 A
Row3 3 -1 0 B
Row4 18 3 1 B
Row5 12 4 0 C

Now we can check all the values that are in category “A” or “C”:

df[df["Col4"].isin(["A", "C"])]
Col1 Col2 Col3 Col4
Row1 0 2 77 A
Row2 18 -4 12 A
Row5 12 4 0 C

Operations

You can have operations on columns like addition, substraction, etc.

df["Col12"] = df["Col1"] + df["Col2"]
df["Col1_2"] = df["Col1"] - df["Col2"]
df
Col1 Col2 Col3 Col4 Col12 Col1_2
Row1 0 2 77 A 2 -2
Row2 18 -4 12 A 14 22
Row3 3 -1 0 B 2 4
Row4 18 3 1 B 21 15
Row5 12 4 0 C 16 8

You can get the mean of a column:

df["Col12"].mean()
11.0

You can also define your own function:

df[["Col1"]].agg(lambda x: np.mean(x) * 77)
Col1    785.4
dtype: float64

Merge

You can merge two dataframes by row:

df = pd.DataFrame(np.random.randn(3, 4))
df2 = pd.DataFrame(np.random.randn(3, 4))

new_df = pd.concat([df, df2])

new_df
0 1 2 3
0 -0.747020 -0.381566 1.038549 0.162756
1 -0.797599 0.836374 -1.420236 0.622965
2 0.253680 -0.272074 -0.947134 0.773496
0 0.173705 0.483917 1.045827 -1.931953
1 -1.028613 0.655464 -0.236323 0.722218
2 0.659679 -0.478420 1.029704 -1.137475

You can also use join command to replicate the JOIN method from SQL:

left = pd.DataFrame({"customer_id": ["1", "2"], "name": ["clement", "guillaume"]})
right = pd.DataFrame({"customer_id": ["1", "3"], "name": ["bernard", "alexis"]})


new_df = pd.merge(left, right, on = "customer_id")
new_df
customer_id name_x name_y
0 1 clement bernard

Group

Using dataframe, you can group by a given category. It will: - split data into groups (based on the given criteria) - apply a function to each group - combine the results into one dataframe

df = pd.DataFrame(
    {
        "A": ["a", "z", "i", "o", "o", "ar", "sf", "ao"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)


# Group by "B" and take the sum

df.groupby("B")[["C", "D"]].sum()
C D
B
one 2.235186 -0.293617
three -0.183364 -0.742939
two -0.059337 -0.927755
11-Matplotlib
Exercices
 

Copyright 2023, Clément Bernard