Python - Pandas with Series and DataFrame¶
- What is pandas?
Pandas is a powerful library in Python that is used for data manipulation and analysis. It provides tools for working with data in a tabular form, similar to Excel spreadsheets, SQL tables, or data frames in R.
To load the pandas package and start working with it, import the package. The community agreed alias for pandas is pd, so loading pandas as pd is assumed standard practice for all of the pandas documentation.
Install pandas from Jupyter Notebook: First, you need to install the pandas package if you haven't already. You can do this using pip:
%pip install pandas
# or
!pip install pandas
Import pandas:
After installing pandas, you need to import it in your Python script. The community has agreed to use the alias pd for pandas, so we import it as follows:
import pandas as pd
Create a DataFrame
:
A DataFrame
is a key data structure in pandas. It is like a table in a database or an Excel spreadsheet. Here is an example of how to create a DataFrame
:
Basic data structures in pandas
Pandas provides two types of classes Series and DataFrame for handling data:
Series
: aone-dimensional
labeled array holding data of any type such as integers, strings, Python objects etc.
DataFrame
: atwo-dimensional
data structure that holds data like a two-dimension array or a table with rows and columns.
Create a DataFrame: A DataFrame is a key data structure in pandas. It is like a table in a database or an Excel spreadsheet. Here is an example of how to create a DataFrame:
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago'],
'Product': [44, 25, 56]
}
df = pd.DataFrame(data)
df
Name | Age | City | Product | |
---|---|---|---|---|
0 | Alice | 25 | New York | 44 |
1 | Bob | 30 | Los Angeles | 25 |
2 | Charlie | 35 | Chicago | 56 |
You can working with the data in the column 'City'. When selecting a single column of a pandas DataFrame, the result is a pandas Series. To select the column, use the column label in between square brackets df['City'] or df.City.
df['City']
0 New York 1 Los Angeles 2 Chicago Name: City, dtype: object
df.City
0 New York 1 Los Angeles 2 Chicago Name: City, dtype: object
You can create a Series
from scratch as well. A pandas Series
has no column labels, as it is just a single column of a DataFrame
. A Series
does have row labels.
data_series = pd.Series([220, 350, 580, 310, 610, 250], name="Total", dtype=int)
data_series
0 220 1 350 2 580 3 310 4 610 5 250 Name: Total, dtype: int64
Example, you want to know the maximum Total
of the product.
We can do this on the DataSeries
or DataFrame
by selecting the Total
column and applying max()
:
# Data Frame
df["Product"].max()
56
# Data Series
data_series.min()
220
As illustrated by the max()
method, you can do things with a DataFrame
or Series
. pandas provides a lot of functionalities, each of them a method you can apply to a DataFrame
or Series
. As methods are functions like as sum()
, min()
, max()
, mean()
, len()
, etc., do not forget to use parentheses ()
.
# Data Frame
f"Suma column Product Data Series: {df['Product'].sum()}"
'Suma column Product Data Series: 125'
# Data Series
f"Suma column Product Data Series: {data_series.sum()}"
' Suma column Product Data Series: 2320'
# Average for Data Series
f"Average: {data_series.sum()/len(data_series):.2f}"
'Average: 386.67'
data_series.head()
0 220 1 350 2 580 3 310 4 610 Name: Total, dtype: int64
data_series.tail()
1 350 2 580 3 310 4 610 5 250 Name: Total, dtype: int64
data_series.sample()
2 580 Name: Total, dtype: int64
round(data_series.head(6).mean(), 2)
386.67
Statistics method describe() of the numerical data. This function provides a statistical summary for numerical columns by default and can also provide summaries for categorical data when specified.
# DataFrame.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)
data_series.describe()
count 6.000000 mean 386.666667 std 167.888852 min 220.000000 25% 265.000000 50% 330.000000 75% 522.500000 max 610.000000 Name: Total, dtype: float64
Parameters
percentiles: list-like, optional
The percentiles to include in the output. All should be between 0 and 1. By default, it includes [0.25, 0.5, 0.75], which correspond to the 25th, 50th (median), and 75th percentiles.
include: 'all', list-like of dtypes or None (default)
Specify the data types to include in the summary. For example, to include all data types, use 'all'. To include only specific types, pass a list like [np.number] for numeric types.
exclude: list-like of dtypes or None (default)
Specify the data types to exclude from the summary.
datetime_is_numeric: bool, default False
If True, treats datetime columns as numeric, resulting in the mean, min, and max being included in the summary.
Returns
DataFrame: Summary statistics of the DataFrame provided.
Statistical Measures Provided
For numerical columns, describe typically returns the following measures:
count: The number of non-null entries.
mean: The average of the entries.
std: The standard deviation of the entries.
min: The minimum value.
25%: The 25th percentile (first quartile).
50%: The 50th percentile (median or second quartile).
75%: The 75th percentile (third quartile).
max: The maximum value.
For categorical columns, if included, describe returns:
count: The number of non-null entries.
unique: The number of unique values.
top: The most frequent value.
freq: The frequency of the most frequent value.
The describe function in pandas is a powerful tool for search outliers if age 1 or 1500 in your data. It provides essential statistics that help you understand the distribution and central tendencies of your dataset, making it an excellent first step in data analysis.
data = {
'age': [25, 30, 35, 40, 45, 50],
'salary': [50000, 60000, 70000, 80000, 90000, 100000],
'department': ['HR', 'IT', 'Finance', 'HR', 'IT', 'Finance']
}
df = pd.DataFrame(data)
# Get the summary statistics for numerical columns
summary_numerical = df.describe()
print("Numerical summary:")
print(summary_numerical)
# Get the summary statistics for all columns, including categorical
summary_all = df.describe(include='all')
print("\nSummary including categorical data:")
print(summary_all)
Numerical summary: age salary count 6.000000 6.000000 mean 37.500000 75000.000000 std 9.354143 18708.286934 min 25.000000 50000.000000 25% 31.250000 62500.000000 50% 37.500000 75000.000000 75% 43.750000 87500.000000 max 50.000000 100000.000000 Summary including categorical data: age salary department count 6.000000 6.000000 6 unique NaN NaN 3 top NaN NaN HR freq NaN NaN 2 mean 37.500000 75000.000000 NaN std 9.354143 18708.286934 NaN min 25.000000 50000.000000 NaN 25% 31.250000 62500.000000 NaN 50% 37.500000 75000.000000 NaN 75% 43.750000 87500.000000 NaN max 50.000000 100000.000000 NaN