Python - Pandas with Series and DataFrame¶

  • What is pandas?

Pandas is a powerful library in Python that is used for data manipulation and analysis. It provides tools for working with data in a tabular form, similar to Excel spreadsheets, SQL tables, or data frames in R.

To load the pandas package and start working with it, import the package. The community agreed alias for pandas is pd, so loading pandas as pd is assumed standard practice for all of the pandas documentation.

Install pandas from Jupyter Notebook: First, you need to install the pandas package if you haven't already. You can do this using pip:

In [ ]:
%pip install pandas
# or
!pip install pandas
  • Import pandas:

  • After installing pandas, you need to import it in your Python script. The community has agreed to use the alias pd for pandas, so we import it as follows:

In [1]:
import pandas as pd

Create a DataFrame: A DataFrame is a key data structure in pandas. It is like a table in a database or an Excel spreadsheet. Here is an example of how to create a DataFrame:

  • Basic data structures in pandas

  • Pandas provides two types of classes Series and DataFrame for handling data:

  • Series: a one-dimensional labeled array holding data of any type such as integers, strings, Python objects etc.

Pandas Series

  • DataFrame: a two-dimensional data structure that holds data like a two-dimension array or a table with rows and columns.

Pandas DataFrame

Pandas DataFrame Series

Create a DataFrame: A DataFrame is a key data structure in pandas. It is like a table in a database or an Excel spreadsheet. Here is an example of how to create a DataFrame:

In [8]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago'],
    'Product': [44, 25, 56]
}
df = pd.DataFrame(data)
df
Out[8]:
Name Age City Product
0 Alice 25 New York 44
1 Bob 30 Los Angeles 25
2 Charlie 35 Chicago 56

You can working with the data in the column 'City'. When selecting a single column of a pandas DataFrame, the result is a pandas Series. To select the column, use the column label in between square brackets df['City'] or df.City.

In [3]:
df['City']
Out[3]:
0       New York
1    Los Angeles
2        Chicago
Name: City, dtype: object
In [4]:
df.City
Out[4]:
0       New York
1    Los Angeles
2        Chicago
Name: City, dtype: object

You can create a Series from scratch as well. A pandas Series has no column labels, as it is just a single column of a DataFrame. A Series does have row labels.

In [6]:
data_series = pd.Series([220, 350, 580, 310, 610, 250], name="Total", dtype=int)
data_series
Out[6]:
0    220
1    350
2    580
3    310
4    610
5    250
Name: Total, dtype: int64

Example, you want to know the maximum Total of the product. We can do this on the DataSeries or DataFrame by selecting the Total column and applying max():

In [9]:
# Data Frame
df["Product"].max()
Out[9]:
56
In [11]:
# Data Series
data_series.min()
Out[11]:
220

As illustrated by the max() method, you can do things with a DataFrame or Series. pandas provides a lot of functionalities, each of them a method you can apply to a DataFrame or Series. As methods are functions like as sum(), min(), max(), mean(), len(), etc., do not forget to use parentheses ().

In [27]:
# Data Frame
f"Suma column Product Data Series: {df['Product'].sum()}"
Out[27]:
'Suma column Product Data Series: 125'
In [25]:
# Data Series
f"Suma column Product Data Series: {data_series.sum()}"
Out[25]:
' Suma column Product Data Series: 2320'
In [16]:
# Average for Data Series
f"Average: {data_series.sum()/len(data_series):.2f}"
Out[16]:
'Average: 386.67'
In [17]:
data_series.head()
Out[17]:
0    220
1    350
2    580
3    310
4    610
Name: Total, dtype: int64
In [22]:
data_series.tail()
Out[22]:
1    350
2    580
3    310
4    610
5    250
Name: Total, dtype: int64
In [24]:
data_series.sample()
Out[24]:
2    580
Name: Total, dtype: int64
In [21]:
round(data_series.head(6).mean(), 2)
Out[21]:
386.67

Statistics method describe() of the numerical data. This function provides a statistical summary for numerical columns by default and can also provide summaries for categorical data when specified.

In [29]:
# DataFrame.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)
data_series.describe()
Out[29]:
count      6.000000
mean     386.666667
std      167.888852
min      220.000000
25%      265.000000
50%      330.000000
75%      522.500000
max      610.000000
Name: Total, dtype: float64

Parameters

percentiles: list-like, optional
    The percentiles to include in the output. All should be between 0 and 1. By default, it includes [0.25, 0.5, 0.75], which correspond to the 25th, 50th (median), and 75th percentiles.
include: 'all', list-like of dtypes or None (default)
    Specify the data types to include in the summary. For example, to include all data types, use 'all'. To include only specific types, pass a list like [np.number] for numeric types.
exclude: list-like of dtypes or None (default)
    Specify the data types to exclude from the summary.
datetime_is_numeric: bool, default False
    If True, treats datetime columns as numeric, resulting in the mean, min, and max being included in the summary.

Returns

DataFrame: Summary statistics of the DataFrame provided.

Statistical Measures Provided

For numerical columns, describe typically returns the following measures:

count: The number of non-null entries.
mean: The average of the entries.
std: The standard deviation of the entries.
min: The minimum value.
25%: The 25th percentile (first quartile).
50%: The 50th percentile (median or second quartile).
75%: The 75th percentile (third quartile).
max: The maximum value.

For categorical columns, if included, describe returns:

count: The number of non-null entries.
unique: The number of unique values.
top: The most frequent value.
freq: The frequency of the most frequent value.

The describe function in pandas is a powerful tool for search outliers if age 1 or 1500 in your data. It provides essential statistics that help you understand the distribution and central tendencies of your dataset, making it an excellent first step in data analysis.

In [30]:
data = {
    'age': [25, 30, 35, 40, 45, 50],
    'salary': [50000, 60000, 70000, 80000, 90000, 100000],
    'department': ['HR', 'IT', 'Finance', 'HR', 'IT', 'Finance']
}

df = pd.DataFrame(data)

# Get the summary statistics for numerical columns
summary_numerical = df.describe()
print("Numerical summary:")
print(summary_numerical)

# Get the summary statistics for all columns, including categorical
summary_all = df.describe(include='all')
print("\nSummary including categorical data:")
print(summary_all)
Numerical summary:
             age         salary
count   6.000000       6.000000
mean   37.500000   75000.000000
std     9.354143   18708.286934
min    25.000000   50000.000000
25%    31.250000   62500.000000
50%    37.500000   75000.000000
75%    43.750000   87500.000000
max    50.000000  100000.000000

Summary including categorical data:
              age         salary department
count    6.000000       6.000000          6
unique        NaN            NaN          3
top           NaN            NaN         HR
freq          NaN            NaN          2
mean    37.500000   75000.000000        NaN
std      9.354143   18708.286934        NaN
min     25.000000   50000.000000        NaN
25%     31.250000   62500.000000        NaN
50%     37.500000   75000.000000        NaN
75%     43.750000   87500.000000        NaN
max     50.000000  100000.000000        NaN
In [ ]: