Basic Statistics

Describing Data

Just like in Excel, there are some quick ways to pull basic (and even some complex) statistics in Python. Within the pandas library, there is one shortcut command that pulls:

  • Total number of observations
  • Mean, median, mode
  • Standard deviation
  • Min and max

To write it out, use the syntax .describe():

# Combined syntax:
df.describe()

To obtain the high-level statistics of the entire data set:

import pandas as pd
df = pd.read_csv('filename.csv')
df.describe()

To obtain the high-level statistics of one column in the data set:

import pandas as pd
df = pd.read_csv('filename.csv')
df.column_name.describe()

Alternatively, we could pull the descriptive statistics with the following commands:

Descriptive StatisticsCode
Meandf_name.column_name.mean()
Mediandf_name.column_name.median()
Modedf_name.column_name.mode()
Maximum Valuedf_name.column_name.max()
Minimum Valuedf_name.column_name.min()
Standard Deviationdf_name.column_name.std()
Count Instancesdf_name.column_name.count()

For example, to obtain the max value of one column in the data set:

import pandas as pd
df = pd.read_csv('filename.csv')
df.column_name.max()

Or, to obtain the counts of one column in the data set:

import pandas as pd
df = pd.read_csv('filename.csv')
df_name.column_name.count()

Tutorial

To practice what we just learned, let’s use sample data to pull these statistics. Download the following HR data set from Kaggle: https://www.kaggle.com/rhuebner/human-resources-data-set

Part One: Describe Command

Take a look at the following code and analyze the output. What observations can you make from this initial pull of the data?

Part Two: Alternative Commands

What insights can you gather from looking at the data using these commands?

Mariah Norell
Mariah Norell
Data Scientist & Lecturer

My research interests include pay equity, diversity and inclusion, and women in leadership.