pydit.wrangling.blanks.check_blanks

pydit.wrangling.blanks.check_blanks(obj, columns=None, include_zeroes=False, include_nullstrings_and_spaces=False, totals_only=True, silent=False)[source]

Returns by default a summary dictionary with column names as key and count of blanks as value, for the columns selected (or all if no column list provided)

If “total_only” is False it would return detailed information of the blanks

original/copied dataframe with:

  1. one boolean column per input columns, True when there are blanks in that record

  2. a summary boolean column if any of the previous is true

Check out https://github.com/ResidentMario/missingno library for a nice visualization (seems to come with Anaconda)

Parameters:
  • obj (DataFrame or Series) – The dataframe or series to check for blanks

  • columns (list, optional, default None) – The columns to check for blanks. If None, all columns are checked.

  • include_zeroes (bool, optional, default False) – If True, checks for zeroes as blanks

  • include_nullstrings_and_spaces (bool, optional, default False) – If True, checks for null strings and spaces as blanks

  • totals_only (bool, optional, default False) – If True, only the total counts are returned

  • silent (bool, optional, default False) – If True, logging level set to critical, ie no info messages shown

Returns:

A dataframe with the counts of blanks in each column. Or a summary dictionary with various counts.

Return type:

pandas.DataFrame

See also

profile_dataframe, includes

Examples

Basic usage with a DataFrame containing NaN values:

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({
...     'A': [1, 2, None, 4],
...     'B': ['x', 'y', None, 'z'],
...     'C': [1.0, 2.0, 3.0, 4.0]
... })
>>> result = check_blanks(df, silent=True)
>>> result['A']
1
>>> result['B']
1
>>> result['C']
0

Test with specific columns:

>>> result = check_blanks(df, columns=['A', 'B'], silent=True)
>>> len(result)
2
>>> 'C' in result
False

Test including zeroes as blanks:

>>> df_zeros = pd.DataFrame({'A': [1, 0, 3], 'B': [0, 2, 0]})
>>> result = check_blanks(df_zeros, include_zeroes=True, silent=True)
>>> result['A']
1
>>> result['B']
2

Test including null strings and spaces:

>>> df_strings = pd.DataFrame({
...     'text': ['hello', '', '   ', 'world', None]
... })
>>> result = check_blanks(df_strings, include_nullstrings_and_spaces=True, silent=True)
>>> result['text']
3

Test with Series input:

>>> series = pd.Series([1, None, 3, None], name='my_series')
>>> result = check_blanks(series, silent=True)
>>> result['my_series']
2

Test with totals_only=False to get detailed DataFrame:

>>> df_small = pd.DataFrame({'A': [1, None], 'B': [None, 2]})
>>> result = check_blanks(df_small, totals_only=False, silent=True)
>>> 'A_blanks' in result.columns
True
>>> 'has_blanks' in result.columns
True
>>> int(result['has_blanks'].sum())
2