pydit.wrangling

Sub-package (./wrangling) containing the core data wrangling functionality.

The modules are also self standing, you should be able to copy any .py file and import it in your script to use it with no dependencies on other modules.

There may be some exceptions to this principle in the logging module, but you should be able to create your own logger object and run with it.

anonymise

Module for anonymising a key/identifier column

blanks

Checks for various types of nulls/blanks in a dataframe and returns counts.

calendar_table

Function to create a calendar DataFrame to be used as a lookup table

cleanup_dataframe_columns_names

Module for cleaning up column names of a DataFrame

coalesce_dataframe_columns

Function for coalescing columns in a pandas DataFrame.

coalesce_dataframe_values

Creates a new column with the top N most frequent values and the rest are replaced by Other

collapse_dataframe_levels

Implementation of the collapse_levels function.

counts

Module that implements a few useful count related functions Takes inspiration on the usual counta and countif functions in Excel

date_time_calculations

Module with functions for date and time calculations.

duplicates

Module for checking for duplicates in a dataframe.

file_utils

File utilities for saving and loading files

fillna

Improving on fillna() with options for various data types and opinionated defaults.

fuzzy_matching

Module with utility functions for fuzzy matching

groupby_text_concatenate

Groupby text column into concatenated text

keyword_search_batch

Functions to sweep a dataframe for keywords and return a matrix of matches.

lookup_values(df, key, df_ref, key_ref, ...)

Lookup values from a reference dataframe and return values from a column If the key is a list, it will return a list of values

map_common_values

Module to map/add various values like 1, 2, 3 to "High", "Medium", "Low".

merge

Module to merge dataframes with prefixes or suffixes for all fields not just those that have colissions.

referential_integrity_check

Module to perform referential integrity checks on two dataframes.

sequence

Module to check for numerical sequence of DataFrame column or Series

split_transactions

Utility functions to do analysis/detection of split purchases/expenses

truncate_datetime

Implementation of the truncate_datetime family of functions.

various

Utility functions, they are not used directly in the core functions.