pydit.wrangling.coalesce_dataframe_values.coalesce_values

pydit.wrangling.coalesce_dataframe_values.coalesce_values(df_in, cols, top_n_values_to_keep=10, translation_dict=None, other_label='OTHER', nan_label='N/A', case_insensitive=True, show_nan=True)[source]

Creates a new column with the top N most frequent values and the rest are replaced by Other.

Also can take a translation dictionary to do the manual translation prior to applying that top N limit.

Parameters:
  • df_in (pandas.DataFrame) – The dataframe to clean up

  • cols (list) – The column names to coalesce

  • top_n_values_to_keep (int, optional, default 10) – The number of top values to keep.

  • translation_dict (dict, optional, default None) – A dictionary to use for manual translation/coalescing.

  • other_label (str or int, optional, default "OTHER") – The label to use for the other values.

  • case_insensitive (bool, optional, default True) – Whether to do a case insensitive comparison.

  • dropna (bool, optional, default True) – Whether to ignore np.nan values. If False, NA values will be treated as a category with “N/A” as the label.

Returns:

Pandas DataFrame with new column with coalesced values.

Return type:

pandas.DataFrame