pydit.wrangling.coalesce_dataframe_values.coalesce_values¶
- pydit.wrangling.coalesce_dataframe_values.coalesce_values(df_in, cols, top_n_values_to_keep=10, translation_dict=None, other_label='OTHER', nan_label='N/A', case_insensitive=True, show_nan=True)[source]¶
Creates a new column with the top N most frequent values and the rest are replaced by Other.
Also can take a translation dictionary to do the manual translation prior to applying that top N limit.
- Parameters:
df_in (pandas.DataFrame) – The dataframe to clean up
cols (list) – The column names to coalesce
top_n_values_to_keep (int, optional, default 10) – The number of top values to keep.
translation_dict (dict, optional, default None) – A dictionary to use for manual translation/coalescing.
other_label (str or int, optional, default "OTHER") – The label to use for the other values.
case_insensitive (bool, optional, default True) – Whether to do a case insensitive comparison.
dropna (bool, optional, default True) – Whether to ignore np.nan values. If False, NA values will be treated as a category with “N/A” as the label.
- Returns:
Pandas DataFrame with new column with coalesced values.
- Return type:
pandas.DataFrame