pydit.wrangling.merge.merge_outer_and_split

pydit.wrangling.merge.merge_outer_and_split(dffact, dfdim, on=None, left_on=None, right_on=None, suffixes=(None, None))[source]

Merge two dataframes, and keep the joinable 1:1 or 1:n the nan or unmatched are returned in separate dataframes or Excel sheets. Under the bonnet it will - filter nulls from the key columns (and store in a separate dataframe to return those exceptions - on the clean version will do an outer merge with indicator=True - will split the results into the 3 dataframes (both, left, right) - will write the results to an Excel file if specified

It will also check for duplicates in the key columns and exit if found in the right dataframe (presumed to be a dimension/ file ) or warn if they are in the left dataframe (presumably the transaction/fact file).

Parameters:
  • dffact (pandas.DataFrame) – The left dataframe

  • dfdim (pandas.DataFrame) – The right dataframe

  • left_on (list) – The list of key columns to join on in the left dataframe

  • right_on (list) – The list of key columns to join on in the right dataframe

  • suffixes (tuple, optional) – The suffixes to use for the left and right dataframes, by default (None,None)

Returns:

A tuple of dataframes, (both, left, right, left_na, right_na) both : the rows that matched left : the rows that matched but had nulls in the right dataframe right : the rows that matched but had nulls in the left dataframe left_na : the rows that have nulls in the specified key columns in the left dataframe right_na : the rows that have nulls in the specified key columns in the right dataframe

Return type:

tuple

Raises:

ValueError – If the right dataframe has duplicates in the key columns