pydit.statistics.benford.benford_list_anomalies

pydit.statistics.benford.benford_list_anomalies(df, column_name, top_n_digits=3, first_n_digits=1, return_anomalies_only=False)[source]

Returns the Benford’s Law frequencies expected and actual for a column of values.

Also adds an extra “flag_bf_anomaly” boolean column that is True for those records where the first n digits match those identified as top N anomalies which, in turn, are those that have largest percent variation between actual and expected.

Note that blanks and zeroes are not deemed anomalies, they are simply ignored Those you need to analyse separately, as they are likely to be data quality anomalies. Also note that technically we are calculating the top rank of differences, if they are insignificant or even zero the flag_anomalies will still yield True for the top N “anomalies”. Possibly something to improve on in the future.

Parameters:
  • df (DataFrame or Series) – The data to be analyzed.

  • column_name (str) – The column name to be analyzed.

  • top_n_digits (int, optional, default: 3) – Threshold for when we consider an anomaly, based on rank of difference.

  • first_n_digits (int, optional, default: 1) – The number of first digits to be considered Typically first 1 and 2 digits are enough.

  • only_anomalies (boolean, optional, default: False) – True to return just the anomalies. False for full original dataframe

Returns:

A copy of the dataframe with the expected and actual Benford’s Law frequency. Also adds an extra “flag_bf_anomaly” boolean column that is True for those records where the first n digits match those identified as top N anomalies

Return type:

pandas.DataFrame