pydit.wrangling.fuzzy_matching.clean_string¶

pydit.wrangling.fuzzy_matching.clean_string(t=None, keep_dot=False, keep_dash=False, keep_apostrophe=False, keep_ampersand=False, keep_spaces=True, space_to_underscore=True, to_case='lower')[source]¶

Sanitising a string

Cleans the strings applying the following transformations: - Normalises unicode to remove accents and other symbols - Keeps only [a-zA-Z0-9] - Optional to retain dot - Spaces to underscore - Removes multiple spaces, strips - Optional to lowercase

This is a naive/slow implementation, useful for sanitising things like a filename or column headers or small datasets. If you need to cleanup large datasets, you need to look into pandas/numpy tools, and vectorised functions.

Parameters:

t (str) – String to clean
keep_dot (bool, optional, default False) – Whether to keep the dot in the string
keep_dash (bool, optional, default False) – Whether to keep the dash in the string (useful for names)
keep_aphostrophe (bool, optional, default False) – Whether to keep the apostrophe in the string (useful for names)
keep_ampersand (True, False, "expand", default False) – Whether to keep the & or not, or expand to “and”
keep_spaces (bool, optional, default True) – Whether to keep the spaces in the string If true we still remove double spaces, and by default we replace spaces to underscores.
space_to_underscore (bool, optional, default True) – Whether to replace spaces with underscores
case (str, optional, default "lower", choices=["lower", "upper"]) – Whether to lowercase the string

Returns:

Cleaned string

Return type:

str

pydit.wrangling.fuzzy_matching.clean_string¶

Table of Contents

Previous topic

Next topic

This Page