pydit.wrangling.keyword_search_batch.keyword_search¶
- pydit.wrangling.keyword_search_batch.keyword_search(obj, keywords, columns=None, return_data='full', regexp=True, case_sensitive=False, labels=None, key_column=None)[source]¶
Searches the keywords in a dataframe or series and returns a matrix of matches
Creates a boolean column in the dataframe, one per keyword and a combined column that is True if any of the other columns is True. For simplicity by default we name columns sequentially, pushing keywords straight away as columns may yield error with special characters or duplicated/banned names. If you need labels there is an option to provide them.
- Parameters:
obj (pandas.DataFrame or pandas.Series) – The dataframe or series to search
keywords (list) – The list of regular expressions or string keywords to search for.
columns (list) – The list of columns to search in, if None then all columns are searched
return_data (str, optional default="full") – If “full” then the full dataframe is returned, plus hit columns If “target” then the target columns and hits are returned, If “result” then only the boolean result columns will be returned, If “detail” then a dataframe with a hit per row is returned If you use “full_hits”, “target_hits” or “result_hits” then only hit rows are returned
regexp (bool, default True) – If True then the keywords are treated as regular expressions, otherwise a simpler string search is performed.
case_sensitive (bool, default False) – If True then the keywords are case sensitive. The most typical case is that we do NOT care about case sensitivity. Note: use case_sensitive=True and include special prefix (?i) in the regexp itself to disable case sensitivity. E.g. the same way you do re.findall(‘(?i)test’, s)
labels (list, optional) – The list of labels to use for the columns, if None then the labels are kw_match_NN. Labels must be the same length as the number of keywords. But they could be repeated and automagically will be grouped/rolled up.
key_column (str, optional, default=None) – If return_data=”detail”, this is the column to use as the key for the returned dataframe
- Returns:
A copy of the dataframe with the new hit columns added or just the boolean columns for each keyword (depending on return_hit_columns_only) Plus a column kw_match_all that is True if any of the other columns is True.
- Return type:
DataFrame