aif360.datasets
.StandardDataset
- class aif360.datasets.StandardDataset(df, label_name, favorable_classes, protected_attribute_names, privileged_classes, instance_weights_name='', scores_name='', categorical_features=[], features_to_keep=[], features_to_drop=[], na_values=[], custom_preprocessing=None, metadata=None)[source]
Base class for every
BinaryLabelDataset
provided out of the box by aif360.It is not strictly necessary to inherit this class when adding custom datasets but it may be useful.
This class is very loosely based on code from https://github.com/algofairness/fairness-comparison.
Subclasses of StandardDataset should perform the following before calling
super().__init__
:Load the dataframe from a raw file.
Then, this class will go through a standard preprocessing routine which:
(optional) Performs some dataset-specific preprocessing (e.g. renaming columns/values, handling missing data).
Drops unrequested columns (see
features_to_keep
andfeatures_to_drop
for details).Drops rows with NA values.
Creates a one-hot encoding of the categorical variables.
Maps protected attributes to binary privileged/unprivileged values (1/0).
Maps labels to binary favorable/unfavorable labels (1/0).
- Parameters:
df (pandas.DataFrame) – DataFrame on which to perform standard processing.
label_name – Name of the label column in
df
.favorable_classes (list or function) – Label values which are considered favorable or a boolean function which returns
True
if favorable. All others are unfavorable. Label values are mapped to 1 (favorable) and 0 (unfavorable) if they are not already binary and numerical.protected_attribute_names (list) – List of names corresponding to protected attribute columns in
df
.privileged_classes (list(list or function)) – Each element is a list of values which are considered privileged or a boolean function which return
True
if privileged for the corresponding column inprotected_attribute_names
. All others are unprivileged. Values are mapped to 1 (privileged) and 0 (unprivileged) if they are not already numerical.instance_weights_name (optional) – Name of the instance weights column in
df
.categorical_features (optional, list) – List of column names in the DataFrame which are to be expanded into one-hot vectors.
features_to_keep (optional, list) – Column names to keep. All others are dropped except those present in
protected_attribute_names
,categorical_features
,label_name
orinstance_weights_name
. Defaults to all columns if not provided.features_to_drop (optional, list) – Column names to drop. Note: this overrides
features_to_keep
.na_values (optional) – Additional strings to recognize as NA. See
pandas.read_csv()
for details.custom_preprocessing (function) – A function object which acts on and returns a DataFrame (f: DataFrame -> DataFrame). If
None
, no extra preprocessing is applied.metadata (optional) – Additional metadata to append.
Methods
align_datasets
Align the other dataset features, labels and protected_attributes to this dataset.
convert_to_dataframe
Convert the StructuredDataset to a
pandas.DataFrame
.copy
Convenience method to return a copy of this dataset.
export_dataset
Export the dataset and supporting attributes TODO: The preferred file format is HDF
import_dataset
Import the dataset and supporting attributes TODO: The preferred file format is HDF
split
Split this dataset into multiple partitions.
subset
Subset of dataset based on position :param indexes: iterable which contains row indexes
temporarily_ignore
Temporarily add the fields provided to
ignore_fields
.validate_dataset
Error checking and type validation.
- __init__(df, label_name, favorable_classes, protected_attribute_names, privileged_classes, instance_weights_name='', scores_name='', categorical_features=[], features_to_keep=[], features_to_drop=[], na_values=[], custom_preprocessing=None, metadata=None)[source]
Subclasses of StandardDataset should perform the following before calling
super().__init__
:Load the dataframe from a raw file.
Then, this class will go through a standard preprocessing routine which:
(optional) Performs some dataset-specific preprocessing (e.g. renaming columns/values, handling missing data).
Drops unrequested columns (see
features_to_keep
andfeatures_to_drop
for details).Drops rows with NA values.
Creates a one-hot encoding of the categorical variables.
Maps protected attributes to binary privileged/unprivileged values (1/0).
Maps labels to binary favorable/unfavorable labels (1/0).
- Parameters:
df (pandas.DataFrame) – DataFrame on which to perform standard processing.
label_name – Name of the label column in
df
.favorable_classes (list or function) – Label values which are considered favorable or a boolean function which returns
True
if favorable. All others are unfavorable. Label values are mapped to 1 (favorable) and 0 (unfavorable) if they are not already binary and numerical.protected_attribute_names (list) – List of names corresponding to protected attribute columns in
df
.privileged_classes (list(list or function)) – Each element is a list of values which are considered privileged or a boolean function which return
True
if privileged for the corresponding column inprotected_attribute_names
. All others are unprivileged. Values are mapped to 1 (privileged) and 0 (unprivileged) if they are not already numerical.instance_weights_name (optional) – Name of the instance weights column in
df
.categorical_features (optional, list) – List of column names in the DataFrame which are to be expanded into one-hot vectors.
features_to_keep (optional, list) – Column names to keep. All others are dropped except those present in
protected_attribute_names
,categorical_features
,label_name
orinstance_weights_name
. Defaults to all columns if not provided.features_to_drop (optional, list) – Column names to drop. Note: this overrides
features_to_keep
.na_values (optional) – Additional strings to recognize as NA. See
pandas.read_csv()
for details.custom_preprocessing (function) – A function object which acts on and returns a DataFrame (f: DataFrame -> DataFrame). If
None
, no extra preprocessing is applied.metadata (optional) – Additional metadata to append.