aif360.datasets.StandardDataset
- class aif360.datasets.StandardDataset(df, label_name, favorable_classes, protected_attribute_names, privileged_classes, instance_weights_name='', scores_name='', categorical_features=[], features_to_keep=[], features_to_drop=[], na_values=[], custom_preprocessing=None, metadata=None)[source]
Base class for every
BinaryLabelDatasetprovided out of the box by aif360.It is not strictly necessary to inherit this class when adding custom datasets but it may be useful.
This class is very loosely based on code from https://github.com/algofairness/fairness-comparison.
Subclasses of StandardDataset should perform the following before calling
super().__init__:Load the dataframe from a raw file.
Then, this class will go through a standard preprocessing routine which:
(optional) Performs some dataset-specific preprocessing (e.g. renaming columns/values, handling missing data).
Drops unrequested columns (see
features_to_keepandfeatures_to_dropfor details).Drops rows with NA values.
Creates a one-hot encoding of the categorical variables.
Maps protected attributes to binary privileged/unprivileged values (1/0).
Maps labels to binary favorable/unfavorable labels (1/0).
- Parameters:
df (pandas.DataFrame) – DataFrame on which to perform standard processing.
label_name – Name of the label column in
df.favorable_classes (list or function) – Label values which are considered favorable or a boolean function which returns
Trueif favorable. All others are unfavorable. Label values are mapped to 1 (favorable) and 0 (unfavorable) if they are not already binary and numerical.protected_attribute_names (list) – List of names corresponding to protected attribute columns in
df.privileged_classes (list(list or function)) – Each element is a list of values which are considered privileged or a boolean function which return
Trueif privileged for the corresponding column inprotected_attribute_names. All others are unprivileged. Values are mapped to 1 (privileged) and 0 (unprivileged) if they are not already numerical.instance_weights_name (optional) – Name of the instance weights column in
df.categorical_features (optional, list) – List of column names in the DataFrame which are to be expanded into one-hot vectors.
features_to_keep (optional, list) – Column names to keep. All others are dropped except those present in
protected_attribute_names,categorical_features,label_nameorinstance_weights_name. Defaults to all columns if not provided.features_to_drop (optional, list) – Column names to drop. Note: this overrides
features_to_keep.na_values (optional) – Additional strings to recognize as NA. See
pandas.read_csv()for details.custom_preprocessing (function) – A function object which acts on and returns a DataFrame (f: DataFrame -> DataFrame). If
None, no extra preprocessing is applied.metadata (optional) – Additional metadata to append.
Methods
align_datasetsAlign the other dataset features, labels and protected_attributes to this dataset.
convert_to_dataframeConvert the StructuredDataset to a
pandas.DataFrame.copyConvenience method to return a copy of this dataset.
export_datasetExport the dataset and supporting attributes TODO: The preferred file format is HDF
import_datasetImport the dataset and supporting attributes TODO: The preferred file format is HDF
splitSplit this dataset into multiple partitions.
subsetSubset of dataset based on position :param indexes: iterable which contains row indexes
temporarily_ignoreTemporarily add the fields provided to
ignore_fields.validate_datasetError checking and type validation.
- __init__(df, label_name, favorable_classes, protected_attribute_names, privileged_classes, instance_weights_name='', scores_name='', categorical_features=[], features_to_keep=[], features_to_drop=[], na_values=[], custom_preprocessing=None, metadata=None)[source]
Subclasses of StandardDataset should perform the following before calling
super().__init__:Load the dataframe from a raw file.
Then, this class will go through a standard preprocessing routine which:
(optional) Performs some dataset-specific preprocessing (e.g. renaming columns/values, handling missing data).
Drops unrequested columns (see
features_to_keepandfeatures_to_dropfor details).Drops rows with NA values.
Creates a one-hot encoding of the categorical variables.
Maps protected attributes to binary privileged/unprivileged values (1/0).
Maps labels to binary favorable/unfavorable labels (1/0).
- Parameters:
df (pandas.DataFrame) – DataFrame on which to perform standard processing.
label_name – Name of the label column in
df.favorable_classes (list or function) – Label values which are considered favorable or a boolean function which returns
Trueif favorable. All others are unfavorable. Label values are mapped to 1 (favorable) and 0 (unfavorable) if they are not already binary and numerical.protected_attribute_names (list) – List of names corresponding to protected attribute columns in
df.privileged_classes (list(list or function)) – Each element is a list of values which are considered privileged or a boolean function which return
Trueif privileged for the corresponding column inprotected_attribute_names. All others are unprivileged. Values are mapped to 1 (privileged) and 0 (unprivileged) if they are not already numerical.instance_weights_name (optional) – Name of the instance weights column in
df.categorical_features (optional, list) – List of column names in the DataFrame which are to be expanded into one-hot vectors.
features_to_keep (optional, list) – Column names to keep. All others are dropped except those present in
protected_attribute_names,categorical_features,label_nameorinstance_weights_name. Defaults to all columns if not provided.features_to_drop (optional, list) – Column names to drop. Note: this overrides
features_to_keep.na_values (optional) – Additional strings to recognize as NA. See
pandas.read_csv()for details.custom_preprocessing (function) – A function object which acts on and returns a DataFrame (f: DataFrame -> DataFrame). If
None, no extra preprocessing is applied.metadata (optional) – Additional metadata to append.