aif360.datasets.StandardDataset

class aif360.datasets.StandardDataset(df, label_name, favorable_classes, protected_attribute_names, privileged_classes, instance_weights_name='', scores_name='', categorical_features=[], features_to_keep=[], features_to_drop=[], na_values=[], custom_preprocessing=None, metadata=None)[source]

Base class for every BinaryLabelDataset provided out of the box by aif360.

It is not strictly necessary to inherit this class when adding custom datasets but it may be useful.

This class is very loosely based on code from https://github.com/algofairness/fairness-comparison.

Subclasses of StandardDataset should perform the following before calling super().__init__:

  1. Load the dataframe from a raw file.

Then, this class will go through a standard preprocessing routine which:

  1. (optional) Performs some dataset-specific preprocessing (e.g. renaming columns/values, handling missing data).

  2. Drops unrequested columns (see features_to_keep and features_to_drop for details).

  3. Drops rows with NA values.

  4. Creates a one-hot encoding of the categorical variables.

  5. Maps protected attributes to binary privileged/unprivileged values (1/0).

  6. Maps labels to binary favorable/unfavorable labels (1/0).

Parameters:
  • df (pandas.DataFrame) – DataFrame on which to perform standard processing.

  • label_name – Name of the label column in df.

  • favorable_classes (list or function) – Label values which are considered favorable or a boolean function which returns True if favorable. All others are unfavorable. Label values are mapped to 1 (favorable) and 0 (unfavorable) if they are not already binary and numerical.

  • protected_attribute_names (list) – List of names corresponding to protected attribute columns in df.

  • privileged_classes (list(list or function)) – Each element is a list of values which are considered privileged or a boolean function which return True if privileged for the corresponding column in protected_attribute_names. All others are unprivileged. Values are mapped to 1 (privileged) and 0 (unprivileged) if they are not already numerical.

  • instance_weights_name (optional) – Name of the instance weights column in df.

  • categorical_features (optional, list) – List of column names in the DataFrame which are to be expanded into one-hot vectors.

  • features_to_keep (optional, list) – Column names to keep. All others are dropped except those present in protected_attribute_names, categorical_features, label_name or instance_weights_name. Defaults to all columns if not provided.

  • features_to_drop (optional, list) – Column names to drop. Note: this overrides features_to_keep.

  • na_values (optional) – Additional strings to recognize as NA. See pandas.read_csv() for details.

  • custom_preprocessing (function) – A function object which acts on and returns a DataFrame (f: DataFrame -> DataFrame). If None, no extra preprocessing is applied.

  • metadata (optional) – Additional metadata to append.

Methods

align_datasets

Align the other dataset features, labels and protected_attributes to this dataset.

convert_to_dataframe

Convert the StructuredDataset to a pandas.DataFrame.

copy

Convenience method to return a copy of this dataset.

export_dataset

Export the dataset and supporting attributes TODO: The preferred file format is HDF

import_dataset

Import the dataset and supporting attributes TODO: The preferred file format is HDF

split

Split this dataset into multiple partitions.

subset

Subset of dataset based on position :param indexes: iterable which contains row indexes

temporarily_ignore

Temporarily add the fields provided to ignore_fields.

validate_dataset

Error checking and type validation.

__init__(df, label_name, favorable_classes, protected_attribute_names, privileged_classes, instance_weights_name='', scores_name='', categorical_features=[], features_to_keep=[], features_to_drop=[], na_values=[], custom_preprocessing=None, metadata=None)[source]

Subclasses of StandardDataset should perform the following before calling super().__init__:

  1. Load the dataframe from a raw file.

Then, this class will go through a standard preprocessing routine which:

  1. (optional) Performs some dataset-specific preprocessing (e.g. renaming columns/values, handling missing data).

  2. Drops unrequested columns (see features_to_keep and features_to_drop for details).

  3. Drops rows with NA values.

  4. Creates a one-hot encoding of the categorical variables.

  5. Maps protected attributes to binary privileged/unprivileged values (1/0).

  6. Maps labels to binary favorable/unfavorable labels (1/0).

Parameters:
  • df (pandas.DataFrame) – DataFrame on which to perform standard processing.

  • label_name – Name of the label column in df.

  • favorable_classes (list or function) – Label values which are considered favorable or a boolean function which returns True if favorable. All others are unfavorable. Label values are mapped to 1 (favorable) and 0 (unfavorable) if they are not already binary and numerical.

  • protected_attribute_names (list) – List of names corresponding to protected attribute columns in df.

  • privileged_classes (list(list or function)) – Each element is a list of values which are considered privileged or a boolean function which return True if privileged for the corresponding column in protected_attribute_names. All others are unprivileged. Values are mapped to 1 (privileged) and 0 (unprivileged) if they are not already numerical.

  • instance_weights_name (optional) – Name of the instance weights column in df.

  • categorical_features (optional, list) – List of column names in the DataFrame which are to be expanded into one-hot vectors.

  • features_to_keep (optional, list) – Column names to keep. All others are dropped except those present in protected_attribute_names, categorical_features, label_name or instance_weights_name. Defaults to all columns if not provided.

  • features_to_drop (optional, list) – Column names to drop. Note: this overrides features_to_keep.

  • na_values (optional) – Additional strings to recognize as NA. See pandas.read_csv() for details.

  • custom_preprocessing (function) – A function object which acts on and returns a DataFrame (f: DataFrame -> DataFrame). If None, no extra preprocessing is applied.

  • metadata (optional) – Additional metadata to append.