aif360.datasets.StructuredDataset

class aif360.datasets.StructuredDataset(df, label_names, protected_attribute_names, instance_weights_name=None, scores_names=[], unprivileged_protected_attributes=[], privileged_protected_attributes=[], metadata=None)[source]

Base class for all structured datasets.

A StructuredDataset requires data to be stored in numpy.ndarray objects with dtype as float64.

Variables:
  • features (numpy.ndarray) – Dataset features for each instance.

  • labels (numpy.ndarray) – Generic label corresponding to each instance (could be ground-truth, predicted, cluster assignments, etc.).

  • scores (numpy.ndarray) – Probability score associated with each label. Same shape as labels. Only valid for binary labels (this includes one-hot categorical labels as well).

  • protected_attributes (numpy.ndarray) – A subset of features for which fairness is desired.

  • feature_names (list(str)) – Names describing each dataset feature.

  • label_names (list(str)) – Names describing each label.

  • protected_attribute_names (list(str)) – A subset of feature_names corresponding to protected_attributes.

  • privileged_protected_attributes (list(numpy.ndarray)) – A subset of protected attribute values which are considered privileged from a fairness perspective.

  • unprivileged_protected_attributes (list(numpy.ndarray)) – The remaining possible protected attribute values which are not included in privileged_protected_attributes.

  • instance_names (list(str)) – Indentifiers for each instance. Sequential integers by default.

  • instance_weights (numpy.ndarray) – Weighting for each instance. All equal (ones) by default. Pursuant to standard practice in social science data, 1 means one person or entity. These weights are hence person or entity multipliers (see: https://www.ibm.com/support/knowledgecenter/en/SS3RA7_15.0.0/com.ibm.spss.modeler.help/netezza_decisiontrees_weights.htm) These weights may not be normalized to sum to 1 across the entire dataset, rather the nominal (default) weight of each entity/record in the data is 1. This is similar in spirit to the person weight in census microdata samples. https://www.census.gov/programs-surveys/acs/technical-documentation/pums/about.html

  • ignore_fields (set(str)) – Attribute names to ignore when doing equality comparisons. Always at least contains 'metadata'.

  • metadata (dict) –

    Details about the creation of this dataset. For example:

    {
        'transformer': 'Dataset.__init__',
        'params': kwargs,
        'previous': None
    }
    

Parameters:
  • df (pandas.DataFrame) – Input DataFrame with features, labels, and protected attributes. Values should be preprocessed to remove NAs and make all data numerical. Index values are taken as instance names.

  • label_names (iterable) – Names of the label columns in df.

  • protected_attribute_names (iterable) – List of names corresponding to protected attribute columns in df.

  • instance_weights_name (optional) – Column name in df corresponding to instance weights. If not provided, instance_weights will be all set to 1.

  • unprivileged_protected_attributes (optional) – If not provided, all but the highest numerical value of each protected attribute will be considered not privileged.

  • privileged_protected_attributes (optional) – If not provided, the highest numerical value of each protected attribute will be considered privileged.

  • metadata (optional) – Additional metadata to append.

Raises:
  • TypeError – Certain fields must be np.ndarrays as specified in the class description.

  • ValueError – ndarray shapes must match.

Methods

align_datasets

Align the other dataset features, labels and protected_attributes to this dataset.

convert_to_dataframe

Convert the StructuredDataset to a pandas.DataFrame.

copy

Convenience method to return a copy of this dataset.

export_dataset

Export the dataset and supporting attributes TODO: The preferred file format is HDF

import_dataset

Import the dataset and supporting attributes TODO: The preferred file format is HDF

split

Split this dataset into multiple partitions.

subset

Subset of dataset based on position :param indexes: iterable which contains row indexes

temporarily_ignore

Temporarily add the fields provided to ignore_fields.

validate_dataset

Error checking and type validation.

__init__(df, label_names, protected_attribute_names, instance_weights_name=None, scores_names=[], unprivileged_protected_attributes=[], privileged_protected_attributes=[], metadata=None)[source]
Parameters:
  • df (pandas.DataFrame) – Input DataFrame with features, labels, and protected attributes. Values should be preprocessed to remove NAs and make all data numerical. Index values are taken as instance names.

  • label_names (iterable) – Names of the label columns in df.

  • protected_attribute_names (iterable) – List of names corresponding to protected attribute columns in df.

  • instance_weights_name (optional) – Column name in df corresponding to instance weights. If not provided, instance_weights will be all set to 1.

  • unprivileged_protected_attributes (optional) – If not provided, all but the highest numerical value of each protected attribute will be considered not privileged.

  • privileged_protected_attributes (optional) – If not provided, the highest numerical value of each protected attribute will be considered privileged.

  • metadata (optional) – Additional metadata to append.

Raises:
  • TypeError – Certain fields must be np.ndarrays as specified in the class description.

  • ValueError – ndarray shapes must match.

align_datasets(other)[source]

Align the other dataset features, labels and protected_attributes to this dataset.

Parameters:

other (StructuredDataset) – Other dataset that needs to be aligned

Returns:

StructuredDataset – New aligned dataset

convert_to_dataframe(de_dummy_code=False, sep='=', set_category=True)[source]

Convert the StructuredDataset to a pandas.DataFrame.

Parameters:
  • de_dummy_code (bool) – Performs de_dummy_coding, converting dummy- coded columns to categories. If de_dummy_code is True and this dataset contains mappings for label and/or protected attribute values to strings in the metadata, this method will convert those as well.

  • sep (char) – Separator between the prefix in the dummy indicators and the dummy-coded categorical levels.

  • set_category (bool) – Set the de-dummy coded features to categorical type.

Returns:

(pandas.DataFrame, dict)

  • pandas.DataFrame: Equivalent dataframe for a dataset. All columns will have only numeric values. The protected_attributes field in the dataset will override the values in the features field.

  • dict: Attributes. Will contain additional information pulled from the dataset such as feature_names, label_names, protected_attribute_names, instance_names, instance_weights, privileged_protected_attributes, unprivileged_protected_attributes. The metadata will not be returned.

export_dataset(export_metadata=False)[source]

Export the dataset and supporting attributes TODO: The preferred file format is HDF

import_dataset(import_metadata=False)[source]

Import the dataset and supporting attributes TODO: The preferred file format is HDF

split(num_or_size_splits, shuffle=False, seed=None)[source]

Split this dataset into multiple partitions.

Parameters:
  • num_or_size_splits (array or int) – If num_or_size_splits is an int, k, the value is the number of equal-sized folds to make (if k does not evenly divide the dataset these folds are approximately equal-sized). If num_or_size_splits is an array of type int, the values are taken as the indices at which to split the dataset. If the values are floats (< 1.), they are considered to be fractional proportions of the dataset at which to split.

  • shuffle (bool, optional) – Randomly shuffle the dataset before splitting.

  • seed (int or array_like) – Takes the same argument as numpy.random.seed().

Returns:

list – Splits. Contains k or len(num_or_size_splits) + 1 datasets depending on num_or_size_splits.

subset(indexes)[source]

Subset of dataset based on position :param indexes: iterable which contains row indexes

Returns:

StructuredDataset – subset of dataset based on indexes

temporarily_ignore(*fields)[source]

Temporarily add the fields provided to ignore_fields.

To be used in a with statement. Upon completing the with block, ignore_fields is restored to its original value.

Parameters:

*fields – Additional fields to ignore for equality comparison within the scope of this context manager, e.g. temporarily_ignore('features', 'labels'). The temporary ignore_fields attribute is the union of the old attribute and the set of these fields.

Examples

>>> sd = StructuredDataset(...)
>>> modified = sd.copy()
>>> modified.labels = sd.labels + 1
>>> assert sd != modified
>>> with sd.temporarily_ignore('labels'):
>>>     assert sd == modified
>>> assert 'labels' not in sd.ignore_fields
validate_dataset()[source]

Error checking and type validation.

Raises:
  • TypeError – Certain fields must be np.ndarrays as specified in the class description.

  • ValueError – ndarray shapes must match.