aif360.datasets
.StructuredDataset¶
-
class
aif360.datasets.
StructuredDataset
(df, label_names, protected_attribute_names, instance_weights_name=None, scores_names=[], unprivileged_protected_attributes=[], privileged_protected_attributes=[], metadata=None)[source]¶ Base class for all structured datasets.
A StructuredDataset requires data to be stored in
numpy.ndarray
objects withdtype
asfloat64
.Variables: - features (numpy.ndarray) – Dataset features for each instance.
- labels (numpy.ndarray) – Generic label corresponding to each instance (could be ground-truth, predicted, cluster assignments, etc.).
- scores (numpy.ndarray) – Probability score associated with each label.
Same shape as
labels
. Only valid for binary labels (this includes one-hot categorical labels as well). - protected_attributes (numpy.ndarray) – A subset of
features
for which fairness is desired. - feature_names (list(str)) – Names describing each dataset feature.
- label_names (list(str)) – Names describing each label.
- protected_attribute_names (list(str)) – A subset of
feature_names
corresponding toprotected_attributes
. - privileged_protected_attributes (list(numpy.ndarray)) – A subset of protected attribute values which are considered privileged from a fairness perspective.
- unprivileged_protected_attributes (list(numpy.ndarray)) – The remaining
possible protected attribute values which are not included in
privileged_protected_attributes
. - instance_names (list(str)) – Indentifiers for each instance. Sequential integers by default.
- instance_weights (numpy.ndarray) – Weighting for each instance. All equal (ones) by default. Pursuant to standard practice in social science data, 1 means one person or entity. These weights are hence person or entity multipliers (see: https://www.ibm.com/support/knowledgecenter/en/SS3RA7_15.0.0/com.ibm.spss.modeler.help/netezza_decisiontrees_weights.htm) These weights may not be normalized to sum to 1 across the entire dataset, rather the nominal (default) weight of each entity/record in the data is 1. This is similar in spirit to the person weight in census microdata samples. https://www.census.gov/programs-surveys/acs/technical-documentation/pums/about.html
- ignore_fields (set(str)) – Attribute names to ignore when doing equality
comparisons. Always at least contains
'metadata'
. - metadata (dict) –
Details about the creation of this dataset. For example:
{ 'transformer': 'Dataset.__init__', 'params': kwargs, 'previous': None }
Parameters: - df (pandas.DataFrame) – Input DataFrame with features, labels, and protected attributes. Values should be preprocessed to remove NAs and make all data numerical. Index values are taken as instance names.
- label_names (iterable) – Names of the label columns in
df
. - protected_attribute_names (iterable) – List of names corresponding to
protected attribute columns in
df
. - instance_weights_name (optional) – Column name in
df
corresponding to instance weights. If not provided,instance_weights
will be all set to 1. - unprivileged_protected_attributes (optional) – If not provided, all but the highest numerical value of each protected attribute will be considered not privileged.
- privileged_protected_attributes (optional) – If not provided, the highest numerical value of each protected attribute will be considered privileged.
- metadata (optional) – Additional metadata to append.
Raises: TypeError
– Certain fields must be np.ndarrays as specified in the class description.ValueError
– ndarray shapes must match.
Methods
align_datasets
Align the other dataset features, labels and protected_attributes to this dataset. convert_to_dataframe
Convert the StructuredDataset to a pandas.DataFrame
.copy
Convenience method to return a copy of this dataset. export_dataset
Export the dataset and supporting attributes TODO: The preferred file format is HDF import_dataset
Import the dataset and supporting attributes TODO: The preferred file format is HDF split
Split this dataset into multiple partitions. subset
Subset of dataset based on position :param indexes: iterable which contains row indexes temporarily_ignore
Temporarily add the fields provided to ignore_fields
.validate_dataset
Error checking and type validation. -
__init__
(df, label_names, protected_attribute_names, instance_weights_name=None, scores_names=[], unprivileged_protected_attributes=[], privileged_protected_attributes=[], metadata=None)[source]¶ Parameters: - df (pandas.DataFrame) – Input DataFrame with features, labels, and protected attributes. Values should be preprocessed to remove NAs and make all data numerical. Index values are taken as instance names.
- label_names (iterable) – Names of the label columns in
df
. - protected_attribute_names (iterable) – List of names corresponding to
protected attribute columns in
df
. - instance_weights_name (optional) – Column name in
df
corresponding to instance weights. If not provided,instance_weights
will be all set to 1. - unprivileged_protected_attributes (optional) – If not provided, all but the highest numerical value of each protected attribute will be considered not privileged.
- privileged_protected_attributes (optional) – If not provided, the highest numerical value of each protected attribute will be considered privileged.
- metadata (optional) – Additional metadata to append.
Raises: TypeError
– Certain fields must be np.ndarrays as specified in the class description.ValueError
– ndarray shapes must match.
-
align_datasets
(other)[source]¶ Align the other dataset features, labels and protected_attributes to this dataset.
Parameters: other (StructuredDataset) – Other dataset that needs to be aligned Returns: StructuredDataset – New aligned dataset
-
convert_to_dataframe
(de_dummy_code=False, sep='=', set_category=True)[source]¶ Convert the StructuredDataset to a
pandas.DataFrame
.Parameters: - de_dummy_code (bool) – Performs de_dummy_coding, converting dummy-
coded columns to categories. If
de_dummy_code
isTrue
and this dataset contains mappings for label and/or protected attribute values to strings in themetadata
, this method will convert those as well. - sep (char) – Separator between the prefix in the dummy indicators and the dummy-coded categorical levels.
- set_category (bool) – Set the de-dummy coded features to categorical type.
Returns: (pandas.DataFrame, dict) –
pandas.DataFrame
: Equivalent dataframe for a dataset. All columns will have only numeric values. Theprotected_attributes
field in the dataset will override the values in thefeatures
field.dict
: Attributes. Will contain additional information pulled from the dataset such asfeature_names
,label_names
,protected_attribute_names
,instance_names
,instance_weights
,privileged_protected_attributes
,unprivileged_protected_attributes
. The metadata will not be returned.
- de_dummy_code (bool) – Performs de_dummy_coding, converting dummy-
coded columns to categories. If
-
export_dataset
(export_metadata=False)[source]¶ Export the dataset and supporting attributes TODO: The preferred file format is HDF
-
import_dataset
(import_metadata=False)[source]¶ Import the dataset and supporting attributes TODO: The preferred file format is HDF
-
split
(num_or_size_splits, shuffle=False, seed=None)[source]¶ Split this dataset into multiple partitions.
Parameters: - num_or_size_splits (array or int) – If
num_or_size_splits
is an int, k, the value is the number of equal-sized folds to make (if k does not evenly divide the dataset these folds are approximately equal-sized). Ifnum_or_size_splits
is an array of type int, the values are taken as the indices at which to split the dataset. If the values are floats (< 1.), they are considered to be fractional proportions of the dataset at which to split. - shuffle (bool, optional) – Randomly shuffle the dataset before splitting.
- seed (int or array_like) – Takes the same argument as
numpy.random.seed()
.
Returns: list – Splits. Contains k or
len(num_or_size_splits) + 1
datasets depending onnum_or_size_splits
.- num_or_size_splits (array or int) – If
-
subset
(indexes)[source]¶ Subset of dataset based on position :param indexes: iterable which contains row indexes
Returns: StructuredDataset
– subset of dataset based on indexes
-
temporarily_ignore
(*fields)[source]¶ Temporarily add the fields provided to
ignore_fields
.To be used in a
with
statement. Upon completing thewith
block,ignore_fields
is restored to its original value.Parameters: *fields – Additional fields to ignore for equality comparison within the scope of this context manager, e.g. temporarily_ignore('features', 'labels')
. The temporaryignore_fields
attribute is the union of the old attribute and the set of these fields.Examples
>>> sd = StructuredDataset(...) >>> modified = sd.copy() >>> modified.labels = sd.labels + 1 >>> assert sd != modified >>> with sd.temporarily_ignore('labels'): >>> assert sd == modified >>> assert 'labels' not in sd.ignore_fields
-
validate_dataset
()[source]¶ Error checking and type validation.
Raises: TypeError
– Certain fields must be np.ndarrays as specified in the class description.ValueError
– ndarray shapes must match.