aif360.sklearn.datasets.standardize_dataset

aif360.sklearn.datasets.standardize_dataset(df, prot_attr, target, sample_weight=None, usecols=[], dropcols=[], numeric_only=False, dropna=True)[source]

Separate data, targets, and possibly sample weights and populate protected attributes as sample properties.

Parameters:
  • df (pandas.DataFrame) – DataFrame with features and target together.
  • prot_attr (single label or list-like) – Label or list of labels corresponding to protected attribute columns. Even if these are dropped from the features, they remain in the index.
  • target (single label or list-like) – Column label of the target (outcome) variable.
  • sample_weight (single label, optional) – Name of the column containing sample weights.
  • usecols (single label or list-like, optional) – Column(s) to keep. All others are dropped.
  • dropcols (single label or list-like, optional) – Column(s) to drop.
  • numeric_only (bool) – Drop all non-numeric, non-binary feature columns.
  • dropna (bool) – Drop rows with NAs.
Returns:

collections.namedtuple – A tuple-like object where items can be accessed by index or name. Contains the following attributes:

  • X (pandas.DataFrame) – Feature array.
  • y (pandas.DataFrame or pandas.Series) – Target array.
  • sample_weight (pandas.Series, optional) – Sample weights.

Note

The order of execution for the dropping parameters is: numeric_only -> usecols -> dropcols -> dropna.

Examples

>>> import pandas as pd
>>> from sklearn.linear_model import LinearRegression
>>> df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['X', 'y', 'Z'])
>>> train = standardize_dataset(df, prot_attr='Z', target='y')
>>> reg = LinearRegression().fit(*train)
>>> import numpy as np
>>> from sklearn.datasets import make_classification
>>> from sklearn.model_selection import train_test_split
>>> df = pd.DataFrame(np.hstack(make_classification(n_features=5)))
>>> X, y = standardize_dataset(df, prot_attr=0, target=5)
>>> X_tr, X_te, y_tr, y_te = train_test_split(X, y)