aif360.sklearn.datasets.standardize_dataset

aif360.sklearn.datasets.standardize_dataset(df, *, prot_attr, target, sample_weight=None, usecols=None, dropcols=None, numeric_only=False, dropna=True)[source]

Separate data, targets, and possibly sample weights and populate protected attributes as sample properties.

Parameters:
  • df (pandas.DataFrame) – DataFrame with features and, optionally, target.

  • prot_attr (label or array-like or list of labels/arrays) – Label, array of the same length as df, or a list containing any combination of the two corresponding to protected attribute columns. Even if these are dropped from the features, they remain in the index. Column(s) indicated by label will be copied from df, not dropped. Column(s) passed explicitly as arrays will not be added to features.

  • target (label or array-like or list of labels/arrays) – Label, array of the same length as df, or a list containing any combination of the two corresponding to the target (outcome) variable. Column(s) indicated by label will be dropped from features.

  • sample_weight (single label or array-like, optional) – Name of the column containing sample weights or an array of sample weights of the same length as df. If a label is passed, the column is dropped from features. Note: the index of a passed Series will be ignored.

  • usecols (list-like, optional) – Column(s) to keep. All others are dropped.

  • dropcols (list-like, optional) – Column(s) to drop. Missing labels are ignored.

  • numeric_only (bool) – Drop all non-numeric, non-binary feature columns.

  • dropna (bool) – Drop rows with NAs.

Returns:

collections.namedtuple – A tuple-like object where items can be accessed by index or name. Contains the following attributes:

  • X (pandas.DataFrame) – Feature array.

  • y (pandas.DataFrame or pandas.Series) – Target array.

  • sample_weight (pandas.Series, optional) – Sample weights.

Note

The order of execution for the dropping parameters is: usecols -> dropcols -> numeric_only -> dropna.

Examples

>>> import pandas as pd
>>> from sklearn.linear_model import LinearRegression
>>> df = pd.DataFrame([[0.5, 1, 1, 0.75], [-0.5, 0, 0, 0.25]],
...                   columns=['X', 'y', 'Z', 'w'])
>>> train = standardize_dataset(df, prot_attr='Z', target='y',
...                             sample_weight='w')
>>> reg = LinearRegression().fit(**train._asdict())
>>> import numpy as np
>>> from sklearn.datasets import make_classification
>>> from sklearn.model_selection import train_test_split
>>> df = pd.DataFrame(np.hstack(make_classification(n_features=5)))
>>> X, y = standardize_dataset(df, prot_attr=0, target=5)
>>> X_tr, X_te, y_tr, y_te = train_test_split(X, y)