aif360.sklearn.datasets
.standardize_dataset¶
-
aif360.sklearn.datasets.
standardize_dataset
(df, prot_attr, target, sample_weight=None, usecols=[], dropcols=[], numeric_only=False, dropna=True)[source]¶ Separate data, targets, and possibly sample weights and populate protected attributes as sample properties.
Parameters: - df (pandas.DataFrame) – DataFrame with features and target together.
- prot_attr (single label or list-like) – Label or list of labels corresponding to protected attribute columns. Even if these are dropped from the features, they remain in the index.
- target (single label or list-like) – Column label of the target (outcome) variable.
- sample_weight (single label, optional) – Name of the column containing sample weights.
- usecols (single label or list-like, optional) – Column(s) to keep. All others are dropped.
- dropcols (single label or list-like, optional) – Column(s) to drop.
- numeric_only (bool) – Drop all non-numeric, non-binary feature columns.
- dropna (bool) – Drop rows with NAs.
Returns: collections.namedtuple – A tuple-like object where items can be accessed by index or name. Contains the following attributes:
- X (
pandas.DataFrame
) – Feature array. - y (
pandas.DataFrame
orpandas.Series
) – Target array. - sample_weight (
pandas.Series
, optional) – Sample weights.
Note
The order of execution for the dropping parameters is: numeric_only -> usecols -> dropcols -> dropna.
Examples
>>> import pandas as pd >>> from sklearn.linear_model import LinearRegression
>>> df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['X', 'y', 'Z']) >>> train = standardize_dataset(df, prot_attr='Z', target='y') >>> reg = LinearRegression().fit(*train)
>>> import numpy as np >>> from sklearn.datasets import make_classification >>> from sklearn.model_selection import train_test_split >>> df = pd.DataFrame(np.hstack(make_classification(n_features=5))) >>> X, y = standardize_dataset(df, prot_attr=0, target=5) >>> X_tr, X_te, y_tr, y_te = train_test_split(X, y)