Pipeline API

The physlearn.pipeline module enhances the original Scikit-learn pipeline with an implementation of base boosting. It includes a physlearn.pipeline.ModifiedPipeline class, as well as a physlearn.pipeline.make_pipeline() convenience function.

class physlearn.pipeline.ModifiedPipeline(steps, memory=None, verbose=0, n_jobs=None, target_index=None, base_boosting_options=None)[source]

Bases: Pipeline

Custom pipeline object that supports base boosting.

The object inherits from the original Scikit-learn Pipeline, thus it is designed to sequentially compose a list of named transforms and a final estimator into a new estimator. The modification extends this functionality such that the composed estimator supports base boosting. In other words, the base_boosting_options parameter enables a user to boost an explicit model of the domain by fitting an additive expansion, wherein the intercept term is generated by the explicit model. As such, the final estimator may be any estimator contained in the dictionary of estimators, i.e., the final estimator is not restricted to the decision tree hypothesis class.

Parameters
  • steps (list) – List of tuples, wherein the preceding tuple(s) (name, transform) are transform(s) and the last tuple (name, estimator) is an estimator.

  • memory (str or object with the joblib.Memory interface, optional (default=None)) – Enables fitted transform caching.

  • verbose (int, optional (default=0)) – Determines verbosity.

  • n_jobs (int or None, optional (default=-1)) – The number of jobs to run in parallel.

  • target_index (int or None, optional (default=None)) – Specifies the single-target subtask in the multi-target task.

  • base_boosting_options (dict or None, optional (default=None)) –

    A dictionary of base boosting options, wherein the following options must be specified:

    n_estimators int

    The number of basis functions in the noise term of the additive expansion. Note that this option may also be specified as n_regressors; see the example below.

    boosting_loss str

    The loss function utilized in the pseudo-residual computation, where ‘ls’ denotes the squared error loss function, ‘lad’ denotes the absolute error loss function, ‘huber’ denotes the Huber loss function, and ‘quantile’ denotes the quantile loss function.

    line_search_options dict
    init_guess int, float, or ndarray

    The initial guess for the expansion coefficient.

    opt_method str

    Choice of optimization method. If 'minimize', then scipy.optimize.minimize, else if 'basinhopping', then scipy.optimize.basinhopping.

    method str or None

    The type of solver utilized in the optimization method.

    tol float or None

    The epsilon tolerance for terminating the optimization method.

    options dict or None

    A dictionary of solver options.

    niter int or None

    The number of iterations in basin-hopping.

    T float or None

    The temperature paramter utilized in basin-hopping, which determines the accept or reject criterion.

    loss str

    The loss function utilized in the line search computation, where ‘ls’ denotes the squared error loss function, ‘lad’ denotes the absolute error loss function, ‘huber’ denotes the Huber loss function, and ‘quantile’ denotes the quantile loss function.

    regularization int or float

    The regularization strength in the line search computation.

See also

physlearn.pipeline.make_pipeline()

Convenience function for constructing a modified pipeline.

physlearn.supervised.utils._definition

Dictionary of final estimator options.

Examples

>>> from sklearn.linear_model import Ridge
>>> from sklearn.preprocessing import StandardScaler
>>> from physlearn import ModifiedPipeline
>>> from physlearn.datasets import load_benchmark
>>> X_train, X_test, y_train, y_test = load_benchmark(return_split=True)
>>> line_search_options = dict(init_guess=1, opt_method='minimize',
                               method='Nelder-Mead', tol=1e-7,
                               options={"maxiter": 10000},
                               niter=None, T=None, loss='lad',
                               regularization=0.1)
>>> base_boosting_options = dict(n_regressors=3, boosting_loss='lad',
                                 line_search_options=line_search_options)
>>> pipe = ModifiedPipeline(steps=[('scaler', StandardScaler()), ('reg', Ridge())],
                            base_boosting_options=base_boosting_options)
>>> pipe.fit(X_train, y_train)
>>> pipe.score(X_test, y_test).round(decimals=2)
    mae    mse  rmse    r2    ev
0  2.17  10.01  3.16  0.97  0.98
1  1.17   3.09  1.76  0.99  0.99
2  0.78   1.20  1.09  1.00  1.00
3  0.83   1.12  1.06  1.00  1.00
4  0.99   2.00  1.42  1.00  1.00

References

  • Alex Wozniakowski, Jayne Thompson, Mile Gu, and Felix C. Binder. “A new formulation of gradient boosting”, Machine Learning: Science and Technology, 2 045022 (2021).

  • John Tukey. “Exploratory Data Analysis”, Addison-Wesley (1977).

  • Jerome Friedman. “Greedy function approximation: A gradient boosting machine,” Annals of Statistics, 29(5):1189–1232 (2001).

  • Trevor Hastie, Robert Tibshirani, and Jerome Friedman. “The Elements of Statistical Learning”, Springer (2009).

  • Lars Buitinck et al. “API design for machine learning software: experiences from the scikit-learn project” arXiv preprint arXiv:1309.0238 (2013).

Computes the expansion coefficient in physlearn.pipeline.ModifiedPipeline._fit_stages().

Parameters
  • function (callable) – The objective function for the line search.

  • init_guess (int, float, or ndarray) – The initial guess for the expansion coefficient.

  • opt_method (str) – Choice of optimization method. If 'minimize', then scipy.optimize.minimize, else if 'basinhopping', then scipy.optimize.basinhopping.

  • method (str or None, optional (default=None)) – The type of solver utilized in the optimization method.

  • tol (float or None, optional (default=None)) – The epsilon tolerance for terminating the optimization method.

  • options (dict or None, optional (default=None)) – A dictionary of solver options.

  • niter (int or None, optional (default=None)) – The number of iterations in basin-hopping.

  • T (float or None, optional (default=None)) – The temperature paramter utilized in basin-hopping, which determines the accept or reject criterion.

Returns

res.x[0] – The expansion coefficient, i.e., the first element in the solution array.

Return type

float

Notes

The supported optimization methods include: scipy.optimize.minimize and scipy.optimize.basinhopping; see the Scipy optimization documentation for further details.

_fit_stage(X, pseudo_residual, **fit_params_last_step)[source]

Induces a basis function, which is a map from the domain to the pseudo-residual space.

Parameters
  • X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).

  • pseudo_residual (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The negative gradient of the loss function.

  • **fit_params_last_step (dict of string -> object) – Parameters passed to the estimator’s fit method during the stage.

_fit_stages(X, y, init_expansion, **fit_params_last_step)[source]

Fits the additive expansion in a greedy stagewise fashion.

This method transfers prior domain knowledge to gradient boosting through the init_expansion parameter, and it is designed to be utilized within physlearn.pipeline.ModifiedPipeline.fit(). The induced basis functions and the learned expansion coefficients can be retrieved with the estimators_ and the coefs_ attributes, respectively.

Parameters
  • X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).

  • y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the column(s) correspond to the single-target(s).

  • init_expansion (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The smooth term in the additive expansion, i.e., the initial guess in gradient boosting.

  • **fit_params_last_step (dict of string -> object) – Parameters passed to the estimator’s fit method during each stage.

estimators_

A list of induced basis functions.

Type

list

coefs_

A list of learned expansion coefficients.

Type

list

Notes

This greedy stagewise algorithm fits an additive expansion, which differs from the standard additve expansion. Namely, the constant term is a random variable, which depends upon the input example, e.g., an element in init_expansion.

fit(X, y, **fit_params)[source]

Sequentially fits the transform(s) then the final estimator.

This method supports base boosting.

Parameters
  • X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).

  • y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the column(s) correspond to the single-target(s).

  • **fit_params (dict of string -> object) – Parameters passed to the fit method of each step of the stagewise _fit_stage method.

Returns

self – The induced pipeline object.

Return type

ModifiedPipeline

_predict(estimator, Xt, coef, **predict_params)[source]

Helper method for parallelizing the noise term predictions.

Parameters
  • estimator (estimator) – An estimator that follows the Scikit-learn API.

  • Xt (array-like of shape = [n_samples, n_features]) – The transformed design matrix.

  • coef (float) – The learned expansion coefficient.

  • **predict_params (dict of string -> object) – Parameters to the predict called at the end of all transformations in the pipeline.

Returns

y_pred – A Numpy array of predictions.

Return type

ndarray

predict(X, **predict_params)[source]

Applies transform(s) to the data, then predicts with the final estimator.

The method supports base boosting.

Parameters
  • X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).

  • **predict_params (dict of string -> object) – Parameters to the predict method, which are called after completing all of the pipeline transformations.

Returns

y_pred – A pandas DataFrame or Series of predictions.

Return type

DataFrame or Series

Notes

In base boosting, we decompose the predictions in accord with Tukey’s notion of reroughing. Namely, data = smooth + rough.

score(X, y, multioutput='raw_values', **predict_params)[source]

Computes the supervised score.

Parameters
  • X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).

  • y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the column(s) correspond to the single-target(s).

  • multioutput (str, optional (default='raw_values')) – Defines aggregating of multiple output values, wherein the string must be either 'raw_values' or 'uniform_average'.

  • **predict_params (dict of string -> object) – Parameters to the predict method, which are called after completing all of the pipeline transformations.

Returns

scores – The pandas object of computed scores.

Return type

pd.DataFrame or pd.Series

property _pairwise

Attribute _pairwise was deprecated in version 0.24 and will be removed in 1.1 (renaming of 0.26).

Type

DEPRECATED

property classes_

The classes labels. Only exist if the last step is a classifier.

property feature_names_in_

Names of features seen during first step fit method.

get_feature_names_out(input_features=None)

Get output feature names for transformation.

Transform input features using the pipeline.

Parameters

input_features (array-like of str or None, default=None) – Input features.

Returns

feature_names_out – Transformed feature names.

Return type

ndarray of str objects

property n_features_in_

Number of features seen during first step fit method.

property named_steps

Access the steps by name.

Read-only attribute to access any step by given name. Keys are steps names and values are the steps objects.

physlearn.pipeline.make_pipeline(estimator, transform=None, **kwargs)[source]

Constructs a ModifiedPipeline from the given base estimator.

Parameters
  • estimator (estimator) – A base estimator that follows the Scikit-learn API.

  • transform (str, list, tuple, or None, optional (default=None)) – Choice of transform(s). If the specified choice is a string, then it must be a default option, where 'standardscaler', 'boxcox', 'yeojohnson', 'quantileuniform', and 'quantilenormal' denote sklearn.preprocessing.StandardScaler, sklearn.preprocessing.PowerTransformer with method='box-cox' or method='yeo-johnson', and sklearn.preprocessing.QuantileTransformer with output_distribution='uniform' or output_distribution='normal', respectively.

  • memory (str or object with the joblib.Memory interface) – Enables fitted transform caching.

  • verbose (int) – Determines verbosity.

  • n_jobs (int or None) – The number of jobs to run in parallel.

  • auto_target (bool, optional (default=True)) – Determines whether to automatically handle the pipeline steps or let the user specify the steps.

  • target_index (int or None) – Specifies the single-target subtask in the multi-target task.

  • target_type (str) – Specifies the type of target according to sklearn.utils.multiclass.type_of_target.

  • base_boosting_options (dict or None) –

    A dictionary of base boosting options, wherein the following options must be specified:

    n_estimators int

    The number of basis functions in the noise term of the additive expansion.

    boosting_loss str

    The loss function utilized in the pseudo-residual computation, where ‘ls’ denotes the squared error loss function, ‘lad’ denotes the absolute error loss function, ‘huber’ denotes the Huber loss function, and ‘quantile’ denotes the quantile loss function.

    line_search_options dict
    init_guess int, float, or ndarray

    The initial guess for the expansion coefficient.

    opt_method str

    Choice of optimization method. If 'minimize', then scipy.optimize.minimize, else if 'basinhopping', then scipy.optimize.basinhopping.

    method str or None

    The type of solver utilized in the optimization method.

    tol float or None

    The epsilon tolerance for terminating the optimization method.

    options dict or None

    A dictionary of solver options.

    niter int or None

    The number of iterations in basin-hopping.

    T float or None

    The temperature paramter utilized in basin-hopping, which determines the accept or reject criterion.

    loss str

    The loss function utilized in the line search computation, where ‘ls’ denotes the squared error loss function, ‘lad’ denotes the absolute error loss function, ‘huber’ denotes the Huber loss function, and ‘quantile’ denotes the quantile loss function.

    regularization int or float

    The regularization strength in the line search computation.

  • random_state (int, RandomState instance, or None) – Determines the random number generation in sklearn.preprocessing.QuantileTransformer, if pipeline_transform is either `quantileuniform` or `quantilenormal`, and also in sklearn.multioutput.RegressorChain.

  • n_quantiles (int or None) – Number of quantiles in sklearn.preprocessing.QuantileTransformer, if pipeline_transform is either `quantileuniform` or `quantilenormal`.

  • cv (int, cross-validation generator, an iterable, or None) – Determines which targets are utilized in sklearn.multioutput.RegressorChain.

  • chain_order (list or None) – Determines the target order in sklearn.multioutput.RegressorChain.

Returns

pipe

Return type

ModifiedPipeline

See also

physlearn.pipeline.ModifiedPipeline

Class for creating a modified pipeline of transforms with a final estimator, which supports base boosting.

Examples

>>> import pandas as pd
>>> from sklearn.datasets import make_regression
>>> from sklearn.linear_model import Ridge
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.utils.multiclass import type_of_target
>>> from physlearn import make_pipeline, Regressor
>>> X, y = make_regression(n_targets=3, random_state=42)
>>> X, y = pd.DataFrame(X), pd.DataFrame(y)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                        random_state=42)
>>> pipe = make_pipeline(Ridge(), 'yeojohnson',
                         target_type=type_of_target(y))
>>> pipe.fit(X_train, y_train)
>>> pipe.score(X_test, y_test).round(decimals=2)
      mae       mse    rmse    r2    ev
0   58.68   5884.12   76.71  0.67  0.67
1  101.19  14627.70  120.95  0.36  0.36
2   96.31  14450.54  120.21  0.40  0.40

References

  • Alex Wozniakowski, Jayne Thompson, Mile Gu, and Felix C. Binder. “A new formulation of gradient boosting”, Machine Learning: Science and Technology, 2 045022 (2021).

  • Jerome Friedman. “Greedy function approximation: A gradient boosting machine,” Annals of Statistics, 29(5):1189–1232 (2001).

  • Trevor Hastie, Robert Tibshirani, and Jerome Friedman. “The Elements of Statistical Learning”, Springer (2009).