The physlearn.pipeline module enhances the original Scikit-learn
pipeline with an implementation of base boosting. It includes a
physlearn.pipeline.ModifiedPipeline class, as well as a
physlearn.pipeline.make_pipeline() convenience
function.
Bases: Pipeline
Custom pipeline object that supports base boosting.
The object inherits from the original Scikit-learn Pipeline, thus it is
designed to sequentially compose a list of named transforms and a final
estimator into a new estimator. The modification extends this
functionality such that the composed estimator supports base boosting.
In other words, the base_boosting_options parameter enables a user
to boost an explicit model of the domain by fitting an additive
expansion, wherein the intercept term is generated by the explicit
model. As such, the final estimator may be any estimator contained
in the dictionary of estimators, i.e., the final estimator is not
restricted to the decision tree hypothesis class.
steps (list) – List of tuples, wherein the preceding tuple(s) (name, transform) are transform(s) and the last tuple (name, estimator) is an estimator.
memory (str or object with the joblib.Memory interface, optional (default=None)) – Enables fitted transform caching.
verbose (int, optional (default=0)) – Determines verbosity.
n_jobs (int or None, optional (default=-1)) – The number of jobs to run in parallel.
target_index (int or None, optional (default=None)) – Specifies the single-target subtask in the multi-target task.
base_boosting_options (dict or None, optional (default=None)) –
A dictionary of base boosting options, wherein the following options must be specified:
intThe number of basis functions in the noise term of the additive
expansion. Note that this option may also be specified as
n_regressors; see the example below.
strThe loss function utilized in the pseudo-residual computation, where ‘ls’ denotes the squared error loss function, ‘lad’ denotes the absolute error loss function, ‘huber’ denotes the Huber loss function, and ‘quantile’ denotes the quantile loss function.
dictint, float, or ndarrayThe initial guess for the expansion coefficient.
strChoice of optimization method. If 'minimize', then
scipy.optimize.minimize, else if 'basinhopping',
then scipy.optimize.basinhopping.
str or NoneThe type of solver utilized in the optimization method.
float or NoneThe epsilon tolerance for terminating the optimization method.
dict or NoneA dictionary of solver options.
int or NoneThe number of iterations in basin-hopping.
float or NoneThe temperature paramter utilized in basin-hopping, which determines the accept or reject criterion.
strThe loss function utilized in the line search computation, where ‘ls’ denotes the squared error loss function, ‘lad’ denotes the absolute error loss function, ‘huber’ denotes the Huber loss function, and ‘quantile’ denotes the quantile loss function.
int or floatThe regularization strength in the line search computation.
See also
physlearn.pipeline.make_pipeline()Convenience function for constructing a modified pipeline.
physlearn.supervised.utils._definitionDictionary of final estimator options.
Examples
>>> from sklearn.linear_model import Ridge
>>> from sklearn.preprocessing import StandardScaler
>>> from physlearn import ModifiedPipeline
>>> from physlearn.datasets import load_benchmark
>>> X_train, X_test, y_train, y_test = load_benchmark(return_split=True)
>>> line_search_options = dict(init_guess=1, opt_method='minimize',
method='Nelder-Mead', tol=1e-7,
options={"maxiter": 10000},
niter=None, T=None, loss='lad',
regularization=0.1)
>>> base_boosting_options = dict(n_regressors=3, boosting_loss='lad',
line_search_options=line_search_options)
>>> pipe = ModifiedPipeline(steps=[('scaler', StandardScaler()), ('reg', Ridge())],
base_boosting_options=base_boosting_options)
>>> pipe.fit(X_train, y_train)
>>> pipe.score(X_test, y_test).round(decimals=2)
mae mse rmse r2 ev
0 2.17 10.01 3.16 0.97 0.98
1 1.17 3.09 1.76 0.99 0.99
2 0.78 1.20 1.09 1.00 1.00
3 0.83 1.12 1.06 1.00 1.00
4 0.99 2.00 1.42 1.00 1.00
References
Alex Wozniakowski, Jayne Thompson, Mile Gu, and Felix C. Binder. “A new formulation of gradient boosting”, Machine Learning: Science and Technology, 2 045022 (2021).
John Tukey. “Exploratory Data Analysis”, Addison-Wesley (1977).
Jerome Friedman. “Greedy function approximation: A gradient boosting machine,” Annals of Statistics, 29(5):1189–1232 (2001).
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. “The Elements of Statistical Learning”, Springer (2009).
Lars Buitinck et al. “API design for machine learning software: experiences from the scikit-learn project” arXiv preprint arXiv:1309.0238 (2013).
Computes the expansion coefficient in physlearn.pipeline.ModifiedPipeline._fit_stages().
function (callable) – The objective function for the line search.
init_guess (int, float, or ndarray) – The initial guess for the expansion coefficient.
opt_method (str) – Choice of optimization method. If 'minimize', then
scipy.optimize.minimize, else if 'basinhopping',
then scipy.optimize.basinhopping.
method (str or None, optional (default=None)) – The type of solver utilized in the optimization method.
tol (float or None, optional (default=None)) – The epsilon tolerance for terminating the optimization method.
options (dict or None, optional (default=None)) – A dictionary of solver options.
niter (int or None, optional (default=None)) – The number of iterations in basin-hopping.
T (float or None, optional (default=None)) – The temperature paramter utilized in basin-hopping, which determines the accept or reject criterion.
res.x[0] – The expansion coefficient, i.e., the first element in the solution array.
float
Notes
The supported optimization methods include: scipy.optimize.minimize
and scipy.optimize.basinhopping; see the Scipy optimization
documentation for
further details.
Induces a basis function, which is a map from the domain to the pseudo-residual space.
X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).
pseudo_residual (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The negative gradient of the loss function.
**fit_params_last_step (dict of string -> object) – Parameters passed to the estimator’s fit method during the stage.
Fits the additive expansion in a greedy stagewise fashion.
This method transfers prior domain knowledge to gradient boosting through the
init_expansion parameter, and it is designed to be utilized within
physlearn.pipeline.ModifiedPipeline.fit(). The induced basis functions
and the learned expansion coefficients can be retrieved with the estimators_
and the coefs_ attributes, respectively.
X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).
y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the column(s) correspond to the single-target(s).
init_expansion (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The smooth term in the additive expansion, i.e., the initial guess in gradient boosting.
**fit_params_last_step (dict of string -> object) – Parameters passed to the estimator’s fit method during each stage.
A list of induced basis functions.
list
A list of learned expansion coefficients.
list
Notes
This greedy stagewise algorithm fits an additive expansion, which differs
from the standard additve expansion. Namely, the constant term is a random
variable, which depends upon the input example, e.g., an element in
init_expansion.
Sequentially fits the transform(s) then the final estimator.
This method supports base boosting.
X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).
y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the column(s) correspond to the single-target(s).
**fit_params (dict of string -> object) – Parameters passed to the fit method of each step of the stagewise
_fit_stage method.
self – The induced pipeline object.
Helper method for parallelizing the noise term predictions.
estimator (estimator) – An estimator that follows the Scikit-learn API.
Xt (array-like of shape = [n_samples, n_features]) – The transformed design matrix.
coef (float) – The learned expansion coefficient.
**predict_params (dict of string -> object) – Parameters to the predict called at the end of all
transformations in the pipeline.
y_pred – A Numpy array of predictions.
ndarray
Applies transform(s) to the data, then predicts with the final estimator.
The method supports base boosting.
X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).
**predict_params (dict of string -> object) – Parameters to the predict method, which are called after completing
all of the pipeline transformations.
y_pred – A pandas DataFrame or Series of predictions.
DataFrame or Series
Notes
In base boosting, we decompose the predictions in accord with Tukey’s notion of reroughing. Namely, data = smooth + rough.
Computes the supervised score.
X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).
y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the column(s) correspond to the single-target(s).
multioutput (str, optional (default='raw_values')) – Defines aggregating of multiple output values, wherein the string
must be either 'raw_values' or 'uniform_average'.
**predict_params (dict of string -> object) – Parameters to the predict method, which are called after completing
all of the pipeline transformations.
scores – The pandas object of computed scores.
pd.DataFrame or pd.Series
Attribute _pairwise was deprecated in version 0.24 and will be removed in 1.1 (renaming of 0.26).
DEPRECATED
The classes labels. Only exist if the last step is a classifier.
Names of features seen during first step fit method.
Get output feature names for transformation.
Transform input features using the pipeline.
input_features (array-like of str or None, default=None) – Input features.
feature_names_out – Transformed feature names.
ndarray of str objects
Number of features seen during first step fit method.
Access the steps by name.
Read-only attribute to access any step by given name. Keys are steps names and values are the steps objects.
Constructs a ModifiedPipeline from the given base estimator.
estimator (estimator) – A base estimator that follows the Scikit-learn API.
transform (str, list, tuple, or None, optional (default=None)) – Choice of transform(s). If the specified choice is a string,
then it must be a default option, where 'standardscaler',
'boxcox', 'yeojohnson', 'quantileuniform', and
'quantilenormal' denote sklearn.preprocessing.StandardScaler,
sklearn.preprocessing.PowerTransformer with method='box-cox'
or method='yeo-johnson', and sklearn.preprocessing.QuantileTransformer
with output_distribution='uniform' or output_distribution='normal',
respectively.
memory (str or object with the joblib.Memory interface) – Enables fitted transform caching.
verbose (int) – Determines verbosity.
n_jobs (int or None) – The number of jobs to run in parallel.
auto_target (bool, optional (default=True)) – Determines whether to automatically handle the pipeline steps or let the user specify the steps.
target_index (int or None) – Specifies the single-target subtask in the multi-target task.
target_type (str) – Specifies the type of target according to sklearn.utils.multiclass.type_of_target.
base_boosting_options (dict or None) –
A dictionary of base boosting options, wherein the following options must be specified:
intThe number of basis functions in the noise term of the additive expansion.
strThe loss function utilized in the pseudo-residual computation, where ‘ls’ denotes the squared error loss function, ‘lad’ denotes the absolute error loss function, ‘huber’ denotes the Huber loss function, and ‘quantile’ denotes the quantile loss function.
dictint, float, or ndarrayThe initial guess for the expansion coefficient.
strChoice of optimization method. If 'minimize', then
scipy.optimize.minimize, else if 'basinhopping',
then scipy.optimize.basinhopping.
str or NoneThe type of solver utilized in the optimization method.
float or NoneThe epsilon tolerance for terminating the optimization method.
dict or NoneA dictionary of solver options.
int or NoneThe number of iterations in basin-hopping.
float or NoneThe temperature paramter utilized in basin-hopping, which determines the accept or reject criterion.
strThe loss function utilized in the line search computation, where ‘ls’ denotes the squared error loss function, ‘lad’ denotes the absolute error loss function, ‘huber’ denotes the Huber loss function, and ‘quantile’ denotes the quantile loss function.
int or floatThe regularization strength in the line search computation.
random_state (int, RandomState instance, or None) – Determines the random number generation in
sklearn.preprocessing.QuantileTransformer, if pipeline_transform
is either `quantileuniform` or `quantilenormal`, and also in
sklearn.multioutput.RegressorChain.
n_quantiles (int or None) – Number of quantiles in sklearn.preprocessing.QuantileTransformer, if
pipeline_transform is either `quantileuniform` or `quantilenormal`.
cv (int, cross-validation generator, an iterable, or None) – Determines which targets are utilized in sklearn.multioutput.RegressorChain.
chain_order (list or None) – Determines the target order in sklearn.multioutput.RegressorChain.
pipe
See also
physlearn.pipeline.ModifiedPipelineClass for creating a modified pipeline of transforms with a final estimator, which supports base boosting.
Examples
>>> import pandas as pd
>>> from sklearn.datasets import make_regression
>>> from sklearn.linear_model import Ridge
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.utils.multiclass import type_of_target
>>> from physlearn import make_pipeline, Regressor
>>> X, y = make_regression(n_targets=3, random_state=42)
>>> X, y = pd.DataFrame(X), pd.DataFrame(y)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=42)
>>> pipe = make_pipeline(Ridge(), 'yeojohnson',
target_type=type_of_target(y))
>>> pipe.fit(X_train, y_train)
>>> pipe.score(X_test, y_test).round(decimals=2)
mae mse rmse r2 ev
0 58.68 5884.12 76.71 0.67 0.67
1 101.19 14627.70 120.95 0.36 0.36
2 96.31 14450.54 120.21 0.40 0.40
References
Alex Wozniakowski, Jayne Thompson, Mile Gu, and Felix C. Binder. “A new formulation of gradient boosting”, Machine Learning: Science and Technology, 2 045022 (2021).
Jerome Friedman. “Greedy function approximation: A gradient boosting machine,” Annals of Statistics, 29(5):1189–1232 (2001).
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. “The Elements of Statistical Learning”, Springer (2009).