The physlearn.supervised.regression module provides machine learning
utilities, which solve single-target and multi-target regression tasks. It
includes the physlearn.BaseRegressor and physlearn.Regressor
classes.
Bases: BaseEstimator, RegressorMixin, AdditionalRegressorMixin
Base class for regressor amalgamation.
The object is designed to amalgamate regressors from
Scikit-learn,
LightGBM,
XGBoost,
CatBoost,
and Mlxtend into a unified framework,
which follows the Scikit-learn API. Important methods include fit,
predict, score, dump, load, cross_validate, and
cross_val_score.
regressor_choice (str, optional (default='ridge')) – Specifies the case-insensitive regressor choice.
cv (int, cross-validation generator, an iterable, or None, optional (default=5)) – Determines the cross-validation strategy if the regressor choice is stacking, if the task is multi-target regression and the single-targets are chained, and as the default in the k-fold cross-validation methods.
random_state (int, RandomState instance, or None, optional (default=0)) – Determines the random number generation in the regressor choice
mlxtend.regressor.StackingCVRegressor and in the modified
pipeline construction.
verbose (int, optional (default=0)) – Determines verbosity in either regressor choice:
mlxtend.regressor.StackingRegressor and
mlxtend.regressor.StackingCVRegressor, in the modified
pipeline construction, and in the k-fold cross-validation methods.
n_jobs (int or None, optional (default=-1)) – The number of jobs to run in parallel if the regressor choice is stacking or voting, in the modified pipeline construction, and in the k-fold cross-validation methods.
score_multioutput (str, optional (default='raw_values')) – Defines aggregating of multiple output values in the score method,
wherein the string must be either 'raw_values', 'uniform_average', or
'variance_weighted'.
scoring (str, callable, list/tuple, or dict, optional (default='neg_mean_absolute_error')) – Determines scoring in the k-fold cross-validation methods.
return_train_score (bool, optional (default=True)) – Determines whether to return the training scores from the k-fold cross-validation methods.
auto_target (bool, optional (default=True)) – Determines whether to automatically handle the pipeline steps or let the user specify the steps.
pipeline_transform (str, list, tuple, or None, optional (default=None)) – Choice of transform(s) used in the modified pipeline construction.
If the specified choice is a string, then it must be a default option,
where 'standardscaler', 'boxcox', 'yeojohnson', 'quantileuniform',
and 'quantilenormal' denote sklearn.preprocessing.StandardScaler,
sklearn.preprocessing.PowerTransformer with method='box-cox'
or method='yeo-johnson', and sklearn.preprocessing.QuantileTransformer
with output_distribution='uniform' or output_distribution='normal',
respectively.
pipeline_memory (str or object with the joblib.Memory interface, optional (default=None)) – Enables fitted transform caching in the modified pipeline construction.
params (dict, list, or None, optional (default=None)) – The choice of (hyper)parameters for the regressor choice. If None, then the default (hyper)parameters are utilized.
target_index (int, or None, optional (default=None)) – Specifies the single-target regression subtask in the multi-target regression task.
chain_order (list or None) – Determines the target order in sklearn.multioutput.RegressorChain
during the modified pipeline construction.
stacking_options (dict or None, optional (default=None)) –
A dictionary of stacking options, whereby layers
must be specified:
dictA dictionary of stacking layer(s).
bool or None, (default=True)Determines whether to shuffle the training data in
mlxtend.regressor.StackingCVRegressor.
bool or None, (default=True)Determines whether to clone and refit the regressors in
mlxtend.regressor.StackingCVRegressor.
bool or None, (default=True)Determines whether to concatenate the original features with
the first stacking layer predictions in
sklearn.ensemble.StackingRegressor,
mlxtend.regressor.StackingRegressor, or
mlxtend.regressor.StackingCVRegressor.
bool or None, (default=True)Determines whether to make the concatenated features
accessible through the attribute train_meta_features_
in mlxtend.regressor.StackingRegressor and
mlxtend.regressor.StackingCVRegressor.
ndarray of shape (n_regressors,) or None, (default=None)Sequence of weights for sklearn.ensemble.VotingRegressor.
base_boosting_options (dict or None, optional (default=None)) –
A dictionary of base boosting options used in the modified pipeline construction, wherein the following options must be specified:
intThe number of basis functions in the noise term of the additive expansion.
Note that this option may also be specified as n_regressors.
strThe loss function utilized in the pseudo-residual computation, where ‘ls’ denotes the squared error loss function, ‘lad’ denotes the absolute error loss function, ‘huber’ denotes the Huber loss function, and ‘quantile’ denotes the quantile loss function.
dictint, float, or ndarrayThe initial guess for the expansion coefficient.
strChoice of optimization method. If 'minimize', then
scipy.optimize.minimize, else if 'basinhopping',
then scipy.optimize.basinhopping.
str or NoneThe type of solver utilized in the optimization method.
float or NoneThe epsilon tolerance for terminating the optimization method.
dict or NoneA dictionary of solver options.
int or NoneThe number of iterations in basin-hopping.
float or NoneThe temperature paramter utilized in basin-hopping, which determines the accept or reject criterion.
strThe loss function utilized in the line search computation, where ‘ls’ denotes the squared error loss function, ‘lad’ denotes the absolute error loss function, ‘huber’ denotes the Huber loss function, and ‘quantile’ denotes the quantile loss function.
int or floatThe regularization strength in the line search computation.
Notes
The score method differs from the Scikit-learn usage, as the method is designed
to abstract the regressor metrics, e.g., sklearn.metrics.mean_absolute_error.
See also
physlearn.pipeline.ModifiedPipelineClass for creating a pipeline.
physlearn.supervised.regression.RegressorMain class for regressor amalgamation.
Examples
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> from sklearn.model_selection import train_test_split
>>> from physlearn import BaseRegressor
>>> X, y = load_boston(return_X_y=True)
>>> X, y = pd.DataFrame(X), pd.Series(y)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=42)
>>> reg = BaseRegressor(regressor_choice='lgbmregressor',
pipeline_transform='standardscaler')
>>> y_pred = reg.fit(X_train, y_train).predict(X_test)
>>> reg.score(y_test, y_pred)
array([11.63706835])
Checks if regressor adheres to scikit-learn conventions.
Namely, it runs sklearn.utils.estimator_checks.check_estimator.
Retrieves the (hyper)parameters.
deep (bool, optional (default=True)) – Although we do not use this parameter, it is required as various Scikit-learn utilities require it.
self.params – (Hyper)parameter names mapped to their values.
dict
Sets the regressor’s (hyper)parameters.
**params (dict) – The regressor’s (hyper)parameters.
self – The base regressor object.
Checks the validity of the data representation(s).
X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).
y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the column(s) correspond to the single-target(s).
out
validated data
Serializes the value with joblib.
value (any Python object) – The object to store to disk.
filename (str, joblib.pathlib.Path, or file object) – The file object or path of the file.
filenames – The list of file names in which the data is stored.
list of str
Deserializes the file object.
filename (str, joblib.pathlib.Path, or file object) – The file object or path of the file.
joblib.load – The object stored in the file.
any Python object
Creates pipe attribute for downstream tasks.
This method constructs a ModifiedPipeline from the given base regressor.
y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the
column(s) correspond to the single-target(s). The targets are used to
determine the type of the target, and the number of samples if the
pipeline_transform involves quantile transformers.
n_quantiles (int or None, optional (default=None)) – Number of quantiles in sklearn.preprocessing.QuantileTransformer, if
pipeline_transform is either `quantileuniform` or `quantilenormal`.
A ModifiedPipeline object.
Gets a regressor’s attribute from the ModifiedPipeline object.
The pipe attribute must exist in order to use this method.
attr (str) – The name of the regressor’s attribute.
attr
type of attribute
Automates subtask slicing in multi-target regression.
y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the
column(s) correspond to the single-target(s). The targets are used to
determine the type of the target, and the number of samples if the
pipeline_transform involves quantile transformers.
y
array-like of shape = [n_samples] or shape = [n_samples, n_targets]
Helper fit method.
X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).
y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the column(s) correspond to the single-target(s).
sample_weight (float, ndarray, or None, optional (default=None)) – Individual weights for each example. If the weight is a float, then every example will have the same weight.
**fit_params (dict of string -> object) – If base boosting, then these parameters are passed to the stagewise
_fit_stages method.
Fits the ModifiedPipeline object.
X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).
y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the column(s) correspond to the single-target(s).
sample_weight (float, ndarray, or None, optional (default=None)) – Individual weights for each example. If the weight is a float, then every example will have the same weight.
self.pipe – The induced pipeline object.
Generates predictions with the ModifiedPipeline object.
X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).
y_pred – The predictions generated by the induced ModifiedPipeline object.
array-like of shape = [n_samples] or shape = [n_samples, n_targets]
Computes the supervised score.
y_true (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The observed target matrix, where each row corresponds to an example and the column(s) correspond to the observed single-target(s).
y_pred (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The predicted target matrix, where each row corresponds to an example and the column(s) correspond to the predicted single-target(s).
scoring (str, optional (default='mse')) – The scoring name, which may be mae, mse, rmse, r2, ev, or msle.
multioutput (str, optional (default='raw_values')) – Defines aggregating of multiple output values, wherein the string
must be either 'raw_values', 'uniform_average', or
'variance_weighted'.
score – The computed score.
float or ndarray of floats
Helper method to estimate cross-validation fold size.
y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the column(s) correspond to the single-target(s).
cv (int, cross-validation generator, or an iterable) – Used in order to determine the fold size.
estimate
int
Performs (augmented) cross-validation.
If return_incumbent_score is True, then the incumbent is scored
on the withheld folds. Otherwise, the behavior is the same as in
Scikit-learn.
X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).
y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the column(s) correspond to the single-target(s).
return_regressor (bool, optional (default=False)) – Determines whether to return the induced regressor.
error_score ('raise' or numeric, optional (default=np.nan)) – The assigned value if an error occurs while inducing a regressor. If set to ‘raise’, then the specific error is raised. Else if set to a numeric value, then FitFailedWarning is raised.
return_incumbent_score (bool, optional (default=True)) – Determines whether to score the incumbent on the withheld folds, whereby the incumbent is assumed to be an example in the design matrix.
cv (int, cross-validation generator, an iterable, or None, optional (default=None)) – Determines the cross-validation strategy. If None, then the default is 5-fold cross-validation.
fit_params (dict, optional (default=None)) – (Hyper)parameters to pass to the regressor’s fit method.
scores – Array of scores for each run of the cross-validation procedure.
dict of float arrays of shape (n_splits,)
References
Alex Wozniakowski, Jayne Thompson, Mile Gu, and Felix C. Binder. “A new formulation of gradient boosting”, Machine Learning: Science and Technology, 2 045022 (2021).
Performs (augmented) cross-validation, and wraps the result in a DataFrame.
If return_incumbent_score is True, then the incumbent is scored
on the withheld folds. Otherwise, the behavior is the same as in
Scikit-learn.
X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).
y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the column(s) correspond to the single-target(s).
return_regressor (bool, optional (default=False)) – Determines whether to return the induced regressor.
error_score ('raise' or numeric, optional (default=np.nan)) – The assigned value if an error occurs while inducing a regressor. If set to ‘raise’, then the specific error is raised. Else if set to a numeric value, then FitFailedWarning is raised.
return_incumbent_score (bool, optional (default=True)) – Determines whether to score the incumbent on the withheld folds, whereby the incumbent is assumed to be an example in the design matrix.
cv (int, cross-validation generator, an iterable, or None, optional (default=None)) – Determines the cross-validation strategy. If None, then the default is 5-fold cross-validation.
fit_params (dict, optional (default=None)) – (Hyper)parameters to pass to the regressor’s fit method.
scores – DataFrame of scores for each run of the cross-validation procedure.
pd.DataFrame
Notes
Scikit-learn returns negative scores for some metrics, such as mean absolute error (MAE) or mean squared error (MSE). However, we only return nonnegativie scores.
References
Alex Wozniakowski, Jayne Thompson, Mile Gu, and Felix C. Binder. “A new formulation of gradient boosting”, Machine Learning: Science and Technology, 2 045022 (2021).
Performs (augmented) cross-validation, then returns the withheld fold score.
If return_incumbent_score is True, then the incumbent is scored
on the withheld folds. Otherwise, the behavior is the same as in
Scikit-learn.
X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).
y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the column(s) correspond to the single-target(s).
error_score ('raise' or numeric, optional (default=np.nan)) – The assigned value if an error occurs while inducing a regressor. If set to ‘raise’, then the specific error is raised. Else if set to a numeric value, then FitFailedWarning is raised.
return_incumbent_score (bool, optional (default=True)) – Determines whether to score the incumbent on the withheld folds, whereby the incumbent is assumed to be an example in the design matrix.
cv (int, cross-validation generator, an iterable, or None, optional (default=None)) – Determines the cross-validation strategy. If None, then the default is 5-fold cross-validation.
fit_params (dict, optional (default=None)) – (Hyper)parameters to pass to the regressor’s fit method.
scores – The withheld fold scores for each run of the cross-validation procedure.
pd.Series or pd.DataFrame
Notes
Scikit-learn returns negative scores for some metrics, such as mean absolute error (MAE) or mean squared error (MSE). However, we only return nonnegativie scores.
References
Alex Wozniakowski, Jayne Thompson, Mile Gu, and Felix C. Binder. “A new formulation of gradient boosting”, Machine Learning: Science and Technology, 2 045022 (2021).
Bases: BaseRegressor
Main class for regressor amalgamation.
The object is designed to amalgamate regressors from
Scikit-learn,
LightGBM,
XGBoost,
CatBoost,
and Mlxtend into a unified framework,
which follows the Scikit-learn API. Important methods include fit,
predict, score, baseboostcv, search, dump, load,
cross_val_score, and nested_cross_validate.
regressor_choice (str, optional (default='ridge')) – Specifies the case-insensitive regressor choice.
cv (int, cross-validation generator, an iterable, or None, optional (default=5)) – Determines the cross-validation strategy if the regressor choice is stacking, if the task is multi-target regression and the single-targets are chained, and as the default in the k-fold cross-validation methods.
random_state (int, RandomState instance, or None, optional (default=0)) – Determines the random number generation in the regressor choice
mlxtend.regressor.StackingCVRegressor and in the modified
pipeline construction.
verbose (int, optional (default=1)) – Determines verbosity in either regressor choice:
mlxtend.regressor.StackingRegressor and
mlxtend.regressor.StackingCVRegressor, in the modified
pipeline construction, and in the k-fold cross-validation methods.
n_jobs (int or None, optional (default=-1)) – The number of jobs to run in parallel if the regressor choice is stacking or voting, in the modified pipeline construction, and in the k-fold cross-validation methods.
score_multioutput (str, optional (default='raw_values')) – Defines aggregating of multiple output values in the score method,
wherein the string must be either 'raw_values', 'uniform_average', or
'variance_weighted'.
scoring (str, callable, list/tuple, or dict, optional (default='neg_mean_absolute_error')) – Determines scoring in the k-fold cross-validation methods.
refit (bool, optional (default=True)) – Determines whether to return the refit regressor in the search method.
randomizedcv_n_iter (int, optional (default=20)) – Determines the number of (hyper)parameter settings that are
sampled in the search method, when the chosen search is
'randomizedsearchcv', e.g., RandomizedSearchCV from
Scikit-learn.
bayesoptcv_init_points (int, optional (default=2)) – Determines the number of random exploration steps in the search method,
when the chose search method is 'bayesoptcv', e.g., Bayesian
Optimization.
Increasing the number corresponds to diversifying the exploration
space.
bayesoptcv_n_iter (int, optional (default=20)) –
Determines the number of Bayesian optimization steps in the search method,
when the chose search method is 'bayesoptcv', e.g., Bayesian
Optimization.
return_train_score (bool, optional (default=True)) – Determines whether to return the training scores from the k-fold cross-validation methods.
pipeline_transform (str, list, tuple, or None, optional (default='quantilenormal')) – Choice of transform(s) used in the modified pipeline construction.
If the specified choice is a string, then it must be a default option,
where 'standardscaler', 'boxcox', 'yeojohnson', 'quantileuniform',
and 'quantilenormal' denote sklearn.preprocessing.StandardScaler,
sklearn.preprocessing.PowerTransformer with method='box-cox'
or method='yeo-johnson', and sklearn.preprocessing.QuantileTransformer
with output_distribution='uniform' or output_distribution='normal',
respectively.
pipeline_memory (str or object with the joblib.Memory interface, optional (default=None)) – Enables fitted transform caching in the modified pipeline construction.
params (dict, list, or None, optional (default=None)) – The choice of (hyper)parameters for the regressor choice. If None, then the default (hyper)parameters are utilized.
target_index (int, or None, optional (default=None)) – Specifies the single-target regression subtask in the multi-target regression task.
chain_order (list or None) – Determines the target order in sklearn.multioutput.RegressorChain
during the modified pipeline construction.
stacking_options (dict or None, optional (default=None)) –
A dictionary of stacking options, whereby layers
must be specified:
dictA dictionary of stacking layer(s).
bool or None, (default=True)Determines whether to shuffle the training data in
mlxtend.regressor.StackingCVRegressor.
bool or None, (default=True)Determines whether to clone and refit the regressors in
mlxtend.regressor.StackingCVRegressor.
bool or None, (default=True)Determines whether to concatenate the original features with
the first stacking layer predictions in
sklearn.ensemble.StackingRegressor,
mlxtend.regressor.StackingRegressor, or
mlxtend.regressor.StackingCVRegressor.
bool or None, (default=True)Determines whether to make the concatenated features
accessible through the attribute train_meta_features_
in mlxtend.regressor.StackingRegressor and
mlxtend.regressor.StackingCVRegressor.
ndarray of shape (n_regressors,) or None, (default=None)Sequence of weights for sklearn.ensemble.VotingRegressor.
base_boosting_options (dict or None, optional (default=None)) –
A dictionary of base boosting options used in the modified pipeline construction, wherein the following options must be specified:
intThe number of basis functions in the noise term of the additive expansion.
Note that this option may also be specified as n_regressors.
strThe loss function utilized in the pseudo-residual computation, where ‘ls’ denotes the squared error loss function, ‘lad’ denotes the absolute error loss function, ‘huber’ denotes the Huber loss function, and ‘quantile’ denotes the quantile loss function.
dictint, float, or ndarrayThe initial guess for the expansion coefficient.
strChoice of optimization method. If 'minimize', then
scipy.optimize.minimize, else if 'basinhopping',
then scipy.optimize.basinhopping.
str or NoneThe type of solver utilized in the optimization method.
float or NoneThe epsilon tolerance for terminating the optimization method.
dict or NoneA dictionary of solver options.
int or NoneThe number of iterations in basin-hopping.
float or NoneThe temperature paramter utilized in basin-hopping, which determines the accept or reject criterion.
strThe loss function utilized in the line search computation, where ‘ls’ denotes the squared error loss function, ‘lad’ denotes the absolute error loss function, ‘huber’ denotes the Huber loss function, and ‘quantile’ denotes the quantile loss function.
int or floatThe regularization strength in the line search computation.
Notes
The score method differs from the Scikit-learn usage, as the method is designed
to abstract the regressor metrics, e.g., sklearn.metrics.mean_absolute_error.
Moreover, it computes multiple metrics, and returns the scores in a pandas object.
See also
physlearn.pipeline.ModifiedPipelineClass for creating a pipeline.
physlearn.supervised.regression.BaseRegressorBase class for regressor amalgamation.
Examples
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> from sklearn.decomposition import PCA, TruncatedSVD
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.pipeline import FeatureUnion
>>> from physlearn import Regressor
>>> X, y = load_boston(return_X_y=True)
>>> X, y = pd.DataFrame(X), pd.Series(y)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=42)
>>> transformer_list = [('pca', PCA(n_components=1)),
('svd', TruncatedSVD(n_components=2))]
>>> union = FeatureUnion(transformer_list=transformer_list, n_jobs=-1)
>>> stack = dict(regressors=['kneighborsregressor', 'bayesianridge'],
final_regressor='lasso')
>>> reg = Regressor(regressor_choice='stackingregressor',
pipeline_transform=('tr', union),
stacking_options=dict(layers=stack))
>>> y_pred = reg.fit(X_train, y_train).predict(X_test)
>>> reg.score(y_test, y_pred)
mae mse rmse r2 ev msle
target
0 4.775145 42.874253 6.547843 0.387748 0.40836 0.079818
Checks if regressor adheres to scikit-learn conventions.
Namely, it runs sklearn.utils.estimator_checks.check_estimator.
Scikit-learn and Mlxtend stacking regressors, as well as LightGBM,
XGBoost, and CatBoost regressor do not adhere to the convention.
Retrieves the (hyper)parameters.
deep (bool, optional (default=True)) – Although we do not use this parameter, it is required as various Scikit-learn utilities require it.
self.params – (Hyper)parameter names mapped to their values.
dict
Sets the regressor’s (hyper)parameters.
**params (dict) – The regressor’s (hyper)parameters.
self – The base regressor object.
Serializes the value with joblib.
value (any Python object) – The object to store to disk.
filename (str, joblib.pathlib.Path, or file object) – The file object or path of the file.
filenames – The list of file names in which the data is stored.
list of str
Deserializes the file object.
filename (str, joblib.pathlib.Path, or file object) – The file object or path of the file.
joblib.load – The object stored in the file.
any Python object
Gets a regressor’s attribute from the ModifiedPipeline object.
The pipe attribute must exist in order to use this method.
attr (str) – The name of the regressor’s attribute.
attr
type of attribute
Fits the ModifiedPipeline object.
X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).
y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the column(s) correspond to the single-target(s).
sample_weight (float, ndarray, or None, optional (default=None)) – Individual weights for each example. If the weight is a float, then every example will have the same weight.
self.pipe – The induced pipeline object.
Performs augmented cross-validation.
This method is designed to be utilized within
physlearn.supervised.regression.Regressor.baseboostcv(),
as the inbuilt model selection step.
X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).
y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the column(s) correspond to the single-target(s).
This flag implies that the incumbent won the inbuilt model selection step.
bool
None
References
Alex Wozniakowski, Jayne Thompson, Mile Gu, and Felix C. Binder. “A new formulation of gradient boosting”, Machine Learning: Science and Technology, 2 045022 (2021).
Base boosting with inbuilt cross-validation.
This method starts with inbuilt cross-validation, which scores both the incumbent and the candidate base boosting algorithm. If the incumbent wins, then the explict model of the domain is the single-target regressor. Otherwise, base boosting greedily boosts the explict model of the domain in a stagewise fashion.
In essence, this method acts as a fit method.
X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).
y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the column(s) correspond to the single-target(s).
**fit_params (dict of string -> object) – If base boosting, then these parameters are passed to the stagewise
_fit_stages method.
This flag implies that the incumbent won the inbuilt model selection step, and it notifies the predict method.
bool
single-target regressor
References
Alex Wozniakowski, Jayne Thompson, Mile Gu, and Felix C. Binder. “A new formulation of gradient boosting”, Machine Learning: Science and Technology, 2 045022 (2021).
Generates predictions with the ModifiedPipeline object.
X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).
y_pred – The predictions generated by the induced ModifiedPipeline object.
array-like of shape = [n_samples] or shape = [n_samples, n_targets]
References
Alex Wozniakowski, Jayne Thompson, Mile Gu, and Felix C. Binder. “A new formulation of gradient boosting”, Machine Learning: Science and Technology, 2 045022 (2021).
Computes the DataFrame of supervised scores.
The scoring metrics include mean squared error, mean absolute error, root mean squared error, R^2, explained variance, and mean squared logarithmic error. If the observed or predicted single-targets contain negative values, then the mean squared logarithmic error is not included, as the score is considered a NaN.
y_true (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The observed target matrix, where each row corresponds to an example and the column(s) correspond to the observed single-target(s).
y_pred (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The predicted target matrix, where each row corresponds to an example and the column(s) correspond to the predicted single-target(s).
path (str or file handle, optional (default=None)) – The file path or object, if the scoring DataFrame is to be saved to a comma-seperated values (csv) file.
scores – The pandas object of computed scores.
pd.DataFrame or pd.Series
Performs (augmented) cross-validation, and wraps the result in a DataFrame.
If return_incumbent_score is True, then the incumbent is scored
on the withheld folds. Otherwise, the behavior is the same as in
Scikit-learn.
X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).
y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the column(s) correspond to the single-target(s).
return_regressor (bool, optional (default=False)) – Determines whether to return the induced regressor.
error_score ('raise' or numeric, optional (default=np.nan)) – The assigned value if an error occurs while inducing a regressor. If set to ‘raise’, then the specific error is raised. Else if set to a numeric value, then FitFailedWarning is raised.
return_incumbent_score (bool, optional (default=True)) – Determines whether to score the incumbent on the withheld folds, whereby the incumbent is assumed to be an example in the design matrix.
cv (int, cross-validation generator, an iterable, or None, optional (default=None)) – Determines the cross-validation strategy. If None, then the default is 5-fold cross-validation.
fit_params (dict, optional (default=None)) – (Hyper)parameters to pass to the regressor’s fit method.
scores – DataFrame of scores for each run of the cross-validation procedure.
pd.DataFrame
Notes
Scikit-learn returns negative scores for some metrics, such as mean absolute error (MAE) or mean squared error (MSE). However, we only return nonnegativie scores.
References
Alex Wozniakowski, Jayne Thompson, Mile Gu, and Felix C. Binder. “A new formulation of gradient boosting”, Machine Learning: Science and Technology, 2 045022 (2021).
Performs (augmented) cross-validation, then returns the withheld fold score.
If return_incumbent_score is True, then the incumbent is scored
on the withheld folds. Otherwise, the behavior is the same as in
Scikit-learn.
X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).
y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the column(s) correspond to the single-target(s).
error_score ('raise' or numeric, optional (default=np.nan)) – The assigned value if an error occurs while inducing a regressor. If set to ‘raise’, then the specific error is raised. Else if set to a numeric value, then FitFailedWarning is raised.
return_incumbent_score (bool, optional (default=True)) – Determines whether to score the incumbent on the withheld folds, whereby the incumbent is assumed to be an example in the design matrix.
cv (int, cross-validation generator, an iterable, or None, optional (default=None)) – Determines the cross-validation strategy. If None, then the default is 5-fold cross-validation.
fit_params (dict, optional (default=None)) – (Hyper)parameters to pass to the regressor’s fit method.
scores – The withheld fold scores for each run of the cross-validation procedure.
pd.Series or pd.DataFrame
Notes
Scikit-learn returns negative scores for some metrics, such as mean absolute error (MAE) or mean squared error (MSE). However, we only return nonnegativie scores.
References
Alex Wozniakowski, Jayne Thompson, Mile Gu, and Felix C. Binder. “A new formulation of gradient boosting”, Machine Learning: Science and Technology, 2 045022 (2021).
Helper method for preprocessing (hyper)parameters.
This method automatically preprocesses (hyper)parameter names for the exhaustive search method by determining whether the task is single-target or multi-target regression. In the latter case, it further determines the user’s assumption on the single-targets’s independence. Namely, it asks if the user wishes to chain the single-targets.
y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the column(s) correspond to the single-target(s).
search_params (dict) – Dictionary with (hyper)parameter names as keys, and either lists of (hyper)parameter settings to try as values or tuples of (hyper)parameter lower and upper bounds to try as values.
search_params – The preprocessed (hyper)parameters.
dict
Helper (hyper)parameter search method.
X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).
y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the column(s) correspond to the single-target(s).
search_params (dict) – Dictionary with (hyper)parameter names as keys, and either lists of (hyper)parameter settings to try as values or tuples of (hyper)parameter lower and upper bounds to try as values.
search_method (str, optional (default='gridsearchcv')) – Specifies the search method. If 'gridsearchcv', 'randomizedsearchcv',
or 'bayesoptcv' then the search method is GridSearchCV, RandomizedSearchCV,
or Bayesian Optimization.
cv (int, cross-validation generator, an iterable, or None, optional (default=None)) – Determines the cross-validation strategy. If None, then the default is 5-fold cross-validation.
An instance of the (hyper)parameter search object.
GridSearchCV, RandomizedSearchCV, BayesianOptimization
(Hyper)parameter search method.
X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).
y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the column(s) correspond to the single-target(s).
search_params (dict) – Dictionary with (hyper)parameter names as keys, and either lists of (hyper)parameter settings to try as values or tuples of (hyper)parameter lower and upper bounds to try as values.
search_method (str, optional (default='gridsearchcv')) – Specifies the search method. If 'gridsearchcv', 'randomizedsearchcv',
or 'bayesoptcv' then the search method is GridSearchCV, RandomizedSearchCV,
or Bayesian Optimization.
cv (int, cross-validation generator, an iterable, or None, optional (default=None)) – Determines the cross-validation strategy. If None, then the default is 5-fold cross-validation.
path (str or file handle, optional (default=None)) – The file path or object, if the scoring DataFrame is to be saved to a comma-seperated values (csv) file.
The optimal (hyper)parameters.
pd.Series
The scores for the optimal (hyper)parameters.
pd.Series
Bundles the best_params_, best_score_, and refit_time
into one attribute.
pd.DataFrame
Notes
Scikit-learn returns negative scores for some metrics, such as mean absolute error (MAE) or mean squared error (MSE). However, we only return nonnegativie scores.
Helper method for nested cross-validation.
Exhaustively searches over the specified (hyper)parameters in the inner loop then scores the best performing regressor in the outer loop.
pipeline (ModifiedPipeline) – A ModifiedPipeline object.
X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).
y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the column(s) correspond to the single-target(s).
scorer (dict) – A dict mapping each scorer name to its validated scorer.
train (list) – A list of indices for the training folds.
test (list) – A list of indices for the withheld folds.
verbose (int) – Determines verbosity.
search_params (dict) – Dictionary with (hyper)parameter names as keys, and either lists of (hyper)parameter settings to try as values or tuples of (hyper)parameter lower and upper bounds to try as values.
search_method (str, optional (default='gridsearchcv')) – Specifies the search method. If 'gridsearchcv', 'randomizedsearchcv',
or 'bayesoptcv' then the search method is GridSearchCV, RandomizedSearchCV,
or Bayesian Optimization.
cv (int, cross-validation generator, an iterable, or None, optional (default=None)) – Determines the cross-validation strategy. If None, then the default is 5-fold cross-validation.
score
tuple
Notes
Scikit-learn returns negative scores for some metrics, such as mean absolute error (MAE) or mean squared error (MSE). However, we only return nonnegativie scores.
Performs a nested cross-validation procedure.
X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).
y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the column(s) correspond to the single-target(s).
search_params (dict) – Dictionary with (hyper)parameter names as keys, and either lists of (hyper)parameter settings to try as values or tuples of (hyper)parameter lower and upper bounds to try as values.
search_method (str, optional (default='gridsearchcv')) – Specifies the search method. If 'gridsearchcv', 'randomizedsearchcv',
or 'bayesoptcv' then the search method is GridSearchCV, RandomizedSearchCV,
or Bayesian Optimization.
outer_cv (int, cross-validation generator, an iterable, or None, optional (default=None)) – Determines the outer loop cross-validation strategy. If None, then the default is 5-fold cross-validation.
inner_cv (int, cross-validation generator, an iterable, or None, optional (default=None)) – Determines the inner loop cross-validation strategy. If None, then the default is 5-fold cross-validation.
return_inner_loop_score (bool, optional (default=False)) – If True, then we return the inner loop score in addition to the outer loop score.
score
pd.Series or tuple
Notes
The procedure does not compute the single best set of (hyper)parameters, as each inner loop may return a different set of optimal (hyper)parameters.
Scikit-learn returns negative scores for some metrics, such as mean absolute error (MAE) or mean squared error (MSE). However, we only return nonnegativie scores.
References
Jacques Wainer and Gavin Cawley. “Nested cross-validation when selecting classifiers is overzealous for most practical applications,” arXiv preprint arXiv:1809.09446 (2018).
Subsamples from the design and target matrices.
X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).
y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the column(s) correspond to the single-target(s).
subsample_proportion (float or None, optional (default=None)) – Determines the proportion of observations to use in the subsampling procedure.
out – A tuple with the X and y data.
tuple
Automates subtask slicing in multi-target regression.
y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the
column(s) correspond to the single-target(s). The targets are used to
determine the type of the target, and the number of samples if the
pipeline_transform involves quantile transformers.
y
array-like of shape = [n_samples] or shape = [n_samples, n_targets]
Helper method to estimate cross-validation fold size.
y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the column(s) correspond to the single-target(s).
cv (int, cross-validation generator, or an iterable) – Used in order to determine the fold size.
estimate
int
Helper fit method.
X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).
y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the column(s) correspond to the single-target(s).
sample_weight (float, ndarray, or None, optional (default=None)) – Individual weights for each example. If the weight is a float, then every example will have the same weight.
**fit_params (dict of string -> object) – If base boosting, then these parameters are passed to the stagewise
_fit_stages method.
Helper method which instantiates the regressor choice.
Performs (augmented) cross-validation.
If return_incumbent_score is True, then the incumbent is scored
on the withheld folds. Otherwise, the behavior is the same as in
Scikit-learn.
X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).
y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the column(s) correspond to the single-target(s).
return_regressor (bool, optional (default=False)) – Determines whether to return the induced regressor.
error_score ('raise' or numeric, optional (default=np.nan)) – The assigned value if an error occurs while inducing a regressor. If set to ‘raise’, then the specific error is raised. Else if set to a numeric value, then FitFailedWarning is raised.
return_incumbent_score (bool, optional (default=True)) – Determines whether to score the incumbent on the withheld folds, whereby the incumbent is assumed to be an example in the design matrix.
cv (int, cross-validation generator, an iterable, or None, optional (default=None)) – Determines the cross-validation strategy. If None, then the default is 5-fold cross-validation.
fit_params (dict, optional (default=None)) – (Hyper)parameters to pass to the regressor’s fit method.
scores – Array of scores for each run of the cross-validation procedure.
dict of float arrays of shape (n_splits,)
References
Alex Wozniakowski, Jayne Thompson, Mile Gu, and Felix C. Binder. “A new formulation of gradient boosting”, Machine Learning: Science and Technology, 2 045022 (2021).
Checks the validity of the data representation(s).
X (array-like of shape = [n_samples, n_features]) – The design matrix, where each row corresponds to an example and the column(s) correspond to the feature(s).
y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the column(s) correspond to the single-target(s).
out
validated data
Creates pipe attribute for downstream tasks.
This method constructs a ModifiedPipeline from the given base regressor.
y (array-like of shape = [n_samples] or shape = [n_samples, n_targets]) – The target matrix, where each row corresponds to an example and the
column(s) correspond to the single-target(s). The targets are used to
determine the type of the target, and the number of samples if the
pipeline_transform involves quantile transformers.
n_quantiles (int or None, optional (default=None)) – Number of quantiles in sklearn.preprocessing.QuantileTransformer, if
pipeline_transform is either `quantileuniform` or `quantilenormal`.
A ModifiedPipeline object.
The physlearn.supervised.interface provides an interface between
physlearn.BaseRegressor and the regressor dictionary. It includes
the physlearn.RegressorDictionaryInterface class.
Bases: AbstractEstimatorDictionaryInterface
BaseRegressor and regressor dictionary interface.
The regressor dictionary collects key-value pairs, whereby each key
is a lower case regressor class name that uniquely identifies the
regressor class, e.g., dict('ridge': Ridge). As such, the interface
manages regressor class retrieval for physlearn.BaseRegressor
as part of the constructor method.
regressor_choice (str) – The dictionary key for lookup in the dictionary of regressors.
The key must be in lower cases, e.g., the Scikit-learn
regressor Ridge has key 'ridge'.
params (dict, list, or None, optional (default=None)) – The choice of (hyper)parameters.
stacking_options (dict or None, optional (default=None)) –
A dictionary of stacking options, whereby layers
must be specified:
dictA dictionary of stacking layer(s).
bool or None, (default=True)Determines whether to shuffle the training data in
mlxtend.regressor.StackingCVRegressor.
bool or None, (default=True)Determines whether to clone and refit the regressors in
mlxtend.regressor.StackingCVRegressor.
bool or None, (default=True)Determines whether to concatenate the original features with
the first stacking layer predictions in
sklearn.ensemble.StackingRegressor,
mlxtend.regressor.StackingRegressor, or
mlxtend.regressor.StackingCVRegressor.
bool or None, (default=True)Determines whether to make the concatenated features
accessible through the attribute train_meta_features_
in mlxtend.regressor.StackingRegressor and
mlxtend.regressor.StackingCVRegressor.
ndarray of shape (n_regressors,) or None, (default=None)Sequence of weights for sklearn.ensemble.VotingRegressor.
Examples
>>> from physlearn import RegressorDictionaryInterface
>>> interface = RegressorDictionaryInterface(regressor_choice='mlpregressor',
params=dict(alpha=1))
>>> interface.set_params()
MLPRegressor(alpha=1)
Retrieves the (hyper)parameters.
regressor (estimator) – A regressor that follows the Scikit-learn API.
Notes
The method physlearn.RegressorDictionaryInterface.set_params()
must be called beforehand.
Sets the (hyper)parameters.
If params is None, then the default (hyper)parameters
are set.
cv (int, cross-validation generator, an iterable, or None) – Determines the cross-validation strategy in
sklearn.ensemble.StackingRegressor,
mlxtend.regressor.StackingRegressor, or
mlxtend.regressor.StackingCVRegressor.
verbose (int or None) – Determines verbosity in
mlxtend.regressor.StackingRegressor and
mlxtend.regressor.StackingCVRegressor.
random_state (int, RandomState instance, or None) – Determines the random number generation in
mlxtend.regressor.StackingCVRegressor.
n_jobs (int or None) – The number of jobs to run in parallel.
stacking_options (dict or None, optional (default=None)) –
A dictionary of stacking options, whereby layers
must be specified:
dictA dictionary of stacking layer(s).
bool or None, (default=True)Determines whether to shuffle the training data in
mlxtend.regressor.StackingCVRegressor.
bool or None, (default=True)Determines whether to clone and refit the regressors in
mlxtend.regressor.StackingCVRegressor.
bool or None, (default=True)Determines whether to concatenate the original features with
the first stacking layer predictions in
sklearn.ensemble.StackingRegressor,
mlxtend.regressor.StackingRegressor, or
mlxtend.regressor.StackingCVRegressor.
bool or None, (default=True)Determines whether to make the concatenated features
accessible through the attribute train_meta_features_
in mlxtend.regressor.StackingRegressor and
mlxtend.regressor.StackingCVRegressor.
ndarray of shape (n_regressors,) or None, (default=None)Sequence of weights for sklearn.ensemble.VotingRegressor.