Models

The Model class (pySAR/model.py) provides a unified interface for building, fitting, evaluating, and saving scikit-learn regression models. All 16 supported algorithms share the same API: construct with feature data and labels, split the dataset, fit, predict, and optionally run hyperparameter tuning or feature selection.

from pySAR.model import Model
import numpy as np

X = np.random.rand(100, 50)   # feature matrix
Y = np.random.rand(100)       # activity labels

model = Model(X, Y, algorithm="randomforest", parameters={"n_estimators": 200})
model.train_test_split(test_split=0.2)
model.fit()
predictions = model.predict()

Supported Algorithms

Algorithm names are matched with fuzzy matching (difflib), so approximate strings (e.g. "randomforest", "plsreg", "knn") are accepted.

Algorithm

Accepted Aliases

scikit-learn Class

PLS Regression

plsregression

PLSRegression

Random Forest

randomforestregressor

RandomForestRegressor

AdaBoost

adaboostregressor

AdaBoostRegressor

Bagging

baggingregressor

BaggingRegressor

Decision Tree

decisiontreeregressor

DecisionTreeRegressor

Gradient Boosting

gbr, gradientboost, gradientboostingregressor

GradientBoostingRegressor

Histogram Gradient Boosting

histgradientboosting, histgradientboostingregressor, hgbr

HistGradientBoostingRegressor

Linear Regression

linearregression

LinearRegression

Lasso

lasso

Lasso

Ridge

ridge

Ridge

ElasticNet

elasticnet

ElasticNet

Support Vector Regression

svr, supportvectorregression

SVR

Stochastic Gradient Descent

sgd, stochasticgradientdescent

SGDRegressor

K-Nearest Neighbours

knn, kneighborsregressor, knearestneighbors

KNeighborsRegressor

Extra Trees

extratrees, extratreesregressor

ExtraTreesRegressor

Gaussian Process

gaussianprocess, gaussianprocessregressor, gpr

GaussianProcessRegressor


Parameters

Model.__init__(X, Y, algorithm, parameters={}, test_split=0.2)

Parameter

Default

Description

X

Feature matrix (np.ndarray).

Y

Activity/fitness labels (np.ndarray).

algorithm

Name (or approximate name) of the sklearn regression algorithm to use.

parameters

{}

Dict of algorithm-specific parameters passed directly to the sklearn constructor. An empty dict uses all sklearn defaults.

test_split

0.2

Fraction of data reserved for testing (default 80/20 train/test split).

The parameters dict maps directly to the chosen sklearn model’s constructor keyword arguments. For example, to set the number of estimators for a Random Forest:

model = Model(X, Y, algorithm="randomforest", parameters={"n_estimators": 500, "max_depth": 10})

Full parameter lists for each algorithm are available in the scikit-learn documentation.


train_test_split

Splits X and Y into training and test sets and optionally applies standard scaling.

model.train_test_split(test_split=0.2, scale=True, random_state=42, shuffle=True)

Parameter

Default

Description

test_split

0.2

Proportion of observations to reserve for the test set.

scale

True

Whether to apply StandardScaler; the scaler is fit on the training set only and then used to transform both sets, preventing data leakage.

random_state

None

Seed for reproducible splits.

shuffle

True

Whether to shuffle observations before splitting.


fit and predict

After splitting, call fit() to train the model on X_train / Y_train, then predict() to generate predictions on X_test:

model.fit()
predictions = model.predict()

predict() returns an np.ndarray of predicted activity values for the test set.


Hyperparameter Tuning

hyperparameter_tuning() uses scikit-learn’s GridSearchCV to exhaustively search a user-supplied parameter grid and report the best configuration.

param_grid = {
    "n_estimators": [100, 200, 500],
    "max_depth": [None, 5, 10],
}
model.hyperparameter_tuning(param_grid=param_grid, metric="r2", cv=5)

Parameter

Default

Description

param_grid

{}

Dict mapping parameter names to lists of candidate values. Keys that are not valid for the current model are silently ignored.

metric

"r2"

scikit-learn scoring metric used to rank candidates. Any metric returned by sklearn.metrics.get_scorer_names() is accepted.

cv

5

Number of cross-validation folds (clamped to the range 5–10).

n_jobs

None

Number of parallel jobs. None means 1; -1 uses all available cores.

verbose

2

Verbosity level passed to GridSearchCV.

After the search, the best parameters, R² score, RMSE, MSE, MAE, RPD, and explained variance are printed. The GridSearchCV result object is stored in model.grid_result.


Feature Selection

feature_selection() applies a dimensionality-reduction technique to X and Y and returns the reduced feature matrix.

X_reduced = model.feature_selection(method="rfe")

The following methods are supported. Names are matched approximately, so "kbest" will resolve to selectkbest, for example.

Method

Description

selectkbest / chi2

Selects features by the highest scores under the f_regression scoring function (SelectKBest).

variancethreshold

Removes features whose variance does not exceed a threshold (VarianceThreshold).

rfe

Recursive Feature Elimination — iteratively removes the least important features as ranked by the model (RFE).

selectfrommodel

Selects features whose importance weights exceed a threshold derived from a fitted copy of the model (SelectFromModel).

sequentialfeatureselector

Greedy forward/backward selection, adding or removing one feature at a time (SequentialFeatureSelector).


Saving a Model

Trained models are serialised to disk using pickle:

model.save(save_folder="results/", model_name="my_model")
# saves to results/my_model.pkl

The .pkl extension is appended automatically if omitted. To check whether a model has already been fitted before saving:

if model.model_fitted():
    model.save("results/")

Warning

Only load .pkl files from trusted sources. Deserialising a malicious pickle file can execute arbitrary code.


Config File

The algorithm and its parameters can be set in the [model] section of a pySAR config JSON file:

{
    "model": {
        "algorithm": "randomforest",
        "parameters": {
            "n_estimators": 500,
            "max_depth": 10
        },
        "test_split": 0.2
    }
}

This is equivalent to:

model = Model(X, Y, algorithm="randomforest",
              parameters={"n_estimators": 500, "max_depth": 10},
              test_split=0.2)