Models
The Model class (pySAR/model.py) provides a unified interface for building, fitting,
evaluating, and saving scikit-learn regression models. All 16 supported algorithms share the same
API: construct with feature data and labels, split the dataset, fit, predict, and optionally
run hyperparameter tuning or feature selection.
from pySAR.model import Model
import numpy as np
X = np.random.rand(100, 50) # feature matrix
Y = np.random.rand(100) # activity labels
model = Model(X, Y, algorithm="randomforest", parameters={"n_estimators": 200})
model.train_test_split(test_split=0.2)
model.fit()
predictions = model.predict()
Supported Algorithms
Algorithm names are matched with fuzzy matching (difflib), so approximate strings
(e.g. "randomforest", "plsreg", "knn") are accepted.
Algorithm |
Accepted Aliases |
scikit-learn Class |
|---|---|---|
PLS Regression |
|
|
Random Forest |
|
|
AdaBoost |
|
|
Bagging |
|
|
Decision Tree |
|
|
Gradient Boosting |
|
|
Histogram Gradient Boosting |
|
|
Linear Regression |
|
|
Lasso |
|
|
Ridge |
|
|
ElasticNet |
|
|
Support Vector Regression |
|
|
Stochastic Gradient Descent |
|
|
K-Nearest Neighbours |
|
|
Extra Trees |
|
|
Gaussian Process |
|
|
Parameters
Model.__init__(X, Y, algorithm, parameters={}, test_split=0.2)
Parameter |
Default |
Description |
|---|---|---|
|
— |
Feature matrix ( |
|
— |
Activity/fitness labels ( |
|
— |
Name (or approximate name) of the sklearn regression algorithm to use. |
|
|
Dict of algorithm-specific parameters passed directly to the sklearn constructor. An empty dict uses all sklearn defaults. |
|
|
Fraction of data reserved for testing (default 80/20 train/test split). |
The parameters dict maps directly to the chosen sklearn model’s constructor keyword
arguments. For example, to set the number of estimators for a Random Forest:
model = Model(X, Y, algorithm="randomforest", parameters={"n_estimators": 500, "max_depth": 10})
Full parameter lists for each algorithm are available in the scikit-learn documentation.
train_test_split
Splits X and Y into training and test sets and optionally applies standard scaling.
model.train_test_split(test_split=0.2, scale=True, random_state=42, shuffle=True)
Parameter |
Default |
Description |
|---|---|---|
|
|
Proportion of observations to reserve for the test set. |
|
|
Whether to apply |
|
|
Seed for reproducible splits. |
|
|
Whether to shuffle observations before splitting. |
fit and predict
After splitting, call fit() to train the model on X_train / Y_train,
then predict() to generate predictions on X_test:
model.fit()
predictions = model.predict()
predict() returns an np.ndarray of predicted activity values for the test set.
Hyperparameter Tuning
hyperparameter_tuning() uses scikit-learn’s GridSearchCV to exhaustively search a
user-supplied parameter grid and report the best configuration.
param_grid = {
"n_estimators": [100, 200, 500],
"max_depth": [None, 5, 10],
}
model.hyperparameter_tuning(param_grid=param_grid, metric="r2", cv=5)
Parameter |
Default |
Description |
|---|---|---|
|
|
Dict mapping parameter names to lists of candidate values. Keys that are not valid for the current model are silently ignored. |
|
|
scikit-learn scoring metric used to rank candidates. Any metric returned by |
|
|
Number of cross-validation folds (clamped to the range 5–10). |
|
|
Number of parallel jobs. |
|
|
Verbosity level passed to |
After the search, the best parameters, R² score, RMSE, MSE, MAE, RPD, and explained
variance are printed. The GridSearchCV result object is stored in model.grid_result.
Feature Selection
feature_selection() applies a dimensionality-reduction technique to X and Y
and returns the reduced feature matrix.
X_reduced = model.feature_selection(method="rfe")
The following methods are supported. Names are matched approximately, so "kbest"
will resolve to selectkbest, for example.
Method |
Description |
|---|---|
|
Selects features by the highest scores under the |
|
Removes features whose variance does not exceed a threshold ( |
|
Recursive Feature Elimination — iteratively removes the least important features as ranked by the model ( |
|
Selects features whose importance weights exceed a threshold derived from a fitted copy of the model ( |
|
Greedy forward/backward selection, adding or removing one feature at a time ( |
Saving a Model
Trained models are serialised to disk using pickle:
model.save(save_folder="results/", model_name="my_model")
# saves to results/my_model.pkl
The .pkl extension is appended automatically if omitted. To check whether a model
has already been fitted before saving:
if model.model_fitted():
model.save("results/")
Warning
Only load .pkl files from trusted sources. Deserialising a malicious pickle file
can execute arbitrary code.
Config File
The algorithm and its parameters can be set in the [model] section of a pySAR
config JSON file:
{
"model": {
"algorithm": "randomforest",
"parameters": {
"n_estimators": 500,
"max_depth": 10
},
"test_split": 0.2
}
}
This is equivalent to:
model = Model(X, Y, algorithm="randomforest",
parameters={"n_estimators": 500, "max_depth": 10},
test_split=0.2)