Models ====== The ``Model`` class (``pySAR/model.py``) provides a unified interface for building, fitting, evaluating, and saving scikit-learn regression models. All 16 supported algorithms share the same API: construct with feature data and labels, split the dataset, fit, predict, and optionally run hyperparameter tuning or feature selection. .. code-block:: python from pySAR.model import Model import numpy as np X = np.random.rand(100, 50) # feature matrix Y = np.random.rand(100) # activity labels model = Model(X, Y, algorithm="randomforest", parameters={"n_estimators": 200}) model.train_test_split(test_split=0.2) model.fit() predictions = model.predict() ---- Supported Algorithms -------------------- Algorithm names are matched with fuzzy matching (``difflib``), so approximate strings (e.g. ``"randomforest"``, ``"plsreg"``, ``"knn"``) are accepted. .. list-table:: :header-rows: 1 :widths: 30 35 35 * - Algorithm - Accepted Aliases - scikit-learn Class * - PLS Regression - ``plsregression`` - ``PLSRegression`` * - Random Forest - ``randomforestregressor`` - ``RandomForestRegressor`` * - AdaBoost - ``adaboostregressor`` - ``AdaBoostRegressor`` * - Bagging - ``baggingregressor`` - ``BaggingRegressor`` * - Decision Tree - ``decisiontreeregressor`` - ``DecisionTreeRegressor`` * - Gradient Boosting - ``gbr``, ``gradientboost``, ``gradientboostingregressor`` - ``GradientBoostingRegressor`` * - Histogram Gradient Boosting - ``histgradientboosting``, ``histgradientboostingregressor``, ``hgbr`` - ``HistGradientBoostingRegressor`` * - Linear Regression - ``linearregression`` - ``LinearRegression`` * - Lasso - ``lasso`` - ``Lasso`` * - Ridge - ``ridge`` - ``Ridge`` * - ElasticNet - ``elasticnet`` - ``ElasticNet`` * - Support Vector Regression - ``svr``, ``supportvectorregression`` - ``SVR`` * - Stochastic Gradient Descent - ``sgd``, ``stochasticgradientdescent`` - ``SGDRegressor`` * - K-Nearest Neighbours - ``knn``, ``kneighborsregressor``, ``knearestneighbors`` - ``KNeighborsRegressor`` * - Extra Trees - ``extratrees``, ``extratreesregressor`` - ``ExtraTreesRegressor`` * - Gaussian Process - ``gaussianprocess``, ``gaussianprocessregressor``, ``gpr`` - ``GaussianProcessRegressor`` ---- Parameters ---------- ``Model.__init__(X, Y, algorithm, parameters={}, test_split=0.2)`` .. list-table:: :header-rows: 1 :widths: 20 15 65 * - Parameter - Default - Description * - ``X`` - — - Feature matrix (``np.ndarray``). * - ``Y`` - — - Activity/fitness labels (``np.ndarray``). * - ``algorithm`` - — - Name (or approximate name) of the sklearn regression algorithm to use. * - ``parameters`` - ``{}`` - Dict of algorithm-specific parameters passed directly to the sklearn constructor. An empty dict uses all sklearn defaults. * - ``test_split`` - ``0.2`` - Fraction of data reserved for testing (default 80/20 train/test split). The ``parameters`` dict maps directly to the chosen sklearn model's constructor keyword arguments. For example, to set the number of estimators for a Random Forest: .. code-block:: python model = Model(X, Y, algorithm="randomforest", parameters={"n_estimators": 500, "max_depth": 10}) Full parameter lists for each algorithm are available in the `scikit-learn documentation `_. ---- train_test_split ---------------- Splits ``X`` and ``Y`` into training and test sets and optionally applies standard scaling. .. code-block:: python model.train_test_split(test_split=0.2, scale=True, random_state=42, shuffle=True) .. list-table:: :header-rows: 1 :widths: 20 15 65 * - Parameter - Default - Description * - ``test_split`` - ``0.2`` - Proportion of observations to reserve for the test set. * - ``scale`` - ``True`` - Whether to apply ``StandardScaler``; the scaler is fit on the training set only and then used to transform both sets, preventing data leakage. * - ``random_state`` - ``None`` - Seed for reproducible splits. * - ``shuffle`` - ``True`` - Whether to shuffle observations before splitting. ---- fit and predict --------------- After splitting, call ``fit()`` to train the model on ``X_train`` / ``Y_train``, then ``predict()`` to generate predictions on ``X_test``: .. code-block:: python model.fit() predictions = model.predict() ``predict()`` returns an ``np.ndarray`` of predicted activity values for the test set. ---- Hyperparameter Tuning --------------------- ``hyperparameter_tuning()`` uses scikit-learn's ``GridSearchCV`` to exhaustively search a user-supplied parameter grid and report the best configuration. .. code-block:: python param_grid = { "n_estimators": [100, 200, 500], "max_depth": [None, 5, 10], } model.hyperparameter_tuning(param_grid=param_grid, metric="r2", cv=5) .. list-table:: :header-rows: 1 :widths: 20 15 65 * - Parameter - Default - Description * - ``param_grid`` - ``{}`` - Dict mapping parameter names to lists of candidate values. Keys that are not valid for the current model are silently ignored. * - ``metric`` - ``"r2"`` - scikit-learn scoring metric used to rank candidates. Any metric returned by ``sklearn.metrics.get_scorer_names()`` is accepted. * - ``cv`` - ``5`` - Number of cross-validation folds (clamped to the range 5–10). * - ``n_jobs`` - ``None`` - Number of parallel jobs. ``None`` means 1; ``-1`` uses all available cores. * - ``verbose`` - ``2`` - Verbosity level passed to ``GridSearchCV``. After the search, the best parameters, R² score, RMSE, MSE, MAE, RPD, and explained variance are printed. The ``GridSearchCV`` result object is stored in ``model.grid_result``. ---- Feature Selection ----------------- ``feature_selection()`` applies a dimensionality-reduction technique to ``X`` and ``Y`` and returns the reduced feature matrix. .. code-block:: python X_reduced = model.feature_selection(method="rfe") The following methods are supported. Names are matched approximately, so ``"kbest"`` will resolve to ``selectkbest``, for example. .. list-table:: :header-rows: 1 :widths: 25 75 * - Method - Description * - ``selectkbest`` / ``chi2`` - Selects features by the highest scores under the ``f_regression`` scoring function (``SelectKBest``). * - ``variancethreshold`` - Removes features whose variance does not exceed a threshold (``VarianceThreshold``). * - ``rfe`` - Recursive Feature Elimination — iteratively removes the least important features as ranked by the model (``RFE``). * - ``selectfrommodel`` - Selects features whose importance weights exceed a threshold derived from a fitted copy of the model (``SelectFromModel``). * - ``sequentialfeatureselector`` - Greedy forward/backward selection, adding or removing one feature at a time (``SequentialFeatureSelector``). ---- Saving a Model -------------- Trained models are serialised to disk using ``pickle``: .. code-block:: python model.save(save_folder="results/", model_name="my_model") # saves to results/my_model.pkl The ``.pkl`` extension is appended automatically if omitted. To check whether a model has already been fitted before saving: .. code-block:: python if model.model_fitted(): model.save("results/") .. warning:: Only load ``.pkl`` files from trusted sources. Deserialising a malicious pickle file can execute arbitrary code. ---- Config File ----------- The algorithm and its parameters can be set in the ``[model]`` section of a pySAR config JSON file: .. code-block:: json { "model": { "algorithm": "randomforest", "parameters": { "n_estimators": 500, "max_depth": 10 }, "test_split": 0.2 } } This is equivalent to: .. code-block:: python model = Model(X, Y, algorithm="randomforest", parameters={"n_estimators": 500, "max_depth": 10}, test_split=0.2)