Usage ===== .. _installation: Installation ------------ Install the latest release via ``pip``: .. code-block:: console pip install pySAR Alternatively, clone the repository and install from source: .. code-block:: console git clone -b master https://github.com/amckenna41/pySAR.git cd pySAR pip install . Configuration Files ------------------- pySAR is driven by JSON configuration files. Each dataset requires its own config file that specifies the dataset path, activity column, encoding parameters, and descriptor parameters. The config files are stored in the ``config/`` directory of the project. See `CONFIG.md `_ for a full description of all available parameters. Config files are passed to the ``PySAR``, ``Encoding``, or ``Descriptors`` classes via the ``config_file`` parameter. All parameter **names** must remain unchanged; only their **values** should be edited. Any unused parameter can be set to ``null``. Parameters can alternatively be passed directly as ``**kwargs`` to each class. Four example config files are provided in the `config/ `_ directory, one per supported dataset: `thermostability.json `_, `enantioselectivity.json `_, `localization.json `_, and `absorption.json `_. The config file is divided into four top-level sections: **dataset** Defines the input data. ``dataset`` is the path to the sequence/activity file; ``sequence_col`` names the column holding protein sequences; ``activity`` names the target activity column. **model** Specifies the regression algorithm (e.g. ``plsregression``, ``randomforest``, ``svr``), optional hyperparameters, and the train/test split ratio (``test_split``, default ``0.2``). **descriptors** Controls which protein descriptors are calculated and their metaparameters (lag values, properties, window sizes, etc.). ``descriptors_csv`` can point to a pre-calculated descriptor CSV to skip recomputation on repeated runs. **pyDSP** Governs optional Digital Signal Processing applied to AAI-encoded sequences before model training. Set ``use_dsp`` to ``1`` to enable; then configure the ``spectrum`` type (``power``, ``absolute``, ``real``, ``imaginary``), a convolutional ``window`` (e.g. ``hamming``, ``blackman``, ``gaussian``), and an optional ``filter`` (e.g. ``savgol``, ``medfilt``). A full configuration file looks like: .. code-block:: json { "dataset": { "dataset": "thermostability.txt", "sequence_col": "sequence", "activity": "T50" }, "model": { "algorithm": "plsregression", "parameters": "", "test_split": 0.2 }, "descriptors": { "descriptors_csv": "descriptors_thermostability.csv", "moreaubroto_autocorrelation": { "lag": 30, "properties": ["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], "normalize": 1 }, "ctd": { "property": "hydrophobicity", "all": 1 }, "pseudo_amino_acid_composition": { "lambda": 30, "weight": 0.05, "properties": [] }, "charge_distribution": { "ph": 7.4 }, "kmer_composition": { "k": 2 }, "reduced_alphabet_composition": { "alphabet_size": 6 }, "motif_composition": { "motifs": null }, "aggregation_propensity": { "window": 5, "hydrophobicity_threshold": 2.0, "charge_threshold": 1 }, "hydrophobic_moment": { "window": 11, "angle": 100 } }, "pyDSP": { "use_dsp": 1, "spectrum": "power", "window": { "type": "hamming" }, "filter": { "type": null } } } Descriptor Encoding ------------------- pySAR supports 36 protein descriptors via the ``Descriptors`` class. Descriptors are calculated using the `protpy `_ package (>=1.3.0). **Initialising the Descriptors class:** .. code-block:: python from pySAR.descriptors import Descriptors desc = Descriptors(config_file="config/thermostability.json") Composition Descriptors ~~~~~~~~~~~~~~~~~~~~~~~ **Amino Acid Composition** — frequency of each of the 20 canonical amino acids (N × 20): .. code-block:: python aa_comp = desc.get_amino_acid_composition() print(aa_comp.shape) # (261, 20) print(aa_comp.dtypes[0]) # float64 **Dipeptide Composition** — frequency of all 400 dipeptide pairs (N × 400): .. code-block:: python dp_comp = desc.get_dipeptide_composition() print(dp_comp.shape) # (261, 400) **Tripeptide Composition** — frequency of all 8000 tripeptide combinations (N × 8000): .. code-block:: python tp_comp = desc.get_tripeptide_composition() print(tp_comp.shape) # (261, 8000) **GRAVY** — Grand Average of Hydropathicity using Kyte-Doolittle values (N × 1): .. code-block:: python gravy = desc.get_gravy() print(gravy.columns.tolist()) # ['GRAVY'] **Aromaticity** — fraction of aromatic residues (F, W, Y, H) in the sequence (N × 1): .. code-block:: python arom = desc.get_aromaticity() print(arom.columns.tolist()) # ['Aromaticity'] **Instability Index** — DIWV-based stability score; values ≥ 40 indicate instability (N × 1): .. code-block:: python instab = desc.get_instability_index() print(instab.columns.tolist()) # ['InstabilityIndex'] **Isoelectric Point** — estimated pH at which the protein carries no net charge (N × 1): .. code-block:: python pi = desc.get_isoelectric_point() print(pi.columns.tolist()) # ['IsoelectricPoint'] **Molecular Weight** — average molecular weight in Daltons, corrected for peptide bonds (N × 1): .. code-block:: python mw = desc.get_molecular_weight() print(mw.columns.tolist()) # ['MolecularWeight'] **Charge Distribution** — positive, negative, and net charge at a given pH (default 7.4) (N × 3): .. code-block:: python charge = desc.get_charge_distribution() print(charge.columns.tolist()) # ['PositiveCharge', 'NegativeCharge', 'NetCharge'] **Hydrophobic/Polar/Charged Composition** — percentage of residues in each physicochemical group (N × 3): .. code-block:: python hpc = desc.get_hydrophobic_polar_charged_composition() print(hpc.columns.tolist()) # ['Hydrophobic', 'Polar', 'Charged'] **Secondary Structure Propensity** — mean Chou-Fasman propensity values for helix, sheet, and coil conformations (N × 3): .. code-block:: python ssp = desc.get_secondary_structure_propensity() print(ssp.columns.tolist()) # ['Helix', 'Sheet', 'Coil'] **k-mer Composition** — frequency of all 20^k subsequences; default k=2 gives 400 features (N × 400 by default): .. code-block:: python kmer = desc.get_kmer_composition() print(kmer.shape) # (261, 400) **Reduced Alphabet Composition** — amino acid composition after mapping residues to a reduced physicochemical alphabet; default alphabet_size=6 (N × 6 by default): .. code-block:: python rac = desc.get_reduced_alphabet_composition() print(rac.shape) # (261, 6) **Motif Composition** — count of 8 built-in biological sequence motifs (N × 8 by default): .. code-block:: python motifs = desc.get_motif_composition() print(motifs.shape) # (261, 8) **Amino Acid Pair Composition** — frequency of all 400 residue-pair combinations with physicochemical class annotations (N × 400): .. code-block:: python pair_comp = desc.get_amino_acid_pair_composition() print(pair_comp.shape) # (261, 400) **Aliphatic Index** — relative volume of aliphatic side chains (Ala, Val, Ile, Leu); higher values correlate with thermostability (N × 1): .. code-block:: python ali = desc.get_aliphatic_index() print(ali.columns.tolist()) # ['AliphaticIndex'] **Extinction Coefficient** — molar extinction coefficient at 280 nm from Trp, Tyr, Cys counts; reported for reduced and oxidised states (N × 2): .. code-block:: python ext = desc.get_extinction_coefficient() print(ext.columns.tolist()) # ['ExtCoeff_Reduced', 'ExtCoeff_Oxidized'] **Boman Index** — sum of solubility values for all residues divided by sequence length; predicts protein–protein interaction potential (N × 1): .. code-block:: python boman = desc.get_boman_index() print(boman.columns.tolist()) # ['BomanIndex'] **Aggregation Propensity** — count and fraction of aggregation-prone windows identified via a sliding-window Kyte-Doolittle + charge-neutrality heuristic (N × 2): .. code-block:: python agg = desc.get_aggregation_propensity() print(agg.columns.tolist()) # ['AggregProneRegions', 'AggregProneFraction'] **Hydrophobic Moment** — mean and maximum hydrophobic moment across sliding helical-wheel windows using the Eisenberg hydrophobicity scale (N × 2): .. code-block:: python hm = desc.get_hydrophobic_moment() print(hm.columns.tolist()) # ['HydrophobicMoment_Mean', 'HydrophobicMoment_Max'] **Shannon Entropy** — information-theoretic measure of amino acid diversity; 0 = fully repetitive, ~4.322 bits = perfectly uniform over 20 amino acids (N × 1): .. code-block:: python ent = desc.get_shannon_entropy() print(ent.columns.tolist()) # ['ShannonEntropy'] Autocorrelation Descriptors ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Autocorrelation descriptors encode the correlation between the physicochemical properties of amino acid residues separated by a given sequence lag. All three variants use up to 8 AAIndex properties and a default lag of 30, producing 240 features per descriptor (N × 240). **MoreauBroto Autocorrelation** — measures the average product of property values at residues separated by lag *d*. It captures the overall strength of property correlation across the sequence without normalising by variance, making it sensitive to the absolute scale of the chosen property: .. code-block:: python mb = desc.get_moreaubroto_autocorrelation() print(mb.shape) # (261, 240) **Moran Autocorrelation** — a normalised variant of MoreauBroto that divides by the variance of the property values across the sequence. This makes it scale-invariant and directly comparable across different physicochemical properties, reflecting the spatial clustering of similar residues: .. code-block:: python ma = desc.get_moran_autocorrelation() print(ma.shape) # (261, 240) **Geary Autocorrelation** — measures the mean squared difference between property values at residues separated by lag *d*, normalised by the overall variance. Unlike Moran, values close to 0 indicate strong positive autocorrelation and values greater than 1 indicate negative autocorrelation, making it sensitive to local dissimilarity along the chain: .. code-block:: python ga = desc.get_geary_autocorrelation() print(ga.shape) # (261, 240) CTD Descriptors ~~~~~~~~~~~~~~~ **CTD** — combined Composition, Transition, and Distribution descriptor (N × 147): .. code-block:: python ctd = desc.get_ctd() print(ctd.shape) # (261, 147) Sub-components can be accessed individually: .. code-block:: python ctd_c = desc.get_ctd_composition() # (261, 21) ctd_t = desc.get_ctd_transition() # (261, 21) ctd_d = desc.get_ctd_distribution() # (261, 105) Conjoint Triad ~~~~~~~~~~~~~~ **Conjoint Triad** — considers neighbour relationships in protein 3D structure; produces 343 features (N × 343): .. code-block:: python ct = desc.get_conjoint_triad() print(ct.shape) # (261, 343) Sequence Order Descriptors ~~~~~~~~~~~~~~~~~~~~~~~~~~ **Sequence Order Coupling Number** — dissimilarity between amino acid pairs at varying distances; default lag=30 gives 60 features. Can use Schneider-Wrede and/or Grantham distance matrices (N × 60): .. code-block:: python socn = desc.get_sequence_order_coupling_number() print(socn.shape) # (261, 60) **Quasi Sequence Order** — extends SOCN with amino acid composition; generates 100 features by default (N × 100): .. code-block:: python qso = desc.get_quasi_sequence_order() print(qso.shape) # (261, 100) Pseudo Composition Descriptors ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Pseudo Amino Acid Composition** — combines amino acid composition with physicochemical correlation factors. Default generates 50 features (N × 50): .. code-block:: python paac = desc.get_pseudo_amino_acid_composition() print(paac.shape) # (261, 50) **Amphiphilic Pseudo Amino Acid Composition** — extends PAAComp with hydrophobic and hydrophilic distribution patterns along the chain. Default generates 80 features (N × 80): .. code-block:: python apaac = desc.get_amphiphilic_pseudo_amino_acid_composition() print(apaac.shape) # (261, 80) Calculating All Descriptors ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To calculate all 36 descriptors at once and concatenate them into a single DataFrame: .. code-block:: python all_desc = desc.get_all_descriptors() print(all_desc.shape) # (261, ) # Export to CSV for future reuse (avoids recomputation) desc.get_all_descriptors(export=True, descriptors_export_filename="descriptors.csv") AAI Encoding ------------ Encode sequences using physicochemical indices from the AAIndex1 database, combined with Digital Signal Processing (DSP) features: .. code-block:: python from pySAR.encoding import Encoding enc = Encoding(config_file="config/thermostability.json") # Encode using a single AAIndex record results = enc.aai_encoding(aai_indices="FAUJ880110", sort_by="R2") print(results[["Index", "R2", "RMSE"]]) # Encode using multiple comma-separated AAIndex records results = enc.aai_encoding(aai_indices="FAUJ880110, BIGC670101", sort_by="MSE") Descriptor Encoding ------------------- Build predictive models using one or more protein descriptors as feature matrices: .. code-block:: python from pySAR.encoding import Encoding enc = Encoding(config_file="config/thermostability.json") # Single descriptor results = enc.descriptor_encoding(descriptors="amino_acid_composition", desc_combo=1, sort_by="R2") # Multiple specific descriptors results = enc.descriptor_encoding( descriptors=["gravy", "molecular_weight", "charge_distribution"], desc_combo=1, sort_by="R2" ) # All 36 descriptors (empty list = all) results = enc.descriptor_encoding(descriptors=[], desc_combo=1, sort_by="R2") print(len(results)) # 36 # All combinations of 2 descriptors results = enc.descriptor_encoding(descriptors=[], desc_combo=2, sort_by="R2") print(len(results)) # 528 AAI + Descriptor Encoding -------------------------- Combine AAI-encoded features with descriptor features: .. code-block:: python from pySAR.encoding import Encoding enc = Encoding(config_file="config/thermostability.json") results = enc.aai_descriptor_encoding( aai_indices="FAUJ880110", descriptors="amino_acid_composition", desc_combo=1, sort_by="R2" ) print(results[["Index", "Descriptor", "R2", "RMSE"]]) PySAR Workflow -------------- The ``PySAR`` class provides the top-level workflow for building and evaluating models: .. code-block:: python from pySAR.pySAR import PySAR # AAI encoding workflow pysar = PySAR(config_file="config/thermostability.json") results = pysar.encode_aai(aai_indices="FAUJ880110") # Descriptor encoding workflow results = pysar.encode_descriptor(descriptors="amino_acid_composition") # AAI + Descriptor encoding workflow results = pysar.encode_aai_descriptor( aai_indices="FAUJ880110", descriptors="amino_acid_composition" )