Usage
Installation
Install the latest release via pip:
pip install pySAR
Alternatively, clone the repository and install from source:
git clone -b master https://github.com/amckenna41/pySAR.git
cd pySAR
pip install .
Configuration Files
pySAR is driven by JSON configuration files. Each dataset requires its own config file that
specifies the dataset path, activity column, encoding parameters, and descriptor parameters.
The config files are stored in the config/ directory of the project. See
CONFIG.md
for a full description of all available parameters.
Config files are passed to the PySAR, Encoding, or Descriptors classes via the
config_file parameter. All parameter names must remain unchanged; only their values
should be edited. Any unused parameter can be set to null. Parameters can alternatively be
passed directly as **kwargs to each class.
Four example config files are provided in the config/ directory, one per supported dataset: thermostability.json, enantioselectivity.json, localization.json, and absorption.json.
The config file is divided into four top-level sections:
- dataset
Defines the input data.
datasetis the path to the sequence/activity file;sequence_colnames the column holding protein sequences;activitynames the target activity column.- model
Specifies the regression algorithm (e.g.
plsregression,randomforest,svr), optional hyperparameters, and the train/test split ratio (test_split, default0.2).- descriptors
Controls which protein descriptors are calculated and their metaparameters (lag values, properties, window sizes, etc.).
descriptors_csvcan point to a pre-calculated descriptor CSV to skip recomputation on repeated runs.- pyDSP
Governs optional Digital Signal Processing applied to AAI-encoded sequences before model training. Set
use_dspto1to enable; then configure thespectrumtype (power,absolute,real,imaginary), a convolutionalwindow(e.g.hamming,blackman,gaussian), and an optionalfilter(e.g.savgol,medfilt).
A full configuration file looks like:
{
"dataset": {
"dataset": "thermostability.txt",
"sequence_col": "sequence",
"activity": "T50"
},
"model": {
"algorithm": "plsregression",
"parameters": "",
"test_split": 0.2
},
"descriptors": {
"descriptors_csv": "descriptors_thermostability.csv",
"moreaubroto_autocorrelation": {
"lag": 30,
"properties": ["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102",
"CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"],
"normalize": 1
},
"ctd": {
"property": "hydrophobicity",
"all": 1
},
"pseudo_amino_acid_composition": {
"lambda": 30,
"weight": 0.05,
"properties": []
},
"charge_distribution": { "ph": 7.4 },
"kmer_composition": { "k": 2 },
"reduced_alphabet_composition": { "alphabet_size": 6 },
"motif_composition": { "motifs": null },
"aggregation_propensity": {
"window": 5,
"hydrophobicity_threshold": 2.0,
"charge_threshold": 1
},
"hydrophobic_moment": { "window": 11, "angle": 100 }
},
"pyDSP": {
"use_dsp": 1,
"spectrum": "power",
"window": { "type": "hamming" },
"filter": { "type": null }
}
}
Descriptor Encoding
pySAR supports 36 protein descriptors via the Descriptors class. Descriptors are
calculated using the protpy package (>=1.3.0).
Initialising the Descriptors class:
from pySAR.descriptors import Descriptors
desc = Descriptors(config_file="config/thermostability.json")
Composition Descriptors
Amino Acid Composition — frequency of each of the 20 canonical amino acids (N × 20):
aa_comp = desc.get_amino_acid_composition()
print(aa_comp.shape) # (261, 20)
print(aa_comp.dtypes[0]) # float64
Dipeptide Composition — frequency of all 400 dipeptide pairs (N × 400):
dp_comp = desc.get_dipeptide_composition()
print(dp_comp.shape) # (261, 400)
Tripeptide Composition — frequency of all 8000 tripeptide combinations (N × 8000):
tp_comp = desc.get_tripeptide_composition()
print(tp_comp.shape) # (261, 8000)
GRAVY — Grand Average of Hydropathicity using Kyte-Doolittle values (N × 1):
gravy = desc.get_gravy()
print(gravy.columns.tolist()) # ['GRAVY']
Aromaticity — fraction of aromatic residues (F, W, Y, H) in the sequence (N × 1):
arom = desc.get_aromaticity()
print(arom.columns.tolist()) # ['Aromaticity']
Instability Index — DIWV-based stability score; values ≥ 40 indicate instability (N × 1):
instab = desc.get_instability_index()
print(instab.columns.tolist()) # ['InstabilityIndex']
Isoelectric Point — estimated pH at which the protein carries no net charge (N × 1):
pi = desc.get_isoelectric_point()
print(pi.columns.tolist()) # ['IsoelectricPoint']
Molecular Weight — average molecular weight in Daltons, corrected for peptide bonds (N × 1):
mw = desc.get_molecular_weight()
print(mw.columns.tolist()) # ['MolecularWeight']
Charge Distribution — positive, negative, and net charge at a given pH (default 7.4) (N × 3):
charge = desc.get_charge_distribution()
print(charge.columns.tolist()) # ['PositiveCharge', 'NegativeCharge', 'NetCharge']
Hydrophobic/Polar/Charged Composition — percentage of residues in each physicochemical group (N × 3):
hpc = desc.get_hydrophobic_polar_charged_composition()
print(hpc.columns.tolist()) # ['Hydrophobic', 'Polar', 'Charged']
Secondary Structure Propensity — mean Chou-Fasman propensity values for helix, sheet, and coil conformations (N × 3):
ssp = desc.get_secondary_structure_propensity()
print(ssp.columns.tolist()) # ['Helix', 'Sheet', 'Coil']
k-mer Composition — frequency of all 20^k subsequences; default k=2 gives 400 features (N × 400 by default):
kmer = desc.get_kmer_composition()
print(kmer.shape) # (261, 400)
Reduced Alphabet Composition — amino acid composition after mapping residues to a reduced physicochemical alphabet; default alphabet_size=6 (N × 6 by default):
rac = desc.get_reduced_alphabet_composition()
print(rac.shape) # (261, 6)
Motif Composition — count of 8 built-in biological sequence motifs (N × 8 by default):
motifs = desc.get_motif_composition()
print(motifs.shape) # (261, 8)
Amino Acid Pair Composition — frequency of all 400 residue-pair combinations with physicochemical class annotations (N × 400):
pair_comp = desc.get_amino_acid_pair_composition()
print(pair_comp.shape) # (261, 400)
Aliphatic Index — relative volume of aliphatic side chains (Ala, Val, Ile, Leu); higher values correlate with thermostability (N × 1):
ali = desc.get_aliphatic_index()
print(ali.columns.tolist()) # ['AliphaticIndex']
Extinction Coefficient — molar extinction coefficient at 280 nm from Trp, Tyr, Cys counts; reported for reduced and oxidised states (N × 2):
ext = desc.get_extinction_coefficient()
print(ext.columns.tolist()) # ['ExtCoeff_Reduced', 'ExtCoeff_Oxidized']
Boman Index — sum of solubility values for all residues divided by sequence length; predicts protein–protein interaction potential (N × 1):
boman = desc.get_boman_index()
print(boman.columns.tolist()) # ['BomanIndex']
Aggregation Propensity — count and fraction of aggregation-prone windows identified via a sliding-window Kyte-Doolittle + charge-neutrality heuristic (N × 2):
agg = desc.get_aggregation_propensity()
print(agg.columns.tolist()) # ['AggregProneRegions', 'AggregProneFraction']
Hydrophobic Moment — mean and maximum hydrophobic moment across sliding helical-wheel windows using the Eisenberg hydrophobicity scale (N × 2):
hm = desc.get_hydrophobic_moment()
print(hm.columns.tolist()) # ['HydrophobicMoment_Mean', 'HydrophobicMoment_Max']
Shannon Entropy — information-theoretic measure of amino acid diversity; 0 = fully repetitive, ~4.322 bits = perfectly uniform over 20 amino acids (N × 1):
ent = desc.get_shannon_entropy()
print(ent.columns.tolist()) # ['ShannonEntropy']
Autocorrelation Descriptors
Autocorrelation descriptors encode the correlation between the physicochemical properties of amino acid residues separated by a given sequence lag. All three variants use up to 8 AAIndex properties and a default lag of 30, producing 240 features per descriptor (N × 240).
MoreauBroto Autocorrelation — measures the average product of property values at residues separated by lag d. It captures the overall strength of property correlation across the sequence without normalising by variance, making it sensitive to the absolute scale of the chosen property:
mb = desc.get_moreaubroto_autocorrelation()
print(mb.shape) # (261, 240)
Moran Autocorrelation — a normalised variant of MoreauBroto that divides by the variance of the property values across the sequence. This makes it scale-invariant and directly comparable across different physicochemical properties, reflecting the spatial clustering of similar residues:
ma = desc.get_moran_autocorrelation()
print(ma.shape) # (261, 240)
Geary Autocorrelation — measures the mean squared difference between property values at residues separated by lag d, normalised by the overall variance. Unlike Moran, values close to 0 indicate strong positive autocorrelation and values greater than 1 indicate negative autocorrelation, making it sensitive to local dissimilarity along the chain:
ga = desc.get_geary_autocorrelation()
print(ga.shape) # (261, 240)
CTD Descriptors
CTD — combined Composition, Transition, and Distribution descriptor (N × 147):
ctd = desc.get_ctd()
print(ctd.shape) # (261, 147)
Sub-components can be accessed individually:
ctd_c = desc.get_ctd_composition() # (261, 21)
ctd_t = desc.get_ctd_transition() # (261, 21)
ctd_d = desc.get_ctd_distribution() # (261, 105)
Conjoint Triad
Conjoint Triad — considers neighbour relationships in protein 3D structure; produces 343 features (N × 343):
ct = desc.get_conjoint_triad()
print(ct.shape) # (261, 343)
Sequence Order Descriptors
Sequence Order Coupling Number — dissimilarity between amino acid pairs at varying distances; default lag=30 gives 60 features. Can use Schneider-Wrede and/or Grantham distance matrices (N × 60):
socn = desc.get_sequence_order_coupling_number()
print(socn.shape) # (261, 60)
Quasi Sequence Order — extends SOCN with amino acid composition; generates 100 features by default (N × 100):
qso = desc.get_quasi_sequence_order()
print(qso.shape) # (261, 100)
Pseudo Composition Descriptors
Pseudo Amino Acid Composition — combines amino acid composition with physicochemical correlation factors. Default generates 50 features (N × 50):
paac = desc.get_pseudo_amino_acid_composition()
print(paac.shape) # (261, 50)
Amphiphilic Pseudo Amino Acid Composition — extends PAAComp with hydrophobic and hydrophilic distribution patterns along the chain. Default generates 80 features (N × 80):
apaac = desc.get_amphiphilic_pseudo_amino_acid_composition()
print(apaac.shape) # (261, 80)
Calculating All Descriptors
To calculate all 36 descriptors at once and concatenate them into a single DataFrame:
all_desc = desc.get_all_descriptors()
print(all_desc.shape) # (261, <total_features>)
# Export to CSV for future reuse (avoids recomputation)
desc.get_all_descriptors(export=True, descriptors_export_filename="descriptors.csv")
AAI Encoding
Encode sequences using physicochemical indices from the AAIndex1 database, combined with Digital Signal Processing (DSP) features:
from pySAR.encoding import Encoding
enc = Encoding(config_file="config/thermostability.json")
# Encode using a single AAIndex record
results = enc.aai_encoding(aai_indices="FAUJ880110", sort_by="R2")
print(results[["Index", "R2", "RMSE"]])
# Encode using multiple comma-separated AAIndex records
results = enc.aai_encoding(aai_indices="FAUJ880110, BIGC670101", sort_by="MSE")
Descriptor Encoding
Build predictive models using one or more protein descriptors as feature matrices:
from pySAR.encoding import Encoding
enc = Encoding(config_file="config/thermostability.json")
# Single descriptor
results = enc.descriptor_encoding(descriptors="amino_acid_composition", desc_combo=1, sort_by="R2")
# Multiple specific descriptors
results = enc.descriptor_encoding(
descriptors=["gravy", "molecular_weight", "charge_distribution"],
desc_combo=1, sort_by="R2"
)
# All 36 descriptors (empty list = all)
results = enc.descriptor_encoding(descriptors=[], desc_combo=1, sort_by="R2")
print(len(results)) # 36
# All combinations of 2 descriptors
results = enc.descriptor_encoding(descriptors=[], desc_combo=2, sort_by="R2")
print(len(results)) # 528
AAI + Descriptor Encoding
Combine AAI-encoded features with descriptor features:
from pySAR.encoding import Encoding
enc = Encoding(config_file="config/thermostability.json")
results = enc.aai_descriptor_encoding(
aai_indices="FAUJ880110",
descriptors="amino_acid_composition",
desc_combo=1,
sort_by="R2"
)
print(results[["Index", "Descriptor", "R2", "RMSE"]])
PySAR Workflow
The PySAR class provides the top-level workflow for building and evaluating models:
from pySAR.pySAR import PySAR
# AAI encoding workflow
pysar = PySAR(config_file="config/thermostability.json")
results = pysar.encode_aai(aai_indices="FAUJ880110")
# Descriptor encoding workflow
results = pysar.encode_descriptor(descriptors="amino_acid_composition")
# AAI + Descriptor encoding workflow
results = pysar.encode_aai_descriptor(
aai_indices="FAUJ880110",
descriptors="amino_acid_composition"
)