Descriptors
The Descriptors class (pySAR/descriptors.py) calculates a comprehensive set of
physicochemical, biochemical, and structural protein descriptors. These 33 descriptors
span composition, autocorrelation, CTD, conjoint triad, sequence order, and pseudo
amino acid composition and produce over 10,000 features in total when all are calculated.
Descriptors are calculated via protpy, a purpose-built open-source package for protein feature engineering. Input sequences must contain only the 20 canonical amino acids; gaps are stripped automatically on initialisation.
from pySAR.descriptors import Descriptors
desc = Descriptors(config_file="config/thermostability.json")
# calculate a single descriptor
aa_comp = desc.get_amino_acid_composition() # shape: (N, 20)
# calculate all descriptors at once
all_desc = desc.get_all_descriptors() # shape: (N, 10572+)
Instantiation
Descriptors.__init__(config_file, protein_seqs=None, **kwargs)
Parameter |
Default |
Description |
|---|---|---|
|
— |
Path to the JSON configuration file. The |
|
|
Protein sequences as a |
|
— |
Keyword arguments ( |
On construction the class:
Parses the config JSON and loads dataset/descriptor parameters.
Reads protein sequences from the dataset CSV if not directly supplied.
Strips gaps and validates all sequences against the 20 canonical amino acids.
Attempts to import pre-calculated descriptor values from the
descriptors_csvpath, if it exists.
Importing pre-calculated descriptors is strongly recommended for large datasets — set
all_desc: 1 in the [descriptors] config section on first run to generate the
CSV, then subsequent runs will load from it directly without recalculating.
Descriptor Groups
Group |
Descriptors |
|---|---|
Composition |
|
Autocorrelation |
|
CTD |
|
Conjoint Triad |
|
Sequence Order |
|
Pseudo Composition |
|
Composition Descriptors
Composition descriptors capture the amino acid content and physicochemical properties of a sequence without considering positional information.
Amino Acid Composition
Method: get_amino_acid_composition() | Features: 20
The fraction of each of the 20 canonical amino acid types within a sequence:
where $AA(t)$ is the count of amino acid type $t$ and $N$ is the total sequence length.
aa_comp = desc.get_amino_acid_composition() # shape: (N, 20)
Dipeptide Composition
Method: get_dipeptide_composition() | Features: 400 (20²)
The fraction of each of the 400 possible dipeptide types:
where $AA(s,t)$ is the count of dipeptide type $(s, t)$ and $N-1$ is the total number of dipeptides in the sequence.
dp_comp = desc.get_dipeptide_composition() # shape: (N, 400)
Tripeptide Composition
Method: get_tripeptide_composition() | Features: 8000 (20³)
The fraction of each of the 8,000 possible tripeptide types. Computationally expensive on large datasets; pre-calculation and CSV caching is recommended.
tp_comp = desc.get_tripeptide_composition() # shape: (N, 8000)
GRAVY
Method: get_gravy() | Features: 1
The Grand Average of Hydropathy (GRAVY) is the mean Kyte-Doolittle hydropathy score across all residues. Positive values indicate overall hydrophobicity; negative values indicate hydrophilicity.
gravy = desc.get_gravy() # shape: (N, 1)
Aromaticity
Method: get_aromaticity() | Features: 1
Fraction of aromatic residues (F, W, Y, H) in the sequence.
arom = desc.get_aromaticity() # shape: (N, 1)
Instability Index
Method: get_instability_index() | Features: 1
Computed from dipeptide instability weight values (DIWV). A value below 40 indicates a stable protein; 40 or above suggests instability.
ii = desc.get_instability_index() # shape: (N, 1)
Isoelectric Point
Method: get_isoelectric_point() | Features: 1
The estimated pH at which the protein carries no net charge, calculated iteratively using standard pKa values for ionisable residues.
pi = desc.get_isoelectric_point() # shape: (N, 1)
Molecular Weight
Method: get_molecular_weight() | Features: 1
Average molecular weight (Da) calculated from residue masses, corrected for water lost at each peptide bond.
mw = desc.get_molecular_weight() # shape: (N, 1)
Charge Distribution
Method: get_charge_distribution() | Features: 3
Positive, negative, and net charge contributions of ionisable residues at a specified
pH using the Henderson-Hasselbalch equation (default pH 7.4). Output columns:
PositiveCharge, NegativeCharge, NetCharge.
Config parameter: charge_distribution.ph (default 7.4).
charge = desc.get_charge_distribution() # shape: (N, 3)
Hydrophobic/Polar/Charged Composition
Method: get_hydrophobic_polar_charged_composition() | Features: 3
Percentage of residues belonging to each of three physicochemical groups:
Hydrophobic: A, C, F, I, L, M, V, W, Y
Polar: G, N, Q, S, T
Charged: D, E, H, K, R
Output columns: Hydrophobic, Polar, Charged.
hpc = desc.get_hydrophobic_polar_charged_composition() # shape: (N, 3)
Secondary Structure Propensity
Method: get_secondary_structure_propensity() | Features: 3
Average Chou-Fasman propensity values for alpha-helix, beta-sheet, and random-coil
conformations across all residues. Output columns: Helix, Sheet, Coil.
ssp = desc.get_secondary_structure_propensity() # shape: (N, 3)
k-mer Composition
Method: get_kmer_composition() | Features: 20k (default 400)
Frequency of all possible k-length residue subsequences expressed as a percentage of
total k-mers. Config parameter: kmer_composition.k (default 2, producing 400 features).
kmer = desc.get_kmer_composition() # shape: (N, 400) with k=2
Reduced Alphabet Composition
Method: get_reduced_alphabet_composition() | Features: alphabet_size (default 6)
Amino acid composition after mapping residues to a reduced set of physicochemical
groups. Supported alphabet sizes: 2, 3, 4, 6. Config parameter:
reduced_alphabet_composition.alphabet_size (default 6).
rac = desc.get_reduced_alphabet_composition() # shape: (N, 6)
Motif Composition
Method: get_motif_composition() | Features: number of motifs (default 8)
Count of occurrences (including overlapping) of predefined biological sequence motifs,
matched by regular expression. Uses 8 built-in motifs by default; a custom
name → pattern dict can be supplied via motif_composition.motifs in config.
motif = desc.get_motif_composition() # shape: (N, 8)
Amino Acid Pair Composition
Method: get_amino_acid_pair_composition() | Features: 400
Frequency of all 400 residue-pair combinations, with column names annotated by the physicochemical class of each residue.
pair = desc.get_amino_acid_pair_composition() # shape: (N, 400)
Aliphatic Index
Method: get_aliphatic_index() | Features: 1
Relative volume occupied by aliphatic side chains (Ala, Val, Ile, Leu). Higher values are associated with greater thermostability.
ai = desc.get_aliphatic_index() # shape: (N, 1)
Extinction Coefficient
Method: get_extinction_coefficient() | Features: 2
Molar extinction coefficient at 280 nm derived from the number of Trp (W), Tyr (Y),
and Cys (C) residues. Reported for both reduced and oxidised states. Output columns:
ExtCoeff_Reduced, ExtCoeff_Oxidized.
ec = desc.get_extinction_coefficient() # shape: (N, 2)
Boman Index
Method: get_boman_index() | Features: 1
Sum of residue solubility values divided by sequence length. Predicts potential for protein-protein interactions.
boman = desc.get_boman_index() # shape: (N, 1)
Aggregation Propensity
Method: get_aggregation_propensity() | Features: 2
Identifies aggregation-prone regions via a sliding-window approach combining
Kyte-Doolittle hydrophobicity and charge neutrality. Output columns:
AggregProneRegions (count of qualifying windows) and AggregProneFraction
(fraction of sequence covered). Config parameters: aggregation_propensity.window
(default 5), .hydrophobicity_threshold (default 2.0), .charge_threshold
(default 1).
agg = desc.get_aggregation_propensity() # shape: (N, 2)
Hydrophobic Moment
Method: get_hydrophobic_moment() | Features: 2
Mean and maximum hydrophobic moment across sliding windows using the Eisenberg
hydrophobicity scale and a helical-wheel projection, capturing amphipathicity. Output
columns: HydrophobicMoment_Mean, HydrophobicMoment_Max. Config parameters:
hydrophobic_moment.window (default 11), .angle (default 100).
hm = desc.get_hydrophobic_moment() # shape: (N, 2)
Shannon Entropy
Method: get_shannon_entropy() | Features: 1
An information-theoretic measure of amino acid diversity:
A value of 0 indicates a completely repetitive sequence; the theoretical maximum of ~4.322 bits corresponds to a perfectly uniform distribution across all 20 amino acids.
se = desc.get_shannon_entropy() # shape: (N, 1)
Autocorrelation Descriptors
Autocorrelation descriptors describe the level of correlation between two positions in a sequence separated by a lag distance $d$, in terms of a specified physicochemical property. Each of the three variants uses a different mathematical formulation. By default, 8 physicochemical properties are used with a lag of 30, generating 240 features per descriptor.
Default properties (8):
AAIndex Accession |
Property |
|---|---|
CIDH920105 |
Normalised Average Hydrophobicity |
BHAR880101 |
Average Flexibility Indices |
CHAM820101 |
Polarizability Parameter |
CHAM820102 |
Free Energy of Solution in Water (kcal/mol) |
CHOC760101 |
Residue Accessible Surface Area in Tripeptide |
BIGC670101 |
Residue Volume |
CHAM810101 |
Steric Parameter |
DAYM780201 |
Relative Mutability |
Config parameters common to all three descriptors: lag (default 30),
properties (list of AAIndex accession numbers), normalize (bool).
Feature count formula: lag × len(properties) → default 30 × 8 = 240.
MoreauBroto Autocorrelation
Method: get_moreaubroto_autocorrelation() | Features: lag × properties (default 240)
Uses the raw property values of two residues separated by lag $d$:
Config section: [moreaubroto_autocorrelation].
mb = desc.get_moreaubroto_autocorrelation() # shape: (N, 240)
Moran Autocorrelation
Method: get_moran_autocorrelation() | Features: lag × properties (default 240)
Uses normalised deviations from the mean property value:
Config section: [moran_autocorrelation].
moran = desc.get_moran_autocorrelation() # shape: (N, 240)
Geary Autocorrelation
Method: get_geary_autocorrelation() | Features: lag × properties (default 240)
Uses squared differences between residue property values:
Config section: [geary_autocorrelation].
geary = desc.get_geary_autocorrelation() # shape: (N, 240)
CTD Descriptors
CTD describes the amino acid composition within seven physicochemical property classes (hydrophobicity, volume, polarity, polarisability, charge, secondary structure, solvent accessibility). Each property divides the 20 amino acids into three classes (C1, C2, C3), from which three sub-descriptors are computed.
Using all 7 properties generates 147 features (21 per property).
A subset of properties can be specified via ctd.property in config.
CTD (Combined)
Method: get_ctd() | Features: 147 (all 7 properties)
Contains all CTD sub-descriptors concatenated: Composition + Transition + Distribution.
ctd = desc.get_ctd() # shape: (N, 147)
CTD Composition
Method: get_ctd_composition() | Features: 3 per property (21 total)
Fraction of residues in each of the three classes (C1, C2, C3) for each property.
ctd_c = desc.get_ctd_composition() # shape: (N, 21)
CTD Transition
Method: get_ctd_transition() | Features: 3 per property (21 total)
Fraction of transitions between pairs of property classes in the sequence (C1↔C2, C1↔C3, C2↔C3).
ctd_t = desc.get_ctd_transition() # shape: (N, 21)
CTD Distribution
Method: get_ctd_distribution() | Features: 15 per property (105 total)
For each class, the sequence positions (as percentages of sequence length) of the 1st, 25th, 50th, 75th, and 100th occurrence of that class — capturing how each property class is distributed along the sequence.
ctd_d = desc.get_ctd_distribution() # shape: (N, 105)
Conjoint Triad
Method: get_conjoint_triad() | Features: 343 (7³)
Describes the neighbourhood environment of each residue by considering triplets of adjacent residues, each residue grouped into one of 7 physicochemical classes. The frequency of each of the 7³ = 343 possible triplet combinations is computed.
ct = desc.get_conjoint_triad() # shape: (N, 343)
Sequence Order Descriptors
Sequence Order Coupling Number
Method: get_sequence_order_coupling_number() | Features: lag or 2 × lag
Captures long-range interactions by summing the squared differences of a property
between residues $d$ positions apart up to a specified lag. If a single distance matrix
is given in config, lag features are produced; if no matrix is specified both the
Schneider-Wrede and Grantham matrices are used, producing 2 × lag features.
Config section: [sequence_order_coupling_number], params: lag, distance_matrix.
socn = desc.get_sequence_order_coupling_number()
Quasi Sequence Order
Method: get_quasi_sequence_order() | Features: 20 + lag or 2 × (20 + lag)
Extends amino acid composition with sequence-order correlation factors derived from
pairwise residue distance matrices. Feature count: 20 + lag with one distance matrix,
or 2 × (20 + lag) when both Schneider-Wrede and Grantham matrices are used.
Config section: [quasi_sequence_order], params: lag, distance_matrix.
qso = desc.get_quasi_sequence_order()
Pseudo Amino Acid Composition
Pseudo Amino Acid Composition (Type 1)
Method: get_pseudo_amino_acid_composition() | Features: 20 + lambda
Augments amino acid composition (20 features) with lambda sequence-order correlation
factors (correlation along the chain at lags 1 through lambda), capturing both
composition and sequence-order information. Config section:
[pseudo_amino_acid_composition], param: lambda.
paac = desc.get_pseudo_amino_acid_composition() # shape: (N, 20+lambda)
Amphiphilic Pseudo Amino Acid Composition (Type 2)
Method: get_amphiphilic_pseudo_amino_acid_composition() | Features: 20 + 2 × lambda
Extends PseAAC Type 1 by adding separate hydrophobicity and hydrophilicity correlation
factors for each lag, producing 20 + 2 × lambda features. Designed to capture
amphipathic patterns. Config section:
[amphiphilic_pseudo_amino_acid_composition], param: lambda.
apaac = desc.get_amphiphilic_pseudo_amino_acid_composition() # shape: (N, 20+(2*lambda))
All Descriptors Summary
Descriptor |
Features |
Method |
|---|---|---|
Amino Acid Composition |
20 |
|
Dipeptide Composition |
400 |
|
Tripeptide Composition |
8000 |
|
GRAVY |
1 |
|
Aromaticity |
1 |
|
Instability Index |
1 |
|
Isoelectric Point |
1 |
|
Molecular Weight |
1 |
|
Charge Distribution |
3 |
|
Hydrophobic/Polar/Charged Composition |
3 |
|
Secondary Structure Propensity |
3 |
|
k-mer Composition |
20k (default 400) |
|
Reduced Alphabet Composition |
alphabet_size (default 6) |
|
Motif Composition |
len(motifs) (default 8) |
|
Amino Acid Pair Composition |
400 |
|
Aliphatic Index |
1 |
|
Extinction Coefficient |
2 |
|
Boman Index |
1 |
|
Aggregation Propensity |
2 |
|
Hydrophobic Moment |
2 |
|
Shannon Entropy |
1 |
|
MoreauBroto Autocorrelation |
lag × props (default 240) |
|
Moran Autocorrelation |
lag × props (default 240) |
|
Geary Autocorrelation |
lag × props (default 240) |
|
CTD |
147 |
|
CTD Composition |
21 |
|
CTD Transition |
21 |
|
CTD Distribution |
105 |
|
Conjoint Triad |
343 |
|
Sequence Order Coupling Number |
lag or 2×lag |
|
Quasi Sequence Order |
20+λ or 2×(20+λ) |
|
Pseudo Amino Acid Composition |
20+λ |
|
Amphiphilic Pseudo Amino Acid Composition |
20+2λ |
|
Utility Methods
get_all_descriptors()Calculates every descriptor in sequence and returns a concatenated DataFrame of all features. Also exports to the
descriptors_csvpath if configured.all_desc = desc.get_all_descriptors() # shape: (N, ~10572 with defaults)
get_descriptor_encoding(descriptor)Resolves a descriptor name (with fuzzy matching) and returns its feature DataFrame. Useful when the descriptor name is read from config or supplied at runtime.
df = desc.get_descriptor_encoding("moran") # resolves to moran_autocorrelation
all_descriptors_list()Returns the list of all 33 descriptor names.
validate_descriptors(descriptors)Validates that all names in a list (or single string) are recognised descriptor names. Raises
InvalidDescriptorErrorfor any unknown names.get_descriptor_info(name)Returns a metadata dict for
nameincludingfeature_count,group, and the associatedget_*method.reset_descriptors()Clears all descriptor DataFrames back to empty state, freeing memory without re-instantiating the class.
get_descriptor_columns(name)Returns the column names of the calculated DataFrame for descriptor
name.
Pre-calculated Descriptors
For any new dataset it is recommended to calculate all descriptors once and cache them to a CSV file, which is then loaded automatically on subsequent runs:
Set
all_desc: 1anddescriptors_csv: "data/descriptors_<dataset>.csv"in the[descriptors]config section.Run once — all descriptor values are calculated and written to the CSV.
On every subsequent run, the CSV is detected and imported automatically — no recalculation required.
Pre-calculated descriptor CSVs for the bundled example datasets are included in
data/ and example_datasets/.
Config File
All descriptor parameters are set under the [descriptors] key in the pySAR config JSON:
{
"descriptors": {
"descriptors_csv": "data/descriptors_thermostability.csv",
"all_desc": 1,
"descriptor": "amino_acid_composition",
"moreaubroto_autocorrelation": {
"lag": 30,
"properties": ["CIDH920105","BHAR880101","CHAM820101","CHAM820102",
"CHOC760101","BIGC670101","CHAM810101","DAYM780201"],
"normalize": 0
},
"moran_autocorrelation": {
"lag": 30,
"properties": ["CIDH920105","BHAR880101","CHAM820101","CHAM820102",
"CHOC760101","BIGC670101","CHAM810101","DAYM780201"],
"normalize": 0
},
"geary_autocorrelation": {
"lag": 30,
"properties": ["CIDH920105","BHAR880101","CHAM820101","CHAM820102",
"CHOC760101","BIGC670101","CHAM810101","DAYM780201"],
"normalize": 0
},
"ctd": {
"property": ["hydrophobicity","volume","polarity","polarizability",
"charge","secondaryStructure","solventAccessibility"],
"all": 1
},
"conjoint_triad": {},
"sequence_order_coupling_number": {
"lag": 30,
"distance_matrix": ""
},
"quasi_sequence_order": {
"lag": 30,
"distance_matrix": ""
},
"pseudo_amino_acid_composition": {
"lambda": 30,
"weight": 0.05
},
"amphiphilic_pseudo_amino_acid_composition": {
"lambda": 30,
"weight": 0.05
}
}
}
See the CONFIG.md file and the example config files for the full list of available parameters: