Descriptors

The Descriptors class (pySAR/descriptors.py) calculates a comprehensive set of physicochemical, biochemical, and structural protein descriptors. These 33 descriptors span composition, autocorrelation, CTD, conjoint triad, sequence order, and pseudo amino acid composition and produce over 10,000 features in total when all are calculated.

Descriptors are calculated via protpy, a purpose-built open-source package for protein feature engineering. Input sequences must contain only the 20 canonical amino acids; gaps are stripped automatically on initialisation.

from pySAR.descriptors import Descriptors

desc = Descriptors(config_file="config/thermostability.json")

# calculate a single descriptor
aa_comp = desc.get_amino_acid_composition()   # shape: (N, 20)

# calculate all descriptors at once
all_desc = desc.get_all_descriptors()         # shape: (N, 10572+)

Instantiation

Descriptors.__init__(config_file, protein_seqs=None, n_jobs=1, **kwargs)

Parameter	Default	Description
`config_file`	—	Path to the JSON configuration file. The `.json` extension is appended automatically if omitted.
`protein_seqs`	`None`	Protein sequences as a `pd.Series` or a single string. If `None` or empty, sequences are loaded from the dataset path specified in the config.
`n_jobs`	`1`	Number of worker threads to use for parallel descriptor computation. Values of 0 or below are clamped to 1 (sequential). See Parallelism for details.
`**kwargs`	—	Keyword arguments (`dataset`, `descriptors_csv`) that override the corresponding config file values.

On construction the class:

Parses the config JSON and loads dataset/descriptor parameters.
Reads protein sequences from the dataset CSV if not directly supplied.
Strips gaps and validates all sequences against the 20 canonical amino acids.
Attempts to import pre-calculated descriptor values from the descriptors_csv path, if it exists.

Importing pre-calculated descriptors is strongly recommended for large datasets — set all_desc: 1 in the [descriptors] config section on first run to generate the CSV, then subsequent runs will load from it directly without recalculating.

Parallelism

The n_jobs parameter controls how many threads the Descriptors class uses during computation. Two levels of parallelism are available:

Across descriptor groups (get_all_descriptors): When n_jobs > 1, the 33 descriptor groups are dispatched concurrently to a ThreadPoolExecutor. Each group (e.g. amino acid composition, autocorrelation, CTD) is computed in a separate thread, and results are merged once all futures complete. With n_jobs=1 the groups are computed sequentially in a fixed order.
Across sequences within a descriptor (_calculate_descriptor_batch): Individual descriptor methods (e.g. get_amino_acid_composition) iterate over every protein sequence. When n_jobs > 1, sequences are split across threads so multiple sequences are processed simultaneously; when n_jobs=1 they are processed one at a time.

Both levels are active whenever n_jobs > 1, so the speed-up is multiplicative when calling get_all_descriptors on large datasets.

# sequential (default)
desc = Descriptors(config_file="config/thermostability.json", n_jobs=1)

# use 4 threads — faster on multi-core machines with many sequences
desc = Descriptors(config_file="config/thermostability.json", n_jobs=4)

all_desc = desc.get_all_descriptors()

Note

n_jobs uses Python threads (ThreadPoolExecutor), not processes, so it is most effective when the per-sequence computation releases the GIL (e.g. NumPy operations inside protpy). Values of 0 or below are silently clamped to 1. Pre-calculating descriptors to a CSV (see Pre-calculated Descriptors) remains the fastest option for repeated runs on the same dataset.

Descriptor Groups

Group	Descriptors
Composition	`amino_acid_composition`, `dipeptide_composition`, `tripeptide_composition`, `gravy`, `aromaticity`, `instability_index`, `isoelectric_point`, `molecular_weight`, `charge_distribution`, `hydrophobic_polar_charged_composition`, `secondary_structure_propensity`, `kmer_composition`, `reduced_alphabet_composition`, `motif_composition`, `amino_acid_pair_composition`, `aliphatic_index`, `extinction_coefficient`, `boman_index`, `aggregation_propensity`, `hydrophobic_moment`, `shannon_entropy`
Autocorrelation	`moreaubroto_autocorrelation`, `moran_autocorrelation`, `geary_autocorrelation`
CTD	`ctd`, `ctd_composition`, `ctd_transition`, `ctd_distribution`
Conjoint Triad	`conjoint_triad`
Sequence Order	`sequence_order_coupling_number`, `quasi_sequence_order`
Pseudo Composition	`pseudo_amino_acid_composition`, `amphiphilic_pseudo_amino_acid_composition`

Composition Descriptors

Composition descriptors capture the amino acid content and physicochemical properties of a sequence without considering positional information.

Amino Acid Composition

Method: get_amino_acid_composition() | Features: 20

The fraction of each of the 20 canonical amino acid types within a sequence:

\[\text{AAComp}(t) = \frac{AA(t)}{N}\]

where $AA(t)$ is the count of amino acid type $t$ and $N$ is the total sequence length.

aa_comp = desc.get_amino_acid_composition()   # shape: (N, 20)

Dipeptide Composition

Method: get_dipeptide_composition() | Features: 400 (20²)

The fraction of each of the 400 possible dipeptide types:

\[\text{DPComp}(s,t) = \frac{AA(s,t)}{N - 1}\]

where $AA(s,t)$ is the count of dipeptide type $(s, t)$ and $N-1$ is the total number of dipeptides in the sequence.

dp_comp = desc.get_dipeptide_composition()    # shape: (N, 400)

Tripeptide Composition

Method: get_tripeptide_composition() | Features: 8000 (20³)

The fraction of each of the 8,000 possible tripeptide types. Computationally expensive on large datasets; pre-calculation and CSV caching is recommended.

tp_comp = desc.get_tripeptide_composition()   # shape: (N, 8000)

GRAVY

Method: get_gravy() | Features: 1

The Grand Average of Hydropathy (GRAVY) is the mean Kyte-Doolittle hydropathy score across all residues. Positive values indicate overall hydrophobicity; negative values indicate hydrophilicity.

gravy = desc.get_gravy()   # shape: (N, 1)

Aromaticity

Method: get_aromaticity() | Features: 1

Fraction of aromatic residues (F, W, Y, H) in the sequence.

arom = desc.get_aromaticity()   # shape: (N, 1)

Instability Index

Method: get_instability_index() | Features: 1

Computed from dipeptide instability weight values (DIWV). A value below 40 indicates a stable protein; 40 or above suggests instability.

ii = desc.get_instability_index()   # shape: (N, 1)

Isoelectric Point

Method: get_isoelectric_point() | Features: 1

The estimated pH at which the protein carries no net charge, calculated iteratively using standard pK_a values for ionisable residues.

pi = desc.get_isoelectric_point()   # shape: (N, 1)

Molecular Weight

Method: get_molecular_weight() | Features: 1

Average molecular weight (Da) calculated from residue masses, corrected for water lost at each peptide bond.

mw = desc.get_molecular_weight()   # shape: (N, 1)

Charge Distribution

Method: get_charge_distribution() | Features: 3

Positive, negative, and net charge contributions of ionisable residues at a specified pH using the Henderson-Hasselbalch equation (default pH 7.4). Output columns: PositiveCharge, NegativeCharge, NetCharge.

Config parameter: charge_distribution.ph (default 7.4).

charge = desc.get_charge_distribution()   # shape: (N, 3)

Hydrophobic/Polar/Charged Composition

Method: get_hydrophobic_polar_charged_composition() | Features: 3

Percentage of residues belonging to each of three physicochemical groups:

Hydrophobic: A, C, F, I, L, M, V, W, Y
Polar: G, N, Q, S, T
Charged: D, E, H, K, R

Output columns: Hydrophobic, Polar, Charged.

hpc = desc.get_hydrophobic_polar_charged_composition()   # shape: (N, 3)

Secondary Structure Propensity

Method: get_secondary_structure_propensity() | Features: 3

Average Chou-Fasman propensity values for alpha-helix, beta-sheet, and random-coil conformations across all residues. Output columns: Helix, Sheet, Coil.

ssp = desc.get_secondary_structure_propensity()   # shape: (N, 3)

k-mer Composition

Method: get_kmer_composition() | Features: 20^k (default 400)

Frequency of all possible k-length residue subsequences expressed as a percentage of total k-mers. Config parameter: kmer_composition.k (default 2, producing 400 features).

kmer = desc.get_kmer_composition()   # shape: (N, 400) with k=2

Reduced Alphabet Composition

Method: get_reduced_alphabet_composition() | Features: alphabet_size (default 6)

Amino acid composition after mapping residues to a reduced set of physicochemical groups. Supported alphabet sizes: 2, 3, 4, 6. Config parameter: reduced_alphabet_composition.alphabet_size (default 6).

rac = desc.get_reduced_alphabet_composition()   # shape: (N, 6)

Motif Composition

Method: get_motif_composition() | Features: number of motifs (default 8)

Count of occurrences (including overlapping) of predefined biological sequence motifs, matched by regular expression. Uses 8 built-in motifs by default; a custom name → pattern dict can be supplied via motif_composition.motifs in config.

motif = desc.get_motif_composition()   # shape: (N, 8)

Amino Acid Pair Composition

Method: get_amino_acid_pair_composition() | Features: 400

Frequency of all 400 residue-pair combinations, with column names annotated by the physicochemical class of each residue.

pair = desc.get_amino_acid_pair_composition()   # shape: (N, 400)

Aliphatic Index

Method: get_aliphatic_index() | Features: 1

Relative volume occupied by aliphatic side chains (Ala, Val, Ile, Leu). Higher values are associated with greater thermostability.

ai = desc.get_aliphatic_index()   # shape: (N, 1)

Extinction Coefficient

Method: get_extinction_coefficient() | Features: 2

Molar extinction coefficient at 280 nm derived from the number of Trp (W), Tyr (Y), and Cys (C) residues. Reported for both reduced and oxidised states. Output columns: ExtCoeff_Reduced, ExtCoeff_Oxidized.

ec = desc.get_extinction_coefficient()   # shape: (N, 2)

Boman Index

Method: get_boman_index() | Features: 1

Sum of residue solubility values divided by sequence length. Predicts potential for protein-protein interactions.

boman = desc.get_boman_index()   # shape: (N, 1)

Aggregation Propensity

Method: get_aggregation_propensity() | Features: 2

Identifies aggregation-prone regions via a sliding-window approach combining Kyte-Doolittle hydrophobicity and charge neutrality. Output columns: AggregProneRegions (count of qualifying windows) and AggregProneFraction (fraction of sequence covered). Config parameters: aggregation_propensity.window (default 5), .hydrophobicity_threshold (default 2.0), .charge_threshold (default 1).

agg = desc.get_aggregation_propensity()   # shape: (N, 2)

Hydrophobic Moment

Method: get_hydrophobic_moment() | Features: 2

Mean and maximum hydrophobic moment across sliding windows using the Eisenberg hydrophobicity scale and a helical-wheel projection, capturing amphipathicity. Output columns: HydrophobicMoment_Mean, HydrophobicMoment_Max. Config parameters: hydrophobic_moment.window (default 11), .angle (default 100).

hm = desc.get_hydrophobic_moment()   # shape: (N, 2)

Shannon Entropy

Method: get_shannon_entropy() | Features: 1

An information-theoretic measure of amino acid diversity:

\[H = -\sum_{i=1}^{20} p_i \log_2 p_i\]

A value of 0 indicates a completely repetitive sequence; the theoretical maximum of ~4.322 bits corresponds to a perfectly uniform distribution across all 20 amino acids.

se = desc.get_shannon_entropy()   # shape: (N, 1)

Autocorrelation Descriptors

Autocorrelation descriptors describe the level of correlation between two positions in a sequence separated by a lag distance $d$, in terms of a specified physicochemical property. Each of the three variants uses a different mathematical formulation. By default, 8 physicochemical properties are used with a lag of 30, generating 240 features per descriptor.

Default properties (8):

AAIndex Accession	Property
CIDH920105	Normalised Average Hydrophobicity
BHAR880101	Average Flexibility Indices
CHAM820101	Polarizability Parameter
CHAM820102	Free Energy of Solution in Water (kcal/mol)
CHOC760101	Residue Accessible Surface Area in Tripeptide
BIGC670101	Residue Volume
CHAM810101	Steric Parameter
DAYM780201	Relative Mutability

Config parameters common to all three descriptors: lag (default 30), properties (list of AAIndex accession numbers), normalize (bool).

Feature count formula: lag × len(properties) → default 30 × 8 = 240.

MoreauBroto Autocorrelation

Method: get_moreaubroto_autocorrelation() | Features: lag × properties (default 240)

Uses the raw property values of two residues separated by lag $d$:

\[\text{MBAuto}(d) = \sum_{i=1}^{N-d} P_i \cdot P_{i+d}\]

Config section: [moreaubroto_autocorrelation].

mb = desc.get_moreaubroto_autocorrelation()   # shape: (N, 240)

Moran Autocorrelation

Method: get_moran_autocorrelation() | Features: lag × properties (default 240)

Uses normalised deviations from the mean property value:

\[\text{MAuto}(d) = \frac{\frac{1}{N-d}\sum_{i=1}^{N-d}(P_i - \bar{P})(P_{i+d} - \bar{P})}{\frac{1}{N}\sum_{i=1}^{N}(P_i - \bar{P})^2}\]

Config section: [moran_autocorrelation].

moran = desc.get_moran_autocorrelation()   # shape: (N, 240)

Geary Autocorrelation

Method: get_geary_autocorrelation() | Features: lag × properties (default 240)

Uses squared differences between residue property values:

\[\text{GAuto}(d) = \frac{\frac{1}{2(N-d)}\sum_{i=1}^{N-d}(P_i - P_{i+d})^2}{\frac{1}{N-1}\sum_{i=1}^{N}(P_i - \bar{P})^2}\]

Config section: [geary_autocorrelation].

geary = desc.get_geary_autocorrelation()   # shape: (N, 240)

CTD Descriptors

CTD describes the amino acid composition within seven physicochemical property classes (hydrophobicity, volume, polarity, polarisability, charge, secondary structure, solvent accessibility). Each property divides the 20 amino acids into three classes (C1, C2, C3), from which three sub-descriptors are computed.

Using all 7 properties generates 147 features (21 per property). A subset of properties can be specified via ctd.property in config.

CTD (Combined)

Method: get_ctd() | Features: 147 (all 7 properties)

Contains all CTD sub-descriptors concatenated: Composition + Transition + Distribution.

ctd = desc.get_ctd()   # shape: (N, 147)

CTD Composition

Method: get_ctd_composition() | Features: 3 per property (21 total)

Fraction of residues in each of the three classes (C1, C2, C3) for each property.

ctd_c = desc.get_ctd_composition()   # shape: (N, 21)

CTD Transition

Method: get_ctd_transition() | Features: 3 per property (21 total)

Fraction of transitions between pairs of property classes in the sequence (C1↔C2, C1↔C3, C2↔C3).

ctd_t = desc.get_ctd_transition()   # shape: (N, 21)

CTD Distribution

Method: get_ctd_distribution() | Features: 15 per property (105 total)

For each class, the sequence positions (as percentages of sequence length) of the 1st, 25th, 50th, 75th, and 100th occurrence of that class — capturing how each property class is distributed along the sequence.

ctd_d = desc.get_ctd_distribution()   # shape: (N, 105)

Conjoint Triad

Method: get_conjoint_triad() | Features: 343 (7³)

Describes the neighbourhood environment of each residue by considering triplets of adjacent residues, each residue grouped into one of 7 physicochemical classes. The frequency of each of the 7³ = 343 possible triplet combinations is computed.

ct = desc.get_conjoint_triad()   # shape: (N, 343)

Sequence Order Descriptors

Sequence Order Coupling Number

Method: get_sequence_order_coupling_number() | Features: lag or 2 × lag

Captures long-range interactions by summing the squared differences of a property between residues $d$ positions apart up to a specified lag. If a single distance matrix is given in config, lag features are produced; if no matrix is specified both the Schneider-Wrede and Grantham matrices are used, producing 2 × lag features.

Config section: [sequence_order_coupling_number], params: lag, distance_matrix.

socn = desc.get_sequence_order_coupling_number()

Quasi Sequence Order

Method: get_quasi_sequence_order() | Features: 20 + lag or 2 × (20 + lag)

Extends amino acid composition with sequence-order correlation factors derived from pairwise residue distance matrices. Feature count: 20 + lag with one distance matrix, or 2 × (20 + lag) when both Schneider-Wrede and Grantham matrices are used.

Config section: [quasi_sequence_order], params: lag, distance_matrix.

qso = desc.get_quasi_sequence_order()

Pseudo Amino Acid Composition

Pseudo Amino Acid Composition (Type 1)

Method: get_pseudo_amino_acid_composition() | Features: 20 + lambda

Augments amino acid composition (20 features) with lambda sequence-order correlation factors (correlation along the chain at lags 1 through lambda), capturing both composition and sequence-order information. Config section: [pseudo_amino_acid_composition], param: lambda.

paac = desc.get_pseudo_amino_acid_composition()   # shape: (N, 20+lambda)

Amphiphilic Pseudo Amino Acid Composition (Type 2)

Method: get_amphiphilic_pseudo_amino_acid_composition() | Features: 20 + 2 × lambda

Extends PseAAC Type 1 by adding separate hydrophobicity and hydrophilicity correlation factors for each lag, producing 20 + 2 × lambda features. Designed to capture amphipathic patterns. Config section: [amphiphilic_pseudo_amino_acid_composition], param: lambda.

apaac = desc.get_amphiphilic_pseudo_amino_acid_composition()   # shape: (N, 20+(2*lambda))

All Descriptors Summary

Descriptor	Features	Method
Amino Acid Composition	20	`get_amino_acid_composition()`
Dipeptide Composition	400	`get_dipeptide_composition()`
Tripeptide Composition	8000	`get_tripeptide_composition()`
GRAVY	1	`get_gravy()`
Aromaticity	1	`get_aromaticity()`
Instability Index	1	`get_instability_index()`
Isoelectric Point	1	`get_isoelectric_point()`
Molecular Weight	1	`get_molecular_weight()`
Charge Distribution	3	`get_charge_distribution()`
Hydrophobic/Polar/Charged Composition	3	`get_hydrophobic_polar_charged_composition()`
Secondary Structure Propensity	3	`get_secondary_structure_propensity()`
k-mer Composition	20^k (default 400)	`get_kmer_composition()`
Reduced Alphabet Composition	alphabet_size (default 6)	`get_reduced_alphabet_composition()`
Motif Composition	len(motifs) (default 8)	`get_motif_composition()`
Amino Acid Pair Composition	400	`get_amino_acid_pair_composition()`
Aliphatic Index	1	`get_aliphatic_index()`
Extinction Coefficient	2	`get_extinction_coefficient()`
Boman Index	1	`get_boman_index()`
Aggregation Propensity	2	`get_aggregation_propensity()`
Hydrophobic Moment	2	`get_hydrophobic_moment()`
Shannon Entropy	1	`get_shannon_entropy()`
MoreauBroto Autocorrelation	lag × props (default 240)	`get_moreaubroto_autocorrelation()`
Moran Autocorrelation	lag × props (default 240)	`get_moran_autocorrelation()`
Geary Autocorrelation	lag × props (default 240)	`get_geary_autocorrelation()`
CTD	147	`get_ctd()`
CTD Composition	21	`get_ctd_composition()`
CTD Transition	21	`get_ctd_transition()`
CTD Distribution	105	`get_ctd_distribution()`
Conjoint Triad	343	`get_conjoint_triad()`
Sequence Order Coupling Number	lag or 2×lag	`get_sequence_order_coupling_number()`
Quasi Sequence Order	20+λ or 2×(20+λ)	`get_quasi_sequence_order()`
Pseudo Amino Acid Composition	20+λ	`get_pseudo_amino_acid_composition()`
Amphiphilic Pseudo Amino Acid Composition	20+2λ	`get_amphiphilic_pseudo_amino_acid_composition()`

Utility Methods

get_all_descriptors()

Calculates every descriptor and returns a concatenated DataFrame of all features. Also exports to the descriptors_csv path if configured. When n_jobs > 1 the descriptor groups are computed concurrently — see Parallelism.

all_desc = desc.get_all_descriptors()   # shape: (N, ~10572 with defaults)

get_descriptor_encoding(descriptor)

Resolves a descriptor name (with fuzzy matching) and returns its feature DataFrame. Useful when the descriptor name is read from config or supplied at runtime.

df = desc.get_descriptor_encoding("moran")   # resolves to moran_autocorrelation

all_descriptors_list()

Returns the list of all 33 descriptor names.

validate_descriptors(descriptors)

Validates that all names in a list (or single string) are recognised descriptor names. Raises InvalidDescriptorError for any unknown names.

get_descriptor_info(name)

Returns a metadata dict for name including feature_count, group, and the associated get_* method.

reset_descriptors()

Clears all descriptor DataFrames back to empty state, freeing memory without re-instantiating the class.

get_descriptor_columns(name)

Returns the column names of the calculated DataFrame for descriptor name.

Pre-calculated Descriptors

For any new dataset it is recommended to calculate all descriptors once and cache them to a CSV file, which is then loaded automatically on subsequent runs:

Set all_desc: 1 and descriptors_csv: "data/descriptors_<dataset>.csv" in the [descriptors] config section.
Run once — all descriptor values are calculated and written to the CSV.
On every subsequent run, the CSV is detected and imported automatically — no recalculation required.

Pre-calculated descriptor CSVs for the bundled example datasets are included in data/ and example_datasets/.

Config File

All descriptor parameters are set under the [descriptors] key in the pySAR config JSON:

{
    "descriptors": {
        "descriptors_csv": "data/descriptors_thermostability.csv",
        "all_desc": 1,
        "descriptor": "amino_acid_composition",
        "moreaubroto_autocorrelation": {
            "lag": 30,
            "properties": ["CIDH920105","BHAR880101","CHAM820101","CHAM820102",
                           "CHOC760101","BIGC670101","CHAM810101","DAYM780201"],
            "normalize": 0
        },
        "moran_autocorrelation": {
            "lag": 30,
            "properties": ["CIDH920105","BHAR880101","CHAM820101","CHAM820102",
                           "CHOC760101","BIGC670101","CHAM810101","DAYM780201"],
            "normalize": 0
        },
        "geary_autocorrelation": {
            "lag": 30,
            "properties": ["CIDH920105","BHAR880101","CHAM820101","CHAM820102",
                           "CHOC760101","BIGC670101","CHAM810101","DAYM780201"],
            "normalize": 0
        },
        "ctd": {
            "property": ["hydrophobicity","volume","polarity","polarizability",
                         "charge","secondaryStructure","solventAccessibility"],
            "all": 1
        },
        "conjoint_triad": {},
        "sequence_order_coupling_number": {
            "lag": 30,
            "distance_matrix": ""
        },
        "quasi_sequence_order": {
            "lag": 30,
            "distance_matrix": ""
        },
        "pseudo_amino_acid_composition": {
            "lambda": 30,
            "weight": 0.05
        },
        "amphiphilic_pseudo_amino_acid_composition": {
            "lambda": 30,
            "weight": 0.05
        }
    }
}

See the CONFIG.md file and the example config files for the full list of available parameters: