Digital Signal Processing
The PyDSP class (pySAR/pyDSP.py) transforms numerically-encoded protein sequences
into frequency-domain spectral features using the Fast Fourier Transform (FFT). These spectral
features can then be used directly as inputs to a regression model, enabling the
AAI + DSP encoding strategy in pySAR.
Note
PyDSP operates on numerically encoded sequences — not raw amino acid strings.
Each sequence must first be encoded via an AAIndexEncoding step before being passed
to PyDSP.
Pipeline Overview
The DSP encoding pipeline follows these steps:
Numerical encoding — protein sequences are encoded with physico-chemical AAI indices, producing a 2-D array of shape
(num_sequences, signal_len).Pre-processing — sequences are zero-padded to a uniform length;
infandNaNvalues are replaced with0.Window function (optional) — a window is element-wise multiplied with each encoded sequence before the FFT to reduce spectral leakage.
FFT —
scipy.fft.fftis applied to each sequence independently.Filter function (optional) — a smoothing or analytic filter is applied to the FFT output of each sequence.
Spectrum extraction — the desired spectral representation (power, absolute, real, or imaginary) is extracted from the complex FFT output.
The resulting spectrum_encoding array (shape (num_sequences, signal_len)) is the
feature matrix used for model training.
Instantiation
PyDSP.__init__(config_file, protein_seqs, **kwargs)
from pySAR.pyDSP import PyDSP
# from a config file path
dsp = PyDSP(config_file="config/thermostability.json", protein_seqs=encoded_seqs)
# from a config dict
dsp = PyDSP(config_file={"pyDSP": {"spectrum": "power", "window_type": "hamming"}},
protein_seqs=encoded_seqs)
# via keyword arguments (override or replace config values)
dsp = PyDSP(config_file=None, protein_seqs=encoded_seqs,
spectrum="absolute", window_type="blackman")
All [pyDSP] config keys are also accepted as keyword arguments and will override
any value read from the config file.
Pre-processing
Calling pre_processing() prepares the encoded sequences for the FFT:
Zero-pads all sequences to the length of the longest sequence in the dataset (
signal_len), ensuring a uniform array shape.Replaces any
inforNaNvalues with0to prevent FFT errors.Initialises the
fft_power,fft_real,fft_imag, andfft_abszero arrays ready to receive results fromencode_sequences().
pre_processing() is called automatically before encode_sequences().
Spectra
The spectrum parameter controls which spectral representation is extracted from the
FFT output. The selected spectrum is stored in dsp.spectrum_encoding.
Spectrum |
Attribute |
Description |
|---|---|---|
|
|
Magnitude of the FFT output: $|X[k]|$ (absolute value of the complex FFT array). |
|
|
Normalised absolute spectrum: $|X[k]| / N$ where $N$ is |
|
|
Real part of the complex FFT output: $operatorname{Re}(X[k])$. |
|
|
Imaginary part of the complex FFT output: $operatorname{Im}(X[k])$. |
Set the spectrum in the config file or as a keyword argument:
{
"pyDSP": {
"spectrum": "power"
}
}
Window Functions
A window function tapers the encoded signal at its edges before the FFT is applied,
reducing spectral leakage. Set window_type to one of the values below. If
window_type is null or omitted, no window is applied (equivalent to a
rectangular window of ones).
Window names are matched approximately, so "hamm" resolves to "hamming", etc.
Window |
Description |
|---|---|
|
Raised cosine with non-zero endpoints; good general-purpose choice. |
|
Three-term cosine window; lower sidelobe levels than Hamming. |
|
Four-term cosine window with very low sidelobe levels. |
|
Triangular window with zero-valued endpoints. |
|
Bell-curve shaped window (default |
|
Flexible window controlled by |
|
Raised cosine with zero-valued endpoints; minimises spectral leakage. |
|
Combination of Bartlett and Hann windows. |
|
Convolution of two half-duration Bartlett windows; very low sidelobes. |
|
Dolph-Chebyshev window; equiripple sidelobes (default attenuation |
|
Single half-period cosine window. |
|
Exponentially decaying window. |
|
Flat passband; accurate amplitude measurements in the frequency domain. |
|
Rectangular window (no tapering); useful as a reference. |
|
Continuous first-derivative window; very low sidelobes. |
|
B-spline approximation window; smooth tapering. |
|
Triangular window with non-zero endpoints. |
|
Cosine-tapered rectangular (top-hat) window, controlled by |
Set window_type in the config or as a kwarg, and pass optional window parameters
via window_parameters:
{
"pyDSP": {
"window_type": "gaussian",
"window_parameters": {
"std": 10
}
}
}
Filter Functions
An optional filter can be applied to each sequence’s FFT output after the transform.
Set filter_type to one of the values below. If filter_type is null or
omitted, no filter is applied.
Filter |
Description |
|---|---|
|
Savitzky-Golay smoothing filter ( |
|
Median filter ( |
|
Direct-form II transposed IIR filter ( |
|
Hilbert transform ( |
Set filter_type and optional filter_parameters in the config:
{
"pyDSP": {
"filter_type": "savgol",
"filter_parameters": {
"window_length": 11,
"polyorder": 2
}
}
}
encode_sequences
encode_sequences() is the main computation method. It iterates over each numerically
encoded protein sequence and:
Multiplies by the window function (or
1if no window).Applies
scipy.fft.fftto obtain the complex FFT.Applies the filter function to the FFT output (if configured).
Stores all four spectral arrays:
fft_power,fft_abs,fft_real,fft_imag.Sets
spectrum_encodingto the array selected by thespectrumparameter.Computes and stores FFT frequencies in
fft_freqs.
dsp = PyDSP(config_file="config/thermostability.json", protein_seqs=encoded_seqs)
dsp.encode_sequences()
# access the spectral feature matrix
features = dsp.spectrum_encoding # shape: (num_sequences, signal_len)
Utility Methods
inverse_fft(a, n)Returns the inverse Fourier Transform of array
atruncated/padded to lengthn(wrapsnumpy.fft.ifft). The result is stored indsp.inv_fft.reconstructed = dsp.inverse_fft(dsp.fft[0], n=len(dsp.fft[0]))
consensus_freq(freqs)Computes the Consensus Frequency (CF) for a single encoded sequence:
\[CF = \frac{\text{peak position}}{N}\]where N is the total number of sequences in the dataset. Accepts a 1-D array of FFT frequencies for one sequence.
max_freq(freqs)Returns
(max_F, max_FI)— the maximum frequency value and its index — from a 1-D array of FFT frequencies for a single sequence.
Config File
All DSP options are set under the [pyDSP] key in the config JSON:
{
"pyDSP": {
"spectrum": "power",
"window_type": "hamming",
"window_parameters": {},
"filter_type": "",
"filter_parameters": {}
}
}
Example config files for each dataset can be found in the config/ directory of
the repository: