Digital Signal Processing

The PyDSP class (pySAR/pyDSP.py) transforms numerically-encoded protein sequences into frequency-domain spectral features using the Fast Fourier Transform (FFT). These spectral features can then be used directly as inputs to a regression model, enabling the AAI + DSP encoding strategy in pySAR.

Note

PyDSP operates on numerically encoded sequences — not raw amino acid strings. Each sequence must first be encoded via an AAIndexEncoding step before being passed to PyDSP.

Pipeline Overview

The DSP encoding pipeline follows these steps:

Numerical encoding — protein sequences are encoded with physico-chemical AAI indices, producing a 2-D array of shape (num_sequences, signal_len).
Pre-processing — sequences are zero-padded to a uniform length; inf and NaN values are replaced with 0.
Window function (optional) — a window is element-wise multiplied with each encoded sequence before the FFT to reduce spectral leakage.
FFT — scipy.fft.fft is applied to each sequence independently.
Filter function (optional) — a smoothing or analytic filter is applied to the FFT output of each sequence.
Spectrum extraction — the desired spectral representation (power, absolute, real, or imaginary) is extracted from the complex FFT output.

The resulting spectrum_encoding array (shape (num_sequences, signal_len)) is the feature matrix used for model training.

Instantiation

PyDSP.__init__(config_file, protein_seqs, **kwargs)

from pySAR.pyDSP import PyDSP

# from a config file path
dsp = PyDSP(config_file="config/thermostability.json", protein_seqs=encoded_seqs)

# from a config dict
dsp = PyDSP(config_file={"pyDSP": {"spectrum": "power", "window_type": "hamming"}},
            protein_seqs=encoded_seqs)

# via keyword arguments (override or replace config values)
dsp = PyDSP(config_file=None, protein_seqs=encoded_seqs,
            spectrum="absolute", window_type="blackman")

All [pyDSP] config keys are also accepted as keyword arguments and will override any value read from the config file.

Pre-processing

Calling pre_processing() prepares the encoded sequences for the FFT:

Zero-pads all sequences to the length of the longest sequence in the dataset (signal_len), ensuring a uniform array shape.
Replaces any inf or NaN values with 0 to prevent FFT errors.
Initialises the fft_power, fft_real, fft_imag, and fft_abs zero arrays ready to receive results from encode_sequences().

pre_processing() is called automatically before encode_sequences().

Spectra

The spectrum parameter controls which spectral representation is extracted from the FFT output. The selected spectrum is stored in dsp.spectrum_encoding.

Spectrum	Attribute	Description
`power`	`fft_power`	Magnitude of the FFT output: $\|X[k]\|$ (absolute value of the complex FFT array).
`absolute`	`fft_abs`	Normalised absolute spectrum: $\|X[k]\| / N$ where $N$ is `signal_len`.
`real`	`fft_real`	Real part of the complex FFT output: $operatorname{Re}(X[k])$.
`imaginary`	`fft_imag`	Imaginary part of the complex FFT output: $operatorname{Im}(X[k])$.

Set the spectrum in the config file or as a keyword argument:

{
    "pyDSP": {
        "spectrum": "power"
    }
}

Window Functions

A window function tapers the encoded signal at its edges before the FFT is applied, reducing spectral leakage. Set window_type to one of the values below. If window_type is null or omitted, no window is applied (equivalent to a rectangular window of ones).

Window names are matched approximately, so "hamm" resolves to "hamming", etc.

Window	Description
`hamming`	Raised cosine with non-zero endpoints; good general-purpose choice.
`blackman`	Three-term cosine window; lower sidelobe levels than Hamming.
`blackmanharris`	Four-term cosine window with very low sidelobe levels.
`bartlett`	Triangular window with zero-valued endpoints.
`gaussian`	Bell-curve shaped window (default `std=7`); smooth frequency response.
`kaiser`	Flexible window controlled by `beta`; balances main-lobe width against sidelobe amplitude.
`hann`	Raised cosine with zero-valued endpoints; minimises spectral leakage.
`barthann`	Combination of Bartlett and Hann windows.
`bohman`	Convolution of two half-duration Bartlett windows; very low sidelobes.
`chebwin`	Dolph-Chebyshev window; equiripple sidelobes (default attenuation `at=100` dB).
`cosine`	Single half-period cosine window.
`exponential`	Exponentially decaying window.
`flattop`	Flat passband; accurate amplitude measurements in the frequency domain.
`boxcar`	Rectangular window (no tapering); useful as a reference.
`nuttall`	Continuous first-derivative window; very low sidelobes.
`parzen`	B-spline approximation window; smooth tapering.
`triang`	Triangular window with non-zero endpoints.
`tukey`	Cosine-tapered rectangular (top-hat) window, controlled by `alpha`.

Set window_type in the config or as a kwarg, and pass optional window parameters via window_parameters:

{
    "pyDSP": {
        "window_type": "gaussian",
        "window_parameters": {
            "std": 10
        }
    }
}

Filter Functions

An optional filter can be applied to each sequence’s FFT output after the transform. Set filter_type to one of the values below. If filter_type is null or omitted, no filter is applied.

Filter	Description
`savgol`	Savitzky-Golay smoothing filter (`scipy.signal.savgol_filter`). Fits successive sub-sets of adjacent data points with a low-degree polynomial to smooth noisy spectra while preserving peak shape.
`medfilt`	Median filter (`scipy.signal.medfilt`). Replaces each sample with the median of its neighbours; effective at removing spike noise.
`lfilter`	Direct-form II transposed IIR filter (`scipy.signal.lfilter`). Requires `b` and `a` numerator/denominator coefficients to be supplied in `filter_parameters`.
`hilbert`	Hilbert transform (`scipy.signal.hilbert`). Computes the analytic signal; the imaginary part is the Hilbert transform of the input.

Set filter_type and optional filter_parameters in the config:

{
    "pyDSP": {
        "filter_type": "savgol",
        "filter_parameters": {
            "window_length": 11,
            "polyorder": 2
        }
    }
}

encode_sequences

encode_sequences() is the main computation method. It iterates over each numerically encoded protein sequence and:

Multiplies by the window function (or 1 if no window).
Applies scipy.fft.fft to obtain the complex FFT.
Applies the filter function to the FFT output (if configured).
Stores all four spectral arrays: fft_power, fft_abs, fft_real, fft_imag.
Sets spectrum_encoding to the array selected by the spectrum parameter.
Computes and stores FFT frequencies in fft_freqs.

dsp = PyDSP(config_file="config/thermostability.json", protein_seqs=encoded_seqs)
dsp.encode_sequences()

# access the spectral feature matrix
features = dsp.spectrum_encoding   # shape: (num_sequences, signal_len)

Utility Methods

inverse_fft(a, n)

Returns the inverse Fourier Transform of array a truncated/padded to length n (wraps numpy.fft.ifft). The result is stored in dsp.inv_fft.

reconstructed = dsp.inverse_fft(dsp.fft[0], n=len(dsp.fft[0]))

consensus_freq(freqs)

Computes the Consensus Frequency (CF) for a single encoded sequence:

\[CF = \frac{\text{peak position}}{N}\]

where N is the total number of sequences in the dataset. Accepts a 1-D array of FFT frequencies for one sequence.

max_freq(freqs)

Returns (max_F, max_FI) — the maximum frequency value and its index — from a 1-D array of FFT frequencies for a single sequence.

Config File

All DSP options are set under the [pyDSP] key in the config JSON:

{
    "pyDSP": {
        "spectrum": "power",
        "window_type": "hamming",
        "window_parameters": {},
        "filter_type": "",
        "filter_parameters": {}
    }
}

Example config files for each dataset can be found in the config/ directory of the repository: