Digital Signal Processing ========================== The ``PyDSP`` class (``pySAR/pyDSP.py``) transforms numerically-encoded protein sequences into frequency-domain spectral features using the Fast Fourier Transform (FFT). These spectral features can then be used directly as inputs to a regression model, enabling the AAI + DSP encoding strategy in pySAR. .. note:: ``PyDSP`` operates on **numerically encoded** sequences — not raw amino acid strings. Each sequence must first be encoded via an AAIndexEncoding step before being passed to ``PyDSP``. ---- Pipeline Overview ----------------- The DSP encoding pipeline follows these steps: 1. **Numerical encoding** — protein sequences are encoded with physico-chemical AAI indices, producing a 2-D array of shape ``(num_sequences, signal_len)``. 2. **Pre-processing** — sequences are zero-padded to a uniform length; ``inf`` and ``NaN`` values are replaced with ``0``. 3. **Window function** (optional) — a window is element-wise multiplied with each encoded sequence before the FFT to reduce spectral leakage. 4. **FFT** — ``scipy.fft.fft`` is applied to each sequence independently. 5. **Filter function** (optional) — a smoothing or analytic filter is applied to the FFT output of each sequence. 6. **Spectrum extraction** — the desired spectral representation (power, absolute, real, or imaginary) is extracted from the complex FFT output. The resulting ``spectrum_encoding`` array (shape ``(num_sequences, signal_len)``) is the feature matrix used for model training. ---- Instantiation ------------- ``PyDSP.__init__(config_file, protein_seqs, **kwargs)`` .. code-block:: python from pySAR.pyDSP import PyDSP # from a config file path dsp = PyDSP(config_file="config/thermostability.json", protein_seqs=encoded_seqs) # from a config dict dsp = PyDSP(config_file={"pyDSP": {"spectrum": "power", "window_type": "hamming"}}, protein_seqs=encoded_seqs) # via keyword arguments (override or replace config values) dsp = PyDSP(config_file=None, protein_seqs=encoded_seqs, spectrum="absolute", window_type="blackman") All ``[pyDSP]`` config keys are also accepted as keyword arguments and will override any value read from the config file. ---- Pre-processing -------------- Calling ``pre_processing()`` prepares the encoded sequences for the FFT: - Zero-pads all sequences to the length of the longest sequence in the dataset (``signal_len``), ensuring a uniform array shape. - Replaces any ``inf`` or ``NaN`` values with ``0`` to prevent FFT errors. - Initialises the ``fft_power``, ``fft_real``, ``fft_imag``, and ``fft_abs`` zero arrays ready to receive results from ``encode_sequences()``. ``pre_processing()`` is called automatically before ``encode_sequences()``. ---- Spectra ------- The ``spectrum`` parameter controls which spectral representation is extracted from the FFT output. The selected spectrum is stored in ``dsp.spectrum_encoding``. .. list-table:: :header-rows: 1 :widths: 20 30 50 * - Spectrum - Attribute - Description * - ``power`` - ``fft_power`` - Magnitude of the FFT output: $|X[k]|$ (absolute value of the complex FFT array). * - ``absolute`` - ``fft_abs`` - Normalised absolute spectrum: $|X[k]| / N$ where $N$ is ``signal_len``. * - ``real`` - ``fft_real`` - Real part of the complex FFT output: $\operatorname{Re}(X[k])$. * - ``imaginary`` - ``fft_imag`` - Imaginary part of the complex FFT output: $\operatorname{Im}(X[k])$. Set the spectrum in the config file or as a keyword argument: .. code-block:: json { "pyDSP": { "spectrum": "power" } } ---- Window Functions ---------------- A window function tapers the encoded signal at its edges before the FFT is applied, reducing spectral leakage. Set ``window_type`` to one of the values below. If ``window_type`` is ``null`` or omitted, no window is applied (equivalent to a rectangular window of ones). Window names are matched approximately, so ``"hamm"`` resolves to ``"hamming"``, etc. .. list-table:: :header-rows: 1 :widths: 25 75 * - Window - Description * - ``hamming`` - Raised cosine with non-zero endpoints; good general-purpose choice. * - ``blackman`` - Three-term cosine window; lower sidelobe levels than Hamming. * - ``blackmanharris`` - Four-term cosine window with very low sidelobe levels. * - ``bartlett`` - Triangular window with zero-valued endpoints. * - ``gaussian`` - Bell-curve shaped window (default ``std=7``); smooth frequency response. * - ``kaiser`` - Flexible window controlled by ``beta``; balances main-lobe width against sidelobe amplitude. * - ``hann`` - Raised cosine with zero-valued endpoints; minimises spectral leakage. * - ``barthann`` - Combination of Bartlett and Hann windows. * - ``bohman`` - Convolution of two half-duration Bartlett windows; very low sidelobes. * - ``chebwin`` - Dolph-Chebyshev window; equiripple sidelobes (default attenuation ``at=100`` dB). * - ``cosine`` - Single half-period cosine window. * - ``exponential`` - Exponentially decaying window. * - ``flattop`` - Flat passband; accurate amplitude measurements in the frequency domain. * - ``boxcar`` - Rectangular window (no tapering); useful as a reference. * - ``nuttall`` - Continuous first-derivative window; very low sidelobes. * - ``parzen`` - B-spline approximation window; smooth tapering. * - ``triang`` - Triangular window with non-zero endpoints. * - ``tukey`` - Cosine-tapered rectangular (top-hat) window, controlled by ``alpha``. Set ``window_type`` in the config or as a kwarg, and pass optional window parameters via ``window_parameters``: .. code-block:: json { "pyDSP": { "window_type": "gaussian", "window_parameters": { "std": 10 } } } ---- Filter Functions ---------------- An optional filter can be applied to each sequence's FFT output after the transform. Set ``filter_type`` to one of the values below. If ``filter_type`` is ``null`` or omitted, no filter is applied. .. list-table:: :header-rows: 1 :widths: 20 80 * - Filter - Description * - ``savgol`` - Savitzky-Golay smoothing filter (``scipy.signal.savgol_filter``). Fits successive sub-sets of adjacent data points with a low-degree polynomial to smooth noisy spectra while preserving peak shape. * - ``medfilt`` - Median filter (``scipy.signal.medfilt``). Replaces each sample with the median of its neighbours; effective at removing spike noise. * - ``lfilter`` - Direct-form II transposed IIR filter (``scipy.signal.lfilter``). Requires ``b`` and ``a`` numerator/denominator coefficients to be supplied in ``filter_parameters``. * - ``hilbert`` - Hilbert transform (``scipy.signal.hilbert``). Computes the analytic signal; the imaginary part is the Hilbert transform of the input. Set ``filter_type`` and optional ``filter_parameters`` in the config: .. code-block:: json { "pyDSP": { "filter_type": "savgol", "filter_parameters": { "window_length": 11, "polyorder": 2 } } } ---- encode_sequences ---------------- ``encode_sequences()`` is the main computation method. It iterates over each numerically encoded protein sequence and: 1. Multiplies by the window function (or ``1`` if no window). 2. Applies ``scipy.fft.fft`` to obtain the complex FFT. 3. Applies the filter function to the FFT output (if configured). 4. Stores all four spectral arrays: ``fft_power``, ``fft_abs``, ``fft_real``, ``fft_imag``. 5. Sets ``spectrum_encoding`` to the array selected by the ``spectrum`` parameter. 6. Computes and stores FFT frequencies in ``fft_freqs``. .. code-block:: python dsp = PyDSP(config_file="config/thermostability.json", protein_seqs=encoded_seqs) dsp.encode_sequences() # access the spectral feature matrix features = dsp.spectrum_encoding # shape: (num_sequences, signal_len) ---- Utility Methods --------------- ``inverse_fft(a, n)`` Returns the inverse Fourier Transform of array ``a`` truncated/padded to length ``n`` (wraps ``numpy.fft.ifft``). The result is stored in ``dsp.inv_fft``. .. code-block:: python reconstructed = dsp.inverse_fft(dsp.fft[0], n=len(dsp.fft[0])) ``consensus_freq(freqs)`` Computes the Consensus Frequency (CF) for a single encoded sequence: .. math:: CF = \frac{\text{peak position}}{N} where *N* is the total number of sequences in the dataset. Accepts a 1-D array of FFT frequencies for one sequence. ``max_freq(freqs)`` Returns ``(max_F, max_FI)`` — the maximum frequency value and its index — from a 1-D array of FFT frequencies for a single sequence. ---- Config File ----------- All DSP options are set under the ``[pyDSP]`` key in the config JSON: .. code-block:: json { "pyDSP": { "spectrum": "power", "window_type": "hamming", "window_parameters": {}, "filter_type": "", "filter_parameters": {} } } Example config files for each dataset can be found in the ``config/`` directory of the repository: - `thermostability.json `_ - `absorption.json `_ - `enantioselectivity.json `_ - `localization.json `_