Transformation functions¶

Most audio processing implies handling the following inherent elements:

Clipping: A transformation is likely to increase some part of the input waveform above its original level. In the worst case, it can go beyond 1.0 (or lower than -1.0) and thus saturate/clip when saving the transformed waveform into a file. By default, nothing is done in pitchmeld to prevent this. However, you can use clipper_knee=0.66 argument in the functions below to apply a clipping effect that will reduce the distortion of any clipping.
Equalisation and loudness preservation: A transformation is likely to change the spectral balance of the input waveform. The standard way to handle this is by preserving the energy in some frequency bands by applying a loudness equalisation effect. This is done by default in the functions below, but you can disable it by setting eq=False. You might want to disable it if you’re handling very unatural synthetic signals (ex. a pure tone).
Multichannels: Currently, multichannel is not supported yet (but will be soon). In the present version, the functions below will average the channels and process the signal as a monophonic signal. The output channel is then duplicated to the same number of channels as the input to preserve dimensions.

Processing flow:

The function transform is based on an Overlap-Add process whose base implementation is freely available here.

The different processing operations are done in the following order:

Functions¶

pitchmeld.transform_timescaling(wav: ndarray, fs: float, **kwargs)¶

Same arguments and return values as transform().

This function alter a few technical arguments of transform() in order to optimize speed for time scaling, without compromising audio quality.

pitchmeld.transform_pitchscaling(wav: ndarray, fs: float, **kwargs)¶

Same arguments and return values as transform().

This function alter a few technical arguments of transform() in order to optimize speed for pitch scaling only, without compromising audio quality.

pitchmeld.transform(wav: ndarray, fs: float, pbf: float = 1.0, pbfs: ndarray, esf: float = 1.0, esp: boolean = True, psf: float = 1.0, psfs: ndarray, set_f0: float = None, set_f0s: ndarray, psf_max: float = 2.0, clipper_knee: float = None, winlen_inner: float = 0.020*fs, timestep: float = 0.005*fs, f0_min: float = 27.5, f0_max: float = 3520, eq: boolean = True, info: boolean = False) → ndarray[float32]¶

This is the generic function to transform a voice signal while applying multiple audio effects. See also below for more functions dedicated to specific tasks.

Note

It assumes the signal is monophonic, like a voice, a flute, a violin, saxophone, etc.

It is not recommended to use it on polyphonic signals like a piano, a guitar, a drum set, etc.

Parameters:

wav – Input signal to transform.
fs – Sampling rate [Hz].
pbf –
Playback factor to do time scaling [coefficient, def. 1.0].

Note

The method is designed so that there is no global time drift possible. However, because internal frames need to be time aligned for ensuring signal continuity, audio events might be slightly shifted locally, at most one frame before or further (at most 0.005s by default).

For example, assuming a speed up of 2, and a timestep of 0.005s,an audio event at 60s, might end up at 30.005s, instead of 30s. Nevertheless, there is no time drift. So an audio event at 120s will not go as far as 60.010s.
pbfs – Time varying playback factor [2D ndarray, def. None]. A 2D numpy array of shape (N, 2) where N is the number of given pairs [time, pbf]. The first column is the time in seconds, relative to the original signal (not the transformed one). The second column is the pbf playback factor (as above).
esf – Envelope scaling factor [coefficient, def. 1.0].
esp – Preserve spectral envelope [boolean, def. True]. Also known as “formants preservation”.
psf – Pitch scaling factor [coefficient, def. 1.0].
psfs – Time varying pitch scaling factor [2D ndarray, def. None]. A 2D numpy array of shape (N, 2) where N is the number of given pairs [time, psf]. The first column is the time in seconds, relative to the original signal (not the transformed one). The second column is the psf playback factor (as above).
psf_max – Maximum value for pitch scaling factor [coefficient, def. 2.0].
psf_autotune_enable – Enable autotune [boolean, def. False]. Automatically snap the pitch to the closest note in the given scale.
psf_autotune_snapping_A4 – A4 reference frequency for autotune [Hz, def. 440.0].
psf_autotune_snapping_key – Key for autotune [string, def. “C”, ex.: “C”, “Db”, “D”, “Eb”, “E”, “F”, “Gb”, “G”, “Ab”, “A”, “Bb”, “B”].
psf_autotune_snapping_scale – Scale for autotune [string, among: “chromatic”, “chord_major”, “chord_major7”, “chord_minor”, “pentatonic_major”, “pentatonic_minor”, “def. “chord_major7”].
psf_autotune_amount_coef – Amount coefficient [coefficient in [0.0, 1.0], def. 0.80]. 1.0 means full correction, 0.0 means no correction.
psf_autotune_retune_delay – Retune delay [seconds, def. 0.020]. A short delay speeds up the snapping time and sounds more robotic.
set_f0 – Force the fundamental frequency to a constant value [Hz, def. None]. psf will be set automatically so that the output fundamental frequency is equal to the given value f0set_f0.
set_f0s – Time varying fundamental frequency [2D ndarray, def. None]. A 2D numpy array of shape (N, 2) where N is the number of given pairs [time, set_f0]. The first column is the time in seconds, relative to the original signal (not the transformed one). The second column is the set_f0 fundamental frequency (as above).
clipper_knee – Clipper knee amplitude [linear amplitude, def. None, common 0.66, source]. This is to prevent the signal to clip at 1.0 when saving it in a file and create audio glitches. The knee amplitude is the point where the clipper starts to act. This will prevent the signal to go above ±1.0 in amplitude. The lower the value, the less glitches but the more the signal will be distorted. Set it to None to disable it.

Note

The following arguments below are used to optimize the processing’s audio quality and speed. It is not recommended to changed them unless you know what you are doing.

Using transform_timescaling and transform_pitchscaling will automatically do that for you depending on the task.

Parameters:

eq – Equalisation [boolean, def. True]. This is to preserve the spectral balance of the output waveform compared to the input waveform.
winlen_inner – Inner window length [#samples, def. 0.020*fs]. This is the window length used for the inner processing. The bigger the value, the more stable the sound but the processing will be slower.
timestep – Inner window length [#samples, def. 0.005*fs]. This is the time step from one frame to the next. The smaller the value, the more stable the sound but the processing will be slower.
f0_min – Minimum value for the fundamental frequency [Hz, def. 440/16=27.5]. This is to prevent the pitch to go too low and create audio glitches.
f0_max – Maximum value for the fundamental frequency [Hz, def. 440*8=3520]. This is to prevent the pitch to go too high and create audio glitches.
info – If set to True, returns an extra dict with various information related to how the processing went [def. False]

Returns:

ndarray[float32] - The modified signal.
Shape will be the same as the input signal. The type will always be float32 since the whole processing runs on float32 precision.
info[dict] - Processing information [optional: only if argument info=True]

Example:

import pitchmeld
import soundfile
wav, fs = soundfile.read('path/to/audio.wav')
syn = pitchmeld.transform(wav, fs, psf=2.0)
soundfile.write('syn.wav', syn, fs)