Releases: pytorch/audio
Torchaudio 2.1 Release Note
Hilights
TorchAudio v2.1 introduces the new features and backward-incompatible changes;
- [BETA] A new API to apply filter, effects and codec
torchaudio.io.AudioEffector
can apply filters, effects and encodings to waveforms in online/offline fashion.
You can use it as a form of augmentation.
Please refer to https://pytorch.org/audio/2.1/tutorials/effector_tutorial.html for the examples. - [BETA] Tools for forced alignment
New functions and a pre-trained model for forced alignment were added.
torchaudio.functional.forced_align
computes alignment from an emission andtorchaudio.pipelines.MMS_FA
provides access to the model trained for multilingual forced alignment in MMS: Scaling Speech Technology to 1000+ languages project.
Please refer to https://pytorch.org/audio/2.1/tutorials/ctc_forced_alignment_api_tutorial.html for the usage offorced_align
function, and https://pytorch.org/audio/2.1/tutorials/forced_alignment_for_multilingual_data_tutorial.html for how one can useMMS_FA
to align transcript in multiple languages. - [BETA] TorchAudio-Squim : Models for reference-free speech assessment
Model architectures and pre-trained models from the paper TorchAudio-Squim: Reference-less Speech Quality and Intelligibility measures in TorchAudio were added.
You can usetorchaudio.pipelines.SQUIM_SUBJECTIVE
andtorchaudio.pipelines.SQUIM_OBJECTIVE
models to estimate the various speech quality and intelligibility metrics. This is helpful when evaluating the quality of speech generation models, such as TTS.
Please refer to https://pytorch.org/audio/2.1/tutorials/squim_tutorial.html for the detail. - [BETA] CUDA-based CTC decoder
torchaudio.models.decoder.CUCTCDecoder
takes emission stored in CUDA memory and performs CTC beam search on it in CUDA device. The beam search is fast. It eliminates the need to move data from CUDA device to CPU when performing automatic speech recognition. With PyTorch's CUDA support, it is now possible to perform the entire speech recognition pipeline in CUDA.
Please refer to https://pytorch.org/audio/2.1/tutorials/asr_inference_with_cuda_ctc_decoder_tutorial.html for the detail. - [Prototype] Utilities for AI music generation
We are working to add utilities that are relevant to music AI. Since the last release, the following APIs were added to the prototype.
Please refer to respective documentation for the usage.- torchaudio.prototype.chroma_filterbank
- torchaudio.prototype.transforms.ChromaScale
- torchaudio.prototype.transforms.ChromaSpectrogram
- torchaudio.prototype.pipelines.VGGISH
- New recipes for training models.
Recipes for Audio-visual ASR, multi-channel DNN beamforming and TCPGen context-biasing were added.
Please refer to the recipes - Update to FFmpeg support
The version of supported FFmpeg libraries was updated.
TorchAudio v2.1 works with FFmpeg 6, 5 and 4.4. The support for 4.3, 4.2 and 4.1 are dropped.
Please refer to https://pytorch.org/audio/2.1/installation.html#optional-dependencies for the detail of the new FFmpeg integration mechanism. - Update to libsox integration
TorchAudio now depends on libsox installed separately from torchaudio. Sox I/O backend no longer supports file-like object. (This is supported by FFmpeg backend and soundfile)
Please refer to https://pytorch.org/audio/2.1/installation.html#optional-dependencies for the detail.
New Features
I/O
- Support overwriting PTS in
torchaudio.io.StreamWriter
(#3135) - Include format information after filter
torchaudio.io.StreamReader.get_out_stream_info
(#3155) - Support CUDA frame in
torchaudio.io.StreamReader
filter graph (#3183, #3479) - Support YUV444P in GPU decoder (#3199)
- Add additional filter graph processing to
torchaudio.io.StreamWriter
(#3194) - Cache and reuse HW device context in GPU decoder (#3178)
- Cache and reuse HW device context in GPU encoder (#3215)
- Support changing the number of channels in
torchaudio.io.StreamReader
(#3216) - Support encode spec change in
torchaudio.io.StreamWriter
(#3207) - Support encode options such as compression rate and bit rate (#3179, #3203, #3224)
- Add
420p10le
support totorchaudio.io.StreamReader
CPU decoder (#3332) - Support multiple FFmpeg versions (#3464, #3476)
- Support writing opus and mp3 with soundfile (#3554)
- Add switch to disable sox integration and ffmpeg integration at runtime (#3500)
Ops
- Add
torchaudio.io.AudioEffector
(#3163, #3372, #3374) - Add
torchaudio.transforms.SpecAugment
(#3309, #3314) - Add
torchaudio.functional.forced_align
(#3348, #3355, #3533, #3536, #3354, #3365, #3433, #3357) - Add
torchaudio.functional.merge_tokens
(#3535, #3614) - Add
torchaudio.functional.frechet_distance
(#3545)
Models
- Add
torchaudio.models.SquimObjective
for speech enhancement (#3042, 3087, #3512) - Add
torchaudio.models.SquimSubjective
for speech enhancement (#3189) - Add
torchaudio.models.decoder.CUCTCDecoder
(#3096)
Pipelines
- Add
torchaudio.pipelines.SquimObjectiveBundle
for speech enhancement (#3103) - Add
torchaudio.pipelines.SquimSubjectiveBundle
for speech enhancement (#3197) - Add
torchaudio.pipelines.MMS_FA
Bundle for forced alignment (#3521, #3538)
Tutorials
- Add tutorial for
torchaudio.io.AudioEffector
(#3226) - Add tutorials for CTC forced alignment API (#3356, #3443, #3529, #3534, #3542, #3546, #3566)
- Add tutorial for
torchaudio.models.decoder.CUCTCDecoder
(#3297) - Add tutorial for real-time av-asr (#3511)
- Add tutorial for TorchAudio-SQUIM pipelines (#3279, #3313)
- Split HW acceleration tutorial into nvdec/nvenc tutorials (#3483, #3478)
Recipe
- Add TCPGen context-biasing Conformer RNN-T (#2890)
- Add AV-ASR recipe (#3278, #3421, #3441, #3489, #3493, #3498, #3492, #3532)
- Add multi-channel DNN beamforming training recipe (#3036)
Backward-incompatible changes
Third-party libraries
In this release, the following third party libraries are removed from TorchAudio binary distributions. TorchAudio now search and link these libraries at runtime. Please install them to use the corresponding APIs.
SoX
libsox
is used for various audio I/O, filtering operations.
Pre-built binaries are avaialble via package managers, such as conda
, apt
and brew
. Please refer to the respective documetation.
The APIs affected include;
torchaudio.load
("sox" backend)torchaudio.info
("sox" backend)torchaudio.save
("sox" backend)torchaudio.sox_effects.apply_effects_tensor
torchaudio.sox_effects.apply_effects_file
torchaudio.functional.apply_codec
(also deprecated, see below)
Changes related to the removal: #3232, #3246, #3497, #3035
Flashlight Text
flashlight-text
is the core of CTC decoder.
Pre-built packages are available on PyPI. Please refer to https://github.com/flashlight/text for the detail.
The APIs affected include;
torchaudio.models.decoder.CTCDecoder
Changes related to the removal: #3232, #3246, #3236, #3339
Kaldi
A custom built libkaldi
was used to implement torchaudio.functional.compute_kaldi_pitch
. This function, along with libkaldi integration, is removed in this release. There is no replcement.
Changes related to the removal: #3368, #3403
I/O
- Switch to the backend dispatcher (#3241)
To make I/O operations more flexible, TorchAudio introduced the backend dispatcher in v2.0, and users could opt-in to use the dispatcher.
In this release, the backend dispatcher becomes the default mechanism for selecting the I/O backend.
You can pass backend
argument to torchaudio.info
, torchaudio.load
and torchaudio.save
function to select I/O backend library per-call basis. (If it is omitted, an available backend is automatically selected.)
If you want to use the global backend mechanism, you can set the environment variable, TORCHAUDIO_USE_BACKEND_DISPATCHER=0
.
Please note, however, that this the global backend mechanism is deprecated and is going to be removed in the next release.
Please see #2950 for the detail of migration work.
torchaudio.io.StreamReader
accepted a byte-string wrapped in 1D torch.Tensor
object. This is no longer supported.
Please wrap the underlying data with io.BytesIO
instead.
The optional arguments of add_[audio|video]_stream
methods of torchaudio.io.StreamReader
and torchaudio.io.StreamWriter
are now keyword-only arguments.
- Drop the support of FFmpeg < 4.1 (#3561, 3557)
Previously TorchAudio supported FFmpeg 4 (>=4.1, <=4.4). In this release, TorchAudio supports FFmpeg 4, 5 and 6 (>=4.4, <7). With this change, support for FFmpeg 4.1, 4.2 and 4.3 are dropped.
Ops
- Use named file in
torchaudio.functional.apply_codec
(#3397)
In previous versions, TorchAudio shipped custom built libsox
, so that it can perform in-memory decoding and encoding.
Now, in-memory decoding and encoding are handled by FFmpeg binding, and with the switch to dynamic libsox
linking, torchaudio.functional.apply_codec
no longer process audio in in-memory fashion. Instead it writes to temporary file.
For in-memory processing, please use torchaudio.io.AudioEffector
.
- Switch to
lstsq
when solving InverseMelScale (#3280)
Previously, torchaudio.transform.InverseMelScale
ran SGD optimizer to find the inverse of mel-scale transfo...
v2.0.2
TorchAudio 2.0.2 Release Note
This is a minor release, which is compatible with PyTorch 2.0.1 and includes bug fixes, improvements and documentation updates. There is no new feature added.
Bug fix
- #3239 Properly set #samples passed to encoder (#3204)
- #3238 Fix virtual function issue with CTC decoder (#3230)
- #3245 Fix path-like object support in FFmpeg dispatcher (#3243, #3248)
- #3261 Use scaled_dot_product_attention in Wav2vec2/HuBERT's SelfAttention (#3253)
- #3264 Use scaled_dot_product_attention in WavLM attention (#3252, #3265)
Full Changelog: v2.0.1...v2.0.2
Torchaudio 2.0 Release Note
Highlights
TorchAudio 2.0 release includes:
- Data augmentation operators, e.g. convolution, additive noise, speed perturbation
- WavLM and XLS-R models and pre-trained pipelines
- Backend dispatcher powering revised
info
,load
,save
functions - Dropped support of Python 3.7
- Added Python 3.11 support
[Beta] Data augmentation operators
The release adds several data augmentation operators under torchaudio.functional
and torchaudio.transforms
:
torchaudio.functional.add_noise
torchaudio.functional.convolve
torchaudio.functional.deemphasis
torchaudio.functional.fftconvolve
torchaudio.functional.preemphasis
torchaudio.functional.speed
torchaudio.transforms.AddNoise
torchaudio.transforms.Convolve
torchaudio.transforms.Deemphasis
torchaudio.transforms.FFTConvolve
torchaudio.transforms.Preemphasis
torchaudio.transforms.Speed
torchaudio.transforms.SpeedPerturbation
The operators can be used to synthetically diversify training data to improve the generalizability of downstream models.
For usage details, please refer to the documentation for torchaudio.functional
and torchaudio.transforms
, and tutorial “Audio Data Augmentation”.
[Beta] WavLM and XLS-R models and pre-trained pipelines
The release adds two self-supervised learning models for speech and audio.
Besides the model architectures, torchaudio also supports corresponding pre-trained pipelines:
torchaudio.pipelines.WAVLM_BASE
torchaudio.pipelines.WAVLM_BASE_PLUS
torchaudio.pipelines.WAVLM_LARGE
torchaudio.pipelines.WAV2VEC_XLSR_300M
torchaudio.pipelines.WAV2VEC_XLSR_1B
torchaudio.pipelines.WAV2VEC_XLSR_2B
For usage details, please refer to factory function
and pre-trained pipelines
documentation.
Backend dispatcher
Release 2.0 introduces new versions of I/O functions torchaudio.info
, torchaudio.load
and torchaudio.save
, backed by a dispatcher that allows for selecting one of backends FFmpeg, SoX, and SoundFile to use, subject to library availability. Users can enable the new logic in Release 2.0 by setting the environment variable TORCHAUDIO_USE_BACKEND_DISPATCHER=1
; the new logic will be enabled by default in Release 2.1.
# Fetch metadata using FFmpeg
metadata = torchaudio.info("test.wav", backend="ffmpeg")
# Load audio (with no backend parameter value provided, function prioritizes using FFmpeg if it is available)
waveform, rate = torchaudio.load("test.wav")
# Write audio using SoX
torchaudio.save("out.wav", waveform, rate, backend="sox")
Please see the documentation for torchaudio
for more details.
Backward-incompatible changes
-
Dropped Python 3.7 support (#3020)
Following the upstream PyTorch (pytorch/pytorch#93155), the support for Python 3.7 has been dropped. -
Default to "precise" seek in
torchaudio.io.StreamReader.seek
(#2737, #2841, #2915, #2916, #2970)
Previously, theStreamReader.seek
method seeked into a key frame closest to the given time stamp. A new optionmode
has been added which can switch the behavior to seeking into any type of frame, including non-key frames, that is closest to the given timestamp, and this behavior is now default. -
Removed deprecated/unused/undocumented functions from datasets.utils (#2926, #2927)
The following functions are removed fromdatasets.utils
stream_url
download_url
validate_file
extract_archive
.
Deprecations
Ops
-
Deprecated 'onesided' init param for MelSpectrogram (#2797, #2799)
torchaudio.transforms.MelSpectrogram
assumes theonesided
argument to be alwaysTrue
. The forward path fails if its value isFalse
. Therefore this argument is deprecated. Users specifying this argument should stop specifying it. -
Deprecated
"sinc_interpolation"
and"kaiser_window"
option value in favor of"sinc_interp_hann"
and"sinc_interp_kaiser"
(#2922)
The valid values ofresampling_method
argument of resampling operations (torchaudio.transforms.Resample
andtorchaudio.functional.resample
) are changed."kaiser_window"
is now"sinc_interp_kaiser"
and"sinc_interpolation"
is"sinc_interp_hann"
. The old values will continue to work, but users are encouraged to update their code.
For the reason behind of this change, please refer #2891. -
Deprecated sox initialization/shutdown public API functions (#3010)
torchaudio.sox_effects.init_sox_effects
andtorchaudio.sox_effects.shutdown_sox_effects
are deprecated. They were required to use libsox-related features, but are called automatically since v0.6, and the initialization/shutdown mechanism have been moved elsewhere. These functions are now no-op. Users can simply remove the call to these functions.
Models
- Deprecated static binding of Flashlight-text based CTC decoder (#3055, #3089)
Since v0.12, TorchAudio binary distributions included the CTC decoder based on flashlight-text project. In a future release, TorchAudio will switch to dynamic binding of underlying CTC decoder implementation, and stop shipping the core CTC decoder implementations. Users who would like to use the CTC decoder need to separately install the CTC decoder from the upstream flashlight-text project. Other functionalities of TorchAudio will continue to work without flashlight-text.
Note: The API and numerical behavior does not change.
For more detail, please refer #3088.
I/O
- Deprecated file-like object support in sox_io (#3033)
As a preparation to switch to dynamically bound libsox, file-like object support in sox_io backend has been deprecated. It will be removed in 2.1 release in favor of the dispatcher. This deprecation affects the following functionalities.- I/O:
torchaudio.load
,torchaudio.info
andtorchaudio.save
. - Effects:
torchaudio.sox_effects.apply_effects_file
andtorchaudio.functional.apply_codec
.
For I/O, to continue using file-like objects, please use the new dispatcher mechanism.
For effects, replacement functions will be added in the next release.
- I/O:
- Deprecated the use of Tensor as a container for byte string in StreamReader (#3086)
torchaudio.io.StreamReader
supports decoding media from byte strings contained in 1D tensors oftorch.uint8
type. Using torch.Tensor type as a container for byte string is now deprecated. To pass byte strings, please wrap the string withio.BytesIO
.Deprecated Migration data = b"..."
src = torch.frombuffer(data, dtype=torch.uint8)
StreamReader(src)
data = b"..."
src = io.BytesIO(data)
StreamReader(src)
Bug Fixes
Ops
- Fixed contiguous error when backpropagating through
torchaudio.functional.lfilter
(#3080)
Pipelines
- Added layer normalization to wav2vec2 large+ pretrained models (#2873)
In self-supervised learning models such as Wav2Vec 2.0, HuBERT, or WavLM, layer normalization should be applied to waveforms if the convolutional feature extraction module uses layer normalization and is trained on a large-scale dataset. After adding layer normalization to those affected models, the Word Error Rate is significantly reduced.
Without the change in #2873, the WER results are:
Model | dev-clean | dev-other | test-clean | test-other |
---|---|---|---|---|
WAV2VEC2_ASR_LARGE_LV60K_10M | 10.59 | 15.62 | 9.58 | 16.33 |
WAV2VEC2_ASR_LARGE_LV60K_100H | 2.80 | 6.01 | 2.82 | 6.34 |
WAV2VEC2_ASR_LARGE_LV60K_960H | 2.36 | 4.43 | 2.41 | 4.96 |
HUBERT_ASR_LARGE | 1.85 | 3.46 | 2.09 | 3.89 |
HUBERT_ASR_XLARGE | 2.21 | 3.40 | 2.26 | 4.05 |
After applying layer normalization, the updated WER results are:
| Model | dev-clean | dev-other | test-clean | test-other |
|:---------------------------------------------------------------------------------...
TorchAudio 0.13.1 Release Note
This is a minor release, which is compatible with PyTorch 1.13.1 and includes bug fixes, improvements and documentation updates. There is no new feature added.
Bug Fix
IO
- Make buffer size configurable in ffmpeg file object operations and set size in backend (#2810)
- Fix issue with the missing video frame in StreamWriter (#2789)
- Fix decimal FPS handling StreamWriter (#2831)
- Fix wrong frame allocation in StreamWriter (#2905)
- Fix duplicated memory allocation in StreamWriter (#2906)
Model
Recipe
torchaudio 0.13.0 Release Note
Highlights
TorchAudio 0.13.0 release includes:
- Source separation models and pre-trained bundles (Hybrid Demucs, ConvTasNet)
- New datasets and metadata mode for the SUPERB benchmark
- Custom language model support for CTC beam search decoding
- StreamWriter for audio and video encoding
[Beta] Source Separation Models and Bundles
Hybrid Demucs is a music source separation model that uses both spectrogram and time domain features. It has demonstrated state-of-the-art performance in the Sony Music DeMixing Challenge. (citation: https://arxiv.org/abs/2111.03600)
The TorchAudio v0.13 release includes the following features
- MUSDB_HQ Dataset, which is used in Hybrid Demucs training (docs)
- Hybrid Demucs model architecture (docs)
- Three factory functions suitable for different sample rate ranges
- Pre-trained pipelines (docs) and tutorial
SDR Results of pre-trained pipelines on MUSDB-HQ test set
Pipeline | All | Drums | Bass | Other | Vocals |
---|---|---|---|---|---|
HDEMUCS_HIGH_MUSDB* | 6.42 | 7.76 | 6.51 | 4.47 | 6.93 |
HDEMUCS_HIGH_MUSDB_PLUS** | 9.37 | 11.38 | 10.53 | 7.24 | 8.32 |
* Trained on the training data of MUSDB-HQ dataset.
** Trained on both training and test sets of MUSDB-HQ and 150 extra songs from an internal database that were specifically produced for Meta.
Special thanks to @adefossez for the guidance.
ConvTasNet model architecture was added in TorchAudio 0.7.0. It is the first source separation model that outperforms the oracle ideal ratio mask. In this release, TorchAudio adds the pre-trained pipeline that is trained within TorchAudio on the Libri2Mix dataset. The pipeline achieves 15.6dB SDR improvement and 15.3dB Si-SNR improvement on the Libri2Mix test set.
[Beta] Datasets and Metadata Mode for SUPERB Benchmarks
With the addition of four new audio-related datasets, there is now support for all downstream tasks in version 1 of the SUPERB benchmark. Furthermore, these datasets support metadata mode through a get_metadata
function, which enables faster dataset iteration or preprocessing without the need to load or store waveforms.
Datasets with metadata functionality:
- LIBRISPEECH (docs)
- LibriMix (docs)
- QUESST14 (docs)
- SPEECHCOMMANDS (docs)
- (new) FluentSpeechCommands (docs)
- (new) Snips (docs)
- (new) IEMOCAP (docs)
- (new) VoxCeleb1 (Identification, Verification)
[Beta] Custom Language Model support in CTC Beam Search Decoding
In release 0.12, TorchAudio released a CTC beam search decoder with KenLM language model support. This release, there is added functionality for creating custom Python language models that are compatible with the decoder, using the torchaudio.models.decoder.CTCDecoderLM
wrapper.
[Beta] StreamWriter
torchaudio.io.StreamWriter
is a class for encoding media including audio and video. This can handle a wide variety of codecs, chunk-by-chunk encoding and GPU encoding.
Backward-incompatible changes
- [BC-breaking] Fix momentum in transforms.GriffinLim (#2568)
TheGriffinLim
implementations in transforms and functional used themomentum
parameter differently, resulting in inconsistent results between the two implementations. Thetransforms.GriffinLim
usage ofmomentum
is updated to resolve this discrepancy. - Make
torchaudio.info
decode audio to computenum_frames
if it is not found in metadata (#2740).
In such cases,torchaudio.info
may now return non-zero values fornum_frames
.
Bug Fixes
- Fix random Gaussian generation (#2639)
torchaudio.compliance.kaldi.fbank
with dither option produced a different output from kaldi because it used a skewed, rather than gaussian, distribution for dither. This is updated in this release to correctly use a random gaussian instead. - Update download link for speech commands (#2777)
The previous download link for SpeechCommands v2 did not include data for the valid and test sets, resulting in errors when trying to use those subsets. Update the download link to correctly download the whole dataset.
New Features
IO
- Add metadata to source stream info (#2461, #2464)
- Add utility function to fetch FFmpeg library versions (#2467)
- Add YUV444P support to StreamReader (#2516)
- Add StreamWriter (#2628, #2648, #2505)
- Support in-memory decoding via Tensor wrapper in StreamReader (#2694)
- Add StreamReader Tensor Binding to src (#2699)
- Add StreamWriter media device/streaming tutorial (#2708)
- Add StreamWriter tutorial (#2698)
Ops
- Add ITU-R BS.1770-4 loudness recommendation (#2472)
- Add convolution operator (#2602)
- Add additive noise function (#2608)
Models
- Hybrid Demucs model implementation (#2506)
- Docstring change for Hybrid Demucs (#2542, #2570)
- Add NNLM support to CTC Decoder (#2528, #2658)
- Move hybrid demucs model out of prototype (#2668)
- Move conv_tasnet_base doc out of prototype (#2675)
- Add custom lm example to decoder tutorial (#2762)
Pipelines
- Add SourceSeparationBundle to prototype (#2440, #2559)
- Adding pipeline changes, factory functions to HDemucs (#2547, #2565)
- Create tutorial for HDemucs (#2572)
- Add HDEMUCS_HIGH_MUSDB (#2601)
- Move SourceSeparationBundle and pre-trained ConvTasNet pipeline into Beta (#2669)
- Move Hybrid Demucs pipeline to beta (#2673)
- Update description of HDemucs pipelines
Datasets
- Add fluent speech commands (#2480, #2510)
- Add musdb dataset and tests (#2484)
- Add VoxCeleb1 dataset (#2349)
- Add metadata function for LibriSpeech (#2653)
- Add Speech Commands metadata function (#2687)
- Add metadata mode for various datasets (#2697)
- Add IEMOCAP dataset (#2732)
- Add Snips Dataset (#2738)
- Add metadata for Librimix (#2751)
- Add file name to returned item in Snips dataset (#2775)
- Update IEMOCAP variants and labels (#2778)
Improvements
IO
- Replace
runtime_error
exception withTORCH_CHECK
(#2550, #2551, #2592) - Refactor StreamReader (#2507, #2508, #2512, #2530, #2531, #2533, #2534)
- Refactor sox C++ (#2636, #2663)
- Delay the import of kaldi_io (#2573)
Ops
- Speed up resample with kernel generation modification (#2553, #2561)
The kernel generation for resampling is optimized in this release. The following table illustrates the performance improvements from the previous release for thetorchaudio.functional.resample
function using the sinc resampling method, onfloat32
tensor with two channels and one second duration.
CPU
torchaudio version | 8k → 16k [Hz] | 16k → 8k | 16k → 44.1k | 44.1k → 16k |
---|---|---|---|---|
0.13 | 0.256 | 0.549 | 0.769 | 0.820 |
0.12 | 0.386 | 0.534 | 31.8 | 12.1 |
CUDA
torchaudio version | 8k → 16k [Hz] | 16k → 8k | 16k → 44.1k | 44.1k → 16k |
---|---|---|---|---|
0.13 | 0.332 | 0.336 | 0.345 | 0.381 |
0.12 | 0.524 | 0.334 | 64.4 | 22.8 |
- Add normalization parameter on spectrogram and inverse spectrogram (#2554)
- Replace assert with raise for ops (#2579, #2599)
- Replace CHECK_ by TORCH_CHECK_ (#2582)
- Fix argument validation in TorchAudio filtering (#2609)
Models
- Switch to flashlight decoder from upstream (#2557)
- Add dimension and shape check (#2563)
- Replace assert with raise in models (#2578, #2590)
- Migrate CTC decoder code (#2580)
- Enable CTC decoder in Windows (#2587)
Datasets
- Replace assert with raise in datasets (#2571)
- Add unit test for LibriMix dataset (#2659)
- Add gtzan download note (#2763)
Tutorials
- Tweak tutorials (#2630, #2733)
- Update ASR inference tutorial (#2631)
- Update and fix tutorials (#2661, #2701)
- Introduce IO section to getting started tutorials (#2703)
- Update HW video processing tutorial (#2739)
- Update tutorial author information (#2764)
- Fix typos in tacotron2 tutorial (#2761)
- Fix fading in hybrid demucs tutorial (#2771)
- Fix leaking matplotlib figure (#2769)
- Update resampling tutorial (#2773)
Recipes
- Use lazy import for joblib (#2498)
- Revise LibriSpeech Conformer RNN-T recipe (#2535)
- Fix bug in Conformer RNN-T recipe (#2611)
- Replace bg_iterator in examples (#2645)
- Remove obsolete examples (#2655)
- Fix LibriSpeech Conforner RNN-T eval script (#2666)
- Replace IValue::toString()->string() with IValue::toStringRef() (#2700)
- Improve wav2vec2/hubert model for pre-training (#2716)
- Improve hubert recipe for pre-training and fine-tuning (#2744)
WER improvement on LibriSpe...
torchaudio 0.12.1 Release Note
This is a minor release, which is compatible with PyTorch 1.12.1 and include small bug fixes, improvements and documentation update. There is no new feature added.
Bug Fix
Improvement
- #2552 Remove unused boost source code
- #2527 Improve speech enhancement tutorial
- #2544 Update forced alignment tutorial
- #2595 Update data augmentation tutorial
For the full feature of v0.12, please refer to the v0.12.0 release note.
v0.12.0
TorchAudio 0.12.0 Release Notes
Highlights
TorchAudio 0.12.0 includes the following:
- CTC beam search decoder
- New beamforming modules and methods
- Streaming API
[Beta] CTC beam search decoder
To support inference-time decoding, the release adds the wav2letter CTC beam search decoder, ported over from Flashlight (GitHub). Both lexicon and lexicon-free decoding are supported, and decoding can be done without a language model or with a KenLM n-gram language model. Compatible token, lexicon, and certain pretrained KenLM files for the LibriSpeech dataset are also available for download.
For usage details, please check out the documentation and ASR inference tutorial.
[Beta] New beamforming modules and methods
To improve flexibility in usage, the release adds two new beamforming modules under torchaudio.transforms
: SoudenMVDR and RTFMVDR. They differ from MVDR mainly in that they:
- Use power spectral density (PSD) and relative transfer function (RTF) matrices as inputs instead of time-frequency masks. The module can be integrated with neural networks that directly predict complex-valued STFT coefficients of speech and noise.
- Add
reference_channel
as an input argument in the forward method to allow users to select the reference channel in model training or dynamically change the reference channel in inference.
Besides the two modules, the release adds new function-level beamforming methods under torchaudio.functional
. These include
For usage details, please check out the documentation at torchaudio.transforms and torchaudio.functional and the Speech Enhancement with MVDR Beamforming tutorial.
[Beta] Streaming API
StreamReader
is TorchAudio’s new I/O API. It is backed by FFmpeg† and allows users to
- Decode various audio and video formats, including MP4 and AAC.
- Handle various input forms, such as local files, network protocols, microphones, webcams, screen captures and file-like objects.
- Iterate over and decode media chunk-by-chunk, while changing the sample rate or frame rate.
- Apply various audio and video filters, such as low-pass filter and image scaling.
- Decode video with Nvidia's hardware-based decoder (NVDEC).
For usage details, please check out the documentation and tutorials:
- Media Stream API - Pt.1
- Media Stream API - Pt.2
- Online ASR with Emformer RNN-T
- Device ASR with Emformer RNN-T
- Accelerated Video Decoding with NVDEC
† To use StreamReader
, FFmpeg libraries are required. Please install FFmpeg. The coverage of codecs depends on how these libraries are configured. TorchAudio official binaries are compiled to work with FFmpeg 4 libraries; FFmpeg 5 can be used if TorchAudio is built from source.
Backwards-incompatible changes
I/O
- MP3 decoding is now handled by FFmpeg in sox_io backend. (#2419, #2428)
- FFmpeg is now used as fallback in sox_io backend, and now MP3 decoding is handled by FFmpeg. To load MP3 audio with
torchaudio.load
, please install a compatible version of FFmpeg (Version 4 when using an official binary distribution). - Note that, whereas the previous MP3 decoding scheme pads the output audio, the new scheme does not. As a consequence, the new version returns shorter audio tensors.
torchaudio.info
now returnsnum_frames=0
for MP3.
- FFmpeg is now used as fallback in sox_io backend, and now MP3 decoding is handled by FFmpeg. To load MP3 audio with
Models
- Change underlying implementation of RNN-T hypothesis to tuple (#2339)
- In release 0.11,
Hypothesis
subclassednamedtuple
. Containers ofnamedtuple
instances, however, are incompatible with the PyTorch Lite Interpreter. To achieve compatibility,Hypothesis
has been modified in release 0.12 to instead aliastuple
. This affectsRNNTBeamSearch
as it accepts and returns a list ofHypothesis
instances.
- In release 0.11,
Bug Fixes
Ops
- Fix return dtype in MVDR module (#2376)
- In release 0.11, the MVDR module converts the dtype of input spectrum to
complex128
to improve the precision and robustness of downstream matrix computations. The output dtype, however, is not correctly converted back to the original dtype. In release 0.12, we fix the output dtype to be consistent with the original input dtype.
- In release 0.11, the MVDR module converts the dtype of input spectrum to
Build
- Fix Kaldi submodule integration (#2269)
- Pin jinja2 version for build_docs (#2292)
- Use sourceforge url to fetch zlib (#2297)
New Features
I/O
- Add Streaming API (#2041, #2042, #2043, #2044, #2045, #2046, #2047, #2111, #2113, #2114, #2115, #2135, #2164, #2168, #2202, #2204, #2263, #2264, #2312, #2373, #2378, #2402, #2403, #2427, #2429)
- Add YUV420P format support to Streaming API (#2334)
- Support specifying decoder and its options (#2327)
- Add NV12 format support in Streaming API (#2330)
- Add HW acceleration support on Streaming API (#2331)
- Add file-like object support to Streaming API (#2400)
- Make FFmpeg log level configurable (#2439)
- Set the default ffmpeg log level to FATAL (#2447)
Ops
- New beamforming methods (#2227, #2228, #2229, #2230, #2231, #2232, #2369, #2401)
- New MVDR modules (#2367, #2368)
- Add and refactor CTC lexicon beam search decoder (#2075, #2079, #2089, #2112, #2117, #2136, #2174, #2184, #2185, #2273, #2289)
- Add lexicon free CTC decoder (#2342)
- Add Pretrained LM Support for Decoder (#2275)
- Move CTC beam search decoder to beta (#2410)
Datasets
Improvements
I/O
Ops
- Raise error for resampling int waveform (#2318)
- Move multi-channel modules to a separate file (#2382)
- Refactor MVDR module (#2383)
Models
- Add an option to use Tanh instead of ReLU in RNNT joiner (#2319)
- Support GroupNorm and re-ordering Convolution/MHA in Conformer (#2320)
- Add extra arguments to hubert pretrain factory functions (#2345)
- Add feature_grad_mult argument to HuBERTPretrainModel (#2335)
Datasets
Performance
- Make Pitchshift for faster by caching resampling kernel (#2441)
The following table illustrates the performance improvement over the previous release by comparing the time in msecs it takestorchaudio.transforms.PitchShift
, after its first call, to perform the operation onfloat32
Tensor with two channels and 8000 frames, resampled to 44.1 kHz across various shifted steps.
TorchAudio Version | 2 | 3 | 4 | 5 |
---|---|---|---|---|
0.12 | 2.76 | 5 | 1860 | 223 |
0.11 | 6.71 | 161 | 8680 | 1450 |
Tests
- Add complex dtype support in functional autograd test (#2244)
- Refactor torchscript consistency test in functional (#2246)
- Add unit tests for PyTorch Lightning modules of emformer_rnnt recipes (#2240)
- Refactor batch consistency test in functional (#2245)
- Run smoke tests on regular PRs (#2364)
- Refactor smoke test executions (#2365)
- Move seed to setup (#2425)
- Remove possible manual seeds from test files (#2436)
Build
- Revise the parameterization of third party libraries (#2282)
- Use zlib v1.2.12 with GitHub source (#2300)
- Fix ffmpeg integration for ffmpeg 5.0 (#2326)
- Use custom FFmpeg libraries for torchaudio binary distributions (#2355)
- Adding m1 builds to torchaudio (#2421)
Other
- Add download utility specialized for torchaudio (#2283)
- Use module-level
__getattr__
to implement delayed initialization (#2377) - Update build_doc job to use Conda CUDA package (#2395)
- Update I/O initialization (#2417)
- Add Python 3.10 (build and test) (#2224)
- Retrieve version from version.txt (#2434)
- Disable OpenMP on mac (#2431)
Examples
Ops
- Add CTC decoder example for librispeech (#2130, #2161)
- Fix LM, arguments in CTC decoding script (#2235, #2315)
- Use pretrained LM API for decoder example (#2317)
Pipelines
- Refactor pipeline_demo.py to support variant EMFORMER_RNNT bundles (#2203)
- Refactor eval and pipeline_demo scripts in emformer_rnnt (#2238)
- Refactor pipeline_demo script in emformer_rnnt recipes (#2239)
- Add EMFORMER_RNNT_BASE_MUSTC into pipeline demo script (#2248)
Tests
- Add unit tests for Emformer RNN-T LibriSpeech recipe (#2216)
- Add fixed random seed for Emformer RNN-T recipe test (#2220)
Training recipes
v0.11.0
torchaudio 0.11.0 Release Note
Highlights
TorchAudio 0.11.0 release includes:
- Emformer (paper) RNN-T components, training recipe, and pre-trained pipeline for streaming ASR
- Voxpopuli pre-trained pipelines
- HuBERTPretrainModel for training HuBERT from scratch
- Conformer model for speech recognition
- Drop Python 3.6 support
[Beta] Emformer RNN-T
To support streaming ASR use cases, the release adds implementations of Emformer (docs), an RNN-T model that uses Emformer (emformer_rnnt_base), and an RNN-T beam search decoder (RNNTBeamSearch). It also includes a pipeline bundle (EMFORMER_RNNT_BASE_LIBRISPEECH) that wraps pre- and post-processing components, the beam search decoder, and the RNN-T Emformer model with weights pre-trained on LibriSpeech, which in whole allow for performing streaming ASR inference out of the box. For reference and reproducibility, the release provides the training recipe used to produce the pre-trained weights in the examples directory.
[Beta] HuBERT Pretrain Model
The masked prediction training of HuBERT model requires the masked logits, unmasked logits, and feature norm as the outputs. The logits are for cross-entropy losses and the feature norm is for penalty loss. The release adds HuBERTPretrainModel and corresponding factory functions (hubert_pretrain_base, hubert_pretrain_large, and hubert_pretrain_xlarge) to enable training from scratch.
[Beta] Conformer (paper)
The release adds an implementation of Conformer (docs), a convolution-augmented transformer architecture that has achieved state-of-the-art results on speech recognition benchmarks.
Backward-incompatible changes
Ops
- Removed deprecated
F.magphase
,F.angle
,F.complex_norm
, andT.ComplexNorm
. (#1934, #1935, #1942)- Utility functions for pseudo complex types were deprecated in 0.10, and now they are removed in 0.11. For the detail of this migration plan, please refer to #1337.
- Dropped pseudo complex support from
F.spectrogram
,T.Spectrogram
,F.phase_vocoder
, andT.TimeStretch
(#1957, #1958)- The support for the pseudo complex type was deprecated in 0.10, and now they are removed in 0.11. For the detail of this migration plan, please refer to #1337.
- Removed deprecated
create_fb_matrix
(#1998)create_fb_matrix
was replaced bymelscale_fbanks
in release 0.10. It is removed in 0.11. Please usemelscale_fbanks
.
Datasets
- Removed deprecated VCTK (#1825)
- The original VCTK archive file is no longer accessible. Please migrate to
VCTK_092
class for the latest version of the dataset.
- The original VCTK archive file is no longer accessible. Please migrate to
- Removed deprecated dataset utils (#1826)
- Undocumented methods
diskcache_iterator
andbg_iterator
were deprecated in 0.10. They are removed in 0.11. Please cease the usage of them.
- Undocumented methods
Models
- Removed unused dimension from pretrained Wav2Vec2 ASR (#1914)
- The final linear layer of Wav2Vec2 ASR models included dimensions (
<s>
,<pad>
,</s>
,<unk>
) that were not related to ASR tasks and not used. These dimensions were removed.
- The final linear layer of Wav2Vec2 ASR models included dimensions (
Build
- Dropped support for Python3.6 (#2119, #2139)
- Following the lifecycle of Python-3.6, torchaudio dropped the support for Python 3.6.
New Features
RNN-T Emformer
- Introduced Emformer (#1801)
- Added Emformer RNN-T model (#2003)
- Added RNN-T beam search decoder (#2028)
- Cleaned up Emformer module (#2091)
- Added pretrained Emformer RNN-T streaming ASR inference pipeline (#2093)
- Reorganized RNN-T components in prototype module (#2110)
- Added integration test for Emformer RNN-T LibriSpeech pipeline (#2172)
- Registered RNN-T pipeline global stats constants as buffers (#2175)
- Refactored RNN-T factory function to support num_symbols argument (#2178)
- Fixed output shape description in RNN-T docstrings (#2179)
- Removed invalid token blanking logic from RNN-T decoder (#2180)
- Updated stale prototype references (#2189)
- Revised RNN-T pipeline streaming decoding logic (#2192)
- Cleaned up Emformer (#2207)
- Applied minor fixes to Emformer implementation (#2252)
Conformer
- Introduced Conformer (#2068)
- Removed subsampling and positional embedding logic from Conformer (#2171)
- Moved ASR features out of prototype (#2187)
- Passed bias and dropout args to Conformer convolution block (#2215)
- Adjusted Conformer args (#2223)
Datasets
- Added DR-VCTK dataset (#1819)
Models
- Added HuBERT pretrain model to enable training from scratch (#2064)
- Added feature mean square value to HuBERT Pretrain model output (#2128)
Pipelines
- Added wav2vec2 ASR French pretrained from voxpopuli (#1919)
- Added wav2vec2 ASR Spanish pretrained model from voxpopuli (#1924)
- Added wav2vec2 ASR German pretrained model from voxpopuli (#1953)
- Added wav2vec2 ASR Italian pretrained model from voxpopuli (#1954)
- Added wav2vec2 ASR English pretrained model from voxpopuli (#1956)
Build
- Added CUDA-11.5 builds to torchaudio (#2067)
Improvements
I/O
- Fixed load behavior for 24-bit input (#2084)
Ops
- Added OpenMP support (#1761)
- Improved MVDR stability (#2004)
- Relaxed dtype for MVDR (#2024)
- Added warnings in mu_law* for the wrong input type (#2034)
- Added parameter p to TimeMasking (#2090)
- Removed unused vars from RNN-T loss (#2142)
- Removed complex32 dtype in F.griffinlim (#2233)
Datasets
- Deprecated data utils (#2073)
- Updated URLs for libritts (#2074)
- Added subset support for TEDLIUM release3 dataset (#2157)
Models
- Replaced dropout with Dropout (#1815)
- Inplace initialization of RNN weights (#2010)
- Updated to xavier_uniform and avoid legacy data.uniform_ initialization (#2018)
- Allowed Tacotron2 decode batch_size 1 examples (#2156)
Pipelines
- Added tool to convert voxpopuli model (#1923)
- Refactored wav2vec2 pipeline util (#1925)
- Allowed the customization of axis exclusion for ASR head (#1932)
- Tweaked wav2vec2 checkpoint conversion tool (#1938)
- Added melkwargs setting for MFCC in HuBERT pipeline (#1949)
Documentation
- Added 0.10.0 to version compatibility matrix (#1862)
- Removed MACOSX_DEPLOYMENT_TARGET (#1880)
- Updated intersphinx inventory (#1893)
- Updated compatibility matrix to include LTS version (#1896)
- Updated CONTRIBUTING with doc conventions (#1898)
- Added anaconda stats to README (#1910)
- Updated README.md (#1916)
- Added citation information (#1947)
- Updated CONTRIBUTING.md (#1975)
- Doc fixes (#1982)
- Added tutorial to CONTRIBUTING (#1990)
- Fixed docstring (#2002)
- Fixed minor typo (#2012)
- Updated audio augmentation tutorial (#2082)
- Added Sphinx gallery automatically (#2101)
- Disabled matplotlib warning in tutorial rendering (#2107)
- Updated prototype documentations (#2108)
- Added custom CSS to make signatures appear in multi-line (#2123)
- Updated prototype pipeline documentation (#2148)
- Tweaked documentation (#2152)
Tests
- Refactored integration test (#1922)
- Enabled integration tests on CI (#1939)
- Removed facebook folder in wav2vec unit tests (#2015)
- Temporarily skipped threadpool test (#2025)
- Revised Griffin-Lim transform test to reduce execution time (#2037)
- Fixed CircleCI test failures (#2069)
- Do not auto-skip tests on CI (#2127)
- Relaxed absolute tolerance for Kaldi compat tests (#2165)
- Added tacotron2 unit test with different batch_size (#2176)
Build
- Updated GPU resource class (#1791)
- Updated the main version to 0.11.0 (#1793)
- Updated windows cuda installer 11.1.0 to 11.1.1 (#1795)
- Renamed build_tools to tools (#1812)
- Limit Windows GPU testing to CUDA-11.3 only (#1842)
- Used cu113 for unittest_windows_gpu (#1853)
- USE_CUDA in windows and reduce one vcvarsall (#1854)
- Check torch installation before building package (#1867)
- Install tools from conda instead of brew (#1873)
- Cleaned up setup.py (#1900)
- Moved TorchAudio conda package to use pytorch-mutex (#1904)
- Updated smoke test docker image (#1905)
- Fixed formatting CIRCLECI_TAG when building docs (#1915)
- Fetch third party sources automatically (#1966)
- Disabled SPHINXOPT=-W for local env (#2013)
- Improved installing nightly pytorch (#2026)
- Improved cuda installation on windows (#2032)
- Refactored the library loading mechanism (#2038)
- Cleaned up libtorchaudio customization logic (#2039)
- Refactored and functionize the library definition (#2040)
- Introduced helper function to define extension (#2077)
- Standardized the location of third-party source code (#2086)
- Show lint diff with color (#2102)
- Updated third party submodule setup (#2132)
- Suppressed stderr from subprocess in setup.py (#2133)
- Fixed header include (#2135)
- Updated ROCM version 4.1 -> 4.3.1 and 4.5 (#2186)
- Added "cu102" back (#2190)
- Pinned flake8 version (#2191)
Style
- Removed trailing whitespace (#1803)
- Fixed style checks (#1913)
- Resolved lint warning (#1971)
- Enabled CLANGFORMAT (#1999)
- Fixed style checks in examples/tutorials (#2006)
- OSS config for lint checks (#2066)
- Excluded sphinx-gallery examples (#2071)
- Reverted linting exemptions introduced in #2071 (#2087)
- Applied arc lint to pytorch audio (#2096)
- Enforced lint checks and fix/mute lint errors (#2116)...
torchaudio v0.10.2 Minor release
This is a minor release compatible with PyTorch 1.10.2.
There is no feature change in torchaudio from 0.10.1. For the full feature of v0.10, please refer to the v0.10.0 release notes.
torchaudio 0.10.1 Release Note
This is a minor release, which is compatible with PyTorch 1.10.1 and include small bug fix, improvements and documentation update. There is no new feature added.
Bug Fix
- #2050 Allow whitespace as
TORCH_CUDA_ARCH_LIST
delimiter
Improvement
- #2054 Fetch third party source code automatically
The build process now fetches third party source code (git submodule and cmake external projects) - #2059 Improve documentation
For the full feature of v0.10, please refer to the v0.10.0 release note.