Codec SUPERB

Codec SUPERB Challenge @ SLT 2024

Codec Speech processing Universal PERformance Benchmark Challenge


  • Timeline
  • Committee
  • Contact
  • Keynote
  • Paper

Introduction

Neural audio codecs are initially introduced to compress audio data into compact codes to reduce transmission latency. Researchers recently discovered the potential of codecs as suitable tokenizers for converting continuous audio into discrete codes, which can be employed to develop audio language models (LMs). The neural audio codec's dual roles in minimizing data transmission latency and serving as tokenizers underscore its critical importance. The ideal neural audio codec models should preserve content, paralinguistics, speakers, and audio information. However, the question of which codec achieves optimal audio information preservation remains unanswered, as in different papers, models are evaluated on their selected experimental settings. There's a lack of a challenge to enable a fair comparison of all current existing codec models and stimulate the development of more advanced codecs. To fill this blank, we propose the Codec-SUPERB challenge.

Keynote Speech

Date: Tuesday, December 3
Time: 15:00 - 18:30 (3.5 hours total)
Format: Each speaker will have a 30-minute talk followed by a 5-minute Q&A session.

Notes
  • Parts of the recorded video and slides will be posted on the website and GitHub after the session.

  • Timeline
  • 15:10 - 15:45 Neil Zeghidour
  • 15:45 - 16:20 Dongchao Yang
  • 16:20 - 16:55 Shang-Wen Li
  • 16:55 - 17:25 Paper Presentation
  • 17:25 - 18:00 Wenwu Wang
  • 18:00 - 18:35 Minje Kim
  • Keynote Speakers

    Speaker: Prof. Wenwu Wang

    Prof. Wenwu Wang

    Title: Neural Audio Codecs: Recent Progress and a Case Study with SemantiCodec

    Abstract: The neural audio codec has attracted increasing interest as a highly effective method for audio compression and representation. By transforming continuous audio into discrete tokens, it facilitates the use of large language modelling (LLM) techniques in audio processing. In this talk, we will report recent progress in neural audio codecs, with a particular focus on SemantiCodec, a new neural audio codec for ultra-low bit rate audio compression and tokenization. SemantiCodec features a dual-encoder architecture: a semantic encoder using a self-supervised AudioMAE, discretized using k-means clustering on extensive audio data, and an acoustic encoder to capture the remaining details. The semantic and acoustic encoder outputs are then used to reconstruct audio via a diffusion-model-based decoder. SemantiCodec presents several advantages over previous codecs, which typically operate at high bitrates, are confined to narrow domains like speech, and lack the semantic information essential for effective language modelling. First, SemantiCodec compresses audio into fewer than 100 tokens per second across various audio types, including speech, general audio, and music, while maintaining high-quality output. Second, it preserves substantially richer semantic information from audio compared to all evaluated codecs. We will illustrate these benefits through benchmarking and conclude by discussing potential directions for future research in this field.

    Bio: Wenwu Wang is a Professor in Signal Processing and Machine Learning, University of Surrey, UK. He is also an AI Fellow at the Surrey Institute for People Centred Artificial Intelligence. His current research interests include signal processing, machine learning and perception, artificial intelligence, machine audition (listening), and statistical anomaly detection. He has (co)-authored over 300 papers in these areas. He has been recognized as a (co-)author or (co)-recipient of more than 15 accolades, including the 2022 IEEE Signal Processing Society Young Author Best Paper Award, ICAUS 2021 Best Paper Award, DCASE 2020 and 2023 Judge’s Award, DCASE 2019 and 2020 Reproducible System Award, and LVA/ICA 2018 Best Student Paper Award. He is an Associate Editor (2020-2025) for IEEE/ACM Transactions on Audio Speech and Language Processing, and an Associate Editor (2024-2026) for IEEE Transactions on Multimedia. He was a Senior Area Editor (2019-2023) and Associate Editor (2014-2018) for IEEE Transactions on Signal Processing. He is the elected Chair (2023-2024) of IEEE Signal Processing Society (SPS) Machine Learning for Signal Processing Technical Committee, a Board Member (2023-2024) of IEEE SPS Technical Directions Board, the elected Chair (2025-2027) and Vice Chair (2022-2024) of the EURASIP Technical Area Committee on Acoustic Speech and Music Signal Processing, an elected Member (2021-2026) of the IEEE SPS Signal Processing Theory and Methods Technical Committee. He has been on the organising committee of INTERSPEECH 2022, IEEE ICASSP 2019 & 2024, IEEE MLSP 2013 & 2024, and SSP 2009. He is Technical Program Co-Chair of IEEE MLSP 2025. He has been an invited Keynote or Plenary Speaker on more than 20 international conferences and workshops.


    Speaker: Prof. Minje Kim

    Prof. Minje Kim

    Title: Future Directions in Neural Speech Communication Codecs

    Abstract: Neural speech codecs promise high-quality speech at low bitrates but face challenges like increased model complexity and suboptimal quality. This talk presents two approaches to address these issues: generative de-quantization and personalization. With LaDiffCodec, we propose separating representation learning from information reconstruction. It combines a typical end-to-end codec that learns low-dimensional discrete tokens for compact representation and a latent diffusion model that de-quantizes these tokens into high-dimensional continuous space in a generative fashion. To prevent over-smooth speech, we employ “midway-infilling” during diffusion. Subjective tests show our model improves popular neural speech codecs' performance. Next, we introduce personalized neural speech codecs, where we explore personalizing codecs to specific user groups to reduce complexity and enhance perceptual quality. By learning speaker embeddings with a Siamese network from the LibriSpeech dataset, we cluster speakers based on perceptual similarity. Subjective tests reveal this strategy enables model compression without sacrificing—and even improving—speech quality. By decoupling key tasks and introducing personalization, these approaches address current limitations and pave the way for superior speech quality and efficiency at low bitrates in neural speech codecs.

    Bio: Minje Kim is an associate professor in the Siebel School of Computing and Data Science at the University of Illinois at Urbana-Champaign and a Visiting Academic at Amazon Lab126. Before then, he was an associate professor at Indiana University (2016-2023). He earned his Ph.D. in Computer Science at UIUC (2016) after working as a researcher at ETRI, a national lab in Korea (2006 to 2011). During his career, he has focused on developing machine learning models for audio signal processing applications. He is a recipient of various awards, including the NSF Career Award (2021), IU Trustees Teaching Award (2021), IEEE SPS Best Paper Award (2020), Google and Starkey’s grants for outstanding student papers in ICASSP 2013 and 2014, respectively. He is an IEEE Senior Member and the Vice Chair of the IEEE Audio and Acoustic Signal Processing Technical Committee. He is serving on the editorial boards for IEEE/ACM T-ALSP and IEEE SPL as Senior Area Editor, EURASIP JASMP (Associate Editor), and IEEE OJSP (Consulting Associate Editor). He was the General Chair of IEEE WASPAA 2023 and also a reviewer, program committee member, or area chair for the major machine learning and signal processing venues. He is on more than 50 patents as an inventor.


    Speaker: Dongchao Yang

    Dongchao Yang

    Title: Challenges in Developing Universal Audio Foundation Model

    Abstract: Building a universal audio foundation model for different audio generation tasks, such as text-to-speech, text-to-audio, singing voice synthesis, voice conversion, and speech dialogue, has attract great interest in the audio community. Audio codec and audio modeling strategies are two key points. In this talk, we review the development of audio codec and different audio modeling strategies in the literature, and show our recent works in audio codec and audio modeling methods. Lastly, we summarize the challenges and potential directions in development universal audio foundation model.

    Bio: Dongchao Yang is a second-year Ph.D. student at The Chinese University of Hong Kong, supervised by Prof. Helen Meng. Before that, he received his master's degree from Peking University. His research interests encompass the extensive domain of speech and language intelligence, which includes audio foundation models, multi-modal large language models (MLLMs), text-to-speech synthesis (TTS), cross-modal representation learning, among other related areas. Currently, his work focuses on audio-text foundation models and audio codec models.


    Speaker: Dr. Shang-Wen Li

    Dr. Daniel Li

    Title: VoiceCraft: Zero-Shot Speech Editing and TTS in the Wild

    Abstract: We introduce VoiceCraft, a token infilling neural codec language model (NCLM), that achieves state-of-the-art (SOTA) performance on both speech editing and zero-shot text-to-speech (TTS) on audiobooks, internet videos, and podcasts. We study the case of developing VoiceCraft, to understand how NCLM works in solving critical speech generation problems, how codec sets the foundation of SOTA NCLM, and the challenges in the development and evaluation.

    VoiceCraft leverages Encodec to tokenize speech signals; it employs a Transformer decoder architecture and introduces a token rearrangement procedure that combines causal masking and delayed stacking to enable token generation within an existing sequence. On speech editing tasks, VoiceCraft produces edited speech that is nearly indistinguishable from unedited recordings in terms of naturalness, as evaluated by humans; for zero-shot TTS, our model outperforms prior SOTA models including VALLE and XTTS-v2. The models are evaluated on challenging and realistic datasets, that consist of diverse accents, speaking styles, recording conditions, and background noise and music, and our model performs consistently well compared to other models and real recordings. In particular, for speech editing evaluation, we introduce a high quality, challenging, and realistic dataset named RealEdit. We encourage audience to listen to the demos at this website.

    Bio: Shang-Wen Li is a Research Lead and Manager at Meta’s Fundamental AI Research (FAIR) team, and he worked at Apple Siri, Amazon Alexa and AWS before joining FAIR. He completed his PhD in 2016 at MIT in the Spoken Language Systems group of Computer Science and Artificial Intelligence Laboratory (CSAIL). His recent research is focused on multimodal large language models, multimodal representation learning, and spoken language understanding.


    Speaker: Dr. Neil Zeghidour

    Dr. Neil Zeghidour

    Title: Audio Language Models

    Abstract: Audio analysis and audio synthesis require modeling long-term, complex phenomena and have historically been tackled in an asymmetric fashion, with specific analysis models that differ from their synthesis counterpart. In this presentation, we will introduce the concept of audio language models, a recent innovation aimed at overcoming these limitations. By discretizing audio signals using a neural audio codec, we can frame both audio generation and audio comprehension as similar autoregressive sequence-to-sequence tasks, capitalizing on the well-established Transformer architecture commonly used in language modeling. This approach unlocks novel capabilities in areas such as textless speech modeling, zero-shot voice conversion, text-to-music generation and even real-time spoken dialogue. Furthermore, we will illustrate how the integration of analysis and synthesis within a single model enables the creation of versatile audio models capable of handling a wide range of tasks involving audio as inputs or outputs. We will conclude by highlighting the promising prospects offered by these models and discussing the key challenges that lie ahead in their development.

    Bio: Neil is co-founder and Chief Modeling Officer of the Kyutai non-profit research lab. Neil and team have recently presented Moshi (https://moshi.chat), the first real-time spoken dialogue model. He was previously at Google DeepMind, where he started and led a team working on generative audio, with contributions including Google’s first text-to-music API, a voice preserving speech-to-speech translation system, and the first neural audio codec that outperforms general-purpose audio codecs. Before that, Neil spent three years at Facebook AI Research, working on automatic speech recognition and audio understanding. He graduated with a PhD in machine learning from Ecole Normale Supérieure (Paris), and holds an MSc in machine learning from Ecole Normale Supérieure (Saclay) and an MSc in quantitative finance from Université Paris Dauphine. In parallel with his research activities, Neil teaches speech processing technologies at the École Normale Supérieure (Saclay).

    News

  • 2024-11-22 Keynote Speakers information updated
  • 2024-09-21 Codec-SUPERB @ SLT 2024 is available on arXiv
  • 2024-06-10 Evaluation Colab Example [Colab Link]
  • 2024-04-29: Rule announced: [Rule with Baselines]
  • Please use the Google Form to register.
    Please submit the evaluation results by creating a GitHub issue

    Timeline / Important Dates

  • Results announcement and hosting challenge: 2024-12
  • Submission deadline: 2024-06-20
  • Submission start: 2024-04-29
  • Rule announcement: 2024-04-29 [Rule with Baselines]
  • Data available for public-set (Hidden-set will be hidden throughout the challenge)
  • Organizers

    Academia

  • Hung-yi Lee (NTU) website
  • Haibin Wu (NTU) website
  • Kai-Wei Chang (NTU) website
  • Alexander H. Liu (MIT) website
  • Dongchao Yang (CUHK) website
  • Shinji Watanabe (CMU) website
  • James Glass (MIT) website
  • Industrial

  • Songxiang Liu (miHoYo) website
  • Yi-Chiao Wu (Meta) website
  • Xu Tan (Microsoft) website
  • Technical Committee

  • Ho-Lam Chung (NTU)
  • Yi-Cheng Lin (NTU)
  • Yuan-Kuei Wu (NTU)
  • Xuanjun Chen (NTU)
  • Ke-Han Lu (NTU)
  • Jiawei Du (NTU)
  • References

    [1] Wu, Haibin, et al. "Towards audio language modeling-an overview." arXiv preprint arXiv:2402.13236 (2024).
    [2] Wu, Haibin, et al. "Codec-SUPERB: An In-Depth Analysis of Sound Codec Models." arXiv preprint arXiv:2402.13071 (2024).
    [3] Neil Zeghidour et al., “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021.
    [4] Zalan Borsos et al., “Audiolm: a language modeling approach to audio generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
    [5] Felix Kreuk et al., “Audiogen: Textually guided audio generation,” arXiv preprint arXiv:2209.15352, 2022.
    [6] Défossez, Alexandre, et al. "High fidelity neural audio compression." arXiv preprint arXiv:2210.13438 (2022).
    [7] Chengyi Wang et al., “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
    [8] Andrea Agostinelli et al., “Musiclm: Generating music from text,” arXiv preprint arXiv:2301.11325, 2023.
    [9] Ziqiang Zhang et al., “Speak foreign languages with your own voice: Cross-lingual neural codec language modeling,” arXiv preprint arXiv:2303.03926, 2023.
    [10] Jenrungrot, Teerapat, et al. "LMCodec: A Low Bitrate Speech Codec with Causal Transformer Models." ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023.
    [11] Tianrui Wang et al., “Viola: Unified codec language models for speech recognition, synthesis, and translation,” arXiv preprint arXiv:2305.16107, 2023.
    [12] Jiang, Xue, et al. "Latent-Domain Predictive Neural Speech Coding." IEEE/ACM Transactions on Audio, Speech, and Language Processing (2023).
    [13] Yi-Chiao Wu et al., “Audiodec: An open-source streaming high- fidelity neural audio codec,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
    [14] Dongchao Yang et al., “Hifi-codec: Group-residual vector quantization for high fidelity audio codec,” arXiv preprint arXiv:2305.02765, 2023.
    [15] Borsos, Zalán, et al. "SoundStorm: Efficient Parallel Audio Generation." arXiv preprint arXiv:2305.09636 (2023).
    [16] Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar, “High-fidelity audio compression with improved rvqgan,” arXiv preprint arXiv:2306.06546, 2023.
    [17] Jade Copet et al., “Simple and controllable music generation,” arXiv preprint arXiv:2306.05284, 2023.
    [18] Paul K Rubenstein et al., “Audiopalm: A large language model that can speak and listen,” arXiv preprint arXiv:2306.12925, 2023.
    [19] Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, and Xipeng Qiu, “Speechtokenizer: Unified speech tokenizer for speech large language models,” arXiv preprint arXiv:2308.16692, 2023.
    [20] Xiaofei Wang et al., “Speechx: Neural codec language model as a versatile speech transformer,” arXiv preprint arXiv:2308.06873, 2023.
    [21] Ratnarajah, Anton, et al. "M3-AUDIODEC: Multi-channel multi-speaker multi-spatial audio codec." arXiv preprint arXiv:2309.07416 (2023).
    [22] Xu, Zhongweiyang, et al. "SpatialCodec: Neural Spatial Speech Coding." arXiv preprint arXiv:2309.07432 (2023).
    [23] Zhihao Du, Shiliang Zhang, Kai Hu, and Siqi Zheng, “Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec,” arXiv preprint arXiv:2309.07405, 2023.
    [24] Qian Chen et al., “Lauragpt: Listen, attend, understand, and regenerate audio with gpt,” arXiv preprint arXiv:2310.04673, 2023.
    [25] Dongchao Yang et al., “Uniaudio: An audio foundation model toward universal audio generation,” arXiv preprint arXiv:2310.00704, 2023.
    [26] Ji, Shengpeng, et al. "Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models." arXiv preprint arXiv:2402.12208 (2024).
    [27] Liu, Haohe, et al. "SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound." arXiv preprint arXiv:2405.00233 (2024).
    [28] Ai, Yang, et al. "APCodec: A Neural Audio Codec with Parallel Amplitude and Phase Spectrum Encoding and Decoding." arXiv preprint arXiv:2402.10533 (2024).

    Codec-SUPERB @ SLT 2024 Paper

    Codec-SUPERB @ SLT 2024: A lightweight benchmark for neural audio codec models
    Codec-SUPERB Paper Thumbnail
    [arXiv link]

    Contact

    codecsuperb@gmail.com