CosyEdit2 Demo

CosyEdit2: Speech-Editing-Oriented Reinforcement Learning Unlocks Better Zero-Shot TTS

Abstract: Speech editing and zero-shot Text-to-Speech (TTS) share a similar generative foundation conditioned on speech prompts, yet speech editing demands far stricter local acoustic consistency with surrounding unedited content. While prior work has shown that Supervised Fine-Tuning (SFT) enables TTS models to acquire functional editing capability, this approach remains fundamentally bottlenecked by imperfect paired editing data and coarse-grained optimization signals. To address these limitations, we propose CosyEdit2, a speech editing model built on a two-stage post-training framework that progresses from supervised editing initialization to editing-oriented Group Relative Policy Optimization (GRPO) over target-speech-free data. Extensive experiments demonstrate that CosyEdit2 not only substantially advances speech editing performance, but also unlocks better zero-shot TTS capability, revealing a deeper mutual relationship between the two tasks.

Contents

Speech Editing Demo
Zero-Shot TTS Demo
Multilingual Zero-Shot TTS Demo
Paralinguistic Event Editing Demo

Speech Editing Demo

Insertion · Deletion · Substitution — English & Chinese

Highlights: Inserted text ~~Deleted~~ text Original phrase Substituted phrase

＋

Insertion Task

Original Transcript

One who performs conjurations is called a conjurer or conjuror

Edited Transcript

One who performs conjurations is commonly called a conjurer or conjuror

原始文本

防止因过快而过急而出现处置风险

编辑后文本

防止因操作过快而过急而出现处置风险

－

Deletion Task

Original Transcript

The stems of the tall glasses cracked and broke

Edited Transcript

The stems of the tall glasses

原始文本

另外根据中央气象台的未来七天降雨预报

编辑后文本

根据中央气象台的未来七天降雨预报

⇄

Substitution Task

Original Transcript

Women are selling grains in a stall in an urban environment

Edited Transcript

Women are selling grains in a stall in a city setting

原始文本

消费者更倾向于用衣服彰显自身的个性与独特

编辑后文本

消费者更倾向于用配饰彰显自身的个性与独特

Zero-Shot TTS Demo

English & Chinese

English

Standard

Prompt Text (Reference)

Target Text (to synthesize)

中文

Standard

提示文本（参考）

目标文本（合成内容）

English

Hard

Prompt Text (Reference)

Target Text (to synthesize)

中文

Hard

提示文本（参考）

目标文本（合成内容）

Multilingual Zero-Shot TTS Demo

English · Chinese · Japanese · Korean

Paralinguistic Event Editing Demo

Laughter · Cough · Breath

Acknowledgements

This work builds upon several excellent open-source projects:

We borrowed a lot of code from CosyVoice.
We borrowed a lot of code from BigVGAN.
We borrowed a lot of code from verl.
We borrowed a lot of code from WeNet.
The template for this demo page is adapted from FunAudioLLM.

We are deeply grateful to the authors and contributors of these projects for their outstanding work and for making their code publicly available.

Disclaimer

The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.