CosyEdit2: Speech-Editing-Oriented Reinforcement Learning Unlocks Better Zero-Shot TTS

Abstract: Speech editing and zero-shot Text-to-Speech (TTS) share a similar generative foundation conditioned on speech prompts, yet speech editing demands far stricter local acoustic consistency with surrounding unedited content. While prior work has shown that Supervised Fine-Tuning (SFT) enables TTS models to acquire functional editing capability, this approach remains fundamentally bottlenecked by imperfect paired editing data and coarse-grained optimization signals. To address these limitations, we propose CosyEdit2, a speech editing model built on a two-stage post-training framework that progresses from supervised editing initialization to editing-oriented Group Relative Policy Optimization (GRPO) over target-speech-free data. Extensive experiments demonstrate that CosyEdit2 not only substantially advances speech editing performance, but also unlocks better zero-shot TTS capability, revealing a deeper mutual relationship between the two tasks.

CosyEdit2 Overview

Contents

Speech Editing Demo

Insertion · Deletion · Substitution — English & Chinese

Highlights: Inserted text Deleted text Original phrase Substituted phrase

Insertion Task

EN English Example
Original Transcript
One who performs conjurations is called a conjurer or conjuror
Edited Transcript
One who performs conjurations is commonly called a conjurer or conjuror
ZH Chinese Example
原始文本
防止因过快而过急而出现处置风险
编辑后文本
防止因操作过快而过急而出现处置风险

Deletion Task

EN English Example
Original Transcript
The stems of the tall glasses cracked and broke
Edited Transcript
The stems of the tall glasses
ZH Chinese Example
原始文本
另外根据中央气象台的未来七天降雨预报
编辑后文本
根据中央气象台的未来七天降雨预报

Substitution Task

EN English Example
Original Transcript
Women are selling grains in a stall in an urban environment
Edited Transcript
Women are selling grains in a stall in a city setting
ZH Chinese Example
原始文本
消费者更倾向于用衣服彰显自身的个性与独特
编辑后文本
消费者更倾向于用配饰彰显自身的个性与独特

Zero-Shot TTS Demo

English & Chinese

EN

English

Standard
Prompt Text (Reference)
Target Text (to synthesize)

ZH

中文

Standard
提示文本(参考)
目标文本(合成内容)

EN

English

Hard
Prompt Text (Reference)
Target Text (to synthesize)

ZH

中文

Hard
提示文本(参考)
目标文本(合成内容)

Multilingual Zero-Shot TTS Demo

English · Chinese · Japanese · Korean

Acknowledgements

This work builds upon several excellent open-source projects:

We are deeply grateful to the authors and contributors of these projects for their outstanding work and for making their code publicly available.

Disclaimer

The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.