CosyEdit: Unlocking End-to-End Speech Editing Capability from Zero-Shot Text-to-Speech Models

🎧 Demo Page   |   📜 Paper   |   💻 Code

Abstract: Automatic speech editing aims to modify spoken content based on textual instructions, yet traditional cascade systems rely on explicit temporal alignment and complex preprocessing. To address these limitations, we propose CosyEdit, an end-to-end speech editing model adapted from CosyVoice through task-specific post-training and a complementary training paradigm, which internalizes text--speech alignment while ensuring high consistency between the speech before and after editing. Trained on only 250 hours of supervised data from our curated GigaEdit dataset, our 400M-parameter model achieves reliable speech editing performance. Extensive evaluations show that CosyEdit not only outperforms several billion-parameter language model baselines but also approaches state-of-the-art cascade systems. These results show that robust and efficient speech editing can be unlocked from a zero-shot TTS model through post-training, offering a cost-effective end-to-end solution for high-quality speech editing.

Key Advantages:

  • Comfortable Speech Editing ☕
    No external speech–text alignment tools, no complex editing algorithms— everything is handled by an end-to-end model, just one-step editing.

  • Native Multi-Span Editing ✂️
    Natively supports insertion, deletion, and substitution across multiple spans within a single utterance, all completed in one inference pass.

  • Low-Cost, High-Performance ⚡
    Unlocks strong speech editing capabilities from existing zero-shot TTS models, delivering competitive performance with small model size and minimal training cost.

CosyEdit Logo

Contents

Cascade Speech Editing vs. End-to-End Speech Editing

Cascade vs. End-to-End

Cascade Speech Editing Systems:

  • Step 1) Speech–Text Temporal Alignment. Use an external forced aligner (e.g., Montreal Forced Aligner) to obtain timestamp-level temporal alignment between the original speech signal and its corresponding transcript.

  • Step 2) Text Edit Span Detection. Compare the original text with the target text to identify the textual regions that need to be inserted, deleted, or substituted.

  • Step 3) Speech Segmentation Based on Edits. Leverage the temporal alignment and detected text edit spans to determine the corresponding speech boundaries and split the input speech into preserved and editable segments.

  • Step 4) Edited Speech Generation and Integration. Apply speech editing models to generate the modified speech segments and seamlessly integrate them with the preserved original speech to produce the final edited utterance.

End-to-End Speech Editing Models:

  • Just One Step: Directly generate the edited speech from the original speech and the target transcript, optionally conditioning on the original text, without relying on any external alignment tools or complex preprocessing pipelines.

Overview of CosyEdit

CosyEdit Architecture

(a) Example of four editing tasks for constructing the speech editing training dataset GigaEdit, which is built based on the GigaSpeech-S dataset and includes insertion task, deletion task, substitution task, and multi-edit task.

(b) Schematic diagram of CosyEdit. S, E and T represent the markers of "start of the sequence", "end of the sequence" and "transition token" respectively. Dashed lines indicate the autoregressive decoding.

(c) Enlarged view of the GOT-CFM module, conditioned on speaker embedding $\mathbf{v}$, concatenated semantic tokens $\mu_Z$, concatenated speech features $\tilde{Z}$, and intermediate state $Z_t$ at timestep $t$. Here, $\mu_Z = [\mu_{\mathrm{ori}}, \mu_{\mathrm{tar}}]$ and $\tilde{Z} = [M_{\mathrm{ori}}, \tilde{M}_{\mathrm{tar}}]$, where $\tilde{M}_{\mathrm{tar}}$ denotes the fully masked target mel-spectrogram.

Insertion Task

Example 1
Original Transcript:
yet anytime you and i question the schemes of the dogooders, were denounced as being against their humanitarian goals. they say we're always against things, we're never for anything.
Edited Transcript:
yet anytime you and i question the schemes of the dogooders or dare to dig into any of their motives, were denounced as being against their humanitarian goals. they say we're always against things, we're never for anything.
Ground Truth CosyEdit (Ours) FluentSpeech VoiceCraft SSR-Speech Step-Audio-EditX Mimo-Audio Ming-UniAudio
Example 2
Original Transcript:
Another thing is anything inside the parentheses turns the opposite so negative becomes positive and a positive becomes negative.
Edited Transcript:
Another thing is that in this equation anything inside the parentheses turns the opposite so negative becomes positive and a positive becomes negative.
Ground Truth CosyEdit (Ours) FluentSpeech VoiceCraft SSR-Speech Step-Audio-EditX Mimo-Audio Ming-UniAudio

Deletion Task

Example 1
Original Transcript:
Thus was she borne away captive of her dead, neither willing nor unwilling, of life and death equally careless.
Edited Transcript:
Thus was she borne away of life and death equally careless.
Ground Truth CosyEdit (Ours) FluentSpeech VoiceCraft SSR-Speech Step-Audio-EditX Mimo-Audio Ming-UniAudio
Example 2
Original Transcript:
Interesting and I think this is a comment a much more common phenomenon nowadays is that she just found out that she has a fifth sibling?
Edited Transcript:
Interesting and I think she just found out that she has a fifth sibling?
Ground Truth CosyEdit (Ours) FluentSpeech VoiceCraft SSR-Speech Step-Audio-EditX Mimo-Audio Ming-UniAudio

Substitution Task

Example 1
Original Transcript:
He added Sydney's address in a postscript, and dispatched his letter that evening.
Edited Transcript:
He added Sydney's address in a highlighted bold, and dispatched his letter that evening.
Ground Truth CosyEdit (Ours) FluentSpeech VoiceCraft SSR-Speech Step-Audio-EditX Mimo-Audio Ming-UniAudio
Example 2
Original Transcript:
And there are a lot of places in the movie where they could have just slipped it in just a little bit just to confirm that it happened.
Edited Transcript:
And there are actually multiple moments throughout the film where they could have just slipped it in just a little bit just to confirm that it happened.
Ground Truth CosyEdit (Ours) FluentSpeech VoiceCraft SSR-Speech Step-Audio-EditX Mimo-Audio Ming-UniAudio

Multi-Edit Task

Example 1
Original Transcript:
When I had shown them what I did with it, they were astonished but not displeased, and seemed to like the smell.
Edited Transcript:
When I had shown how I had changed the recipe from the start, they were surprised but not displeased, and seemed to like the smell.
Ground Truth CosyEdit (Ours) FluentSpeech VoiceCraft SSR-Speech Step-Audio-EditX Mimo-Audio Ming-UniAudio
Example 2
Original Transcript:
it's about the american people about the diversity of experience the resilience and the possibilities of the american future.
Edited Transcript:
it's about the american people who have never seen immigrants before, who never care about the diversity of the country or the possibilities of the american future.
Ground Truth CosyEdit (Ours) FluentSpeech VoiceCraft SSR-Speech Step-Audio-EditX Mimo-Audio Ming-UniAudio

Acknowledgements

This work builds upon several excellent open-source projects:

  • We borrowed a lot of code from CosyVoice.
  • We borrowed a lot of code from WeNet.
  • The template for this demo page is adapted from FunAudioLLM.

We are deeply grateful to the authors and contributors of these projects for their outstanding work and for making their code publicly available, which has been instrumental in advancing our research.

Disclaimer

The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.