Bug Description
When using FSMN-VAD (iic/speech_fsmn_vad_zh-cn-16k-common-pytorch) with the same audio file and same model checkpoint (model.pt + config.yaml), funasr 1.3.9 produces a single ~52s segment, while funasr 1.3.1 correctly produces 19 separated speech segments.
max_end_silence_time (used to control segmentation granularity) appears to be ignored or behave differently in 1.3.9.
Environment
- Python: 3.10
- Platform: Ubuntu 22.04 (Docker)
- PyTorch: latest
Steps to Reproduce
from funasr import AutoModel
model = AutoModel(
model="path/to/fsmn-vad", # local directory containing model.pt + config.yaml + am.mvn
device="cpu",
disable_update=True,
disable_pbar=True,
speech_noise_thres=0.9,
)
result = model.generate(input="test.mp3", max_end_silence_time=300)
print(len(result[0]['value']), "segments")
for s, e in result[0]['value']:
print(f" {s/1000:.2f}s ~ {e/1000:.2f}s")
Run with the same 52s audio under both versions:
funasr 1.3.1 (correct)
19 segments
3.15s ~ 4.35s
4.69s ~ 5.93s
...
45.05s ~ 45.97s
funasr 1.3.9 (regression)
1 segments
0.07s ~ 52.57s
Expected Behavior
Same model + same audio + same parameters should produce the same segmentation result.
Additional Notes
- The
model.pt and config.yaml files used are byte-identical between the two runs.
- Tried passing
max_end_silence_time both at AutoModel(...) init time and at model.generate(...) time — neither works in 1.3.9.
- Tried
speech_noise_thres ranging 0.3 ~ 0.95 — no segmentation in 1.3.9 regardless.
- Likely a regression in the VAD post-processing pipeline or in how
model_conf overrides are propagated.
Suggested Fix
Please verify whether max_end_silence_time and related streaming-VAD parameters are still being applied to FsmnVADStreaming in 1.3.9. If not, restore the parameter routing from 1.3.x.
test_vad.py
Bug Description
When using FSMN-VAD (
iic/speech_fsmn_vad_zh-cn-16k-common-pytorch) with the same audio file and same model checkpoint (model.pt+config.yaml), funasr 1.3.9 produces a single ~52s segment, while funasr 1.3.1 correctly produces 19 separated speech segments.max_end_silence_time(used to control segmentation granularity) appears to be ignored or behave differently in 1.3.9.Environment
Steps to Reproduce
Run with the same 52s audio under both versions:
funasr 1.3.1 (correct)
funasr 1.3.9 (regression)
Expected Behavior
Same model + same audio + same parameters should produce the same segmentation result.
Additional Notes
model.ptandconfig.yamlfiles used are byte-identical between the two runs.max_end_silence_timeboth atAutoModel(...)init time and atmodel.generate(...)time — neither works in 1.3.9.speech_noise_thresranging 0.3 ~ 0.95 — no segmentation in 1.3.9 regardless.model_confoverrides are propagated.Suggested Fix
Please verify whether
max_end_silence_timeand related streaming-VAD parameters are still being applied toFsmnVADStreamingin 1.3.9. If not, restore the parameter routing from 1.3.x.test_vad.py