2025年09月09日/ 浏览 6
语音识别(ASR)、FFmpeg、WebSocket、音频预处理、VAD检测、字幕时间轴、动态批处理
现代ASR转字幕系统主要依赖端到端的深度学习模型架构。以Transformer为例,其工作流程可分为:
1. 音频特征提取:通过80维Mel滤波器组每10ms采样一次
2. 编码器处理:使用Conformer模块同时捕捉局部和全局特征
3. 流式输出:基于CTC/RNNT损失函数实现实时转写
python
import ffmpeg
import websockets
import numpy as np
from vosk import Model, KaldiRecognizer
class VideoToSubtitle:
def init(self, modelpath=”vosk-model-en-us-0.22″):
self.model = Model(modelpath)
self.sample_rate = 16000
self.vad = webrtcvad.Vad(3)
def extract_audio(self, video_path):
try:
return (
ffmpeg.input(video_path)
.output('pipe:', format='s16le', ac=1, ar=self.sample_rate)
.run_async(pipe_stdout=True)
)
except ffmpeg.Error as e:
print(f"FFmpeg error: {e.stderr.decode()}")
async def transcribe_stream(self, websocket):
rec = KaldiRecognizer(self.model, self.sample_rate)
while True:
data = await websocket.recv()
if isinstance(data, bytes):
if rec.AcceptWaveform(data):
result = json.loads(rec.Result())
yield result.get('text', '')
else:
break
def generate_srt(self, transcriptions):
counter = 1
for start, end, text in self._align_timestamps(transcriptions):
yield f"{counter}\n{start} --> {end}\n{text}\n\n"
counter += 1
def _align_timestamps(self, texts):
# 使用动态规划算法对齐时间戳
...
bash
-acodec pcm_s16le -ac 1 -ar 16k
python
class StreamBuffer:
def init(self, chunksize=4000):
self.buffer = bytearray()
self.chunksize = chunk_size
def add_data(self, data):
self.buffer.extend(data)
while len(self.buffer) >= self.chunk_size:
yield bytes(self.buffer[:self.chunk_size])
self.buffer = self.buffer[self.chunk_size:]
采用动态时间规整(DTW)算法解决ASR输出与真实时间偏差问题:
1. 计算MFCC特征距离矩阵
2. 寻找最优路径进行时间拉伸
3. 结合静音段检测调整分段
python
with open('audio.raw', 'rb') as f:
mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
python
torch.set_float32_matmul_precision('medium')
python
class ASRErrorHandler:
@staticmethod
def handle_retry(exc):
if isinstance(exc, (asyncio.TimeoutError, ConnectionResetError)):
return ExponentialBackoff(max_retries=5)
...
通过LangID检测后动态加载模型:
python
import langid
detected_lang = langid.classify(audio_chunk)[0]
model = load_model(f"vosk-model-{detected_lang}")