Merge pull request #50 from drowe67/dr-codec2-doc

Correcting (19), thanks Bruce Mackinnon
pull/47/merge
drowe67 2024-05-06 06:48:35 +09:30 committed by GitHub
commit 6930e3c26a
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 20 additions and 5 deletions

Binary file not shown.

View File

@ -81,7 +81,7 @@ This production of this document was kindly supported by an ARDC grant \cite{ard
A speech codec takes speech samples from an A/D converter (e.g. 16 bit samples at 8 kHz or 128 kbits/s) and compresses them down to a low bit rate that can be more easily sent over a narrow bandwidth channel (e.g. 700 bits/s for HF). Speech coding is the art of ``what can we throw away". We need to lower the bit rate of the speech while retaining speech you can understand, and making it sound as natural as possible.
As such low bit rates we use a speech production ``model". The input speech is anlaysed, and we extract model parameters, which are then sent over the channel. An example of a model based parameter is the pitch of the person speaking. We estimate the pitch of the speaker, quantise it to a 7 bit number, and send that over the channel every 20ms.
As such low bit rates we use a speech production ``model". The input speech is analysed, and we extract model parameters, which are then sent over the channel. An example of a model based parameter is the pitch of the person speaking. We estimate the pitch of the speaker, quantise it to a 7 bit number, and send that over the channel every 20ms.
The model based approach used by Codec 2 allows high compression, with some trade offs such as noticeable artefacts in the decoded speech. Higher bit rate codecs (above 5000 bit/s), such as those use for mobile telephony or voice on the Internet, tend to pay more attention to preserving the speech waveform, or use a hybrid approach of waveform and model based techniques. They sound better but require a higher bit rate.
@ -488,24 +488,39 @@ Voicing is determined using a variation of the MBE voicing algorithm \cite{griff
For each band we first estimate the complex harmonic amplitude (magnitude and phase) using \cite{griffin1988multiband}:
\begin{equation}
\label{eq:est_amp_mbe1}
B_m = \frac{\sum_{k=a_m}^{b_m} S_w(k) W^* (k - \lfloor mr \rceil)}{|\sum_{k=a_m}^{b_m} W (k - \lfloor mr \rceil)|^2}
\end{equation}
where $r= \omega_0 N_{dft}/2 \pi$ is a constant that maps the $m$-th harmonic to a DFT bin, and $ \lfloor x \rceil$ is the rounding operator. As $w(n)$ is a real and even, $W(k)$ is real and even so we can write:
where $r= \omega_0 N_{dft}/2 \pi$ is a constant that maps the $m$-th harmonic to a DFT bin, and $ \lfloor x \rceil$ is the rounding operator. To avoid non-zero array indexes we define the shifted window function:
\begin{equation}
U(k) = W(k-N_{dft}/2)
\end{equation}
such that $U(N_{dft}/2)=W(0)$. As $w(n)$ is a real and even, $W(k)$ is real and even so we can write:
\begin{equation}
\begin{split}
W^* (k - \lfloor mr \rceil) &= W(k - \lfloor mr \rceil) \\
&= U(k - \lfloor mr \rceil + Ndft/2) \\
&= U(k + l) \\
l &= Ndft/2 - \lfloor mr \rceil \\
& = \lfloor Ndft/2 - mr \rceil
\end{split}
\end{equation}
for even $Ndft$. We can therefore write \ref{eq:est_amp_mbe1} as:
\begin{equation}
\label{eq:est_amp_mbe}
B_m = \frac{\sum_{k=a_m}^{b_m} S_w(k) W (k + \lfloor mr \rceil)}{\sum_{k=a_m}^{b_m} |W (k + \lfloor mr \rceil)|^2}
B_m = \frac{\sum_{k=a_m}^{b_m} S_w(k) U(k + l)}{\sum_{k=a_m}^{b_m} |U (k + l)|^2}
\end{equation}
Note this procedure is different to the $A_m$ magnitude estimation procedure in (\ref{eq:mag_est}), and is only used locally for the MBE voicing estimation procedure. Unlike (\ref{eq:mag_est}), the MBE amplitude estimation (\ref{eq:est_amp_mbe}) assumes the energy in the band of $S_w(k)$ is from the DFT of a sine wave, and $B_m$ is complex valued.
The synthesised frequency domain speech for this band is defined as:
\begin{equation}
\hat{S}_w(k) = B_m W(k + \lfloor mr \rceil), \quad k=a_m,...,b_m-1
\hat{S}_w(k) = B_m U(k + l), \quad k=a_m,...,b_m-1
\end{equation}
The error between the input and synthesised speech in this band is then:
\begin{equation}
\begin{split}
E_m &= \sum_{k=a_m}^{b_m-1} |S_w(k) - \hat{S}_w(k)|^2 \\
&=\sum_{k=a_m}^{b_m-1} |S_w(k) - B_m W(k + \lfloor mr \rceil)|^2
&=\sum_{k=a_m}^{b_m-1} |S_w(k) - B_m U(k + l)|^2
\end{split}
\end{equation}
A Signal to Noise Ratio (SNR) ratio is defined as: