mirror of https://github.com/drowe67/codec2.git
Merge pull request #50 from drowe67/dr-codec2-doc
Correcting (19), thanks Bruce Mackinnonpull/47/merge
commit
6930e3c26a
BIN
doc/codec2.pdf
BIN
doc/codec2.pdf
Binary file not shown.
|
@ -81,7 +81,7 @@ This production of this document was kindly supported by an ARDC grant \cite{ard
|
|||
|
||||
A speech codec takes speech samples from an A/D converter (e.g. 16 bit samples at 8 kHz or 128 kbits/s) and compresses them down to a low bit rate that can be more easily sent over a narrow bandwidth channel (e.g. 700 bits/s for HF). Speech coding is the art of ``what can we throw away". We need to lower the bit rate of the speech while retaining speech you can understand, and making it sound as natural as possible.
|
||||
|
||||
As such low bit rates we use a speech production ``model". The input speech is anlaysed, and we extract model parameters, which are then sent over the channel. An example of a model based parameter is the pitch of the person speaking. We estimate the pitch of the speaker, quantise it to a 7 bit number, and send that over the channel every 20ms.
|
||||
As such low bit rates we use a speech production ``model". The input speech is analysed, and we extract model parameters, which are then sent over the channel. An example of a model based parameter is the pitch of the person speaking. We estimate the pitch of the speaker, quantise it to a 7 bit number, and send that over the channel every 20ms.
|
||||
|
||||
The model based approach used by Codec 2 allows high compression, with some trade offs such as noticeable artefacts in the decoded speech. Higher bit rate codecs (above 5000 bit/s), such as those use for mobile telephony or voice on the Internet, tend to pay more attention to preserving the speech waveform, or use a hybrid approach of waveform and model based techniques. They sound better but require a higher bit rate.
|
||||
|
||||
|
@ -488,24 +488,39 @@ Voicing is determined using a variation of the MBE voicing algorithm \cite{griff
|
|||
|
||||
For each band we first estimate the complex harmonic amplitude (magnitude and phase) using \cite{griffin1988multiband}:
|
||||
\begin{equation}
|
||||
\label{eq:est_amp_mbe1}
|
||||
B_m = \frac{\sum_{k=a_m}^{b_m} S_w(k) W^* (k - \lfloor mr \rceil)}{|\sum_{k=a_m}^{b_m} W (k - \lfloor mr \rceil)|^2}
|
||||
\end{equation}
|
||||
where $r= \omega_0 N_{dft}/2 \pi$ is a constant that maps the $m$-th harmonic to a DFT bin, and $ \lfloor x \rceil$ is the rounding operator. As $w(n)$ is a real and even, $W(k)$ is real and even so we can write:
|
||||
where $r= \omega_0 N_{dft}/2 \pi$ is a constant that maps the $m$-th harmonic to a DFT bin, and $ \lfloor x \rceil$ is the rounding operator. To avoid non-zero array indexes we define the shifted window function:
|
||||
\begin{equation}
|
||||
U(k) = W(k-N_{dft}/2)
|
||||
\end{equation}
|
||||
such that $U(N_{dft}/2)=W(0)$. As $w(n)$ is a real and even, $W(k)$ is real and even so we can write:
|
||||
\begin{equation}
|
||||
\begin{split}
|
||||
W^* (k - \lfloor mr \rceil) &= W(k - \lfloor mr \rceil) \\
|
||||
&= U(k - \lfloor mr \rceil + Ndft/2) \\
|
||||
&= U(k + l) \\
|
||||
l &= Ndft/2 - \lfloor mr \rceil \\
|
||||
& = \lfloor Ndft/2 - mr \rceil
|
||||
\end{split}
|
||||
\end{equation}
|
||||
for even $Ndft$. We can therefore write \ref{eq:est_amp_mbe1} as:
|
||||
\begin{equation}
|
||||
\label{eq:est_amp_mbe}
|
||||
B_m = \frac{\sum_{k=a_m}^{b_m} S_w(k) W (k + \lfloor mr \rceil)}{\sum_{k=a_m}^{b_m} |W (k + \lfloor mr \rceil)|^2}
|
||||
B_m = \frac{\sum_{k=a_m}^{b_m} S_w(k) U(k + l)}{\sum_{k=a_m}^{b_m} |U (k + l)|^2}
|
||||
\end{equation}
|
||||
Note this procedure is different to the $A_m$ magnitude estimation procedure in (\ref{eq:mag_est}), and is only used locally for the MBE voicing estimation procedure. Unlike (\ref{eq:mag_est}), the MBE amplitude estimation (\ref{eq:est_amp_mbe}) assumes the energy in the band of $S_w(k)$ is from the DFT of a sine wave, and $B_m$ is complex valued.
|
||||
|
||||
The synthesised frequency domain speech for this band is defined as:
|
||||
\begin{equation}
|
||||
\hat{S}_w(k) = B_m W(k + \lfloor mr \rceil), \quad k=a_m,...,b_m-1
|
||||
\hat{S}_w(k) = B_m U(k + l), \quad k=a_m,...,b_m-1
|
||||
\end{equation}
|
||||
The error between the input and synthesised speech in this band is then:
|
||||
\begin{equation}
|
||||
\begin{split}
|
||||
E_m &= \sum_{k=a_m}^{b_m-1} |S_w(k) - \hat{S}_w(k)|^2 \\
|
||||
&=\sum_{k=a_m}^{b_m-1} |S_w(k) - B_m W(k + \lfloor mr \rceil)|^2
|
||||
&=\sum_{k=a_m}^{b_m-1} |S_w(k) - B_m U(k + l)|^2
|
||||
\end{split}
|
||||
\end{equation}
|
||||
A Signal to Noise Ratio (SNR) ratio is defined as:
|
||||
|
|
Loading…
Reference in New Issue