Experimental Neural Net speech coding for FreeDV

Go to file

David 8bd2c7e2e7 added direct-split VQ to process.sh - quite happy with the quality		2019-03-22 14:13:23 +10:30
src	added direct-split VQ to process.sh - quite happy with the quality	2019-03-22 14:13:23 +10:30
COPYING	…
Makefile	adding support for split VQ	2019-03-21 18:41:39 +10:30
README.md	updated readme	2019-03-21 20:16:31 +10:30
ext_pitch.sh	scripts to compare LPCNet internal pitch est to an external (Codec 2) pitch est. Resolved delay as due to quant_feat buffering, and found a bug in quant_feat	2019-01-31 19:20:28 +10:30
process.sh	added direct-split VQ to process.sh - quite happy with the quality	2019-03-22 14:13:23 +10:30
train_direct.sh	latest training scripts	2019-03-21 20:14:38 +10:30
train_pred1.sh	latest training scripts	2019-03-21 20:14:38 +10:30
train_pred2.sh	latest training scripts	2019-03-21 20:14:38 +10:30

README.md

LPCNet for FreeDV

Experimental version of LPCNet being developed for over the air Digital Voice experiments with FreeDV.

Reading Further

Quantiser Experiments

First build a C lpcnet_test C decoder with the latest .h5 file.

Binary files for these experiments here

Exploring Features

Install GNU Octave (if thats your thing).

Extract a feature file, fire up Octave, and mesh plot the 18 cepstrals for the first 100 frame (1 second):

$ ./dump_data --test speech_orig_16k.s16 speech_orig_16k_features.f32
$ cd src
$ octave --no-gui
octave:3> f=load_f32("../speech_orig_16k_features.f32",55);
nrows: 1080
octave:4> mesh(f(1:100,1:18))

Uniform Quantisation

Listen to the effects of 4dB step uniform quantisation on cepstrals:

$ cat ~/Downloads/wia.wav | ./dump_data --test - - | ./quant_feat -u 4 | ./test_lpcnet - - | play -q -r 16000 -s -2 -t raw -

This lets us listen to the effect of quantisation error. Once we think it sounds OK, we can compute the variance (average squared quantiser error). A 4dB step size means the error PDF is uniform in the range of -2 to +2 dB. A uniform PDF has variance of (b-a)^2/12, so (2--2)^2/12 = 1.33 dB^2. We can then try to design a quantiser (e.g. multi-stage VQ) to achieve that variance.

Training a Predictive VQ

Checkout and build codec2-dev from SVN:

$ svn co https://svn.code.sf.net/p/freetel/code/codec2-dev
$ cd codec2-dev && mkdir build_linux && cd build_linux && cmake ../ && make

In train_pred2.sh, adjust PATH for the location of codec2-dev on your machine.

Generate 5E6 vectors using the -train option on dump_data to apply a bunch of different filters, then run the predictive VQ training script

$ cd LPCNet
$ ./dump_data --train all_speech.s16 all_speech_features_5e6.f32 /dev/null
$ ./train_pred2.sh

Mbest VQ search

Keeps M best candidates after each stage:

cat ~/Downloads/speech_orig_16k.s16 | ./dump_data --test - - | ./quant_feat --mbest 5 -q pred2_stage1.f32,pred2_stage2.f32,pred2_stage3.f32 > /dev/null

In this example, the VQ error variance was reduced from 2.68 to 2.28 dB^2 (I think equivalent to 3 bits), and the number of outliers >2dB reduced from 15% to 10%.

Streaming of WIA broadcast material

Interesting mix of speakers and recording conditions, some not so great microphones. Faster speech than the training material.

Basic unquantised LPCNet model:

sox -r 16000 ~/Downloads/wianews-2019-01-20.s16 -t raw - trim 200 | ./dump_data --c2pitch --test - - | ./test_lpcnet - - | aplay -f S16_LE -r 16000

Fully quantised at (44+8)/0.03 = 1733 bits/s:

sox -r 16000 ~/Downloads/wianews-2019-01-20.s16 -t raw - trim 200 | ./dump_data --c2pitch --test - - | ./quant_feat -g 0.25 -o 6 -d 3 -w --mbest 5 -q pred_v2_stage1.f32,pred_v2_stage2.f32,pred_v2_stage3.f32,pred_v2_stage4.f32 | ./test_lpcnet - - | aplay -f S16_LE -r 16000

Fully quantised encoder/decoder programs

Same thing as above with quantisation code packaged up into library functions. Between quant_enc and quant_dec are 52 bit frames every 30ms:

Same thing with everything integrated into stand alone encoder and decoder programs:

cat ~/Downloads/speech_orig_16k.s16 | ./lpcnet_enc | ./lpcnet_dec | aplay -f S16_LE -r 16000

The bit stream interface is 1 bit/char, as I find that convenient for my digital voice over radio experiments. The decimation rate, number of VQ stages, and a few other parameters can be set as command line options, for example 20ms frame rate, 3 stage VQ (2050 bits/s):

cat ~/Downloads/speech_orig_16k.s16 | ./lpcnet_enc -d 2 -n 3 | ./lpcnet_dec -d 2 -n 3 | aplay -f S16_LE -r 16000

You'll need the same set of parameters for the encoder as decoder.

Useful additions would be:

Run time loading of .h5 NN models.
A --packed option to pack the quantised bits tightly, which would make the programs useful for storage applications.

Direct Split VQ

Four stage VQ of log magnitudes (Ly), 11 bits (2048 entries) per stage, First 3 stages 18 elements wide; final stage 12 elements wide. During traring this acheived similar variance to 4 stage predictive below (on 12 bands). Same bit rate, but direct quantisation means more robust to bit errors and especially packet loss.

sox ~/Desktop/deep/quant/wia.wav -t raw - | ./dump_data --c2pitch --test - - | ./quant_feat -d 3 -i -p 0 --mbest 5 -q split_stage1.f32,split_stage2.f32,split_stage3.f32,split_stage4.f32 | ./test_lpcnet - - | aplay -f S16_LE -r 16000

Four stage VQ of Cepstrals (DCT of Ly), 11 bits (2048 entries) per stage, 18 element wide vectors. We quantise the predictor output.

sox ~/Desktop/deep/quant/wia.wav -t raw -  | ./dump_data --c2pitch --test - - | ./quant_feat -d 3 -w --mbest 5 -q pred_v2_stage1.f32,pred_v2_stage2.f32,pred_v2_stage3.f32,pred_v2_stage4.f32 | ./test_lpcnet - - | aplay -f S16_LE -r 16000

Both are decimated by a factor of 3 (so 30ms update of parameters, 30*44=1733 bits/s).