5.7 KiB

Raw Blame History

LPCNet

Low complexity implementation of the WaveRNN-based LPCNet algorithm, as described in:

J.-M. Valin, J. Skoglund, LPCNet: Improving Neural Speech Synthesis Through Linear Prediction, Submitted for ICASSP 2019, arXiv:1810.11846.

Introduction

Work in progress software for researching low CPU complexity algorithms for speech synthesis and compression by applying Linear Prediction techniques to WaveRNN. High quality speech can be synthesised on regular CPUs (around 3 GFLOP) with SIMD support (AVX, AVX2/FMA, NEON currently supported).

The BSD licensed software is written in C and Python/Keras. For training, a GTX 1080 Ti or better is recommended.

This software is an open source starting point for WaveRNN-based speech synthesis and coding.

Quickstart

Set up a Keras system with GPU.
Generate training data:
```
make dump_data
./dump_data --train input.s16 features.f32 data.u8
```
where the first file contains 16 kHz 16-bit raw PCM audio (no header) and the other files are output files. This program makes several passes over the data with different filters to generate a large amount of training data.
Now that you have your files, train with:
```
./train_lpcnet.py features.f32 data.u8
```
and it will generate a wavenet*.h5 file for each iteration. If it stops with a "Failed to allocate RNN reserve space" message try reducing the batch_size variable in train_wavenet_audio.py.
You can synthesise speech with Python and your GPU card:
```
./dump_data --test test_input.s16 test_features.f32
./test_lpcnet.py test_features.f32 test.s16
```
Note the .h5 is hard coded in test_lpcnet.py, modify for your .h file.
Or with C on a CPU: First extract the model files nnet_data.h and nnet_data.c
```
./dump_lpcnet.py lpcnet15_384_10_G16_64.h5
```
Then you can make the C synthesiser and try synthesising from a test feature file:
```
make test_lpcnet
./dump_data --test test_input.s16 test_features.f32
./test_lpcnet test_features.f32 test.s16
```

Speech Material for Training

Suitable training material can be obtained from the McGill University Telecommunications & Signal Processing Laboratory. Download the ISO and extract the 16k-LP7 directory, the src/concat.sh script can be used to generate a headerless file of training samples.

cd 16k-LP7
sh /path/to/concat.sh

Reading Further

LPCNet: DSP-Boosted Neural Speech Synthesis
Sample model files: https://jmvalin.ca/misc_stuff/lpcnet_models/

David's Quantiser Experiments

First build a C lpcnet_test C decoder with the latest .h5 file.

Binary files for these experiments here

Exploring Features

Install GNU Octave (if thats your thing).

Extract a feature file, fire up Octave, and mesh plot the 18 cepstrals for the first 100 frame (1 second):

$ ./dump_data --test speech_orig_16k.s16 speech_orig_16k_features.f32
$ cd src
$ octave --no-gui
octave:3> f=load_f32("../speech_orig_16k_features.f32",55);
nrows: 1080
octave:4> mesh(f(1:100,1:18))

Uniform Quantisation

Listen to the effects of 4dB step uniform quantisation on cepstrals:

$ cat ~/Downloads/wia.wav | ./dump_data --test - - | ./quant_feat -u 4 | ./test_lpcnet - - | play -q -r 16000 -s -2 -t raw -

This lets us listen to the effect of quantisation error. Once we think it sounds OK, we can compute the variance (average squared quantiser error). A 4dB step size means the error PDF is uniform in the range of -2 to +2 dB. A uniform PDF has variance of (b-a)^2/12, so (2--2)^2/12 = 1.33 dB^2. We can then try to design a quantiser (e.g. multi-stage VQ) to achieve that variance.

Training a Predictive VQ

Checkout and build codec2-dev from SVN:

$ svn co https://svn.code.sf.net/p/freetel/code/codec2-dev
$ cd codec2-dev && mkdir build_linux && cd build_linux && cmake ../ && make

In train_pred2.sh, adjust PATH for the location of codec2-dev on your machine.

Generate 5E6 vectors using the -train option on dump_data to apply a bunch of different filters, then run the predictive VQ training script

$ cd LPCNet
$ ./dump_data --train all_speech.s16 all_speech_features_5e6.f32 /dev/null
$ ./train_pred2.sh

Mbest VQ search

Keeps M best candidates after each stage:

cat ~/Downloads/speech_orig_16k.s16 | ./dump_data --test - - | ./quant_feat --mbest 5 -q pred2_stage1.f32,pred2_stage2.f32,pred2_stage3.f32 > /dev/null

In this example, the VQ error variance was reduced from 2.68 to 2.28 dB^2 (I think equivalent to 3 bits), and the number of outliers >2dB reduced from 15% to 10%.

Streaming of WIA broadcast material

Interesting mix of speakers and recording conditions, some not so great microphones. Faster speech than the training material.

Basic unquantised LPCNet model:

sox -r 16000 ~/Downloads/wianews-2019-01-20.s16 -t raw - trim 200 | ./dump_data --c2pitch --test - - | ./test_lpcnet - - | aplay -f S16_LE -r 16000

Fully quantised at (44+8)/0.03 = 1733 bits/s:

sox -r 16000 ~/Downloads/wianews-2019-01-20.s16 -t raw - trim 200 | ./dump_data --c2pitch --test - - | ./quant_feat -g 0.25 -o 6 -d 3 -w --mbest 5 -q pred_v2_stage1.f32,pred_v2_stage2.f32,pred_v2_stage3.f32,pred_v2_stage4.f32 | ./test_lpcnet - - | aplay -f S16_LE -r 16000

Same thing with quantiation code packaged up into library functions. Between quant_enc and quant_dec are 52 bit frames every 30ms:

5.7 KiB Raw Blame History