Amplitude Modelling using Neural Nets

Go to file

David f529237fd2 first pass output from eband_vq_am, 3/4 test samples reasonable		2021-01-03 20:20:32 +10:30
doc	README edits	2020-11-01 15:09:58 +10:30
wav	replace out-of-database sample experiment with samples inside database	2020-01-23 07:07:17 +10:30
README.md	Update README.md	2020-12-29 06:36:40 +10:30
codec2_model.py	first pass at combined VQVAE and rate K->L, running but reults poor	2021-01-01 11:51:52 +10:30
eband_out.py	better pager	2020-12-31 13:49:56 +10:30
eband_synth_one.sh	helper script for processing eband stuff	2020-12-29 16:09:49 +10:30
eband_train.py	helper script for processing eband stuff	2020-12-29 16:09:49 +10:30
eband_vq_am.py	first pass output from eband_vq_am, 3/4 test samples reasonable	2021-01-03 20:20:32 +10:30
eband_vq_am_out.py	first pass output from eband_vq_am, 3/4 test samples reasonable	2021-01-03 20:20:32 +10:30
eband_vq_am_synth_one.sh	first pass output from eband_vq_am, 3/4 test samples reasonable	2021-01-03 20:20:32 +10:30
ebanddec_train.py	…
hyper_params.sh	removed experienced_8k and added pencil, type, and 700C. Results were no clear improvement over 700C	2020-01-23 07:09:04 +10:30
impulse_test.sh	eband->Am experiments using timpulse	2021-01-03 09:07:47 +10:30
model_to_sparse.py	eband->Am experiments using timpulse	2021-01-03 09:07:47 +10:30
newamp1_train.py	…
rateK_train.py	…
test_vq_ewma.py	setting VQ entry externally	2020-11-23 17:47:44 +10:30
vq_kmeans.py	saving and loading NN weights and VQ	2020-12-23 10:44:13 +10:30
vq_kmeans_demo.py	shorter names	2020-11-27 18:42:18 +10:30
vq_pager.py	feature to compare two VQ files	2020-01-23 07:08:26 +10:30
vq_vae_conv1d_2stage.py	modified to remove mean energy, might change back	2020-11-27 18:43:32 +10:30
vq_vae_demo.py	typo	2020-11-23 17:48:09 +10:30
vq_vae_demo_2stage.py	clean up and pager improvements	2020-10-31 10:28:08 +10:30
vq_vae_keras_mnist_example_orig.py	modified for TF2, possible bug in e_latent_loss formation for 2 stages	2020-10-29 19:06:47 +10:30
vq_vae_kmeans_conv1d.py	first pass at vqvae->eband, muffled forr males	2020-12-31 12:37:46 +10:30
vq_vae_kmeans_conv1d_out.py	first pass at vqvae->eband, muffled forr males	2020-12-31 12:37:46 +10:30
vq_vae_kmeans_demo.py	added dense layer at input, plot encoder space	2020-11-30 07:25:09 +10:30
vq_vae_ratek.py	plotting VQ entry scatter on every epoch, to see VQ evolve	2020-03-09 11:17:53 +10:30
vq_vae_ratek_conv1d.py	conv1D VQ-VAE stuck at about 15 dB*dB	2020-03-10 15:48:05 +10:30
vqvae_eband_synth_one.sh	first pass at vqvae->eband, muffled forr males	2020-12-31 12:37:46 +10:30
vqvae_models.py	first pass output from eband_vq_am, 3/4 test samples reasonable	2021-01-03 20:20:32 +10:30
vqvae_synth_one.sh	phase0 output file	2020-12-29 13:15:44 +10:30
vqvae_util.py	refactoring into vqvae_util.py	2020-12-27 15:52:38 +10:30

README.md

Spectral Amplitude Quantisation using NNs

Experiments with time/frequency sample rate conversion and VQ-VAE (Vector Quantised Variational Autoencoder) for quantising the speech spectrum for vocoders.

This plot shows a VQ-VAE in action: The plot is a 2D histogram of the encoder space, white dots are the stage 1 VQ entries. The 16 dimensional data has been reduced to 2 dimensions using PCA. The plot was produced by vq_vae_conv1d_2stage.py

Themes and Key Points

I'm a Neural Net Noob. As a way of learning I am trying out some NN ideas to solve problems I have faced before in specch coding.
Currently using Keras 2.4.3 and Tensorflow 2.3.1
For convenience I use Codec 2 (an old school, non NN vocoder) to "listen" to results from this work, however the cool kids are using VQ VAE with NN vocoders.
I'm using regression - the NN estimates the actual log10(Am) values, not discrete PDFs.
Making NN work with sparse, variable rate spectral magnitude vectors, using a sparse target and custom loss functions.
Using NNs for decimation/interpolation (sub-sampling) in time, to reduce the frame rate and hence bit rate.
Extending VQ VAE to two stage vector quantisation. Multi-stage VQ is commonly used in non-NN speech coding.
The simulations (NN training and vector quantisation) works on mean square error in the log(Am) domain, which is equivalence to (i) variance (ii) and proportional to Spectral Distortion (SD) in dB^2 - which is very closely correlated to subjective quality. It trains in dB.

Scripts

Script	Description	Useful?
codec2_model.py	Reading and writing Codec 2 model parameters	-
eband_train.py	Constant rate K to timing varying rate L, using LPCNet style K=14 vectors	No
ebanddec_train.py	Constant rate K to timing varying rate L, with decimation in time	No
eband_out.py	Generates Codec 2 output from NN trained in eband_train/ebanddec_train	No
newamp1_train.py	Similar to eband_train.py, constant rate K to timing varying rate L, using Codec 2 newamp1 K=20 vectors	No
vq_pager.py	Step through output rate K vectors	Yes
vq_vae_demo.py	Simple demo of VQVAE, nice visualisation of training in action	Cool demo
vq_kmeans.py	kmeans VQ training in TensorFlow	Yes
vq_kmeans_demo.py	Simple demo of kmeans VQ training in TF	Yes
vq_vae_kmeans_demo.py	Simple demo of kmeans VQVAE in TF	Yes
vq_vae_demo_2stage.py	vq_vae_demo.py extended to two stage VQ	Cool demo
vq_vae_ratek.py	Single stage VQ-VAE with single Dense layer	No
vq_vae_conv1d_2stage.py	Two stage VQ-VAE with two conv1D layers and simple/slow VQ training	Yes, reasonable spectral distortion, cool plots
vqvae_twostage.py	Two stage VQ VAE used by the scripts below	Yes
vq_vae_kmeans_conv1d.py	Two stage kmeans trained VQ-VAE with two conv1D layers	Yes
vq_vae_kmeans_conv1d_out.py	Generates output for Codec 2 from the NN trained above. Not great output at this stage :-)	Yes
vqvae_synth_one.sh	Script to generate output speech files from the above	Yes

Amplitude Sample Rate Conversion Using Neural Nets

Some of the scripts in this repo (eband_*.py, newamp1_train.py) explore the use of Neural Networks (NN) in resampling between rate K and rate L for Codec 2.

Codec 2 models speech as a harmonic series of sine waves, each with it's own frequency, amplitude and phase. The frequencies are approximated as harmonics of the pitch or fundamental frequency. A reasonable model of the phases can be recovered from the amplitudes.

Accurate representation of the sine wave amplitudes {Am} m=1...L is important for good quality speech. The number of amplitudes in each frame L is dependent on the pitch L=P/2, which is time varying between (typically between L=10 and L=80). However for transmission at a fixed bit rate, a fixed number of parameters is desirable.

In earlier Codec 2 modes such as 3200 down to 1200, the amplitudes were represented using a fixed number of Linear Prediction Coefficients. In more recent modes such as 700C, the amplitudes Am are resampled to a fixed sample rate (K=20), and vector quantised for transmission. At the decoder, the rate K amplitude samples are resampled back to rate L for synthesis. The K=20 vectors use mel spaced sampling so these vectors are similar to the mel-spaced MFCCs used by the NN community.