Amplitude Modelling using Neural Nets
 
 
Go to file
David f529237fd2 first pass output from eband_vq_am, 3/4 test samples reasonable 2021-01-03 20:20:32 +10:30
doc README edits 2020-11-01 15:09:58 +10:30
wav replace out-of-database sample experiment with samples inside database 2020-01-23 07:07:17 +10:30
README.md Update README.md 2020-12-29 06:36:40 +10:30
codec2_model.py first pass at combined VQVAE and rate K->L, running but reults poor 2021-01-01 11:51:52 +10:30
eband_out.py better pager 2020-12-31 13:49:56 +10:30
eband_synth_one.sh helper script for processing eband stuff 2020-12-29 16:09:49 +10:30
eband_train.py helper script for processing eband stuff 2020-12-29 16:09:49 +10:30
eband_vq_am.py first pass output from eband_vq_am, 3/4 test samples reasonable 2021-01-03 20:20:32 +10:30
eband_vq_am_out.py first pass output from eband_vq_am, 3/4 test samples reasonable 2021-01-03 20:20:32 +10:30
eband_vq_am_synth_one.sh first pass output from eband_vq_am, 3/4 test samples reasonable 2021-01-03 20:20:32 +10:30
ebanddec_train.py
hyper_params.sh removed experienced_8k and added pencil, type, and 700C. Results were no clear improvement over 700C 2020-01-23 07:09:04 +10:30
impulse_test.sh eband->Am experiments using timpulse 2021-01-03 09:07:47 +10:30
model_to_sparse.py eband->Am experiments using timpulse 2021-01-03 09:07:47 +10:30
newamp1_train.py
rateK_train.py
test_vq_ewma.py setting VQ entry externally 2020-11-23 17:47:44 +10:30
vq_kmeans.py saving and loading NN weights and VQ 2020-12-23 10:44:13 +10:30
vq_kmeans_demo.py shorter names 2020-11-27 18:42:18 +10:30
vq_pager.py feature to compare two VQ files 2020-01-23 07:08:26 +10:30
vq_vae_conv1d_2stage.py modified to remove mean energy, might change back 2020-11-27 18:43:32 +10:30
vq_vae_demo.py typo 2020-11-23 17:48:09 +10:30
vq_vae_demo_2stage.py clean up and pager improvements 2020-10-31 10:28:08 +10:30
vq_vae_keras_mnist_example_orig.py modified for TF2, possible bug in e_latent_loss formation for 2 stages 2020-10-29 19:06:47 +10:30
vq_vae_kmeans_conv1d.py first pass at vqvae->eband, muffled forr males 2020-12-31 12:37:46 +10:30
vq_vae_kmeans_conv1d_out.py first pass at vqvae->eband, muffled forr males 2020-12-31 12:37:46 +10:30
vq_vae_kmeans_demo.py added dense layer at input, plot encoder space 2020-11-30 07:25:09 +10:30
vq_vae_ratek.py plotting VQ entry scatter on every epoch, to see VQ evolve 2020-03-09 11:17:53 +10:30
vq_vae_ratek_conv1d.py conv1D VQ-VAE stuck at about 15 dB*dB 2020-03-10 15:48:05 +10:30
vqvae_eband_synth_one.sh first pass at vqvae->eband, muffled forr males 2020-12-31 12:37:46 +10:30
vqvae_models.py first pass output from eband_vq_am, 3/4 test samples reasonable 2021-01-03 20:20:32 +10:30
vqvae_synth_one.sh phase0 output file 2020-12-29 13:15:44 +10:30
vqvae_util.py refactoring into vqvae_util.py 2020-12-27 15:52:38 +10:30

README.md

Spectral Amplitude Quantisation using NNs

Experiments with time/frequency sample rate conversion and VQ-VAE (Vector Quantised Variational Autoencoder) for quantising the speech spectrum for vocoders.

This plot shows a VQ-VAE in action: PCA of VQVAE encoder space The plot is a 2D histogram of the encoder space, white dots are the stage 1 VQ entries. The 16 dimensional data has been reduced to 2 dimensions using PCA. The plot was produced by vq_vae_conv1d_2stage.py

Themes and Key Points

  • I'm a Neural Net Noob. As a way of learning I am trying out some NN ideas to solve problems I have faced before in specch coding.
  • Currently using Keras 2.4.3 and Tensorflow 2.3.1
  • For convenience I use Codec 2 (an old school, non NN vocoder) to "listen" to results from this work, however the cool kids are using VQ VAE with NN vocoders.
  • I'm using regression - the NN estimates the actual log10(Am) values, not discrete PDFs.
  • Making NN work with sparse, variable rate spectral magnitude vectors, using a sparse target and custom loss functions.
  • Using NNs for decimation/interpolation (sub-sampling) in time, to reduce the frame rate and hence bit rate.
  • Extending VQ VAE to two stage vector quantisation. Multi-stage VQ is commonly used in non-NN speech coding.
  • The simulations (NN training and vector quantisation) works on mean square error in the log(Am) domain, which is equivalence to (i) variance (ii) and proportional to Spectral Distortion (SD) in dB^2 - which is very closely correlated to subjective quality. It trains in dB.

Scripts

Script Description Useful?
codec2_model.py Reading and writing Codec 2 model parameters -
eband_train.py Constant rate K to timing varying rate L, using LPCNet style K=14 vectors No
ebanddec_train.py Constant rate K to timing varying rate L, with decimation in time No
eband_out.py Generates Codec 2 output from NN trained in eband_train/ebanddec_train No
newamp1_train.py Similar to eband_train.py, constant rate K to timing varying rate L, using Codec 2 newamp1 K=20 vectors No
vq_pager.py Step through output rate K vectors Yes
vq_vae_demo.py Simple demo of VQVAE, nice visualisation of training in action Cool demo
vq_kmeans.py kmeans VQ training in TensorFlow Yes
vq_kmeans_demo.py Simple demo of kmeans VQ training in TF Yes
vq_vae_kmeans_demo.py Simple demo of kmeans VQVAE in TF Yes
vq_vae_demo_2stage.py vq_vae_demo.py extended to two stage VQ Cool demo
vq_vae_ratek.py Single stage VQ-VAE with single Dense layer No
vq_vae_conv1d_2stage.py Two stage VQ-VAE with two conv1D layers and simple/slow VQ training Yes, reasonable spectral distortion, cool plots
vqvae_twostage.py Two stage VQ VAE used by the scripts below Yes
vq_vae_kmeans_conv1d.py Two stage kmeans trained VQ-VAE with two conv1D layers Yes
vq_vae_kmeans_conv1d_out.py Generates output for Codec 2 from the NN trained above. Not great output at this stage :-) Yes
vqvae_synth_one.sh Script to generate output speech files from the above Yes

Amplitude Sample Rate Conversion Using Neural Nets

Some of the scripts in this repo (eband_*.py, newamp1_train.py) explore the use of Neural Networks (NN) in resampling between rate K and rate L for Codec 2.

Codec 2 models speech as a harmonic series of sine waves, each with it's own frequency, amplitude and phase. The frequencies are approximated as harmonics of the pitch or fundamental frequency. A reasonable model of the phases can be recovered from the amplitudes.

Accurate representation of the sine wave amplitudes {Am} m=1...L is important for good quality speech. The number of amplitudes in each frame L is dependent on the pitch L=P/2, which is time varying between (typically between L=10 and L=80). However for transmission at a fixed bit rate, a fixed number of parameters is desirable.

In earlier Codec 2 modes such as 3200 down to 1200, the amplitudes were represented using a fixed number of Linear Prediction Coefficients. In more recent modes such as 700C, the amplitudes Am are resampled to a fixed sample rate (K=20), and vector quantised for transmission. At the decoder, the rate K amplitude samples are resampled back to rate L for synthesis. The K=20 vectors use mel spaced sampling so these vectors are similar to the mel-spaced MFCCs used by the NN community.