From 0d6129d500722022d70020a8ee2fbd603e80dc88 Mon Sep 17 00:00:00 2001 From: drowe67 <45574645+drowe67@users.noreply.github.com> Date: Sun, 1 Dec 2019 10:25:23 +1030 Subject: [PATCH 1/7] Update README.md --- README.md | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 6ff6129..30f99f7 100644 --- a/README.md +++ b/README.md @@ -2,13 +2,21 @@ A project to model sinusoidal codec phase spectra with neural nets. -Recent breakthroughs in NN speech synthesis (WaveNet, WaveRNN, LPCNet and friends) have resulted in exciting improvements in model based synthesised speech quality. These algorithms typically use NNs to estimate the PDF of the next speech sample using a history of previous speech samples. This PDF is then sampled. As such, speech is generated on a sample by sample basis. Computational complexity is high, although steadily being reduced. +Recent breakthroughs in NN speech synthesis (WaveNet, WaveRNN, LPCNet and friends) have resulted in exciting improvements in model based synthesised speech quality. These algorithms typically use NNs to estimate the PDF of the next speech sample conditioned on input features and a history of previously synthesised speech samples. This PDF is then sampled to obtain the next output speech sample. As the algorithms need all previous output speech samples, speech must be generated on a sample by sample basis. Computational complexity is high, although steadily being reduced. -Speech codecs employing frequency domain, block based techniques such as sinusoidal transform coding can deliver high quality speech using block based synthesis. They typically synthesise speech in blocks of 10-20ms at a time (e.g. 160-320 samples at Fs=16kHz) using efficient overlap-add IDFT techniques. Sinusoidal codecs use a similar parameter set to NN based synthesis systems (amplitude spectra and pitch information). +Speech codecs employing frequency domain, block based techniques such as sinusoidal transform coding can deliver high quality speech using block based synthesis. They typically synthesise speech in blocks of 10-20ms at a time (e.g. 160-320 samples at Fs=16kHz) using efficient overlap-add IDFT techniques. Sinusoidal codecs use a similar parameter set to the features used for NN based synthesis systems (some form of amplitude spectra, pitch information, voicing). -However for high quality speech, sinusoidal codecs require a suitable set of the sinusoidal harmonic phases for each frame that is synthesised. This work aims to generate the sinusoid phases from amplitude information using NNs, in order to develop a block based NN synthesis engine based on sinusoidal coding. +For high quality speech, sinusoidal codecs require a suitable set of the sinusoidal harmonic phases for each frame that is synthesised. This work aims to generate the sinusoid phases from amplitude information using NNs, in order to develop a block based NN synthesis engine based on sinusoidal coding. -## Status (Nov 2019) +## Status (Dec 2019) -Building up techniques for modelling phase using NNs and toy speech models (2nd order filters) in a series of tests. +Building up techniques for modelling phase using NNs and toy speech models (cascades of 2nd order filters) in a series of tests. + +Here is the output from [phase_test11.py](phase_test11.py). The first plot is a series of magnitude spectra of simulated speech frames. The voiced frames have two fairly sharp peaks (formants) beneath Fs/2 with structured phase consisting of linear and dispersive terms. Unvoiced frames have less sharp peaks above Fs/2, and random phases. + +![](example_mag.png "Magnitude Spectra") + +The next plot shows the original phase spectra (green), the phase spectra with an estimate of the linear phase term removed (red), and the NN ouput estimated phase (blue). For voiced frames, we would like green (original) and blue (NN estimate) to match. In particular we want to model the accurate phase shift across the peak of the amplitude spectra - this is the dispersive term that shifts the phase of high energy speech harmonics apart and reducing the buzzy/unnatural quality in synthsised speech. For unvoiced speech, we want the NN output (blue) to be random. + +![](example_phase.png "Phase Spectra") From 290323bf811abe422593773d1a15e6debe618b4b Mon Sep 17 00:00:00 2001 From: drowe67 <45574645+drowe67@users.noreply.github.com> Date: Sun, 1 Dec 2019 10:27:46 +1030 Subject: [PATCH 2/7] Update README.md --- README.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 30f99f7..5d70952 100644 --- a/README.md +++ b/README.md @@ -12,11 +12,10 @@ For high quality speech, sinusoidal codecs require a suitable set of the sinusoi Building up techniques for modelling phase using NNs and toy speech models (cascades of 2nd order filters) in a series of tests. -Here is the output from [phase_test11.py](phase_test11.py). The first plot is a series of magnitude spectra of simulated speech frames. The voiced frames have two fairly sharp peaks (formants) beneath Fs/2 with structured phase consisting of linear and dispersive terms. Unvoiced frames have less sharp peaks above Fs/2, and random phases. +Here is the output from [phasenn_test11.py](phasenn_test11.py). The first plot is a series of magnitude spectra of simulated speech frames. The voiced frames have two fairly sharp peaks (formants) beneath Fs/2 with structured phase consisting of linear and dispersive terms. Unvoiced frames have less sharp peaks above Fs/2, and random phases. ![](example_mag.png "Magnitude Spectra") +![](example_phase.png "Phase Spectra") The next plot shows the original phase spectra (green), the phase spectra with an estimate of the linear phase term removed (red), and the NN ouput estimated phase (blue). For voiced frames, we would like green (original) and blue (NN estimate) to match. In particular we want to model the accurate phase shift across the peak of the amplitude spectra - this is the dispersive term that shifts the phase of high energy speech harmonics apart and reducing the buzzy/unnatural quality in synthsised speech. For unvoiced speech, we want the NN output (blue) to be random. -![](example_phase.png "Phase Spectra") - From b3a25900a1328c3113bf7990b72cd6d2d248fe44 Mon Sep 17 00:00:00 2001 From: drowe67 <45574645+drowe67@users.noreply.github.com> Date: Sun, 1 Dec 2019 10:55:10 +1030 Subject: [PATCH 3/7] Update README.md --- README.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 5d70952..48aca84 100644 --- a/README.md +++ b/README.md @@ -10,12 +10,16 @@ For high quality speech, sinusoidal codecs require a suitable set of the sinusoi ## Status (Dec 2019) -Building up techniques for modelling phase using NNs and toy speech models (cascades of 2nd order filters) in a series of tests. +Building up techniques for modelling phase using NNs and toy speech models (cascades of 2nd order filters excited by impulse trains) in a series of tests. Here is the output from [phasenn_test11.py](phasenn_test11.py). The first plot is a series of magnitude spectra of simulated speech frames. The voiced frames have two fairly sharp peaks (formants) beneath Fs/2 with structured phase consisting of linear and dispersive terms. Unvoiced frames have less sharp peaks above Fs/2, and random phases. ![](example_mag.png "Magnitude Spectra") ![](example_phase.png "Phase Spectra") -The next plot shows the original phase spectra (green), the phase spectra with an estimate of the linear phase term removed (red), and the NN ouput estimated phase (blue). For voiced frames, we would like green (original) and blue (NN estimate) to match. In particular we want to model the accurate phase shift across the peak of the amplitude spectra - this is the dispersive term that shifts the phase of high energy speech harmonics apart and reducing the buzzy/unnatural quality in synthsised speech. For unvoiced speech, we want the NN output (blue) to be random. +The next plot shows the disperive component of the original phase spectra (green), the hase spectra with an estimate of the linear phase term removed (red), and the NN output estimated dispersive phase (blue). For voiced frames, we would like green (original) and blue (NN estimate) to match. In particular we want to model the rapid phase shift across the peak of the amplitude spectra - this is the dispersive term that shifts the phase of high energy speech harmonics apart and reduces the buzzy/unnatural quality in synthsised speech. + +When training from real world data, we have frames of phase spectra with the linear and disperive phase components combined. We will not know the linear term, and therefore must estimate it. This simulation introduces small errors in the linear term estimation (+/-1 sample), which can lead to large phase differences at high frequencies. The red (original phase with estimated linear term removed) diverges from the true dispersive input (green), as the estimation of the linear term is not perfect. However over the training database these errors tend to have a zero mean - this simulation suggests they are being "trained out" by the NN, resulting in a reasonable model of the dispersive term (blue), albiet with some estimation "noise" at high frequencies. This HF noise may be useful, as it matches the lack of structure of HF phases in real speech. + +For unvoiced speech, we want the NN output (blue) to be random. They do not need to match the original input phase spectra. The NN appears to presever this random phase structure in this simulation. This may remove the need for a voicing estimate - voicing can be deduced from the magnitude spectra. From 900fdbfa3fdd77369d0a87c937d2c1ea87ce0f99 Mon Sep 17 00:00:00 2001 From: drowe67 <45574645+drowe67@users.noreply.github.com> Date: Sun, 1 Dec 2019 11:03:14 +1030 Subject: [PATCH 4/7] Update README.md --- README.md | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 48aca84..a6a354c 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,9 @@ A project to model sinusoidal codec phase spectra with neural nets. -Recent breakthroughs in NN speech synthesis (WaveNet, WaveRNN, LPCNet and friends) have resulted in exciting improvements in model based synthesised speech quality. These algorithms typically use NNs to estimate the PDF of the next speech sample conditioned on input features and a history of previously synthesised speech samples. This PDF is then sampled to obtain the next output speech sample. As the algorithms need all previous output speech samples, speech must be generated on a sample by sample basis. Computational complexity is high, although steadily being reduced. +## Introduction + +Recent breakthroughs in (Neural Net) NN speech synthesis (WaveNet, WaveRNN, LPCNet and friends) have resulted in exciting improvements in model based synthesised speech quality. These algorithms typically use NNs to estimate the PDF of the next speech sample conditioned on input features and a history of previously synthesised speech samples. This PDF is then sampled to obtain the next output speech sample. As the algorithms need all previous output speech samples, speech must be generated on a sample by sample basis. Computational complexity is high, although steadily being reduced. Speech codecs employing frequency domain, block based techniques such as sinusoidal transform coding can deliver high quality speech using block based synthesis. They typically synthesise speech in blocks of 10-20ms at a time (e.g. 160-320 samples at Fs=16kHz) using efficient overlap-add IDFT techniques. Sinusoidal codecs use a similar parameter set to the features used for NN based synthesis systems (some form of amplitude spectra, pitch information, voicing). @@ -12,6 +14,13 @@ For high quality speech, sinusoidal codecs require a suitable set of the sinusoi Building up techniques for modelling phase using NNs and toy speech models (cascades of 2nd order filters excited by impulse trains) in a series of tests. +## Challenges + +1. We must represent phase (angles) in a NN. Phase is being represnted by (cos(angle), sin(angle)) pairs, which when trained, tend to develop weights that behave like complex numbers/matrix rotations. +1. The number of phases in the sinsuoidal model is time varying, based on the pitch of the current frame. This is being modelled by mapping the sinusoids onto sparse, fixed length vectors. + +## Example + Here is the output from [phasenn_test11.py](phasenn_test11.py). The first plot is a series of magnitude spectra of simulated speech frames. The voiced frames have two fairly sharp peaks (formants) beneath Fs/2 with structured phase consisting of linear and dispersive terms. Unvoiced frames have less sharp peaks above Fs/2, and random phases. ![](example_mag.png "Magnitude Spectra") From 57ccb4c1227db02ae8b09da5c4141f37cb1b222b Mon Sep 17 00:00:00 2001 From: drowe67 <45574645+drowe67@users.noreply.github.com> Date: Sun, 1 Dec 2019 11:06:23 +1030 Subject: [PATCH 5/7] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index a6a354c..c4c1337 100644 --- a/README.md +++ b/README.md @@ -17,7 +17,7 @@ Building up techniques for modelling phase using NNs and toy speech models (casc ## Challenges 1. We must represent phase (angles) in a NN. Phase is being represnted by (cos(angle), sin(angle)) pairs, which when trained, tend to develop weights that behave like complex numbers/matrix rotations. -1. The number of phases in the sinsuoidal model is time varying, based on the pitch of the current frame. This is being modelled by mapping the sinusoids onto sparse, fixed length vectors. +1. The number of phases in the sinsuoidal model is time varying, based on the pitch of the current frame. This is being addressed by mapping the frequency of each sinusoid onto the index a sparse, fixed length vector. ## Example From 12ccecab6825790443705383f800c3617521a4f3 Mon Sep 17 00:00:00 2001 From: drowe67 <45574645+drowe67@users.noreply.github.com> Date: Sun, 1 Dec 2019 11:07:45 +1030 Subject: [PATCH 6/7] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index c4c1337..2cedf53 100644 --- a/README.md +++ b/README.md @@ -21,12 +21,12 @@ Building up techniques for modelling phase using NNs and toy speech models (casc ## Example -Here is the output from [phasenn_test11.py](phasenn_test11.py). The first plot is a series of magnitude spectra of simulated speech frames. The voiced frames have two fairly sharp peaks (formants) beneath Fs/2 with structured phase consisting of linear and dispersive terms. Unvoiced frames have less sharp peaks above Fs/2, and random phases. +Here is the output from [phasenn_test11.py](phasenn_test11.py). The first plot is a series of (log) magnitude spectra of simulated speech frames. The voiced frames have two fairly sharp peaks (formants) beneath Fs/2 with structured phase consisting of linear and dispersive terms. Unvoiced frames have less sharp peaks above Fs/2, and random phases. ![](example_mag.png "Magnitude Spectra") ![](example_phase.png "Phase Spectra") -The next plot shows the disperive component of the original phase spectra (green), the hase spectra with an estimate of the linear phase term removed (red), and the NN output estimated dispersive phase (blue). For voiced frames, we would like green (original) and blue (NN estimate) to match. In particular we want to model the rapid phase shift across the peak of the amplitude spectra - this is the dispersive term that shifts the phase of high energy speech harmonics apart and reduces the buzzy/unnatural quality in synthsised speech. +The next plot shows the disperive component of the original phase spectra (green), the phase spectra with an estimate of the linear phase term removed (red), and the NN output estimated dispersive phase (blue). The y-axis is the phase angle in degrees. For voiced frames, we would like green (original) and blue (NN estimate) to match. In particular we want to model the rapid phase shift across the peak of the amplitude spectra - this is the dispersive term that shifts the phase of high energy speech harmonics apart and reduces the buzzy/unnatural quality in synthsised speech. When training from real world data, we have frames of phase spectra with the linear and disperive phase components combined. We will not know the linear term, and therefore must estimate it. This simulation introduces small errors in the linear term estimation (+/-1 sample), which can lead to large phase differences at high frequencies. The red (original phase with estimated linear term removed) diverges from the true dispersive input (green), as the estimation of the linear term is not perfect. However over the training database these errors tend to have a zero mean - this simulation suggests they are being "trained out" by the NN, resulting in a reasonable model of the dispersive term (blue), albiet with some estimation "noise" at high frequencies. This HF noise may be useful, as it matches the lack of structure of HF phases in real speech. From 0a39182a84d17e5052d173639169398af73fbb8c Mon Sep 17 00:00:00 2001 From: drowe67 <45574645+drowe67@users.noreply.github.com> Date: Sun, 1 Dec 2019 11:08:43 +1030 Subject: [PATCH 7/7] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 2cedf53..23c19b1 100644 --- a/README.md +++ b/README.md @@ -30,5 +30,5 @@ The next plot shows the disperive component of the original phase spectra (green When training from real world data, we have frames of phase spectra with the linear and disperive phase components combined. We will not know the linear term, and therefore must estimate it. This simulation introduces small errors in the linear term estimation (+/-1 sample), which can lead to large phase differences at high frequencies. The red (original phase with estimated linear term removed) diverges from the true dispersive input (green), as the estimation of the linear term is not perfect. However over the training database these errors tend to have a zero mean - this simulation suggests they are being "trained out" by the NN, resulting in a reasonable model of the dispersive term (blue), albiet with some estimation "noise" at high frequencies. This HF noise may be useful, as it matches the lack of structure of HF phases in real speech. -For unvoiced speech, we want the NN output (blue) to be random. They do not need to match the original input phase spectra. The NN appears to presever this random phase structure in this simulation. This may remove the need for a voicing estimate - voicing can be deduced from the magnitude spectra. +For unvoiced speech, we want the NN output (blue) to be random. They do not need to match the original input phase spectra. The NN appears to preserve this random phase structure in this simulation. This may remove the need for a voicing estimate - voicing can be deduced from the magnitude spectra.