RainGAN — 1 year update | AI generative audio
In August 2022, I wrote a blog post on my hobby project of building a generative AI model for environmental rain sounds. (You can read about my 2022 model here.) Since then, I’ve learned a lot more in the space of state-of-the-art generative models, immersed myself much more in the latest research papers, and made many improvements to my model. This post is my opportunity to reflect on this journey so far.
RainGAN2
Overview
RainGAN v1 learned to generate 2d images of short <1-second rain sound STFT power spectrograms that would be converted back to audio using the lossy Griffin-Lim algorithm from hours of examples. RainGAN2 uses a different approach, generating the waveform directly, converting to spectrograms during training for discrimination.
RainGAN2 uses the same multi-scale progressive training paradigm introduced by Catch-a-Waveform (CAW) at NeurIPS 2021. CAW trains a fully-convolutional GAN to produce diverse samples with remarkably small datasets as short as 20 seconds of audio in a space where hundreds of thousands of training data is more the norm. This training accessibility made the CAW architecture an attractive starting point for RainGAN2.
The CAW framework provides a framework for training up to 16khz. In my experiments, the model struggles to produce higher definition audio, especially as energy of higher frequencies fall off with higher sample rates. CAW is also ill-suited for capturing the spectral qualities of rain and other pseudo-random sounds.
One idea is to add more discriminators, coupling the CAW waveform discriminator with the spectrogram discriminators of Improved RVQGAN (IRVQGAN). IRVQGAN is built for a different problem in hi-fi neural audio, namely audio compression for applications like music streaming, enterprise data storage, or voice assistants. In the same family as SoundStream or Facebook’s EnCodec. IRVQGAN uses an autoencoder-GAN model to compress 44.1khz audio to RVQ codebooks to achieve compression rates of 90x. Several spectral discriminators are used to enforce sound quality. The one of interest is the multi-band multi-scale STFT discriminator (MBSD) proposed by the paper. A family of MBSDs set at different STFT FFT window sizes observes different bands of frequency ranges. The original paper notes the MBSD’s main utility is in reducing small-magnitude aliasing artifacts. With different convolutional trunks having receptive fields at different frequencies is implied to reduce computational complexity since filters may be more efficiently allocated, since features at different frequencies in the spectrogram are expected to be different. The original paper’s architecture has no overlap between bands, which I hypothesize reduces discriminatory effectiveness at the borders of each frequency band, due to losses of information induced by convolutional padding. I propose that MBSD’s could be improved with mechanisms of information sharing between different bands and different resolutions.
Generator
To attention or not to attention? The generator is fully-convolutional. RainGAN2 does not use any self-attention layers unlike RainGAN v1 or GRU or RNN layers like in Encodec. This significantly saves on computation time and model size — favorable properties for a model meant to be trained on a consumer GPU. The inclusion of self-attention or GRU in either the waveform discriminator or the generator weren’t shown to improve metrics that substantially. A single 2-headed self-attention layer in the generator would increase FLOPS roughly 10x, compared to the fully-convolutional variant.
Instead, I use squeeze-and-excite (SE) layers at different points in the model as a cost-effective substitute to attention. Implementing Concurrent SE to allow for global attention channel-wise and local attention spatially is something I’m interested in in future implementations. For RainGAN2, I propose the following implementation of 1D channel-wise SE:
Noise mixing. How input noise is introduced to the model is different than in CAW. CAW prescribes the scale of noise to added to a signal based on the desired energy increase from Fourier analysis. This approach is fine where changes in sound energy are large, but when increasingly smaller energy increases occur for higher and higher frequencies, the generator is given less ability to produce high-frequency features or to correct errors from the previous scales’ generators. For RainGAN2, I let the generator learn the noise amplitude to add. For training stability to not be hampered, it is important that the generator not be able to extinguish the contribution of the previous scale signal.
Audio filters. The original CAW paper recommends using a pre-emphasis filter for progressive training of audio GANs for the purpose of preserving the low-frequency features generated by the frozen generator of the previous scale and increasing the current scale’s generator’s ability to produce high-frequency features. The CAW paper uses a fixed PE filter with beta 0.97. I make the simple modification of making this beta a learnable parameter, to increase the model’s flexibility to different sampling rate schedules.
Additionally, I notice that in the production of my training samples, there is a consistent feature where the highest ~5% frequencies are sharply cut off. This becomes a feature that the generator struggles to reproduce at every scale, but especially at higher scales. By adding a parametrized sharp cut-off filter directly into the generator model, I remedy this issue. This low-pass cut-off filter can be formulated as a windowed sinc function that is convolved over the input signal. A rectangular window produces ringing artifacts and leaves some high frequencies only partially attenuated. I acheived better results using a Kaiser-Bessel window of width 201 samples. Where an ideal cut-off filter would be infinite samples wide, a large width is required for the best performance of the filter. It is hypothesized that reproducing such a 201-sample-wide filter is difficult for a generator using kernels no larger than 11 samples. A learnable parameter controls the frequency of the sinc function, which directly determines the portion of frequencies to cutoff.
Dr. Youngmoo Kim’s audio digital processing channel on Youtube has been especially informative on these concepts.
Linear output. CAW and other generative audio papers tend to use a gated layer as the output layer. In practice, I did not observe significant benefits from this design choice. A simple linear layer appears to be a sufficient replacement, that also has the benefit of preventing diminishing gradients before the output sigmoid needed for the GLU unit.
Discriminators
Modified MBSD discriminator. I implement several changes to the original MBSD recipe introduced in IRVQGAN. First, height-width-reduction is achieved with neural downsamplers maded with a Fused-MBConv block — proposed in EfficientNetv2 — followed by a bilinear interpolation. The downsampler block halves either the height or width of the input stft. Up to two down samplers in series are used to reduce stft dimensions to comparable sizes for different for 2048, 1024, and 512 ffts.
Following the down sampling step, the spectrogram image is broken into different frequency bands passed to separate convolutional trunks composed to more Fused-MBConv blocks and frequency-wise strided convolutions with ELU activations. This design choice reinforces the inductive bias that the spectral features of raindrops stretch a lot along the frequency dimension and very little along the time dimension.
The outputs of each trunk are concatenated and pass through a two-convolution tail to score patches on realism.
Waveform discriminator. The waveform discriminator takes the 1D audio signal directly as input. The waveform discriminator is used to control for phase artifacts and other features directly in the waveform. This discriminator uses feedforward convolutions with concurrent squeeze excite layers. I favor ELU and Mish activation functions to carry smooth gradients throughout the discriminator. Mish are used sparingly for their self-gating feature. A fixed pre-emphasis filter is applied to the input signal. Additionally, there is no cutoff filter.
Losses
I experimented with a variety of loss functions to guide generation and to stabilize training. A weighted sum of these loss metrics would be used to backpropagate over the generator and discriminators.
Reconstruction losses. One of the core ideas behind CAW was enforcing that at least one point in the generator’s noise distribution points to a recreation of the single training example. For a specific reconstruction noise sample, the generator is trained to match the real audio against several metrics.
- RMS loss on the waveform helps train the generator to produce audio of the correct signal energy, the sum of square amplitudes.
- L1 loss on the waveform ensures the sign of amplitudes is correct, information otherwise lost in the RMS loss.
- STFT loss trains the generator to reconstruct the real audio’s spectrogram for different FFT window sizes.
Adversarial losses. GAN literature describes many different ways to implement adversarial training between discriminator and generator. For RainGAN2, I use the original non-saturating GAN loss proposed in the Goodfellow paper, in line with CAW. The Wasserstein loss is unfavorable for this model as the magnitude of the adversarial loss would be unbounded. This would mean that it would be difficult to control the relative influence of the different components of the total loss, like the reconstruction losses. Recent additions to the audio GAN space suggest the Hinge loss as a viable substitute. This is something I am currently investigating for RainGAN3.
V-objective gradient penalty. To encourage the discriminator to produce a smoother learning landscape for the generator, a gradient penalty is traditionally applied to the discriminator. Interpolates for the gradient penalty are often formulated as a linear interpolation of real and fake samples. I instead to opt for V-objective rotational interpolation between signals, introduced for noising in diffusion-based audio generation.
Noise and signal can be simplified as orthogonal vectors, where the magnitude of the vector represents the audio energy. In this representation, linear interpolation would not preserve the energy of the vector, whereas a rotational interpolation would.
Training procedure
Exploding gradient control. Gradient penalty calculations over the RainGAN2 discriminators have a tendency to explode to infinity. In the course of the training, invalid gradients from the gradient penalty step are set to zero at the parameter level.
Future work and further reading
RainGAN2 still struggles to generate high quality audio especially past 22khz. Artifacts at specific frequencies are also generated, materializing as dark width-wise seams across the STFT. These seams appear to correlate with the top frequency ranges of the previous scale’s signal. I list some potential solutions I am currently investigating.
- Improved concatenated waves — UnaGAN, InfiniGAN, digital audio convolution dynamics
- Turn off low-pass cut-off filter in model evaluation
- Full convolutional previous sample generation: no seams
- Float, not half models
- Generator model tweaks for improved phase reconstruction
- Squeeze-Excite GRU for MBD information sharing between trunks (1 directional information flow)