Neurocomputing Solving the reconstruction-generation trade-off: Generative model with implicit embedding learning

Variational Autoencoder (VAE) and Generative adversarial network (GAN) are two classic generative models that generate r

142 55 5MB

English Pages [11] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Neurocomputing 
Solving the reconstruction-generation trade-off: Generative model with implicit embedding learning

Table of contents :
Solving the reconstruction-generation trade-off: Generative model withimplicit embedding learning
1. Introduction
2. Problem definition
3. Method
4. Experiments
5. Conclusion
CRediT authorship contribution statement
Data availability
Declaration of Competing Interest
References

Citation preview

Neurocomputing 549 (2023) 126428

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Solving the reconstruction-generation trade-off: Generative model with implicit embedding learning Cong Geng, Jia Wang ⇑, Li Chen, Zhiyong Gao Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

a r t i c l e

i n f o

Article history: Received 9 August 2022 Revised 23 April 2023 Accepted 4 June 2023 Available online 8 June 2023 Keywords: Autoencoder Generative model Embedding learning Latent mapping Adversarial

a b s t r a c t Variational Autoencoder (VAE) and Generative adversarial network (GAN) are two classic generative models that generate realistic data from a predefined prior distribution, such as a Gaussian distribution. One advantage of VAE over GAN is its ability to simultaneously generate high-dimensional data and learn latent representations that are useful for data manipulation. However, it has been observed that a tradeoff exists between reconstruction and generation in VAE, as matching the prior distribution for the latent representations may destroy the geometric structure of the data manifold. To address this issue, we propose an autoencoder-based generative model that allows the prior to learn the embedding distribution, rather than imposing the latent variables to fit the prior. To preserve the geometric structure of the data manifold to the maximum, the embedding distribution is trained using a simple regularized autoencoder architecture. Then an adversarial strategy is employed to achieve a latent mapping. We provide both theoretical and experimental support for the effectiveness of our method, which eliminates the contradiction between preserving the geometric structure of the data manifold and matching the distribution in latent space. The code is available at https://github.com/gengcong940126/GMIEL. Ó 2023 Elsevier B.V. All rights reserved.

1. Introduction Generative models are typically defined by a generator function that maps a low-dimensional latent prior to high-dimensional data outputs described by a complex data distribution. Variational autoencoder (VAE) [1] and generative adversarial network (GAN) [2] are two prominent deep learning-based generative models. Both of them need to sample from a simple prior distribution, such as a Gaussian distribution, which, however, suffer from these limitations: For GAN, it consists of a generator and a discriminator to play a two-player game. The generator learns from a simple prior distribution, such as a Gaussian or uniform distribution, to produce realistic samples that conform to the observed data distribution. However, such predefined prior distribution is often independent of the data distribution and may lose geometric information of the data manifold, which hinders data interpolation [3], representation [4] and density estimation [5,6]. Compared with GAN, autoencoder-based methods have the benefit of an encoder that learns latent representations of the data inputs, making VAE an effective tool for compressing and understanding the manifold structure of high-dimensional data. Autoencoder-based methods are ⇑ Corresponding author. https://doi.org/10.1016/j.neucom.2023.126428 0925-2312/Ó 2023 Elsevier B.V. All rights reserved.

typically trained by minimizing a reconstruction error and KL divergence to force the variational posterior to fit the prior, which is a trade-off between reconstruction and generation. Because it’s difficult for the latent embedding to simultaneously preserve the geometric properties of the data manifold and conform to a predefined distribution. To address these challenges, an intuitive strategy is to sample from the latent distribution, rather than a predefined prior distribution. However, considering the dimensional dilemma in highdimensional space, building a suitable latent embedding that can preserve the structure of the data manifold while also being easy to sample is a challenging problem. Additionally, after constructing this latent embedding, the issue of sampling from this latent space becomes another key challenge. In this paper, we propose a latent mapping that enforces the prior to fit the embedding for sampling on a learned latent representation. To construct a useful latent distribution, we combine ACAI [7] with a cycle consistency loss [8] for embedding learning. This encourages a flat latent representation and a bijective mapping from the data manifold to the latent embedding. Furthermore, to prevent the learned embedding distribution from being overdispersed, we introduce a latent normalization technique [9] on the latent space based on volume concentration [10] on high dimensional space. After learning an appropriate latent representa-

C. Geng, J. Wang, L. Chen et al.

Neurocomputing 549 (2023) 126428

between the KL divergence term and reconstruction term. However, if we lay too much emphasis on reconstruction, another extreme case will happen:

tion, we employ a GAN structure to encourage the prior to match the embedding distribution, yielding a high-dimensional generation. Meanwhile, we can also reconstruct observations from latent representations that preserve the geometric properties of data manifolds. Overall, the main contributions of this work are as follows:

Proposition 2. Suppose X ; Z are 1-D spaces. Assume pðzÞ is some specified continuous prior distribution, pðxÞ is a 1-D continuous distribution with bounded support, q/ ðzjxÞ is continuous in bounded support region for different x. then

 We provide an explanation for the trade-off phenomenon observed in most existing variational autoencoders and introduce our method to eliminate it.  We propose an autoencoder-based structure with an adversarial interpolated regularizer, combined with a cycle consistent loss and a latent normalization technique, to obtain a learning-facilitated latent embedding representation that preserves the geometric structure of the data manifold.  We propose a GAN-based latent mapping approach that allows us to sample from a specified prior distribution when generating new high-dimensional data.

supKL½q/ ðzjxÞkpðzÞ ¼ þ1;

ð3Þ

x

h i when EpðxÞ Eq/ ðzjxÞ ½log ph ðxjzÞ achieves the maximum.

Throughout the paper, we use the following notations. We use calligraphic letters (e.g., X ) to denote spaces. Let pðxÞ be the real data distribution, z be the latent variable and fxi gni¼1 2 X be the training data, where X denotes the data space. The encoder and decoder networks are represented by E and G, with parameters /and h, respectively.

Proposition 2 reveals that if we obtain a perfect reconstruction, then the KL divergence term at some points may be infinity, at this moment, q/ ðzjxÞ is prone to be a delta function which is not what we want. These two cases exhibit a contradiction between generation and reconstruction. Thus VAE requires manually balance the weight between the KL divergence term and reconstruction term resulting in a trade-off. Other works [11–13] were proposed to alleviate ”posterior collapse” but still trying to achieve two conflicting goals: preserving the geometric properties of data manifold and forcing the posterior distribution to fit the prior. Fig. 1 shows the latent embedding trained by VAE and AAE [11] which illustrates the mismatching between posterior distribution and prior. Observations in [3] show that decoder is prone to put the geometric information into latent representations to make the network easy to train. Therefore, a trade-off problem may exist in all these methods.

2.2. Trade off in variational autoencoder

3. Method

Traditional VAE models the distribution of observations by defining a ph ðxÞ, which specifies a prior pðzÞ along with a likelihood ph ðxjzÞ that connects it with the observation:

To handle the above problems, we propose to separate the two tasks of reconstruction and generation. Instead of imposing the latent representation to match the prior, we let the prior fit the latent embedding distribution. For the purposes of optimization, we rewrite the reconstruction term in ELBO as follows:

2. Problem definition 2.1. Notation

Z

ph ðxÞ ¼

ph ðxjzÞpðzÞdz:

ð1Þ

R EpðxÞ Eq/ ðzjxÞ ½log ph ðxjzÞ ¼ q/ ðz; xÞ log ph ðxjzÞdzdx R R ¼ q/ ðz; xÞ log ph ðx; zÞdzdx  q/ ðz; xÞ log pðzÞdzdx R R ¼ q/ ðz; xÞ log ph ðx; zÞdzdx  q/ ðzÞ log pðzÞdz

The integral for computing ph ðxÞ is intractable, making it hard to maximize the marginal likelihood of the model under the data. To overcome this intractability, VAEs instead maximize the Evidence Lower Bound (ELBO) of the marginal likelihood:

log ph ðxÞ P Eq/ ðzjxÞ ½log ph ðxjzÞ  KL½q/ ðzjxÞkpðzÞ

ð4Þ

which gives

ð2Þ

R

where q/ ðzjxÞ is the variational posterior. Two neural networks E and G which are used to parameterize q/ ðzjxÞ and ph ðxjzÞ are referred to as the encoder and decoder, respectively. We can observe that the second term in ELBO, the Kullback–Leibler (KL) divergence captures how distinct the posterior distribution given a training sample is from the prior pðzÞ. This objective enforces the encoder to output latent representation with mean 0 and variance 1 for all training data. In this case, the decoder will face an impossible task of reconstructing different samples from completely random noise which is called ‘‘posterior collapse”. The following case will theoretically illustrate this phenomenon:

q/ ðz; xÞ log ph ðx; zÞdzdx ¼ EpðxÞ Eq/ ðzjxÞ ½log ph ðxjzÞ R þ q/ ðzÞ log pðzÞdz:

ð5Þ

If we fix our encoder, the left hand side of (5) achieves its maximum when ph ðx; zÞ ¼ q/ ðz; xÞ. At this point, the two terms on the right hand side achieve their maximum respectively, imposing ph ðxjzÞ ¼ q/ ðxjzÞ and pðzÞ ¼ q/ ðzÞ. For a simple autoencoder, if we fix the encoder, it is flexible enough to find a decoder to satisfy ph ðxjzÞ ¼ q/ ðxjzÞ. Then we only need to make the prior match the

Proposition 1. Assume pðzÞ is some specified prior distribution. pðxÞ is the real data distribution, if we let KL½q/ ðzjxÞkpðzÞ ¼ 0, for every x 2 X , then ELBO is globally optimized only if ph ðxjzÞ ¼ pðxÞ for every z 2 Z. From Proposition 1 we can get that if we force KL½q/ ðzjxÞkpðzÞ ¼ 0 for VAE, then given any two different latent embeddings, we can just sample from the same distribution ph ðxjzÞ, which means VAE will completely lose reconstruction ability. To avoid this, VAE requires manually fine-tuning the weight

Fig. 1. The encoding results with different methods for digit ‘‘0” in MNIST Dataset. 2

Neurocomputing 549 (2023) 126428

C. Geng, J. Wang, L. Chen et al.

embedding distribution, which is also not hard to achieve. There exists no trade-off in our framework between generation and reconstruction. Fig. 2 shows our overall framework.

With this adversarial loss, we can obtain a useful flat latent representation and improve the ability of the decoder on mapping latent embeddings back into the image manifold. Besides, we add a cycle consistency loss [8] to promote a bijective mapping. Finally we obtain the total autoencoder’s loss function:

3.1. Autoencoder with an adversarial regularizer

Lae ¼ Lae

A major challenge for our method is the determination of our latent representation which is expected to be concentrated and easy to learn for the prior. Therefore, inspired by ACAI [7] and MI-AE [14], we add an adversarial regularizer to the original autoencoder’s loss to improve our latent representation. Similar to ACAI, we use a ‘‘critic” network D to form an adversarial game with our autoencoder to propel the decoded linear interpolated samples to perceptually approximate real data as well as possible. First, the critic is trained to distinguish between real data and generated data including reconstructed samples and interpolated ones. The adversarial loss term of the critic can be reformulated as below:

Ldis ¼ kDðxÞk2

þkDðcx þ ð1  cÞ^xÞ  kk2 þ kDðxl Þ  l  kk2

adv

¼ kx  GðEðxÞÞk2 þ kDðxl Þk2 þ kDð^xÞk2

þ xkzl  EðGðzl ÞÞk2

ð8Þ

where zl ¼ lEðx1 Þ þ ð1  lÞEðx2 Þ and x is a hyper-parameter. 3.2. Latent normalization In high-dimensional space, VAE suffers from the dimensional dilemma that can be interpreted via some counterintuitive geometric facts [15]. Therefore, it becomes challenging to fit a distribution in high dimensions since it’s easy to be over-dispersed. Based on this phenomenon, we employ a simple batch normalization (BN) trick as the last layer of our encoder to compress our embedding distribution with mean 0 and variance 1 while preserving the geometric structure. This operation makes embedding distribution more easier for the prior to learn. The operation in the training process can be easily performed by:

ð6Þ

where ^x is the reconstruction of x through the autoencoder ^x ¼ GðEðxÞÞ, and xl ¼ GðlEðx1 Þ þ ð1  lÞEðx2 ÞÞ. k and c are two hyperparameters. We choose c=0.2 as suggested in ACAI and MI-AE. l is randomly sampled from the uniform distribution on [0, 0.5]. Different from ACAI and MI-AE, the critic predicts a preset k for cx þ ð1  cÞ^x to distinguish the subtle difference between reconstruction samples and real data. Since the interpolated datapoint xl is less realistic than ^ x, we predict l þ k for xl . With this loss function for the critic, the corresponding autoencoder’s part is modified by adding two adversarial terms:

Lae

adv

E

G

x # z#BNðzÞ#^z # ^x:

ð9Þ

Usually, we choose the operation in [9] which is very common in large model training as our BN layer. In fact, our BN trick can be viewed as a spherical projection in high dimensional space since volume concentration [10]. The benefit of spherical projection over other latent regularization is that the 2-Wasserstein distance between two arbitrary sets of random variables which are randomly drawn on a sphere converges to a constant when the dimension is sufficiently large [16]. This illustrates that the latent variables on

ð7Þ

Fig. 2. Overall framework of our method which includes an antoencoder with an adversarial regularizer and a latent mapping that enforces the prior to match the learned latent representation. l; k and c are hyperparameters explained in Section 3. 3

C. Geng, J. Wang, L. Chen et al.

Neurocomputing 549 (2023) 126428

the sphere are distribution-robust. After this latent normalization, we can guarantee that the distance between the prior and latent distribution is small enough to facilitate the following latent mapping.

absolutely match Gaussian distribution. For ALAE, due to the selfdrawbacks of GAN and the lack of strong autoencoder-based constraints, although imposing reciprocity in the latent space gives some advantages in choosing measure metrics, but the significant difference between the latent embedding distribution and prior distribution learned using ALAE results in the inability to recover from the prior to the embedding distribution, thereby compromising the quality of reconstruction and generation. There is no tradeoff problem with our approach. The latent embedding can fully preserve geometric information for reconstruction. With a latent mapping, we can obtain satisfying generated samples within the original data manifold from a specified prior through the latent embedding.

3.3. Latent mapping After getting the desired latent representation, we need to make the specified prior easy to match this latent distribution. Considering volume concentration in high-dimensional space, we first sample from a Gaussian distribution which has the same dimension as our latent embedding, then using a simple normalization to normalize the length of each sample to project it onto a unit sphere as our final prior samples. We introduce an adversarial strategy for our latent mapping network g since that GAN-based divergence estimation is widely applied in deep learning, especially in highdimensional cases. Specifically, we introduce another ‘‘critic” network d in the latent space trying to distinguish between gðzÞ and EðxÞ while our latent mappping network g trying to cheat the critic. For network training, we employ a hinge version of the adversarial loss which were applied in SAGAN [17] and BigGAN [18], resulting in the following critic’s objective:

LD

4.2. Ablation study We do an ablation study on MNIST [19] dataset to verify the effect of each model component, latent embedding dimension and hyper-parameter selection. We resize the training samples to 32  32 without augmentation and use the official train-test split from PyTorch. We train the model for 35 epochs since it’s very fast for MNIST dataset to converge. In the training, we use Adam [26] optimizer with b1 ¼ 0; b2 ¼ 0:999 and learning rate of 0.0002. The Encoder E and critic network D are built following the design of the Encoder in [27] except our latent normalization layer. We use the Generator in [27] as our Decoder G. The latent mapping network g and discriminator network d are fully-connected networks with spectral normalization [28] in network d.

¼ ExpðxÞ ½ReLUð1  dðEðxÞÞÞ þ EzpðzÞ ½ReLUð1 þ dðgðzÞÞÞ: ð10Þ

For generator g, the loss function can be written as follows:

Lg ¼ EzpðzÞ ½dðgðzÞÞ:

ð11Þ

The number of training iterations for the critic and generator is 5 and 1 respectively following the common settings of hinge GAN. 4. Experiments

4.2.1. Effect of each model component We investigate the effect of each component in our framework including adversarial regularizer (AR), cycle consistent loss (CC) and batch normalization (BN). The latent dimension in this experiment is set to 256 for a high dimensional distribution’s exploration. We show the reconstructed, interpolated, and generated results with different combinations of each component in Fig. 5. For MNIST dataset, we can observe that all of the comparisons can get well-performed reconstruction results. However, without BN for latent embedding, we fail to generate digits. This is mainly because the embedding distribution may be overdispersed in 256dimensional latent space which is difficult for the prior to learn. This also brings more challenges to interpolation task. With BN operation, the generated images have better visual quality, which justifies the effectiveness of BN, but may have some overlaps by two digits. Adding a AR constraint can generate meaningful digital images, as well as interpolated results. Since there is no notable difference for visual quality when combined other components with a AR constraint, we also show the quantitative results in terms of FID [29] for generation and interpolation and MSE [30] for reconstruction in Table 1. It is worth noting that adding BN and CC loss to AR respectively has little impact on both qualitative and quantitative performance. But if we combined these three components together, the generated images seem to have more semantic information and better performance both for visual quality and numerical metrics. This demonstrates that a bijective mapping between latent space and data space can further improve the learning ability when the latent distribution is flat and concentrated. Furthermore, compared with ACAI, our model can generate new samples from the data distribution while ACAI only focus on representation learning. From Table 1 we can see for reconstruction, ACAI can get the best results with a slight advantage, but for interpolation, it’s even much worse than our model with only AR component, which further illustrates the effectiveness of our AR component.

In this section, we evaluate our method on Swiss-roll dataset and four standard benchmarks MNIST [19], Cifar10 [20], CelebA [21] and CelebA-HQ [22] datasets. As a preliminary evaluation, we use a low-dimensional dataset Swiss-roll to visually justify that our learned latent embedding can preserve geometric properties of the data manifold while the prior can well-fit our embedding distribution, resulting in a satisfying reconstruction and generation simultaneously. Meanwhile, the four image datasets highlight a variety of challenges that our method should address and evaluation on them is adequate to support the advantages of our model. We conduct all experiments on a single NVIDIA GeForce RTX 2080 GPU using PyTorch [23]. 4.1. Swiss-roll We compare our model with VAE [1], AAE [11], and ALAE [24]. For each method, we use the same architecture with our method. All networks use multi-layer perceptrons (MLPs) with PReLU [25] layer as activation layer. We uniformly sample 5000 samples on a swiss-roll and use a 2D normal distribution as our prior. Fig. 3 shows the reconstruction of the input data manifold and generated samples with Gaussian prior. Because of the reconstruction term in both VAE and AAE, they can get satisfying reconstructed samples. However, the generated samples are less satisfying due to the trade-off in their training. For ALAE, its principal training framework is based on GAN without a clear reconstruction loss. Therefore, manifold reconstruction cannot be achieved at all. Even on generation task, ALAE fails to capture the geometric structure of the original manifold. Fig. 4 further illustrates the relation between latent representation and prior distribution. The latent embedding of VAE and AAE try to fit Gaussian distribution while preserving some geometric information of the data manifold. But they can’t 4

Neurocomputing 549 (2023) 126428

C. Geng, J. Wang, L. Chen et al.

Fig. 3. The reconstructed and generated data manifolds for a 2-D swiss-roll. For each subfigure, the left column is the input manifold, the middle column is the reconstruction manifold, and the right column is the generated manifold from Gaussian prior.

Fig. 4. Latent embedding, prior and generated latent distribution for a 2-D swiss-roll. The leftmost image in each sub-figure is the latent embedding, the second image in each sub-figure is the Gaussian prior. For ALAE and our method, the rightmost image is the generated latent distribution from Gaussian prior using latent mapping.

Fig. 5. Ablation results on 3232 MNIST Dataset with different component combination. Original means real data samples in Dataset. BN represents batch normalization operation, AR represents adversarial regularizer and CC represents the cycle consistent loss.

But it may do harm to the generation since high dimensional latent space may bring more difficulties to the training of latent mapping. Fig. 6 and Table 2 show the qualitative and quantitative results with different embedding dimensions of 32, 64, 128, 256 and

4.2.2. Effect of latent embedding dimension We investigate the effect of latent embedding dimension on the performance of our model. Generally, the larger the embedding dimension, the stronger the representation ability for data manifold. 5

C. Geng, J. Wang, L. Chen et al.

Neurocomputing 549 (2023) 126428

Table 1 Quatitative ablation results in terms of FID and MSE on 3232 MNIST Dataset with different component combination. Metrics

Type

ACAI

AR

BN + AR

AR + CC

BN + AR + CC

FID

Generation Interpolation

9.78

21.00 3.05

12.54 2.93

11.96 3.33

9.76 3.72

MSE

Reconstruction

0.0093

0.0122

0.0152

0.0124

0.0130

Fig. 6. Qualitative results on 3232 MNIST Dataset with different latent embedding dimension.

Table 2 Quantitative results of latent embedding dimension on 3232 MNIST Dataset. Metrics

Type

32

64

128

256

512

FID

Generation Interpolation

4.87 2.85

4.24 2.42

4.91 3.61

9.76 3.72

10.13 3.45

MSE

Reconstruction

0.0246

0.0194

0.0148

0.0130

0.0130

observe when x ¼ 100, it fails in generation and interpolation, which indicates the value of x can’t be too large. Other values of x perform well both for FID and MSE, implying our model is not sensitive to these values. So we choose x ¼ 1 by default because of its simplicity. For varying values of k, we find generation becomes better but reconstruction becomes worse with the increase of k. No clear winner for these three k values. But from Eq.(6) we can get k should be more than 0 since reconstruction samples are less realistic than real samples. And it also should be less than 0.5 because the difference between real samples and their reconstructions would be less than that between the reconstructions and the middle interpolations. So we choose k ¼ 0:2 by default.

512 respectively. From Fig. 6 we can see our model can get relative satisfying performance with all the latent dimensions especially for reconstruction. There is no notable differences for visual quality while for quantitative results in Table 2, the MSE metric becomes better with the increase of dimension. This is consistent with our intuition since larger embedding dimension can improve representation ability. For generation and interpolation, the FID score is usually better for smaller dimension. In our experiments, when the number of dimensions is 64, FID score gets the best both for generation and interpolation. When the number of dimensions gets 256 or larger, the generation reduces a lot. These results demonstrate that a relative small dimension is desirable if the representation ability of the embedding is sufficient.

4.3. Comparison against other methods 4.2.3. Effect of hyper-parameters We conduct an experiment in which we investigate the effect of different values of k in Eq.(6)x in Eq.(8) in our model. Table 3 shows the FID results for generation and interpolation and MSE results for reconstruction. We change the value of x every 10-fold from 0.01 to 100 and k to be 0, 0.2 and 0.5. From Table 3 we can

In this section, our goal is to compare our method with other autoencoder-based generative models on Cifar10 and CelebA datasets, we measure the performance quantitatively with FID scores [29] for generation and MSE metric for reconstruction. FID can detect intra-class mode dropping, and measure the diversity

Table 3 Quantitative ablation results of hyper-parameters on MNIST Dataset.

x

Hyperparameters

k

Metric

Type

0.01

0.1

1(default)

10

100

0

0.2(default)

0.5

FID

Generation Interpolation

11.19 3.37

7.02 2.75

9.76 3.72

7.96 6.15

101.57 12.51

12.50 6.83

9.76 3.72

5.54 3.92

MSE

Reconstruction

0.0129

0.0151

0.0130

0.0424

0.0425

0.0122

0.0130

0.0381

6

Neurocomputing 549 (2023) 126428

C. Geng, J. Wang, L. Chen et al. Table 4 Comparison in terms of FID and MSE on 3232 Cifar 10 and 6464 CelebA Datasets. Metrics

Datasets

VAE

AAE

infoVAE

IntroVAE

AGE

Ours

FID

Cifar10 CelebA

132 76

95 71

147 104

133 67

168 62

76 55

MSE

Cifar10 CelebA

0.0520 0.0383

0.0370 0.0473

0.0413 0.0516

0.0492 0.0574

0.0630 0.0851

0.0297 0.0331

Fig. 7. The reconstructed results with different methods for 3232 Cifar10 Dataset.

Fig. 8. The generated results with different methods for 3232 Cifar10 Dataset.

and those benchmarks above in Fig. 7–8 for Cifar10 dataset and Fig. 9–10 for CelebA dataset. We can observe our method produces visually appealing results both in reconstruction and generation which is in line with the quantitative evaluation.

as well as the quality of generated samples. All the training and testing images are resized to 32 for Cifar10 and 64 for CelebA datasets without augmentation. Following the standard protocol of Cifar10 and CelebA, we use 50,000 images for training and 10,000 for testing for Cifar10 dataset. For CelebA, we use 162,770 images for training and 19,962 for testing. FID is computed from 50 K generated samples and the pre-calculated statistics are precomputed on all training data. The model architecture for networks E; G; g and D; d follow the Mimicry’s architectures [31]. We use Adam optimizer with b1 ¼ 0; b2 ¼ 0:999 and learning rate of 0.0002. The latent embedding dimension is set to 128. We compare our model with other autoencoder-based methods including VAE [1], AAE [11], infoVAE [12], introVAE [32] and AGE [27]. Each model is trained for 100 epochs for Cifar10 and 40 epochs for CelebA. Table 4 shows the quantitative evaluation results on both datasets. Our method obtains the best results both on FID and MSE for both datasets, indicating that our method can achieve better performance both on reconstruction and generation rather than a trade-off between them. Moreover, we show the qualitative results for our approach

4.4. High-resolution image validation To validate the effectiveness of our model on high-resolution images, we evaluate generation and reconstruction performance on more challenging benchmark data, specifically the 128128 CelebA dataset and the 256256 CelebA-HQ dataset. Our Encoder network E and Decoder network G use the same architecture in [33]. Our critic D employs the same network as E except a final summation operation to output a scalar. Network g and d are fully-connected network with spectral normalization in d. The dimension of latent embedding is set to be 256 for images at 128128 and 512 for images at 256256. We additionally add a refine step to further improve latent mapping from prior to embedding distribution. We add this step because high-dimensional data 7

C. Geng, J. Wang, L. Chen et al.

Neurocomputing 549 (2023) 126428

Fig. 9. The reconstructed results with different methods for 6464 CelebA Dataset.

Fig. 10. The generated results with different methods for 6464 CelebA Dataset.

100 epochs. We report MS-SSIM [34] and FID scores for measuring the reconstruction and generation quality. MS-SSIM is computed among an average of 10 K pairs of generated images on 128128 CelebA dataset and FID is computed on 50,000 generated images with training samples as the reference images on 256256 CelebA-HQ dataset. As illustrated in Table 5, our method achieves comparable or better quantitative performance than other GAN or VAE-based generative models. PGGAN [22] is a GAN-based model which may suffer from mode collapse for generation. IntroVAE [32] and Soft-introVAE[33] are competitive models among VAE-based models, but they even can’t guaruatee the encoder distribution or generated data distribution is true at the Nash equilibrium. Visual quality on reconstruction, interpolation and generation of our model are displayed in Fig. 11. Evidently, our model is able to reconstruct and interpolate highquality samples. For generation, our method focuses more on facial part generation, which causes some artifacts and makes the background less realistic. Although refine step can alleviate this issue, the problem still exists which needs to be explored in the future.

is more sparse, just an adversarial training between prior and embedding distribution is not enough. Therefore, for our refine step, we re-train the critic D to distinguish between real samples and generated samples, the loss function is as follows:

Ldis

refine

¼ kDðxÞk2 þ kDðGðgðzÞÞÞ  1k2 ;

ð12Þ

where x is real sample and GðgðzÞÞ is the generated sample. Then for network g, the loss function becomes:

Lg

refine

¼ EzpðzÞ ½dðgðzÞÞ þ kDðGðgðzÞÞÞk2

ð13Þ

With this refine step, some artifacts on the background can be effectively removed from the generated images (see Fig. 11(b)). For 128128 CelebA dataset, we remove the latent normalization trick in our Encoder since we found this would slow down the convergence of training. For 256256 celebA-HQ images, to avoid 512 dimensional latent embedding being over-dispersed, we use the same normalization operation with our prior as the latent normalization trick to project the embedding onto a unit sphere. Our model is refined for 50 epochs after a normal training for 8

Neurocomputing 549 (2023) 126428

C. Geng, J. Wang, L. Chen et al.

Fig. 11. Qualitative results on CelebA and CelebA-HQ Datasets. For sub-figure (a) and (c), the four rows represent real samples, reconstructions, interpolations and generated samples respectively.

Table 5 Quantitative results on high-resolution images. Metric

Dataset

PGGAN [22]

IntroVAE

Soft-introVAE

Ours

MS-SSIM FID

CelebA CelebA-HQ

0.2828 8.03

0.2719 –

– 18.63

0.2738 11.21

9

C. Geng, J. Wang, L. Chen et al.

Neurocomputing 549 (2023) 126428

^h;n ðxjzÞ ! ph ðxjzÞ. Then EpðxÞ Eq/ ðzjxÞ ½log ph ðxjzÞ can be With n ! þ1; p

5. Conclusion

^h;n ðxjzÞ, because q/ ðzjxÞ; p ^n ðxÞ P approximated by Ep^n ðxÞ Eq/ ðzjxÞ ½log p ^h;n ðxjzÞ achieves the maximum, ^h;n ðxjzÞ 6 Mn . If Ep^n ðxÞ Eq/ ðzjxÞ ½log p 0; p we can obtain for 8x, when z 2 suppðq/ ðzjxÞÞ,

Despite the recent success of variational autoencoder and its variants, there is still a trade-off between reconstruction and generation. To overcome this issue, we propose an autoencoder-based generative model that allows the prior to fit a learned latent representation, which preserves the geometric properties of the data manifold, rather than imposing the latent distribution to match the prior. To learn a latent embedding that is conducive to prior learning, we introduce an adversarial regularizer and a cycleconsistent loss on interpolated samples. Additionally, we apply a latent normalization technique on latent distribution to obtain a concentrated flat latent embedding. Our experiments demonstrate that our method is able to preserve most of the global geometric information of the input data manifold while achieving highquality generation without sampling meaningless points.

^h;n ðxjzÞ ¼ Mn p

Then we can easily obtain for 8xi – xj ; suppðq/ ðzjxi ÞÞ\ suppðq/ ðzjxj ÞÞ ¼ £. That means the support of q/ ðzjxi Þ are disjoint with each other. Because suppðq/ ðzjxÞÞ are bounded, we suppose its measure is m, then we can obtain:

inf suppðq/ ðzjxi ÞÞ 6 i

lim inf suppðq/ ðzjxÞÞ ¼ 0:

n!þ1 i

Data availability

[1] D.P. Kingma, M. Welling, Auto-encoding variational bayes, in: Proceedings of the 2nd International Conference on Learning Representations, 2014. [2] I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial networks, Adv. Neural Inform. Process. Syst. 3 (2014) 2672–2680. [3] C. Geng, J. Wang, L. Chen, Z. Gao, Geodesic learning with uniform interpolation on data manifold, IEEE Access 10 (2022) 98662–98669. [4] L. Engstrom, A. Ilyas, S. Santurkar, D. Tsipras, B. Tran, A. Madry, Adversarial robustness as a prior for learned representations, arXiv preprint arXiv:1906.00945. [5] G. Arvanitidis, L.K. Hansen, S. Hauberg, Latent Space Oddity: On the Curvature of Deep Generative Models, in: Proceedings of the 6th International Conference on Learning Representations, 2018. [6] T. Yang, G. Arvanitidis, D. Fu, X. Li, S. Hauberg, Geodesic clustering in deep generative models, arXiv preprint arXiv:1809.04747. [7] D. Berthelot, C. Raffel, A. Roy, I. Goodfellow, Understanding and improving interpolation in autoencoders via an adversarial regularizer, in: Proceedings of the 7th International Conference on Learning Representations, 2019. [8] A. Oring, Z. Yakhini, Y. Hel-Or, Faithful autoencoder interpolation by shaping the latent space, arXiv preprint arXiv:2008.01487. [9] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: International conference on machine learning, Vol. 37, 2015, pp. 448–456. [10] A. Blum, J. Hopcroft, R. Kannan, Foundations of data science, Cambridge University Press, 2020. [11] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, Adversarial autoencoders, arXiv preprint arXiv:1511.05644. [12] S. Zhao, J. Song, S. Ermon, Infovae: Balancing learning and inference in variational autoencoders, in: Proceedings of the aaai conference on artificial intelligence, Vol. 33, 2019, pp. 5885–5892. [13] I. Tolstikhin, O. Bousquet, S. Gelly, B. Schoelkopf, Wasserstein auto-encoders, in: Proceedings of the 2nd International Conference on Learning Representations, 2018. [14] S. Qian, G. Li, W.-M. Cao, C. Liu, S. Wu, H.-S. Wong, Improving representation learning in autoencoders via multidimensional interpolation and dual regularizations, IJCAI (2019) 3268–3274. [15] R. van Handel, Probability in high dimension, PRINCETON UNIV NJ, Tech. rep., 2014. [16] D. Zhao, J. Zhu, B. Zhang, Latent variables on spheres for autoencoders in high dimensions, arXiv preprint arXiv:1912.10233. [17] H. Zhang, I. Goodfellow, D. Metaxas, A. Odena, Self-attention generative adversarial networks, in: International conference on machine learning, 2019, pp. 7354–7363. [18] A. Brock, J. Donahue, K. Simonyan, Large Scale GAN Training for High Fidelity Natural Image Synthesis, in: Proceedings of the 2nd International Conference on Learning Representations, 2019. [19] Y. LeCun, The mnist database of handwritten digits,http://yann.lecun.com/ exdb/mnist/. [20] A. Krizhevsky, Learning multiple layers of features from tiny images, Tech. rep. (2009).

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Appendix A. Proof of Proposition 1 Proof 1. If KL½q/ ðzjxÞkpðzÞ ¼ 0, for every x 2 X , then

R EpðxÞ Eq/ ðzjxÞ ½log ph ðxjzÞ ¼ pðxÞpðzÞ log ph ðxjzÞdxdz R ¼ pðxÞpðzÞ½log ph ðx; zÞ  log pðzÞdxdz R R ¼ pðxÞpðzÞ log ph ðx; zÞdxdz  pðzÞ log pðzÞdz R R R 6 pðxÞpðzÞ log½pðxÞpðzÞdxdz  pðzÞ log pðzÞdz ¼ pðxÞ log pðxÞdx ð14Þ

Thus, only if ph ðx; zÞ ¼ pðxÞpðzÞ, for every x 2 X ; z 2 Z, that is, only if ph ðxjzÞ ¼ pðxÞ, for every z 2 Z; EpðxÞ Eq/ ðzjxÞ ½log ph ðxjzÞ can achieve the maximum, at this point, ELBO achieves the optimum.

Appendix B. Proof of Proposition 2 Proof 2. We first uniformly discretize the bounded real data space to n regions and take the midpoint of each region as xi . Then we use step function to approach pðxÞ, suppose the real data space is bounded to a region with measure M. We can define: M M 6 x 6 xi þ 2n pðxi Þ xi  2n 0 otherwise

ð15Þ

^n ðxÞ ! pðxÞ. We then discretize ph ðxjzÞ by step funcWith n ! þ1; p tion using the similar method:



M M 6 x 6 xi þ 2n ph ðxi jzÞ xi  2n

0

ð19Þ

References

No data was used for the research described in the article.

^h;n ðxjzÞ ¼ p

ð18Þ

Let x0 ¼ argminsuppðq/ ðzjxÞÞ, when n ! þ1; q/ ðzjx0 Þ goes to delta function. At this point, KL½q/ ðzjx0 ÞkpðzÞ ! þ1. Thus supx KL½q/ ðzjxÞkpðzÞ ¼ þ1. On the other hand, when n ! þ1; ^h;n ðxjzÞ ! EpðxÞ Eq/ ðzjxÞ ½log ph ðxjzÞ, this problem can Ep^n ðxÞ Eq/ ðzjxÞ ½log p be proved.

Cong Geng: Conceptualization, Methodology, Writing - original draft, Writing - review & editing. Jia Wang: Conceptualization, Methodology, Writing - review & editing. Li Chen: Conceptualization, Writing - review & editing. Zhiyong Gao: Conceptualization, Writing - review & editing.



m : n

We can deduce that:

CRediT authorship contribution statement

^n ðxÞ ¼ p

ð17Þ

otherwise

ð16Þ

10

Neurocomputing 549 (2023) 126428

C. Geng, J. Wang, L. Chen et al.

Jia Wang received the B.Sc. degree in electronic engineering, the M.S. degree in pattern recognition and intelligence control, and the Ph.D. degree in electronic engineering from Shanghai Jiao Tong University, China, in 1997, 1999, and 2002, respectively. He is currently a Professor with the Department of Electronic Engineering, Shanghai Jiao Tong University, and also a member of the Shanghai Key Laboratory of Digital Media Processing and Transmission. His research interests include multiuser information theory and mathematics in artificial intelligence. He is an associate editor of Digital Signal Processing.

[21] Z. Liu, P. Luo, X. Wang, X. Tang, Deep learning face attributes in the wild, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 3730–3738. [22] T. Karras, T. Aila, S. Laine, J. Lehtinen, Progressive growing of gans for improved quality, stability, and variation, arXiv preprint arXiv:1710.10196. [23] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, A. Lerer, Automatic differentiation in pytorch. [24] S. Pidhorskyi, D.A. Adjeroh, G. Doretto, Adversarial latent autoencoders, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 14104–14113. [25] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034. [26] D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980. [27] D. Ulyanov, A. Vedaldi, V. Lempitsky, It takes (only) two: Adversarial generator-encoder networks, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32, 2018. [28] T. Miyato, T. Kataoka, M. Koyama, Y. Yoshida, Spectral normalization for generative adversarial networks, arXiv preprint arXiv:1802.05957. [29] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochreiter, Gans trained by a two time-scale update rule converge to a local nash equilibrium, Adv. Neural Inform. Process. Syst. 30. [30] C. Dong, C.C. Loy, K. He, X. Tang, Learning a deep convolutional network for image super-resolution, in: European conference on computer vision, Springer, 2014, pp. 184–199. [31] K.S. Lee, C. Town, Mimicry: Towards the reproducibility of gan research. [32] H. Huang, Z. Li, R. He, Z. Sun, T. Tan, Introvae: Introspective variational autoencoders for photographic image synthesis, Adv. Neural Inform. Process. Syst. [33] T. Daniel, A. Tamar, Soft-introvae: Analyzing and improving the introspective variational autoencoder, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4391–4400. [34] A. Odena, C. Olah, J. Shlens, Conditional image synthesis with auxiliary classifier gans, in: International conference on machine learning PMLR, 2017, pp. 2642–2651.

Li Chen received the B.S. and M.S. degrees from Northwestern Polytechnical University, Xi’an, China, in 1998 and 2000, respectively, and the Ph.D. degree from Shanghai Jiao Tong University, China, in 2006, all in electrical engineering. He is currently an Associate Professor with the Department of Electronic Engineering, Shanghai Jiao Tong University. His current research interests include image and video processing, DSP, and VLSI for Image and video processing.

Zhiyong Gao received the B.S. and M.S. degrees from Changsha Institute of Technology, China, in 1981 and 1984, respectively, and the Ph.D. degree from Tsinghua University, Beijing, China, in 1989. From 1994 to 2010, he held several senior technical positions in England, including a video architect with 3DLabs, a consultant engineer with Sony European Semiconductor Design Center and a digital video architect with Imagination Technologies. Since 2010 he has been a Professor with Shanghai Jiao Tong University. His research interests include video processing, image and video coding, and computer vision.

Cong Geng is a Ph.D. candidate of Electrical Engineering with the Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, Shanghai, China. She received the B.S. degree in the University of Electronic Science and Technology of China, China, in 2016. Her research interests include manifold learning, generative model and image processing.

11