Analytic Bijections for Smooth and Interpretable Normalizing Flows

Smooth, closed-form-invertible scalar bijections on the whole real line: building blocks for normalizing flows that keep smoothness, an unbounded domain, a cheap exact inverse, and local expressivity all at once.

On this page

Companion to the ICML 2026 paper Analytic Bijections for Smooth and Interpretable Normalizing Flows. Implementations in the bijx normalizing-flow library.

In scientific applications, a generative model is often judged not only by the samples it produces but by the density it represents, whose values and derivatives feed into calculations. Normalizing flows provide the needed ingredients: neural networks that are invertible and have a tractable Jacobian. However, the usual choices give up smoothness or closed-form inversion to get there. From classical mathematical functions we construct building blocks that give up neither: small, smooth pieces that compose freely into smooth normalizing flows whose architecture can be tuned depending on the application.

Generative models

Let’s first review the basic setup. There are two conceptual layers to think about that are important not to mix up. On the “lower” level we have the configuration space C\mathcal{C}, elements of which are individual samples such as images. Over this configuration space we have a space of probability distributions. Skipping more complex notions of measure theory, these are probability densities p:CR0p: {\mathcal C \to \mathbb{R}_{\geq 0}} which assign to each sample xC{x \in \mathcal C} a density p(x)p(x).

The goal is usually to make a generative model match some (implicitly) specified output distribution. This may be a family of conditional distributions like p(imageprompt){p(\text{image} \mid \text{prompt})}, but for simplicity let’s assume a single unconditional one. There are two common ways in which this target is specified:

  1. As a set of representative samples. The “classic ML” setup. Common losses include maximum likelihood, aka forward Kullback-Leibler (KL) divergence.
  2. As a density function. To connect to energy-based models this is often written as p=eH/Zp = e^{-H} / Z where H:CRH: \mathcal C \to \mathbb R is the unnormalized log-density, and ZZ is an (unknown) normalization constant. The two main reasons generative models are of interest, even when an equation for the density is already known, are that we may not know ZZ and that we cannot directly generate samples.
objectivedraw fromevaluate q(x)evaluate p(x)
forward KL (max. likelihood)p (target)(const.)
reverse KL (variational)q (model)(up to Z)
Figure 1. The same target distribution two ways: a cloud of draws x ∼ p (left) and its density p(x) (right). Hover a sample to ring the same point in both panels — a draw is a location, the density is its value there. The table is why the choice matters: matching p by maximum likelihood (forward KL) only needs draws from p and the ability to score them under the model q; the reverse direction, used when the target is given as an (energy) density, instead needs draws from q and an evaluation of p.

Unsurprisingly, the number of methods around generative models, including how to define loss functions and architectures, is vast and growing. For our purposes, let’s demand a generative model with a density qθq_\theta such that:

  1. We can evaluate the density qθ(x)q_\theta(x) for any sample xCx \in \mathcal C efficiently.
  2. We can efficiently draw samples xqθx \sim q_\theta.

with the overall goal being that qθq_\theta also approximates our target. Here the abstract θ\theta subscript indicates the parameters, to be optimized.

Depending on the application, either one of the above points could be relaxed. For example, we might be satisfied if we can draw samples xqθx \sim q_\theta without knowing exactly the density qθ(x)q_\theta(x) per configuration. This would allow us to use generative adversarial networks (GANs). On the other hand, if we only want to model the (unnormalized) density, qθq_\theta could be parametrized by a simple feed forward neural network.

Transport models

The cleanest way to achieve both exact densities and cheap samples is to let the network map samples to samples1, fθ:CCf_\theta: \mathcal C \to \mathcal C. If we now also choose some initial “prior” distribution ρ\rho for the inputs, often some Gaussian, we can define our generative model to be:

  1. Draw a noise sample ϵρ\epsilon \sim \rho.
  2. Output the transformed sample x=fθ(ϵ)x = f_\theta(\epsilon).

By definition, qθq_\theta should then be the distribution of generated outputs xx: xqθx \sim q_\theta. Let’s investigate how we can compute this density qθ(x)q_\theta(x).

We’ll start with the conceptually simplest case, namely demanding fθf_\theta to be bijective, and then briefly consider alternatives. If fθf_\theta is bijective, every xx corresponds to a single input ϵ=fθ1(x)\epsilon = f^{-1}_\theta(x), whose probability under the prior is ρ(ϵ)\rho(\epsilon). But qθ(x)=ρ(fθ1(x))q_\theta(x) = \rho(f_\theta^{-1}(x)) is not the whole story. To preserve the total probability, we also have to account for local stretching or squeezing of space under the map, which is captured by its derivative:2

qθ(x)=detfθ1ρ(fθ1(x)).q_\theta(x) = |\det \nabla f_\theta|^{-1} \cdot \rho(f_\theta^{-1}(x)) \,.

In this case the generative model is known as a normalizing flow.

Figure 2. A Gaussian (grey, dashed) deformed by a single cubic-rational map. Relabelling the prior gives ρ ∘ f⁻¹ (red, dashed); dividing by the local stretch |h′| gives the true density q (filled). The arrows under each sample point outward and red where the map pulls points apart and the density thins, and inward and blue where it compresses and the density piles up.

The question of how such bijective maps can be constructed as a neural network is the central question of this post. Before turning to it, though, it is worth seeing how the same change-of-variables logic extends to the broader landscape of generative models: relaxing bijectivity, determinism, or discreteness in turn. Optionally, skip straight to Discrete normalizing flows without losing continuity.

If fθf_\theta is not bijective, multiple input values may map to the same output. To know the probability of the output, we would have to sum over all possible noise sources:

qθ(x)=Cδ(ϵfθ1(x))detfθ1ρ(ϵ)dϵ.q_\theta(x) = \int_{\mathcal C} \delta(\epsilon - f_\theta^{-1}(x)) \cdot |\det \nabla f_\theta|^{-1} \cdot \rho(\epsilon) \, \dd{\epsilon}\,.

Here δ\delta is the Dirac delta.3

If fθf_\theta is stochastic, we would have to marginalize over all possible ways in which xx may have been generated. Let’s make this explicit by writing fθ(ϵ;α)f_\theta(\epsilon; \alpha) where ff is deterministic again and the randomness is moved into the auxiliary variable4 αA\alpha \in A drawn from some distribution αr\alpha \sim r. In principle, we would then have to compute

qθ=CAδ(ϵfθ1(x))ρ(ϵ)detfθ1dαdϵ.(1)q_\theta = \int_{\mathcal C} \int_{A} \delta(\epsilon - f_\theta^{-1}(x)) \, \rho(\epsilon)\cdot |\det \nabla f_\theta|^{-1} \, \dd{\alpha} \dd{\epsilon}\,.\tag{1}

Unless the spaces involved are discrete and finite (in which case the integrals become finite sums), we would likely have to resort to some Monte Carlo approximation of these integrals.

An important special case is one where fθf_\theta is itself defined implicitly by solving a continuous equation. Let’s introduce an auxiliary “time” variable tt and set y(t0)=ϵy(t_0) = \epsilon. Then define the output x=y(t1)x = y(t_1) via solving

ddty(t)=gθ(y(t),t)\dv{t} y(t) = g_\theta(y(t), t)

from time t0t_0 to t1t_1. Conventions differ, but we typically make the arbitrary choice of t0=0t_0 = 0 and t1=1t_1 = 1 (or vice versa; especially in diffusion models). This defines a continuous normalizing flow, whose model density we can evaluate using the infinitesimal version of the above change of variables formula:

ddtlogqθ(y(t),t)=ygθ(y(t),t),\dv{t} \log q_\theta(y(t), t) = - \nabla_y \cdot g_\theta(y(t), t) \,,

with boundary condition qθ(y,0)=ρ(y)q_\theta(y, 0) = \rho(y).

In fact, the above can be derived as a special case of the Fokker-Planck equation

tqθ(y,t)=y(gθ(y,t)qθ(y,t))+12yy(Dθ(y,t)qθ(y,t)),\pdv{t} q_\theta(y, t) = - \nabla_y \cdot \bigl(g_\theta(y,t)\, q_\theta(y,t)\bigr) + \frac{1}{2} \nabla_y \cdot \nabla_y \cdot \bigl(D_\theta(y,t)\, q_\theta(y,t)\bigr) \,,

which governs how the distribution evolves under a more general Itô SDE

dy=gθ(y,t)dt+Σθ(y,t)dW,\dd{y} = g_\theta(y, t)\,\dd{t} + \Sigma_\theta(y, t)\,\dd{W} \,,

with diffusion matrix Dθ=ΣθΣθD_\theta = \Sigma_\theta \Sigma_\theta^\top. This governs the broad class of diffusion models, which limit to a specific form of SDE driven by the Stein score (the gradient of the log-density), for which a tractable loss function can be introduced.

It is worth pausing on what these two evolution equations buy us. Both are local statements about how the density changes, whereas the general marginalization integral (1) naively asks us to average over all possible generation paths. That is the whole point: by constraining how randomness enters (not at all, for a deterministic flow; through a controlled diffusion, for an SDE) we trade an intractable integral for a tractable differential equation. This is especially true when only approximations are required, in which case the divergence of gθg_\theta can be estimated stochastically with the Hutchinson trace trick.

Discrete normalizing flows

Now, let’s say that we demand cheap and exact per-sample model densities and do not want to use continuous dynamics (chiefly because doing so is computationally expensive). That leaves us with traditional “discrete” normalizing flows. Discrete, not in the sense that they operate on discrete values, but in the sense that they are defined as a sequence of distinct layers instead of continuous dynamics.

How can we define these discrete layers? For their composition to be bijective, each of them has to be bijective. Furthermore, their det-Jacobians multiply. Consider

fθ=lθ(n)lθ(1).f_\theta = l_\theta^{(n)} \circ \dots \circ l_\theta^{(1)} \,.

Clearly we can invert fθf_\theta if we can invert each of the lθ(i)l^{(i)}_\theta, and by the chain rule5 we get

logdetfθ=i=1nlogdetl(i).\log \det \nabla f_\theta = \sum_{i=1}^n \log \det \nabla l^{(i)} \,.

Effectively, this just shows that if we have one normalizing flow (layer) we can compose multiple of them and get a new normalizing flow. This means each individual layer only needs to provide the properties we need (invertible, with a tractable det-Jacobian) even if each one is individually limited. By stacking many of them, we recover the expressivity any one of them lacks.

Most constructions for the layers ll break the problem down even further. It’s difficult to find a multi-dimensional neural network that is invertible. But perhaps we can start with a scalar function h:RRh: \mathbb R \to \mathbb R that is invertible and apply it to the individual components of xx. The Jacobian of such a map is given directly in terms of hh', which is tractable (it’s the derivative of a 1D function). With that alone, we cannot yet introduce correlations between the components. So the next idea is to let the parameters of hh that transforms a specific component depend on some of the other components. As long as we hold those other components “frozen”, the map remains perfectly invertible: to undo it, we read off the same frozen components and reconstruct the same parameters.

This split (frozen versus transformed degrees of freedom) is exactly what distinguishes the two dominant architectures:

  • Coupling layers pick a mask: they partition the coordinates into a passive set (the frozen coordinates) that is copied through unchanged and an active set that is transformed using parameters computed from the passive set. The mask changes from layer to layer so that, after enough layers, every coordinate has been both frozen and transformed.
  • Autoregressive flows instead pick an ordering. Think of factorizing p(x,y,z)=p(x)p(yx)p(zx,y)p(x, y, z) = p(x)\,p(y \mid x)\,p(z \mid x, y): each conditional is one-dimensional, and the conditioning is on everything earlier in the order.

In both cases the per-component transformation is the same kind of object — a scalar bijection hh whose parameters are produced by a conditioner network — and the Jacobian is triangular, so its log-determinant is just ilogh(xi)\sum_i \log |h'(x_i)|. The choice of hh is therefore the single lever that most directly controls how expressive, smooth, and stable the whole flow is.

Real NVP

The simplest and most obvious choice is an affine map, h(x)=esx+th(x) = e^{s} x + t, with a learnable scale and shift. Plugging it into a coupling layer gives Real NVP (Dinh et al., 2017), and it is a good lens for understanding what a coupling layer actually does.

Concretely, in 2D we can freeze xx and transform yes(x)y+t(x)y \mapsto e^{s(x)} y + t(x), where the conditioner outputs the scale s(x)s(x) and shift t(x)t(x) as functions of the frozen coordinate xx. A single such layer can already turn a round Gaussian into a sheared, stretched, position-dependent blob.

Figure 3. One Real NVP affine coupling acting on a round Gaussian. The frozen coordinate x is copied through, so the vertical grid lines stay vertical and the x-marginal on top is still exactly Gaussian, while y is rescaled and shifted by y ↦ e^{s(x)} y + t(x). The scale s(x) and shift t(x) are whatever the conditioner outputs as a function of the frozen coordinate. Because they vary with x, the horizontal grid lines bend, the round blob becomes a sheared, and its y-marginal on the right is no longer quite Gaussian.

It’s instructive to study what happens when the target is multi-modal or does not line up with the coordinate axes. Take as our go-to example a spiral. Because each layer can only stretch and shift one set of coordinates at a time (conditioned on the other), warping a Gaussian into a spiral arm forces the flow to fold the plane repeatedly. The result works (and can be improved with more layers), but intermediate densities develop characteristic creases. Since the affine map lacks local expressivity, we need many layers and training is often unstable.

layer:
Figure 4. A Real NVP flow approximating a spiral with chained affine coupling layers, initialized from a trained model. Each layer freezes one axis and applies an affine map to the other, alternating horizontal and vertical; the scale s and shift t are smooth functions of the frozen coordinate (here, interactive splines). Vary the number of active layers (or press play) to step through the flow, and note the creases where each axis-aligned fold concentrates probability.

What makes a good scalar bijection?

Real NVP’s affine map is the weakest link: it can only scale and shift globally, so it cannot, on its own, carve a bump into one region of the density without affecting the rest. This is well known, and the field has produced alternatives, but each comes with its own drawbacks.

Splines (monotonic, piecewise rational-quadratic; Durkan et al., 2019) are the standard powerful choice. They parametrize monotonic functions using learnable knots and interpolate between them, giving fine-grained local control. But they have two drawbacks that matter especially in a scientific context. First, they only actively transform a bounded interval; outside it, they default to the identity. This can be partially addressed by adding learnable scalings before and after, but it remains a suboptimal fit for distributions on all of R\mathbb R. Second, they are only CkC^k smooth for finite kk: there are kinks at the knot boundaries. If all you care about are output samples this may be harmless (although it may still reduce training smoothness), but if you care about density or derivatives of logp\log p, those kinks can show up as significant errors.

A different escape route is to relax the closed-form-inverse requirement. If we are willing to invert numerically (by root-finding), we only need to guarantee that hh is monotonic, which opens up very flexible constructions — residual flows (Behrmann et al., 2019; Chen et al., 2019) being the canonical example. They are smooth and expressive, but every inverse, and hence every sample, needs an iterative solve.

Thus, we have a tension. Achieving smoothness, an unbounded domain, a cheap exact inverse, and genuine local expressivity each seems to require giving up one of the others. The question that motivates much of the rest of this post is: can we have all four at once?

Let’s write down the wishlist explicitly. A tractable Jacobian hh' is a basic requirement (without it the bijection cannot be used inside a flow at all), so the real contest is over four properties. We want a scalar bijection hh that is

  1. globally smooth (CC^\infty),
  2. defined on the whole real line,
  3. analytically invertible in closed form, and
  4. expressive: in particular, supports local deformations, not just global scalings/shifts.

This is summarized in the table below. Every standard choice misses at least one column, and the analytic bijections are designed to tick all four at once.

MethodSmoothGlobal domainClosed-form inverseLocal
Affine
SplinesCkC^k only
Residual
Analytic (this work)

It turns out two simple design principles can construct such bijections.

Principle 1: a rational perturbation of the identity. We look for maps of the form h(x)=x+g(x){h(x) = x + g(x)}, where the perturbation gg vanishes as x{|x| \to \infty}. Requiring h(x)x{h(x) \to x} in the tails is what makes the deformation local: far from the interesting region, the map leaves the distribution (and its tails) alone, which tends to be good for training stability. Affine layers already handle global shifts and scalings, so there is no need to risk nonlinear distortion far from the central region; doing so is a reliable path to instability. Now take gg to be a ratio of polynomials, g(x)=n(x)/d(x){g(x) = n(x)/d(x)}. Two constraints follow almost immediately. First, dd must have no real roots (so we never divide by zero), which forces it to have even degree. Second, the denominator has to have higher degree than the numerator, degn<degd\deg n < \deg d, so g0g \to 0 in the tails. The only remaining question is how big to make the degree. Here algebra decides for us: for degd4\deg d \geq 4, clearing denominators leaves a quintic or higher, which has no closed-form solution. Degree 00 collapses back to affine. That leaves degd=2\deg d = 2 as the unique nontrivial choice. Clearing denominators then gives a cubic, which Cardano solved in the 16th century. In summary, this yields the cubic rational bijection.

Principle 2: conjugation with a monotonic function. Given any strictly monotonic gg whose inverse we know, define

h(x)=g1(g(x)+δ).h(x) = g^{-1}\bigl(g(x) + \delta\bigr) \,.

This is automatically a bijection for any shift δR\delta \in \mathbb R (it is ”gg, then shift, then undo gg”), with the tidy derivative h(x)=g(x)/g(h(x))h'(x) = g'(x)/g'(h(x)). To again get h(x)xh(x) \to x in the tails, we need gg to grow faster than linearly6. Two natural superlinear choices give the other two families we introduce:

  • Sinh: g(x)=sinh(x)g(x) = \sinh(x)
  • Cubic: g(x)=ax+bx3g(x) = ax + bx^3

Three families

From the above, we derive three concrete formulas that we can stably parametrize for training. Principle 2 can likely yield many more families, with suitable base functions gg.

Cubic rational. The main difficulty is finding a concrete parametrization that guarantees the rational is monotonic. The form we use is

h(x)=x+λ(xγ)1+(xγ)2/σ2,h(x) = x + \frac{\lambda\,(x - \gamma)}{1 + (x - \gamma)^2/\sigma^2} \,,

with center γ\gamma, width σ>0\sigma > 0, and magnitude λ\lambda. Bijectivity requires 1<λ<8-1 < \lambda < 8 (a bound that comes from demanding h>0h' > 0 everywhere; see the paper). It is purely algebraic — no transcendental functions — and strictly local: h(x)xh(x) \to x as x|x| \to \infty. The inverse is the real root of a cubic, via Cardano’s formula.

Sinh conjugation. Taking g=sinhg = \sinh:

h(x)=σarcsinh ⁣(eμ(eνsinh(xγσ)+δ))+γ.h(x) = \sigma \cdot \operatorname{arcsinh}\!\Bigl(e^{\mu}\bigl(e^{\nu} \sinh(\tfrac{x-\gamma}{\sigma}) + \delta\bigr)\Bigr) + \gamma \,.

This has a special feature: alongside the local deformation controlled by δ\delta, the parameters μ,ν\mu, \nu produce a global shift: distant points are displaced by a constant offset even though h(x)1h'(x) \to 1 in the tails. That is a behavior neither splines nor affine maps can produce.7

Cubic conjugation. Taking g(x)=ax+bx3g(x) = a x + b x^3 with a,b>0a, b > 0:

h(x)=g1(g(xγ)+δ)+γ.h(x) = g^{-1}\bigl(g(x - \gamma) + \delta\bigr) + \gamma \,.

Like the cubic rational, it is purely algebraic, with both forward and inverse maps requiring Cardano’s formula, but it follows the conjugation template rather than the rational-perturbation one. (There is no separate global-shift parameter here; μ,ν\mu, \nu would be redundant with a,b,δa, b, \delta.)

The shared parameter across all three is the center γ\gamma, which sets where the bijection deviates from the identity. The remaining parameters control how wide, how strong, and (for sinh) how much global shift.

Cubic rational

Sinh conjugation

Cubic conjugation

Figure 5. Each family transforms a standard Gaussian (dashed grey) into the coloured density. The cubic rational and cubic conjugation only dent mass locally; sinh conjugation can slide a whole shoulder of mass sideways via μ and ν. The middle panel shows the map h(x) against the identity, the right panel its slope |h′(x)|.

Local versus global

It is worth being precise about the different effects these bijections generate:

  • Global scaling is what affine maps do: uniformly stretch everything. It necessarily touches the entire distribution.
  • Global shift is what sinh\sinh conjugation adds: slide distant mass by a constant offset, while leaving local shape (the derivative) intact in the tails.
  • Local deformation is what all of the analytic bijections can do: stretch one region while compressing a nearby one, so that the change is confined and the tails are untouched.

A spline, extended by the identity outside its active interval, is restricted to purely local deformations. An affine map is restricted to global scaling. The analytic bijections cover local deformation always, plus global shift in the sinh\sinh case — and, crucially, they can be composed. Because a composition of bijections is again a bijection (with log-Jacobians that simply add), we can build a single scalar transformation out of any number of each type, mixing local sculpting and global repositioning at will.

This composition is also how we dial expressivity up and down. Rather than adding knots, as one does with splines, we can stack NN independently parametrized bijections of one or multiple types. Such a stack is a drop-in replacement for the affine or spline transformer inside any coupling or autoregressive layer. It is also the building block for the radial flows below, and for problem-specific transformations such as the zero-mode bijection we explore in the context of lattice field theory in the paper.

Decomposing a transformation coordinate by coordinate into conditional scalar bijections is powerful but not always geometrically natural, as the folded spiral made vivid. A geometrically more intuitive idea: relative to some chosen point, pull probability mass inward or push it outward.

That intuition is exactly a transformation of the radius. Decompose any point xRn{0}{\bm x \in \mathbb R^n \setminus \{\bm 0\}} into a length and a direction, x=rx^\bm x = r\,\hat{\bm x} with r=xr = \|\bm x\| and x^=x/r\hat{\bm x} = \bm x / r. A radial transformation applies a scalar bijection ff to the radius and leaves the direction alone:

g(x)=f(x)xx.g(\bm x) = \frac{f(\|\bm x\|)}{\|\bm x\|}\, \bm x \,.

Every analytic bijection above can serve as that ff (with the mild constraint f(0)=0f(0) = 0, enforced by using f(r)=f~(r)f~(0)f(r) = \tilde f(r) - \tilde f(0) for an unconstrained f~\tilde f). The log-Jacobian has a clean closed form: one term for the radial stretch, and one for the way an nn-dimensional shell of radius rr rescales to radius f(r)f(r):

logdetJg=logf(r)+(n1)log ⁣f(r)r.\log|\det J_g| = \log|f'(r)| + (n-1)\,\log\!\left|\frac{f(r)}{r}\right| \,.

In practice, we also add a learnable center c\bm c (so the transformation happens around a chosen point, not the origin) and an optional per-dimension scaling s\bm s:

g(x)=c+f(r)(xc)r,r=s(xc).g(\bm x) = \bm c + \frac{f(r)\,(\bm x - \bm c)}{r} \,, \qquad r = \|\bm s \odot (\bm x - \bm c)\| \,.
stack:
Figure 6. A radial layer acting on a round Gaussian. Around a draggable center (red), the map stretches the radius by a profile f(r) and leaves direction untouched. Here f is a stack of sinh-conjugation bijections controlled by their local shift δ: δ > 0 pushes mass outward near radius γ (here, into a ring with a depleted center), δ < 0 pulls it inward.

Angle dependence

A purely radial map treats all directions the same. To let it bend differently in different directions, we allow the bijection to depend on the angle: r=f(r,x^)r' = f(r, \hat{\bm x}). The same log-Jacobian formula still holds (with ff' now the partial derivative in rr). The angular dependence can come from an arbitrary neural network, just like a coupling-flow conditioner; in fact an angle-dependent radial flow can be understood as a special kind of coupling flow in spherical coordinates, with the radius as the active coordinate.8

shape:
Figure 7. An angle-dependent radial layer. The map still moves mass only along rays from the draggable center, but now every parameter of the radial profile is a function of the conditioning angle φ. The next figure makes that angular dependence interpretable by expanding it in Fourier modes.

There is a much more interpretable option in 2D. There, the direction is just an angle ϕ=atan2(x2,x1)\phi = \operatorname{atan2}(x_2, x_1), and we can parametrize each bijection parameter as a truncated Fourier series in ϕ\phi:

θj(ϕ)=aj,0+k=1K[aj,kcos(kϕ)+bj,ksin(kϕ)].\theta_j(\phi) = a_{j,0} + \sum_{k=1}^{K}\bigl[a_{j,k}\cos(k\phi) + b_{j,k}\sin(k\phi)\bigr] \,.

This is striking in how few parameters it needs. On the spiral, a single angle-dependent layer with a handful of Fourier terms captures the target to high fidelity with only a few hundred learnable parameters, and because there is no black-box conditioner, the entire transformation is a single, inspectable function f(r,ϕ)f(r, \phi). In higher dimensions this may be generalized to spherical harmonics.

modes K:5param:
stack:
Figure 8. Fourier version of the angle-dependent radial flow, initialized from a small model trained on the spiral. Each stack element's profile parameters are truncated Fourier series in the angle, θ(φ) = a₀ + Σ_{k=1}^{K} [aₖ cos kφ + bₖ sin kφ], with only a handful of modes K, and a stack of such radial bijections is composed. With K = 0 every parameter is constant, and the flow generates concentric circles. Increase K (or “animate”), and each mode bends the density further onto the spiral.

Strenghts and weaknesses

The reason to care about radial flows is a cluster of properties that coupling flows struggle to offer simultaneously:

  • Training stability. Because the parameters can be learned directly (no conditioner network forced to output extreme values), radial flows train stably at much larger learning rates. The geometry is simple enough that the pathological Jacobians which destabilize coupling layers don’t tend to arise.
  • Interpretability. Each layer does one geometrically obvious thing: stretch or compress radially around its center. Angle dependence enters in a controlled, inspectable way through the Fourier terms.
  • Parameter efficiency. On targets with radial structure, radial flows can match coupling-flow quality with orders of magnitude fewer parameters, in our experiments up to a factor of a thousand. And on geometrically structured targets they avoid the axis-aligned “folding” artifacts that affine coupling flows tend to produce (Figure 4).

There is one geometric constraint, however. A single radial layer preserves rays: it can move mass along a ray emanating from the center, but never between rays. Thus, the total mass in any angular wedge is conserved, and a single-center flow can only succeed if the target’s angular mass distribution (as seen from some center) already roughly matches the prior’s. The spiral above happens to satisfy this (every direction crosses the arm about equally), which is why one center suffices there. For more general targets, a stack of layers with different learned centers is required, whose composition lets mass effectively migrate between rays. The price is that the number of layers needed may scale unfavorably with dimension.

Radial transformations are not completely new: simple ones were introduced by Rezende & Mohamed (2015) for variational inference, using r=r+rβ/(α+r)r' = r + r\beta/(\alpha + r). What the analytic bijections let us do is build on that foundation substantially: replacing the single fixed transformation with an expressive, stackable ff, adding multiple learnable centers and per-dimension scalings, and introducing the angular dependence above.

A natural next step generalizes the idea even further: pick a subset of components, apply a (stack of) radial bijections to them, conditioned on the remaining frozen components — a kind of block-wise coupling flow with radial blocks. That clearly points to a whole zoo of possible constructions, and is a natural direction for future work.


Where this leads

The recurring lesson of Part I was that smoothness, an unbounded domain, a closed-form inverse, and local expressivity each seemed to cost one of the others. The analytic bijections are the resolution: a single scalar function RR\mathbb R \to \mathbb R, small enough to write on one line, that holds all four at once. Affine maps give up the local deformation; splines give up smoothness and the full real line; residual flows give up the closed-form inverse. None of these is hard to fix in isolation — the difficulty, and the point, is keeping the other three while you fix it.

That one building block already pays off in two concrete ways. Dropped into a coupling or autoregressive layer in place of an affine or spline transformer, it makes the flow smoother and, often, better. Used as the radius map in a radial flow, it gives an architecture that trains stably, stays inspectable, and — on targets with radial structure — matches coupling-flow quality with up to a thousand times fewer parameters in low dimensions.

We have combined these pieces in only a handful of ways, and most of the interesting combinations are problem-specific. The clearest example is the ϕ4\phi^4 lattice: an analytic bijection placed on the right order parameter preserves an exact symmetry of the target and heads off a mode collapse that the usual training metrics never register (see the paper). We expect many more such constructions, each tuned to the structure of its own problem.

For the rigorous version (derivations, the full experiment suite for 1D, 2D coupling and radial, CIFAR-10, the UCI tabular benchmarks) see the paper. And to try them out, see the JAX implementation and a tutorial notebook in bijx.

Footnotes

  1. We could in principle use different input and output spaces. But if the input space is lower-dimensional than the output and ff is deterministic, the output samples live on a strict submanifold: the output density is singular, with measure zero almost everywhere. Intuitively, we have to put in as many “random degrees of freedom” as we want out — the computer can’t conjure them from thin air. There are subtleties we set aside here, chiefly the stochastic case (and the fact that real computers work with finite-precision, pseudo-random numbers).

  2. Unsurprisingly, this is exactly the same as the change of variables formula when changing integration variables.

  3. We can actually rewrite this in the satisfying form

    qθ(x)=Cδ(fθ(ϵ)x)ρ(ϵ)dϵ.q_\theta(x) = \int_{\mathcal C} \delta(f_\theta(\epsilon) - x) \, \rho(\epsilon) \dd{\epsilon}\,.

    However, this obscures the important change-of-density factor, now implicit in the definition of the Dirac delta distribution.

  4. This is of course how it has to be implemented on a standard computer, since fundamentally they operate on deterministic boolean logic.

  5. using det(AB)=det(A)det(B)\det(AB) = \det(A)\,\det(B)

  6. Taylor expanding gives h(x)=x+δ/g(x)+O(δ2)h(x) = x + \delta/g'(x) + O(\delta^2), which approaches xx exactly when g(x)g'(x) \to \infty, i.e. when gg is superlinear.

  7. The parameter ν\nu could be absorbed into μ\mu and δ\delta, but with this parametrization the forward and inverse maps are symmetric: swap μν\mu \leftrightarrow \nu and flip the signs of δ,μ,ν\delta, \mu, \nu.

  8. Which suggests a natural generalization: also update subsets of angular coordinates, conditioned on the radius and the frozen angles.