A few days ago, I was asked what the variational method is, and I found my previous post, Variational Method for Optimization, barely explain some basic of variational method. Thus, I would do it in this post.

Data concerned in machine learning are ruled by physics of informations. It sounds quite abstract, so I will present an example of dynamic mechanics. Let us consider a ball thrown with velocity v=($v_x$, $v_y$) at x = (x, y), and under the vertical gravity with constant g. From the fundamental physics, we can solve this parabolic motion of the ball.

When consider 2D motion, the only force in the system is vertical constant gravity, so the equation of the motion by Newton’s law is

$$ F_y = ma_y = m{d^2y \over dt^2} = -mg $$

$$ F_x = 0 $$

The horizontal component of v is constant, and the vertical component is accelerate by the gravity, so we can find the path as a function of t and parameters v and x.

The principle of the least action uses a different approach to the same solution. See the easy explanation by R. Feynman. In short sum, you can deviate the function of the functional, S, which has the form of integral in order to find the equation of motion which the ball must follow. And the action is defined from the kinetic energy and potential energy.

Now see the integral (2) in the post, Variational Method for Optimization. KL divergence or relative entropy in the integral (2) is the same as free energy for Q, when the temperature is constant1. Thus, the integral plays the role of the action in the above mechanical example2, and by extremizing the integral we can find the equation which the informations must follow. Usually, the equation of the motions are not exact-solvable, so we suggest some intuitive assumption3 and then find the approximate solution.

$$ Q(\mu, \sigma) = Q(\mu) Q(\sigma) $$


  1. The free energy of the data is given by E - T S where E is the energy and T is the temperature and S is the entropy(distinguished from action). I will omit the detail. See the detail at chapter 33 of the book or reply me if you want to know the detail. ↩︎

  2. In the example of the parabolic motion, we were able to obtain the equation of the motion from Newton’s law without an action. However, in the world of data, the equation which confines the relations of the parameters is hard to obtain. ↩︎

  3. In the post, Variational Method for Optimization, we assumed the approximate probability is separable such that ↩︎