Jekyll2020-01-21T06:57:07-05:00http://gregorygundersen.com/feed.xmlA Python Implementation of the Multivariate t-distribution2020-01-20T00:00:00-05:002020-01-20T00:00:00-05:00http://gregorygundersen.com/blog/2020/01/20/multivariate-t<p>Curiously enough, <a href="https://github.com/scipy/scipy/issues/10042" target="_blank">SciPy does not have an implementation of the multivariate t-distribution</a>. I needed one, but after casting around on the internet, the only thing I found in Python was from <a href="https://stackoverflow.com/questions/29798795/multivariate-student-t-distribution-with-python" target="_blank">this StackOverflow Q&amp;A</a>. I dislike trusting statistical software unless it is widely used or I understand the code, and so I decided to write the function myself. However, once I got underway, I realized that the multivariate t-distribution’s probability density function (PDF),</p> <script type="math/tex; mode=display">f(\mathbf{x}; \boldsymbol{\mu}, \boldsymbol{\Sigma}, \nu) = \frac{\Gamma((\nu + p) / 2)}{\Gamma(\nu / 2) \nu^{p/2} \pi^{p/2} |\boldsymbol{\Sigma}|^{1/2}} \Big[ 1 + \frac{1}{\nu} (\mathbf{x} - \boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) \Big]^{-(\nu + p)/2} \tag{1}</script> <p>looks a lot like the multivariate normal’s PDF,</p> <script type="math/tex; mode=display">f(\mathbf{x}; \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{\sqrt{(2 \pi)^D \det(\boldsymbol{\Sigma})}} \exp\Big\{ -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) \Big\}. \tag{2}</script> <p>This isn’t surprising since the student t-distribution’s relationship to the normal distribution is well-known. However, since <a href="/blog/2019/10/30/scipy-multivariate/">I have already written about</a> and re-implemented a fast and numerically stable implementation of SciPy’s multivariate normal PDF, I thought it made sense to use some of the logic in that implementation. The basic idea is that if we perform an eigendecomposition of $\boldsymbol{\Sigma}^{-1}$, which is positive definite, we can easily and stably compute the matrix inverse and log determinant:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \boldsymbol{\Sigma} &= \mathbf{Q} \boldsymbol{\Lambda} \mathbf{Q}^{\top} \\ &\Downarrow \\ \boldsymbol{\Sigma}^{-1} &= \mathbf{Q} \boldsymbol{\Lambda}^{-1} \mathbf{Q}^{\top} \\ &\Downarrow \\ \boldsymbol{\Sigma}^{-1} &= \mathbf{Q} \begin{bmatrix} 1/\lambda_1 & & \\ & \ddots & \\ & & 1/\lambda_d \end{bmatrix} \mathbf{Q}^{\top} \\ \log \det(\boldsymbol{\Sigma}) &= \sum_{d=1}^{D} \log \lambda_d. \end{align} \tag{3} %]]></script> <p>Above, we use the fact that the determinant of a $D \times D$ matrix is the product of its eigenvalues. We compute the log PDF and exponentiate it for numerical stability.</p> <p>A nice feature of this approach is that we can use SciPy’s tested implementation and infrastructure for the tricky bits: inverting $\boldsymbol{\Sigma}$, computing the log determinant, and computing a vectorized Mahalanobis distance. In particular, the original author of that module, <a href="https://www.enthought.com/person/joris-vankerschaver/" target="_blank">Joris Vankerschaver</a>, wrote a <a href="https://github.com/scipy/scipy/blob/5da565bef88ac07dfaa844fc953039ea52d09145/scipy/stats/_multivariate.py#L111">utility class</a> for automatically handling $\boldsymbol{\Sigma}$.</p> <p>See <a href="https://github.com/gwgundersen/multivariate-t-distribution/blob/master/multivariatet.py">my GitHub</a> for a fairly complete implementation of a multivariate t-distributed random variable object, which relies heavily on Vankerschaver’s implementation. For legibility, here is a simpler version of the PDF without any comments, sanity checks, or error handling:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> <span class="kn">from</span> <span class="nn">scipy.special</span> <span class="kn">import</span> <span class="n">gammaln</span> <span class="k">def</span> <span class="nf">pdf</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">mean</span><span class="p">,</span> <span class="n">shape</span><span class="p">,</span> <span class="n">df</span><span class="p">):</span> <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">logpdf</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">mean</span><span class="p">,</span> <span class="n">shape</span><span class="p">,</span> <span class="n">df</span><span class="p">))</span> <span class="k">def</span> <span class="nf">logpdf</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">mean</span><span class="p">,</span> <span class="n">shape</span><span class="p">,</span> <span class="n">df</span><span class="p">):</span> <span class="n">p</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="o">==</span> <span class="mi">2</span> <span class="k">else</span> <span class="n">x</span><span class="o">.</span><span class="n">size</span> <span class="n">vals</span><span class="p">,</span> <span class="n">vecs</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">eigh</span><span class="p">(</span><span class="n">shape</span><span class="p">)</span> <span class="n">logdet</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">vals</span><span class="p">)</span><span class="o">.</span><span class="nb">sum</span><span class="p">()</span> <span class="n">valsinv</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">1.</span><span class="o">/</span><span class="n">v</span> <span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">vals</span><span class="p">])</span> <span class="n">U</span> <span class="o">=</span> <span class="n">vecs</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">valsinv</span><span class="p">)</span> <span class="n">dev</span> <span class="o">=</span> <span class="n">x</span> <span class="o">-</span> <span class="n">mean</span> <span class="n">maha</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">square</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">dev</span><span class="p">,</span> <span class="n">U</span><span class="p">))</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span> <span class="n">t</span> <span class="o">=</span> <span class="mf">0.5</span> <span class="o">*</span> <span class="p">(</span><span class="n">df</span> <span class="o">+</span> <span class="n">p</span><span class="p">)</span> <span class="n">A</span> <span class="o">=</span> <span class="n">gammaln</span><span class="p">(</span><span class="n">t</span><span class="p">)</span> <span class="n">B</span> <span class="o">=</span> <span class="n">gammaln</span><span class="p">(</span><span class="mf">0.5</span> <span class="o">*</span> <span class="n">df</span><span class="p">)</span> <span class="n">C</span> <span class="o">=</span> <span class="n">p</span><span class="o">/</span><span class="mf">2.</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">df</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">pi</span><span class="p">)</span> <span class="n">D</span> <span class="o">=</span> <span class="mf">0.5</span> <span class="o">*</span> <span class="n">logdet</span> <span class="n">E</span> <span class="o">=</span> <span class="o">-</span><span class="n">t</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="p">(</span><span class="mf">1.</span><span class="o">/</span><span class="n">df</span><span class="p">)</span> <span class="o">*</span> <span class="n">maha</span><span class="p">)</span> <span class="k">return</span> <span class="n">A</span> <span class="o">-</span> <span class="n">B</span> <span class="o">-</span> <span class="n">C</span> <span class="o">-</span> <span class="n">D</span> <span class="o">+</span> <span class="n">E</span> </code></pre></div></div> <p>I hope to put together a SciPy pull request (PR) for this code in the near future. However, I am not familiar with the workflow and typical duration of SciPy PRs and figured I should just post this now. I am also unsure of how to handle the cumulative distribution function, which has no closed-form solution.</p>Gregory GundersenCuriously enough, SciPy does not have an implementation of the multivariate t-distribution. I needed one, but after casting around on the internet, the only thing I found in Python was from this StackOverflow Q&amp;A. I dislike trusting statistical software unless it is widely used or I understand the code, and so I decided to write the function myself. However, once I got underway, I realized that the multivariate t-distribution’s probability density function (PDF),Why I Keep a Research Blog2020-01-12T00:00:00-05:002020-01-12T00:00:00-05:00http://gregorygundersen.com/blog/2020/01/12/why-research-blog<p>Before I started taking writing seriously, I had a loose grasp of many mathematical and technical concepts; and I was not sure how to tackle open-ended problems. Maintaining this research blog has clarified my thoughts and improved how I approach research problems. The goal of this post is to explain why I have found the process so valuable.</p> <p>As an overview, here are my reasons with page jumps:</p> <ol> <li><a href="#working-through-confusion">Working through confusion</a></li> <li><a href="#calibrating-confidence">Calibrating confidence</a></li> <li><a href="#learning-with-intention">Learning with intention</a></li> <li><a href="#flanking-the-problem">Flanking the problem</a></li> <li><a href="#solving-through-understanding">Solving through understanding</a></li> <li><a href="#writing-slowly-recalling-quickly">Writing slowly, recalling quickly</a></li> <li><a href="#contributing-to-the-community">Contributing to the community</a></li> </ol> <h2 id="working-through-confusion">Working through confusion</h2> <p>This quote from <a href="https://en.wikiquote.org/wiki/Richard_Feynman" target="_blank">Richard Feynman</a> is at the top of my blog’s landing page:</p> <blockquote> <p>I learned very early the difference between knowing the name of something and knowing something.</p> </blockquote> <p>I would phrase this in my own words as, “Using the word for something does not mean you understand it.” While this is true in general, I hypothesize that jargon is especially susceptible to this kind of misuse because an expert listener might infer a mutual understanding that does not exist. This feeling of verbal common ground can even be gamed. Many of us have done this on exams, hoping for partial credit by stitching together the outline of a proof or using the right words in an essay with the hopes that the professor connects the dots for us.</p> <p>What does this have to do with blogging? Blogging is a public act. Anyone can read this. When I write a blog post, I imagine my supervisor, a respected colleague, or a future employer reading my explanation. These imagined readers force me to ask myself honestly if I understand what I am writing. How do I know when a post is done? I write until I stop having significant questions, until my imaginary audience stop raising their hands. The end result is that writing forces me to acknowledge and then work through my confusion.</p> <p>In my mind, the writing style of scientific papers inadvertently contributes to the problem of jargon abuse because it is designed to highlight and convey novelty; any concept that is not a main contribution may be cited and then taken as a given. A novice might mistake this writing style for how a scientist should actually think or speak. Summarizing a paper in your own words restructures the content to focus on learning rather than novelty.</p> <h2 id="calibrating-confidence">Calibrating confidence</h2> <p>It is difficult to know what you should know when you have a lot to learn and are in an intelligence-signaling environment. A side effect of having written detailed technical notes is that I calibrate my confidence on a topic. If I now understand something, I am sure of it and can explain myself clearly. If I don’t understand something, I have a sense of why it is difficult to understand or what prerequisite knowledge I am missing.</p> <p>This idea reminds me of <a href="https://www.glamour.com/story/mindy-kaling-guide-to-killer-confidence" target="_blank">Mindy Kaling’s Guide to Killer Confidence</a>, in which Khaling writes,</p> <blockquote> <p>People talk about confidence without ever bringing up hard work. That’s a mistake… I don’t understand how you could have self-confidence if you don’t do the work.</p> </blockquote> <p>In childhood, college, and even early graduate school, people have many structures that force them to do the work: homework, tests, admission essays, qualifying exams. But as you enter the research life in earnest, these structures mostly disappear. For me, writing things down is the best way I have found to ensure that I actually do the work.</p> <p>Furthermore, writing has given me a template for how to feel confident whenever I need to: do the work. In my first year of graduate school, I botched presenting papers in my lab’s group meeting. This is because I passively read papers by just underlining key sentences or writing notes and questions in the margins. As a result, I wasn’t sure if I knew what I knew, and that was clear to others. Blogging has taught me how to read a paper because explaining something is a more active form of understanding. Now I summarize the main contribution in my own words, write out the notation and problem setup, define terms, and rederive the main equations or results. This process mimics the act of presenting and is great practice for it.</p> <h2 id="learning-with-intention">Learning with intention</h2> <p>In his essay, <a href="http://michaelnielsen.org/blog/principles-of-effective-research/" target="_blank"><em>Principles of Effective Research</em></a>, Michael Nielsen writes,</p> <blockquote> <p>In my opinion the reason most people fail to do great research is that they are not willing to pay the price in self-development. Say some new field opens up that combines field $X$ and field $Y$. Researchers from each of these fields flock to the new field. My experience is that <em>virtually none</em> of the researchers in either field will systematically learn the other field in any sort of depth. The few who do put in this effort often achieve spectacular results.</p> </blockquote> <p>I have thought a lot about this quote throughout my PhD, perhaps because of my experience as a self-taught programmer. When I first started teaching myself to program, I felt that I had no imagination. I couldn’t be creative because I was too focused on finding the syntax bug or reasoning about program structure. However, with proficiency came creativity. Programming became less important than what I was building and why. When I started my PhD, I hypothesized that the same rules would apply: I wouldn’t be able to think creatively about machine learning until I built up the requisite knowledge base. In programming, you can practice by writing programs; but how can you <em>practice research</em>? For me, writing detailed, expository technical notes is the equivalent of the programmer’s side project: it forces me to intentionally and systematically build my knowledge base by understanding ideas, working through proofs, and implementing models.</p> <p>A watershed moment in my first machine-learning research project came when I decided to systematically work through the background material: I blogged about <a href="/blog/2018/07/17/cca/">canonical correlation analysis</a>, then <a href="/blog/2018/08/08/factor-analysis/">factor analysis</a>, and finally <a href="/blog/2018/09/10/pcca/">probabilistic canonical correlation analysis</a>. My understanding and confidence in the material changed profoundly. I became intellectually committed in a way that was impossible without first understanding the problem.</p> <p>The importance of self-development continues even after one has reached proficiency. There is a famous aphorism, “If all you have is a hammer, everything looks like a nail.” If you ask a person to do a task quickly, they will resort to the tools they know. Similarly, when I am against a publication deadline, stressed about a lagging project, or trying hard to “just think,” I fall back on familiar thought patterns. There is an embarrassing period of my PhD in which all I did was hyperparameter tune a neural network because I didn’t know what else to try. For this reason, I find brainstorming is typically useless for me. Under pressure, my mind, like a cart on a well-worn path, finds the same old ruts. Once again, writing breaks this cycle because it requires more active participation.</p> <h2 id="flanking-the-problem">Flanking the problem</h2> <p>Hard problems are intimidating; and I often do not know where to start and am worried that I will waste my time. Writing blog posts about the larger context of a problem is my way of flanking it, of head faking myself about what I am actually doing. This lowers the psychological stakes because, rather than directly attacking the problem, I am producing something that I know will be valuable either way.</p> <p>Let me give an example. Currently, I am working on a latent variable model of neuron spiking data with a complex Bayesian inference procedure. When I started the project, I did not have the background in Bayesian methods to immediately start working. Furthermore, I started the project in March and had already committed to a summer internship. I knew that it would be hard but important to keep some momentum over the summer. So I decided to flank the project by writing a series of blog posts on topics that I knew were both important for the project and generally useful: the <a href="/blog/2019/05/08/laplaces-method/">Laplace approximation</a>, <a href="/blog/2019/06/27/gp-regression/">Gaussian process regression</a> and an <a href="/blog/2019/09/12/practical-gp-regression/">its efficient implementation</a>, <a href="/blog/2019/09/01/sampling/">Monte Carlo methods</a>, <a href="/blog/2019/09/16/poisson-gamma-nb/">Poisson–gamma mixture models</a>, and <a href="/blog/2019/09/20/polya-gamma/">Polya gamma augmentation</a>. These posts, written in the spring and summer, allowed me to start thinking about and preparing for the problem indirectly.</p> <p>As the project got underway in the fall, I wrote more blog posts as needed. For example, we tried Hamiltonian Monte Carlo (HMC) inference, and so I wrote about <a href="/blog/2019/10/28/romantic-markov-chains/">ergodicity</a> and <a href="/blog/2019/11/02/metropolis-hastings/">Metropolis–Hastings</a> (MH). In this case, I blogged about MH rather than HMC because I knew that the former leads to the latter and is important foundational knowledge. I was mitigating the risk of HMC not working by teaching myself something I was confident I should know anyway. We are currently exploring using random Fourier features to scale up the the method, and I approached this new material by writing about the <a href="/blog/2019/12/10/kernel-trick/">kernel trick</a>, <a href="/blog/2019/12/23/random-fourier-features/">random Fourier features</a>, and <a href="/blog/2020/01/06/kernel-gp-regression/">kernel ridge regression</a>.</p> <p>Research often amounts to long-term gambles. As a junior researcher, I try to mitigate my risk exposure by working on small, promising problems with guidance from my advisor and senior lab members. However, writing is my other way of mitigating risk. If my current project were to fail, the directed and intentional process of systematically attacking the background material will have prepared me well for the next problem.</p> <h2 id="solving-through-understanding">Solving through understanding</h2> <p>The mathematician <a href="http://www.ams.org/notices/200410/fea-grothendieck-part2.pdf" target="_blank">Arthur Ogus explained</a> Alexandre Grothendieck’s approach to problem solving by saying,</p> <blockquote> <p>If you don’t see that what you are working on is almost obvious, then you are not ready to work on that yet.</p> </blockquote> <p>I find this quote comforting because it suggests that good ideas—at least for one famous mathematician—do not come into the mind <em>ex niliho</em>. Rather, good ideas come from so deeply understanding a problem that the solution seems obvious.</p> <p>In my own experience, writing has gotten me closer than anything else to having original research thoughts that feel obvious. So far, these thoughts are always a decade late or unfeasible, but I am happy to be having them. Let me give an example. I have written about <a href="/blog/2018/08/08/factor-analysis/">factor analysis</a>, an efficient implementation of factor analysis using the <a href="/blog/2018/11/30/woodbury/">Woodbury matrix identity</a>, and <a href="/blog/2019/01/17/randomized-svd/">randomized singular value decomposition</a> (SVD). As I wrote about randomized SVD, I thought, “Why isn’t the matrix inversion in factor analysis implemented using randomized SVD? We’re just inverting the loadings, a probabilistic construct; a randomized algorithm seems fine.” This was not a brilliant thought. It’s a little “$A$ plus $B$,” but it was my thought, and I would not have had it without real understanding of the methods. Well, it turns out, <a href="https://github.com/scikit-learn/scikit-learn/blob/0.22.1/sklearn/decomposition/_factor_analysis.py#L210" target="_blank">that’s exactly how Scikit-learn does it</a>.</p> <p>My thinking on this topic has been shaped by how other researchers talk about research: the important thing is not having a big idea but a line of attack; and having a line of attack means understanding the problem. In his talk <a href="https://www.cs.virginia.edu/~robins/YouAndYourResearch.html" target="_blank">You and Your Research</a>, Richard Hamming said,</p> <blockquote> <p>The three outstanding problems in physics, in a certain sense, were never worked on while I was at Bell Labs… We didn’t work on (1) time travel, (2) teleportation, and (3) antigravity. They are not important problems because we do not have an attack. It’s not the consequence that makes a problem important, it is that you have a reasonable attack.</p> </blockquote> <p>By understanding problems deeply, you increase the probability that you can work on an important, attackable problem.</p> <h2 id="writing-slowly-recalling-quickly">Writing slowly, recalling quickly</h2> <p>I think of writing-as-learning as database indexing. In a database, an index is a data structure that efficiently keeps track of where rows in a table are located. To insert into a database via an index is slower than simply adding the row to the bottom of the table because the database must do some bookkeeping. However, querying a database is extremely efficient. A layperson’s example is organizing your books alphabetically.</p> <p>When I write, I make the same trade-off. I appreciate that learning through writing takes longer than learning without writing, often by an order of magnitude. I feel this acutely when I am implementing someone else’s method or creating a figure for my blog when it feels like I should be doing research more directly. However, I consistently feel that the trade-off is worth it. (I suspect this will change throughout my career.) For example, my post on the <a href="/blog/2018/12/10/svd/">SVD</a> took many, many hours to produce. However, the SVD is a foundational and ubiquitous mathematical idea, and now my mind grasps a powerful chunk of understanding rather than a vague symbol. I understand the determinant, matrix rank, principal component analysis, and many other ideas better because I understand the SVD. I am glad I took the time to write that post.</p> <p>Furthermore, I have a permanent, easily recoverable store of personalized knowledge. I probably consult my own blog at least once a day if not more. This is not vanity. Typically, the top of my mental stack is the first post listed on my blog’s landing page. Terry Tao has a good blog post on this, called <a href="https://terrytao.wordpress.com/career-advice/write-down-what-youve-done/" target="_blank">“Write down what you’ve done.”</a></p> <p>Here is an example of the benefit of recording things. My latest blog post is about <a href="/blog/2020/01/06/kernel-gp-regression/">kernel ridge regression</a>. The equation for ridge regression has the matrix inversion, $(\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I})^{-1}$. When I learned that ridge regression is used to combat multicollinearity (linear dependencies between predictors or columns of $\mathbf{X}$), I could see why: the matrix inversion in classical linear regression is unstable for low-rank $\mathbf{X}$ without adding something to the diagonal. I had this intuition because I had already proven that $\text{rank}(\mathbf{X}) = \text{rank}(\mathbf{X}^{\top} \mathbf{X})$ in my <a href="/blog/2018/12/20/svd-proof/">proof of the SVD</a>. Importantly, I had forgotten if the relationship were true, but it <em>felt correct</em>, and I knew exactly where to look to confirm my guess.</p> <h2 id="contributing-to-the-community">Contributing to the community</h2> <p>The writing process has made me realize how much of understanding and therefore research is simply walking well-worn paths. Richard Schwartz once talked about this in <a href="https://www.quantamagazine.org/richard-schwartz-in-praise-of-simple-problems-20180109/" target="_blank">an interview</a>:</p> <blockquote> <p>Maybe one thing I appreciate more now is that the state of human knowledge is full of holes. When you’re young you have the impression that almost everything is known, but now I have this feeling that almost everything is unknown about mathematics. There are these very thin channels that people have gone along, like ants following each other along a trail. You find these long thin trails of things, and most things are undeveloped. I have more of a sense of the openness of it.</p> </blockquote> <p>I appreciate that most of my writing is me, like an ant, simply following someone else’s trail. For example, when I wrote about the <a href="/blog/2019/03/19/exponential-family/">exponential family</a>, I was treading a path in mathematical statistics that is roughly a hundred years old, and even the exponential family is knowledge we carved out of the infinite number of distributions. I may aspire to more detail than average in my explanations, but the end result is typically still reproduction, not production, of knowledge. I think this is okay. A MathOverflow user once asked how the average mathematician can contribute to mathematics. I love <a href="https://mathoverflow.net/questions/43690/#44213" target="_blank">Bill Thurston’s reply</a> and this particular paragraph:</p> <blockquote> <p>In short, mathematics only exists in a living community of mathematicians that spreads understanding and breaths life into ideas both old and new. The real satisfaction from mathematics is in learning from others and sharing with others. All of us have clear understanding of a few things and murky concepts of many more. There is no way to run out of ideas in need of clarification. The question of who is the first person to ever set foot on some square meter of land is really secondary. Revolutionary change does matter, but revolutions are few, and they are not self-sustaining — they depend very heavily on the community of mathematicians.</p> </blockquote> <p>Reforging existing connections, walking well-worn paths, is a contribution to the research community. This means that keeping a research blog is useful for more than just oneself. And perhaps with time and experience, I can occasionally blaze a new trail.</p>Gregory GundersenBefore I started taking writing seriously, I had a loose grasp of many mathematical and technical concepts; and I was not sure how to tackle open-ended problems. Maintaining this research blog has clarified my thoughts and improved how I approach research problems. The goal of this post is to explain why I have found the process so valuable.Comparing Kernel Ridge with Gaussian Process Regression2020-01-06T00:00:00-05:002020-01-06T00:00:00-05:00http://gregorygundersen.com/blog/2020/01/06/kernel-gp-regression<p>With appropriate hyperparameters, the posterior mean of a Gaussian process (GP) regressor with observation noise is a kernel ridge regressor. I learned of this relationship from <a href="https://stats.stackexchange.com/questions/327646#comment755005_327961" target="_blank">this StackExchange comment</a> by <a href="https://www.cs.columbia.edu/~amueller/" target="_blank">Andreas Mueller</a>. I was surprised, and the goal of this post is to better understand that comment. Following <a class="citation" href="#welling2013kernel">(Welling, 2013)</a>, I’ll first introduce kernel ridge regression by re-deriving it from ridge regression. We can then compare this to Gaussian process regression, which I have discussed in detail <a href="/blog/2019/06/27/gp-regression/">here</a> and <a href="/blog/2019/09/12/practical-gp-regression/">here</a>. Finally, this post relies on understanding the kernel trick. Please see my <a href="/blog/2019/12/10/kernel-trick/">previous post</a> on the topic if needed.</p> <h2 id="ridge-regression">Ridge regression</h2> <p>Suppose we have a regression problem with data $\{\mathbf{x}_n, y_n\}_{n=1}^{N}$. Classical linear regression is a simple linear model with additive Gaussian noise, $\varepsilon_n \sim \mathcal{N}(0, \sigma^2)$,</p> <script type="math/tex; mode=display">y_n = \mathbf{x}_n^{\top} \boldsymbol{\beta} + \varepsilon_n. \tag{1}</script> <p>Linear regression is fit by minimizing the sum of squared residuals. If $\mathbf{X}$ is an $N \times P$ matrix of $N$ data points and $P$ predictors, the normal equations for linear regression are</p> <script type="math/tex; mode=display">\hat{\boldsymbol{\beta}} = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y}. \tag{2}</script> <p>See my <a href="/blog/2020/01/04/classical-linear-regression/">previous post</a> on ordinary least squares regression for details. To deal with multicollinearity (linear relationships between predictors), ridge regression introduces Tikhonov regularization or an $\ell_2$-norm penalty to the weights $\boldsymbol{\beta}$. Ridge regression is so-named because the new optimization amounts to adding values along the diagonal (or “ridge”) of the covariance matrix,</p> <script type="math/tex; mode=display">\hat{\boldsymbol{\beta}} = (\lambda \mathbf{I}_P + \mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y}. \tag{3}</script> <p>$\lambda$ is a fixed hyperparameter. If our data matrix $\mathbf{X}$ exhibits multicollinearity, if there are linear dependencies between the columns, then the covariance matrix $\mathbf{X}^{\top} \mathbf{X}$ will be singular. This is because $\text{rank}(\mathbf{A}) = \text{rank}(\mathbf{A}^{\top} \mathbf{A})$ in general. See <a href="/blog/2018/12/20/svd-proof#2-a-and-atop-a-have-the-same-rank">this appendix</a> from a previous post for a proof. In other words, $\mathbf{X}^{\top} \mathbf{X}$ is non-invertible. Thus, from the perspective of numerical linear algebra, ridge regression stabilizes the matrix inversion in $(2)$ by adding small values to the diagonal in $(3)$.</p> <h2 id="kernel-ridge-regression">Kernel ridge regression</h2> <p>Now let’s kernelize ridge regression. First, let’s replace each feature vector or sample $\mathbf{x}_n$ (a row of $\mathbf{X}$) with $\varphi(\mathbf{x}_n)$ where $\varphi: \mathbb{R}^P \mapsto \mathbb{R}^K$. Then we can write $(1)$ as</p> <script type="math/tex; mode=display">y_n = [\varphi(\mathbf{x}_n)]^{\top} \boldsymbol{\beta} + \boldsymbol{\varepsilon_n} \tag{4}</script> <p>where $\boldsymbol{\beta}$ is a $K$- rather than $P$-vector. Let $\boldsymbol{\Phi} = [\varphi(\mathbf{x}_1), \dots, \varphi(\mathbf{x}_N)]^{\top}$ be an $N \times K$ matrix. Then the matrix inverse in the kernelized equation $(3)$, rather than being $P \times P$, is $K \times K$,</p> <script type="math/tex; mode=display">\hat{\boldsymbol{\beta}} = (\lambda \mathbf{I}_K + \boldsymbol{\Phi}^{\top} \boldsymbol{\Phi})^{-1} \boldsymbol{\Phi}^{\top} \mathbf{y}. \tag{5}</script> <p>If $K \gg P$ as it is with many choices of $\varphi(\cdot)$, then $(5)$ is problematic. Recall that $K$ can even be infinite. Kernel ridge regression solves this problem using the <a href="/blog/2018/11/30/woodbury/">Woodbury matrix identity</a>, rewriting $(5)$ as</p> <script type="math/tex; mode=display">\hat{\boldsymbol{\beta}} = \boldsymbol{\Phi} (\lambda \mathbf{I}_N + \boldsymbol{\Phi} \boldsymbol{\Phi}^{\top})^{-1} \mathbf{y}. \tag{6}</script> <p>See <a href="#a1-kernel-ridge-from-the-woodbury-identity">the appendix</a> for a complete derivation of $(6)$. Note that we can rewrite $(6)$ as</p> <script type="math/tex; mode=display">\hat{\boldsymbol{\beta}} = \sum_{n=1}^{N} \alpha_n \varphi(\mathbf{x}_n) \quad\text{where}\quad \boldsymbol{\alpha} = (\lambda \mathbf{I}_N + \boldsymbol{\Phi} \boldsymbol{\Phi}^{\top})^{-1} \mathbf{y}. \tag{6}</script> <p>Mathematically, this is nothing deep. We’re just applying the definition of matrix multiplication. However, it has deep implications. <a class="citation" href="#welling2013kernel">(Welling, 2013)</a> summarize the idea nicely:</p> <blockquote> <p>The solution $\mathbf{w}$ [our $\boldsymbol{\beta}$] must lie in the span of the data-cases, even if the dimensionality of the feature space is much larger than the number of data-cases. This seems intuitively clear, since the algorithm is linear in feature space.</p> </blockquote> <p>Thus, if we are given a new test point $\mathbf{x}$, the predicted $y$ is</p> <script type="math/tex; mode=display">y \equiv [\varphi(\mathbf{x})]^{\top} \boldsymbol{\beta} = [\varphi(\mathbf{x})]^{\top} (\lambda \mathbf{I}_N + \boldsymbol{\Phi} \boldsymbol{\Phi}^{\top})^{-1} \mathbf{y}. \tag{7}</script> <p>We’re now ready for the kernel trick, $k(\mathbf{x}, \mathbf{y}) = \langle \varphi(\mathbf{x}), \varphi(\mathbf{y}) \rangle$, for some positive definite kernel function $k$. Let $\mathbf{K}$ be a matrix and $\mathbf{k}(\mathbf{x})$ be a vector such that</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \mathbf{K}_{ij} &= \varphi(\mathbf{x}_i)^{\top} \varphi(\mathbf{x}_j), \\ \mathbf{k}(\mathbf{x}) &= [k(\mathbf{x}, \mathbf{x}_1), \dots, k(\mathbf{x}, \mathbf{x}_N)]^{\top}. \end{align} \tag{8} %]]></script> <p>Then we can write $(7)$ as</p> <script type="math/tex; mode=display">y = [\mathbf{k}(\mathbf{x})]^{\top}(\lambda \mathbf{I}_N + \mathbf{K})^{-1}\mathbf{y} \tag{9}</script> <p>using the kernel trick. While $\varphi(\cdot)$ might lift the features into possibly infinite-dimensional space, we only need to evaluate $k(\cdot, \cdot)$ on our data.</p> <p>As with linear regression, ridge regression can be interpreted from a probabilistic perspective (see <a class="citation" href="#bishop2006pattern">(Bishop, 2006)</a>, section $3.1.1$ or <a href="https://www.cs.princeton.edu/courses/archive/fall18/cos324/files/regularization.pdf">these lecture notes</a> from Ryan P. Adams). In a probabilistic interpretation, $(9)$ is simply the mean of the probabilistic model. (see <a class="citation" href="#bishop2006pattern">(Bishop, 2006)</a>, equation $3.53$). However, our current framing is sufficient for this post.</p> <h2 id="gaussian-process-regression">Gaussian process regression</h2> <p>Now let’s see how GP regression is related to kernel ridge regression. In GP regression with noisy observations, the model is</p> <script type="math/tex; mode=display">y_n = f(\mathbf{x}_n) + \varepsilon_n, \tag{10}</script> <p>where $\varepsilon_n$ is i.i.d. Gaussian noise or $\varepsilon_n \sim \mathcal{N}(0, \sigma^2)$ and the function $f$ is itself a random variable,</p> <script type="math/tex; mode=display">f \sim \mathcal{N}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x})). \tag{11}</script> <p>In $(11)$, $m(\cdot)$ is the mean function and $k(\cdot)$ is a positive definite kernel. If this looks unfamiliar, please see my <a href="/blog/2019/06/27/gp-regression/">previous post</a> on GP regression. If we use $\mathbf{f}$ to denote $[f(\mathbf{x}_1), \dots, f(\mathbf{x}_N)]^{\top}$ and use subscript $*$ to denote testing rather than training variables, then the full GP model is</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{bmatrix} \mathbf{f}_* \\ \mathbf{f} \end{bmatrix} \sim \mathcal{N} \Bigg( \begin{bmatrix} m(\mathbf{X}_*) \\ m(\mathbf{X}) \end{bmatrix}, \begin{bmatrix} K(\mathbf{X}_*, \mathbf{X}_*) & K(\mathbf{X}_*, \mathbf{X}) \\ K(\mathbf{X}, \mathbf{X}_*) & \sigma^2 \mathbf{I} + K(\mathbf{X}, \mathbf{X}) \end{bmatrix} \Bigg). \tag{12} %]]></script> <p>Hopefully it is clear from context that $K(\cdot, \cdot)$ denotes applying the kernel function to every pair in the two matrices (sets of vectors). Since $\mathbb{E}[y_n] = \mathbb{E}[f(\mathbf{x}_n)]$, we are interested in the mean (and covariance) of $\mathbf{f}_{*}$—the model’s posterior mean. Thankfully, we can use properties of the conditional Gaussian distribution to easily compute these quantities,</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \mathbb{E}[\mathbf{f}_{*}] &= K(\mathbf{X}_*, \mathbf{X}) [\sigma^2 \mathbf{I} + K(\mathbf{X}, \mathbf{X})]^{-1} \mathbf{y} \tag{13} \\ \text{Cov}(\mathbf{f}_{*}) &= K(\mathbf{X}_*, \mathbf{X}_*) - K(\mathbf{X}_*, \mathbf{X}) [\sigma^2 \mathbf{I} + K(\mathbf{X}, \mathbf{X})]^{-1} K(\mathbf{X}, \mathbf{X}_*)). \tag{14} \end{align} %]]></script> <p>The expectation in $(13)$ is just a vectorized version of kernel ridge regression in $(9)$ if we assume the same kernel function $k(\cdot, \cdot)$ and if $\sigma^2$, the variance of the GP noise, is equal to the ridge regression regularization hyperparameter $\lambda$.</p> <h2 id="analysis">Analysis</h2> <p>Kernel ridge and GP regression are quite similar, but the major difference is that a GP regressor is a generative model of the response. This has a few major consequences. While kernel ridge regression can only predict $y$, a GP can quantify its uncertainty about $y$. Furthermore, a GP can generate posterior samples through the generative process. Finally, a GP has a marginal likelihood, and this can be used to find optimal hyperparameters via optimization (see <a class="citation" href="#rasmussen2006gaussian">(Rasmussen &amp; Williams, 2006)</a>, chapter $5$); in kernel ridge regression, one must perform grid search over $\lambda$.</p> <p>   </p> <h2 id="appendix">Appendix</h2> <h3 id="a1-kernel-ridge-from-the-woodbury-identity">A1. Kernel ridge from the Woodbury identity</h3> <p>If $\mathbf{A}$ is a $P \times P$ full rank matrix that is rank corrected by $\mathbf{UCV}$ where $\mathbf{U} \in \mathbb{R}^{P \times K}$, $\mathbf{C} \in \mathbb{R}^{K \times K}$, and $\mathbf{V} \in \mathbb{R}^{K \times P}$, then the Woodbury identity is</p> <script type="math/tex; mode=display">(\mathbf{A} + \mathbf{UCV})^{-1} = \mathbf{A}^{-1} - \mathbf{A}^{-1} \mathbf{U} (\mathbf{C}^{-1} + \mathbf{V} \mathbf{A}^{-1} \mathbf{U})^{-1} \mathbf{V} \mathbf{A}^{-1}.</script> <p>Apply the Woodbury matrix identity with $\mathbf{A} = \mathbf{C} = \mathbf{I}$. Then</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} (\mathbf{I} + \mathbf{U}\mathbf{V})^{-1} &= \mathbf{I} - \mathbf{U}(\mathbf{I} + \mathbf{V}\mathbf{U})^{-1} \mathbf{V} \\ (\mathbf{I} + \mathbf{U}\mathbf{V})^{-1} \mathbf{U} &= [\mathbf{I} - \mathbf{U}(\mathbf{I} + \mathbf{V}\mathbf{U})^{-1} \mathbf{V}] \mathbf{U} \\ &= \mathbf{U} - \mathbf{U}(\mathbf{I} + \mathbf{V}\mathbf{U})^{-1} \mathbf{V}\mathbf{U} \\ &= \mathbf{U}(\mathbf{I} + \mathbf{V}\mathbf{U})^{-1} [(\mathbf{I} + \mathbf{V}\mathbf{U}) - \mathbf{V}\mathbf{U}] \\ &= \mathbf{U}(\mathbf{I} + \mathbf{V}\mathbf{U})^{-1}. \end{align} %]]></script> <p>This is the <a href="http://engr207b.stanford.edu/lectures/matrix_facts_2015_02_19_01.pdf">push-through identity</a>. Now multiply both sides by a scalar, $1/\lambda$, and recall that $(k \mathbf{A})^{-1} = k^{-1} \mathbf{A}^{-1}$ in general:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} (\mathbf{I} + \mathbf{U}\mathbf{V})^{-1} \mathbf{U} &= \mathbf{U}(\mathbf{I} + \mathbf{V}\mathbf{U})^{-1} \\ \frac{\lambda}{\lambda}[(\mathbf{I} + \mathbf{U}\mathbf{V})^{-1} \mathbf{U}] &= \frac{\lambda}{\lambda}[\mathbf{U}(\mathbf{I} + \mathbf{V}\mathbf{U})^{-1}] \\ (\lambda \mathbf{I} + \lambda \mathbf{U}\mathbf{V})^{-1} \lambda\mathbf{U} &= \lambda\mathbf{U}(\lambda \mathbf{I} + \mathbf{V} \lambda \mathbf{U})^{-1} \\ (\lambda \mathbf{I} + \mathbf{W} \mathbf{V})^{-1} \mathbf{W} &= \mathbf{W}(\lambda \mathbf{I} + \mathbf{V} \mathbf{W})^{-1}. \end{align} %]]></script> <p>In the last step, we set $\mathbf{W} = \lambda \mathbf{U}$. For kernel ridge regression, we apply the last line with $\mathbf{W} = \boldsymbol{\Phi}^{\top}$ and $\mathbf{V} = \boldsymbol{\Phi}$.</p>Gregory GundersenWith appropriate hyperparameters, the posterior mean of a Gaussian process (GP) regressor with observation noise is a kernel ridge regressor. I learned of this relationship from this StackExchange comment by Andreas Mueller. I was surprised, and the goal of this post is to better understand that comment. Following (Welling, 2013), I’ll first introduce kernel ridge regression by re-deriving it from ridge regression. We can then compare this to Gaussian process regression, which I have discussed in detail here and here. Finally, this post relies on understanding the kernel trick. Please see my previous post on the topic if needed.Classical Linear Regression2020-01-04T00:00:00-05:002020-01-04T00:00:00-05:00http://gregorygundersen.com/blog/2020/01/04/classical-linear-regression<h2 id="linear-model">Linear model</h2> <p>Suppose we have a regression problem with data $\{\mathbf{x}_n, y_n\}_{n=1}^{N}$. The $n$th observation $\mathbf{x}_n$ is a $P$-dimensional vector of <em>predictors</em> with a scalar <em>response</em> $y_n$. In classical linear regression, the model is that the response is a linear function of the predictors. If $\boldsymbol{\beta} = [\beta_1, \dots, \beta_P]^{\top}$ is a $P$-vector of unknown <em>parameters</em> (or “weights” or “coefficients”) and $\varepsilon_n$ is the $n$th observation’s scalar <em>error</em>, the model can be represented as</p> <script type="math/tex; mode=display">y_n = \beta_1 x_{n1} + \beta_2 x_{n2} + \dots + \beta_P x_{nP} + \varepsilon_n. \tag{1}</script> <p>Written as vectors, $(1)$ is</p> <script type="math/tex; mode=display">y_n = \boldsymbol{\beta}^{\top} \mathbf{x}_n + \varepsilon_n. \tag{2}</script> <p>If we stack the observations $\mathbf{x}_n$ into an $N \times P$ matrix $\mathbf{X}$ and define $\mathbf{y} = [y_1, \dots, y_N]^{\top}$ and $\boldsymbol{\varepsilon} = [\varepsilon_1, \dots, \varepsilon_N]^{\top}$, then the model can be written in matrix form as</p> <script type="math/tex; mode=display">\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}. \tag{3}</script> <p>In this context, $\mathbf{X}$ is often called the <em>design matrix</em>. In classical linear regression, $N &gt; P$, and therefore $\mathbf{X}$ is tall and skiny. In future posts, I will write about methods that deal with this assumption breaking down.</p> <p>We can add an intercept to this linear model in the following way. Without loss of generality, let $\beta_1$ be the intercept. Then add a dummy predictor as the first column of $\mathbf{X}$ whose values are all one.</p> <h2 id="normal-equations">Normal equations</h2> <p>Since $\mathbf{X}$ is a tall and skinny matrix, solving for $\boldsymbol{\beta}$ amounts to solving a linear system of $N$ equations with $P$ unknowns. Such a system is <a href="https://en.wikipedia.org/wiki/Overdetermined_system" target="_blank"><em>overdetermined</em></a>, and it is unlikely that such a system has an exact solution. Classical linear regression is sometimes called <em>ordinary least squares</em> because the “best” fit coefficients $[\beta_1, \dots, \beta_P]^{\top}$ are found by minimizing the sum of squared <em>residuals</em>,</p> <script type="math/tex; mode=display">\hat{\boldsymbol{\beta}} = \arg\!\min_{\boldsymbol{\beta}} \sum_{n=1}^{N} (y_n - \mathbf{x}^{\top} \boldsymbol{\beta})^2. \tag{4}</script> <p>To clarify, the <em>error</em>, $\varepsilon_n$, for the $n$th observation is the difference between what we observe and the underlying true value. The <em>residual</em>, $y_n - \mathbf{x}_n^{\top} \boldsymbol{\beta}$, is the difference between the observed value and what is predicted by the model (Figure $1$, left). Thus, classical linear regression or ordinary least squares <em>minimizes the sum of squared residuals</em>. For a single data point, the squared error is zero if the prediction is exactly correct. Otherwise, the penalty increases quadratically, meaning classical linear regression heavily penalizes outliers (Figure $1$, right). Other loss functions induce other models. See my <a href="/blog/2019/10/04/expectation-median-opt/">previous post</a> on interpreting these kinds of optimization problems.</p> <p>In vector form, $(4)$ is</p> <script type="math/tex; mode=display">\hat{\boldsymbol{\beta}} = \arg\!\min_{\boldsymbol{\beta}} \lVert \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \rVert_2^2. \tag{5}</script> <p>Linear regression has an analytic or closed-form solution known as the <em>normal equations</em>,</p> <script type="math/tex; mode=display">\hat{\boldsymbol{\beta}} = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y}. \tag{6}</script> <p>See <a href="#1-normal-equations">the appendix</a> for a complete derivation of $(6)$. We will see later why this solution, which comes from minimizing the sum of squared residuals, has some nice interpretations.</p> <div class="figure"> <img src="/image/linreg/residuals.png" alt="" style="width: 100%; display: block; margin: 0 auto;" /> <div class="caption"> <span class="caption-label">Figure 1.</span> Ordinary least squares linear regression on Scikit-learn's <a href="https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html" target="_blank" class="mono">make_regression</a> dataset. Data points are in blue. The predicted hyperplane is the red line. <strong>(Left)</strong> Blue dashed vertical lines represent the residuals. <strong>(Right)</strong> Blue boxes represent the squared residuals. </div> </div> <h2 id="a-probabilistic-perspective">A probabilistic perspective</h2> <p>Classical linear regression can be viewed from a probabilistic perspective. Consider again the linear model</p> <script type="math/tex; mode=display">y_n = \mathbf{x}_n^{\top} \boldsymbol{\beta} + \varepsilon_n. \tag{7}</script> <p>If we assume our error $\varepsilon_n$ is additive Gaussian noise, $\varepsilon_n \sim \mathcal{N}(0, \sigma^2)$, then the model is</p> <script type="math/tex; mode=display">p(y_n \mid \mathbf{x}_n, \boldsymbol{\beta}, \sigma^2) = \mathcal{N}(y_n \mid \mathbf{x}_n^{\top} \boldsymbol{\beta}, \sigma^2). \tag{8}</script> <p>If our data is i.i.d., we can write</p> <script type="math/tex; mode=display">p(\mathbf{y} \mid \mathbf{x}, \boldsymbol{\beta}, \sigma^2) = \prod_{n=1}^{N} \mathcal{N}(y_n \mid \mathbf{x}_n^{\top} \boldsymbol{\beta}, \sigma^2). \tag{9}</script> <p>In this statistical framework, maximum likelihood (ML) estimation gives us the same optimal parameters as before. To compute the ML estimate, we first take derivative with respect to the parameter of the log likelihood function and then solve for $\boldsymbol{\beta}$. We can represent the log likelihood compactly using a multivariate normal distribution,</p> <script type="math/tex; mode=display">\log p(\mathbf{y} \mid \mathbf{x}, \boldsymbol{\beta}, \sigma^2) = -\frac{N}{2} \log(2 \pi \sigma^2) - \frac{1}{2 \sigma^2} (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^{\top} (\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \tag{10}</script> <p>See <a href="#2-multivariate-normal-representation-of-the-log-likelihood">the appendix</a> for a complete derivation of $(10)$. If we take the derivative of this log likelihood function with respect to the parameters, the first term is zero and the constant $1/2\sigma^2$ does not effect our optimization. Thus, we are looking for</p> <script type="math/tex; mode=display">\boldsymbol{\beta}_{\texttt{MLE}} = \arg\!\max_{\boldsymbol{\beta}} \big\{ -(\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^{\top} (\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \big\}. \tag{11}</script> <p>Of course, maximizing the negation of a function is the same as minimizing the function directly. Thus, this is the same optimization problem as $(5)$.</p> <p>Furthermore, let $\boldsymbol{\beta}_0$ and $\sigma_0^2$ be the true generative parameters. Then</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \mathbb{E}[\mathbf{y} \mid \mathbf{X}] &= \mathbf{X}\boldsymbol{\beta}_0 \\ \mathbb{V}[\mathbf{y} \mid \mathbf{X}] &= \sigma_0^2 \mathbf{I}. \end{align} \tag{12} %]]></script> <p>See <a href="#3-conditional-expectation-and-variance">the appendix</a> for a derivation of $(12)$. Since we know that the conditional expectation is the minimizer of the mean squared loss—see my <a href="/blog/2019/10/04/expectation-median-opt/">previous post</a> if needed—, we know that $\mathbf{X}\boldsymbol{\beta}_0$ would be the best we can do given our model. An interpretation of the conditional variance in this context is that it is the smallest expected squared prediction error.</p> <h2 id="orthogonal-projectors">Orthogonal projectors</h2> <p>Note that in $(6)$, the term $(\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top}$ is the <em>pseudoinverse</em> or the <a href="https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_inverse" target="_blank">Moore-Penrose inverse</a> of $\mathbf{X}$,</p> <script type="math/tex; mode=display">\mathbf{X}^{+} \equiv (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top}. \tag{13}</script> <p>A common use of the psuedoinverse is for overdetermined systems of linear equations (tall, skinny matrices) because these lack unique solutions. One way to chunk what linear regression is doing is to simply note</p> <script type="math/tex; mode=display">\mathbf{y} = \mathbf{X} \boldsymbol{\beta} \quad\Rightarrow\quad \hat{\boldsymbol{\beta}} = \mathbf{X}^{+} \mathbf{y}.</script> <p>Importantly, by properties of the pseudoinverse, $\mathbf{P} = \mathbf{X} \mathbf{X}^{+}$ is an orthogonal projector. See <a href="#4-orthogonal-projectors">the appendix</a> for a verification of this fact. Thus, given the estimated parameters $\hat{\boldsymbol{\beta}} = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y}$, the predicted values $\hat{\mathbf{y}}$ are</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \hat{\mathbf{y}} &= \mathbf{X} \hat{\boldsymbol{\beta}} \\ &= \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y} \\ &= \mathbf{P} \mathbf{y} \end{align} \tag{14} %]]></script> <p>where $\mathbf{P}$ is an orthogonal projector.</p> <p>There is a nice geometric interpretation of this. When we multiply the response variables $\mathbf{y}$ by $\mathbf{P}$, we are projecting $\mathbf{y}$ into a space spanned by the columns of $\mathbf{X}$. This makes sense since the model is constrained to live in the space of linear combinations of the columns of $\mathbf{X}$,</p> <script type="math/tex; mode=display">\mathbf{y} = \beta_1 \begin{bmatrix} x_{11} \\ x_{12} \\ \vdots \\ x_{1N} \end{bmatrix} + \beta_2 \begin{bmatrix} x_{21} \\ x_{22} \\ \vdots \\ x_{2N} \end{bmatrix} + \dots + \beta_P \begin{bmatrix} x_{P1} \\ x_{P2} \\ \vdots \\ x_{PN} \end{bmatrix} \tag{15}</script> <p>and an orthogonal projection is the closest to $\mathbf{y}$ in Euclidean distance that we can get while staying in this constrained space. (One can find many nice visualizations of this fact online.)</p> <p>Finally, note that</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \hat{\boldsymbol{\varepsilon}} &= \mathbf{y} - \hat{\mathbf{y}} \\ &= \mathbf{y} - \mathbf{P}\mathbf{y} \\ &= (\mathbf{I}_N - \mathbf{P}) \mathbf{y}. \end{align} \tag{16} %]]></script> <p>It is easy to verify that $(\mathbf{I}_N - \mathbf{P})$ is also an orthogonal projection. Importantly, this means that $\mathbf{P}$ gives us an efficient way to compute the estimated errors of the model.</p> <h2 id="conclusion">Conclusion</h2> <p>These various views of classical linear regression help justify the use of the sum of squared residuals. First, a sum of squares is mathematically attractive because it is smooth. Compare this to the absolute value, which has a discontinuity. The probabilistic perspective justifies the use if we assume that $\mathbf{y}$ is contaminated by Gaussian noise. Finally, the solution, the pseudoinverse of $\mathbf{X}$, has a nice geometric interpretation: it creates an orthogonal projection of $\mathbf{y}$ onto the span of the columns of $\mathbf{X}$. There are other attractive features not mentioned here, such as the finite sample distributions being well-defined.</p> <p>   </p> <h2 id="appendix">Appendix</h2> <h3 id="1-normal-equations">1. Normal equations</h3> <p>We want to find the parameters or coefficients $\boldsymbol{\beta}$ that minimize the sum of squared residuals,</p> <script type="math/tex; mode=display">\hat{\boldsymbol{\beta}} = \arg\!\min_{\boldsymbol{\beta}} \lVert \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \rVert_2^2.</script> <p>Note that we can write</p> <script type="math/tex; mode=display">\lVert \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \rVert_2^2 = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^{\top} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta}).</script> <p>This can be easily seen by writing out the vectorization explicitly. Let $\mathbf{v}$ be a vector such that</p> <script type="math/tex; mode=display">% <![CDATA[ \mathbf{v} = \begin{bmatrix} y_1 - \mathbf{x}_1^{\top} \boldsymbol{\beta} \\ \vdots \\ y_N - \mathbf{x}_N^{\top} \boldsymbol{\beta} \end{bmatrix} = \begin{bmatrix} y_1 \\ \vdots \\ y_N \end{bmatrix} - \begin{bmatrix} x_{11} & \dots & x_{1P} \\ \vdots & \ddots & \vdots \\ x_{N1} & \dots & x_{NP} \end{bmatrix} \begin{bmatrix} \beta_1 \\ \dots \\ \beta_P \end{bmatrix}. %]]></script> <p>The squared L2-norm $\lVert \mathbf{v} \rVert_2^2$ is the sums the squared components of $\mathbf{v}$. This is equivalent to taking the dot product $\mathbf{v}^{\top} \mathbf{v}$. Now define the function $J(\cdot)$ such that</p> <script type="math/tex; mode=display">J(\boldsymbol{\beta}) = \lVert \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \rVert_2^2.</script> <p>To minimize $J(\cdot)$, we take its derivative with respect to $\boldsymbol{\beta}$, set it equal to zero, and solve for $\boldsymbol{\beta}$,</p> <script type="math/tex; mode=display">% <![CDATA[ \require{cancel} \begin{align} \nabla_{\boldsymbol{\beta}} J(\boldsymbol{\beta}) &\stackrel{1}{=} \nabla_{\boldsymbol{\beta}} \Big[ (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^{\top} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \Big] \\ &\stackrel{2}{=} \nabla_{\boldsymbol{\beta}} \Big[ (\mathbf{y}^{\top} - \boldsymbol{\beta}^{\top} \mathbf{X}^{\top}) (\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \Big] \\ &\stackrel{3}{=} \nabla_{\boldsymbol{\beta}} \Big[ \boldsymbol{\beta}^{\top} \mathbf{X}^{\top} \mathbf{X}\boldsymbol{\beta} - \mathbf{y}^{\top} \mathbf{X} \boldsymbol{\beta} + \mathbf{y}^{\top} \mathbf{y} - \boldsymbol{\beta}^{\top} \mathbf{X}^{\top} \mathbf{y} \Big] \\ &\stackrel{4}{=} \nabla_{\boldsymbol{\beta}} \,\text{tr} \big( \boldsymbol{\beta}^{\top} \mathbf{X}^{\top} \mathbf{X}\boldsymbol{\beta} - \mathbf{y}^{\top} \mathbf{X} \boldsymbol{\beta} + \mathbf{y}^{\top} \mathbf{y} - \boldsymbol{\beta}^{\top} \mathbf{X}^{\top} \mathbf{y} \big) \\ &\stackrel{5}{=} \nabla_{\boldsymbol{\beta}} \,\text{tr} \big( \boldsymbol{\beta}^{\top} \mathbf{X}^{\top} \mathbf{X}\boldsymbol{\beta} \big) - \nabla_{\boldsymbol{\beta}} \,\text{tr}\big(\mathbf{y}^{\top} \mathbf{X} \boldsymbol{\beta}\big) + \cancel{\nabla_{\boldsymbol{\beta}} \,\text{tr}\big(\mathbf{y}^{\top} \mathbf{y}\big)} - \nabla_{\boldsymbol{\beta}} \,\text{tr}\big(\boldsymbol{\beta}^{\top} \mathbf{X}^{\top} \mathbf{y} \big) \\ &\stackrel{6}{=} \nabla_{\boldsymbol{\beta}} \,\text{tr} \big( \boldsymbol{\beta}^{\top} \mathbf{X}^{\top} \mathbf{X}\boldsymbol{\beta} \big) - 2 \nabla_{\boldsymbol{\beta}} \,\text{tr}\big(\boldsymbol{\beta}^{\top}\mathbf{X}^{\top}\mathbf{y}\big) \\ &\stackrel{7}{=} 2 \mathbf{X}^{\top} \mathbf{X} \boldsymbol{\beta} - 2 \mathbf{X}^{\top} \mathbf{y} \end{align} %]]></script> <p>In step $4$, we use the fact that the trace of a scalar is the scalar. In step $5$, we use the linearity of differentiation and the trace operator. In step $6$, we use the fact that $\text{tr}(\mathbf{A}) = \text{tr}(\mathbf{A}^{\top})$. In step $7$, we take the derivatives of the left and right terms using identities $108$ and $103$ from <a class="citation" href="#petersen2008matrix">(Petersen et al., 2008)</a>, respectively.</p> <p>If we set line $7$ equal to zero and divide both sides of the equation by two, we get the normal equations:</p> <script type="math/tex; mode=display">\mathbf{X}^{\top} \mathbf{X} \boldsymbol{\beta} = \mathbf{X}^{\top} \mathbf{y}.</script> <h3 id="2-multivariate-normal-representation-of-the-log-likelihood">2. Multivariate normal representation of the log likelihood</h3> <p>The probability density function for a $D$-dimensional multivariate normal distribution is</p> <script type="math/tex; mode=display">p(\mathbf{z} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{\sqrt{(2\pi)^D \det(\boldsymbol{\Sigma})}} \exp\Big\{ -\frac{1}{2} (\mathbf{z} - \boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{z} - \boldsymbol{\mu}) \Big\}</script> <p>The mean parameter $\boldsymbol{\mu}$ is a $D$-vector, and the covariance matrix $\boldsymbol{\Sigma}$ is a $D \times D$ positive definite matrix. In the probabilistic view of classical linear regression, the data are i.i.d. Therefore, we can represent the likelihood function as</p> <script type="math/tex; mode=display">p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}, \sigma^2) = \frac{1}{\sqrt{(2\pi\sigma^2)^N}} \exp\Big\{ -\frac{1}{2 \sigma^2} (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^{\top} (\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \Big\}</script> <p>The above formulation leverages two properties from linear alegbra. First, if the dimensions of the covariance matrix are independent (in our case, each dimension is a sample), then $\boldsymbol{\Sigma}$ is diagonal, and its matrix inverse is just a diagonal matrix with each value replaced by its reciprocal. Second, the determinant of a diagonal matrix is just the product of the diagonal elements.</p> <p>The log likelihood is then</p> <script type="math/tex; mode=display">\log p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}, \sigma^2) = -\frac{N}{2} \log(2\pi\sigma^2) - \frac{1}{2 \sigma^2} (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^{\top} (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})</script> <p>as desired.</p> <h3 id="3-conditional-expectation-and-variance">3. Conditional expectation and variance</h3> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \mathbb{E}[\mathbf{y} \mid \mathbf{X}] &= \mathbb{E}[\mathbf{X}\boldsymbol{\beta}_0 + \boldsymbol{\varepsilon} \mid \mathbf{X}] \\ &= \mathbb{E}[\mathbf{X}\boldsymbol{\beta}_0 \mid \mathbf{X}] + \mathbb{E}[\boldsymbol{\varepsilon} \mid \mathbf{X}] \\ &= \mathbf{X}\boldsymbol{\beta}_0 + \mathbb{E}[\boldsymbol{\varepsilon}] \\ &= \mathbf{X}\boldsymbol{\beta}_0 \\ \\ \mathbb{V}[\mathbf{y} \mid \mathbf{X}] &= \mathbb{V}[\mathbf{X}\boldsymbol{\beta}_0 + \boldsymbol{\varepsilon} \mid \mathbf{X}] \\ &= \mathbb{V}[\mathbf{X}\boldsymbol{\beta}_0 \mid \mathbf{X}] + \mathbb{V}[\boldsymbol{\varepsilon} \mid \mathbf{X}] \\ &= \mathbb{V}[\mathbf{X} \mid \mathbf{X}] + \mathbb{V}[\boldsymbol{\varepsilon}] \\ &= \sigma_0^2 \mathbf{I} \end{align} %]]></script> <h3 id="4-orthogonal-projectors">4. Orthogonal projectors</h3> <p>A square matrix is a projection if $\mathbf{P} = \mathbf{P}^2$,</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \mathbf{P}^2 &= \mathbf{P}\mathbf{P} \\ &= \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \\ &= \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \\ &= \mathbf{P}. \end{align} %]]></script> <p>A real-valued projection is orthogonal if $\mathbf{P} = \mathbf{P}^{\top}$, and</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \mathbf{P}^{\top} &= (\mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top})^{\top} \\ &= (\mathbf{X}^{\top})^{\top} [(\mathbf{X}^{\top} \mathbf{X})^{-1}]^{\top} \mathbf{X}^{\top} \\ &= \mathbf{P}. \end{align} %]]></script>Gregory GundersenLinear modelRandom Fourier Features2019-12-23T00:00:00-05:002019-12-23T00:00:00-05:00http://gregorygundersen.com/blog/2019/12/23/random-fourier-features<h2 id="kernel-machines">Kernel machines</h2> <p>Consider a learning problem with data and targets $\{(\mathbf{x}_n, y_n)\}_{n=1}^{N}$ where $\mathbf{x}_n \in \mathcal{X}$ and $y_n \in \mathcal{Y}$. Ignoring the bias, a linear model finds a hyperplane $\mathbf{w}$ such that the decision function</p> <script type="math/tex; mode=display">f^{*}(\mathbf{x}) = \mathbf{w}^{\top} \mathbf{x}</script> <p>is optimal for some loss function. For example, in logistic regression, we compute the logistic function of $f(\mathbf{x})$, and then threshold the output probability to produce a binary classifier with $\mathcal{Y} = \{0, 1\}$. Obviously, linear models break down when our data are not linearly separable for classification (Figure $1$, left) or lack a linear relationship between the features and targets for regression.</p> <p>In a <em>kernel machine</em> or a <em>kernel method</em>, the input domain $\mathcal{X}$ is mapped into another space $\mathcal{V}$ in which the targets may be a linear function of the data. The dimension of $\mathcal{V}$ may be high or even infinite, but kernel methods avoid operating explicitly in this space using the kernel trick: if $k: \mathcal{X} \times \mathcal{X} \mapsto \mathbb{R}$ is a positive definite kernel, then by <a href="https://en.wikipedia.org/wiki/Mercer%27s_theorem" target="_blank">Mercer’s theorem</a> there exists a <em>basis function</em> or <em>feature map</em> $\varphi: \mathcal{X} \mapsto \mathcal{V}$ such that</p> <script type="math/tex; mode=display">k(\mathbf{x}, \mathbf{y}) = \langle \varphi(\mathbf{x}), \varphi(\mathbf{y}) \rangle_{\mathcal{V}}. \tag{1}</script> <p>$\langle \cdot, \cdot \rangle_{\mathcal{V}}$ is an inner product in $\mathcal{V}$. If the kernel trick is new to you, please see my <a href="/blog/2019/12/10/kernel-trick/" target="_blank">previous post</a>. Using the kernel trick and a <a href="https://en.wikipedia.org/wiki/Representer_theorem" target="_blank">representer theorem</a>—specifically, the nonparametric representer theorem—kernel methods construct nonlinear models of $\mathcal{X}$ that are linear in $k$,</p> <script type="math/tex; mode=display">f^{*}(\mathbf{x}) = \langle \mathbf{w}, \varphi(\mathbf{x}) \rangle = \sum_{n=1}^{N} \alpha_n k(\mathbf{x}, \mathbf{x}_n). \tag{2}</script> <p>Taken together, Equations $1$ and $2$ say that provided we have a positive definite kernel $k(\cdot, \cdot)$, we can avoid operating in the possibly infinite-dimensional space $\mathcal{V}$ and instead only compute over $N$ data points because the optimal decision rule can be expressed as an expansion in terms of the training samples. See <a class="citation" href="#scholkopf2001generalized">(Schölkopf et al., 2001)</a> for a rigorous treatment on this topic.</p> <p>For some quick intuition on the representer theorem, recall that the posterior mean of a Gaussian process (GP) regressor is</p> <script type="math/tex; mode=display">\mathbb{E}[\mathbf{f}_{*}] = k(\mathbf{X}_*, \mathbf{X}) \overbrace{[\sigma^2 \mathbf{I} + k(\mathbf{X}, \mathbf{X})]^{-1} \mathbf{y}}^{\boldsymbol{\beta}},</script> <p>where $\mathbf{X}$ is an $N \times D$ matrix of training data and $\mathbf{X}_{*}$ is an $M \times D$ matrix of testing data. See <a href="/blog/2019/06/27/gp-regression/">my previous post</a> on GP regression if needed. For each component in the $M$-vector $\mathbb{E}[\mathbf{f}_{*}]$ (each prediction $y_m$), the GP prediction is a linear-in-$\boldsymbol{\beta}$ model of the kernel evaluated at the test data $\mathbf{x}_m$ against all the training points.</p> <p>Perhaps the most famous kernel machine is the nonlinear support vector machine (SVM) (Figure $1$, right). However, any algorithm that can be represented as a dot product between pairs of samples can be converted into a kernel method using Equation $1$. Other methods that can be considered kernel methods are GPs, kernel PCA, and kernel regression.</p> <div class="figure"> <img src="/image/rff/motivation.png" alt="" style="width: 100%; display: block; margin: 0 auto;" /> <div class="caption"> <span class="caption-label">Figure 1.</span> Logistic regression (left) with decision boundary denoted with a solid line and SVM with radial basis function (RBF) kernel (right) on the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons.html" target="_blank">Scikit-learn half-circles</a> dataset. Support vectors are denoted with circles, and the margins are denoted with dashed lines. </div> </div> <p>While the kernel trick is a beautiful idea and the conceptual backbone of kernel machines, the problem is that for large datasets (for huge $N$), the machine must operate on a kernel matrix $\mathbf{K}$ that is $N \times N$,</p> <script type="math/tex; mode=display">% <![CDATA[ \mathbf{K} = \begin{bmatrix} k(\mathbf{x}_1, \mathbf{x}_1) & k(\mathbf{x}_1, \mathbf{x}_2) & \dots & k(\mathbf{x}_1, \mathbf{x}_N) \\ k(\mathbf{x}_2, \mathbf{x}_1) & k(\mathbf{x}_2, \mathbf{x}_2) & \dots & k(\mathbf{x}_2, \mathbf{x}_N) \\ \vdots & \vdots & \ddots & \vdots \\ k(\mathbf{x}_N, \mathbf{x}_1) & k(\mathbf{x}_N, \mathbf{x}_2) & \dots & k(\mathbf{x}_N, \mathbf{x}_N) \end{bmatrix}. %]]></script> <p>To evaluate a test data point, it must evaluate the sum in Equation $2$. While we’re avoiding computing in $\mathcal{V}$, the kernel matrix is still large. Thus, in the area of big data, kernel methods do not necessarily scale.</p> <h2 id="random-features">Random features</h2> <p>In their 2007 paper, <em>Random Features for Large-Scale Kernel Machines</em> <a class="citation" href="#rahimi2007random">(Rahimi &amp; Recht, 2007)</a>, Ali Rahimi and Ben Recht propose a different tack: approximate the above inner product (Equation $1$) with a randomized map $\mathbf{z}: \mathbb{R}^{D} \mapsto \mathbb{R}^{R}$ where ideally $R \ll N$,</p> <script type="math/tex; mode=display">k(\mathbf{x}, \mathbf{y}) = \langle \varphi(\mathbf{x}), \varphi(\mathbf{y}) \rangle \approx \mathbf{z}(\mathbf{x})^{\top} \mathbf{z}(\mathbf{y}). \tag{4}</script> <p>See <a href="https://youtu.be/Nqi2iU7kbD0" target="_blank">this talk</a> by Rahimi for details on the theoretical guarantees of this approximation. Why does this work and why is it it a good idea? The representer theorem tells us that the our data is linear in $\mathcal{V}$. If we have a good approximation of $\varphi(\cdot)$, then</p> <script type="math/tex; mode=display">f(\mathbf{x}) = \langle \mathbf{w}, \varphi(\mathbf{x}) \rangle \approx \boldsymbol{\beta}^{\top} \mathbf{z}(\mathbf{x}). \tag{5}</script> <p>Note that the inner product is in a possibly infinite-dimensional space, while the dot product is in $\mathbb{R}^R$. In other words, provided $\mathbf{z}(\cdot)$ is a good approximation of $\varphi(\cdot)$, then we can simply project our data using $\mathbf{z}(\cdot)$ and then use fast linear models in $\mathbb{R}^{R}$. So the task at hand is to find a random projection $\mathbf{z}(\cdot)$ such that it well-approximates the corresponding nonlinear kernel machine.</p> <p>According to <a href="http://www.argmin.net/2017/12/05/kitchen-sinks/" target="_blank">this blog post by Rahimi</a>, their 2007 paper was inspired by the following observation. Let $\boldsymbol{\omega}$ be a random $D$-dimensional vector such that</p> <script type="math/tex; mode=display">\boldsymbol{\omega} \sim \mathcal{N}_D(\mathbf{0}, \mathbf{I}).</script> <p>Now define $h_{\boldsymbol{\omega}}$ as</p> <script type="math/tex; mode=display">h_{\boldsymbol{\omega}}: \mathbf{x} \mapsto \exp(i \boldsymbol{\omega}^{\top} \mathbf{x}).</script> <p>Above, $i$ is the imaginary unit. Let the superscript $*$ denote the complex conjugate. Importantly, recall that the complex conjugate of $e^{ix}$ is $e^{-ix}$. Then note</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \mathbb{E}[h_{\boldsymbol{\omega}}(\mathbf{x}) h_{\boldsymbol{\omega}}(\mathbf{y})^{*}] &= \mathbb{E}[\exp(i \boldsymbol{\omega}^{\top} \mathbf{x}) \exp(-i \boldsymbol{\omega}^{\top} \mathbf{y})] \\ &= \mathbb{E}[\exp(i \boldsymbol{\omega}^{\top} (\mathbf{x} - \mathbf{y}))] \\ &= \int_{\mathbb{R}^D} p(\boldsymbol{\omega}) \exp(i \boldsymbol{\omega}^{\top} (\mathbf{x} - \mathbf{y})) \text{d}\boldsymbol{\omega} \\ &\stackrel{\star}{=} k(\mathbf{x} - \mathbf{y}). \end{align} %]]></script> <p>Every step except the last is straightforward: apply the definition of $h_{\boldsymbol{\omega}}$, do some algebra, and apply the definition of expectation. The last step invokes Bochner’s theorem <a class="citation" href="#rudin1962fourier">(Rudin, 1962)</a>. Quoting Rahimi and Recht’s version with small modifications for consistent notation, the theorem is:</p> <blockquote> <p><strong>Bochner’s theorem:</strong> A continuous kernel $k(\mathbf{x}, \mathbf{y}) = k(\mathbf{x} - \mathbf{y})$ on $\mathbb{R}^D$ is positive definite if and only if $k(\Delta)$ is the Fourier transform of a non-negative measure.</p> </blockquote> <p>The Fourier transform of a non-negative measure, call it $p(\cdot)$, is</p> <script type="math/tex; mode=display">k(\Delta) = \int p(\omega) \exp(i \omega \Delta) \text{d}\omega.</script> <p>Hence the equality labeled $\star$. Notice that the only constraint we need for $k(\cdot, \cdot)$, besides it being a positive definite kernel, is that it is shift invariant or $k(\mathbf{x}, \mathbf{y}) = k(\mathbf{x} - \mathbf{y})$. Rahimi and Recht observe that many popular kernels, such as the radial basis function (RBF) kernel, are shift invariant. The upshot is that the transformation $h_{\boldsymbol{\omega}}$ is an unbiased estimate of $k(\mathbf{x}, \mathbf{y})$. Since the <em>expectation</em> of $h_{\boldsymbol{\omega}}(\mathbf{x}) h_{\boldsymbol{\omega}}(\mathbf{y})^{*}$ is equal to $k(\mathbf{x}, \mathbf{y})$, we can lower the variance of our estimate by sampling $R$ i.i.d. realizations $\{\boldsymbol{\omega}_r, b_r \}_{r=1}^{R}$.</p> <p>Putting it all together, we have</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} k(\mathbf{x}, \mathbf{y}) &= k(\mathbf{x} - \mathbf{y}) \\ &\stackrel{1}{=} \int \exp(i \boldsymbol{\omega}^{\top}(\mathbf{x} - \mathbf{y})) p(\boldsymbol{\omega}) \text{d}\boldsymbol{\omega} \\ &= \mathbb{E}\big[\exp(i \boldsymbol{\omega}^{\top}(\mathbf{x} - \mathbf{y}))\big] \\ &\stackrel{2}{\approx} \frac{1}{R} \sum_{r=1}^{R} \exp(i \boldsymbol{\omega}^{\top}(\mathbf{x} - \mathbf{y})) \\ &= \frac{1}{R} \sum_{r=1}^{R} \exp(i \boldsymbol{\omega}_r^{\top} \mathbf{x}) \exp(-i \boldsymbol{\omega}_r^{\top} \mathbf{y})) \\ &\stackrel{3}{=} \begin{bmatrix} \frac{1}{\sqrt{R}} \exp(i \boldsymbol{\omega}_1^{\top} \mathbf{x}) \\ \frac{1}{\sqrt{R}} \exp(i \boldsymbol{\omega}_2^{\top} \mathbf{x}) \\ \vdots \\ \frac{1}{\sqrt{R}} \exp(i \boldsymbol{\omega}_R^{\top} \mathbf{x}) \end{bmatrix}^{\top} \begin{bmatrix} \frac{1}{\sqrt{R}} \exp(-i \boldsymbol{\omega}_1^{\top} \mathbf{y}) \\ \frac{1}{\sqrt{R}} \exp(-i \boldsymbol{\omega}_2^{\top} \mathbf{y}) \\ \vdots \\ \frac{1}{\sqrt{R}} \exp(-i \boldsymbol{\omega}_R^{\top} \mathbf{y}) \end{bmatrix} \\ &\stackrel{4}{\equiv} \mathbf{h}(\mathbf{x}) \mathbf{h}(\mathbf{y})^{*}. \end{align} %]]></script> <p>Step $1$ is the Fourier transform of $k(\cdot, \cdot)$. Step $2$ is a Monte Carlo approximation of the expectation. Step $3$ is a vectorization of the problem. Step $4$ is the definition of a random map $\mathbf{h}$.</p> <p>Before continuing, let’s make two clarifying points. First, note that we use $\mathbf{h}$ rather than $h$ to denote the fact that $\mathbf{h}$ is really an $R$-vector of normalized $h(\cdot)$ transformations. Second, note that we’ve talked about the <em>dot product</em> $\mathbf{z}(\mathbf{x})^{\top} \mathbf{z}(\mathbf{y})$, but above we have $\mathbf{h}(\mathbf{x}) \mathbf{h}(\mathbf{y})^{*}$. As we will see in the next section, the imaginary part of our random map will disappear, and the new transform is what Rahimi and Recht call $\mathbf{z}$.</p> <h2 id="fine-tuning">Fine tuning</h2> <p>Now that we understand the big idea of a low-dimensional, randomized map and why it might work, let’s get into the weeds. First, note that since both our distribution $\mathcal{N}_D(\mathbf{0}, \mathbf{I})$ and the kernel $k(\Delta)$ are real-valued, we can write</p> <script type="math/tex; mode=display">% <![CDATA[ \require{cancel} \begin{align} \exp(i \boldsymbol{\omega}^{\top} (\mathbf{x} - \mathbf{y})) &\stackrel{\dagger}{=} \cos(\boldsymbol{\omega}^{\top} (\mathbf{x} - \mathbf{y})) - \cancel{i \sin(\boldsymbol{\omega}^{\top} (\mathbf{x} - \mathbf{y}))} \\ &= \cos(\boldsymbol{\omega}^{\top} (\mathbf{x} - \mathbf{y})). \end{align} %]]></script> <p>Step $\dagger$ is Euler’s formula. We can then define $z_{\boldsymbol{\omega}}(\mathbf{x})$—note that this is still not yet the bolded $\mathbf{z}$—without the imaginary unit as</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \boldsymbol{\omega} &\sim \mathcal{N}_D(\mathbf{0}, \mathbf{I}) \\ b &\sim \text{Uniform}(0, 2\pi) \\ z_{\boldsymbol{\omega}}(\mathbf{x}) &= \sqrt{2} \cos(\boldsymbol{\omega}^{\top}\mathbf{x} + b). \end{align} %]]></script> <p>This works because</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \mathbb{E}_{\boldsymbol{\omega}}[z_{\boldsymbol{\omega}}(\mathbf{x}) z_{\boldsymbol{\omega}}(\mathbf{y})] &= \mathbb{E}_{\boldsymbol{\omega}}[\sqrt{2} \cos(\boldsymbol{\omega}^{\top}\mathbf{x} + b) \sqrt{2} \cos(\boldsymbol{\omega}^{\top}\mathbf{y} + b)] \\ &\stackrel{\star}{=} \mathbb{E}_{\boldsymbol{\omega}}[ \cos(\boldsymbol{\omega}^{\top}(\mathbf{x} + \mathbf{y}) + 2b)] + \mathbb{E}_{\boldsymbol{\omega}}[\cos(\boldsymbol{\omega}^{\top}(\mathbf{x} - \mathbf{y}))] \\ &\stackrel{\dagger}{=} \mathbb{E}_{\boldsymbol{\omega}}[\cos(\boldsymbol{\omega}^{\top}(\mathbf{x} - \mathbf{y}))]. \end{align} %]]></script> <p>Step $\star$ is just trigonometry. See <a href="#1-trigonometric-identity">the Appendix</a> for a derivation. Step $\dagger$ uses the fact that since $b \sim \text{Uniform}(0, 2\pi)$, the expectation with respect to $b$ is zero:</p> <script type="math/tex; mode=display">\mathbb{E}_{\boldsymbol{\omega}}[ \cos(\boldsymbol{\omega}^{\top}(\mathbf{x} + \mathbf{y}) + 2b)] = \mathbb{E}_{\boldsymbol{\omega}} [\mathbb{E}_b[ \cos(\boldsymbol{\omega}^{\top}(\mathbf{x} + \mathbf{y}) + 2b) \mid \boldsymbol{\omega}]] = 0</script> <p>If you are unconvinced, see <a href="#2-expectation-of-cost--x-is-zero">the Appendix</a>. We are now ready to define the random map $\mathbf{z}: \mathbb{R}^D \mapsto \mathbb{R}^R$ such that Equation $4$ holds. Let</p> <script type="math/tex; mode=display">\mathbf{z}(\mathbf{x}) = \begin{bmatrix} \frac{1}{\sqrt{R}} z_{\boldsymbol{\omega}_1}(\mathbf{x}) \\ \frac{1}{\sqrt{R}} z_{\boldsymbol{\omega}_2}(\mathbf{x}) \\ \vdots \\ \frac{1}{\sqrt{R}} z_{\boldsymbol{\omega}_R}(\mathbf{x}) \end{bmatrix}.</script> <p>and therefore</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \mathbf{z}(\mathbf{x})^{\top} \mathbf{z}(\mathbf{y}) &= \frac{1}{R} \sum_{r=1}^{R} z_{\boldsymbol{\omega}_r}(\mathbf{x}) z_{\boldsymbol{\omega}_r}(\mathbf{y}) \\ &= \frac{1}{R} \sum_{r=1}^{R} 2 \cos(\boldsymbol{\omega}_r^{\top} \mathbf{x} + b_r) \cos(\boldsymbol{\omega}_r^{\top} \mathbf{y} + b_r) \\ &= \frac{1}{R} \sum_{r=1}^{R} \cos(\boldsymbol{\omega}_r^{\top} (\mathbf{x} - \mathbf{y})) \\ &\approx \mathbb{E}_{\boldsymbol{\omega}}[\cos(\boldsymbol{\omega}^{\top} (\mathbf{x} - \mathbf{y}))] \\ &= k(\mathbf{x}, \mathbf{y}). \end{align} %]]></script> <p>We now have a simple algorithm to estimate a shift invariant, positive definite kernel. Draw $R$ samples of $\boldsymbol{\omega} \sim \mathcal{N}_D(\mathbf{0}, \mathbf{I})$ and $b \sim \text{Uniform}(0, 2\pi)$ and then compute $\mathbf{z}(\mathbf{x})^{\top} \mathbf{z}(\mathbf{y})$.</p> <p>An alternative version of random Fourier features that you might see is</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \boldsymbol{\omega}_r &\stackrel{\texttt{iid}}{\sim} \mathcal{N}_D(\mathbf{0}, \mathbf{I}) \\ \\ z_{\boldsymbol{\omega}_r}(\mathbf{x}) &= \frac{1}{\sqrt{R}} \begin{bmatrix} \cos(\boldsymbol{\omega}_r^{\top} \mathbf{x}) \\ \sin(\boldsymbol{\omega}_r^{\top} \mathbf{x}) \end{bmatrix}. \end{align} %]]></script> <p>See <a href="#3-alternative-random-fourier-features">the Appendix</a> to see why this works and <a class="citation" href="#sutherland2015error">(Sutherland &amp; Schneider, 2015)</a> for a comparative analysis of each approach.</p> <h2 id="examples">Examples</h2> <h3 id="rbf-kernel-approximation">RBF kernel approximation</h3> <p>Before fitting a more complex model, let’s first approximate an RBF kernel using random Fourier features. Sample $R$ i.i.d. $\boldsymbol{\omega}$ variables and then compute</p> <script type="math/tex; mode=display">\mathbf{z}(\mathbf{x})^{\top} \mathbf{z}(\mathbf{y}) = \frac{1}{R} \sum_{r=1}^{R} z_{\boldsymbol{\omega}_r}(\mathbf{x})^{\top} z_{\boldsymbol{\omega}_r}(\mathbf{y}) = \frac{1}{R} \sum_{r=1}^{R} \cos(\boldsymbol{\omega}_r^{\top}(\mathbf{x} - \mathbf{y})).</script> <p>for each $(\mathbf{x}, \mathbf{y})$ pair in the data. The resultant $N \times N$ object is the approximate kernel. We see in Figure $2$ that as $R$ increases, the kernel approximation improves.</p> <div class="figure"> <img src="/image/rff/rbf.png" alt="" style="width: 100%; display: block; margin: 0 auto;" /> <div class="caption"> <span class="caption-label">Figure 2.</span> A comparison of the exact radial basis function (RBF) kernel on the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_s_curve.html" target="_blank">Scikit-learn S-curve</a> with random Fourier features (left) and an approximate kernel with an increasing number of Monte Carlo samples (right four frames). </div> </div> <p>See my <a href="https://github.com/gwgundersen/random-fourier-features" target="_blank">GitHub</a> for the code to generate this figure.</p> <h3 id="kernel-ridge-regression">Kernel ridge regression</h3> <p>As a more complex example and to see concretely why random Fourier features are efficient, let’s look at kernel ridge regression. If needed, see <a class="citation" href="#welling2013kernel">(Welling, 2013)</a> for an introduction to the model. And the Scikit-learn docs have a nice <a href="https://scikit-learn.org/stable/auto_examples/gaussian_process/plot_compare_gpr_krr.html" target="_blank">tutorial comparing kernel ridge and Gaussian process regression</a>. Equation $5$ tells us that $f(\mathbf{x})$ is linear in $\mathbf{z}(\mathbf{x})$. Therefore, we just need to convert our input $\mathbf{x}$ into random features and apply linear methods. Concretely, let $\mathbf{Z} \in \mathbb{R}^{N \times R}$ be a matrix such that</p> <script type="math/tex; mode=display">% <![CDATA[ \mathbf{Z} = \begin{bmatrix} \mathbf{z}_{\boldsymbol{\omega}_1}(\mathbf{x}_1) & \mathbf{z}_{\boldsymbol{\omega}_2}(\mathbf{x}_1) & \dots & \mathbf{z}_{\boldsymbol{\omega}_R}(\mathbf{x}_1) \\ \mathbf{z}_{\boldsymbol{\omega}_1}(\mathbf{x}_2) & \mathbf{z}_{\boldsymbol{\omega}_2}(\mathbf{x}_2) & \dots & \mathbf{z}_{\boldsymbol{\omega}_R}(\mathbf{x}_2) \\ \vdots & \vdots & \ddots & \vdots \\ \mathbf{z}_{\boldsymbol{\omega}_1}(\mathbf{x}_N) & \mathbf{z}_{\boldsymbol{\omega}_2}(\mathbf{x}_N) & \dots & \mathbf{z}_{\boldsymbol{\omega}_R}(\mathbf{x}_N) \end{bmatrix}. %]]></script> <p>Then we want to solve for the coefficients $\boldsymbol{\beta}$,</p> <script type="math/tex; mode=display">\boldsymbol{\beta} = (\underbrace{\mathbf{Z}^{\top} \mathbf{Z} + \lambda \mathbf{I}_R}_{\mathbf{A}})^{-1} \mathbf{Z}^{\top} \mathbf{y}. \tag{6}</script> <p>Above, $\lambda$ is the ridge regression regularization parameter. See Figure $3$ for the results of comparing kernel regression with an RBF kernel with RFF regression.</p> <div class="figure"> <img src="/image/rff/kernel_regression.png" alt="" style="width: 100%; display: block; margin: 0 auto;" /> <div class="caption"> <span class="caption-label">Figure 3.</span> Comparison of kernel ridge regression (top) with RFF ridge regression (bottom) on $N = 100$ data points and $(R = 20)$-dimensional random features. </div> </div> <p>With Equation $6$ in mind, it is clear why random Fourier features are efficient: inverting $\mathbf{A}$ has time complexity $\mathcal{O}(R^3)$ rather than $\mathcal{O}(N^3)$. If $R &lt; N$—and especially if $R \ll N$—then we can have big savings. What is not shown is that even on this small data set, RFF regression is over an order of magnitude faster than RBF kernel regression. Since kernel machines scale poorly in $N$, it is easy to make this multiplier larger by increasing $N$ while keeping $R$ fixed.</p> <p>Again, see my <a href="https://github.com/gwgundersen/random-fourier-features" target="_blank">GitHub</a> for a complete implementation. Note that we need to cache the random variables, $\boldsymbol{\omega}_1, \boldsymbol{\omega}_2, \dots, \boldsymbol{\omega}_R$ and $b_1, b_2, \dots b_R$, for generating $\mathbf{Z}$ so that we have the same transformation when we predict on test data.</p> <h2 id="conclusion">Conclusion</h2> <p>Random features for kernel methods is a beautiful, simple, and practical idea; and I see why Rahimi and Recht’s paper won the <a href="https://www.youtube.com/watch?v=Qi1Yry33TQE" target="_blank">Test of Time Award</a> at NeurIPS 2017. Many theoretical results are impractical or complicated to implement. Many empirical results are not well-understood or brittle. Random features is neither: relatively speaking, it is a simple idea using established mathematics, yet it comes equipped with good theoretical guarantees and good results with practical, easy-to-implement models.</p> <p>   </p> <h2 id="appendix">Appendix</h2> <h3 id="1-trigonometric-identity">1. Trigonometric identity</h3> <p>Recall from trigonometry that</p> <script type="math/tex; mode=display">\cos(x + y) = \cos(x) \cos(y) - \sin(x) \sin(y).</script> <p>See <a href="https://www.youtube.com/watch?v=0VBQnR2h8XM" target="_blank">this Khan Academy video</a> for a proof. Furthermore, note that $\cos(-x) = \cos(x)$ since the cosine function is symmetric about $x = 0$. This is not true of the sine function. Instead, it has odd symmetry: $\sin(-x) = -\sin(x)$. Thus, with a little clever manipulation, we can write</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \cos(x + y) + \cos(x - y) &= \cos(x + y) + \cos(x + (-y)) \\ &= [\cos(x) \cos(y) - \sin(x) \sin(y)] + [\cos(x) \cos(-y) - \sin(x) \sin(-y)] \\ &= [\cos(x) \cos(y) - \cancel{\sin(x) \sin(y)}] + [\cos(x) \cos(y) + \cancel{\sin(x) \sin(y)}] \\ &= 2 \cos(x) \cos(y). \end{align} %]]></script> <h3 id="2-expectation-of-cost--x-is-zero">2. Expectation of $\cos(t + x)$ is zero.</h3> <p>Note that</p> <script type="math/tex; mode=display">\cos(\boldsymbol{\omega}^{\top}(\mathbf{x} + \mathbf{y}) + 2b)] = \mathbb{E}_{\boldsymbol{\omega}} [\mathbb{E}_b[ \cos(\boldsymbol{\omega}^{\top}(\mathbf{x} + \mathbf{y}) + 2b) \mid \boldsymbol{\omega}]]</script> <p>holds by the law of total expectation. We claim the inner conditional expectation is zero. To ease notation, let $t = \boldsymbol{\omega}^{\top}(\mathbf{x} - \mathbf{y})$ and let $2b = x$. Therefore, $x \sim \text{Uniform}(0, \pi)$ and $p(x) = 1/\pi$. Then</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \mathbb{E}[\cos(t + x)] &= \int_{0}^{\pi} \frac{\cos(t + x)}{\pi} \text{d}x \\ &= \frac{1}{\pi} \int_{0}^{\pi} \cos(t + x) \text{d}x \\ &= \frac{1}{\pi} \Big[ \sin(t + x) \Big|_{0}^{\pi} \Big] \\ &= \frac{1}{\pi} \Big[ \sin(t) - \sin(t + \pi) \Big] \\ &= 0 \end{align} %]]></script> <p>The last step holds because $\sin(t) = \sin(t \pm \pi)$.</p> <h3 id="3-alternative-random-fourier-features">3. Alternative random Fourier features</h3> <p>Let</p> <script type="math/tex; mode=display">\boldsymbol{\omega} \sim \mathcal{N}_D(\mathbf{0}, \mathbf{I}).</script> <p>Then</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} k(\mathbf{x}, \mathbf{y}) &= \mathbb{E}[\cos(\boldsymbol{\omega}^{\top} (\mathbf{x} - \mathbf{y}))] \\ &= \mathbb{E}[\cos(\boldsymbol{\omega}^{\top} \mathbf{x} - \boldsymbol{\omega}^{\top} \mathbf{y})] \\ &\stackrel{\ddagger}{=} \mathbb{E}[\cos(\boldsymbol{\omega}^{\top} \mathbf{x}) \cos(\boldsymbol{\omega}^{\top} \mathbf{y}) + \sin(\boldsymbol{\omega}^{\top} \mathbf{x}) \sin(\boldsymbol{\omega}^{\top} \mathbf{y})] \\ &= \mathbb{E}\Bigg[ \begin{bmatrix} \cos(\boldsymbol{\omega}^{\top} \mathbf{x}) \\ \sin(\boldsymbol{\omega}^{\top} \mathbf{x}) \end{bmatrix}^{\top} \begin{bmatrix} \cos(\boldsymbol{\omega}^{\top} \mathbf{y}) \\ \sin(\boldsymbol{\omega}^{\top} \mathbf{y}) \end{bmatrix}\Bigg] \\ &\approx \frac{1}{R} \sum_{r=1}^{R} \begin{bmatrix} \cos(\boldsymbol{\omega}_r^{\top} \mathbf{x}) \\ \sin(\boldsymbol{\omega}_r^{\top} \mathbf{x}) \end{bmatrix}^{\top} \begin{bmatrix} \cos(\boldsymbol{\omega}_r^{\top} \mathbf{y}) \\ \sin(\boldsymbol{\omega}_r^{\top} \mathbf{y}) \end{bmatrix} \\ &\equiv \frac{1}{R} \sum_{r=1}^{R} z_{\boldsymbol{\omega_r}}(\mathbf{x})^{\top} z_{\boldsymbol{\omega_r}}(\mathbf{y}). \end{align} %]]></script> <p>As a reminder, step $\ddagger$ is a trigonometric identity. See <a href="https://www.youtube.com/watch?v=0VBQnR2h8XM" target="_blank">this Khan Academy video</a> for a proof.</p>Gregory GundersenKernel machinesImplicit Lifting and the Kernel Trick2019-12-10T00:00:00-05:002019-12-10T00:00:00-05:00http://gregorygundersen.com/blog/2019/12/10/kernel-trick<h2 id="implicit-lifting">Implicit lifting</h2> <p>Imagine we have some data for a classification problem that is not linearly separable. A classic example is Figure $1$a. We would like to use a linear classifier. How might we do this? One idea is to augment our data’s features so that we can “lift” it into a higher dimensional space in which our data <em>are</em> linearly separable (Figure $1$b).</p> <div class="figure"> <img src="/image/kerneltrick/idea.png" alt="" style="width: 100%; display: block; margin: 0 auto;" /> <div class="caption"> <span class="caption-label">Figure 1:</span> The "lifting trick". (a) A binary classification problem that is not linearly separable in $\mathbb{R}^2$. (b) A lifting of the data into $\mathbb{R}^3$ using a polynomial kernel, $\varphi([x_1 \;\; x_2]) = [x_1^2 \;\; x_2^2 \;\; \sqrt{2} x_1 x_2]$. </div> </div> <p>Let’s formalize this approach. Let our data be $\{(\mathbf{x}_1, y_1), \dots, (\mathbf{x}_N, y_N)\}$ where $\mathbf{x}_n \in \mathbb{R}^D$ in general. Now consider $D = 2$ and a single data point</p> <script type="math/tex; mode=display">\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}.</script> <p>We might transform each data point with a function (the <a href="https://en.wikipedia.org/wiki/Polynomial_kernel" target="_blank">polynomial kernel</a> for the curious),</p> <script type="math/tex; mode=display">\varphi(\mathbf{x}) = \begin{bmatrix} x_1^2 \\ x_2^2 \\ \sqrt{2} x_1 x_2 \end{bmatrix}.</script> <p>Since our new data, $\varphi(\mathbf{x})$, is in $\mathbb{R}^3$, we might be able to find a hyperplane $\boldsymbol{\beta}$ in 3D to separate our observations,</p> <script type="math/tex; mode=display">\boldsymbol{\beta}^{\top} \varphi(\mathbf{x}) = \beta_0 + \beta_1 x_1^2 + \beta_2^2 + \beta_3^2 \sqrt{2} x_1 x_2 = 0.</script> <p>This idea, while cool, is not the kernel trick, but it deserves a name. Rather than calling it <em>the pre-(kernel trick) trick</em>, let’s just call it <em>the lifting trick</em>. Caveat: I am not aware of a name for this trick, but I find naming things useful. If you loudly call this “the lifting trick” at a machine-learning party, you might get confused looks.</p> <p>In order to find this hyperplane, we need to run a classification algorithm on our data <em>after</em> it has been lifted into three-dimensional space. At this point, we could be done. We take $\mathbb{R}^D$, perform our lifting trick into $\mathbb{R}^J$ where $D &lt; J$, and then use a method like logistic regression to try to linearly classify it. However, this might be expensive for a “fancy” enough $\varphi(\cdot)$. For $N$ data points lifted into $J$ dimensions, we need $NJ$ operations just to preprocess the data. But we can avoid computing $\varphi(\cdot)$ entirely while still doing linear classification in this lifted space if we’re clever. This second trick is the kernel trick.</p> <h2 id="the-kernel-trick">The kernel trick</h2> <p>Consider the loss function for a <a href="https://en.wikipedia.org/wiki/Support-vector_machine" target="_blank">support vector machine</a> (SVM):</p> <script type="math/tex; mode=display">L(\mathbf{w}, \boldsymbol{\alpha}) = \sum_n \alpha_n - \frac{1}{2} \sum_n^N \sum_m^N \alpha_n \alpha_m y_n y_m (\mathbf{x}_n^{\top} \mathbf{x}_m)</script> <p>$\mathbf{w}$ is a normed vector representing the linear decision boundary and $\boldsymbol{\alpha}$ is a vector of Lagrange multipliers. If this is new or confusing, please see <a href="https://www.cs.princeton.edu/courses/archive/spring14/cos511/scribe_notes/0325.pdf" target="_blank">these excellent lecture notes</a> from Rob Schapire’s Princeton course on theoretical machine learning. Otherwise, you can elide the details if you like; the upshot is that SVMs require computing a dot product and that, as formulated, the SVM is <em>linear</em>.</p> <p>Now what if we had the data problem in Figure $1$a? Could we use the lifting trick to make our SVM nonlinear? Sure. For the previously specified $\varphi(\cdot)$, we have</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \varphi(\mathbf{x}_n)^{\top} \varphi(\mathbf{x}_m) &= \begin{bmatrix} x_{n,1}^2 & x_{n,2}^2 & \sqrt{2} x_{n,1} x_{n,2} \end{bmatrix} \cdot \begin{bmatrix} x_{m,1}^2 \\ x_{m,2}^2 \\ \sqrt{2} x_{m,1} x_{m,2} \end{bmatrix} \\ &= x_{n,1}^2 x_{m,1}^2 + x_{n,2}^2 x_{m,2}^2 + 2 x_{n,1} x_{n,2} x_{m,1} x_{m,2}. \end{align} %]]></script> <p>We would then need to compute this for all our $N$ data points. As we discussed, the problem with this approach is scalability. <em>However</em>, consider the following derivation,</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} (\mathbf{x}_m^{\top} \mathbf{x}_m)^2 &= \Big( \begin{bmatrix} x_{n,1} & x_{n,2} \end{bmatrix} \cdot \begin{bmatrix} x_{m,1} \\ x_{m,2} \end{bmatrix} \Big)^2 \\ &= (x_{n,1} x_{m,1} + x_{n,2} x_{m,2})^2 \\ &= (x_{n,1} x_{m,1})^2 + (x_{n,2} x_{m,2})^2 + 2(x_{n,1} x_{m,1})(x_{n,2} x_{m,2}) \\ &= \varphi(\mathbf{x}_n)^{\top} \varphi(\mathbf{x}_m). \end{align} %]]></script> <p>What just happened? Rather than lifting our data into $\mathbb{R}^3$ and computing an inner product, we just computed an inner product in $\mathbb{R}^2$ and then squared the sum. While both derivations have a similar number of mathematical symbols, the actual number of operations is much smaller for the second approach. This is because a inner product in $\mathbb{R}^2$ is two multiplications and a sum. The square is just the square of a scalar, so 4 operations. The first approach squared three components of two vectors (6 operations), then performed an inner product (3 multiplications, 1 sum) for 9 operations.</p> <p><em>This</em> is the kernel trick: we can avoid expensive operations in high dimensions by finding an appropriate <em>kernel function</em> $k(\mathbf{x}_n,\mathbf{x}_m)$ that is equivalent to the inner product in higher dimensional space. In our example above, $k(\mathbf{x}_n, \mathbf{x}_m) = (\mathbf{x}_n^{\top} \mathbf{x}_m)^2$. In other words, the kernel trick performs the lifting trick for cheap.</p> <h2 id="mercers-theorem">Mercer’s theorem</h2> <p>The mathematical basis for the kernel trick was discovered by <a href="https://en.wikipedia.org/wiki/James_Mercer_(mathematician)" target="_blank">James Mercer</a>. Mercer proved that any <a href="https://en.wikipedia.org/wiki/Positive-definite_kernel" target="_blank">positive definition function</a> $k(\mathbf{x}_n, \mathbf{x}_m)$ with $\mathbf{x}_n, \mathbf{x}_m \in \mathbb{R}^D$ defines an inner product of another vector space $\mathcal{V}$. Thus, if you have a function $\varphi(\cdot)$ such that $\langle \varphi(\mathbf{x}_n), \varphi(\mathbf{x}_m) \rangle_{\mathcal{V}}$ is a valid inner product in $\mathcal{V}$, you know a kernel function exists that can perform the lifting trick for cheap. Alternatively, if you have a positive definite kernel, you can deconstruct its implicit <em>basis function</em> $\varphi(\cdot)$.</p> <p>This idea is formalized in Mercer’s Theorem (taken from <a href="https://people.eecs.berkeley.edu/~jordan/courses/281B-spring04/lectures/lec3.pdf">Michael Jordan’s lecture notes</a>){:target=”_blank”}:</p> <blockquote> <p><strong>Mercer’s Theorem:</strong> A symmetric function $k(\mathbf{x}, \mathbf{y})$ can be expressed as an inner product</p> <script type="math/tex; mode=display">k(\mathbf{x}, \mathbf{y}) = \langle \varphi(\mathbf{x}), \varphi(\mathbf{y}) \rangle</script> <p>for some $\varphi(\cdot)$ if and only if $k(\mathbf{x}, \mathbf{y})$ is positive semidefinite, i.e.</p> <script type="math/tex; mode=display">\int k(\mathbf{x}, \mathbf{y}) g(\mathbf{x}) g(\mathbf{y}) \text{d}\mathbf{x}\text{d}\mathbf{y} \geq 0, \qquad \forall g</script> <p>or, equivalently, if</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{bmatrix} k(\mathbf{x}_1, \mathbf{x}_1) & k(\mathbf{x}_1, \mathbf{x}_2) & \dots & k(\mathbf{x}_1, \mathbf{x}_N) \\ k(\mathbf{x}_2, \mathbf{x}_1) & k(\mathbf{x}_2, \mathbf{x}_2) & \dots & k(\mathbf{x}_2, \mathbf{x}_N) \\ \vdots & \vdots & \ddots & \vdots \\ k(\mathbf{x}_N, \mathbf{x}_1) & k(\mathbf{x}_2, \mathbf{x}_2) & \dots & k(\mathbf{x}_N, \mathbf{x}_N) \end{bmatrix} %]]></script> <p>is positive semidefinite for any collection $\{\mathbf{x}_1, \dots, \mathbf{x}_N\}$.</p> </blockquote> <p>This theorem is if and only if, meaning we could explicitly construct a kernel function $k(\cdot, \cdot)$ for a given $\varphi(\cdot)$ or we could take a kernel function and use it without having an explicit representation of $\varphi(\cdot)$.</p> <p>If we assume everything is real-valued, then we can demonstrate this fact easily. Let $\mathbf{K}$ be the positive semidefinite Gram matrix above. Since it is real and symmetric, it has an eigendecomposition of the form</p> <script type="math/tex; mode=display">\mathbf{K} = \mathbf{U}^{\top} \boldsymbol{\Lambda} \mathbf{U}</script> <p>where $\boldsymbol{\Lambda} = \text{diag}(\lambda_1, \dots, \lambda_N)$. Since $\mathbf{K}$ is positive definite, then $\lambda_n \geq 0$ and the square root is real-valued. We can write an element of $\mathbf{K}$ as</p> <script type="math/tex; mode=display">% <![CDATA[ \mathbf{K}_{ij} = \begin{bmatrix} \boldsymbol{\Lambda}^{1/2} & \mathbf{U}_{:, i} \end{bmatrix}\begin{bmatrix} \boldsymbol{\Lambda}^{1/2} \\ \mathbf{U}_{:, j} \end{bmatrix}. %]]></script> <p>Define $\varphi(\mathbf{x}_i) = \boldsymbol{\Lambda}^{1/2} \mathbf{U}_{:, i}$. Therefore, if our kernel function is positive semidefinite—if it defines a Gram matrix that is positive semidefinite—then there exists a function $\varphi: \mathcal{X} \mapsto \mathcal{V}$ such that</p> <script type="math/tex; mode=display">k(\mathbf{x}, \mathbf{y}) = \varphi(\mathbf{x})^{\top} \varphi(\mathbf{y})</script> <p>where $\mathcal{X}$ is the space of samples.</p> <h2 id="infinite-dimensional-feature-space">Infinite-dimensional feature space</h2> <p>An interesting consequence of the kernel trick is that kernel methods, equipped with the appropriate kernel function, can be viewed as operating in infinite-dimensional feature space. As an example, consider the radial basis function (RBF) kernel,</p> <script type="math/tex; mode=display">k_{\texttt{RBF}}(\mathbf{x}, \mathbf{y}) = \exp\Big(-\gamma\lVert\mathbf{x}-\mathbf{y}\rVert^2\Big).</script> <p>Let’s take it for granted that this is a valid positive semidefinite kernel. Let $k_{\texttt{poly(r)}}$ denote a polynomial kernel of degree $r$, and let $\gamma = 1/2$. Then</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} k_{\texttt{RBF}}(\mathbf{x}, \mathbf{y}) &= \exp\Big(-\frac{1}{2} \lVert\mathbf{x}-\mathbf{y}\rVert^2\Big) \\ &= \exp\Big(-\frac{1}{2} \langle \mathbf{x}-\mathbf{y}, \mathbf{x}-\mathbf{y} \rangle \Big) \\ &\stackrel{\star}{=} \exp\Big(-\frac{1}{2} \langle \mathbf{x}, \mathbf{x}-\mathbf{y} \rangle - \langle \mathbf{y}, \mathbf{x}-\mathbf{y} \rangle \Big) \\ &\stackrel{\star}{=} \exp\Big(-\frac{1}{2} \langle \mathbf{x}, \mathbf{x} \rangle - \langle \mathbf{x}, \mathbf{y} \rangle - \big[ \langle \mathbf{y}, \mathbf{x} \rangle - \langle \mathbf{y}, \mathbf{y} \rangle \big] \rangle \Big) \\ &= \exp\Big(-\frac{1}{2} \langle \mathbf{x}, \mathbf{x} \rangle + \langle \mathbf{y}, \mathbf{y} \rangle - 2 \langle \mathbf{x}, \mathbf{y} \rangle \Big) \\ &= \exp\Big(-\frac{1}{2} \rVert \mathbf{x} \lVert^2 \Big) \exp\Big(-\frac{1}{2} \rVert \mathbf{y} \lVert^2 \Big) \exp\Big(- 2 \langle \mathbf{x}, \mathbf{y} \rangle \Big) \end{align} %]]></script> <p>Above, the two steps labeled $\star$ leverage the fact that</p> <script type="math/tex; mode=display">\langle \mathbf{u} + \mathbf{v}, \mathbf{w} \rangle = \langle \mathbf{u}, \mathbf{w} \rangle + \langle \mathbf{v}, \mathbf{w} \rangle</script> <p>in general for inner products (see <a href="http://mathworld.wolfram.com/InnerProduct.html">here</a>){:target=”_blank”}. Now let $C$ be a constant,</p> <script type="math/tex; mode=display">C \equiv \exp\Big(-\frac{1}{2} \rVert \mathbf{x} \lVert^2 \Big) \exp\Big(-\frac{1}{2} \rVert \mathbf{y} \lVert^2 \Big).</script> <p>and note that the Taylor expansion of $e^{f(x)}$ is</p> <script type="math/tex; mode=display">e^{f(x)} = \sum_{r=0}^{\infty} \frac{[f(x)]^r}{r!}.</script> <p>We can write the RBF kernel as</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} k_{\texttt{RBF}}(\mathbf{x}, \mathbf{y}) &= C \exp\big(- 2 \langle \mathbf{x}, \mathbf{y} \rangle \big) \\ &= C \sum_{r=0}^{\infty} \frac{ \langle \mathbf{x}, \mathbf{y} \rangle^r}{r!} \\ &= C \sum_{r}^{\infty} \frac{k_{\texttt{poly(r)}}(\mathbf{x}, \mathbf{y})}{r!}. \end{align} %]]></script> <p>So the RBF kernel can be viewed as an infinite sum over polynomial kernels. As $r$ increases, each polynomial kernel lifts the data into higher dimensions, and the RBF kernel is an infinite sum over these kernels. (NB: kernel functions are linear operators.) <a href="http://pages.cs.wisc.edu/~matthewb/pages/notes/pdf/svms/RBFKernel.pdf" target="_blank">Matthew Bernstein</a> has a nice derivation more explicitly showing that $\varphi_{\texttt{RBF}}: \mathbb{R}^D \mapsto \mathbb{R}^{\infty}$, but I think the above logic captures the main point.</p> <h2 id="why-the-distinction">Why the distinction?</h2> <p>Why did I stress the distinction beween lifting and the kernel trick? Good research is about having a line of attack on a problem. A layperson might suggest good problems solve, but researchers find good solvable problems. This is the difference between saying, “We should cure cancer,” and the work done by an oncology researcher.</p> <p>For similar reasons, I think it’s important to disentangle the lifting trick from the kernel trick. Without the mathematics of Mercer and others, we might have discovered the lifting trick but found it entirely useless in practice. With high probability, such currently useless solutions exist in the research wild today. It is <em>mathematical relationship</em> between kernel functions and lifting that is the eureka moment for the kernel trick.</p>Gregory GundersenImplicit liftingAsymptotic Normality of Maximum Likelihood Estimators2019-11-28T00:00:00-05:002019-11-28T00:00:00-05:00http://gregorygundersen.com/blog/2019/11/28/asymptotic-normality-mle<p>Given a statistical model $\mathbb{P}_{\theta}$ and a random variable $X \sim \mathbb{P}_{\theta_0}$ where $\theta_0$ are the true generative parameters, maximum likelihood estimation (MLE) finds a point estimate $\hat{\theta}_n$ such that the resulting distribution “most likely” generated the data. MLE is popular for a number of theoretical reasons, one such reason being that MLE is <em>asymtoptically efficient</em>: in the limit, a maximum likelihood estimator achieves minimum possible variance or the Cramér–Rao lower bound. Recall that point estimators, as functions of $X$, are themselves random variables. Therefore, a low-variance estimator estimates $\theta_0$ more precisely.</p> <p>To state our claim more formally, let $X = \langle X_1, \dots, X_n \rangle$ be a finite sample of observation $X$ where $X \sim \mathbb{P}_{\theta_0}$ with $\theta_0 \in \Theta$ being the true but unknown parameter. Let $\rightarrow^p$ denote <em>converges in probability</em> and $\rightarrow^d$ denote <em>converges in distribution</em>. Our claim of asymptotic normality is the following:</p> <blockquote> <p><strong>Asymptotic normality:</strong> Assume $\hat{\theta}_n \rightarrow^p \theta_0$ with $\theta_0 \in \Theta$ and that other regularity conditions hold. Then</p> <script type="math/tex; mode=display">\sqrt{n}(\hat{\theta}_n - \theta_0) \rightarrow^d \mathcal{N}(0, \mathcal{I}(\theta_0)^{-1})</script> <p>where $\mathcal{I}(\theta_0)$ is the Fisher information.</p> </blockquote> <p>By “other regularity conditions”, I simply mean that I do not want to make a detailed accounting of every assumption for this post. Obviously, one should consult a standard textbook for a more rigorous treatment.</p> <p>If asymptotic normality holds, then asymptotic efficiency falls out because it immediately implies</p> <script type="math/tex; mode=display">\hat{\theta}_n \rightarrow^d \mathcal{N}(\theta_0, \mathcal{I}_n(\theta_0)^{-1}).</script> <p>I use the notation $\mathcal{I}_n(\theta)$ for the Fisher information for $X$ and $\mathcal{I}(\theta)$ for the Fisher information for a single $X_i$. Therefore, $\mathcal{I}_n(\theta) = n \mathcal{I}(\theta)$ provided the data are i.i.d. See my <a href="/blog/2019/11/21/fisher-information/#reformulation-for-iid-settings">previous post</a> on properties of the Fisher information for details.</p> <p>The goal of this post is to discuss the asymptotic normality of maximum likelihood estimators. This post relies on understanding the <a href="/blog/2019/11/21/fisher-information/">Fisher information</a> and the <a href="/blog/2019/11/26/proof-crlb/">Cramér–Rao lower bound</a>.</p> <h2 id="proof-of-asymptotic-normality">Proof of asymptotic normality</h2> <p>To prove asymptotic normality of MLEs, define the <em>normalized</em> log-likelihood function and its first and second derivatives with respect to $\theta$ as</p> <script type="math/tex; mode=display">L_n(\theta) = \frac{1}{n} \log f_X(x; \theta) \qquad L^{\prime}_n(\theta) = \frac{\partial}{\partial \theta} \Big( \frac{1}{n} \log f_X(x; \theta) \Big) \qquad L^{\prime\prime}_n(\theta) = \frac{\partial^2}{\partial \theta^2} \Big( \frac{1}{n} \log f_X(x; \theta) \Big).</script> <p>By definition, the MLE is a maximum of the log likelihood function and therefore,</p> <script type="math/tex; mode=display">\hat{\theta}_n = \arg\!\max_{\theta \in \Theta} \log f_X(x; \theta) \quad \implies \quad L^{\prime}_n(\hat{\theta}_n) = 0.</script> <p>Now let’s apply the mean value theorem,</p> <blockquote> <p><strong>Mean value theorem:</strong> Let $f$ be a continuous function on the closed interval $[a, b]$ and differentiable on the open interval. Then there exists a point $c \in (a, b)$ such that</p> <script type="math/tex; mode=display">f^{\prime}(c) = \frac{f(a) - f(b)}{a - b}</script> </blockquote> <p>where $f = L_n^{\prime}$, $a = \hat{\theta}_n$ and $b = \theta_0$. Then for some point $\hat{\theta}_1 \in (\hat{\theta}_n, \theta_0)$, we have</p> <script type="math/tex; mode=display">L^{\prime}_{n}(\hat{\theta}_n) = L_n^{\prime}(\theta_0) + L_n^{\prime\prime}(\tilde{\theta})(\hat{\theta}_n - \theta_0).</script> <p>Above, we have just rearranged terms. (Note that other proofs might apply the more general Taylor’s theorem and show that the higher-order terms are bounded in probability.) Now by definition $L^{\prime}_{n}(\hat{\theta}_n) = 0$, and we can write</p> <script type="math/tex; mode=display">\hat{\theta}_n - \theta_0 = - \frac{L_n^{\prime}(\theta_0)}{L_n^{\prime\prime}(\tilde{\theta})} \quad \implies \quad \sqrt{n}(\hat{\theta}_n - \theta_0) = - \frac{\sqrt{n} L_n^{\prime}(\theta_0)}{L_n^{\prime\prime}(\tilde{\theta})}</script> <p>Let’s tackle the numerator and denominator separately. The upshot is that we can show the numerator converges <em>in distribution</em> to a normal distribution using the Central Limit Theorem, and that the denominator converges <em>in probability</em> to a constant value using the Weak Law of Large Numbers. Then we can invoke <a href="https://en.wikipedia.org/wiki/Slutsky%27s_theorem" target="_blank">Slutsky’s theorem</a>.</p> <p>For the numerator, by the linearity of differentiation and the log of products we have</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \sqrt{n} L^{\prime}_n(\theta_0) &= \sqrt{n} \Big( \frac{1}{n} \Big[ \frac{\partial}{\partial \theta} \log f_X(X; \theta_0) \Big] \Big) \\ &= \sqrt{n} \Big( \frac{1}{n} \Big[ \frac{\partial}{\partial \theta} \log \prod_{i=1}^n f_X(X_i; \theta_0) \Big] \Big) \\ &= \sqrt{n} \Big( \frac{1}{n} \sum_{i=1}^{n} \Big[ \frac{\partial}{\partial \theta} \log f_X(X_i; \theta_0) \Big] \Big) \\ &= \sqrt{n} \Big( \frac{1}{n} \sum_{i=1}^{n} \Big[ \frac{\partial}{\partial \theta} \log f_X(X_i; \theta_0) \Big] - \mathbb{E}\Big[\frac{\partial}{\partial \theta} \log f_X(X_1; \theta_0)\Big] \Big) \tag{1}. \end{align} %]]></script> <p>In the last line, we use the fact that the expected value of the score is zero. Without loss of generality, we take $X_1$,</p> <script type="math/tex; mode=display">\mathbb{E}\big[\frac{\partial}{\partial \theta} \log f_X(X_1; \theta_0)\big] = 0.</script> <p>See <a href="/blog/2019/11/21/fisher-information/#expected-score-is-zero">my previous post</a> on properties of the Fisher information for a proof. Equation $1$ allows us to invoke the Central Limit Theorem to say that</p> <script type="math/tex; mode=display">\sqrt{n} L^{\prime}_n(\theta_0) \rightarrow^d \mathcal{N}\Big(0, \mathbb{V}\Big[\frac{\partial}{\partial \theta} \log f_X(X_1; \theta_0)\Big]\Big)</script> <p>This variance is just the Fisher information for a single observation,</p> <script type="math/tex; mode=display">\mathbb{V}\Big[\frac{\partial}{\partial \theta} \log f_X(X_1; \theta_0)\Big] = \mathbb{E}\Big[\Big(\frac{\partial}{\partial \theta} \log f_X(X_1; \theta_0)\Big)^2\Big] - \Big(\underbrace{\mathbb{E}\Big[\frac{\partial}{\partial \theta} \log f_X(X_1; \theta_0)\Big]}_{=\,0}\Big)^2 = \mathcal{I}_n(\theta_0).</script> <p>For the denominator, we first invoke the Weak Law of Large Numbers (WLLN) for any $\theta$,</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} L_n^{\prime\prime}(\theta) &= \frac{1}{n} \Big( \frac{\partial^2}{\partial \theta^2} \log f_X(X; \theta) \Big) \\ &= \frac{1}{n} \Big( \frac{\partial^2}{\partial \theta^2} \log \prod_{i=1}^n f_X(X_i; \theta) \Big) \\ &= \frac{1}{n} \sum_{i=1}^n \Big( \frac{\partial^2}{\partial \theta^2} \log f_X(X_i; \theta) \Big) \\ &\rightarrow^p \mathbb{E}\Big[ \frac{\partial^2}{\partial \theta^2} \log f_X(X_1; \theta) \Big]. \end{align} %]]></script> <p>In the last step, we invoke the WLLN without loss of generality on $X_1$. Now note that $\hat{\theta}_1 \in (\hat{\theta}_n, \theta_0)$ by construction, and we assume that $\hat{\theta}_n \rightarrow^p \theta_0$. Taken together, we have</p> <script type="math/tex; mode=display">L_n^{\prime\prime}(\tilde{\theta}) \rightarrow^p \mathbb{E}\Big[ \frac{\partial^2}{\partial \theta^2} \log f_X(X_1; \theta_0) \Big] = - \mathcal{I}(\theta_0).</script> <p>If you’re unconvinced that the expected value of the derivative of the score is equal to the negative of the Fisher information, once again see <a href="/blog/2019/11/21/fisher-information/#alternative-definition">my previous post</a> on properties of the Fisher information for a proof.</p> <p>To summarize, we have shown that</p> <script type="math/tex; mode=display">\sqrt{n} L^{\prime}_n(\theta_0) \rightarrow^d \mathcal{N}(0, \mathcal{I}(\theta_0))</script> <p>and</p> <script type="math/tex; mode=display">L^{\prime\prime}_n(\tilde{\theta}) \rightarrow^p - \mathcal{I}(\theta_0).</script> <p>We invoke Slutsky’s theorem, and we’re done:</p> <script type="math/tex; mode=display">\sqrt{n}(\hat{\theta}_n - \theta_0) \rightarrow^d \mathcal{N}\Big(\frac{1}{\mathcal{I}(\theta_0)} \Big).</script> <p>As discussed in the introduction, asymptotic normality immediately implies</p> <script type="math/tex; mode=display">\hat{\theta}_n \rightarrow^d \mathcal{N}(\theta_0, \mathcal{I}_n(\theta_0)^{-1}).</script> <p>As our finite sample size $n$ increases, the MLE becomes more concentrated or its variance becomes smaller and smaller. In the limit, MLE achieves the lowest possible variance, the Cramér–Rao lower bound.</p> <h2 id="example-with-bernoulli-distribution">Example with Bernoulli distribution</h2> <p>Let’s look at a complete example. Let $X_1, \dots, X_n$ be i.i.d. samples from a Bernoulli distribution with true parameter $p$. The log likelihood is</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \log f_X(X; p) &= \sum_{i=1}^n \log \big[p^{X_i} + (1-p)^{X_i - 1} \big] \\ &= \sum_{i=1}^n \big[ X_i \log p + (1 - X_i) \log (1 - p) \big]. \end{align} %]]></script> <p>This works because $X_i$ only has support $\{0, 1\}$. If we compute the derivative of this log likelihood, set it equal to zero, and solve for $p$, we’ll have $\hat{p}_n$, the MLE:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \frac{\partial}{\partial p} \log f_X(X; p) &= \sum_{i=1}^n \Big[ \frac{\partial}{\partial p} X_i \log p + \frac{\partial}{\partial p} (1 - X_i)\log (1 - p) \Big] \\ &= \sum_{i=1}^n \Big[ \frac{X_i}{p} - \frac{1 - X_i}{1 - p} \Big] \\ &\Downarrow \\ 0 &= \sum_{i=1}^n \Big[ \frac{X_i}{p} + \frac{X_i}{1 - p} \Big] - \frac{n}{1 - p} \\ \frac{n}{1 - p} &= \sum_{i=1}^n X_i \Big[ \frac{1}{p} + \frac{1}{1 - p} \Big] \\ \frac{p(1 - p)}{1 - p} &= \frac{1}{n} \sum_{i=1}^n X_i \\ &\Downarrow \\ \hat{p}_n &= \frac{1}{n} \sum_{i=1}^n X_i. \end{align} %]]></script> <p>The second derivative is</p> <script type="math/tex; mode=display">\frac{\partial}{\partial p} \sum_{i=1}^n \Big[ \frac{X_i}{p} + \frac{X_i - 1}{1 - p} \Big] = \sum_{i=1}^n \Big[ - \frac{X_i}{p^2} - \frac{X_i - 1}{(1 - p)^2} \Big].</script> <p>The Fisher information is the negative expected value of this second derivative or</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \mathcal{I}_n(p) &= -\mathbb{E}\Big[ \sum_{i=1}^n \Big[ - \frac{X_i}{p^2} - \frac{X_i - 1}{(1 - p)^2} \Big] \Big] \\ &= \sum_{i=1}^n \Big[ \frac{\mathbb{E}[X_i]}{p^2} + \frac{\mathbb{E}[X_i] - 1}{(1 - p)^2} \Big] \\ &= \sum_{i=1}^n \Big[ \frac{1}{p} - \frac{1}{1 - p} \Big] \\ &= \frac{n}{p(1-p)}. \end{align} %]]></script> <p>Thus, by the asymptotic normality of the MLE of the Bernoullli distribution—to be completely rigorous, we should show that the Bernoulli distribution meets the required regularity conditions—we know that</p> <script type="math/tex; mode=display">\hat{p}_{n} \rightarrow^d \mathcal{N}\Big(p, \frac{p(1-p)}{n}\Big).</script> <p>We can empirically test this by drawing the probability density function of the above normal distribution, as well as a histogram of $\hat{p}_n$ for many iterations (Figure $1$).</p> <div class="figure"> <img src="/image/asymnorm/bernoulli.png" alt="" style="width: 100%; display: block; margin: 0 auto;" /> <div class="caption"> <span class="caption-label">Figure 1.</span> The probability density function of $\mathcal{N}(p, p(1-p)/n)$ (red), as well as a histogram of $\hat{p}_{n}$ (gray) over many experimental iterations. The true value of $p$ is $0.4$. </div> </div> <p>Here is the minimum code required to generate the above figure:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> <span class="kn">from</span> <span class="nn">scipy.stats</span> <span class="kn">import</span> <span class="n">norm</span> <span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span> <span class="n">n</span> <span class="o">=</span> <span class="mi">1000</span> <span class="n">p0</span> <span class="o">=</span> <span class="mf">0.4</span> <span class="n">xx</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mf">0.3</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="mf">0.001</span><span class="p">)</span> <span class="n">yy</span> <span class="o">=</span> <span class="n">norm</span><span class="o">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">xx</span><span class="p">,</span> <span class="n">p0</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">((</span><span class="n">p0</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">p0</span><span class="p">))</span> <span class="o">/</span> <span class="n">n</span><span class="p">))</span> <span class="n">mles</span> <span class="o">=</span> <span class="p">[]</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10000</span><span class="p">):</span> <span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">binomial</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">p0</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">n</span><span class="p">)</span> <span class="n">mles</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">X</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span> <span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">xx</span><span class="p">,</span> <span class="n">yy</span><span class="p">)</span> <span class="n">plt</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">mles</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span> <span class="n">density</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> </code></pre></div></div> <p>   </p> <h3 id="acknowledgements">Acknowledgements</h3> <p>I relied on a few different excellent resources to write this post:</p> <ul> <li>My in-class lecture notes for Matias Cattaneo’s <a href="https://registrar.princeton.edu/course-offerings/course-details?term=1202&amp;courseid=009316" target="_blank">ORF 524, Statistical Theory and Methods</a>.</li> <li><a href="https://ocw.mit.edu/courses/mathematics/18-443-statistics-for-applications-fall-2006/lecture-notes/lecture3.pdf" target="_blank">These lecture notes</a> for MIT 18.650, Statistics for Applications.</li> <li><a href="https://stephens999.github.io/fiveMinuteStats/asymptotic_normality_mle.html" target="_blank">This post</a> by Matthew Stephen’s et al for a complete example.</li> </ul>Gregory GundersenGiven a statistical model $\mathbb{P}_{\theta}$ and a random variable $X \sim \mathbb{P}_{\theta_0}$ where $\theta_0$ are the true generative parameters, maximum likelihood estimation (MLE) finds a point estimate $\hat{\theta}_n$ such that the resulting distribution “most likely” generated the data. MLE is popular for a number of theoretical reasons, one such reason being that MLE is asymtoptically efficient: in the limit, a maximum likelihood estimator achieves minimum possible variance or the Cramér–Rao lower bound. Recall that point estimators, as functions of $X$, are themselves random variables. Therefore, a low-variance estimator estimates $\theta_0$ more precisely.Proof of the Cramér–Rao Lower Bound2019-11-27T00:00:00-05:002019-11-27T00:00:00-05:00http://gregorygundersen.com/blog/2019/11/27/proof-crlb<p>Given a statistical model $X \sim \mathbb{P}_{\theta}$ with a fixed true parameter $\theta$, the Cramér–Rao lower bound (CRLB) provides a lower bound on the variance of an estimator $T(X)$. The CRLB is useful because if an unbiased estimator achieves the CRLB, it must be a <a href="https://en.wikipedia.org/wiki/Minimum-variance_unbiased_estimator">uniformly minimum–variance unbiased estimator</a> because it is unbiased by construction and has minimum variance by the CRLB. A precise statement of scalar-case of the CLRB is</p> <blockquote> <p><strong>CRLB:</strong> Let $X = (X_1, \dots, X_n) \in \mathbb{R}^n$ be a random vector with joint density $f(X; \theta)$ where $\theta \in \Theta \subseteq \mathbb{R}$. Let $T(X)$ be a biased estimator of $\theta$. Assume the Fisher information is always defined and that the operations of integration with respect to $X$ and differention with respect to $\theta$ can be interchanged. Then</p> <script type="math/tex; mode=display">\mathbb{V}[T(X)] \geq \frac{\big( \frac{d}{d \theta} \mathbb{E}[T(X)] \big)^2}{\mathbb{E} \big[\big(\frac{d}{d\theta} \log f(X; \theta) \big)^2\big]} \equiv \text{CRLB}(\theta).</script> </blockquote> <p>The denominator of the CRLB is the <a href="/blog/2019/11/21/fisher-information/">Fisher information</a>. If the estimator is unbiased, then the numerator is one since $\mathbb{E}[T(X)] = \theta$.</p> <p>To prove the CRLB, let $W$ and $Y$ be two random variables. In general, $\mathbb{E}[W] \neq 0$ and $\mathbb{E}[Y] \neq 0$. However, assume $\mathbb{E}[Y] = 0$. A property of covariance is</p> <script type="math/tex; mode=display">(\text{Cov}[W, Y])^2 \leq \mathbb{V}[W] \mathbb{V}[Y]. \tag{1}</script> <p>This can be derived by applying <a href="https://en.wikipedia.org/wiki/Cauchy%E2%80%93Schwarz_inequality#Probability_theory">Cauchy–Schwarz</a> to random variables (see <a href="/blog/2019/11/26/proof-crlb/#1-covariance-inequality">the Appendix</a>). Now set $W$ and $Y$ to</p> <script type="math/tex; mode=display">W = T(X), \qquad Y = \frac{\partial}{\partial \theta} \log f(X; \theta).</script> <p>We know that the <a href="/blog/2019/11/21/fisher-information/#expected-score-is-zero">expectation of the score (derivative of the log likelihood) is zero</a>. Therefore, $\mathbb{E}[Y] = 0$ as desired. Then Equation $1$ can be rewritten as</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \mathbb{V}[T(X)] \mathbb{V}\big[\frac{\partial}{\partial \theta} \log f(X; \theta)\big] &\geq \big(\text{Cov}\big[T(X), \frac{\partial}{\partial \theta} \log f(X; \theta)\big]\big)^2 \\ \mathbb{V}[T(X)] &\geq \frac{\big(\text{Cov}\big[T(X), \frac{\partial}{\partial \theta} \log f(X; \theta)\big]\big)^2}{ \mathbb{V}\big[\frac{\partial}{\partial \theta} \log f(X; \theta)\big]} \tag{2} \end{align} %]]></script> <p>The numerator is our desired quantity. Ignoring the square, we have</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \text{Cov}\big[T(X), \frac{\partial}{\partial \theta} \log f(X; \theta)\big] &= \mathbb{E}\Big[T(X)\frac{\partial}{\partial \theta} \log f(X; \theta) \Big] - \overbrace{\mathbb{E}[T(X)] \mathbb{E}\Big[\frac{\partial}{\partial \theta} \log f(X; \theta) \Big]}^{=\,0} \\ &= \int T(X) \frac{\partial}{\partial \theta} \log f(X; \theta) f(X; \theta) \text{d}\mu(X) \\ &\stackrel{\star}{=} \int T(X) \frac{\partial}{\partial \theta} f(X; \theta) \text{d}\mu(X) \\ &\stackrel{\dagger}{=} \frac{\partial}{\partial \theta} \int T(X) f(X; \theta) \text{d}\mu(X) \\ &= \frac{\partial}{\partial \theta} \mathbb{E}[T(X)]. \end{align} %]]></script> <p>In step $\star$, we use the fact that if $g(x) = \log h(x)$, $g^{\prime}(x) = h^{\prime}(x) / h(x)$. In step $\dagger$, we use our assumption that we can interchange integration and differention.</p> <p>The denominator is the desired quantity because</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \mathbb{V}\big[\frac{\partial}{\partial \theta} \log f(X; \theta)\big] &= \mathbb{E}\big[\big(\frac{\partial}{\partial \theta} \log f(X; \theta)\big)^2\big] - \mathbb{E}\big[\frac{\partial}{\partial \theta} \log f(X; \theta)\big]^2 \\ &= \mathbb{E}\big[\big(\frac{\partial}{\partial \theta} \log f(X; \theta)\big)^2\big]. \end{align} %]]></script> <p>Putting these results together in Equation $2$, we have</p> <script type="math/tex; mode=display">\mathbb{V}[T(X)] \geq \frac{\big( \frac{\partial}{\partial \theta} \mathbb{E}[T(X)] \big)^2}{\mathbb{E}\big[\big(\frac{\partial}{\partial \theta} \log f(X; \theta)\big)^2\big]}</script> <p>as desired.</p> <p>   </p> <h2 id="appendix">Appendix</h2> <h3 id="1-covariance-inequality">1. Covariance inequality</h3> <p>The Cauchy–Schwarz inequality for vectors $\mathbf{u}$ and $\mathbf{v}$ with an inner product $\langle \mathbf{u}, \mathbf{v} \rangle$ is</p> <script type="math/tex; mode=display">| \langle \mathbf{u}, \mathbf{v }\rangle |^2 \leq \langle \mathbf{u}, \mathbf{u} \rangle \cdot \langle \mathbf{v}, \mathbf{v} \rangle</script> <p>Now note that for real random variables $W$ and $Y$, the <a href="https://en.wikipedia.org/wiki/Inner_product_space#Random_variables">expected value of their product is itself an inner product</a>:</p> <script type="math/tex; mode=display">\langle W, Y \rangle = \mathbb{E}[WY].</script> <p>Now let $\mathbb{E}[W] = \omega$ and $\mathbb{E}[Y] = \gamma$. Then if we apply the definition of covariance and Cauchy–Schwarz to this inner product, we have</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} |\text{Cov}[W, Y]|^2 &\triangleq|\mathbb{E}[(W - \omega)(Y - \gamma)] |^2 \\ &= |\langle W - \omega, Y - \gamma \rangle |^2 \\ &\leq \langle W - \omega, W - \omega \rangle \cdot \langle Y - \gamma, Y - \gamma \rangle \\ &= \mathbb{E}[(W - \omega)^2] \mathbb{E}[(Y - \gamma)^2] \\ &= \mathbb{V}[W] \mathbb{V}[Y] \end{align} %]]></script> <p>as desired.</p>Gregory GundersenGiven a statistical model $X \sim \mathbb{P}_{\theta}$ with a fixed true parameter $\theta$, the Cramér–Rao lower bound (CRLB) provides a lower bound on the variance of an estimator $T(X)$. The CRLB is useful because if an unbiased estimator achieves the CRLB, it must be a uniformly minimum–variance unbiased estimator because it is unbiased by construction and has minimum variance by the CRLB. A precise statement of scalar-case of the CLRB isThe Fisher Information2019-11-21T00:00:00-05:002019-11-21T00:00:00-05:00http://gregorygundersen.com/blog/2019/11/21/fisher-information<p>The goal of this post is to enumerate and derive several key properties of the Fisher information. The Fisher information quantifies how much information a random variable carries about its unknown generative parameters. Let $X = (X_1, \dots, X_n)$ be a random sample from $\mathbb{P}_{\theta} \in \mathcal{P} = \{\mathbb{P}_{\theta} : \theta \in \Theta\}$ with joint density $f_{\theta}(X)$. The log likelihood is</p> <script type="math/tex; mode=display">\mathcal{L}(\theta) = \log f_{\theta}(X).</script> <p>Since any point estimate $\hat{\theta}$ of $\theta$ is itself a random variable—because the point estimate is a function of $X$—then we can think of the log likelihood as a “random curve”. The <em>score</em>, or gradient of the log likelihood w.r.t. to $\theta$, evaluated at a particular point tells us how sensitive the log likelihood is to changes in parameter values. Under certain regularity conditions, we can show (see below) that</p> <script type="math/tex; mode=display">\mathbb{E}\Big[\frac{\partial}{\partial \theta} \log f_{\theta}(X)\Big] = 0.</script> <p>In words, the expected gradient of the log likelihood is zero. The Fisher information is the <em>variance of the score</em>,</p> <script type="math/tex; mode=display">\mathcal{I}_n(\theta) = \mathbb{E}\Big[\Big( \frac{\partial}{\partial \theta} \log f_{\theta}(X) \Big)^2\Big].</script> <p>This is the variance because for any random variable $Z$, $\mathbb{V}[Z] = \mathbb{E}[Z^2] - \mathbb{E}[Z]^2$, and we just argued that the final term is zero. To quote <a href="https://stats.stackexchange.com/questions/10578/">this StackExchange answer</a>, “The Fisher information determines <em>how quickly</em> the observed score function converges to the shape of the true score function.”</p> <h2 id="properties">Properties</h2> <h3 id="expected-score-is-zero">Expected score is zero</h3> <p>If we can swap integration and differentiation, then</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \mathbb{E}\Big[\frac{\partial}{\partial \theta} \log p(X; \theta)\Big] &\stackrel{\star}{=} \int \Bigg( \frac{\frac{\partial}{\partial \theta} p(x; \theta)}{p(x; \theta)} \Bigg) p(x; \theta) \text{d}x \\ &= \int \frac{\partial}{\partial \theta} p(x; \theta) \text{d}x \\ &= \frac{\partial}{\partial \theta} \int p(x; \theta) \text{d}x \\ &= 0. \end{align} %]]></script> <p>In step $\star$, we use the fact that</p> <script type="math/tex; mode=display">g(x) = \log f(x) \implies g^{\prime}(x) = \frac{f^{\prime}(x)}{f(x)}.</script> <h3 id="alternative-definition">Alternative definition</h3> <p>If $f_{\theta}$ is twice differentiable and if we can swap integration and differentiation, then $\mathcal{I}_n(\theta)$ can be equivalently written as</p> <script type="math/tex; mode=display">\mathcal{I}_n(\theta) = - \mathbb{E}\Big[\frac{\partial^2}{\partial \theta^2} \log f_{\theta}(X) \Big].</script> <p>To see this, first note that</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \frac{\partial^2}{\partial \theta^2} \log f_{\theta}(X) &= \frac{\partial}{\partial \theta} \Big( \frac{\partial}{\partial \theta} \log f_{\theta}(X) \Big) \\ &= \frac{\partial}{\partial \theta} \Big( \frac{\frac{\partial}{\partial \theta} f_{\theta}(X)}{f_{\theta}(X)} \Big) \\ &\stackrel{\star}{=} \frac{f_{\theta}(X) \frac{\partial^2}{\partial \theta^2} f_{\theta}(X) - \frac{\partial}{\partial \theta} f_{\theta} (X) \frac{\partial}{\partial \theta} f_{\theta} (X)}{f_{\theta}(X)^2} \\ &= \frac{\frac{\partial^2}{\partial \theta^2} f_{\theta}(X)}{f_{\theta}(X)} - \Bigg(\frac{\frac{\partial}{\partial \theta} f_{\theta} (X)}{f_{\theta}(X)}\Bigg)^2 \end{align} %]]></script> <p>We use the quotient rule from calculus in step $\star$. Now notice that</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \mathbb{E}\Bigg[\frac{\frac{\partial^2}{\partial \theta^2} f_{\theta}(X)}{f_{\theta}(X)}\Bigg] &= \int \Bigg[\frac{\frac{\partial^2}{\partial \theta^2} f_{\theta}(X)}{f_{\theta}(X)}\Bigg] f_{\theta}(X) \text{d}X \\ &= \int \frac{\partial^2}{\partial \theta^2} f_{\theta}(X) \text{d}X \\ &= \frac{\partial^2}{\partial \theta^2} \int f_{\theta}(X) \text{d}X \\ &= 0. \end{align} %]]></script> <h3 id="nonnegativity">Nonnegativity</h3> <p>Since $f_{\theta}(X) \geq 0$, then $\log f_{\theta}(X) \geq 0$ and $\mathcal{I}_n(\theta) \geq 0$. This should also be obvious since variance is nonnegative.</p> <h3 id="reformulation-for-iid-settings">Reformulation for i.i.d. settings</h3> <p>If $X_n$ are i.i.d., then</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \mathcal{I}_n(\theta) &= - \mathbb{E}\Big[\frac{\partial^2}{\partial \theta^2} \log f_{\theta}(X)\Big] \\ &= - \mathbb{E}\Big[\frac{\partial^2}{\partial \theta^2} \sum_{n=1}^{N} \log f_{\theta}(X_n)\Big] \\ &= - \sum_{n=1}^{N} \mathbb{E}\Big[\frac{\partial^2}{\partial \theta^2} \log f_{\theta}(X_n)\Big] \\ &= - n \mathbb{E}\Big[\frac{\partial^2}{\partial \theta^2} \log f_{\theta}(X_n)\Big]. \end{align} %]]></script> <p>Above, we abuse notation a bit by writing the joint density and marginal densities as both $f_{\theta}(X)$. We can distinguish the Fisher information for a single sample as $\mathcal{I}(\theta)$ or</p> <script type="math/tex; mode=display">\mathcal{I}(\theta) = -\mathbb{E}\Big[\frac{\partial^2}{\partial \theta^2} \log f_{\theta}(X_n)\Big].</script> <p>This means</p> <script type="math/tex; mode=display">\mathcal{I}_n(\theta) = n \mathcal{I}(\theta).</script> <h3 id="reparameterization">Reparameterization</h3> <p>Let $\eta = \psi(\theta)$ be a reparameterization and let $g(x; \eta) = f(x, \psi^{-1}(\eta))$ denote the reparameterized density. Let $\mathcal{I}_g(\cdot)$ and $\mathcal{I}_f(\cdot)$ denote the Fisher information under the two respective densities. Then</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \mathcal{I}_{g}(\eta) &= \mathbb{E}_g\Big[\Big(\frac{\partial}{\partial \eta} \log g(x; \eta)\Big)^2\Big] \\ &= \mathbb{E}_g\Big[\Big(\frac{\partial}{\partial \eta} \log f(x; \psi^{-1}(\eta))\Big)^2\Big] \\ &= \mathbb{E}_g\Big[\Big(\frac{\partial}{\partial \theta} \log f(x; \psi^{-1}(\eta)) \frac{\partial}{\partial \eta} \psi^{-1}(\eta) \Big)^2\Big] \\ &= \mathbb{E}_g\Big[\Big(\frac{\partial}{\partial \theta} \log f(x; \psi^{-1}(\eta))\Big)^2 \Big] \Big(\frac{\partial}{\partial \eta} \psi^{-1}(\eta) \Big)^2 \\ &= \mathcal{I}_f(\psi^{-1}(\eta)) \Big(\frac{\partial}{\partial \eta} \psi^{-1}(\eta) \Big)^2 \\ &= \mathcal{I}_f(\psi^{-1}(\eta)) \Big(\frac{1}{\frac{\partial}{\partial \eta} \psi(\psi^{-1}(\eta))}\Big)^2 \end{align} %]]></script> <p>The main idea is to apply the chain rule. The last step uses a property of <a href="https://www.khanacademy.org/math/ap-calculus-ab/ab-differentiation-2-new/ab-3-3/v/derivatives-of-inverse-functions">derivatives of inverse functions</a> from calculus. Intuitively, if a density is reparameterized by $\psi$, the new Fisher information is the old Fisher information times a function of the gradient of $\psi$ w.r.t. $\eta$.</p>Gregory GundersenThe goal of this post is to enumerate and derive several key properties of the Fisher information. The Fisher information quantifies how much information a random variable carries about its unknown generative parameters. Let $X = (X_1, \dots, X_n)$ be a random sample from $\mathbb{P}_{\theta} \in \mathcal{P} = \{\mathbb{P}_{\theta} : \theta \in \Theta\}$ with joint density $f_{\theta}(X)$. The log likelihood isProof of the Rao–Blackwell Theorem2019-11-15T00:00:00-05:002019-11-15T00:00:00-05:00http://gregorygundersen.com/blog/2019/11/15/proof-rao-blackwell<p>The Rao–Blackwell Theorem <a class="citation" href="#rao1992information">(Rao, 1992; Blackwell, 1947)</a> states:</p> <blockquote> <p>Let $\hat{\theta}$ be an unbiased estimator of $\theta$ with a finite second moment for all $\theta$. Let $T(X)$ be a sufficient statistic for $\theta$. Then for all $\theta$,</p> <ol> <li> <p>$\theta_{\texttt{RB}} \triangleq \mathbb{E}[\hat{\theta} \mid T(X)] = \theta$,</p> </li> <li> <p>$\mathbb{V}[\theta_{\texttt{RB}}] \leq \mathbb{V}[\hat{\theta}]$.</p> </li> </ol> </blockquote> <p>This is a remarkably general result. In words, it says: if we have an unbiased estimator of our statistical parameter $\theta$ and a sufficient statistic of that parameter $T(X)$, then we can construct <em>another</em> estimator $\theta_{\texttt{RB}}$ such that this new estimator is still unbiased and may have less variance.</p> <p>The proof of the first claim is</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \mathbb{E}[\theta_{\texttt{RB}}] &\triangleq \mathbb{E}[\mathbb{E}[\hat{\theta} \mid T(X)]] \\ &= \mathbb{E}[\hat{\theta}] \\ &= \theta. \end{align} %]]></script> <p>The first equality just applies our definition of this new estimator $\theta_{\texttt{RB}}$. The next applies the <a href="/blog/2019/11/14/proof-total-expectation/">law of total expectation</a>. The last holds because $\hat{\theta}$ is unbiased.</p> <p>The proof of the second claim is</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \mathbb{V}[\theta_{\texttt{RB}}] &= \mathbb{E}[(\theta_{\texttt{RB}} - \theta)^2] \\ &= \mathbb{E}[(\mathbb{E}[\hat{\theta} \mid T(X)] - \theta)^2] \\ &= \mathbb{E}[(\mathbb{E}[\hat{\theta} - \theta \mid T(X)])^2] \\ &\leq \mathbb{E}[(\mathbb{E}[(\hat{\theta} - \theta)^2 \mid T(X)]] \\ &= \mathbb{E}[(\hat{\theta} - \theta)^2] \\ &= \mathbb{V}[\hat{\theta}]. \end{align} %]]></script> <p>Once again, we just use the definition of $\theta_{\texttt{RB}}$ and the law of total expectation. The third equality holds because $\theta = \mathbb{E}[\theta]$ and the linearity of expectation. The inequality holds because</p> <script type="math/tex; mode=display">\mathbb{V}[X] = \mathbb{E}[X^2] - \mathbb{E}[X]^2 \implies \mathbb{E}[X]^2 \leq \mathbb{E}[X^2].</script> <p>In my mind, the Rao–Blackwell Theorem is remarkable in that (1) the proof is quite simple and (2) the result is quite general.</p> <p>   </p> <h3 id="acknowledgements">Acknowledgements</h3> <p>This proof is based on a blackboard proof by <a href="https://cattaneo.princeton.edu/home">Matias Cattaneo</a> in Princeton’s <a href="https://registrar.princeton.edu/course-offerings/course-details?term=1202&amp;courseid=009316">Statistical Theory and Methods</a>.</p>Gregory GundersenThe Rao–Blackwell Theorem (Rao, 1992; Blackwell, 1947) states: