# A Must Read for Logistic Regression

I came across an old technical report written by Michael Jordan (no, not the basketball guy):

Why the logistic function? A tutorial discussion on probabilities and neural networks“. M. I. Jordan. MIT Computational Cognitive Science Report 9503, August 1995.

The material is amazingly straightforward and easy to understand. It answers (or at least partially) a long-standing question for me, why the form of logistic function is used in regression? Regardless of how it was used in the first place, the report shows that it is actually can be derived from a simple binary classification case where we wish to estimate the posterior probability: $P(w_{0}|\mathbf{x}) = \frac{P(\mathbf{x}|w_{0})P(w_{0})}{P(\mathbf{x})}$
where $$w_{0}$$ can be thought as class label and $$\mathbf{x}$$ can be treated as feature vector. We can expand the denominator and introduce an exponential:
$P(w_{0}|\mathbf{x}) = \frac{P(\mathbf{x}|w_{0})P(w_{0})}{P(\mathbf{x}|w_{0})P(w_{0})+P(\mathbf{x}|w_{1})P(w_{1})}=\frac{1}{1+\exp\{-\log a – \log b\}}$
where $$a=\frac{P(\mathbf{x}|w_{0})}{P(\mathbf{x}|w_{1})}$$ and $$b= \frac{P(w_{0})}{P(w_{1})}$$. Without achieving anything but only through mathematical maneuvering, we have already had the flavor how logistic function can be derived from simple classification problems. Now, if we specify a particular distribution form of $$P(\mathbf{x}|w)$$ ( the class-conditional densities), for instance, Gaussian distribution, we can recover the logistic regression easily.

However, the whole point of the report is not just to show where logistic function comes into play, but showing how discriminative models and generative models in this particular setting are only the two sides of the same coin. In addition, Jordan demonstrated that these two sides are simply NOT equivalent but should be treated carefully when different learning criteria is considered. In general, a simple take-away is that the discriminative model (logistic regression) is more “robust” where generative model might be more accurate if the assumption is correct.

More details, please refer to the report.