Ideas | Prediction Revisited

From Chapter 1:

We process experiences both naturally and statistically; however, the way we naturally process experiences often diverges from the methods that classical statistics prescribes. Our purpose in writing this book is to reorient common statistical thinking to accord with our natural instincts.

...

The advantage of the natural process is that it is intuitive and sensible. The advantage of the statistical method is that by recording experiences as data we can analyze experiences more rigorously and efficiently than would be allowed by narratives. Our purpose is to reconcile classical statistics with our natural process in a way that secures the advantages of both approaches.

From Chapter 4: Relevance Conceptually

Relevance measures the importance of an observation to a prediction. It is equal to a sum of two parts: informativeness and similarity. Simply put, observations that are different from average but like present circumstances are more relevant than those that are not.

...

Filtering experience in this way is intuitive. Faced with a new situation, we automatically scour our memory for noteworthy experiences that bear some resemblance to the one we face. A pending snowstorm might recall past storms while stuck at home, plus some from ski trips, and maybe even other types of sever weather events. A new illness might conjure images of people our age who recently struggled with a similar disease. A new product launch might spark comparison to a company's history and to related products from competitors. Relevance is a matter of degree. We might include less relevant experiences along with highly relevant ones, but we acknowledge the difference.

...

Predictions arise from this view almost effortlessly. Just take the relevance-weighted average of what occurred in each case, where the occurrence we care about can be anything we wish to predict.

From Chapter 4: Relevance Mathematically

We are now able to augment our baseline prediction of the simple average of Y by applying non-equal weights to the deviations of Y around its average. As before, we use the term, weight, loosely — it is a scalar multiple that can be positive, negative, or zero. In fact, the weight of each deviation is that observation's relevance.

Where:

Therefore:

Mathematical Equivalences

VARIANCE = The average of half the squared distance between every pair of observations

CORRELATION = The information-weighted average of co-occurrence

Y-HAT (linear regression prediction) = The relevance-weighted average of prior outcomes

R-SQUARED = The information-weighted average fit for all prediction tasks

From Chapter 8: Biographies — Claude Shannon

Keep in mind that Shannon worked for the phone company, granted an ultra-elite branch of the phone company, but nonetheless an organization whose primary purpose was to enable fast and accurate communication. The phone company's chief challenge was to transmit signals of information quickly and accurately across great distances in the presence of static or noise, which explains the term, signal to noise ratio. Shannon's first great insight was that the speed of transmission is a direct inverse function of redundancy. That is to say, we can speed up the transmission of information by reducing the redundant symbols used to convey the information. For example, in the English language, the letter q is almost always followed by the letter u. Therefore, u is almost always redundant when preceded by q. By removing u whenever it is preceded by q, we can convey the same information with fewer symbols, thereby enabling quicker transmission of the message. Perhaps a starker example of redundancy is the fact that we can remove the word, the, throughout this entire book without giving up any information contained herein. It would not read as well, and it might take the reader a little longer to grasp the meaning of our text, but the information contained in it could be transmitted more quickly. This example suggests another feature of Shannon's theory of communication. He was not concerned with conveying meaning, but rather information or messages that might mean different things to different people.

...

To understand the profundity of Shannon's information theory is to imagine a world without email, without the internet, without cell phones, or without the ability to send a photograph across the world or to download music and videos nearly instantaneously. Consider that all these capabilities are enabled by 0s and 1s configured probabilistically bouncing off satellites orbiting Earth. And to appreciate further the brilliance of Shannon, consider that the logical and mathematical structure of information theory explains the transmission of information by the genes in our bodies as well as it does how we communicate with one another. The mark of a great theory is the magnitude and generality of its influence. By that standard, Shannon's information theory ranks as one of the greatest technological achievements ever.

References for the biographical sketch of Claude Shannon include: J. Soni and R. Goodman. 2017. "A Mind at Play." Simon and Schuster Paperbacks.

A present-day view of Claude Shannon's former residence.

Relevance-weighted prediction compared to other approaches

Weighted Least Squares applies a fixed external weighting scheme to observations to produce "better" linear regression betas. The weights are used to measure covariances, too.

Our approach does not weight the covariance matrix. But more importantly, it does re-weight observations based on their relevance, which establishes nonlinear conditionality.

The relevance we use to weight observations is not arbitrary. It is grounded in information theory and further justified by an important mathematical equivalence.

Kernel Regressions and Nearest Neighbor Algorithms form predictions from local subsets of observations based on their proximity to current circumstances.

Our approach includes informativeness (difference from average) whereas the above methods only consider similarity.

Our approach motivates the Mahalanobis distance through information theory and a full-sample regression equivalence, whereas the above methods treat the distance measure as an arbitrary (empirical) choice.

Hidden Markov Models and State Space Models more generally (via Kalman Filters) assume some structured time evolution of unobservable states which makes predictions dependent on current circumstances.

Our approach does not make a structural assumption about hidden states, nor does it assume sequential dependency. Thus, our approach applies equally to cross-sectional data.

Further, it makes the link between prior observation and prediction transparent.

The historical lineage of Relevance