Early Implementation of Attention Mechanism

We are witnessing the popularity and fast development of the Attention Mechanism in the deep learning community in recent years. It serves as the pivotal parts in most state-of-the-art models in NLP tasks, and continues to be a rapid evolving research topics in CV field. Besides, in recent AI-related scientific breakthroughs, like AlphaFold 2, the Attention Mechanism looks like an omnipresent component in the models. That is why we (Kevin and I) decided to start a journal club to read and discuss seminal papers about how attention was introduced and further developed. We hope this discussion could bring us more intuition about this fancy name, such that we could apply it to problems we are interested in with more confidence.

This blog is a note of the first discussion, about the paper Bahdanau, et al. (2014) Neural machine translation by jointly learning to align and translate1. As an early (or first) implementation of “Attention Mechanism” in the translation task, it helps a lot, at least for me, to understand what is attention, although the attention here is a little different from that in the following Transformer model.

  1. Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate.” arXiv preprint arXiv:1409.0473 (2014). 


Spectral Bias and Positional Encoding

Recent days (maybe it is already out of date when you read this blog), we see a “renaissance” of classic multilayer perceptron (MLP) models in machine learning field. The logic behind this trend is heuristic for researches to see that, by understanding how a complex black box works, we can naturally add some reasonably modifications to make it better, instead of shotting with blind eyes. The majority of the blog is based on paper Tancik, Matthew, et al. (2020) Fourier features let networks learn high frequency functions in low dimensional domains.

The basic take-away is, a standard MLP model fails to learn high frequencies both in theory and in practice, which is called Spectral Bias. Based on this findings, with a simple Fourier feature mapping (Positional Encoding), the performance of MLPs can be greatly improved, especially for low-dimensional regression tasks, like your inputs are atom coordinates.


Bayesian Inference with Probabilisitc Pupulation Codes

This is a summary about paper Ma, Wei Ji, et al. (2006) Bayesian inference with probabilistic population codes. Nature neuroscience and Ma, Wei Ji, et al. (2014) Neural Coding of Uncertainty and Probability. Annual Review of Neuroscience. The authors presented a model, with some physiological evidence, about neural realization of bayesian probabilitic computation in human brains: probabilistic population codes. This report borrows a lot from Yafah’s presentation.


Temporal Difference Methods in Machine Learning

This is a summary about paper Sutton, et al. 1988. Learning to predict by the methods of temporal differences. This paper provided a complete discussion about the temporal difference methods in the learning to predict task, which takes observations and try to predict outcomes from those observations like classification problem. This summary borrowed a lot of ideas from Tasha’s presentation and centers around the comparision with the supervised learning method.


Stability of Memory Allocation with Neuroidal Model

This is a summary about paper Jacob Beal and Thomas F. Knight, Jr. (2008) Analyzing Composability in a Sparse Encoding Model of Memorization and Association, which is again a follow-up work of paper L. Valiant (2005) Memorization and association on a realistic neural model The two papers talked about a random graph model to understand the basic cognitive tasks like memorization and association in brains.