A collection of interesting and important machine learning papers
Jan 27, 2023
7 min read
New and interesting papers
Most are seminal, introducing something new, and a few are reviews that help summarize.
2020 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale - ViT (Vision Transformer)
- Wikipedia: Vision transformer: Transformers measure the relationships between pairs of input tokens (words in the case of text strings), termed attention. The cost is quadratic in the number of tokens. For images, the basic unit of analysis is the pixel. However, computing relationships for every pixel pair in a typical image is prohibitive in terms of memory and computation. Instead, ViT computes relationships among pixels in various small sections of the image (e.g., 16x16 pixels), at a drastically reduced cost. The sections (with positional embeddings) are placed in a sequence. The embeddings are learnable vectors. Each section is arranged into a linear sequence and multiplied by the embedding matrix. The result, with the position embedding is fed to the transformer.
- ChatGPT summary: .. introduces a new image recognition model called ViT (Vision Transformers). Unlike traditional image recognition models that use convolutional neural networks (CNNs), ViT applies the transformer architecture from NLP to image recognition. ViT uses a patch-based approach, where an image is divided into non-overlapping patches, and each patch is fed into a transformer encoder to obtain a set of features. The features are then used to classify the image. The authors show that ViT outperforms previous state-of-the-art models on several benchmark datasets, demonstrating the effectiveness of the transformer architecture in image recognition. The paper also highlights the scalability of ViT, as it can handle high-resolution images and achieve improved accuracy with larger model sizes.
2020 - Autoencoders - Summary paper
2020 - Language Models are Few-Shot Learners - “GPT-3”
- Wikipedia: GPT-3: The architecture is a decoder-only transformer network with a 2048-token-long context and then-unprecedented size of 175 billion parameters, requiring 800GB to store. The model was trained using generative pre-training; it is trained to predict what the next token is based on previous tokens. The model demonstrated strong zero-shot and few-shot learning on many tasks. The authors described how language understanding performances in natural language processing (NLP) were improved in GPT-n through a process of “generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task.” This eliminated the need for human supervision and for time-intensive hand-labeling.
- ChatGPT summary: … investigates the ability of large language models such as GPT-3 to perform well on a wide range of tasks with only a small amount of fine-tuning or task-specific training data. The authors show that fine-tuning these models on a small dataset can lead to performance that is comparable or even superior to models trained from scratch on the same dataset, demonstrating the effectiveness of pre-training in NLP. The study also highlights the robustness of the models, as they can perform well on a wide range of tasks and domains, even when the tasks are very different from the pre-training data. The paper concludes that large language models are powerful few-shot learners, capable of adapting to new tasks with minimal fine-tuning, and has significant implications for NLP and AI in general.
2019 - Understanding LSTM – a tutorial into Long Short-Term Memory Recurrent Neural Networks - LSTM
2018 - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Wikipedia: BERT: “Bidirectional Encoder Representations from Transformers (BERT) is a transformer-based machine learning technique for natural language processing (NLP) pre-training developed by Google. … The original English-language BERT has two models: (1) the BERTBASE: 12 encoders with 12 bidirectional self-attention heads, and (2) the BERTLARGE: 24 encoders with 16 bidirectional self-attention heads. Both models are pre-trained from unlabeled data extracted from the BooksCorpus with 800M words and English Wikipedia with 2,500M words.”
2017 - Decoupled Weight Decay Regularization - “AdamW”
2017 - Neural Discrete Representation Learning - “VQ-VAE”, vector autoencoders
2017 - Attention Is All You Need - introduces the transformer architecture
- Wikipedia: Transformer: A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data. It is used primarily in the fields of natural language processing (NLP) and computer vision (CV).
- Like recurrent neural networks (RNNs), transformers are designed to process sequential input data, such as natural language, with applications towards tasks such as translation and text summarization. However, unlike RNNs, transformers process the entire input all at once. The attention mechanism provides context for any position in the input sequence. For example, if the input data is a natural language sentence, the transformer does not have to process one word at a time. This allows for more parallelization than RNNs and therefore reduces training times."
- Transformers were introduced in 2017 by a team at Google Brain and are increasingly the model of choice for NLP problems, replacing RNN models such as long short-term memory (LSTM). The additional training parallelization allows training on larger datasets. This led to the development of pretrained systems such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer), which were trained with large language datasets, such as the Wikipedia Corpus and Common Crawl, and can be fine-tuned for specific tasks.
- ChatGPT summary: … presents a novel Transformer architecture for neural machine translation (NMT) that relies solely on self-attention mechanisms and dispenses with recurrence and convolutions. The Transformer model consists of an encoder and a decoder, both of which are composed of a stack of self-attention and feed-forward layers. The self-attention mechanism allows each position in the sequence to attend to all other positions, enabling the model to capture long-range dependencies and contextual information.
2016 - WaveNet: A Generative Model for Raw Audio
2016 - Layer Normalization
2016 - XGBoost: A Scalable Tree Boosting System
2015 - Deep Residual Learning for Image Recognition - “ResNet”, introduces “residual blocks” that transform data, then a skip connection from the previous features
2015 - Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift - “batch norm”
2014 - Adam: A Method for Stochastic Optimization - “Adam”
2014 - Dropout: a simple way to prevent neural networks from overfitting
2014 - Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation - “GRU”, a Gated Recurrent Unit
- GRU is a type of recurrent neural network (RNN). It is similar to an LSTM, but only has two gates - a reset gate and an update gate - and notably lacks an output gate. Fewer parameters means GRUs are generally easier/faster to train than their LSTM counterparts
2014 - Generative Adversarial Networks - “GAN”
- ChatGPT summary: … presents a novel architecture for generative models in machine learning. GANs consist of two neural networks, a generator and a discriminator, that are trained simultaneously in a game-theoretic setup where the generator aims to produce samples that are indistinguishable from real data, while the discriminator tries to correctly classify whether each sample is real or fake. The training process continues until the generator produces high-quality samples that the discriminator is unable to reliably distinguish from real data. The authors show that GANs are capable of generating diverse and highly-realistic samples, making them a powerful tool for various applications such as image synthesis, data augmentation, and unsupervised representation learning.
Read more posts like this in the
Software Engineering Toolbox