Notes from Karpathy’s machine learning lectures - Neural Networks: Zero to Hero
These are my notes from the Andrej Karpathy lecture series: Neural Networks: Zero to Hero. Andrej is the former Director of AI at Tesla and an excellent teacher. He demystifies complex ML topics like gradient descent through simple examples. When following these videos, I recommend recreating everything Andrej covers on your local machine. Typing out the example code and running it locally helps practice and build confidence that from simple building blocks we can build up powerful models.
YouTube playlist: Neural Networks: Zero to Hero
YouTube videos:
Github repos:
Here’s how Wikipedia defines a neural network:
A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes.
Thus, a neural network is either a biological neural network, made up of biological neurons, or an artificial neural network, used for solving artificial intelligence (AI) problems.
The connections of the biological neuron are modeled in artificial neural networks as weights between nodes. A positive weight reflects an excitatory connection, while negative values mean inhibitory connections. All inputs are modified by a weight and summed. This activity is referred to as a linear combination.
Finally, an activation function controls the amplitude of the output. For example, an acceptable range of output is usually between 0 and 1, or it could be −1 and 1.
First, install Python if you haven’t yet.
Use pyenv, similar to rbenv, to easily manage versions. Note: I recommend an older Python version like 3.9 because PyTorch, a key dependency later on, doesn’t always work with the latest version it seems:
pyenv install -l # list remote
pyenv install 3.9 # install
pyenv local 3.9 # use it locally
pyenv versions # list local
pyenv init
# then update ~/.zshrc with the output to be able to use correct version of
# python via `python` and `pip` commands without calling `python3`, `pip3`, etc
Then, open VS Code and install its Python extension
Then git clone
Andrej’s repositories locally:
git clone git@github.com:karpathy/micrograd.git
git clone git@github.com:karpathy/makemore.git
# Open in VS Code
code micrograd
Open a Jupyter Notebook .ipynb
file in micrograd
and select the pyenv
version you installed, plus install any VSCode recommended extensions for Jupyter Notebooks.
Create a new file called youtube1.ipynb
or something similar so you can run same commands that Andrej does during his videos.
Now you’re all set to dive into the videos.
Watch the YouTube video: The spelled-out intro to neural networks and backpropagation: building micrograd.
Code source: micrograd
And follow along locally in your own .ipynb
file. I strongly recommend doing this yourself, typing out everything that Andrej does, and running it all locally. This “practice” helps me learn the content better and builds confidence that it can all be recreated locally.
Just for reference (create your own!), here are examples of local notebooks:
Everything needed to understand a neural network is in micrograd. Everything else is just efficiency.
There are only 150 lines of code in micrograd/engine.py
and micrograd/nn.py
.
“Back propagation is recursive application of chain rule, backwards through the computation graph.”
If can write local gradients and can do backward propagation of gradients, then it doesn’t matter if it’s a compound function or separate functions (equal to compound function), the result will be the same.
micrograd
is for scalar values (1.0, etc)
PyTorch is for a tensor, an n-dimensional array of scalars
Gradient descent is the iteration of:
The neural net improved its predictions with each iteration.
Most common neural net mistakes from @karpathy’s tweet:
most common neural net mistakes: 1) you didn't try to overfit a single batch first. 2) you forgot to toggle train/eval mode for the net. 3) you forgot to .zero_grad() (in pytorch) before .backward(). 4) you passed softmaxed outputs to a loss that expects raw logits. ; others? :)
— Andrej Karpathy (@karpathy) July 1, 2018
Watch the YouTube video: The spelled-out intro to language modeling: building makemore.
Code source: makemore
Every line in makemore/names.txt
is an example. Each example is a sequence of characters.
We’re building a character level language model. It knows how to predict the next character in the sequence.
As noted in the video, it’s important to learn more about Broadcasting semantics.
This lecture let’s us train a bigram language model.
# Setup - common to both approaches
import torch
# data set: 32k first names
words = open('names.txt', 'r').read().splitlines()
chars = sorted(list(set(''.join(words))))
# s to i lookup, setting `.` as 0 index in array and all others + 1
# we'll use `.` to mark the start and end of all words
stoi = {s: i+1 for i, s in enumerate(chars)}
stoi['.'] = 0
# i to s lookup
itos = {i: s for s, i in stoi.items()}
We start training it by counting how frequently any pairing of letters occurs in ~32k names, and then normalizing so we get a nice probability distribution.
# Approach 1: non-neural network approach: count frequency of bigrams and store in `N`
#
# Create a 27x27 matrix with values all set to 0
N = torch.zeros((27, 27), dtype=torch.int32)
# Get the counts
for w in words:
# use `.` to mark the start and end of all words
chs = ['.'] + list(w) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
# integer index of this character in stoi 0-27
ix1 = stoi[ch1]
ix2 = stoi[ch2]
N[ix1, ix2] += 1
# prepare probabilities, parameters of our bigram language model
# Apply "model smoothing" using `N + 1` instead of `N`. This prevents 0's in probability matrix P, which could lead to `infinity` for loss measurement.
P = (N + 1).float()
# 27, 27
# 27, 1 # This is "broadcastable" and it stretches the 1 into all 27 rows
# https://pytorch.org/docs/stable/notes/broadcasting.html?highlight=broadcasting
# Below uses `/=` to avoid creating new tensor, ie more efficient
P /= P.sum(1, keepdim=True)
g = torch.Generator().manual_seed(2147483647)
# Sample
for i in range(5):
out = []
ix = 0
while True:
p = P[ix]
ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
out.append(itos[ix])
# Break with `.` is found, marking the end of the word
if ix == 0:
break
print(''.join(out))
# Output:
# mor.
# axx.
# minaymoryles.
# kondlaisah.
# anchshizarie.
Then we can evaluate the quality of this model.
Goal: summarize probabilities into a single number that measure the quality of this model.
# Goal: summarize probabilities into a single number that measure the quality of this model
log_likelihood = 0.0
n = 0
for w in words:
# for w in ["andrejq"]:
chs = ['.'] + list(w) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
prob = P[ix1, ix2]
logprob = torch.log(prob)
# This is because: log(a*b*c) = log(a) + log(b) + log(c)
log_likelihood += logprob
n += 1
print(f'{ch1}{ch2}: {prob:.4f} {logprob:.4f}')
print(f'{log_likelihood=}')
# negative log likelihood is a nice loss function.
# The lowest it can get is 0. The higher it is, the worse off the predictions
# you are making are.
nll = -log_likelihood
print(f'{nll=}')
# Above was the sum negative log likelihood. Better is the average negative log likelihood.
# So divide that sum by `n` to get the average.
# So the loss function for the training set assigned by this model is 2.4. That's the "quality" of this model.
# The lower it is the better off we are. The higher it is the worse off we are.
print(f'{nll/n}')
# Output:
# log_likelihood=tensor(-559951.5625)
# nll=tensor(559951.5625)
# 2.4543561935424805
# Approach 2: neural network approach trained on bigrams
# for one hot encoding: `F.one_hot` below
import torch.nn.functional as F
#
# Dataset: 228K bigrams from the 32K example names
#
xs, ys = [], []
for w in words:
chs = ['.'] + list(w) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
xs.append(ix1)
ys.append(ix2)
xs = torch.tensor(xs)
ys = torch.tensor(ys)
num = xs.nelement()
print('number of examples: ', num)
# initialize the 'network'
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27, 27), generator=g, requires_grad=True)
Gradient descent
# Gradient descent
for k in range(100):
# forward pass
# input to the network: one-hot encoding
xenc = F.one_hot(xs, num_classes=27).float()
logits = xenc @ W # predict log-counts
counts = logits.exp() # counts, equivalent to N
# probabilities for next character
probs = counts / counts.sum(1, keepdims=True)
# regularization loss: `0.01*(W**2).mean()` tries to make all W's 0
# if `0.01` is higher it will be more uniform and not
loss = -probs[torch.arange(num), ys].log().mean() + 0.01 * (W**2).mean()
print(loss.item())
# backward pass
W.grad = None # set to zero the gradient
loss.backward()
# update
W.data += -50 * W.grad
# Earlier we had 2.47 loss when we manually did the counts.
# So we'd like this neural network approach to become as "good", when measuring the loss.
# Sample from neural net model
g = torch.Generator().manual_seed(2147483647)
for i in range(5):
out = []
ix = 0
while True:
xenc = F.one_hot(torch.tensor([ix]), num_classes=27).float()
logits = xenc @ W # predict log-counts
counts = logits.exp() # counts, equivalent to N
p = counts / counts.sum(1, keepdims=True) # probabilities for next character
ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
out.append(itos[ix])
if ix == 0:
break
print(''.join(out))
# Output: almost identical to approach 1 non-neural network with count frequencies
# mor.
# axx.
# minaymoryles.
# kondlaisah.
# anchshizarie.
Watch the YouTube video: Building makemore Part 2: MLP.
A multilayer perceptron (MLP) character-level language model.
Original paper: “A Neural Probabilistic Language Model, Bengio et al. 2003”
We can embed all of the integers in X
as C[X]
thanks to PyTorch multi dimensional indexing.
We want to randomly select some portion of the dataset, the minibatch. Then only forward, backward, and update on that minibatch.
What’s the right learning rate? Video
lre = torch.linspace(-3, 0, 1000)
lrs = 10**lre
lri = []
# ...
# Inside gradient descent
lr = lrs[i]
# track stats
#lri.append(lre[i])
stepi.append(i)
lossi.append(loss.log10().item())
# To get visualization
plt.plot(stepi, lossi)
Split up data into 3 splits:
We use the training split to optimize the parameters of the model, like we’re doing using gradient descent.
Use the dev/validation split to train the hyperparameters.
Use the test split to evaluate the performance of the model at the end. We should rarely use the test split to avoid overfitting to it and learning from it.
# Build dataset
block_size = 3 # context length: how many characters do we take to predict the next one?
def build_dataset(words):
X, Y = [], []
for w in words:
# print(w)
context = [0] * block_size
for ch in w + '.':
ix = stoi[ch]
X.append(context)
Y.append(ix)
# print(''.join(itos[i] for i in context), '--->', itos[ix])
context = context[1:] + [ix] # crop and append
X = torch.tensor(X)
Y = torch.tensor(Y)
print(X.shape, Y.shape)
return X, Y
import random
random.seed(42)
random.shuffle(words)
# Get n1 and n2 to help us split up the words into 3 splits.
n1 = int(0.8*len(words))
n2 = int(0.9*len(words))
# Split up data into 3 splits:
# - 1. training split - 80%
# - 2. dev/validation split - 10%
# - 3. test split - 10%
Xtr, Ytr = build_dataset(words[:n1]) # 80%
Xdev, Ydev = build_dataset(words[n1:n2]) # 10%
Xte, Yte = build_dataset(words[n2:]) # 10%
# MLP revisited (from start of "building makemore part 3")
n_embd = 10 # the dimensionality of the character embedding vectors
n_hidden = 200 # the number of neurons in the hidden layer of the MLP
g = torch.Generator().manual_seed(2147483647) # for reproducibility
C = torch.randn((vocab_size, n_embd), generator=g)
W1 = torch.randn((n_embd * block_size, n_hidden), generator=g)
b1 = torch.randn(n_hidden, generator=g)
W2 = torch.randn((n_hidden, vocab_size), generator=g)
b2 = torch.randn(vocab_size, generator=g)
# BatchNorm parameters
bngain = torch.ones((1, n_hidden))
bnbias = torch.zeros((1, n_hidden))
bnmean_running = torch.zeros((1, n_hidden))
bnstd_running = torch.ones((1, n_hidden))
parameters = [C, W1, b1, W2, b2]
print(sum(p.nelement() for p in parameters)) # number of parameters in total
for p in parameters:
p.requires_grad = True
max_steps = 200000
batch_size = 32
lossi = []
for i in range(max_steps):
# minibatch construct
ix = torch.randint(0, Xtr.shape[0], (batch_size,), generator=g)
Xb, Yb = Xtr[ix], Ytr[ix] # batch X,Y
# forward pass
emb = C[Xb] # embed the characters into vectors
embcat = emb.view(emb.shape[0], -1) # concatenate the vectors
hpreact = embcat @ W1 + b1 # hidden layer pre-activation
h = torch.tanh(hpreact) # hidden layer
logits = h @ W2 + b2 # output layer
loss = F.cross_entropy(logits, Yb) # loss function
# backward pass
for p in parameters:
p.grad = None
loss.backward()
# update
lr = 0.1 if i < 100000 else 0.01 # step learning rate decay
for p in parameters:
p.data += -lr * p.grad
# track stats
if i % 10000 == 0: # print every once in a while
print(f'{i:7d}/{max_steps:7d}: {loss.item():.4f}')
lossi.append(loss.log10().item())
# Evaluate
@torch.no_grad() # this decorator disables gradient tracking
def split_loss(split):
x,y = {
'train': (Xtr, Ytr),
'val': (Xdev, Ydev),
'test': (Xte, Yte),
}[split]
emb = C[x] # (N, block_size, n_embd)
embcat = emb.view(emb.shape[0], -1) # concat into (N, block_size * n_embd)
h = torch.tanh(embcat @ W1 + b1) # (N, n_hidden)
logits = h @ W2 + b2 # (N, vocab_size)
loss = F.cross_entropy(logits, y)
print(split, loss.item())
split_loss('train')
split_loss('val')
# Sample from the model
g = torch.Generator().manual_seed(2147483647 + 10)
for _ in range(20):
out = []
context = [0] * block_size # initialize with all ...
while True:
# forward pass the neural net
emb = C[torch.tensor([context])] # (1,block_size,n_embd)
h = torch.tanh(emb.view(1, -1) @ W1 + b1)
logits = h @ W2 + b2
probs = F.softmax(logits, dim=1)
# sample from the distribution
ix = torch.multinomial(probs, num_samples=1, generator=g).item()
# shift the context window and track the samples
context = context[1:] + [ix]
out.append(ix)
# if we sample the special '.' token, break
if ix == 0:
break
print(''.join(itos[i] for i in out)) # decode and print the generated word
Underfitting - the loss function for the training split is similar to the dev/validation split. Likely too few parameters and the model isn’t powerful enough to
Iterating the model:
Watch the Youtube video: Building makemore Part 3: Activations & Gradients, BatchNorm
Fixing problems from the previous MLP model.
Symptom: there is a hockey stick of loss from very high to a tighter range.
We’d expect the loss for the newly initialized model with 27 characters to be -torch.tensor(1/27.0).log() == 3.2958
. Instead we’re seeing ~27. Something is wrong with the initialization.
When the network is initialized we want the logits to be roughly 0 or 1 when it is initialized. To debug, insert break
after the first pass of gradient descent and inspect logits[0]
. We see there extreme values.
Solution: multiple the weights and bias by 0.01 and 0 to get logits closer to 0 to start.
W2 = torch.randn((n_hidden, vocab_size), generator=g) * 0.01 # Fix to get initial logits closer to 0
b2 = torch.randn(vocab_size, generator=g) * 0 # Fix to get initial logits closer to 0
Sympton: there are too many -1’s and 1’s in the .tanh
. It’s too saturated
Help diagnose by using a histogram of the values:
# h = torch.tanh(hpreact) # hidden layer
# flattened with h.view(-1); split into 50 buckets/bins
# Figure 1.
plt.hist(h.view(-1).tolist(), 50);
# Figure 2
plt.figure(figsize=(20,10))
# will appear white if boolean (h.abs() > 0.99)) is true, ie when it's in the long tail
plt.imshow(h.abs() > 0.99, cmap='gray', interpolation='nearest')
If an entire column is white, then that’s a “dead neuron”. All the examples land in the tail, then it will never learn as a neuron.
The neuron will “stop” backpropogation.
Solution: similar to above, tighten the range of values this time in W1 and b1 by multiplying by 0.1 or 0.01 or some other decimal value.
W1 = torch.randn((n_embd * block_size, n_hidden), generator=g) * 0.2
b1 = torch.randn(n_hidden, generator=g) * 0.01
Instead of just guessing at these values 0.2
we can use the kaiming initialization
W1 = torch.randn((n_embd * block_size, n_hidden), generator=g) * (5/3)/((n_embd * block_size)**0.5) # kaiming init
b1 = torch.randn(n_hidden, generator=g) * 0.01
We can make the hidden states Gaussian by standardizing them to make them exactly Gaussian, at initialization only.
It happens to have a regularizing effect and stabilizes training. People are trying to remove batch normalization, but it works quite well – it is quite effective at controlling activations and their distributions.
The batch normalization layer has its own bias. And there’s no need to have a bias (eg. + b1
) in the layer before it.
Summary:
for i in range(max_steps):
# minibatch construct
# ...
# forward pass
emb = C[Xb] # embed the characters into vectors
embcat = emb.view(emb.shape[0], -1) # concatenate the vectors
# Linear layer
hpreact = embcat @ W1 #+ b1 # hidden layer pre-activation
# BatchNorm layer
# -------------------------------------------------------------
bnmeani = hpreact.mean(0, keepdim=True)
bnstdi = hpreact.std(0, keepdim=True)
hpreact = bngain * (hpreact - bnmeani) / bnstdi + bnbias
with torch.no_grad():
bnmean_running = 0.999 * bnmean_running + 0.001 * bnmeani
bnstd_running = 0.999 * bnstd_running + 0.001 * bnstdi
# -------------------------------------------------------------
# Non-linearity
h = torch.tanh(hpreact) # hidden layer
logits = h @ W2 + b2 # output layer
loss = F.cross_entropy(logits, Yb) # loss function
# backward pass ...
# update ...
# track stats ...
In our code, we’ll mimic the PyTorch code for:
See class Linear:
, class BatchNorm1d:
, etc here: https://github.com/BrianSigafoos/makemore/blob/4d51f23df80f6757809884a9d2e2888c835507f4/makemore_part3_bn.ipynb#L4
To understand if neural network is in a good state.
Activation distribution from forward pass
plt.figure(figsize=(20, 4)) # width and height of the plot
legends = []
for i, layer in enumerate(layers[:-1]): # note: exclude the output layer
if isinstance(layer, Tanh):
t = layer.out
print('layer %d (%10s): mean %+.2f, std %.2f, saturated: %.2f%%' % (i, layer.__class__.__name__, t.mean(), t.std(), (t.abs() > 0.97).float().mean()*100))
hy, hx = torch.histogram(t, density=True)
plt.plot(hx[:-1].detach(), hy.detach())
legends.append(f'layer {i} ({layer.__class__.__name__}')
plt.legend(legends);
plt.title('activation distribution')
Gradient distribution from backward pass
plt.figure(figsize=(20, 4)) # width and height of the plot
legends = []
for i, layer in enumerate(layers[:-1]): # note: exclude the output layer
if isinstance(layer, Tanh):
t = layer.out.grad
print('layer %d (%10s): mean %+f, std %e' % (i, layer.__class__.__name__, t.mean(), t.std()))
hy, hx = torch.histogram(t, density=True)
plt.plot(hx[:-1].detach(), hy.detach())
legends.append(f'layer {i} ({layer.__class__.__name__}')
plt.legend(legends);
plt.title('gradient distribution')
Weights gradient distribution from stochastic gradient descent
# visualize histograms
plt.figure(figsize=(20, 4)) # width and height of the plot
legends = []
for i,p in enumerate(parameters):
t = p.grad
if p.ndim == 2:
print('weight %10s | mean %+f | std %e | grad:data ratio %e' % (tuple(p.shape), t.mean(), t.std(), t.std() / p.std()))
hy, hx = torch.histogram(t, density=True)
plt.plot(hx[:-1].detach(), hy.detach())
legends.append(f'{i} {tuple(p.shape)}')
plt.legend(legends)
plt.title('weights gradient distribution');
Updates distribution from stochastic gradient descent
plt.figure(figsize=(20, 4))
legends = []
for i,p in enumerate(parameters):
if p.ndim == 2:
plt.plot([ud[j][i] for j in range(len(ud))])
legends.append('param %d' % i)
plt.plot([0, len(ud)], [-3, -3], 'k') # these ratios should be ~1e-3, indicate on plot
plt.legend(legends);
Updates
# Inside gradient descent
ud = []
for i in range(max_steps):
# ...
# track stats ...
# if i % 10000 == 0: # print every once in a while
# print(f'{i:7d}/{max_steps:7d}: {loss.item():.4f}')
# lossi.append(loss.log10().item())
with torch.no_grad():
ud.append([((lr*p.grad).std() / p.data.std()).log10().item() for p in parameters])
# ---
# Histogram
plt.figure(figsize=(20, 4))
legends = []
for i,p in enumerate(parameters):
if p.ndim == 2:
plt.plot([ud[j][i] for j in range(len(ud))])
legends.append('param %d' % i)
plt.plot([0, len(ud)], [-3, -3], 'k') # these ratios should be ~1e-3, indicate on plot
plt.legend(legends);
When creating a deep neural network the “motif” to stack up for the “forward” pass is:
Watch the video on YouTube Building makemore Part 4: Becoming a Backprop Ninja and follow along in the Google Collab
It’s worth doing as much as possible yourself to help understand what goes on under the hood when calling .backward
in PyTorch.
Watch the video on YouTube: Building makemore Part 5: Building a WaveNet
Implements a convolutional neural network architecture similar to WaveNet (2016 paper, DeepMind).
We want to make the network deeper. And at each level we want to fuse two consequetive elements.
Debug code to view the name and shape of each layer:
# Layer inspector - code snippet to help debug
for layer in model.layers:
print(layer.__class__.__name__, ':', tuple(layer.out.shape))
# For example:
# Embedding : (32, 8, 10)
# FlattenConsecutive : (32, 80)
# Linear : (32, 200)
# BatchNorm1d : (32, 200)
# Tanh : (32, 200)
# Linear : (32, 27)
We can switch to using torch.nn
moving forward now that in this video we got a better understanding of how the stack layers and mimic the PyTorch API.
What’s the development process like for building a deep neural network
Future:
Watch the video on YouTube: Let’s build GPT: from scratch, in code, spelled out
Typically sub-word tokenizers have large dictionaries (~50k tokens), like:
Instead, we’ll have 65 characters simply encoded at the character level to keep this example simple.
In this example, the later nodes get information from the earlier nodes but never from the future nodes.
“Attention can be applied to any arbitrary directed graph. Attention is just a communication mechanism between the nodes. … An attention is just a set of vectors out there in space. They communicate. And if you want them to have a notion of space, you need to specifically add it.” “This is why we need to positionally encode tokens.” (see notes in collab notebook)
“Attention supports arbitrary connectivity between nodes.”
Using “dropout” is like training an ensemble of subnetworks because during training connections are randomly dropped. More from the paper Dropout: a simple way to prevent neural networks from overfitting:
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different “thinned” networks.
PyTorch docs
PyTorch allows a tensor to be a View of an existing tensor. View tensor shares the same underlying data with its base tensor. Supporting View avoids explicit data copy, thus allows us to do fast and memory efficient reshaping, slicing and element-wise operations.
t = torch.arange(8)
print(t)
print(t.view(2, 4))
# Output:
# tensor([0, 1, 2, 3, 4, 5, 6, 7])
# tensor([[0, 1, 2, 3],
# [4, 5, 6, 7]])
F.cross_entropy
- docsInstead of hand rolling the commented out code below, calling F.cross_entropy
makes the forward and backward passes much more efficient.
# counts = logits.exp()
# prob = counts / counts.sum(1, keepdimes=True)
# loss = -prob[torch.arange(32), Y].log().mean()
# Replaced efficiently with:
F.cross_entropy(logits, Y)
Github repos:
Articles:
More: