Resnet on mnist from scratch
I have been using variant of resnet a lot, such as resnet18 or resnet34, but never knew how it is structured. After reading chapter 14 of the fastbook, I try to understand what is going on under the hood by experimenting everything I learned.
In chapter 14, the fastbook talks about how to write resnet from scratch. The book used imagenette, but we will just use MNIST dataset. Also, we will only use 2 epochs instead of 5 epochs. Let's find out how far we can go with using 2 epochs.
!pip install -Uqq fastbook
import fastbook
from fastai.vision.all import *
When experimenting, it is a good idea to start from a simple dataset. We can save time and resources this way. When it works good on easy dataset, we can move on to a little bit more complex dataset, and so on. That is why we will use with MNIST handwritten dataset.
We briefly look at how data is divided into training and testing with respect labels.
path = untar_data(URLs.MNIST)
path.ls()
(path/'training').ls()
Then, we can build a dataloaders. With get_data
, we can change the size of our dataset into anything we want and get a dataloaders. This function allows us to explore with different sizes. Let's try training with images with 28 28 pixels, which are full sizes, and 14 14 pixels.
Generally, we can expect to have better result using images with bigger resolutions even though it takes longer time to train. Let's take a look at our pictures first.
def get_data(resize=28):
"Return dataloaders from MNIST dataset"
return DataBlock(
blocks=(ImageBlock(PILImageBW), CategoryBlock),
get_items=get_image_files,
splitter=GrandparentSplitter(train_name='training', valid_name='testing'),
get_y=parent_label,
item_tfms=Resize(resize)
).dataloaders(untar_data(URLs.MNIST), bs=256)
dls = get_data()
dls.show_batch()
Let's try size 14. It is hard to see, but we can see what they are.
dls_14 = get_data(14)
dls_14.show_batch()
We can look at a shape of each batch by getting a batch from each dataloaders. Each batch is composed of [batch_size, number of channels_in, pixel of height, pixel of width]
. Because we are working with black and white images, channel_in is 1, instead of 3.
xb, yb = dls.one_batch()
xb.shape, yb.shape
xb_14, yb_14 = dls_14.one_batch()
xb_14.shape, yb_14.shape
Before we jump into the resnet, let's make a baseline with linear layers first. We can compare it with resnet later and see how resnet performs on MNIST dataset.
model1 = nn.Sequential(
nn.Flatten(),
nn.Linear(28*28, 59),
nn.ReLU(),
nn.Linear(59, 10))
As we did with get_data()
, we will define get_learner()
that returns learner for us with dataset and accuracy. I am just using CPUs, but we can definitely use GPUs as well. When we use GPUs to train, we will use mixed precision with to_fp16()
. (I found GPUs to take longer to train with shallow models. It might be because it takes longer to load into GPU memories and compute the result than simply use CPUs. However, it is not the case with deeper models that we will use later on.)
def get_learner(m, dls=dls):
return Learner(dls, m, metrics=accuracy, loss_func=nn.CrossEntropyLoss()).to_fp16()
learn = get_learner(model1)
learn.lr_find()
learn.fit_one_cycle(2, 1e-3)
Okay. Let's try training our resized dataset. With smaller pixels, we should train faster.
model2 = nn.Sequential(
nn.Flatten(),
nn.Linear(14*14, 59),
nn.ReLU(),
nn.Linear(59, 10))
learn = get_learner(model2, dls=dls_14)
learn.lr_find()
learn.fit_one_cycle(2, 1e-3)
Smaller images did not save us any time. We got the worse performance as we expected though. Therefore, we will just use the full sized dataset.
Let's try adding a dropout and see what happens.
model1_d = nn.Sequential(
nn.Flatten(),
nn.Linear(28*28, 100),
nn.Dropout(),
nn.ReLU(),
nn.Linear(100, 10))
learn = get_learner(model1_d)
learn.fit_one_cycle(2, 1e-3)
Adding dropout does not improve our performance.
Let's try convolutional layers now. Let's try a simple convolutional layers with ConvLayer
. We will set stride to 2 so that we can decrease the number of pixels and extract important features that will help us gain better performance.
To make it easier to understand and read the code, we will refactor our convolutional layer. We can change block
into different layer later and use get_model()
to get our model without calling nn.Sequential
with all the components.
def block(ni, nf):
return ConvLayer(ni, nf, stride=2)
def get_model():
return nn.Sequential(
block(1, 8),
block(8, 16),
block(16, 32),
nn.AdaptiveAvgPool2d(1),
nn.Flatten(),
nn.Linear(32, dls.c)
)
learn = get_learner(get_model())
learn.lr_find()
learn.fit_one_cycle(2, 1e-2)
This is amazing. This is a huge jump from linear layers. However, we can improve our results a lot.
With BatchZero, we can use a higher learning rate. On a last layer of the block, it helps the training of the model by finding true identity path. Let's try it out and see whether it is true.
def conv_block(ni, nf, stride=2, norm=NormType.Batch, last_layer=False):
if last_layer:
norm = NormType.BatchZero
return ConvLayer(ni, nf, stride=stride, norm_type=norm)
conv_block(2, 3)
conv_block(3, 1, last_layer=True)
def block(ni, nf):
return conv_block(ni, nf)
learn = get_learner(get_model())
learn.lr_find()
learn.summary()
Now let's try with NormType.BatchZero
.
def block(ni, nf):
return conv_block(ni, nf, last_layer=True)
learn2 = get_learner(get_model())
learn2.lr_find()
We do get to use a faster learning rate! Let's try training with suggested learning rates. 5e-3 with normal Batch
norm vs. 3e-2 with BatchZero
is a great difference. This time, we will use more epochs to explore what is going on.
learn.fit_one_cycle(7, 5e-3)
learn2.fit_one_cycle(7, 3e-2)
It looks like it is not training very well though. It seems like there is no point using BatchZero
, but we will see that it helps us train later on.
What happens if we add dropout?
model3 = nn.Sequential(conv_block(1, 8),
nn.Dropout2d(),
conv_block(8, 16, stride=2),
nn.Dropout2d(),
conv_block(16, 32, stride=2),
nn.Dropout2d(),
conv_block(32, 64, last_layer=True),
nn.Flatten())
learn = Learner(dls, model3, metrics=accuracy)
learn.lr_find()
learn.fit_one_cycle(5, 1e-3)
It is not training very well. Valid loss redeuces quicker than train loss. After five epochs, valid loss is about half of train loss. It might be because there are too many regularizations.
Let's try using ResBlock
, which composes Resnet. The idea behind Resnet is that when we stack too many layers, gradients collapse or reach infinity, which are not helpful for our models to train.
Therefore, with deeper models, we pass an identity mapping, which is just an activation before the convolutional layer, to a next layer with activations just came from the convolutional layer. Basically, it is y = x + conv(x)
where y
is an input to next layer, and x
is an input from the previous layer, and conv()
is a convolutional layer. When conv(x)
is 0, we are just skipping a convolutional layer.
Basically, with conv(x)
, the model only has to predict y - x
. We can view this model as building a tall tower by stacking up bricks. When we reach high up, the building gets unstable, and it gets harder to build it on the top. However, imagine we have magnetic bricks that sticks to a generally right spot and all we have to do is make a small adjustment to each brick. It will be way easier to build a tower. That's what resnet is doing.
def _conv_block(ni, nf, stride=2):
return nn.Sequential(
ConvLayer(ni, nf, stride=stride),
ConvLayer(nf, nf, act_cls=None, norm_type=NormType.BatchZero))
class ResBlock(Module):
def __init__(self, ni, nf, stride=2):
self.conv = _conv_block(ni, nf, stride=stride)
self.id_conv = noop if ni == nf else ConvLayer(ni, nf, ks=1)
self.pool = noop if stride == 1 else nn.AvgPool2d(2, ceil_mode=True)
def forward(self, x):
return F.relu(self.conv(x) + self.id_conv(self.pool(x)))
def block(ni, nf):
return ResBlock(ni, nf)
learn = get_learner(get_model())
learn.lr_find()
learn.fit_one_cycle(2, 3e-2)
With ResBlock
, we reached 98% accuracy with only 2 epochs! This is an amazing result. We can train more for a better result. However, we still have more tricks left. With resnet_stem
and bottleneck layers, we can improve our result.
def _resnet_stem(*sizes):
return [
ConvLayer(sizes[i], sizes[i+1], stride = 2 if i==0 else 1)
for i in range(len(sizes)-1)
] + [nn.MaxPool2d(kernel_size=3, stride=2, padding=1)]
_resnet_stem(2, 3, 4, 5)
class ResNet(nn.Sequential):
def __init__(self, n_out, layers, expansion=1):
stem = _resnet_stem(1,32,32,64)
self.block_szs = [64, 64, 128, 256, 512]
for i in range(1,5): self.block_szs[i] *= expansion
blocks = [self._make_layer(*o) for o in enumerate(layers)]
super().__init__(*stem, *blocks,
nn.AdaptiveAvgPool2d(1), Flatten(),
nn.Linear(self.block_szs[-1], n_out))
def _make_layer(self, idx, n_layers):
stride = 1 if idx==0 else 2
ch_in,ch_out = self.block_szs[idx:idx+2]
return nn.Sequential(*[
ResBlock(ch_in if i==0 else ch_out, ch_out, stride if i==0 else 1)
for i in range(n_layers)
])
learn = get_learner(ResNet(dls.c, [2, 2, 2, 2]))
learn.summary()
learn.lr_find()
learn.fit_one_cycle(2, 8e-4)
We approached 99% accuracy! Let's try label smoothing and see how far we can go. The real power of label smoothing comes with many epoches.
learn = Learner(dls, get_model(), loss_func=LabelSmoothingCrossEntropy(),
metrics=accuracy)
learn.lr_find()
learn.fit_one_cycle(15, 1e-2)
That is a good result. Let's try training with normal batchnorm instead of using BatchZero
. Can we get a better result?
def _conv_block_bn(ni, nf, stride=2):
return nn.Sequential(
ConvLayer(ni, nf, stride=stride),
ConvLayer(nf, nf, act_cls=None))
class ResBlock_bn(Module):
def __init__(self, ni, nf, stride=2):
self.conv = _conv_block_bn(ni, nf, stride=stride)
self.id_conv = noop if ni == nf else ConvLayer(ni, nf, ks=1)
self.pool = noop if stride == 1 else nn.AvgPool2d(2, ceil_mode=True)
def forward(self, x):
return F.relu(self.conv(x) + self.id_conv(self.pool(x)))
def block(ni, nf):
return ResBlock_bn(ni, nf)
learn = get_learner(ResNet(dls.c, [2,2,2,2]))
learn.lr_find()
learn.fit_one_cycle(5, 1e-4)
Without BatchZero
, we cannot get a result as good as before.
With bottle neck layers, we use three conv layers instead of normal two. However, we use 1 for kernel sizes for first and third layers. Instead, bottle neck layers give us four times the features compared to normal conv layers. However, because bottle neck layers use 1x1 kernels, computing is the same as normal conv layers. So we get twice as much features with the same computing power.
def _conv_block(ni, nf, stride=2):
return nn.Sequential(
ConvLayer(ni, nf//4, ks=1, stride=stride),
ConvLayer(nf//4, nf//4, ks=3),
ConvLayer(nf//4, nf, ks=1, act_cls=None, norm_type=NormType.BatchZero))
learn = get_learner(ResNet(dls.c, [2, 2, 2, 2], 4))
Summary of learn
gives us a good idea of what is going on.
learn.summary()
learn.lr_find()
learn.fit_one_cycle(2, 1e-4)
Contrary to the book, bottleneck layers did not give us better results. It might be because our model was not deep enough, or we only trained for 2 epochs. So let's just train for more and find out.
learn.fit_one_cycle(8, 3e-4)
It is a good result, but not as good as one without bottle neck layer. It might not be a good idea to use it for resnet18.
Here is my version of top_5_accuracy. It seems to work okay.
t = tensor([1, 2, 3, 0, 5, 4, 6, 7])
t.sort(descending=True)
t.sort(descending=True)[0][:3]
def top_5_accuracy(inp, targ, axis=-1):
acc = 0
for _inp, _targ in zip(inp, targ):
items, index = _inp.sort(descending=True, dim=axis)
top5 = index[:5]
if _targ in top5: acc += 1
return acc / len(inp)
learn = Learner(dls, model3, metrics=[accuracy, top_5_accuracy])
learn.fit_one_cycle(3, 1e-3)
We went over convolutional layers and resnet. Compared to using linear layers, we can get better results because those methods make our models train easier. The easier it is for our models to learn, the better performance we get. Resblock basically has a guideline for our model to go through. What other ways can we try for better results?