I have been using variant of resnet a lot, such as resnet18 or resnet34, but never knew how it is structured. After reading chapter 14 of the fastbook, I try to understand what is going on under the hood by experimenting everything I learned.

In chapter 14, the fastbook talks about how to write resnet from scratch. The book used imagenette, but we will just use MNIST dataset. Also, we will only use 2 epochs instead of 5 epochs. Let's find out how far we can go with using 2 epochs.

!pip install -Uqq fastbook
import fastbook
from fastai.vision.all import *
     |████████████████████████████████| 720 kB 4.1 MB/s 
     |████████████████████████████████| 46 kB 3.4 MB/s 
     |████████████████████████████████| 1.2 MB 30.7 MB/s 
     |████████████████████████████████| 186 kB 20.7 MB/s 
     |████████████████████████████████| 56 kB 2.3 MB/s 
     |████████████████████████████████| 51 kB 220 kB/s 

When experimenting, it is a good idea to start from a simple dataset. We can save time and resources this way. When it works good on easy dataset, we can move on to a little bit more complex dataset, and so on. That is why we will use with MNIST handwritten dataset.

We briefly look at how data is divided into training and testing with respect labels.

path = untar_data(URLs.MNIST)
100.03% [15687680/15683414 00:02<00:00]
path.ls()
(#2) [Path('/root/.fastai/data/mnist_png/training'),Path('/root/.fastai/data/mnist_png/testing')]
(path/'training').ls()
(#10) [Path('/root/.fastai/data/mnist_png/training/0'),Path('/root/.fastai/data/mnist_png/training/6'),Path('/root/.fastai/data/mnist_png/training/5'),Path('/root/.fastai/data/mnist_png/training/9'),Path('/root/.fastai/data/mnist_png/training/3'),Path('/root/.fastai/data/mnist_png/training/4'),Path('/root/.fastai/data/mnist_png/training/2'),Path('/root/.fastai/data/mnist_png/training/8'),Path('/root/.fastai/data/mnist_png/training/1'),Path('/root/.fastai/data/mnist_png/training/7')]

Then, we can build a dataloaders. With get_data, we can change the size of our dataset into anything we want and get a dataloaders. This function allows us to explore with different sizes. Let's try training with images with 28 28 pixels, which are full sizes, and 14 14 pixels.

Generally, we can expect to have better result using images with bigger resolutions even though it takes longer time to train. Let's take a look at our pictures first.

def get_data(resize=28):
    "Return dataloaders from MNIST dataset"
    return DataBlock(
        blocks=(ImageBlock(PILImageBW), CategoryBlock),
        get_items=get_image_files,
        splitter=GrandparentSplitter(train_name='training', valid_name='testing'),
        get_y=parent_label,
        item_tfms=Resize(resize)
    ).dataloaders(untar_data(URLs.MNIST), bs=256)
dls = get_data()
dls.show_batch()

Let's try size 14. It is hard to see, but we can see what they are.

dls_14 = get_data(14)
dls_14.show_batch()

We can look at a shape of each batch by getting a batch from each dataloaders. Each batch is composed of [batch_size, number of channels_in, pixel of height, pixel of width]. Because we are working with black and white images, channel_in is 1, instead of 3.

xb, yb = dls.one_batch()
xb.shape, yb.shape
(torch.Size([256, 1, 28, 28]), torch.Size([256]))
xb_14, yb_14 = dls_14.one_batch()
xb_14.shape, yb_14.shape
(torch.Size([256, 1, 14, 14]), torch.Size([256]))

Baseline

Before we jump into the resnet, let's make a baseline with linear layers first. We can compare it with resnet later and see how resnet performs on MNIST dataset.

model1 = nn.Sequential(
    nn.Flatten(),
    nn.Linear(28*28, 59),
    nn.ReLU(),
    nn.Linear(59, 10))

As we did with get_data(), we will define get_learner() that returns learner for us with dataset and accuracy. I am just using CPUs, but we can definitely use GPUs as well. When we use GPUs to train, we will use mixed precision with to_fp16(). (I found GPUs to take longer to train with shallow models. It might be because it takes longer to load into GPU memories and compute the result than simply use CPUs. However, it is not the case with deeper models that we will use later on.)

def get_learner(m, dls=dls):
    return Learner(dls, m, metrics=accuracy, loss_func=nn.CrossEntropyLoss()).to_fp16()
learn = get_learner(model1)
learn.lr_find()
SuggestedLRs(valley=0.002511886414140463)
learn.fit_one_cycle(2, 1e-3)
epoch train_loss valid_loss accuracy time
0 0.462859 0.339541 0.909200 01:49
1 0.314156 0.293297 0.918800 01:48

Okay. Let's try training our resized dataset. With smaller pixels, we should train faster.

model2 = nn.Sequential(
    nn.Flatten(),
    nn.Linear(14*14, 59),
    nn.ReLU(),
    nn.Linear(59, 10))
learn = get_learner(model2, dls=dls_14)
learn.lr_find()
SuggestedLRs(valley=0.004365158267319202)
learn.fit_one_cycle(2, 1e-3)
epoch train_loss valid_loss accuracy time
0 0.776017 0.532420 0.874700 01:49
1 0.448006 0.420771 0.894200 01:48

Smaller images did not save us any time. We got the worse performance as we expected though. Therefore, we will just use the full sized dataset.

Let's try adding a dropout and see what happens.

model1_d = nn.Sequential(
    nn.Flatten(),
    nn.Linear(28*28, 100),
    nn.Dropout(),
    nn.ReLU(),
    nn.Linear(100, 10))
learn = get_learner(model1_d)
learn.fit_one_cycle(2, 1e-3)
epoch train_loss valid_loss accuracy time
0 0.556546 0.337871 0.910900 01:49
1 0.387193 0.290043 0.921600 01:48

Adding dropout does not improve our performance.

Convolutional Layers

Let's try convolutional layers now. Let's try a simple convolutional layers with ConvLayer. We will set stride to 2 so that we can decrease the number of pixels and extract important features that will help us gain better performance.

To make it easier to understand and read the code, we will refactor our convolutional layer. We can change block into different layer later and use get_model() to get our model without calling nn.Sequential with all the components.

def block(ni, nf):
    return ConvLayer(ni, nf, stride=2)
def get_model():
    return nn.Sequential(
        block(1, 8),        
        block(8, 16),       
        block(16, 32),      
        nn.AdaptiveAvgPool2d(1),
        nn.Flatten(),
        nn.Linear(32, dls.c)
)
learn = get_learner(get_model())
learn.lr_find()
SuggestedLRs(valley=0.0030199517495930195)
learn.fit_one_cycle(2, 1e-2)
epoch train_loss valid_loss accuracy time
0 0.411212 0.289458 0.920200 01:49
1 0.185604 0.168248 0.952300 01:49

This is amazing. This is a huge jump from linear layers. However, we can improve our results a lot.

BatchZero

With BatchZero, we can use a higher learning rate. On a last layer of the block, it helps the training of the model by finding true identity path. Let's try it out and see whether it is true.

def conv_block(ni, nf, stride=2, norm=NormType.Batch, last_layer=False):
    if last_layer: 
        norm = NormType.BatchZero
    return ConvLayer(ni, nf, stride=stride, norm_type=norm)
conv_block(2, 3)
ConvLayer(
  (0): Conv2d(2, 3, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
  (1): BatchNorm2d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (2): ReLU()
)
conv_block(3, 1, last_layer=True)
ConvLayer(
  (0): Conv2d(3, 1, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
  (1): BatchNorm2d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (2): ReLU()
)
def block(ni, nf):
    return conv_block(ni, nf)
learn = get_learner(get_model())
learn.lr_find()
SuggestedLRs(valley=0.00363078061491251)
learn.summary()
Sequential (Input shape: 256)
============================================================================
Layer (type)         Output Shape         Param #    Trainable 
============================================================================
                     256 x 8 x 14 x 14   
Conv2d                                    72         True      
BatchNorm2d                               16         True      
ReLU                                                           
____________________________________________________________________________
                     256 x 16 x 7 x 7    
Conv2d                                    1152       True      
BatchNorm2d                               32         True      
ReLU                                                           
____________________________________________________________________________
                     256 x 32 x 4 x 4    
Conv2d                                    4608       True      
BatchNorm2d                               64         True      
ReLU                                                           
AdaptiveAvgPool2d                                              
Flatten                                                        
____________________________________________________________________________
                     256 x 10            
Linear                                    330        True      
____________________________________________________________________________

Total params: 6,274
Total trainable params: 6,274
Total non-trainable params: 0

Optimizer used: <function Adam at 0x7f8b8a8894d0>
Loss function: CrossEntropyLoss()

Model unfrozen

Callbacks:
  - TrainEvalCallback
  - MixedPrecision
  - Recorder
  - ProgressCallback

Now let's try with NormType.BatchZero.

def block(ni, nf):
    return conv_block(ni, nf, last_layer=True)
learn2 = get_learner(get_model())
learn2.lr_find()
SuggestedLRs(valley=0.013182567432522774)

We do get to use a faster learning rate! Let's try training with suggested learning rates. 5e-3 with normal Batch norm vs. 3e-2 with BatchZero is a great difference. This time, we will use more epochs to explore what is going on.

learn.fit_one_cycle(7, 5e-3)
epoch train_loss valid_loss accuracy time
0 1.384760 0.964067 0.740000 01:50
1 0.339207 0.338183 0.898500 01:49
2 0.184407 0.162223 0.953900 01:49
3 0.137132 0.147511 0.957200 01:49
4 0.109026 0.112336 0.968300 01:47
5 0.100849 0.097497 0.971000 01:48
6 0.091943 0.096191 0.972200 01:49
learn2.fit_one_cycle(7, 3e-2)
epoch train_loss valid_loss accuracy time
0 2.093218 1.799008 0.331600 01:50
1 0.341182 1.041854 0.722100 01:47
2 0.211505 0.810533 0.779500 01:49
3 0.178218 0.320881 0.899200 01:49
4 0.153390 0.154975 0.951400 01:49
5 0.136555 0.128518 0.959500 01:49
6 0.124655 0.122785 0.962200 01:49

It looks like it is not training very well though. It seems like there is no point using BatchZero, but we will see that it helps us train later on.

What happens if we add dropout?

model3 = nn.Sequential(conv_block(1, 8), 
                       nn.Dropout2d(),   
                       conv_block(8, 16, stride=2), 
                       nn.Dropout2d(),                    
                       conv_block(16, 32, stride=2),
                       nn.Dropout2d(),
                       conv_block(32, 64, last_layer=True),
                       nn.Flatten())
learn = Learner(dls, model3, metrics=accuracy)
learn.lr_find()
SuggestedLRs(valley=0.0014454397605732083)
learn.fit_one_cycle(5, 1e-3)
epoch train_loss valid_loss accuracy time
0 5.419976 5.314860 0.518500 01:50
1 4.703518 4.482738 0.747200 01:50
2 4.098514 3.907110 0.782100 01:49
3 3.798431 3.621009 0.790300 01:49
4 3.721098 3.586130 0.788500 01:49

It is not training very well. Valid loss redeuces quicker than train loss. After five epochs, valid loss is about half of train loss. It might be because there are too many regularizations.

ResBlock

Let's try using ResBlock, which composes Resnet. The idea behind Resnet is that when we stack too many layers, gradients collapse or reach infinity, which are not helpful for our models to train.

Therefore, with deeper models, we pass an identity mapping, which is just an activation before the convolutional layer, to a next layer with activations just came from the convolutional layer. Basically, it is y = x + conv(x) where y is an input to next layer, and x is an input from the previous layer, and conv() is a convolutional layer. When conv(x) is 0, we are just skipping a convolutional layer.

Basically, with conv(x), the model only has to predict y - x. We can view this model as building a tall tower by stacking up bricks. When we reach high up, the building gets unstable, and it gets harder to build it on the top. However, imagine we have magnetic bricks that sticks to a generally right spot and all we have to do is make a small adjustment to each brick. It will be way easier to build a tower. That's what resnet is doing.

def _conv_block(ni, nf, stride=2):
    return nn.Sequential(
        ConvLayer(ni, nf, stride=stride),
        ConvLayer(nf, nf, act_cls=None, norm_type=NormType.BatchZero))
class ResBlock(Module):
    def __init__(self, ni, nf, stride=2):
        self.conv = _conv_block(ni, nf, stride=stride)
        self.id_conv = noop if ni == nf else ConvLayer(ni, nf, ks=1)
        self.pool = noop if stride == 1 else nn.AvgPool2d(2, ceil_mode=True)

    def forward(self, x):
        return F.relu(self.conv(x) + self.id_conv(self.pool(x)))
def block(ni, nf):
    return ResBlock(ni, nf)
learn = get_learner(get_model())
learn.lr_find()
SuggestedLRs(valley=0.02290867641568184)
learn.fit_one_cycle(2, 3e-2)
epoch train_loss valid_loss accuracy time
0 0.099767 0.211910 0.934100 01:53
1 0.043861 0.033711 0.989400 01:54

With ResBlock, we reached 98% accuracy with only 2 epochs! This is an amazing result. We can train more for a better result. However, we still have more tricks left. With resnet_stem and bottleneck layers, we can improve our result.

def _resnet_stem(*sizes):
    return [
            ConvLayer(sizes[i], sizes[i+1], stride = 2 if i==0 else 1)
            for i in range(len(sizes)-1)
    ] + [nn.MaxPool2d(kernel_size=3, stride=2, padding=1)]
_resnet_stem(2, 3, 4, 5)
[ConvLayer(
   (0): Conv2d(2, 3, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
   (1): BatchNorm2d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (2): ReLU()
 ), ConvLayer(
   (0): Conv2d(3, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
   (1): BatchNorm2d(4, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (2): ReLU()
 ), ConvLayer(
   (0): Conv2d(4, 5, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
   (1): BatchNorm2d(5, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (2): ReLU()
 ), MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)]
class ResNet(nn.Sequential):
    def __init__(self, n_out, layers, expansion=1):
        stem = _resnet_stem(1,32,32,64)
        self.block_szs = [64, 64, 128, 256, 512]
        for i in range(1,5): self.block_szs[i] *= expansion
        blocks = [self._make_layer(*o) for o in enumerate(layers)]
        super().__init__(*stem, *blocks,
                         nn.AdaptiveAvgPool2d(1), Flatten(),
                         nn.Linear(self.block_szs[-1], n_out))
    
    def _make_layer(self, idx, n_layers):
        stride = 1 if idx==0 else 2
        ch_in,ch_out = self.block_szs[idx:idx+2]
        return nn.Sequential(*[
            ResBlock(ch_in if i==0 else ch_out, ch_out, stride if i==0 else 1)
            for i in range(n_layers)
        ])
learn = get_learner(ResNet(dls.c, [2, 2, 2, 2]))
learn.summary()
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
ResNet (Input shape: 256)
============================================================================
Layer (type)         Output Shape         Param #    Trainable 
============================================================================
                     256 x 32 x 14 x 14  
Conv2d                                    288        True      
BatchNorm2d                               64         True      
ReLU                                                           
Conv2d                                    9216       True      
BatchNorm2d                               64         True      
ReLU                                                           
____________________________________________________________________________
                     256 x 64 x 14 x 14  
Conv2d                                    18432      True      
BatchNorm2d                               128        True      
ReLU                                                           
MaxPool2d                                                      
Conv2d                                    36864      True      
BatchNorm2d                               128        True      
ReLU                                                           
Conv2d                                    36864      True      
BatchNorm2d                               128        True      
Conv2d                                    36864      True      
BatchNorm2d                               128        True      
ReLU                                                           
Conv2d                                    36864      True      
BatchNorm2d                               128        True      
____________________________________________________________________________
                     256 x 128 x 4 x 4   
Conv2d                                    73728      True      
BatchNorm2d                               256        True      
ReLU                                                           
Conv2d                                    147456     True      
BatchNorm2d                               256        True      
Conv2d                                    8192       True      
BatchNorm2d                               256        True      
ReLU                                                           
AvgPool2d                                                      
Conv2d                                    147456     True      
BatchNorm2d                               256        True      
ReLU                                                           
Conv2d                                    147456     True      
BatchNorm2d                               256        True      
____________________________________________________________________________
                     256 x 256 x 2 x 2   
Conv2d                                    294912     True      
BatchNorm2d                               512        True      
ReLU                                                           
Conv2d                                    589824     True      
BatchNorm2d                               512        True      
Conv2d                                    32768      True      
BatchNorm2d                               512        True      
ReLU                                                           
AvgPool2d                                                      
Conv2d                                    589824     True      
BatchNorm2d                               512        True      
ReLU                                                           
Conv2d                                    589824     True      
BatchNorm2d                               512        True      
____________________________________________________________________________
                     256 x 512 x 1 x 1   
Conv2d                                    1179648    True      
BatchNorm2d                               1024       True      
ReLU                                                           
Conv2d                                    2359296    True      
BatchNorm2d                               1024       True      
Conv2d                                    131072     True      
BatchNorm2d                               1024       True      
ReLU                                                           
AvgPool2d                                                      
Conv2d                                    2359296    True      
BatchNorm2d                               1024       True      
ReLU                                                           
Conv2d                                    2359296    True      
BatchNorm2d                               1024       True      
AdaptiveAvgPool2d                                              
Flatten                                                        
____________________________________________________________________________
                     256 x 10            
Linear                                    5130       True      
____________________________________________________________________________

Total params: 11,200,298
Total trainable params: 11,200,298
Total non-trainable params: 0

Optimizer used: <function Adam at 0x7f8b8a8894d0>
Loss function: CrossEntropyLoss()

Callbacks:
  - TrainEvalCallback
  - MixedPrecision
  - Recorder
  - ProgressCallback
learn.lr_find()
SuggestedLRs(valley=0.0008317637839354575)
learn.fit_one_cycle(2, 8e-4)
epoch train_loss valid_loss accuracy time
0 0.105901 0.069346 0.978600 02:05
1 0.026138 0.025373 0.991800 02:05

We approached 99% accuracy! Let's try label smoothing and see how far we can go. The real power of label smoothing comes with many epoches.

learn = Learner(dls, get_model(), loss_func=LabelSmoothingCrossEntropy(),
                metrics=accuracy)
learn.lr_find()
SuggestedLRs(valley=0.013182567432522774)
learn.fit_one_cycle(15, 1e-2)
epoch train_loss valid_loss accuracy time
0 1.482026 1.118772 0.832100 01:52
1 0.648935 0.688249 0.949600 01:51
2 0.593247 0.635145 0.969500 01:52
3 0.570614 0.568556 0.985800 01:52
4 0.555753 0.628088 0.960400 01:52
5 0.547613 0.550607 0.989900 01:52
6 0.540540 0.539059 0.991200 01:52
7 0.533598 0.536510 0.992200 01:52
8 0.529636 0.534598 0.990500 01:53
9 0.525533 0.530576 0.991900 01:53
10 0.521623 0.525659 0.993000 01:52
11 0.518083 0.525331 0.993000 01:52
12 0.515895 0.523791 0.993200 01:51
13 0.515033 0.523518 0.993300 01:51
14 0.514001 0.523306 0.993500 01:50

That is a good result. Let's try training with normal batchnorm instead of using BatchZero. Can we get a better result?

def _conv_block_bn(ni, nf, stride=2):
    return nn.Sequential(
        ConvLayer(ni, nf, stride=stride),
        ConvLayer(nf, nf, act_cls=None))

class ResBlock_bn(Module):
    def __init__(self, ni, nf, stride=2):
        self.conv = _conv_block_bn(ni, nf, stride=stride)
        self.id_conv = noop if ni == nf else ConvLayer(ni, nf, ks=1)
        self.pool = noop if stride == 1 else nn.AvgPool2d(2, ceil_mode=True)

    def forward(self, x):
        return F.relu(self.conv(x) + self.id_conv(self.pool(x)))

def block(ni, nf):
    return ResBlock_bn(ni, nf)
learn = get_learner(ResNet(dls.c, [2,2,2,2]))
learn.lr_find()
SuggestedLRs(valley=0.0004786300996784121)
learn.fit_one_cycle(5, 1e-4)
epoch train_loss valid_loss accuracy time
0 1.312650 0.878860 0.822500 02:02
1 0.256814 0.177439 0.956500 02:02
2 0.118774 0.098630 0.972500 02:03
3 0.087837 0.080848 0.976300 02:02
4 0.078750 0.078552 0.976500 02:02

Without BatchZero, we cannot get a result as good as before.

Bottle neck layers

With bottle neck layers, we use three conv layers instead of normal two. However, we use 1 for kernel sizes for first and third layers. Instead, bottle neck layers give us four times the features compared to normal conv layers. However, because bottle neck layers use 1x1 kernels, computing is the same as normal conv layers. So we get twice as much features with the same computing power.

def _conv_block(ni, nf, stride=2):
    return nn.Sequential(
        ConvLayer(ni, nf//4, ks=1, stride=stride),
        ConvLayer(nf//4, nf//4, ks=3),
        ConvLayer(nf//4, nf, ks=1, act_cls=None, norm_type=NormType.BatchZero))
learn = get_learner(ResNet(dls.c, [2, 2, 2, 2], 4))

Summary of learn gives us a good idea of what is going on.

learn.summary()
ResNet (Input shape: 256)
============================================================================
Layer (type)         Output Shape         Param #    Trainable 
============================================================================
                     256 x 32 x 14 x 14  
Conv2d                                    288        True      
BatchNorm2d                               64         True      
ReLU                                                           
Conv2d                                    9216       True      
BatchNorm2d                               64         True      
ReLU                                                           
____________________________________________________________________________
                     256 x 64 x 14 x 14  
Conv2d                                    18432      True      
BatchNorm2d                               128        True      
ReLU                                                           
MaxPool2d                                                      
Conv2d                                    4096       True      
BatchNorm2d                               128        True      
ReLU                                                           
Conv2d                                    36864      True      
BatchNorm2d                               128        True      
ReLU                                                           
____________________________________________________________________________
                     256 x 256 x 7 x 7   
Conv2d                                    16384      True      
BatchNorm2d                               512        True      
Conv2d                                    16384      True      
BatchNorm2d                               512        True      
ReLU                                                           
____________________________________________________________________________
                     256 x 64 x 7 x 7    
Conv2d                                    16384      True      
BatchNorm2d                               128        True      
ReLU                                                           
Conv2d                                    36864      True      
BatchNorm2d                               128        True      
ReLU                                                           
____________________________________________________________________________
                     256 x 256 x 7 x 7   
Conv2d                                    16384      True      
BatchNorm2d                               512        True      
____________________________________________________________________________
                     256 x 128 x 4 x 4   
Conv2d                                    32768      True      
BatchNorm2d                               256        True      
ReLU                                                           
Conv2d                                    147456     True      
BatchNorm2d                               256        True      
ReLU                                                           
____________________________________________________________________________
                     256 x 512 x 4 x 4   
Conv2d                                    65536      True      
BatchNorm2d                               1024       True      
Conv2d                                    131072     True      
BatchNorm2d                               1024       True      
ReLU                                                           
AvgPool2d                                                      
____________________________________________________________________________
                     256 x 128 x 4 x 4   
Conv2d                                    65536      True      
BatchNorm2d                               256        True      
ReLU                                                           
Conv2d                                    147456     True      
BatchNorm2d                               256        True      
ReLU                                                           
____________________________________________________________________________
                     256 x 512 x 4 x 4   
Conv2d                                    65536      True      
BatchNorm2d                               1024       True      
____________________________________________________________________________
                     256 x 256 x 2 x 2   
Conv2d                                    131072     True      
BatchNorm2d                               512        True      
ReLU                                                           
Conv2d                                    589824     True      
BatchNorm2d                               512        True      
ReLU                                                           
____________________________________________________________________________
                     256 x 1024 x 2 x 2  
Conv2d                                    262144     True      
BatchNorm2d                               2048       True      
Conv2d                                    524288     True      
BatchNorm2d                               2048       True      
ReLU                                                           
AvgPool2d                                                      
____________________________________________________________________________
                     256 x 256 x 2 x 2   
Conv2d                                    262144     True      
BatchNorm2d                               512        True      
ReLU                                                           
Conv2d                                    589824     True      
BatchNorm2d                               512        True      
ReLU                                                           
____________________________________________________________________________
                     256 x 1024 x 2 x 2  
Conv2d                                    262144     True      
BatchNorm2d                               2048       True      
____________________________________________________________________________
                     256 x 512 x 1 x 1   
Conv2d                                    524288     True      
BatchNorm2d                               1024       True      
ReLU                                                           
Conv2d                                    2359296    True      
BatchNorm2d                               1024       True      
ReLU                                                           
____________________________________________________________________________
                     256 x 2048 x 1 x 1  
Conv2d                                    1048576    True      
BatchNorm2d                               4096       True      
Conv2d                                    2097152    True      
BatchNorm2d                               4096       True      
ReLU                                                           
AvgPool2d                                                      
____________________________________________________________________________
                     256 x 512 x 1 x 1   
Conv2d                                    1048576    True      
BatchNorm2d                               1024       True      
ReLU                                                           
Conv2d                                    2359296    True      
BatchNorm2d                               1024       True      
ReLU                                                           
____________________________________________________________________________
                     256 x 2048 x 1 x 1  
Conv2d                                    1048576    True      
BatchNorm2d                               4096       True      
AdaptiveAvgPool2d                                              
Flatten                                                        
____________________________________________________________________________
                     256 x 10            
Linear                                    20490      True      
____________________________________________________________________________

Total params: 13,985,322
Total trainable params: 13,985,322
Total non-trainable params: 0

Optimizer used: <function Adam at 0x7f8b8a8894d0>
Loss function: CrossEntropyLoss()

Callbacks:
  - TrainEvalCallback
  - MixedPrecision
  - Recorder
  - ProgressCallback
learn.lr_find()
SuggestedLRs(valley=0.0002290867705596611)
learn.fit_one_cycle(2, 1e-4)
epoch train_loss valid_loss accuracy time
0 0.289013 0.159714 0.955400 02:06
1 0.113530 0.109414 0.969000 02:06
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f8bb0813320>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1328, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f8bb0813320>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1328, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process

Contrary to the book, bottleneck layers did not give us better results. It might be because our model was not deep enough, or we only trained for 2 epochs. So let's just train for more and find out.

learn.fit_one_cycle(8, 3e-4)
epoch train_loss valid_loss accuracy time
0 0.077559 0.072035 0.978400 02:06
1 0.050012 0.079854 0.974900 02:04
2 0.025841 0.052087 0.982700 02:03
3 0.013299 0.043280 0.985800 02:05
4 0.004574 0.035069 0.988600 02:05
5 0.001627 0.035244 0.989700 02:05
6 0.000468 0.033405 0.990300 02:04
7 0.000319 0.032500 0.989900 02:05
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f8bb0813320>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1328, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f8bb0813320>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1328, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f8bb0813320>
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f8bb0813320>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1328, in __del__
    self._shutdown_workers()
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1328, in __del__
    if w.is_alive():
    self._shutdown_workers()
  File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
    assert self._parent_pid == os.getpid(), 'can only test a child process'
    if w.is_alive():
  File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f8bb0813320>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1328, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f8bb0813320>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1328, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f8bb0813320>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1328, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f8bb0813320>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1328, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process

It is a good result, but not as good as one without bottle neck layer. It might not be a good idea to use it for resnet18.

Top 5 accuracy

Here is my version of top_5_accuracy. It seems to work okay.

t = tensor([1, 2, 3, 0, 5, 4, 6, 7])
t.sort(descending=True)
torch.return_types.sort(values=tensor([7, 6, 5, 4, 3, 2, 1, 0]), indices=tensor([7, 6, 4, 5, 2, 1, 0, 3]))
t.sort(descending=True)[0][:3]
tensor([7, 6, 5])
def top_5_accuracy(inp, targ, axis=-1):
    acc = 0
    for _inp, _targ in zip(inp, targ):
        items, index = _inp.sort(descending=True, dim=axis)
        top5 = index[:5]
        if _targ in top5: acc += 1
    return acc / len(inp)
learn = Learner(dls, model3, metrics=[accuracy, top_5_accuracy])
learn.fit_one_cycle(3, 1e-3)

Conclusion

We went over convolutional layers and resnet. Compared to using linear layers, we can get better results because those methods make our models train easier. The easier it is for our models to learn, the better performance we get. Resblock basically has a guideline for our model to go through. What other ways can we try for better results?