I have been using variant of resnet a lot, such as resnet18 or resnet34, but never knew how it is structured. After reading chapter 14 of the fastbook, I try to understand what is going on under the hood by experimenting everything I learned.

In chapter 14, the fastbook talks about how to write resnet from scratch. The book used imagenette, but we will just use MNIST dataset. Also, we will only use 2 epochs instead of 5 epochs. Let's find out how far we can go with using 2 epochs.

!pip install -Uqq fastbook
import fastbook
from fastai.vision.all import *

     |████████████████████████████████| 720 kB 4.1 MB/s 
     |████████████████████████████████| 46 kB 3.4 MB/s 
     |████████████████████████████████| 1.2 MB 30.7 MB/s 
     |████████████████████████████████| 186 kB 20.7 MB/s 
     |████████████████████████████████| 56 kB 2.3 MB/s 
     |████████████████████████████████| 51 kB 220 kB/s

When experimenting, it is a good idea to start from a simple dataset. We can save time and resources this way. When it works good on easy dataset, we can move on to a little bit more complex dataset, and so on. That is why we will use with MNIST handwritten dataset.

We briefly look at how data is divided into training and testing with respect labels.

path = untar_data(URLs.MNIST)

path.ls()

(#2) [Path('/root/.fastai/data/mnist_png/training'),Path('/root/.fastai/data/mnist_png/testing')]

(path/'training').ls()

(#10) [Path('/root/.fastai/data/mnist_png/training/0'),Path('/root/.fastai/data/mnist_png/training/6'),Path('/root/.fastai/data/mnist_png/training/5'),Path('/root/.fastai/data/mnist_png/training/9'),Path('/root/.fastai/data/mnist_png/training/3'),Path('/root/.fastai/data/mnist_png/training/4'),Path('/root/.fastai/data/mnist_png/training/2'),Path('/root/.fastai/data/mnist_png/training/8'),Path('/root/.fastai/data/mnist_png/training/1'),Path('/root/.fastai/data/mnist_png/training/7')]

Then, we can build a dataloaders. With get_data, we can change the size of our dataset into anything we want and get a dataloaders. This function allows us to explore with different sizes. Let's try training with images with 28 28 pixels, which are full sizes, and 14 14 pixels.

Generally, we can expect to have better result using images with bigger resolutions even though it takes longer time to train. Let's take a look at our pictures first.

def get_data(resize=28):
    "Return dataloaders from MNIST dataset"
    return DataBlock(
        blocks=(ImageBlock(PILImageBW), CategoryBlock),
        get_items=get_image_files,
        splitter=GrandparentSplitter(train_name='training', valid_name='testing'),
        get_y=parent_label,
        item_tfms=Resize(resize)
    ).dataloaders(untar_data(URLs.MNIST), bs=256)

dls = get_data()

dls.show_batch()

Let's try size 14. It is hard to see, but we can see what they are.

dls_14 = get_data(14)
dls_14.show_batch()

We can look at a shape of each batch by getting a batch from each dataloaders. Each batch is composed of [batch_size, number of channels_in, pixel of height, pixel of width]. Because we are working with black and white images, channel_in is 1, instead of 3.

xb, yb = dls.one_batch()
xb.shape, yb.shape

(torch.Size([256, 1, 28, 28]), torch.Size([256]))

xb_14, yb_14 = dls_14.one_batch()
xb_14.shape, yb_14.shape

(torch.Size([256, 1, 14, 14]), torch.Size([256]))

Baseline

Before we jump into the resnet, let's make a baseline with linear layers first. We can compare it with resnet later and see how resnet performs on MNIST dataset.

model1 = nn.Sequential(
    nn.Flatten(),
    nn.Linear(28*28, 59),
    nn.ReLU(),
    nn.Linear(59, 10))

As we did with get_data(), we will define get_learner() that returns learner for us with dataset and accuracy. I am just using CPUs, but we can definitely use GPUs as well. When we use GPUs to train, we will use mixed precision with to_fp16(). (I found GPUs to take longer to train with shallow models. It might be because it takes longer to load into GPU memories and compute the result than simply use CPUs. However, it is not the case with deeper models that we will use later on.)

def get_learner(m, dls=dls):
    return Learner(dls, m, metrics=accuracy, loss_func=nn.CrossEntropyLoss()).to_fp16()

learn = get_learner(model1)
learn.lr_find()

SuggestedLRs(valley=0.002511886414140463)

learn.fit_one_cycle(2, 1e-3)

Okay. Let's try training our resized dataset. With smaller pixels, we should train faster.

model2 = nn.Sequential(
    nn.Flatten(),
    nn.Linear(14*14, 59),
    nn.ReLU(),
    nn.Linear(59, 10))

learn = get_learner(model2, dls=dls_14)
learn.lr_find()

SuggestedLRs(valley=0.004365158267319202)

learn.fit_one_cycle(2, 1e-3)

Smaller images did not save us any time. We got the worse performance as we expected though. Therefore, we will just use the full sized dataset.

Let's try adding a dropout and see what happens.

model1_d = nn.Sequential(
    nn.Flatten(),
    nn.Linear(28*28, 100),
    nn.Dropout(),
    nn.ReLU(),
    nn.Linear(100, 10))

learn = get_learner(model1_d)
learn.fit_one_cycle(2, 1e-3)

Adding dropout does not improve our performance.

Convolutional Layers

Let's try convolutional layers now. Let's try a simple convolutional layers with ConvLayer. We will set stride to 2 so that we can decrease the number of pixels and extract important features that will help us gain better performance.

To make it easier to understand and read the code, we will refactor our convolutional layer. We can change block into different layer later and use get_model() to get our model without calling nn.Sequential with all the components.

def block(ni, nf):
    return ConvLayer(ni, nf, stride=2)

def get_model():
    return nn.Sequential(
        block(1, 8),        
        block(8, 16),       
        block(16, 32),      
        nn.AdaptiveAvgPool2d(1),
        nn.Flatten(),
        nn.Linear(32, dls.c)
)

learn = get_learner(get_model())
learn.lr_find()

SuggestedLRs(valley=0.0030199517495930195)

learn.fit_one_cycle(2, 1e-2)

This is amazing. This is a huge jump from linear layers. However, we can improve our results a lot.

BatchZero

With BatchZero, we can use a higher learning rate. On a last layer of the block, it helps the training of the model by finding true identity path. Let's try it out and see whether it is true.

def conv_block(ni, nf, stride=2, norm=NormType.Batch, last_layer=False):
    if last_layer: 
        norm = NormType.BatchZero
    return ConvLayer(ni, nf, stride=stride, norm_type=norm)

conv_block(2, 3)

ConvLayer(
  (0): Conv2d(2, 3, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
  (1): BatchNorm2d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (2): ReLU()
)

conv_block(3, 1, last_layer=True)

ConvLayer(
  (0): Conv2d(3, 1, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
  (1): BatchNorm2d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (2): ReLU()
)

def block(ni, nf):
    return conv_block(ni, nf)

learn = get_learner(get_model())
learn.lr_find()

SuggestedLRs(valley=0.00363078061491251)

learn.summary()

Sequential (Input shape: 256)
============================================================================
Layer (type)         Output Shape         Param #    Trainable 
============================================================================
                     256 x 8 x 14 x 14   
Conv2d                                    72         True      
BatchNorm2d                               16         True      
ReLU                                                           
____________________________________________________________________________
                     256 x 16 x 7 x 7    
Conv2d                                    1152       True      
BatchNorm2d                               32         True      
ReLU                                                           
____________________________________________________________________________
                     256 x 32 x 4 x 4    
Conv2d                                    4608       True      
BatchNorm2d                               64         True      
ReLU                                                           
AdaptiveAvgPool2d                                              
Flatten                                                        
____________________________________________________________________________
                     256 x 10            
Linear                                    330        True      
____________________________________________________________________________

Total params: 6,274
Total trainable params: 6,274
Total non-trainable params: 0

Optimizer used: <function Adam at 0x7f8b8a8894d0>
Loss function: CrossEntropyLoss()

Model unfrozen

Callbacks:
  - TrainEvalCallback
  - MixedPrecision
  - Recorder
  - ProgressCallback

Now let's try with NormType.BatchZero.

def block(ni, nf):
    return conv_block(ni, nf, last_layer=True)

learn2 = get_learner(get_model())

learn2.lr_find()

SuggestedLRs(valley=0.013182567432522774)

We do get to use a faster learning rate! Let's try training with suggested learning rates. 5e-3 with normal Batch norm vs. 3e-2 with BatchZero is a great difference. This time, we will use more epochs to explore what is going on.

learn.fit_one_cycle(7, 5e-3)

learn2.fit_one_cycle(7, 3e-2)

It looks like it is not training very well though. It seems like there is no point using BatchZero, but we will see that it helps us train later on.

What happens if we add dropout?

model3 = nn.Sequential(conv_block(1, 8), 
                       nn.Dropout2d(),   
                       conv_block(8, 16, stride=2), 
                       nn.Dropout2d(),                    
                       conv_block(16, 32, stride=2),
                       nn.Dropout2d(),
                       conv_block(32, 64, last_layer=True),
                       nn.Flatten())

learn = Learner(dls, model3, metrics=accuracy)
learn.lr_find()

SuggestedLRs(valley=0.0014454397605732083)

learn.fit_one_cycle(5, 1e-3)

It is not training very well. Valid loss redeuces quicker than train loss. After five epochs, valid loss is about half of train loss. It might be because there are too many regularizations.

ResBlock

Let's try using ResBlock, which composes Resnet. The idea behind Resnet is that when we stack too many layers, gradients collapse or reach infinity, which are not helpful for our models to train.

Therefore, with deeper models, we pass an identity mapping, which is just an activation before the convolutional layer, to a next layer with activations just came from the convolutional layer. Basically, it is y = x + conv(x) where y is an input to next layer, and x is an input from the previous layer, and conv() is a convolutional layer. When conv(x) is 0, we are just skipping a convolutional layer.

Basically, with conv(x), the model only has to predict y - x. We can view this model as building a tall tower by stacking up bricks. When we reach high up, the building gets unstable, and it gets harder to build it on the top. However, imagine we have magnetic bricks that sticks to a generally right spot and all we have to do is make a small adjustment to each brick. It will be way easier to build a tower. That's what resnet is doing.

def _conv_block(ni, nf, stride=2):
    return nn.Sequential(
        ConvLayer(ni, nf, stride=stride),
        ConvLayer(nf, nf, act_cls=None, norm_type=NormType.BatchZero))

class ResBlock(Module):
    def __init__(self, ni, nf, stride=2):
        self.conv = _conv_block(ni, nf, stride=stride)
        self.id_conv = noop if ni == nf else ConvLayer(ni, nf, ks=1)
        self.pool = noop if stride == 1 else nn.AvgPool2d(2, ceil_mode=True)

    def forward(self, x):
        return F.relu(self.conv(x) + self.id_conv(self.pool(x)))

def block(ni, nf):
    return ResBlock(ni, nf)

learn = get_learner(get_model())
learn.lr_find()

SuggestedLRs(valley=0.02290867641568184)

learn.fit_one_cycle(2, 3e-2)

With ResBlock, we reached 98% accuracy with only 2 epochs! This is an amazing result. We can train more for a better result. However, we still have more tricks left. With resnet_stem and bottleneck layers, we can improve our result.

def _resnet_stem(*sizes):
    return [
            ConvLayer(sizes[i], sizes[i+1], stride = 2 if i==0 else 1)
            for i in range(len(sizes)-1)
    ] + [nn.MaxPool2d(kernel_size=3, stride=2, padding=1)]

_resnet_stem(2, 3, 4, 5)

[ConvLayer(
   (0): Conv2d(2, 3, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
   (1): BatchNorm2d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (2): ReLU()
 ), ConvLayer(
   (0): Conv2d(3, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
   (1): BatchNorm2d(4, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (2): ReLU()
 ), ConvLayer(
   (0): Conv2d(4, 5, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
   (1): BatchNorm2d(5, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (2): ReLU()
 ), MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)]

class ResNet(nn.Sequential):
    def __init__(self, n_out, layers, expansion=1):
        stem = _resnet_stem(1,32,32,64)
        self.block_szs = [64, 64, 128, 256, 512]
        for i in range(1,5): self.block_szs[i] *= expansion
        blocks = [self._make_layer(*o) for o in enumerate(layers)]
        super().__init__(*stem, *blocks,
                         nn.AdaptiveAvgPool2d(1), Flatten(),
                         nn.Linear(self.block_szs[-1], n_out))
    
    def _make_layer(self, idx, n_layers):
        stride = 1 if idx==0 else 2
        ch_in,ch_out = self.block_szs[idx:idx+2]
        return nn.Sequential(*[
            ResBlock(ch_in if i==0 else ch_out, ch_out, stride if i==0 else 1)
            for i in range(n_layers)
        ])

learn = get_learner(ResNet(dls.c, [2, 2, 2, 2]))

learn.summary()

/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)

ResNet (Input shape: 256)
============================================================================
Layer (type)         Output Shape         Param #    Trainable 
============================================================================
                     256 x 32 x 14 x 14  
Conv2d                                    288        True      
BatchNorm2d                               64         True      
ReLU                                                           
Conv2d                                    9216       True      
BatchNorm2d                               64         True      
ReLU                                                           
____________________________________________________________________________
                     256 x 64 x 14 x 14  
Conv2d                                    18432      True      
BatchNorm2d                               128        True      
ReLU                                                           
MaxPool2d                                                      
Conv2d                                    36864      True      
BatchNorm2d                               128        True      
ReLU                                                           
Conv2d                                    36864      True      
BatchNorm2d                               128        True      
Conv2d                                    36864      True      
BatchNorm2d                               128        True      
ReLU                                                           
Conv2d                                    36864      True      
BatchNorm2d                               128        True      
____________________________________________________________________________
                     256 x 128 x 4 x 4   
Conv2d                                    73728      True      
BatchNorm2d                               256        True      
ReLU                                                           
Conv2d                                    147456     True      
BatchNorm2d                               256        True      
Conv2d                                    8192       True      
BatchNorm2d                               256        True      
ReLU                                                           
AvgPool2d                                                      
Conv2d                                    147456     True      
BatchNorm2d                               256        True      
ReLU                                                           
Conv2d                                    147456     True      
BatchNorm2d                               256        True      
____________________________________________________________________________
                     256 x 256 x 2 x 2   
Conv2d                                    294912     True      
BatchNorm2d                               512        True      
ReLU                                                           
Conv2d                                    589824     True      
BatchNorm2d                               512        True      
Conv2d                                    32768      True      
BatchNorm2d                               512        True      
ReLU                                                           
AvgPool2d                                                      
Conv2d                                    589824     True      
BatchNorm2d                               512        True      
ReLU                                                           
Conv2d                                    589824     True      
BatchNorm2d                               512        True      
____________________________________________________________________________
                     256 x 512 x 1 x 1   
Conv2d                                    1179648    True      
BatchNorm2d                               1024       True      
ReLU                                                           
Conv2d                                    2359296    True      
BatchNorm2d                               1024       True      
Conv2d                                    131072     True      
BatchNorm2d                               1024       True      
ReLU                                                           
AvgPool2d                                                      
Conv2d                                    2359296    True      
BatchNorm2d                               1024       True      
ReLU                                                           
Conv2d                                    2359296    True      
BatchNorm2d                               1024       True      
AdaptiveAvgPool2d                                              
Flatten                                                        
____________________________________________________________________________
                     256 x 10            
Linear                                    5130       True      
____________________________________________________________________________

Total params: 11,200,298
Total trainable params: 11,200,298
Total non-trainable params: 0

Optimizer used: <function Adam at 0x7f8b8a8894d0>
Loss function: CrossEntropyLoss()

Callbacks:
  - TrainEvalCallback
  - MixedPrecision
  - Recorder
  - ProgressCallback

learn.lr_find()

SuggestedLRs(valley=0.0008317637839354575)

learn.fit_one_cycle(2, 8e-4)

We approached 99% accuracy! Let's try label smoothing and see how far we can go. The real power of label smoothing comes with many epoches.

learn = Learner(dls, get_model(), loss_func=LabelSmoothingCrossEntropy(),
                metrics=accuracy)
learn.lr_find()

SuggestedLRs(valley=0.013182567432522774)

learn.fit_one_cycle(15, 1e-2)

That is a good result. Let's try training with normal batchnorm instead of using BatchZero. Can we get a better result?

def _conv_block_bn(ni, nf, stride=2):
    return nn.Sequential(
        ConvLayer(ni, nf, stride=stride),
        ConvLayer(nf, nf, act_cls=None))

class ResBlock_bn(Module):
    def __init__(self, ni, nf, stride=2):
        self.conv = _conv_block_bn(ni, nf, stride=stride)
        self.id_conv = noop if ni == nf else ConvLayer(ni, nf, ks=1)
        self.pool = noop if stride == 1 else nn.AvgPool2d(2, ceil_mode=True)

    def forward(self, x):
        return F.relu(self.conv(x) + self.id_conv(self.pool(x)))

def block(ni, nf):
    return ResBlock_bn(ni, nf)

learn = get_learner(ResNet(dls.c, [2,2,2,2]))

learn.lr_find()

SuggestedLRs(valley=0.0004786300996784121)

learn.fit_one_cycle(5, 1e-4)

Without BatchZero, we cannot get a result as good as before.

Bottle neck layers

With bottle neck layers, we use three conv layers instead of normal two. However, we use 1 for kernel sizes for first and third layers. Instead, bottle neck layers give us four times the features compared to normal conv layers. However, because bottle neck layers use 1x1 kernels, computing is the same as normal conv layers. So we get twice as much features with the same computing power.

def _conv_block(ni, nf, stride=2):
    return nn.Sequential(
        ConvLayer(ni, nf//4, ks=1, stride=stride),
        ConvLayer(nf//4, nf//4, ks=3),
        ConvLayer(nf//4, nf, ks=1, act_cls=None, norm_type=NormType.BatchZero))

learn = get_learner(ResNet(dls.c, [2, 2, 2, 2], 4))

Summary of learn gives us a good idea of what is going on.

learn.summary()

ResNet (Input shape: 256)
============================================================================
Layer (type)         Output Shape         Param #    Trainable 
============================================================================
                     256 x 32 x 14 x 14  
Conv2d                                    288        True      
BatchNorm2d                               64         True      
ReLU                                                           
Conv2d                                    9216       True      
BatchNorm2d                               64         True      
ReLU                                                           
____________________________________________________________________________
                     256 x 64 x 14 x 14  
Conv2d                                    18432      True      
BatchNorm2d                               128        True      
ReLU                                                           
MaxPool2d                                                      
Conv2d                                    4096       True      
BatchNorm2d                               128        True      
ReLU                                                           
Conv2d                                    36864      True      
BatchNorm2d                               128        True      
ReLU                                                           
____________________________________________________________________________
                     256 x 256 x 7 x 7   
Conv2d                                    16384      True      
BatchNorm2d                               512        True      
Conv2d                                    16384      True      
BatchNorm2d                               512        True      
ReLU                                                           
____________________________________________________________________________
                     256 x 64 x 7 x 7    
Conv2d                                    16384      True      
BatchNorm2d                               128        True      
ReLU                                                           
Conv2d                                    36864      True      
BatchNorm2d                               128        True      
ReLU                                                           
____________________________________________________________________________
                     256 x 256 x 7 x 7   
Conv2d                                    16384      True      
BatchNorm2d                               512        True      
____________________________________________________________________________
                     256 x 128 x 4 x 4   
Conv2d                                    32768      True      
BatchNorm2d                               256        True      
ReLU                                                           
Conv2d                                    147456     True      
BatchNorm2d                               256        True      
ReLU                                                           
____________________________________________________________________________
                     256 x 512 x 4 x 4   
Conv2d                                    65536      True      
BatchNorm2d                               1024       True      
Conv2d                                    131072     True      
BatchNorm2d                               1024       True      
ReLU                                                           
AvgPool2d                                                      
____________________________________________________________________________
                     256 x 128 x 4 x 4   
Conv2d                                    65536      True      
BatchNorm2d                               256        True      
ReLU                                                           
Conv2d                                    147456     True      
BatchNorm2d                               256        True      
ReLU                                                           
____________________________________________________________________________
                     256 x 512 x 4 x 4   
Conv2d                                    65536      True      
BatchNorm2d                               1024       True      
____________________________________________________________________________
                     256 x 256 x 2 x 2   
Conv2d                                    131072     True      
BatchNorm2d                               512        True      
ReLU                                                           
Conv2d                                    589824     True      
BatchNorm2d                               512        True      
ReLU                                                           
____________________________________________________________________________
                     256 x 1024 x 2 x 2  
Conv2d                                    262144     True      
BatchNorm2d                               2048       True      
Conv2d                                    524288     True      
BatchNorm2d                               2048       True      
ReLU                                                           
AvgPool2d                                                      
____________________________________________________________________________
                     256 x 256 x 2 x 2   
Conv2d                                    262144     True      
BatchNorm2d                               512        True      
ReLU                                                           
Conv2d                                    589824     True      
BatchNorm2d                               512        True      
ReLU                                                           
____________________________________________________________________________
                     256 x 1024 x 2 x 2  
Conv2d                                    262144     True      
BatchNorm2d                               2048       True      
____________________________________________________________________________
                     256 x 512 x 1 x 1   
Conv2d                                    524288     True      
BatchNorm2d                               1024       True      
ReLU                                                           
Conv2d                                    2359296    True      
BatchNorm2d                               1024       True      
ReLU                                                           
____________________________________________________________________________
                     256 x 2048 x 1 x 1  
Conv2d                                    1048576    True      
BatchNorm2d                               4096       True      
Conv2d                                    2097152    True      
BatchNorm2d                               4096       True      
ReLU                                                           
AvgPool2d                                                      
____________________________________________________________________________
                     256 x 512 x 1 x 1   
Conv2d                                    1048576    True      
BatchNorm2d                               1024       True      
ReLU                                                           
Conv2d                                    2359296    True      
BatchNorm2d                               1024       True      
ReLU                                                           
____________________________________________________________________________
                     256 x 2048 x 1 x 1  
Conv2d                                    1048576    True      
BatchNorm2d                               4096       True      
AdaptiveAvgPool2d                                              
Flatten                                                        
____________________________________________________________________________
                     256 x 10            
Linear                                    20490      True      
____________________________________________________________________________

Total params: 13,985,322
Total trainable params: 13,985,322
Total non-trainable params: 0

Optimizer used: <function Adam at 0x7f8b8a8894d0>
Loss function: CrossEntropyLoss()

Callbacks:
  - TrainEvalCallback
  - MixedPrecision
  - Recorder
  - ProgressCallback

learn.lr_find()

SuggestedLRs(valley=0.0002290867705596611)

learn.fit_one_cycle(2, 1e-4)

Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f8bb0813320>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1328, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f8bb0813320>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1328, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process

Contrary to the book, bottleneck layers did not give us better results. It might be because our model was not deep enough, or we only trained for 2 epochs. So let's just train for more and find out.

learn.fit_one_cycle(8, 3e-4)

Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f8bb0813320>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1328, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f8bb0813320>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1328, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f8bb0813320>
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f8bb0813320>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1328, in __del__
    self._shutdown_workers()
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1328, in __del__
    if w.is_alive():
    self._shutdown_workers()
  File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
    assert self._parent_pid == os.getpid(), 'can only test a child process'
    if w.is_alive():
  File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f8bb0813320>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1328, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f8bb0813320>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1328, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f8bb0813320>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1328, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f8bb0813320>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1328, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process

It is a good result, but not as good as one without bottle neck layer. It might not be a good idea to use it for resnet18.

Top 5 accuracy

Here is my version of top_5_accuracy. It seems to work okay.

t = tensor([1, 2, 3, 0, 5, 4, 6, 7])
t.sort(descending=True)

torch.return_types.sort(values=tensor([7, 6, 5, 4, 3, 2, 1, 0]), indices=tensor([7, 6, 4, 5, 2, 1, 0, 3]))

t.sort(descending=True)[0][:3]

tensor([7, 6, 5])

def top_5_accuracy(inp, targ, axis=-1):
    acc = 0
    for _inp, _targ in zip(inp, targ):
        items, index = _inp.sort(descending=True, dim=axis)
        top5 = index[:5]
        if _targ in top5: acc += 1
    return acc / len(inp)

learn = Learner(dls, model3, metrics=[accuracy, top_5_accuracy])
learn.fit_one_cycle(3, 1e-3)

Conclusion

We went over convolutional layers and resnet. Compared to using linear layers, we can get better results because those methods make our models train easier. The easier it is for our models to learn, the better performance we get. Resblock basically has a guideline for our model to go through. What other ways can we try for better results?

epoch	train_loss	valid_loss	accuracy	time
0	0.462859	0.339541	0.909200	01:49
1	0.314156	0.293297	0.918800	01:48

epoch	train_loss	valid_loss	accuracy	time
0	0.776017	0.532420	0.874700	01:49
1	0.448006	0.420771	0.894200	01:48

epoch	train_loss	valid_loss	accuracy	time
0	0.556546	0.337871	0.910900	01:49
1	0.387193	0.290043	0.921600	01:48

epoch	train_loss	valid_loss	accuracy	time
0	0.411212	0.289458	0.920200	01:49
1	0.185604	0.168248	0.952300	01:49

epoch	train_loss	valid_loss	accuracy	time
0	1.384760	0.964067	0.740000	01:50
1	0.339207	0.338183	0.898500	01:49
2	0.184407	0.162223	0.953900	01:49
3	0.137132	0.147511	0.957200	01:49
4	0.109026	0.112336	0.968300	01:47
5	0.100849	0.097497	0.971000	01:48
6	0.091943	0.096191	0.972200	01:49

epoch	train_loss	valid_loss	accuracy	time
0	2.093218	1.799008	0.331600	01:50
1	0.341182	1.041854	0.722100	01:47
2	0.211505	0.810533	0.779500	01:49
3	0.178218	0.320881	0.899200	01:49
4	0.153390	0.154975	0.951400	01:49
5	0.136555	0.128518	0.959500	01:49
6	0.124655	0.122785	0.962200	01:49

epoch	train_loss	valid_loss	accuracy	time
0	5.419976	5.314860	0.518500	01:50
1	4.703518	4.482738	0.747200	01:50
2	4.098514	3.907110	0.782100	01:49
3	3.798431	3.621009	0.790300	01:49
4	3.721098	3.586130	0.788500	01:49

epoch	train_loss	valid_loss	accuracy	time
0	0.099767	0.211910	0.934100	01:53
1	0.043861	0.033711	0.989400	01:54

epoch	train_loss	valid_loss	accuracy	time
0	0.105901	0.069346	0.978600	02:05
1	0.026138	0.025373	0.991800	02:05

epoch	train_loss	valid_loss	accuracy	time
0	1.482026	1.118772	0.832100	01:52
1	0.648935	0.688249	0.949600	01:51
2	0.593247	0.635145	0.969500	01:52
3	0.570614	0.568556	0.985800	01:52
4	0.555753	0.628088	0.960400	01:52
5	0.547613	0.550607	0.989900	01:52
6	0.540540	0.539059	0.991200	01:52
7	0.533598	0.536510	0.992200	01:52
8	0.529636	0.534598	0.990500	01:53
9	0.525533	0.530576	0.991900	01:53
10	0.521623	0.525659	0.993000	01:52
11	0.518083	0.525331	0.993000	01:52
12	0.515895	0.523791	0.993200	01:51
13	0.515033	0.523518	0.993300	01:51
14	0.514001	0.523306	0.993500	01:50

epoch	train_loss	valid_loss	accuracy	time
0	1.312650	0.878860	0.822500	02:02
1	0.256814	0.177439	0.956500	02:02
2	0.118774	0.098630	0.972500	02:03
3	0.087837	0.080848	0.976300	02:02
4	0.078750	0.078552	0.976500	02:02

epoch	train_loss	valid_loss	accuracy	time
0	0.289013	0.159714	0.955400	02:06
1	0.113530	0.109414	0.969000	02:06

epoch	train_loss	valid_loss	accuracy	time
0	0.077559	0.072035	0.978400	02:06
1	0.050012	0.079854	0.974900	02:04
2	0.025841	0.052087	0.982700	02:03
3	0.013299	0.043280	0.985800	02:05
4	0.004574	0.035069	0.988600	02:05
5	0.001627	0.035244	0.989700	02:05
6	0.000468	0.033405	0.990300	02:04
7	0.000319	0.032500	0.989900	02:05