Let's look at how datablock works by digging deeper into it. We explore how to use our datablocks to create datasets and dataloaders. Without understanding how data block works, it could be tricky to get our data ready to train.

!pip install -Uqq fastbook
from fastai.vision.all import *
     |████████████████████████████████| 720 kB 8.5 MB/s 
     |████████████████████████████████| 46 kB 4.4 MB/s 
     |████████████████████████████████| 1.2 MB 57.4 MB/s 
     |████████████████████████████████| 186 kB 65.3 MB/s 
     |████████████████████████████████| 56 kB 3.8 MB/s 
     |████████████████████████████████| 51 kB 345 kB/s 
path = untar_data(URLs.PETS)/'images'
path.ls()
100.00% [811712512/811706944 00:17<00:00]
(#7393) [Path('/root/.fastai/data/oxford-iiit-pet/images/keeshond_74.jpg'),Path('/root/.fastai/data/oxford-iiit-pet/images/havanese_12.jpg'),Path('/root/.fastai/data/oxford-iiit-pet/images/chihuahua_29.jpg'),Path('/root/.fastai/data/oxford-iiit-pet/images/scottish_terrier_59.jpg'),Path('/root/.fastai/data/oxford-iiit-pet/images/american_bulldog_36.jpg'),Path('/root/.fastai/data/oxford-iiit-pet/images/leonberger_161.jpg'),Path('/root/.fastai/data/oxford-iiit-pet/images/great_pyrenees_139.jpg'),Path('/root/.fastai/data/oxford-iiit-pet/images/newfoundland_38.jpg'),Path('/root/.fastai/data/oxford-iiit-pet/images/British_Shorthair_78.jpg'),Path('/root/.fastai/data/oxford-iiit-pet/images/leonberger_182.jpg')...]
path
Path('/root/.fastai/data/oxford-iiit-pet/images')
path.name[0]
'i'
files = get_image_files(path)
files[0]
Path('/root/.fastai/data/oxford-iiit-pet/images/keeshond_74.jpg')

Let's try using DataBlock by itself. When nothing is passed, it is a simple template that grabs anything from the source. We can check its behaviour by passing a source when we create datasets with DataBlock.

dblock = DataBlock()
dsets = dblock.datasets(files)
dsets
(#7390) [(Path('/root/.fastai/data/oxford-iiit-pet/images/keeshond_74.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/keeshond_74.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/havanese_12.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/havanese_12.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/chihuahua_29.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/chihuahua_29.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/scottish_terrier_59.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/scottish_terrier_59.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/american_bulldog_36.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/american_bulldog_36.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/leonberger_161.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/leonberger_161.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/great_pyrenees_139.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/great_pyrenees_139.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/newfoundland_38.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/newfoundland_38.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/British_Shorthair_78.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/British_Shorthair_78.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/leonberger_182.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/leonberger_182.jpg'))...]

Because we did not define how to differentiate dependent variable from independent variable, our datasets have same items for both variables.

dsets.train[0]
(Path('/root/.fastai/data/oxford-iiit-pet/images/Birman_30.jpg'),
 Path('/root/.fastai/data/oxford-iiit-pet/images/Birman_30.jpg'))

Although we did not say anything about splitting our data into train and valid data, DataBlock automatically grabs 20% of the data for validation.

dsets.train
(#5912) [(Path('/root/.fastai/data/oxford-iiit-pet/images/Bengal_173.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/Bengal_173.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/german_shorthaired_28.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/german_shorthaired_28.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/wheaten_terrier_30.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/wheaten_terrier_30.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/Maine_Coon_169.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/Maine_Coon_169.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/american_bulldog_175.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/american_bulldog_175.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/newfoundland_132.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/newfoundland_132.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/scottish_terrier_15.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/scottish_terrier_15.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/keeshond_53.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/keeshond_53.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/Egyptian_Mau_95.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/Egyptian_Mau_95.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/Abyssinian_56.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/Abyssinian_56.jpg'))...]
dsets.valid
(#1478) [(Path('/root/.fastai/data/oxford-iiit-pet/images/newfoundland_66.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/newfoundland_66.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/Russian_Blue_212.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/Russian_Blue_212.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/Ragdoll_63.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/Ragdoll_63.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/Russian_Blue_263.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/Russian_Blue_263.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/great_pyrenees_34.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/great_pyrenees_34.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/beagle_83.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/beagle_83.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/shiba_inu_130.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/shiba_inu_130.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/chihuahua_196.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/chihuahua_196.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/Sphynx_184.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/Sphynx_184.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/Bengal_71.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/Bengal_71.jpg'))...]
7390 * .2
1478.0

When we try to use our DataBlock template to create dataloaders with it, it fails because our data cannot be batches yet.

dloaders = dblock.dataloaders(files)
Could not do one pass in your dataloader, there is something wrong in it

Let's slowly build up our DataBlock. First, we pass get_items=get_image_files. It specifies how to grab data. In this case, it only grabs image files. Although it is not necessary here because there are only image files in this directory, it is helpful when there are many other files mixed in here, such as csv files, text files, etc.

dblocks = DataBlock(get_items=get_image_files)
dsets = dblock.datasets(files)
dsets
(#7390) [(Path('/root/.fastai/data/oxford-iiit-pet/images/keeshond_74.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/keeshond_74.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/havanese_12.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/havanese_12.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/chihuahua_29.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/chihuahua_29.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/scottish_terrier_59.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/scottish_terrier_59.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/american_bulldog_36.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/american_bulldog_36.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/leonberger_161.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/leonberger_161.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/great_pyrenees_139.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/great_pyrenees_139.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/newfoundland_38.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/newfoundland_38.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/British_Shorthair_78.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/British_Shorthair_78.jpg')),(Path('/root/.fastai/data/oxford-iiit-pet/images/leonberger_182.jpg'), Path('/root/.fastai/data/oxford-iiit-pet/images/leonberger_182.jpg'))...]

We cannot make dataloaders yet, as expected.

dblock.dataloaders(files)
Could not do one pass in your dataloader, there is something wrong in it
<fastai.data.core.DataLoaders at 0x7f1a841e9710>

When we have the same values for the input and the target, our model is not very useful. With PETS dataset, we want to know whether it is a cat or a dog by looking at the images. Because cats' file names are capitalized, we can define a function that tells whether it is a dog or cat by looking at its file name. Then we pass this function into our DataBlock to figure out our target variables.

def is_cat(animal):
    return 'cat' if animal.name[0].isupper() else 'dog'
dblocks = DataBlock(get_y=is_cat,
                    get_items=get_image_files)
dsets = dblocks.datasets(path)
dsets.train[0]
(Path('/root/.fastai/data/oxford-iiit-pet/images/japanese_chin_60.jpg'), 'dog')
dsets.train[-1]
(Path('/root/.fastai/data/oxford-iiit-pet/images/shiba_inu_67.jpg'), 'dog')

We are almost ready to make our dataloaders. We just have to transform our path objects into objects that can be batches, such as tensors, numpy arrays, dicts or lists.

dloaders = dblocks.dataloaders(path)
dloaders.show_batch()
Could not do one pass in your dataloader, there is something wrong in it
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-68-b1edbbb53044> in <module>()
      1 dloaders = dblocks.dataloaders(path)
----> 2 dloaders.show_batch()

/usr/local/lib/python3.7/dist-packages/fastai/data/core.py in show_batch(self, b, max_n, ctxs, show, unique, **kwargs)
     98             old_get_idxs = self.get_idxs
     99             self.get_idxs = lambda: Inf.zeros
--> 100         if b is None: b = self.one_batch()
    101         if not show: return self._pre_show_batch(b, max_n=max_n)
    102         show_batch(*self._pre_show_batch(b, max_n=max_n), ctxs=ctxs, max_n=max_n, **kwargs)

/usr/local/lib/python3.7/dist-packages/fastai/data/load.py in one_batch(self)
    146     def one_batch(self):
    147         if self.n is not None and len(self)==0: raise ValueError(f'This DataLoader does not contain any batches')
--> 148         with self.fake_l.no_multiproc(): res = first(self)
    149         if hasattr(self, 'it'): delattr(self, 'it')
    150         return res

/usr/local/lib/python3.7/dist-packages/fastcore/basics.py in first(x, f, negate, **kwargs)
    545     x = iter(x)
    546     if f: x = filter_ex(x, f=f, negate=negate, gen=True, **kwargs)
--> 547     return next(x, None)
    548 
    549 # Cell

/usr/local/lib/python3.7/dist-packages/fastai/data/load.py in __iter__(self)
    107         self.before_iter()
    108         self.__idxs=self.get_idxs() # called in context of main process (not workers/subprocesses)
--> 109         for b in _loaders[self.fake_l.num_workers==0](self.fake_l):
    110             if self.device is not None: b = to_device(b, self.device)
    111             yield self.after_batch(b)

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in __next__(self)
    519             if self._sampler_iter is None:
    520                 self._reset()
--> 521             data = self._next_data()
    522             self._num_yielded += 1
    523             if self._dataset_kind == _DatasetKind.Iterable and \

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
    559     def _next_data(self):
    560         index = self._next_index()  # may raise StopIteration
--> 561         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    562         if self._pin_memory:
    563             data = _utils.pin_memory.pin_memory(data)

/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     32                 raise StopIteration
     33         else:
---> 34             data = next(self.dataset_iter)
     35         return self.collate_fn(data)
     36 

/usr/local/lib/python3.7/dist-packages/fastai/data/load.py in create_batches(self, samps)
    116         if self.dataset is not None: self.it = iter(self.dataset)
    117         res = filter(lambda o:o is not None, map(self.do_item, samps))
--> 118         yield from map(self.do_batch, self.chunkify(res))
    119 
    120     def new(self, dataset=None, cls=None, **kwargs):

/usr/local/lib/python3.7/dist-packages/fastai/data/load.py in do_batch(self, b)
    142         else: raise IndexError("Cannot index an iterable dataset numerically - must use `None`.")
    143     def create_batch(self, b): return (fa_collate,fa_convert)[self.prebatched](b)
--> 144     def do_batch(self, b): return self.retain(self.create_batch(self.before_batch(b)), b)
    145     def to(self, device): self.device = device
    146     def one_batch(self):

/usr/local/lib/python3.7/dist-packages/fastai/data/load.py in create_batch(self, b)
    141         elif s is None:  return next(self.it)
    142         else: raise IndexError("Cannot index an iterable dataset numerically - must use `None`.")
--> 143     def create_batch(self, b): return (fa_collate,fa_convert)[self.prebatched](b)
    144     def do_batch(self, b): return self.retain(self.create_batch(self.before_batch(b)), b)
    145     def to(self, device): self.device = device

/usr/local/lib/python3.7/dist-packages/fastai/data/load.py in fa_collate(t)
     48     b = t[0]
     49     return (default_collate(t) if isinstance(b, _collate_types)
---> 50             else type(t[0])([fa_collate(s) for s in zip(*t)]) if isinstance(b, Sequence)
     51             else default_collate(t))
     52 

/usr/local/lib/python3.7/dist-packages/fastai/data/load.py in <listcomp>(.0)
     48     b = t[0]
     49     return (default_collate(t) if isinstance(b, _collate_types)
---> 50             else type(t[0])([fa_collate(s) for s in zip(*t)]) if isinstance(b, Sequence)
     51             else default_collate(t))
     52 

/usr/local/lib/python3.7/dist-packages/fastai/data/load.py in fa_collate(t)
     49     return (default_collate(t) if isinstance(b, _collate_types)
     50             else type(t[0])([fa_collate(s) for s in zip(*t)]) if isinstance(b, Sequence)
---> 51             else default_collate(t))
     52 
     53 # Cell

/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py in default_collate(batch)
     84         return [default_collate(samples) for samples in transposed]
     85 
---> 86     raise TypeError(default_collate_err_msg_format.format(elem_type))

TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'pathlib.PosixPath'>

Easy way to transform our raw inputs is with blocks. ImageBlock transforms inputs into images and CategoryBlock transforms our target labels into tensors.

dblocks = DataBlock(blocks=(ImageBlock, CategoryBlock),
                    get_y=is_cat,
                    get_items=get_image_files)
dsets = dblocks.datasets(path)
dsets.train[0]
(PILImage mode=RGB size=500x333, TensorCategory(1))

Our dataset has a vocab for us now because our datasets have 0 and 1 for cats and dogs for computation purposes.

dsets.vocab
['cat', 'dog']

We can finally build our dataloaders with correct format that is acceptable.

Even if they are only 1s and 0s, they are transformed back into dogs and cats for us to interpret the data when we do show_batch.

dloaders = dblocks.dataloaders(path)
dloaders.show_batch()

Just because we are successful with making a dataloaders does not mean we can train our model for a good result efficiently. In order to use our GPU efficiently, we need to resize our images into same sizes. we do that with Resize(224) so that all the images will have a size of 224 by 224. Also, we pass splitter that splits train and valid data. By default, it takes away 20%, but I only wanted 10% for valid data.

dblocks = DataBlock(blocks=(ImageBlock, CategoryBlock),
                    get_y=is_cat,
                    get_items=get_image_files,
                    splitter=RandomSplitter(valid_pct=0.1),
                    item_tfms=Resize(224))
dsets = dblocks.datasets(path)
dsets.train[0]
(PILImage mode=RGB size=320x480, TensorCategory(1))
dloaders = dblocks.dataloaders(path)
dloaders.show_batch()

We can also use ImageDataLoaders directly in order to create dataloaders. We just need is_cat_dl, which is pretty much the same as is_cat except that it expects strings instead of path objects. This is because we are using from_name_func and names are strings.

def is_cat_dl(x):
    return x[0].isupper()
dloaders = ImageDataLoaders.from_name_func(path,
                                           get_image_files(path),
                                           label_func=is_cat_dl,
                                           item_tfms=Resize(224))
dloaders.show_batch()

When we get an error or are curious of what is going on under the hood when we use DataBlocks, we can look into its summary. We can find out what kind of data gets in, what kind of transformations are applied, and how batches are put together.

dblocks.summary(path)
Setting-up type transforms pipelines
Collecting items from /root/.fastai/data/oxford-iiit-pet/images
Found 7390 items
2 datasets of sizes 6651,739
Setting up Pipeline: PILBase.create
Setting up Pipeline: is_cat -> Categorize -- {'vocab': None, 'sort': True, 'add_na': False}

Building one sample
  Pipeline: PILBase.create
    starting from
      /root/.fastai/data/oxford-iiit-pet/images/newfoundland_20.jpg
    applying PILBase.create gives
      PILImage mode=RGB size=500x375
  Pipeline: is_cat -> Categorize -- {'vocab': None, 'sort': True, 'add_na': False}
    starting from
      /root/.fastai/data/oxford-iiit-pet/images/newfoundland_20.jpg
    applying is_cat gives
      dog
    applying Categorize -- {'vocab': None, 'sort': True, 'add_na': False} gives
      TensorCategory(1)

Final sample: (PILImage mode=RGB size=500x375, TensorCategory(1))


Collecting items from /root/.fastai/data/oxford-iiit-pet/images
Found 7390 items
2 datasets of sizes 6651,739
Setting up Pipeline: PILBase.create
Setting up Pipeline: is_cat -> Categorize -- {'vocab': None, 'sort': True, 'add_na': False}
Setting up after_item: Pipeline: Resize -- {'size': (224, 224), 'method': 'crop', 'pad_mode': 'reflection', 'resamples': (2, 0), 'p': 1.0} -> ToTensor
Setting up before_batch: Pipeline: 
Setting up after_batch: Pipeline: IntToFloatTensor -- {'div': 255.0, 'div_mask': 1}

Building one batch
Applying item_tfms to the first sample:
  Pipeline: Resize -- {'size': (224, 224), 'method': 'crop', 'pad_mode': 'reflection', 'resamples': (2, 0), 'p': 1.0} -> ToTensor
    starting from
      (PILImage mode=RGB size=500x375, TensorCategory(1))
    applying Resize -- {'size': (224, 224), 'method': 'crop', 'pad_mode': 'reflection', 'resamples': (2, 0), 'p': 1.0} gives
      (PILImage mode=RGB size=224x224, TensorCategory(1))
    applying ToTensor gives
      (TensorImage of size 3x224x224, TensorCategory(1))

Adding the next 3 samples

No before_batch transform to apply

Collating items in a batch

Applying batch_tfms to the batch built
  Pipeline: IntToFloatTensor -- {'div': 255.0, 'div_mask': 1}
    starting from
      (TensorImage of size 4x3x224x224, TensorCategory([1, 1, 0, 1]))
    applying IntToFloatTensor -- {'div': 255.0, 'div_mask': 1} gives
      (TensorImage of size 4x3x224x224, TensorCategory([1, 1, 0, 1]))

This is a very helpful feature. If you want to learn more about data block in FastAI, check out FastAI tutorial.