FastAI Data block
Let's look at how datablock works by digging deeper into it. We explore how to use our datablocks to create datasets and dataloaders. Without understanding how data block works, it could be tricky to get our data ready to train.
!pip install -Uqq fastbook
from fastai.vision.all import *
path = untar_data(URLs.PETS)/'images'
path.ls()
path
path.name[0]
files = get_image_files(path)
files[0]
Let's try using DataBlock
by itself. When nothing is passed, it is a simple template that grabs anything from the source. We can check its behaviour by passing a source when we create datasets with DataBlock
.
dblock = DataBlock()
dsets = dblock.datasets(files)
dsets
Because we did not define how to differentiate dependent variable from independent variable, our datasets have same items for both variables.
dsets.train[0]
Although we did not say anything about splitting our data into train and valid data, DataBlock
automatically grabs 20% of the data for validation.
dsets.train
dsets.valid
7390 * .2
When we try to use our DataBlock
template to create dataloaders with it, it fails because our data cannot be batches yet.
dloaders = dblock.dataloaders(files)
Let's slowly build up our DataBlock
. First, we pass get_items=get_image_files
. It specifies how to grab data. In this case, it only grabs image files. Although it is not necessary here because there are only image files in this directory, it is helpful when there are many other files mixed in here, such as csv files, text files, etc.
dblocks = DataBlock(get_items=get_image_files)
dsets = dblock.datasets(files)
dsets
We cannot make dataloaders yet, as expected.
dblock.dataloaders(files)
When we have the same values for the input and the target, our model is not very useful. With PETS dataset, we want to know whether it is a cat or a dog by looking at the images. Because cats' file names are capitalized, we can define a function that tells whether it is a dog or cat by looking at its file name. Then we pass this function into our DataBlock
to figure out our target variables.
def is_cat(animal):
return 'cat' if animal.name[0].isupper() else 'dog'
dblocks = DataBlock(get_y=is_cat,
get_items=get_image_files)
dsets = dblocks.datasets(path)
dsets.train[0]
dsets.train[-1]
We are almost ready to make our dataloaders. We just have to transform our path objects into objects that can be batches, such as tensors, numpy arrays, dicts or lists.
dloaders = dblocks.dataloaders(path)
dloaders.show_batch()
Easy way to transform our raw inputs is with blocks
. ImageBlock
transforms inputs into images and CategoryBlock
transforms our target labels into tensors.
dblocks = DataBlock(blocks=(ImageBlock, CategoryBlock),
get_y=is_cat,
get_items=get_image_files)
dsets = dblocks.datasets(path)
dsets.train[0]
Our dataset has a vocab for us now because our datasets have 0 and 1 for cats and dogs for computation purposes.
dsets.vocab
We can finally build our dataloaders with correct format that is acceptable.
Even if they are only 1s and 0s, they are transformed back into dogs and cats for us to interpret the data when we do show_batch.
dloaders = dblocks.dataloaders(path)
dloaders.show_batch()
Just because we are successful with making a dataloaders does not mean we can train our model for a good result efficiently. In order to use our GPU efficiently, we need to resize our images into same sizes. we do that with Resize(224)
so that all the images will have a size of 224 by 224. Also, we pass splitter
that splits train and valid data. By default, it takes away 20%, but I only wanted 10% for valid data.
dblocks = DataBlock(blocks=(ImageBlock, CategoryBlock),
get_y=is_cat,
get_items=get_image_files,
splitter=RandomSplitter(valid_pct=0.1),
item_tfms=Resize(224))
dsets = dblocks.datasets(path)
dsets.train[0]
dloaders = dblocks.dataloaders(path)
dloaders.show_batch()
We can also use ImageDataLoaders
directly in order to create dataloaders. We just need is_cat_dl
, which is pretty much the same as is_cat
except that it expects strings instead of path objects. This is because we are using from_name_func
and names are strings.
def is_cat_dl(x):
return x[0].isupper()
dloaders = ImageDataLoaders.from_name_func(path,
get_image_files(path),
label_func=is_cat_dl,
item_tfms=Resize(224))
dloaders.show_batch()
When we get an error or are curious of what is going on under the hood when we use DataBlocks
, we can look into its summary. We can find out what kind of data gets in, what kind of transformations are applied, and how batches are put together.
dblocks.summary(path)
This is a very helpful feature. If you want to learn more about data block in FastAI, check out FastAI tutorial.