* Image: Joshua Sortino on Remove splashes

Deep Learning (DL) algorithms need to eat a lot of data. Therefore, it is essential for every DL enthusiast to learn not only to use but also to create data sets. And every expert has a habit of doing their magic. It is difficult for beginners to identify which technique is practical to adopt given their specific situation.

There are several styles of network scraping. As a beginner, I found numerous obstacles and barren detours.

Eventually I finally found a a good combination of libraries that allow me to perform all steps in a smooth order: identify items, load URLs, mark annotations, make data available to the coding platform, allow code to find files, clean up folders, save a cleaned data set, perform template training, view results, evaluate results, make an application, and save a template.

I share them with others as a beginner-friendly technique for an end-to-end project.

For those who want to get a more detailed understanding of the definition of functions, I’ve provided you with some source code links.

Share the code with me 🙂

Big Data jobs
  1. Set.

I used the Colab platform to run the program. If you are just starting out, give yourself at least 1-2 hours of protected time. If you quit your job before you can save the data set (step 4), you may need to run it again from the starting cell.

a. Specify the type of machine you need.

Consider whether you are processing just a few images or whether you want the numbers to be hundreds and thousands. If you plan to process less than a hundred, keep the default setting (CPU).

If you process a lot of images, I recommend using the GPU:

Using the GPU in Colab is free for non-Pro accounts. However, the more you use it, the less you may have access rights in the future (to facilitate fair use). If you are doing your first project, I recommend starting with a CPU. You will notice the need for a GPU if processing takes too long (arbitrarily, more than 5-10 minutes).

The CPU and GPU are separate machines. If you change machines in the middle of your job, it will be considered a recently opened notebook and you will need to run from the boot cell. But because you’re using a GPU, you’ll find that image processing is faster.

b. Install and import the following into your Colab notebook:

!pip install -Uqq fastbook
import fastbook
from fastbook import *
#!pip install fastai -U # unhash if this is the your first fastai use
import fastai
from fastai.vision.all import *

You need to connect your Colab notebook to Google Drive. If this is your first time, do this:

from google.colab import drive
drive.mount('/content/gdrive') # force_remount=True is sometimes needed

If you need more information about this step, check Step 2c on this blog due to simple instructions.

c. Special settings for creating a data set.

I found some options for collecting images jmd_imagescraper the library is the easiest to install and use. It’s free and doesn’t require cloud accounts.

!pip install jmd_imagescraper
from jmd_imagescraper.core import *
from pathlib import Path

We are now ready to capture some pictures!

2. Retrieve images from the web.

a. Plan a route.

root = Path().cwd()/'LOTR_characters_250'
  • The Path () system lets you organize where your files go. This is necessary.
  • cwd () keeps you as the basis for the directory you are working on.
  • The last field (in this case ‘LOTR_characters_250’) is the folder to which your file will be assigned.

b. Perform a data search and retrieval.

i. Search Engine:

Jmd_imagescraper utilizes duckduckgo online search. It follows form: (file search direction, subfolder where the image group is stored, search strings, maximum number of images you want, layout). For example:

duckduckgo_search(root, 'Folder_Label_A','strings specific for finding Label_A', max_results = n, ImgLayout='Square')

ii. Subject and intended use of the data:

To get you started, I suggest you choose a topic that is reasonably familiar to you to make cleaning and interpretation easier later. For example: modes of transport (stickers = train, plane, horse, dragon).

Set your goals for the data set you create. I had:

Target: You can associate a sample image with one of the nine members of the Ring Alliance.

iii. Number of pictures:

The code shown below is the result of several runs. Adjust max_results based on your experience.

If this is your first time, it would be a good idea to start a pilot project. Maximum results of 25-50 should be sufficient for initial use. If you get less than these, you may not have enough knowledge to model your training. If you get more, you may be distracted or overwhelmed. Once you get a good idea of ​​the whole process, you can venture to larger numbers if necessary.

1. Why do corporate AI projects fail?

2. How will artificial intelligence trigger the next wave of healthcare innovation?

3. Machine learning using a regression model

4. The best data science platforms in 2021, other than Kaggle

iv. Layout:

The default image layout is square, but I noticed that ImgLayout.All gave a better set of images.

search = duckduckgo_searchsearch(root, 'Frodo','frodo lord of the rings ', max_results = 250, img_layout=ImgLayout.All)search(root, 'Aragorn','aragorn strider lord of the rings ', max_results = 250,img_layout=ImgLayout.All)search(root, 'Gandalf','gandalf lord of the rings ', max_results = 250,img_layout=ImgLayout.All)search(root, 'Sam','sam lord of the rings ', max_results = 250,img_layout=ImgLayout.All)search(root, 'Merry','merry meriadoc lord of the rings ', max_results = 250,img_layout=ImgLayout.All)search(root, 'Pippin','pippin peregrin lord of the rings ', max_results = 250,img_layout=ImgLayout.All)search(root, 'Legolas','legolas lord of the rings ', max_results = 250,img_layout=ImgLayout.All)search(root, 'Boromir','boromir lord of the rings ', max_results = 250,img_layout=ImgLayout.All)search(root, 'Gimli','gimli lord of the rings ', max_results = 250,img_layout=ImgLayout.All)

The data collected from this search is included in subfolders (in this case, “Gimli”, etc.), and is placed after the root folder that you specified in the root directory. For example:

3. Cleaning the data set

For established datasets, the cleanup step is sometimes unnecessary and often skipped.

As we are creating a new collection with strong connections between target labels, cleaning is essential.

a. Opening the cleaner.

from jmd_imagescraper.imagecleaner import *

This will show all the files in the folders specified in the previous step. Notice how each folder is labeled with a folder name – this makes it easier to mark a data set.

Before you start clicking, read the section below!

b. Remind yourself of the set of information you made for your goal.

Your goal is to determine which images you keep. This is important. You can lose a lot of time reviewing or reloading images if you don’t focus on your goal.

For my part, I wanted to be able to generalize if the picture could be compared to any of the nine members of the Covenant of the Ring. That’s why I included all the character representations and their most significant accessories. These included drawings, non-movie images, sculptures, toys, video games, etc.

To ensure that the algorithm only sees one person at a time, images with two or more members visible were removed. Some non-Fellow characters strongly associated with the target sticker were retained (e.g., Faramir Boromir, Bilbo Frodo, and Saruman Gandalf).

Easily identifiable duplicates were removed so that the algorithm would not give more weight to a particular image.

c. Check that each label has a good number. If not, you may need to perform a new search with verified keywords or max_results.

4. Save the cleaned data set.

Tip: keep the names consistent, it will help you choose the path.

a. Save the data set to your own computer.

zip_name = 'LOTR_characters_250.zip'!rm -f {zip_name}
!zip -q -r {zip_name} {root}

b. Save the data set to Google Drive.

from google.colab import files
from google.colab import drive
import shutil
destination_folder = 'LOTR_characters_250'drive.mount('/content/drive/')
folder = Path('/content/drive/My Drive')/destination_folder
folder.mkdir(parents=True, exist_ok = True)
shutil.copyfile(zip_name, str(folder/zip_name))

5. Modeling!

a. Path

Provide instructions on where to find the data set. This is an essential step.

path = Path('drive/My Drive/LOTR_characters_250')

b.DataBlock and DataLoaders

Introduction or update in DataBlock, see Steps 6 af in Starting a Dive for Deep Learning.

chars = DataBlock(
blocks = (ImageBlock, CategoryBlock),
get_items = get_image_files,
splitter = RandomSplitter(valid_pct= 0.2, seed =42),
get_y = parent_label,
item_tfms = Resize(460),
batch_tfms = aug_transforms(size = 200, max_rotate = 30,
max_zoom = 1.2))
dls = chars.dataloaders(path/'/content/LOTR_characters_250/')

Note that there is no obvious connection between your previous code and DataBlock. The connection is hidden get_items and get_y.

  • get_image files redirects your code to the path of your choice.
  • header allows DataBlock to identify the part of the data set that should be the labels.

c. Visualize your information.

dls.train.show_batch(nrows = 2, ncols = 3)

d. Start learning.

A simple introduction to cnn_learner and fine_tune can be found at Taste for deep learning Steps 7-10.

learn = cnn_learner(dls, resnet50, metrics = error_rate)

We are able to get 89% accuracy when the model is 50 levels deep, 1 frozen cycle, 8 eras and default learning percentages. This is acceptable given that this is an attempt to generalize a wide variety of images.

6. Interpret the results of the model.

a. Visualize images with labels.

i. If you are familiar with the subject.


ii. If you are not sure about the stickers because you are not familiar with the subject.

  • vocab gives you indexed labels. For example, 0 = Aragorn, 1 = Boromir.
  • get_preds gives you real pictures sticker. So the actual label for the third image is 1 = Boromir, but the predicted label was Aragorn.
  • The semantics here can be confusing. It will help you remember that the top sticker is Real and the bottom layer is Predicted.

b. Look at the numbers.

interp = ClassificationInterpretation.from_learner(learn)

The model was able to distinguish images from two men in the validation series (Aragorn, Boromir). It was also able to distinguish most of the hobbits with the exception of some Merry and Sam. Given that these two hobbits look the same, the figures above are acceptable. Placing characters in context (i.e., a Frodo-Sam and Merry-Pippin connection can help fine-tune the model.

c. Let’s check which are the most false predictions.

Neural networks learn by analyzing what they have done wrong and trying to correct them in the next round. This is reflected in the loss.

Example: If a model predicts a picture to be Sam when it should be happy, this is a loss. And the more the model is sure of its prediction (i.e., the greater the probability), the greater the loss. This is how neural networks determine weights. See for more information Stochastic gradient calculation in the Fast.ai mnist_basics section.


We can see that the computer was most wrong with images that even a person familiar with the subject could easily confuse.

I think this model works well. Let’s enjoy our work with a simple application!

7. Application of our model.

a. Download a sample image of the start image.

btn_upload = widgets.FileUpload()

Upload the selected image before proceeding to the next step.

b. Create an application.

img = PILImage.create(btn_upload.data[-1])
out_pl = widgets.Output()
with out_pl: display(img.to_thumb(250))
pred, pred_idx, probs = learn.predict(img)
lbl_pred = widgets.Label()
lbl_pred.value = f'Prediction: {pred}; Probability: {probs[pred_idx]:.04f}'
btn_run = widgets.Button(description = 'Classify')
def on_click_classify(change):
img= PILImage.create(btn_upload.data[-1])
with out_pl: display(img.to_thumb(200))
pred, pred_idx, probs = learn.predict(img)
lbl_pred.value = f'Prediction: {pred}; Probability: {probs[pred_idx]:0.4f}'
btn_upload = widgets.FileUpload()
from ipywidgets import *
VBox([widgets.Label('Select your LOTR Fellowship character!'),
btn_upload, btn_run, out_pl, lbl_pred])

c. To be afraid!

I realized that this is probably the mechanism used by those fun quizzes that show a picture of a person to a Disney princess or a villain, or in this case, a LOTR buddy.

Luckily I didn’t include Gollum in the characters;)

8. Save the template.

path = Path()
path.ls(file_exts = '.pkl')

I hope the coding was fun! 🙂


We were able to work around a series of libraries and codes that allowed us to complete the entire process from creating a data set to an application.


For Fast.ai founders and forum contributors.

Joeing a friendly library for Joedockrill.

To SharpestMinds and Farid Hassainia / Ice Vision for mentoring.


Please enter your comment!
Please enter your name here