Artificially Intelligent - Exploring the Custom Vision Service

Artificially Intelligent - Exploring the Custom Vision Service

By Frank La Vigne | August 2018

Frank La VigneOne of the more fascinating cognitive services available in Microsoft Azure is the Custom Vision Service, which allows users to easily create custom computer vision models. Traditionally, training an image classifier required thousands, if not tens of thousands, of images per class to generate a model accurate enough for practical use. Additionally, you would also need a strong grasp on the mechanics of neural networks. The Custom Vision Service provides an easy-to-use Web UI that masks this underlying complexity.

Better still, the Custom Vision Service can create fairly accurate custom image classifiers with as little as 15 or 20 images per class, although the documentation recommends a minimum of 50 images per class. Generally, the quality of the image classifier will improve with more training images. The Custom Vision Service also exposes trained models via a REST API to simplify the deploy­ment of these models.

At the Build 2018 conference, one of the keynote demos involved an unmanned aerial vehicle, or drone, flying over a series of pipes to detect defects and maintenance issues. In the demo, a drone streamed video from its camera back to a laptop. The laptop evaluated the images against a trained custom vision model, which alerted the drone operator of any defects seen in the pipes. For this article, I’ll mimic a similar artificial intelligence (AI)-augmented maintenance model for detecting issues with train tracks. To do so, I’ll explore the Custom Vision Service and demonstrate how easy it is to add custom image classification to any app or Web site.

Sourcing Images to Train On

Creating a computer vision model with the Custom Vision Service involves three simple steps:

  • Upload a set of images and label them.
  • Train and generate the model by clicking on the Train button.
  • Use the Web interface or REST APIs to evaluate the model’s performance.
  • It’s actually that easy to get started. But care must be taken as you begin sourcing images for your model on which to train.

    Recently for a customer demo, I collected a series of images of train tracks via the Internet. These images included normal train tracks and those with severe defects. With as little as 30 images, I was able to generate a custom vision model to detect track defects with around 80 percent accuracy as part of my presentation. As impressive at that may be, I was quick to point out to the customer that any production-ready image classification model would require images of train tracks in a variety of diverse lighting conditions, angles and so on. This diversity of classification data reduces the likelihood of the algorithm isolating incorrect features of the training images, which would wreak havoc with the results.

    There’s a classic, and possibly apocryphal, cautionary tale about training neural networks. The legend explains what researchers in the 1980s faced when the U.S. Army decided to train a neural network to detect tanks. According to the story, the neural network performed very well against the test data as part of the experiment. However, when the neural network was given new images to classify, it performed terribly.

    The researchers were dumbfounded until they realized that all the images containing tanks had been taken on overcast days, while all the images lacking tanks were taken on sunny days. The neural network separated the two classes of photos and chose to distinguish the two sets based on the color of the sky rather than the presence of a tank. Obviously, this is not what the Army needed. For a full exploration of the tank detection story, you can check out Neil Fraser’s original write up at, but this part of his article is worth quoting here:

    “It is a perfect illustration of the biggest problem behind neural networks. Any automatically trained net with more than a few dozen neurons is virtually impossible to analyze and understand. One can’t tell if a net has memorized inputs, or is ‘cheating’ in some other way.”

    Although the article was last updated in 2003, this quote still holds true today. Work is being done to better understand the inner workings of complex neural networks, but this effort is ongoing and is barely beyond its nascent phase. The best practice currently is to source data that contains a varied set of training data. This helps the algorithm distinguish images based on the correct variables.

    To avoid image copyright issues, I’ve decided to leverage my children’s large toy train track collection for this article. This also allows me to create training and testing imagery quickly. I’ve taken 34 images and split them into three folders named Broken, Normal and Test. Images with “broken” train tracks are in the Broken folder, and images with contiguous tracks are in the Normal folder. I’ve randomly chosen a series of images to test the model on and placed them in the Test folder. With my source images selected and labeled, it’s time to create a model.


    Creating the Model

    In a browser, I go to and click on the Sign In button to sign in with my Microsoft account. Once logged in, I click on New Project to create a new custom vision project.

    In the dialog that follows, I enter “Toy Trains” for the project name and a brief description of the project. I leave the Resource Group dropdown list to the default setting of Limited trial, and make sure the radio buttons for Project Types and Domains are set to Classification and General, respectively. Next, I click the Create project button to create a new custom vision project. Figure 1 shows these settings.

    The Create New Project Dialog
    Figure 1 The Create New Project Dialog

    Uploading the Images

    Once the project is created, the Web site prompts you to upload images. I click Add Images and in the following dialog click Browse local files to find images to upload. First, I upload the images in the Broken folder by selecting them all and clicking Open in the browsing dialog. In the text box below the images, I enter the term Broken and click the blue plus sign to tag the images with the Broken label. I then click the Upload button to upload the 15 images. This step is shown in Figure 2. Next, I click Done to finish this step.

    Tagged Broken Track Images
    Figure 2 Tagged Broken Track Images

    Now that the images of broken train tracks are uploaded, it’s time to upload the images in the Normal folder. To do this, I click the icon on the upper left of the pane with the plus sign on it. This button is highlighted in Figure 3. I repeat the previous steps to upload the input images, except this time, I choose the images in the Normal folder and tag the images with the label Normal. I click the Upload button to transfer the images and then click Done to close the dialog.

    Adding More Images to the Training Data
    Figure 3 Adding More Images to the Training Data

    With all the sample images uploaded and tagged, it’s time to train and test the model.

    Training the Model

    Toward the top of the page, there’s a green button with gears on it. I click on this to train a model with the currently labeled images. It takes a moment or two before the training is complete and the Web page returns with the performance metrics of the model, which is shown in Figure 4.

    Training Results and Performance Metrics
    Figure 4 Training Results and Performance Metrics

    The two primary metrics displayed are the model’s precision and recall. Precision indicates how often a predicted result is correct, while recall measures the percentage of how often a predicted tag was correct. In other words, recall indicates that when the correct answer is “Broken,” how often the model will predict “Broken.” For a further explanation of Confusion Matrix terminology, read the “Simple Guide to Confusion Matrix Terminology” at

    Testing the Model

    Now it’s time to test the model on the images set aside for testing to see how well the model will perform when presented with new images. To the immediate right of the green training button, there’s a button labeled Quick Test. I click on this to bring up the testing dialog and then click the Browse local files button to bring up the file upload dialog. I use this dialog to select an image in the Test folder and click Open to upload the file. After a moment, the algorithm will display its results after evaluating the image. Figure 5 shows that this image has a 79.2 percent probability of being a Normal track, and a 60.2 percent probability of being Broken. Remember that for this project, Normal means the track is unbroken, or contiguous.

    Quick Test Dialog for a Contiguous Set of Tracks
    Figure 5 Quick Test Dialog for a Contiguous Set of Tracks

    I run the test again with another image of a broken track, by clicking again on the Browse local files button. This time I pick an image of a broken set of tracks. The model reports that there’s a 95.9 percent chance of this being a broken track and only 11.2 percent chance of the track being normal. I test the model out with the rest of the files in the Test folder to get an idea of where the model performs well and where it has trouble identifying the correct state of the tracks.

    For further experimentation, I perform an image search on the Internet and copy the image URL and paste it into the text box on the Test UI. These images depict train tracks from different angles and on different surfaces in lighting conditions unlike the training set of images. This will provide a perspective on how neural networks work and why a diverse set of sample data is crucial to the success of a production-ready model. The following three URLs point to some example images that I found online (note how the model classifies each of the images):


    Furthermore, it’s useful to know how the classifier will behave when given an image outside of its scope. For example, how will this classifier, designed to determine the state of toy train tracks, work when given an image of a neon sign? This is referred to as negative image handling, and an effective model ought to predict values close to zero in these instances. To test this out, I enter the following URL into the text box: My predictive model produced a result of 1.7 percent for Normal and 0 percent for Broken.

    Going Further

    For purposes of this example, I created a two-class image classification system, where the model only had to label images of toy train tracks as either broken or normal. However, I can create more complex models. Currently, the Custom Vision Service S0 (Standard) tier can support up to 250 unique tags, meaning that it can be trained to classify 250 separate labels. Additionally, the service can handle up to 50,000 images per project.

    Furthermore, the models generated by the Custom Vision Service can be exported for use on edge computing devices and do not rely on access to the cloud service. This means that I could load the model onto a device and classify images in an offline scenario. Currently, the Custom Vision Service supports model exports in three formats: TensorFlow for Android devices, CoreML for iOS devices and ONNX for Windows devices. Additionally, the Custom Vision Service can create a Windows or Linux container that incorporates a TensorFlow model and code to call the REST API. For more details on exporting models to run on edge devices, be sure to refer to the documentation at

    In this article, I demonstrated how easy it is to get started with the Custom Vision Service and build out a computer vision model with a limited set of training data. One potential use for this technology would be to access image data from traffic cameras and train a model to detect different levels of congestion based solely on that data. Previously, automating such a task would be daunting, as it required both specialized knowledge and a massive amount of labeled training data. However, the Custom Vision Service combines cutting-edge neural network technology with an easy-to-use interface to create a tool that opens up machine vision to more widespread use.

    Frank La Vigne works at Microsoft as an AI Technology Solutions Professional where he helps companies achieve more by getting the most out of their data with analytics and AI. He also co-hosts the DataDriven podcast. He blogs regularly at and you can watch him on his YouTube channel, “Frank’s World TV” (FranksWorld.TV).

    Thanks to the following technical experts for reviewing this article: Andy Leonard and Jonathan Wood