By James McCaffrey | February 2018 | Get the Code
The Microsoft Cognitive Toolkit (CNTK) library is a powerful set of functions that allows you to create machine learning (ML) prediction systems. I provided an introduction to version 2 in the July 2017 issue (msdn.com/magazine/mt784662). In this article, I explain how to use CNTK to make a deep neural network classifier. A good way to see where this article is headed is to take a look at the screenshot in Figure 1.
Figure 1 Wheat Seed Variety Prediction Demo
The CNTK library is written in C++ for performance reasons, but the most usual way to call into the library functions is to use the CNTK Python language API. I invoked the demo program by issuing the following command in an ordinary Windows 10 command shell:
The goal of the demo program is to create a deep neural network that can predict the variety of a wheat seed. Behind the scenes, the demo program uses a set of training data that looks like this:
The training data has 150 items. Each line represents one of three varieties of wheat seed: “Kama,” “Rosa” or “Canadian.” The first seven numeric values on each line are the predictor values, often called attributes or features in machine learning terminology. The predictors are seed area, perimeter, compactness, length, width, asymmetry coefficient, and groove length. The item-to-predict (often called the class or the label) fills the last three columns and is encoded as 1 0 0 for Kama, 0 1 0 for Rosa, and 0 0 1 for Canadian.
The demo program also uses a test data set of 60 items, 20 of each seed variety. The test data has the same format as the training data.
The demo program creates a 7-(4-4-4)-3 deep neural network. The network is illustrated in Figure 2. There are seven input nodes (one for each predictor value), three hidden layers, each of which has four processing nodes, and three output nodes that correspond to the three possible encoded wheat seed varieties.
Figure 2 Deep Neural Network Structure
The demo program trains the network using 5000 batches of 10 items each, using the stochastic gradient descent (SGD) algorithm. After the prediction model has been trained, it’s applied to the 60-item test data set. The model achieved 78.33 percent accuracy, meaning it correctly predicted 47 of the 60 test items.
The demo program concludes by making a prediction for an unknown wheat seed. The seven input values are (17.6, 15.9, 0.8, 6.2, 3.5, 4.1, 6.1). The computed raw output node values are (1.0530, 2.5276, -3.6578) and the associated output node probability values are (0.1859, 0.8124, 0.0017). Because the middle value is largest, the output maps to (0, 1, 0) which is variety Rosa.
This article assumes you have intermediate or better programming skills with a C-family language, and a basic familiarity with neural networks. But regardless of your background, you should be able to follow along without too much trouble. The complete source code for the seeds_dnn.py program is presented in this article. The code, and the associated training and test data files, are also available in the file download that accompanies this article.Installing CNTK v2
Because CNTK v2 is relatively new, you may not be familiar with the installation process. Briefly, you first install a Python language distribution (I strongly recommend the Anaconda distribution) which contains the core Python language and required Python packages, and then you install CNTK as an additional Python package. In other words, CNTK is not a standalone install.
At the time of this writing, the current version of CNTK is v2.3. Because CNTK is under vigorous development, by the time you read this, there could well be a newer version. I used the Anaconda distribution version 4.1.1 (which contains Python version 3.5.2, NumPy version 1.11.1, and SciPy version 0.17.1). After installing Anaconda, I installed the CPU-only version of CNTK using the pip utility program. Installing CNTK can be a bit tricky if you’re careless with versioning compatibility, but the CNTK documentation describes the installation process in detail.
Creating most machine learning systems starts with the time-consuming and often annoying process of setting up the training and test data files. The raw wheat seeds data set can be found at bit.ly/2idhoRK. The raw 210-item tab-delimited data looks like this:
I wrote a utility program to generate a file in a format that can be easily handled by CNTK. The resulting 210-item file looks like:
The utility program added a leading "|properties" tag to identify the location of the features, and a "|variety" tag to identify the location of the class to predict. The raw class values were 1-of-N encoded (sometimes called one-hot encoding), tabs were replaced by single blank space characters, and all predictor values were formatted to exactly four decimals.
In most situations you’ll want to normalize numeric predictor values so they all have roughly the same range. I didn’t normalize this data, in order to keep this article a bit simpler. Two common forms of normalization are z-score normalization and min-max normalization. In general, in non-demo scenarios you should normalize your predictor values.
Next, I wrote another utility program that took the 210-item data file in CNTK format, and then used the file to generate a 150-item training data file named seeds_train_data.txt (the first 50 of each variety) and a 60-item test file named seeds_test_data.txt (the last 20 of each variety).
Because there are seven predictor variables, it’s not feasible to make a full graph of the data. But you can get a rough idea of the data’s structure by the graph of partial data in Figure 3. I used just the seed perimeter and seed compactness predictor values of the 60-item test dataset.
Figure 3 Partial Graph of the Test Data
I used Notepad to write the demo program. I like Notepad but most of my colleagues prefer one of the many excellent Python editors that are available. The free Visual Studio Code editor with the Python language add-in is especially nice. The complete demo program source code, with a few minor edits to save space, is presented in Figure 4. Note that the backslash character is used by Python for line-continuation.
The demo begins by importing the required NumPy and CNTK packages, and assigning shortcut aliases of np and C to them. Function create_reader is a program-defined helper that can be used to read training data (if the is_training parameter is set to True) or test data (if is_training is set to False).
You can consider the create_reader function as boilerplate code for neural classification problems. The only things you’ll need to change in most situations are the two string values of the field arguments in the calls to the StreamDef function, “properties” and “varieties” in the demo.
All the program control logic is contained in a single main function. All normal error checking code has been removed to keep the size of the demo small and to help keep the main ideas clear. Note that I indent two spaces rather than the more usual four spaces to save space.
The main function begins by setting up the neural network architecture dimensions:
Because CNTK is under rapid development, it‘s a good idea to print out or comment the version being used. The demo has three hidden layers, all of which have four nodes. The number of hidden layers, and the number of nodes in each layer, must be determined by trial and error. You can have a different number of nodes in each layer if you wish. For example, hidden_dim = [10, 8, 10, 12] would correspond to a deep network with four hidden layers, with 10, 8, 10, and 12 nodes respectively.
Next, the location of the training and test data files is specified and the network input and output vectors are created:
Notice I put the training and test files in a separate Data subdirectory, which is a common practice because you often have many different data files during model creation. Using the np.float32 data type is much more common than the np.float64 type because the additional precision gained using 64 bits usually isn"t worth the performance penalty you incur.
Next, the network is created:
There’s a lot going on here. The Python with statement is shortcut syntax to apply a set of common values to multiple layers of a network. Here, all weights are given a Gaussian (bell-shaped curve) random value with a standard deviation of 0.1 and a mean of 0. Setting a seed value ensures reproducibility. CNTK supports a large number of initialization algorithms, including “uniform,” “glorot,” “he” and “xavier.” Deep neural networks are often surprisingly sensitive to the choice of initialization algorithm, so when training fails, one of the first things to try is an alternative initialization algorithm.
The three hidden layers are defined using the Dense function, so named because each node is fully connected to the nodes in the layers before and after. The syntax used can be confusing. Here, X acts as input to hidden layer h1. The h1 layer acts as input to hidden layer h2, and so on.
Notice that the output layer uses no activation function so the output nodes will have values that don’t necessarily sum to 1. If you have experience with other neural network libraries, this requires some explanation. With many other neural libraries you’d use softmax activation on the output layer so that output value always sums to 1 and can be interpreted as probabilities. Then, during training, you’d use cross-entropy error (also called log loss), which requires a set of values that sums to 1.
But, somewhat surprisingly, CNTK v2.3 doesn’t have a basic cross-entropy error function for training. Instead, CNTK has a cross entropy with softmax function. This means that during training, output node values are converted on the fly to probabilities using softmax to compute an error term.
So, with CNTK, you train a deep network on raw output node values, but when making predictions, if you want prediction probabilities as is usually the case, you must apply the softmax function explicitly. The approach used by the demo is to train on the “nnet” object (no activation in the output layer), but create an additional “model” object, with softmax applied, for use when making predictions.
Now, it is, in fact, possible to use softmax activation on the output layer, and then use cross entropy with softmax during training. This approach results in softmax being applied twice, first to raw output values and then again to the normalized output node values. As it turns out, although this approach will work, for rather complex technical reasons, training isn’t as efficient.
Chaining hidden layers is feasible up to a point. For very deep networks, CNTK supports a meta function named Sequential that provides a shortcut syntax for creating multi-layered networks. The CNTK library also has a Dropout function that can be used to help prevent model overfitting. For example, to add dropout to the first hidden layer, you could modify the demo code like this:
Many of my colleagues prefer to always use Sequential, even for deep neural networks that only have a few hidden layers. I prefer manual chaining, but this is just a matter of style.
After creating a neural network and model, the demo program creates a Learner object and a Trainer object:
You can think of a Learner as an algorithm and a Trainer as an object that uses the Learner algorithm. The tr_loss (“training loss”) object defines how to measure error between network-computed output values and known correct output values in the training data. For classification, cross entropy is almost always used, but CNTK supports several alternatives. The “with_softmax” part of the function name indicates that the function expects raw output node values rather than values normalized with softmax. This is why the output layer doesn’t use an activation function.
The tr_clas (“training classification error”) object defines how the number of correct and incorrect predictions are calculated during training. CNTK defines a classification error (percentage of incorrect predictions) library function rather than a classification accuracy function used by some other libraries. So, there are two forms of error being calculated during training. The tr_loss error is used to adjust the weights and biases. The tr_clas is used to monitor prediction accuracy.
The Learner object uses the SGD algorithm with a constant learning rate set to 0.01. SGD is the simplest training algorithm but it’s rarely the best-performing one. CNTK supports a large number of learner algorithms, some of which are very complex. As a rule of thumb, I recommend starting with SGD and only trying more exotic algorithms if training fails. The Adam algorithm (Adam isn’t an acronym) is usually my second choice.
Notice the unusual syntax for creating a Trainer object. The two loss function objects are passed as a Python tuple, indicated by the parentheses, but the Learner object is passed as a Python list, indicated by square brackets. You can pass multiple Leaner objects to a Trainer, though the demo program passes just one.
The code that actually performs training is:
It’s important to monitor training progress because training often fails. Here, the average cross-entropy error on the just-used batch of 10 training items is displayed every 1,000 iterations. The demo displays the average classification accuracy (percentage of correct predictions on the current 10 items), which I think is a more natural metric than classification error (percentage of incorrect predictions).Saving the Trained Model
Because there are only 150 training items, the demo neural network can be trained in just a few seconds. But in non-demo scenarios, training a very deep neural network can take hours, days or even longer. After training, you’ll want to save your model so you won’t have to retrain from scratch. Saving and loading a trained CNTK model is very easy. To save, you can add code like this to the demo program:
The first argument passed to the save function is just a filename, possibly including a path. There’s no required file extension, but using “.model” makes sense. The format parameter has the default value ModelFormat.CNTKv2, so it could’ve been omitted. An alternative is to use the new Open Neural Network Exchange format=ONNX.
Recall that the demo program created both an nnet object (with no softmax on the output) and a model object (with softmax). You’ll normally want to save the softmax version of a trained model, but you can save the non-softmax object if you wish.
Once a model has been saved, you can load it into memory like so:
And then the model can be used as if it had just been trained. Notice that there’s a bit of asymmetry in the calls to save and load—save is a method on a Function object and load is a static method from the Function class.Wrapping Up
Many classification problems can be handled using a simple feed-forward neural network (FNN) with a single hidden layer. In theory, given certain assumptions, an FNN can handle any problem a deep neural network can handle. However, in practice, sometimes a deep neural network is easier to train than an FNN. The mathematical basis for these ideas is called the universal approximation theorem (or sometimes the Cybenko Theorem).
If you’re new to neural network classification, the number of decisions you have to make can seem intimidating. You must decide on the number of hidden layers, the number of nodes in each layer, an initialization scheme and activation function for each hidden layer, a training algorithm, and the training algorithm parameters such as learning rate and momentum term. However, with practice you’ll quickly develop a set of rules of thumb for the types of problems with which you deal.
Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products, including Internet Explorer and Bing. Dr. McCaffrey can be reached at [email protected]
Thanks to the following Microsoft technical experts who reviewed this article: Chris Lee, Ricky Loynd, Kenneth Tran