By James McCaffrey | February 2018 | Get the Code

Thompson sampling is an algorithm that can be used to find a solution to a multi-armed bandit problem, a term deriving from the fact that gambling slot machines are informally called “one-armed bandits.”

Suppose you’re standing in front of three slot machines. When you pull the arm on one of the machines, it will pay you either zero dollars or one dollar according to some probability distribution that’s unknown to you. For example, the machine might pay with a mean probability of 0.5, so if you pulled the handle on that machine 100 times, you’d expect to be paid zero dollars about 50 times and one dollar about 50 times.

Your goal is to identify the best machine, as efficiently as possible. The problem just described is an example of a Bernoulli bandit problem because the result is either 0 or 1. There are other types of bandit problems. For example, if each machine paid out a value, usually between zero and one dollar (such as 55.0 cents), according to a bell-shaped Gaussian distribution, you’d have a Gaussian process bandit problem. This article addresses only the Bernoulli bandit problem.

There are several algorithms that can be used to try to find the best machine. Suppose you’re limited to a total of 100 pulls on the three machines. You could try 10 pulls on each machine, keeping track of how well each machine performed. Then you could use your remaining 70 pulls on the one machine that paid out the most money during your 30-pull explore phase. The danger with this approach is that you might incorrectly identify the best machine. And if you use many pulls during the explore phase, you won’t have many left during the exploit phase.

Thompson sampling is a clever algorithm that continuously adjusts the estimated probabilities of each machine’s payout. As you’ll see shortly, the key to Thompson sampling for a Bernoulli bandit problem is the mathematical beta distribution.

It’s unlikely you’ll be asked by your boss to analyze slot machines, but multi-armed bandit problems appear in many important, practical scenarios. For example, suppose you work for a drug manufacturer. You’ve just created four new experimental drugs for cancer and you want to establish which of the four drugs is most effective, with a minimum of testing on live subjects. Or perhaps you work for an e-commerce company and you want to determine which of several new advertising campaigns is the most successful.

A good way to see where this article is headed is to take a look at the demo program in **Figure 1**. The demo sets up three Bernoulli machines with probabilities of payout of (0.3, 0.5, 0.7), respectively. In a non-demo scenario, the probabilities are unknown to you, of course. You are allowed a total of 10 pulls. The goal is to determine the best machine (machine #3) and pull its handle the most times.

**Figure 1 Thompson Sampling Demo**

In the first trial, you assume that all three machines have a payout probability of 0.5. The demo uses the beta distribution to generate three probabilities based on that assumption. These random probabilities are (0.7711, 0.1660, 0.1090). Because the probability associated with machine #1 is highest, its arm is pulled. In this case, machine #1 does not pay out.

In the second trial, behind the scenes, the demo believes that the first machine now has a payout probability that is less than 0.5. Beta sampling is used and this time the probabilities are (0.6696, 0.2250, 0.7654), so machine #3 is selected and played, and its estimated probability of payout is adjusted according to whether the machine pays out or not.

Over time, because machine #3 has the highest probability of payout, it will win more often than the other two machines, and each time machine #3 does pay out, the likelihood that it will be selected increases.

After the 10 trials, machine #1 was played three times and paid out just once so the simulation guesses its true probability of payout is approximately 1/3 = 0.33. Machine #2 was played three times and paid out twice (estimated p = 2/3 = .6667). Machine #3 was played four times and paid out three times (estimated p = 3/4 = 0.7500). So, in this example at least, the best machine was identified and was played the most often.

This article assumes you have intermediate or better programming skills, but doesn’t assume you know anything about Thompson sampling or beta distributions. The demo program is coded using C#, but you shouldn’t have too much trouble refactoring the code to another language, such as Visual Basic or Python, if you wish. The code for the demo program is presented in its entirety in this article, and is also available in the accompanying file download.

Thompson sampling for a Bernoulli bandit problem depends on the beta distribution. In order to understand the beta distribution you must understand probability distributions in general. There are many types of probability distributions, each of which has variations that depend on one or two parameters.

You may be familiar with the uniform distribution, which has two parameters, called min and max, or sometimes just a and b. A uniform distribution with min = 0.0 and max = 1.0 will return a p-value between 0.0 and 1.0 where each value is equally likely. Therefore, if you sampled 1,000 times from the uniform distribution, you’d expect to get about 100 p-values between 0.00 and 0.10, about 100 p-values between 0.10 and 0.20, and so on, to about 100 p-values between 0.90 and 1.00. If you graphed the results, you’d see a bar chart with 10 bars, all about the same height.

You might also be familiar with the normal (also called Gaussian) distribution. The normal distribution is also characterized by two parameters, the mean and the standard deviation. If you sampled 1,000 times from the normal distribution with mean = 0.0 and standard deviation = 1.0, you’d expect to get about 380 z-values between -0.5 and +0.5; about 240 z-values between +0.5 and +1.5 (and also between -0.5 and -1.5); about 60 z-values between +1.5 and +2.5 (and also between -1.5 and -2.5); and 10 z-values greater than +2.5 (and 10 less than -2.5). If you graphed the results you’d see a bell-shaped bar chart.

The beta distribution is characterized by two parameters, usually called alpha and beta, or sometimes just a and b. Note the possible confusion between “beta” representing the entire distribution, and “beta,” representing the second of the two distribution parameters.

If you sample from a beta distribution with a = 1 and b = 1, you get the exact same results as from the uniform distribution with mean = 0.5. If a and b have different values, when you sample from the beta distribution you get p-values that average to a / (a+b). For example, if a = 3 and b = 1, and you repeatedly sample, you will get p-values between 0.0 and 1.0, but the average of the p-values will be 3 / (3+1) = 0.75. This means that p-values between 0.90 and 1.00 will be the most common; p-values between 0.80 and 0.90 will be a bit less common; and so on, down to very few p-values between 0.00 and 0.10. The graph in **Figure 2** shows the results of 10,000 samples from the beta distribution with a = 3 and b = 1.

**Figure 2 Sampling from the Beta(3,1) Distribution**

The connection between the beta distribution and a Bernoulli bandit problem is quite subtle. Briefly, the beta distribution is the conjugate prior for the Bernoulli likelihood function. Although the underlying math is very deep, implementation of the Thompson algorithm is, fortunately, (relatively) simple.

The Demo ProgramTo create the demo program I launched Visual Studio and created a new console application named Thompson. The demo has no significant .NET Framework dependencies, so any version of Visual Studio is fine. After the template code loaded into the editor window, I right-clicked on file Program.cs and renamed it to the slightly more descriptive ThompsonProgram.cs, and allowed Visual Studio to automatically rename class Program for me. At the top of the template code, I deleted all references to unneeded namespaces, leaving just the reference to the top-level System namespace.

The overall program structure, with a few minor edits to save space, is presented in **Figure 3**. All the control logic is in the Main method. Sampling from the beta distribution is implemented using the program-defined BetaSampler class. All normal error checking is removed to save space.

The demo begins by setting up four arrays:

The demo uses three machines, but Thompson sampling can work with any number of machines. The mean probabilities of payouts are hardcoded. The closer the mean probabilities are to each other, the more difficult the problem. The array named probs holds the probabilities from a sampling of the beta distribution, which determine which machine to play. The arrays named S (“success”) and F (“failure”) hold the cumulative number of times each machine paid out and didn’t pay out when played.

The demo uses two random number-generating objects:

The rnd object is used to determine whether a selected machine wins or loses. Note that I use the terms win and lose, pay out and not pay out, and success and failure, interchangeably. The bs object is used to sample the beta distribution. The seeds used, 2 and 4, were specified only because they provided a representative demo run.

The main simulation loop begins:

You might want to set the number of trials to a large number, like 1,000, to observe how quickly Thompson sampling zeros in on an optimal machine. The key to the entire demo is the call to the Sample method. The numbers of successes and failures are passed to the method. A constant 1.0 is added as something of a hack because the beta distribution requires the a and b parameters to be greater than 0. If you have some prior knowledge of the machines’ payout probabilities, you could adjust the constant to reflect that information.

After displaying the updated sampling probabilities, the demo selects the machine with the greatest sampling probability:

Because probabilities are sampled, even if a machine has zero wins and many failures, it could still be selected for play. This mechanism enhances the exploration capabilities of the algorithm.

The main iteration loop concludes by playing the selected machine:

The NextDouble method returns a uniformly random value between 0.0 and 1.0 and is used to implement a Bernoulli process. The demo concludes by displaying the estimated probabilities of payout for each machine (not bothering to check for possible division by zero), and the number of times each machine was played.

Somewhat surprisingly, to the best of my knowledge the .NET Framework doesn’t have a library method to sample from the beta distribution. Instead of taking a dependency on an external library, I decided to implement one from scratch. There are many algorithms to generate a beta sample. I used the “BA” algorithm, developed by R. C. H. Cheng in 1978. The entire code is presented in **Figure 4**.

Sampling from the beta distribution is a fascinating topic in its own right. A quick scan of the code should convince you there’s some clever, non-trial math involved. The implementation follows the source research paper closely—the same variables names and so on. Notice the potentially infinite loop, which is common in research pseudo-code, but a definite no-no in production code. You might want to add a loop counter variable and throw an exception if its value exceeds some threshold.

Wrapping UpThis article and its code should give you all the information you need to experiment with Thompson sampling for multi-armed Bernoulli problems. It should also allow you to explore bandit problems with other kinds of payout functions. For example, if you have machines that pay out according to a Poisson likelihood function, instead of using the beta distribution you’d sample from the gamma distribution, which is the conjugate prior for Poisson.

The multi-armed bandit problem is an example of what’s called reinforcement learning (RL). In RL machine learning, instead of creating a prediction system using labeled data that has known correct answers, you generate a prediction model on-the-fly, using some sort of reward function. RL is at the forefront of machine learning research.

**Dr. James McCaffrey** works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products, including Internet Explorer and Bing. Dr. McCaffrey can be reached at [email protected]

Thanks to the following Microsoft technical experts who reviewed this article: Chris Lee, Ricky Loynd, Adith Swaminathan

**Nguồn: msdn.microsoft.com**