Can you beat the bandit?

Last time we went through the basic problem known as the “multi-armed bandit”. If you’re not totally familiar with the subject, or just want a refresher then I’d suggest you start by reading that post. I know, I know… ain’t nobody got time to be reading old posts. The main idea is:

If you want to design a good learning algorithm you must balance exploration (learning more about the world around you) with exploitation (making good decisions based on what you already know).

This good news this week is that we’re going to cut out the maths and you’re going to get a chance to try these algorithms out for yourselves; very exciting. Many thanks to Andrej Karpathy who is an absolute superstar and helped me get up and running with Javascript for this demo.

Your mission

Imagine yourself three months from now: you’ve subscribed to my blog and even read all the maths parts that you skipped the first time through. Well, now that you’re now an expert in sequential decision making (and updated your LinkedIn profile to say so too) you are hired as head of drug development at a prestigious research hospital. Life’s going great, the sign on bonus was insane, but now you find yourself in the middle of a catastrophic medical emergency.

With the release of his new album Bieber fever has mutated to become lethal. Far and wide people are bopping to “baby, baby, baby” until their hearts give out in a modern day dancing plague… The world’s foremost medical researchers have developed several drugs to combat the disease. Unfortunately, nobody knows how well these treatments work. Now it’s up to you to save as many lives as possible.

Luckily, you recognise this as nothing more than an independent bernoulli bandit… no problem!

In the simulation below you’ll choose how many drugs you have to choose from and how many patients you’ll be dealing with. Once you press Play you’ll be able to simulate this sequential experiment problem for yourself. At the end (or whenever you press Plot) you will be able to see the true success probabilities of each drug and compare your performance to that of several basic algorithms:

Greedy
Epsilon-Greedy
UCB
Posterior Sampling

Can you beat the bandit algorithms?

Configuration

Drugs: Patients:

Medical testing

Press "Play" to try for yourself or "Plot" to the bandit algorithms.

Trial results

Experiment results will show up here once you select a drug to trial.

True drug efficacy

Press "Plot" to reveal true drug efficacy.

Lives saved (higher is better)

</br> Press "Plot" to see algorithm performance.

Regret (lower is better)

</br> Press "Plot" to see algorithm performance.

Results

Hopefully you’ll have a play around with the simulation above and get a bit of a feel for the performance of these algorithms. Here are my first takeaways:

Being greedy really sucks. Sometimes you get lucky, but usually it just doesn’t work.
Human intuition and $\epsilon$-greedy (here $\epsilon = 0.1$) aren’t bad on the small problems (drugs and/or patients).
Once you get into the large-scale problems you start to see the benefits of efficient experimentation.
Posterior sampling is consistently one of the best algorithms.

Isn’t it nice when the experiment matches the theory?

Now, if you’re interested in learning more about this I can recommend the paper “An emprical evaluation of Thompson sampling” that really kicked off the recent revival in Thompson sampling research.

Just a few more things before I check out on this post.

This simulation is a multi-armed bandit with independent arms. Problems with similarity between drug treatments are much more interesting. In these settings efficient exploration is even more important.
In the case of finite independent Bernoulli arms you can actually compute the Bayes-optimal solution, which would do better than all of these algorithms. This method of solution is usually called a Gittins index but I’ll leave that for you to dig into another time…

Code

If you fancy having a peak under the hood then just right click anywhere on this page and click “View Page Source”. The only really interesting part of the code is how the different bandit algorithms select actions. Hopefully this should look exactly like the descriptions in my last post.

Algorithm 1: Greedy

function greedy_choice(arm_counts) {
    // greedy_choice uses the greedy empirical estimate to choose an arm.
    //
    // Args:
    //  arm_counts - n_arms x 2 - array of observed counts
    //
    // Returns:
    //  choice - int - arm to be pulled at the next timestep (0 index)
    var p_max = -1
    var choice = -1
    for (i = 0; i < arm_counts.length; i++) {
        var n_pull = (arm_counts[i][0] + arm_counts[i][1])
        if (n_pull < 0.5) {
            var p_hat = 0.5
        } else {
            var p_hat = arm_counts[i][0] / n_pull
        }
        if (p_hat > p_max) {
            p_max = p_hat
            choice = i
        }
    }
    return choice
}

Algorithm 2: Epsilon-Greedy

function egreedy_choice(arm_counts) {
    // egreedy_choice uses epsilon-greedy to choose an arm.
    //
    // Args:
    //  arm_counts - n_arms x 2 - array of observed counts
    //  epsilon - double - probability of random action - now fixed within
    //
    // Returns:
    //  choice - int - arm to be pulled at the next timestep (0 index)
    var epsilon = 0.1
    var choice = -1
    if (Math.random() < epsilon) {
        choice = Math.floor((Math.random() * arm_counts.length));
    } else {
        choice = greedy_choice(arm_counts)
    }
    return choice
}

Algorithm 3: Optimism (UCB)

function ucb_choice(arm_counts, timestep) {
    // ucb_choice uses the UCB algoirthm to choose an arm.
    //
    // Args:
    //  arm_counts - n_arms x 2 - array of observed counts
    //  timestep - int - number of timesteps elapsed
    //
    // Returns:
    //  choice - int - arm to be pulled at the next timestep (0 index)
    var p_max = -1
    var choice = -1
    for (i = 0; i < arm_counts.length; i++) {
        var n_pull = (arm_counts[i][0] + arm_counts[i][1])
        if (n_pull < 0.5) {
            n_pull = 1
        }
        var p_hat = arm_counts[i][0] / n_pull
        var p_upper = p_hat + Math.sqrt(Math.log(timestep) / n_pull)
        if (p_upper > p_max) {
            p_max = p_upper
            choice = i
        }
    }
    return choice
}

Algorithm 4: Posterior sampling

function ps_choice(arm_counts) {
    // ps_choice uses posterior sampling to make a choice of arm.
    //
    // Args:
    //  arm_counts - n_arms x 2 - array of observed counts
    //
    // Returns:
    //  choice - int - arm to be pulled at the next timestep (0 index)
    var p_max = -1
    var choice = -1
    var prior_a = 1
    var prior_b = 1
    for (i = 0; i < arm_counts.length; i++) {
        var p_sample = rbeta(arm_counts[i][0] + prior_a,
                             arm_counts[i][1] + prior_b)
        if (p_sample > p_max) {
            p_max = p_sample
            choice = i
        }
    }
    return choice
}

Until next time…

OpenAI Reinforcement Learning

Ian Osband