MAB.ai 0.9.1

Thabo Koee

Thabo Koee

Unknown

Maximize the reward obtained by successively playing gamble machines (the ‘arms’ of the bandits) Invented in early 1950s by Robbins to model decision making under uncertainty when the environment is unknown The lotteries are unknown ahead of time. ...learn more

Project status: Published/In Market

Virtual Reality, HPC, Game Development, Artificial Intelligence

Intel Technologies
MKL, Movidius NCS, Intel CPU

Code Samples [1]

Overview / Usage

We have also added an Augmented Reality scene to enhance the experience

Reinforcement learning is learning what to do - how to map situations to actions - so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them. In the most interesting and challenging cases, actions may affect not only the immediate reward, but also the next situation and, through that, all subsequent rewards.

In the simplest scenario, there is a single room that contains two chests. Opening a chest either yields a diamond (a good thing) or a ghost (a bad thing). Opening the same chest multiple times will yield a different sequence of diamonds and ghosts based on some underlying probability of yielding a diamond. For example, a chest that has a probability of 0.5 means that it will yield a 50-50 mix of diamonds and ghosts, while a probability of 0.9 means that it will yield a diamond nine out of every ten times, approximately. Note that each chest has its own true probability that the agent (in this case, the entity deciding which chest to open) is not aware of. Each time an agent selects a chest, they either receive a positive reward in the case of finding a diamond, or a negative reward in the case of finding a ghost. The goal of the agent is to maximize its total reward over a number of trials - in each trial the agent is allowed to select any chest.

If the agent is aware of the true underlying probability of each chest, then its task is quite simple, all it has to do is repeatedly select the chest that has the highest probability of yielding a diamond. However, in absence of this information, the best it can do is intelligently trade off between estimating the probabilities (called exploration) and selecting the chest with the highest estimated probability (called exploitation). An agent that only explores will waste all its trials estimating the probability for each chest without maximizing its own reward, while an agent that performs limited exploration will waste most of its trials exploiting based on inaccurate probability estimates. The key here is how to balance exploration and exploitation effectively.

The simplest scenario of a single room that contains two chests can be expanded to include multiple rooms with several chests. In this demo you'll be able to select between a stateless bandit (one room) or a contextual bandit (three rooms). For either of those two scenarios you can choose the number of chests in each room (two through five) in addition to a few other settings discussed below.

Methodology / Approach

Each machine 𝑖 has a different (unknown) distribution law for rewards with (unknown) expectation 𝜇𝑖: Successive plays of the same machine yeald rewards that are independent and identically distributed Independence also holds for rewards across machines Reward = random variable 𝑋𝑖,𝑛 ; 1 ≤ 𝑖 ≤ 𝐾, 𝑛 ≥ 1 𝑖 = index of the gambling machine 𝑛 = number of plays 𝜇𝑖 = expected reward of machine 𝑖. A policy, or allocation strategy, 𝐴 is an algorithm that chooses the next machine to play based on the sequence of past plays and obtained rewards.

Technologies Used

For training :
-Intel Coach
-Movidius Neural Compute Stick
-Intel MKL-DNN 2018
Hardware:

  • Lenovo Ideapad 110 Intel core i7
    For Augmented Reality:
  • Unity3D
    Operatting System:
    -Linux Ubuntu 16.04

Repository

https://github.com/TechTouchABI/MAB-AI

Collaborators

1 Result

1 Result

Comments (0)