A Deep Q-Network for the Beer Game: A Reinforcement Learning Algorithm to Solve Inventory Optimization Problems

Afshin OroojlooyJadid

Afshin OroojlooyJadid

Bethlehem, Pennsylvania

8 1
  • 0 Collaborators

The beer game is a widely used in-class game that is played in supply chain management classes to demonstrate the bullwhip effect. The game is a decentralized, multi-agent, cooperative problem that can be modeled as a serial supply chain network in which agents cooperatively attempt to minimize the total cost of the network even though each agent can only observe its own local information. Each agent chooses order quantities to replenish its stock. Under some conditions, a base-stock replenishment policy is known to be optimal. However, in a decentralized supply chain in which some agents (stages) may act irrationally (as they do in the beer game), there is no known optimal policy for an agent wishing to act optimally. We propose a machine learning algorithm, based on deep Q-networks, to optimize the replenishment decisions at a given stage. When playing alongside agents who follow a base-stock policy, our algorithm obtains near-optimal order quantities. It performs much better than a base-stock policy when the other agents use a more realistic model of human ordering behavior. Unlike most other algorithms in the literature, our algorithm does not have any limits on the beer game parameter values. Like any deep learning algorithm, training the algorithm can be computationally intensive, but this can be performed ahead of time; the algorithm executes in real time when the game is played. Moreover, we propose a transfer learning approach so that the training performed for one agent and one set of cost coefficients can be adapted quickly for other agents and costs. Our algorithm can be extended to other decentralized multi-agent cooperative games with partially observed information, which is a common type of situation in real-world supply chain problems. ...learn more

Project status: Under Development

Game Development, Artificial Intelligence

Intel Technologies
AI DevCloud / Xeon, Intel Opt ML/DL Framework

Overview / Usage

The beer game is a widely used in-class game that is played in supply chain management classes to demonstrate a phenomenon known as the bullwhip effect. The game consists of a serial supply chain network with four players—a retailer, a wholesaler, a distributor, and a manufacturer.
In each period of the game, the retailer experiences a random demand from customers. Then the four players each decide how much inventory of "beer" to order. The retailer orders from the wholesaler, the wholesaler orders from the distributor, the distributor from the manufacturer, and the manufacturer orders from an external supplier that is not a player in the game. If a player does not have sufficient inventory to serve the current period's demand (or orders), the unmet demands are backordered; that is, the customer or downstream node waits until inventory is available. The manufacturer's supplier, however, never has shortages—it is assumed to have infinite supply. There are deterministic transportation and information lead times, though the total lead time is stochastic due to stockouts at upstream nodes. Each player may have nonzero shortage and holding costs.

The players' goal is to minimize the total holding and shortage costs of the whole supply chain. This is therefore a cooperative game, but the players are not allowed to communicate with one another, and moreover they do not know the inventory levels or ordering decisions at the other players—each player only has access to local information.

The beer game has been written about quite a bit in the academic literature. Most of this research asks, how do human players play the beer game? and examines the bullwhip effect and other phenomena that arise from this play. An important question that has not been addressed much in the literature is, how should players play the beer game? In other words, what is the optimal inventory policy that a given player should follow? This is a difficult question, especially when the other players do not, themselves, follow an optimal policy. (The beer game can be classified as a multi-agent, cooperative, decentralized, partially observed Markov decision process, a class of problems that is known to be NEXP-complete—that is, very hard.)

Answering the question of how players should play will have implications not only for the beer game, but also for optimizing inventory decisions in supply chains more generally.

Methodology / Approach

We propose a modified Deep Q-Network (DQN) to solve to this problem. DQN has been used to solve single-agent games (such as Atari games) and two-agent zero-sum games (such as Go). However, the beer game is a cooperative, non-zero-sum game. Therefore, using DQN directly would result in each player minimizing its own cost, ignoring the system-wide cost. To fix this issue, we propose a feedback scheme that causes the agents to learn to minimize the whole cost of system. In particular, we update the observed reward of each agent i in each period t using some formula to give the agents some feedback of how they have played.

We present the results from a straightforward version of the game in which the DQN plays the role of the wholesaler and the other three players each follow a base-stock policy. (A base-stock policy means we always order to bring the inventory position—on-hand inventory minus backorders plus on-order inventory—equal to a fixed value, called the base-stock level. A base-stock policy is known to be optimal for the beer game, assuming all four players use it.) Each stage has a holding cost of $2 per item per period. The retailer has a shortage cost of $2 per item per period and the other stages do not have a shortage cost. The customer demand is either 0, 1, or 2 in each time period. (If these numbers seem small to you, think of them as cases or truckloads.)

The attached figures plot the total cost of the four players for two cases: (1) when all four players use a base-stock policy, which is optimal for this setting and therefore provides a baseline (blue curve), and (2) when one of the players is replaced by the DQN agent (green curve). The x-axis indicates the number of episodes on which the DQN is trained. The upper subfigure indicates the total cost, while the lower subfigure indicates the normalized cost, in which the costs of the BS policy and the DQN are divided by BS policy's cost. The The experimental results show that after a relatively small number of training episodes, the AI has learned to play nearly optimally.
Also, one of the attached figures shows the details of the inventory level (IL), on-order quantity (OO), order quantity (a), order up to level (OUTL), and reward (r) for the retailer when the retailer is played by the DQN. The BS policy and DQN have similar IL and OO trends, and as a result their rewards are also very close.

Transfer Learning

This mentioned algorithm works well; however, for each new set of the number of actions, shortage and holding cost parameters we need to train a new policy, since they uniquely define the DQN loss function, and this is a time consuming procedure. To address this issue, we train a base agent with given cost parameters and action space, then transfer the acquired knowledge to the new agents, so that the obtained policy easily can be generalized to other agents. The attached table shows the results of training new agents with different cost coefficients and different action spaces. As it is shown, the average gap is around 8.9%, which shows the effectiveness of transfer learning approach. Additionally, the training times of each case are provided in the table. In order to get the base results, we did hyper-parameter tuning which resulted in total of 13,354,951 seconds training. However, following the transfer learning approach we do not need any hyper-parameter tuning, and on average we spent 891,117 seconds for training, which is 15 times faster.

The results above are surprising (at least to us) and impressive. They demonstrate that the DQN can learn to play close to the optimal solution, with just—and only just—some historical data, without any knowledge of how other the agents play, what their ordering policy is, etc.
These days, most companies have huge quantities of data about their supply chains. This research suggests that these data can be used by ML algorithms to make near-optimal decisions in a supply chain. In the future, we plan to adapt this approach to different supply chain structures with more complex networks and more agents. Stay tuned on this page for more results.

Technologies Used

To develop the algorithm, I used Google’s open-source library Tensorflow. Later, for the proof of the concept I trained and tested the neural network in one K80 GPU with the help of NVIDIA CUDA. In order to get more more concrete experiments I used Intel® AI DevCloud.

Comments (1)