Learning to Play
Black Jack with Artificial Neural Networks

Andrés PEREZ-URIBE 
 Logic Systems Laboratory
Swiss Federal Institute of Technology-Lausanne
 


Blackjack or twenty-one is a card game where the player attempts to beat the dealer, by obtaining a sum of card values that is equal to or less than 21 so that his total is higher than the dealer's. The probabilistic nature of the game makes it an interesting testbed problem for learning algorithms, though the problem of learning a good playing strategy is not obvious. Learning with a teacher systems are not very useful since the target outputs for a given stage of the game are not known. Instead, the learning system has to explore different actions and develop a certain strategy by selectively retaining the actions that maximize the player's performance. We have explored the use of blackjack as a test bed for learning strategies in neural networks, and specifically with reinforcement learning techniques.

Related Work

In the early 1970s, Bernard Widrow [2] used the game of blackjack to make the idea of one of the first reinforcement learning algorithms called selective bootstrap adaptation. He demonstrated how an ADAptive LINear Element [2] learned with a critic to play blackjack without knowing the game and the objective of play. More recently, in [3] neural networks and Temporal Differences(TD) learning were applied to learning play games from experience. Finally, in [7] Monte Carlo methods have been used to learn value-functions of the game of Black Jack.

Our approach

We obtained experimental results similar to those reported in [3].In our approach, we used the Q-learning [4] and SARSA [5] algorithms, and a slightly different state encoding. As a result, our system was able to learn strategies with a higher rate of winning games.

Experimental Results

We have implemented a simplified version of the blackjack game. The deck of 52 cards is shuffled before each hand. An Ace is automatically valued as 11, unless it would cause a bust. In that case it is valued as 1. The state s of the environment corresponds to the sum of card values in the player's hand (i.e., the score), provided that he/she does not have an ace. Therefore, s={4,5,6,...,20}. If the player has an ace, there is a set of 9 extra states s={23,24,...,31} determined by the sum of card values being held plus 11. Two terminal states s=21 and s=-1 were also introduced to indicate a ``perfect'' score or a bust. At each stage of the game, the player chooses to hit a new card, or to stand. Such selection followed an epsilon-greedy policy, that is, the action with the highest Q value is chosen with probability 1-epsilon, otherwise a random action is chosen with probability epsilon, in order to explore new actions.

Performance of fixed and learned strategies

In the first experiments (one experiment consisted of 1000 runs of 100 games) the simulated player used a set of fixed strategies (dealer's, hold and random).The results of such experiments (see table 1) enabled us to rank the learned strategies (0.1-greedy, 0.01-greedy and 0.01-greedy (2)).

Strategy Avg(%) Max(%) Min(%)
dealer's 40.7 57 25
hold 38.3 51 24
random 31.5 46 18
0.1-greedy 39.9 54 26
0.01-greedy 41.9 56 26
0.01-greedy (2) 46.7 49.14 --

Learned Strategies

Such strategy can be interpreted from the learned action-values after the 1000 runs of 100 games.
The learned strategy is very conservative, i.e., not to hit if the score is larger than 11. However, it is interesting to note how the algorithm determines such threshold without knowing the rules of the game, nor the goal of the game (just by experience and the reinforcement signal at the end of each hand). In fact, the SARSA player learns to hit only if it is sure that it will not bust; it discovered by experience the double card value of the ace, and the maximum card value of 10 to determine the following strategy:

hit if (score<11) or (an ACE is held)
stand otherwise

Probabilistic strategies

When one considers a nonstationary task, reinforcement learning algorithms track the best actions as the task changes over time, instead of finding a single best action. When a task's optimal stationary policy is probabilistic, reinforcement learning algorithms find a mapping from states to discrete probability distributions over actions, which is denoted by pi: S -> PD(A) (PD(A) represents the set of discrete probability distributions over the set A, i.e., the actions) [6].

In these experiments, we used the same state encoding previously described, but this time, we used a reinforcement signal defined as follows: r = -1 if loss, 0 otherwise.

We may describe the current learned strategy as follows:

stand if (score>9) or [(score>13) and (score<19) and (an ACE is held)]
else
hit with 50% of probability

The resulting performance was surprisingly high: an average of 49.14% win games, a maximum percentage of 61% win games, and a minimum percentage of 42% win games. However, when using such strategy without further learning during 1000 runs of 100 games, the results are again poor: a mere 40% of games won in average. Nevertheless, we proceed with our experiments with continual learning, and let the system adapt to the nonstationary problem.
The learner developed changing probabilistic strategies over time, and we obtained the following means of win games after 1000, 5000, 10000, and 20000 runs of 100 games: 49.1, 47.4, 46.8, 46.7 and 46.7%.

Java Implementation

Java implementation by F. Meyer, Swiss Federal Institute of Technology-Lausanne, Switzerland.

References

[1] A. Perez-Uribe and E. Sanchez, "Blackjack as a Test Bed for Learning Strategies in Neural Networks", Proceedings of the IEEE International Joint Conference on Neural Networks IJCNN'98 (to appear)

[2] B. Widrow, N. Gupta, and S. Maitra, ``Punish/Reward: Learning with a Critic in Adaptive Threshold Systems,'', IEEE Transactions on Systems, Man and Cybernetics, vol. 3, no.5, pp. 455--465, 1973.

[3] K. Olson, ``Learning to play games from experience: An application of artificial neural networks and temporal difference learning,'' M.S. thesis, Pacific Lutheran University, Washington, 1993.

[4] J.C.H Watkins and P. Dayan, ``Technical note q-learning,'', Machine Learning , vol. 8, pp. 279--292, 1992.

[5] G. Rummery and M. Niranjan, ``On-line q-learning using connectionist systems,'',Tech. Rep. Technical Report CUED/F-INFENG/TR 166, Cambridge, University Engineering Department, 1994.

[6] M.L. Littman, `Markov games as a framework for multi-agent reinforcement learning,'', Proceedings of the Eleven International Conference on Machine Learning, 1994, pp. 157--163, Morgan Kaufmann.

[7] R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction, MIT Press, 1998. (see examples 5.1 and 5.3).

[*] Some Reinforcement Learning WWW Links


Version 27.I.98, Last Update 18.III.98
aperez@lslsun.epfl.ch