Blackjack or twenty-one is a card game where the player
attempts to beat the dealer, by obtaining a sum of card values that is
equal to or less than 21 so that his total is higher than the
dealer's. The probabilistic nature of the game makes it an
interesting testbed problem for learning algorithms, though the
problem of learning a good playing strategy is not obvious. Learning
with a teacher systems are not very useful since the target outputs
for a given stage of the game are not known. Instead, the learning
system has to explore different actions and develop a certain strategy
by selectively retaining the actions that maximize the player's
performance. We have explored the use of blackjack as a test bed for
learning strategies in neural networks, and specifically with
reinforcement learning techniques.
Related Work
In the early 1970s, Bernard Widrow [2] used the game
of blackjack to make the idea of one of the first reinforcement
learning algorithms called selective bootstrap adaptation. He
demonstrated how an ADAptive LINear Element [2]
learned with a critic to play blackjack without knowing the game and
the objective of play. More recently, in [3] neural
networks and Temporal Differences(TD) learning were
applied to learning play games from experience. Finally, in [7] Monte Carlo
methods have been used to learn value-functions of the game of Black Jack.
Our approach
We obtained experimental results similar to those reported in
[3].In our approach, we used the Q-learning
[4] and SARSA [5] algorithms, and a
slightly different state encoding. As a result, our system was able to
learn strategies with a higher rate of winning games.
Experimental Results
We have implemented a simplified version of the
blackjack game. The deck of 52 cards is shuffled before each hand. An
Ace is automatically valued as 11, unless it would cause a bust. In
that case it is valued as 1. The state s of the environment
corresponds to the sum of card values in the player's hand (i.e., the
score), provided that he/she does not have an ace. Therefore,
s={4,5,6,...,20}. If the player has an ace, there is a set of 9
extra states s={23,24,...,31} determined by the sum of card values
being held plus 11. Two terminal states s=21 and s=-1 were also
introduced to indicate a ``perfect'' score or a bust. At each stage
of the game, the player chooses to hit a new card, or to stand. Such
selection followed an epsilon-greedy policy, that is, the action
with the highest Q value is chosen with probability 1-epsilon,
otherwise a random action is chosen with probability epsilon, in
order to explore new actions.
Performance of fixed and learned strategies
In the first experiments (one experiment consisted of 1000 runs of 100 games) the simulated player used a set of fixed strategies (dealer's, hold and random).The results of such experiments (see table 1) enabled us to rank the learned
strategies (0.1-greedy, 0.01-greedy and 0.01-greedy (2)).
| Strategy
| Avg(%)
| Max(%)
| Min(%)
|
| dealer's |
40.7 |
57 |
25 |
| hold |
38.3 |
51 |
24 |
random |
31.5 |
46 |
18 |
0.1-greedy |
39.9 |
54 |
26 |
0.01-greedy |
41.9 |
56 |
26 |
0.01-greedy (2) |
46.7 |
49.14 |
-- |
Learned Strategies
Such strategy can be interpreted from the learned action-values after
the 1000 runs of 100 games.
The learned strategy is very conservative, i.e., not to hit if the
score is larger than 11. However, it is interesting to note how the
algorithm determines such threshold without knowing the rules of the
game, nor the goal of the game (just by experience and the
reinforcement signal at the end of each hand). In fact, the SARSA
player learns to hit only if it is sure that it will not bust; it
discovered by experience the double card value of the ace, and the maximum
card value of 10 to determine the following strategy:
hit if (score<11) or (an ACE is held)
stand otherwise
Probabilistic strategies
When one considers a nonstationary task, reinforcement learning
algorithms track the best actions as the task changes over time,
instead of finding a single best action.
When a task's optimal stationary policy is probabilistic, reinforcement
learning algorithms find a mapping from states to discrete probability
distributions over actions, which is denoted by pi: S ->
PD(A) (PD(A) represents the set of discrete probability
distributions over the set A, i.e., the actions) [6].
In these experiments, we used the same state encoding previously
described, but this time, we used a reinforcement signal defined as
follows: r = -1 if loss, 0 otherwise.
We may describe the current learned strategy as follows:
stand if (score>9) or [(score>13) and (score<19) and (an ACE is held)]
else
hit with 50% of probability
The resulting performance was surprisingly high: an average of 49.14%
win games, a maximum percentage of 61% win games, and a minimum
percentage of 42% win games. However, when using such strategy
without further learning during 1000 runs of 100 games, the results are
again poor: a mere 40% of games won in average. Nevertheless, we proceed with our experiments with continual
learning, and let the system adapt to the nonstationary problem.
The learner developed changing probabilistic strategies over
time, and we obtained the following means of win games after 1000, 5000,
10000, and 20000 runs of 100 games: 49.1, 47.4, 46.8, 46.7 and 46.7%.
Java Implementation
Java implementation by F. Meyer, Swiss Federal Institute of Technology-Lausanne, Switzerland.
References
[1] A. Perez-Uribe and E. Sanchez, "Blackjack as a Test Bed for Learning Strategies in Neural Networks", Proceedings of the IEEE International Joint Conference on Neural Networks IJCNN'98 (to appear)
[2] B. Widrow, N. Gupta, and S. Maitra, ``Punish/Reward: Learning with a Critic in Adaptive Threshold Systems,'', IEEE Transactions on Systems, Man and Cybernetics, vol. 3, no.5, pp. 455--465, 1973.
[3] K. Olson, ``Learning to play games from experience: An application of artificial neural networks and temporal difference learning,'' M.S. thesis, Pacific Lutheran University, Washington, 1993.
[4] J.C.H Watkins and P. Dayan, ``Technical note q-learning,'', Machine Learning , vol. 8, pp. 279--292, 1992.
[5] G. Rummery and M. Niranjan, ``On-line q-learning using connectionist systems,'',Tech. Rep. Technical Report CUED/F-INFENG/TR 166, Cambridge, University Engineering Department, 1994.
[6] M.L. Littman, `Markov games as a framework for multi-agent reinforcement learning,'', Proceedings of the Eleven International Conference on Machine Learning, 1994, pp. 157--163, Morgan Kaufmann.
[7] R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction, MIT Press, 1998. (see examples 5.1 and 5.3).
[*] Some Reinforcement Learning WWW Links
Version 27.I.98, Last Update 18.III.98
aperez@lslsun.epfl.ch