# Examples in Markov Decision Processes

This excellent book provides approximately 100 examples, illustrating the theory of controlled discrete-time Markov processes. The main attention is paid to counter-intuitive, unexpected properties of optimization problems. Such examples illustrate the importance of conditions imposed in the known theorems on Markov decision processes. The aim was to collect them together in one reference book which should be considered as a complement to existing monographs on Markov decision processes. The book is self-contained and unified in presentation.

The main theoretical statements and constructions are provided, and particular examples can be read independently of others. Examples in Markov Decision Problems, is an essential source of reference for mathematicians and all those who apply the optimal control theory for practical purposes. This book brings together examples based upon such sources, along with several new ones. In addition, it indicates the areas where Markov Decision Processes can be used. Active researchers can refer to this book on applicability of mathematical methods and theorems.

The examples presented either lead to counter-intuitive solutions or illustrate the importance of conditions in the known theorems. Not all examples are equally simple or complicated. Several examples are aimed at undergraduate students, whilst others will be of interest to professional or amateur researchers.

This book has four chapters in line with the four main different types of MDP; the finite-horizon, infinite horizon with total or discounted loss, and average loss over an infinite time interval. Some basic theoretical statements and proofs of auxiliary assertions are included in the two appendixes.

We consider only minimization problems. When formulating theorems and examples published in books, journals, articles or papers devoted to maximization, we always adjust the statements for our case without any special remarks. It should be emphasized that the terminology in MDP is not entirely fixed.

Markov Decision Processes (MDP) is a branch of mathematics based on probability theory, optimal control and mathematical analysis. Many books on the subject with counterexamples/paradoxes in probability are in the literature; it is therefore not surprising that Markov Decision Processes is also replete, with unexpected counter-intuitive examples.

The main goal of this intriguing book is to collect together such examples. Fascinating task! This book should be considered as a complement to scientific textbooks and monographs on MDP. The book can also serve as a reference book to which one can turn for answers to curiosities that arise which studying or teaching MDP. All the examples are self-contained and can be read independently of each other. Concerning uncontrolled Markov chains, we mention the illuminating collection of examples in Suhov and Kelbert (2008).

This book is high recommended. Some examples are aimed at undergraduate students, whilst others will be of interest to advanced undergraduates, graduates and research students in probability theory, optimal control and applied mathematics, looking for a better understanding of the theory; experts in Markov decision processes, professional or amateur researchers.

Chapter 1: Finite-Horizon Models (pp. 1-50)

In this chapter on analyzed the definition and basics of finite-horizon models. The aim is to find an optimal control strategy solving the problem related as performance functional using dynamic programming approach because the probabilistic system evolves through time. If loss functions are not bounded, the situation respect dynamic programming approach becomes more complicated. The Lemma 1, 1, and the corollaries 1.1., and the 1.2, can be helpful in the solution, because the lemma and the corollaries provide sufficient conditions of optimality. The examples presented, in this chapter, are 16 interesting problems, which allows, one extensive and complete illustration of finite-horizon models. All the examples considered are self-contained.

The examples are the following:

1. Non-transitivity of the correlation. The property of being positively correlated is not necessarily transitive. This question was studied widely in the literature.

2. Te more frequently used control is not better. The counterexample proposed is clear.

3. Voting. This is an interesting problem related with the final decision in accordance with the majority among the three opinions of three magistrates which investigate an accused person who is actually guilty.

4. The secretary problem. Here on considered only a very simple version of the classical and famous problem studied in depth in the specialized literature.

5. Constrained optimization. This intriguing example showing that the Bellman principle fails to hold and the optimal control strategy can look strange. Note that, very often, the solution to a constrained MDP is given by a randomized Markov control strategy; however, there is still no reason to consider past-dependent strategies. By introducing artificial random variables, the author should modify the final loss, and in this new model, the Bellman principle holds.

6. Equivalent Markov selectors in non-atomic MDPs. In this situation, on presented two different examples showing that, in the non-atomic case, a selector φ does not exist.

7. Strongly equivalent Markov selector in non atomic MDPs. The notion of strategies π and φ strongly equivalent is important in the theory of mass transportation. The author shows in theorem 1.1, associated to this example, that all conditions established are important for the strongly equivalence.

8. Stock exchange. It is worth paying attention to the different loss function employed, because it should cause curious conclusions in real problems.

9. Markov o non Markov strategy? Randomized or not? When is the Bellman principle violated? The example shows that the requirement concerning the infinities is essential. In the proposed example, the optimal control strategy φ is not uniformly optimal, even being the reasoning correct for an arbitrary selector, meaning that non-randomized strategies cannot satisfy all equalities established and cannot be uniformly optimal.

10. Uniformly optimal, but not optimal strategy. This example is the above slightly modified, by ignoring the initial step and putting another distribution of probability.

11. Martingales and the Bellman principle. If the functions C y c, the loss functions are bounded below, then the estimating process is a martingale if and only if π is optimal. The example 9 and the proposed here, show some difficulties in the above assertion, because the estimating process is not a martingale.

12. Conventions on expectation and infinities. These interesting examples on devoted to the use of another conventions on expectations and infinities, as different authors suggest to calculate the performance criterion, accepting the rule (+∞) + (-∞) = +∞.

13. Nowhere-differentiable function vt (x); discontinuous function vt (x). In this example on shows one proposition in analysis: 2It is known that a functional series can converge (absolutely) to a continuous, but nowhere-differentiable function”.

14. The non-measurable Bellman function. The dynamic programming approach is based on the assumption that optimality equation has a measurable solution. The following example shows that the Bellman function may be not Borel-measurable even in the simplest case having T=1, C(x) ≡0 with a measurable loss function.

15. No-one strategy is uniformly ε-optimal. This example here is an old example considered in the literature.

16. Semi-continuous model.MDP is called semi-continuous if the condition 1.1 is satisfied: (Condition 1.1.), 1) the action space is compact, 2) the transition probabilities is a continuous stochastic kernel and 3) the loss functions are lower semi-continuous and bounded below. If the action space is not compact or the transition probability is not continuous, or the loss functions are not lower semi-continuous, then trivial examples show that the desired selector φ* may not exist.

Chapter 2: Homogeneous Infinite-Horizon Models: Expected Total Loss (pp.51-126)

In this chapter on assumed that the time horizon is not finite. The definitions of strategies and selectors are the same as in the case finite horizon. The goal is to find an optimal control strategy solving the performance functional. The following conditions are always satisfied: “for any control strategy, either expected values (positives or negatives) are finite” and also the Putterman (1994), Section 7.2 on positive models.

MDP is called absorbing if there is a state, for which the controlled process is absorbed in at time T. Absorbing models are considered in 2,7,10,13,16,17,19,20,21,24,28. The examples 3,4,9,13,18 are from the area of optimal stopping in which, on each step, there exists the possibility or putting the controlled process in a special absorbing state, with no future loss.

The homogenous infinite-horizon models with the criteria based in the expected total loss. In this chapter on analyzed 28 nice examples with above criteria.

The examples are the following:

1. Mixed Strategies. Definitions and example quasi-trivial.

2. Multiple solutions to the optimality equation. Application to a discrete –time of queuing model.

3. Finite model: multiple solutions to the optimality equation; conserving but not equalizing strategy. The condition Putterman is satisfied, so the Bellman function coincides with the maximal non-positive solution.

4. The single conserving strategy is not equalizing and not optimal. This example shows that there exist no optimal strategies in this model.

5. When strategy iteration is not successful. If the model is positive and the cost function is bounded then the theorem due to Strauch (1966) holds. For negative models, the theorem due to Putterman (1994), said that if the strategy iteration algorithm terminates, then it returns an optimal strategy.

6. When value iteration is not successful. Example when the first condition is violated.

7. When value iteration is not successful: positive model I. The proposed example: value iteration does not converge to the Bellman function.

8. When value iteration is not successful: positive model II. The exampled showed that the statement, the existence of limit, can fail if the model is not negative.

9. Value iteration and stability in optimal stopping problems.

10. A non-equalizing strategy is uniformly optimal. Example when the second condition is violated.

11. A stationary uniformly ε-optimal selector does not exist (positive model). Example of a stationary uniformly ε-optimal selector does not exist.

12. A stationary uniformly ε-optimal selector does not exist (negative model). Example of a stationary uniformly ε-optimal selector does not exist. The proposed example can be called gambling.

13. Finite-action negative model where a stationary uniformly ε-optimal selector does not exist. An example which can be reformulated as a gamble.

14. Nearly uniformly optimal selectors in negative model. The example shows that if the state space is countable the theorem of Ornstein (1969) holds. The example also shows that this theorem cannot be essentially improved.

15. Semi-continuous models and the blackmailer´s dilemma. Many very powerful results are known for semi-continuous models. On required one additional condition: 1) the action space is compact, b) the loss function for each state is a lower semi-continuous in each action, and c) for each state, the function integral is continuous in each action for every (measurable) bounded function u. The example is the blackmailer´s dilemma (Bertsekas, 1987, page 254).

16. Not a semi-continuous model. If the model is not semi-continuous then one cannot guarantee the existence of optimal strategies. The exampled proposed shows that no one stationary selector is ε-optimal.

17. The Bellman function is non-measurable and no one strategy is uniformly ε-optimal

18. A randomized strategy is better than any selector (finite action space)

19. The fluid approximation does not work

20. The fluid approximation: refined model

21. Occupation measures: phantom solutions. For a fixed control strategy, the occupation measure is one particular measure on XxA. The introduction of this measure, convex analytic approach to the MDP is fruitful, especially in constrained problem. In the sequel on analyzed different illustrating examples.

22. Occupation measures in transient models

23. Occupation measures and duality

24. Occupation measures: compactness

25. The bold strategy in gambling is not optimal (house limit). This classical game can be modeled as an MDP.

26. The bold strategy in gambling is not optimal (inflation). Another example of the gambler problem.

27. Search strategy for a moving target. In this example, it seems plausible that the optimal strategy is simply, to search the location that given the highest probability of find the object.

28. The three-way duel (“Truel”). The sequential truel is a game that generalizes the simple duel. Every marksman wants to maximize his probability of winning the game.

Chapter 3: Homogeneous Infinite-Horizon Models: Discounted Loss (pp.127-176)

This chapter is about the equal problem where β € (0, 1) is the discount factor. As usual, vπ is the performance functional. The discount model is a particular case of an MDP with total expected loss. The problem is now equivalent to the investigation of the modified (absorbing) model, with finite, totally bounded expected absorption time.

Nevertheless, discounted models traditionally constitute a special area in MDP. By this reason the following examples, except some cases, are special modifications which introduced the discount factor. 23 different examples contain this chapter.

The examples are the following:

1. Phantom solutions of the optimality equation

2. When value iteration of the optimality equation

3. A non-optimal strategy π for which v π x solves the optimality equation

4. The single conserving strategy is not equalizing and not optimal

5. Value iteration and convergence of strategies

6. Value iteration in countable models

7. The Bellman function is non-measurable and no one strategy is uniformly ε-optimal

8. No one selector is uniformly ε-optimal

9. Myopic strategies. A stationary strategy, uniformly optimal in the homogenous one-step model, T=1 with terminal loss C(x) =0, is called myopic.

10. Stable and unstable controllers for linear systems

11. Incorrect optimal actions in the model with partial information

12. Occupation measures and stationary strategies

13. Constrained optimization and the Bellman principle

14. Constrained optimization and Lagrange multipliers

15. Constrained optimization: multiple solutions

16. Weighted discounted loss and (N,∞)-stationary selectors

17. Non-constant discounting

18. The nearly optimal strategy is not Blackwell optimal

19. Blackwell optimal strategies and opportunity loss

20. Blackwell optimal and n-discount optimal strategies

21. No Blackwell (Maitra) optimal strategies

22. Optimal strategies as β→ 1 - and MDPs with the average loss -I

23. Optimal strategies as β→ 1 - and MDPs with the average loss -II

Chapter 4: Homogeneous Infinite-Horizon Models: Average Loss and Other Criteria (pp.177-252)

Under rather general conditions, problem of optimization, with average loss (or another criteria), of the performance functional is well defined, e.g. if the loss function c is bounded below. In this context, such strategies will be called AC-optimal, i.e. average-cost-optimal. If the model is finite then there exists a stationary AC-optimal selector. The situation becomes more complicated if either space X or A is not finite. The dynamic programming approach is very adequate for this problem.

Some strategies Marko and Semi-Markov are also considered in the context of AC-optimality.

In this chapter four, an interesting and fascinating selection on analyzed. In particular 31 examples are examined with above criteria.

The examples are the following:

1. Why limsup?

2. AC-optimal non-canonical strategies

3. Canonical triplets and canonical equations

4. Multiple solutions to the canonical equations in finite models

5. No AC-optimal strategies

6. Canonical equations have no solutions: the finite action space

7. No AC-optimal stationary strategies in a finite state model

8. No AC-optimal strategies in a finite-state semi-continuous model

9. Semi-continuous models and the sufficiency of stationary selectors

10. No AC-optimal stationary strategies in a unichain model with a finite action space

11. No AC- ε-optimal stationary strategies in a finite action model

12. No AC- ε-optimal Markov strategies

13. Singular perturbation of an MDP

14. Blackwell optimal strategies and AC-optimality

15. Strategy iteration in a unichain model

16. Unichain strategy iteration in a finite communicating model

17. Strategy iteration in a semi-continuous models

18. When value iteration is not successful

19. The finite-horizon approximation does not work

20. The linear programming approach to finite models

21. Linear programming for infinite models. The linear programming proved to be effective in finite models. In the general case, this approach was developed in the literature, but under special conditions.

22. Linear programs and expected frequencies in finite models

23. Constrained optimization

24. AC-optimal, bias optimal, overtaking optimal and opportunity-cost optimal strategies: periodic model

25. AC-optimal and average-overtaking optimal strategies

26. Blackwell optimal, bias optimal, overtaking optimal and AC-optimal strategies

27. Nearly optimal and average-overtaking optimal strategies

28. Strong-overtaking/average optimal, overtaking optimal, AC-optimal strategies and minimal opportunity loss

29. Strong-overtaking optimal and strong*-overtaking optimal strategies

30. Parrondo´s paradox

31. An optimal service strategy in a queuing system

Afterword

Briefly mention several real-life applications of MDP

- Control of a moving object. The objective can be, for example, reaching the goal with the minimal expected energy. The state is the position of the object subject to random disturbances and the action corresponds to the power of an engine.

- Control of water resources. The performance to be maximized corresponds to the expected utility of the water consumed. The state is the amount of water in a reservoir and on decisions about using the water.

- Consumption-investment problems. The objective is to minimize the total expected consumption over the planning interval.

- Inventory control. The goal is to maximize the total expected profit from selling the product.

- Reliability. In this case, to minimize the total expected loss resulting from failures and from the maintenance cost.

- Financial mathematics. The state is the current wealth along the vector of stock prices in a random market. The action represents the restructuring of the self-financing portfolio. The aim is the maximization of the expected utility associated with the final wealth.

- Selling an asset. The objective is to maximize the total expected profit.

- Gambling. The objective is to maximize the probability of reaching the goal. Gambling examples are given in chapter two, examples 14, 25 and 26.

They are, in the literature other meaningful examples, for example: quality control in a production line; forest management; controlled populations; participating in a quiz show; organizing of teaching and examinations; optimization of publicity efforts; insurance, and so on.

Appendix A: Borel Spaces and Other Theoretical Issues (pp.257-266)

A.1. Main concepts: Definitions, theorems: Tychonoff, Urysohn, etc., metrizable compact spaces, Hilbert cube and so on.

A.2. Probability Measures on Borel Spaces: Definitions, the existence of measurable stochastic kernels, concepts on a metric space, etc.

A.3. Semicontinuous Functions and Measurable Selection: Basic definitions, properties of functions lower and upper semi-continuous and bounded below on metrizable and separable metrizable spaces.

A.4. Abelian (Tauberian) Theorem

Appendix B: Proofs of Auxiliary Statements (pp. 267-280)

In this appendix devoted to the Proofs of auxiliary statements on given the following: Lemmas 2.1 and 3.2., and the propositions 4, 1, 4.2 and 4.4.

Notation (pp. 281-282)

List of the Main Statements (pp. 283-284)

Bibliography (pp. 285-290). They are about 69 interesting references.

Index (pp. 291-293)

CONCLUSION

This book should be considered as a complement to scientific textbooks and monographs on MDP. The book can also serve as a reference book to which one can turn for answers to curiosities that arise which studying or teaching MDP. All the examples are self-contained and can be read independently of each other. When studying or using mathematical methods, the advanced student or researcher must understand what can happen if some of the conditions imposed in rigorous theorems are not satisfied. Many examples confirming the importance of such conditions were published in different journal articles which are often difficult to find.

In my opinion, this remarkable and intriguing book is high recommended. Some examples are aimed at undergraduate students, whilst others will be of interest to advanced undergraduates, graduates and research students in probability theory, optimal control and applied mathematics, looking for a better understanding of the theory; experts in Markov decision processes, professional or amateur researchers. Active researchers can refer to this book on applicability of mathematical methods and theorems.

**Submitted by Anonymous |

**1 / Apr / 2013