1)Shortest path as an MDP
To find the shortest path on a graph (V, E) using the Markov Decision Process, we must consider each vertex of the graph a state, and each edge as an action. Opposite the stochastic nature of a Markovian transition, we should assume that taking each edge (action) has a 100% chance of reaching the corresponding vertex (state). Trivially, we also allocate the starting and goal states as the appropriate vertices. So far, we have accounted for S, A and T.
The real problem is assigning appropriate rewards to each action and state. Since we are required to use a directed graph with same travel times on every edge, then finding the shortest path with only positive rewards can be tricky. However, we do have a factor in our utility function that can help with that: the discount factor (gamma). If the reward of reaching the goal state sufficiently outweighs the reward of visiting any other vertex (state) then since with each step our discount factor is reducing the value of that sate, then finding the best path using the value iteration function is possible and very easy.
So, we would set the reward of the goal state to 1, and every other state to zero. Our transition function returns 1 for reaching any state that has a direct edge to the previous state, and 0 for every other state. And finally our gamma could be along the lines of .8.
2-a)This is my implementation of the value iteration function. It is customized to exclusively works for make_mdp_for_gridworld.m
% value iteration:
function [V, pi] = value_iteration(mdp, precision)
%IN: mdp, precision
%OUT: V, pi
% Recall: to obtain an estimate of the value function within accuracy of
% "precision" it suffices that one of the following conditions is met:
% (i) max(abs(V_new-V)) <= precision / (2*gamma/(1-gamma))
% (ii) gamma^i * Rmax / (1-gamma) <= precision -- with i the value
% iteration count, and Rmax = max_{s,a,s'} | R(s,a,s') |
%Initializing local variables
U = zeros(1,12); %utility function
U_temp = zeros(1,12); %new utility function
delta=10; %Current max difference
totaliterations=0;
%These hold the sum of transition model by old utily value of each
%action.
%Sum=zeros(1,4);
while(~delta<= (precision / ((2*mdp.gamma/(1-mdp.gamma)))))
totaliterations=totaliterations+1;
U = U_temp;
delta=0;
%for every state in the mdp:
for s = 1:12
Sum=zeros(1,4);
%for every action
for a=1:4
%for every destination state
for s_prime = 1:12
Sum(a) =Sum(a)+ mdp.R{a}(s,s_prime)+ (mdp.gamma* U(s_prime)*mdp.T{a}(s,s_prime));
end
end
U_temp(s)=max(Sum);
if(delta<abs(U(s) - U_temp(s)))
delta= abs(U(s) - U_temp(s));
end
end
end
V=U;
%We have the utility/value function so far; now to get the best Pi
pi=ones(1,12);
for x=1:12
for move=1:4
moveVal=0;
piVal=0;
for x_prime=1:12
moveVal=moveVal+mdp.T{move}(x,x_prime)*U(x_prime);
piVal = piVal+mdp.T{pi(x)}(x,x_prime)*U(x_prime);
end
if(moveVal>piVal)
pi(x)=move;
end
end
end
end
And these are the solutions I got for V and Pi:
and for Pi:
Which means this would be the grid map:
As always, the code and these results are in a zip file attached to the page.
2-b)
This part is very long and very difficult and extremely time consuming; to the point I'm afraid, that I was not able to get the value iteration function to converge in time. Why is this part so out of this world difficult? I can explain.
Each state in this problem is defined with 4 values. 3 of them are number of tokens existing in each list of a,b, and c, and the last one is the answer to the question: "Which list are we in right now?". How many states does that make? 3 x 5 x 5 x 5 = 375 states. Well that's just the tip of the iceberg. What we need to do to implement "make_mdp_3queues", are 3 things: the transition function, the reward function and gamma. Easiest part is gamma, because it's already given to us, and the reward is defined as well. The reward function is also very simple, because all we need to do, is to add a point whenever we stay in a list that's not empty. But the difficult part, and I mean the really really difficult part, is the transition function.
You see each transition function takes 3 things: the action, the current state, and the next state. What kind of a mapping is that? Well we're making each state to every other sate, multiplied by the number of possible moves. What does that make? 375 x 375 x 3 = 421,875. So that's a lot of transitions! It's going to take a devastatingly long time to do all of them right? Well we are not even done yet because the value iteration function doesn't take weird 4 parameter states, it takes a single number for however many states we have. So that's 375 states if we number them. But how do we number the state? In what order? If I don't want to do it by hand, and I am a computer science student so of course I do not, then I have to find a way to mechanize this numbering. So a number should very strictly point to 1 and only 1 state, and we should be easily able to reach that number from a given state, or the other way around. How would a find a code to do that?
Well, the easiest possible way, would be to have a 4 digit number, each of the digits of which would show the contents of it's corresponding list; or the current state. What does that mean in terms of how many states we need now? Well, 4 digits, and even if we save the smallest one for the largest value place, we need 3555 states to describe these easily. Accounting for the fact that this is a state to state mapping with 3 possible moves, that makes just the initializing loop 37,914,075 iterations long; and this is the easiest way of doing this imaginable!
This giant number is just for initializing the MDP values, and I have been very successful with that. My problem starts when we put this monster in a while loop for the iteration-value function! These is no foreseeable end for how long this function then takes, given the fact that we do not even have a maximum on our time steps! It is an online algorithm problem either way!
Attached, you can find my full code. I have implemented the MDP function and it runs, but it will run for so long that converging is basically proved impossible. And just to prove I'm not lying, here is a picture of some of the values auto generated by my MDP:
I spent many many hours on this so sorry, but the branching factor is enormous.