Robotic Artificial Intelligence

Youngmoon Lee
Robotics Department

본 결과물은 교육부과 한국연구재단의 재원으로 지원을 받아 수행된
디지털 신기술 인재양성 혁신공유대학사업의 연구결과입니다.

While Artificial Intelligence (AI) is all about Computational Rationality that maximizes the expected utility,
Robotic AI is the realization of computational AI in the physical space as a form of robotic perception, planning, and control that maximizes the expected utility.

In the lectures, we will focus on AI, ML especially reinforcement learning, and Representation and Deep Learning.
In the projects, we develop an intelligent agent that plays Pacman in the context of explore vs. exploit (eat vs. run)

Week1: Introduction

What is AI?

Artificial Intelligence is all about computational rationality (Maximize Expected Utility)

AI consists of Machine Learning(ML), Natural Language Processing(NLP), Vision, Robotics

How to learn AI?

We will take breath path than depth and learn as many AI techniques and their pros and cons to apply all of them and pick the best (AI toolbox)

We will focus on Machine Learning by digging its statistical foundation than the ML frameworks.

Lecture1 AI
Lecture2 Computational Rationality

Week2: Mathematical Foundation

This course is to give you a toolbox for machine learning. Since quite a few of you struggle with math, we will review following essentials for machine learning.
BUT these are prerequisite for this course! You should be very comfortable with these.

Lecture3 Linear Algebra
Lecture4 Optimization
Lecture5 Probability

Week3: Classification and Probabilistic Setup

Lecture6 Bayes Classifier

Bayes classifier
Linear Discriminant Analysis (LDA)

Week4: Naive Bayes Classifier and Logistic Regression

Lecture7 NB LR
Lecture8 NB Example
Naive Bayes Classifier

Logistic Regression
Laplace Smoothing
Example of Naive Bayes Classifier (Spam filtering, Digit recognition)

Week5: Regression

Lecture9 Regression

Optimization and Convexity Note
Linear Regression Note
1. Least Sqaure Regression
2. Ridge Regression
3. Robust Regression
4. Gradient Descent

Week6: Perceptron

Lecture14 Perceptron

Perceptron (or just a linear classifier or regression function)
Multi-Layer Perceptron (non-linear model for a regression function or Neural Network)

Week7: Markov Decision Processes

Lecture10 MDP
Lecture11 MDP (Continue)

MDP(State, Actions, Transition Function, Reward Function, Start State, Terminal State)
Discounting
Bellman Equations and Value Iteration

Week8: Reinforcement Learning

Lecture12 RL
Lecture12 RL (Continue)

Reinforcement Learning
Model-based and Model-free Learning
Q-state, Q-Value Iteration, Q-Learning

Week9: Exploration vs. Exploitation

Lecture13 Regret

Q-Learning again
Explore vs. Exploit

Week10: Neural Network

Lecture15 NN

Colab Tutorial
Neural Network Structure
Activation Function, Softmax
Forward propagation, Backpropagation

Week11: Deep Learning

Lecture16 CNN

Backpropagation
CNN, RNN, DNN, GAN

Week12: Conclusion

Lecture17 Conclusion
Lecture18 Final Review

Final Review
Final Exam

Midterm Project

Getting Started

The projects for this class assume you use Python 2.7. You may use Linux, Windows 10, or Mac OS.

If you run Windows 10, install Ubuntu by following this:

https://ubuntu.com/tutorials/tutorial-ubuntu-on-windows#1-overview

You need to set up the sudo password.

$sudo apt update

$sudo apt install python2.7 python-pip

$sudo apt install unzip zip vim git

If asked, please enter sudo password.

Install and run Xming

https://sourceforge.net/projects/xming/files/latest/download

Set the display forwarding

$export DISPLAY=localhost:0.0

You need github account to copy project files. Please create an account on github, and remember ID/password.
(try to learn git: https://rogerdudler.github.io/git-guide/index.ko.html)
Download project files through git clone and if asked, put your ID/password.
Throughout project, try to commit your changes by git commit -m "change message" so we can track your changes (at least for each task)

$git clone https://github.com/leeymcj/pacman.git
$cd pacman

You should be able to play a game of Pacman by typing the following at the command line (note the left key is a right key is d) :
$python2.7 pacman.py

We will develop an intelligent agent (Alpha-Pacman) that plays Pacman for us.

Submission
zip entire project files including git log

$zip <MyName>.zip -r * .git
then, upload <MyName>.zip to LMS homework submission.

Pacman Search Project (4 tasks)

In this project, your Pacman agent will find paths through his maze world, both to reach a particular location and to collect food efficiently. You will build general search algorithms and apply them to Pacman scenarios.

This project includes an autograder for you to grade your answers on your machine. This can be run with the command:

python2.7 autograder.py

See the autograder tutorial in Project 0 for more information about using the autograder.

The code for this project consists of several Python files, some of which you will need to read and understand in order to complete the assignment, and the rest should be ignored and unmodified.

Files you'll edit:

search.py Where all of your search algorithms will reside.

searchAgents.py Where all of your search-based agents will reside.

Files you may read but NOT edit:

pacman.py The main file that runs Pacman games. This file describes a Pacman GameState type, which you use in this project.

game.py The logic behind how the Pacman world works. This file describes several supporting types like AgentState, Agent, Direction, and Grid.

util.py Useful data structures for implementing search algorithms.

The simplest agent in searchAgents.py is called the GoWestAgent, which always goes West (a trivial reflex agent). This agent can occasionally win:

python2.7 pacman.py --layout testMaze --pacman GoWestAgent

But, things get ugly for this agent when turning is required:

python2.7 pacman.py --layout tinyMaze --pacman GoWestAgent

If Pacman gets stuck, you can exit the game by typing CTRL-c into your terminal.

Soon, your agent will solve not only tinyMaze, but any maze you want.

Note that pacman.py supports a number of options that can each be expressed in a long way (e.g., --layout) or a short way (e.g., -l). You can see the list of all options and their default values via:

python2.7 pacman.py -h

Task1: Finding a Fixed Food Dot using Depth First Search

In searchAgents.py, you'll find a fully implemented SearchAgent, which plans out a path through Pacman's world and then executes that path step-by-step. The search algorithms for formulating a plan are not implemented -- that's your job. As you work through the following questions, you might find it useful to refer to the object glossary (the second to last tab in the navigation bar above).

First, test that the SearchAgent is working correctly by running:

python2.7 pacman.py -l tinyMaze -p SearchAgent -a fn=tinyMazeSearch

The command above tells the SearchAgent to use tinyMazeSearch as its search algorithm, which is implemented in search.py. Pacman should navigate the maze successfully.

Now it's time to write full-fledged generic search functions to help Pacman plan routes! Pseudocode for the search algorithms you'll write can be found in the lecture slides. Remember that a search node must contain not only a state but also the information necessary to reconstruct the path (plan) which gets to that state.

Important note: All of your search functions need to return a list of actions that will lead the agent from the start to the goal. These actions all have to be legal moves (valid directions, no moving through walls).

Important note: Make sure to use the Stack, Queue and PriorityQueue data structures provided to you in util.py! These data structure implementations have particular properties which are required for compatibility with the autograder.

Hint: Each algorithm is very similar. Algorithms for DFS, BFS, UCS, and A* differ only in the details of how the fringe is managed. So, concentrate on getting DFS right and the rest should be relatively straightforward. Indeed, one possible implementation requires only a single generic search method which is configured with an algorithm-specific queuing strategy. (Your implementation need not be of this form to receive full credit).

Implement the depth-first search (DFS) algorithm in the depthFirstSearch function in search.py. To make your algorithm complete, write the graph search version of DFS, which avoids expanding any already visited states.

Your code should quickly find a solution for:

python2.7 pacman.py -l tinyMaze -p SearchAgent

python2.7 pacman.py -l mediumMaze -p SearchAgent

python2.7 pacman.py -l bigMaze -z .5 -p SearchAgent

The Pacman board will show an overlay of the states explored, and the order in which they were explored (brighter red means earlier exploration). Is the exploration order what you would have expected? Does Pacman actually go to all the explored squares on his way to the goal?

Hint: If you use a Stack as your data structure, the solution found by your DFS algorithm for mediumMaze should have a length of 130 (provided you push successors onto the fringe in the order provided by getSuccessors; you might get 246 if you push them in the reverse order). Is this a least cost solution? If not, think about what depth-first search is doing wrong.

Task2: Breadth First Search

Implement the breadth-first search (BFS) algorithm in the breadthFirstSearch function in search.py. Again, write a graph search algorithm that avoids expanding any already visited states. Test your code the same way you did for depth-first search.

python2.7 pacman.py -l mediumMaze -p SearchAgent -a fn=bfs

python2.7 pacman.py -l bigMaze -p SearchAgent -a fn=bfs -z .5

Does BFS find a least cost solution? If not, check your implementation.

Hint: If Pacman moves too slowly for you, try the option --frameTime 0.

Note: If you've written your search code generically, your code should work equally well for the eight-puzzle search problem without any changes.

python2.7 eightpuzzle.py

Task3: Varying the Cost Function

While BFS will find a fewest-actions path to the goal, we might want to find paths that are "best" in other senses. Consider mediumDottedMaze and mediumScaryMaze.

By changing the cost function, we can encourage Pacman to find different paths. For example, we can charge more for dangerous steps in ghost-ridden areas or less for steps in food-rich areas, and a rational Pacman agent should adjust its behavior in response.

Implement the uniform-cost graph search algorithm in the uniformCostSearch function in search.py. We encourage you to look through util.py for some data structures that may be useful in your implementation. You should now observe successful behavior in all three of the following layouts, where the agents below are all UCS agents that differ only in the cost function they use (the agents and cost functions are written for you):

python2.7 pacman.py -l mediumMaze -p SearchAgent -a fn=ucs

python2.7 pacman.py -l mediumDottedMaze -p StayEastSearchAgent

python2.7 pacman.py -l mediumScaryMaze -p StayWestSearchAgent

Note: You should get very low and very high path costs for the StayEastSearchAgent and StayWestSearchAgent respectively, due to their exponential cost functions (see searchAgents.py for details).

Task4: Greedy Algorithm

Sometimes, even with breath, depth-first search, finding the optimal path through all the dots is hard. In these cases, we'd still like to find a reasonably good path, quickly. In this section, you'll write an agent that always greedily eats the closest dot. ClosestDotSearchAgent is implemented for you in searchAgents.py, but it's missing a key function that finds a path to the closest dot.

Implement the function findPathToClosestDot in searchAgents.py. Our agent solves this maze (suboptimally!) in under a second with a path cost of 350:

python2.7 pacman.py -l bigSearch -p ClosestDotSearchAgent -z .5

Hint: The quickest way to complete findPathToClosestDot is to fill in the AnyFoodSearchProblem, which is missing its goal test. Then, solve that problem with an appropriate search function. The solution should be very short!

Your ClosestDotSearchAgent won't always find the shortest possible path through the maze. Make sure you understand why and try to come up with a small example where repeatedly going to the closest dot does not result in finding the shortest path for eating all the dots.

Final Project: Pacman AI

In the final project, we will implement value iteration and Q-learning. We will test agents first on Gridworld, then apply them to Pacman.

Getting Started

The projects for this class assume you use Python 2.7. You may use Linux, Windows 10, or Mac OS.

If you run Windows 10, install Ubuntu by following this:

https://ubuntu.com/tutorials/tutorial-ubuntu-on-windows#1-overview

You need to set up the sudo password.

$sudo apt update

$sudo apt install python2.7 python-pip

$sudo apt install unzip zip vim git

If asked, please enter sudo password.

Install and run Xming

https://sourceforge.net/projects/xming/files/latest/download

Set the display forwarding

$export DISPLAY=localhost:0.0

$git clone https://github.com/leeymcj/pacman-ai.git
$cd pacman-ai

You should be able to play a game of Pacman by typing the following at the command line (note the left key is a right key is d) :
$python2.7 pacman.py

Submission
zip entire project files including git log

$zip <YourName>.zip -r * .git
then, upload <YourName>.zip to LMS homework submission.

We will develop an intelligent agent (Pacman-AI) that plays Pacman for us.
Files you'll edit:

valueIterationAgents.py A value iteration agent for solving known MDPs.

qlearningAgents.py Q-learning agents for Gridworld and Pacman.

Files you should read but NOT edit:
mdp.py Defines methods on general MDPs.

learningAgents.py Defines the base classes ValueEstimationAgent and QLearningAgent, which your agents will extend.

util.py Utilities, including util.Counter, which is particularly useful for Q-learners.

gridworld.py The Gridworld implementation.

featureExtractors.py Classes for extracting features on (state,action) pairs. Used for the approximate Q-learning agent (in qlearningAgents.py).

To get started, run Gridworld in manual control mode, which uses the arrow keys:

python2.7 gridworld.py -m

You will see the two-exit layout from class. The blue dot is the agent. Note that when you press up, the agent only actually moves north 80% of the time. Such is the life of a Gridworld agent!

You can control many aspects of the simulation. A full list of options is available by running:

python2.7 gridworld.py -h

The default agent moves randomly

python2.7 gridworld.py -g MazeGrid

You should see the random agent bounce around the grid until it happens upon an exit. Not the finest hour for an AI agent.

Note: The Gridworld MDP is such that you first must enter a pre-terminal state (the double boxes shown in the GUI) and then take the special 'exit' action before the episode actually ends (in the true terminal state called TERMINAL_STATE, which is not shown in the GUI). If you run an episode manually, your total return may be less than you expected, due to the discount rate (-d to change; 0.9 by default).

Look at the console output that accompanies the graphical output (or use -t for all text). You will be told about each transition the agent experiences (to turn this off, use -q).

As in Pacman, positions are represented by (x,y) Cartesian coordinates and any arrays are indexed by [x][y], with 'north' being the direction of increasing y, etc. By default, most transitions will receive a reward of zero, though you can change this with the living reward option (-r).

Task1: MDP and Value Iteration

Recall the value iteration state update equation:

Write a value iteration agent in ValueIterationAgent, which has been partially specified for you in valueIterationAgents.py. Your value iteration agent is an offline planner, not a reinforcement learning agent, and so the relevant training option is the number of iterations of value iteration it should run (option -i) in its initial planning phase. ValueIterationAgent takes an MDP on construction and runs value iteration for the specified number of iterations before the constructor returns.

Value iteration computes k-step estimates of the optimal values, V_k. In addition to running value iteration, implement the following methods for ValueIterationAgent using V_k.

computeActionFromValues(state) computes the best action according to the value function given by self.values.
computeQValueFromValues(state, action) returns the Q-value of the (state, action) pair given by the value function given by self.values.

These quantities are all displayed in the GUI: values are numbers in squares, Q-values are numbers in square quarters, and policies are arrows out from each square.

Important: Use the "batch" version of value iteration where each vector V_k is computed from a fixed vector V_k-1 (like in lecture), not the "online" version where one single weight vector is updated in place. This means that when a state's value is updated in iteration k based on the values of its successor states, the successor state values used in the value update computation should be those from iteration k-1 (even if some of the successor states had already been updated in iteration k). The difference is discussed in Sutton & Barto in the 6th paragraph of chapter 4.1.

Note: A policy synthesized from values of depth k (which reflect the next k rewards) will actually reflect the next k+1 rewards (i.e. you return π_k+1). Similarly, the Q-values will also reflect one more reward than the values (i.e. you return Q_k+1).

You should return the synthesized policy π_k+1.

Hint: You may optionally use the util.Counter class in util.py, which is a dictionary with a default value of zero. However, be careful with argMax: the actual argmax you want may be a key not in the counter!

Note: Make sure to handle the case when a state has no available actions in an MDP (think about what this means for future rewards).

The following command loads your ValueIterationAgent, which will compute a policy and execute it 10 times. Press a key to cycle through values, Q-values, and the simulation. You should find that the value of the start state (V(start), which you can read off of the GUI) and the empirical resulting average reward (printed after the 10 rounds of execution finish) are quite close.

python2.7 gridworld.py -a value -i 100 -k 10

Hint: On the default BookGrid, running value iteration for 5 iterations should give you this output:

python2.7 gridworld.py -a value -i 5

Task2: Q-Learning

Note that your value iteration agent does not actually learn from experience. Rather, it ponders its MDP model to arrive at a complete policy before ever interacting with a real environment. When it does interact with the environment, it simply follows the precomputed policy (e.g. it becomes a reflex agent). This distinction may be subtle in a simulated environment like a Gridword, but it's very important in the real world, where the real MDP is not available.

You will now write a Q-learning agent, which does very little on construction, but instead learns by trial and error from interactions with the environment through its update(state, action, nextState, reward) method. A stub of a Q-learner is specified in QLearningAgent in qlearningAgents.py, and you can select it with the option '-a q'. For this question, you must implement the update, computeValueFromQValues, getQValue, and computeActionFromQValues methods.

Note: For computeActionFromQValues, you should break ties randomly for better behavior. The random.choice() function will help. In a particular state, actions that your agent hasn't seen before still have a Q-value, specifically a Q-value of zero, and if all of the actions that your agent has seen before have a negative Q-value, an unseen action may be optimal.

Important: Make sure that in your computeValueFromQValues and computeActionFromQValues functions, you only access Q values by calling getQValue . This abstraction will be useful for question 10 when you override getQValue to use features of state-action pairs rather than state-action pairs directly.

With the Q-learning update in place, you can watch your Q-learner learn under manual control, using the keyboard:

python2.7 gridworld.py -a q -k 5 -m

Recall that -k will control the number of episodes your agent gets to learn. Watch how the agent learns about the state it was just in, not the one it moves to, and "leaves learning in its wake." Hint: to help with debugging, you can turn off noise by using the --noise 0.0 parameter (though this obviously makes Q-learning less interesting). If you manually steer Pacman north and then east along the optimal path for four episodes, you should see the following Q-values:

Complete your Q-learning agent by implementing epsilon-greedy action selection in getAction, meaning it chooses random actions an epsilon fraction of the time, and follows its current best Q-values otherwise. Note that choosing a random action may result in choosing the best action - that is, you should not choose a random sub-optimal action, but rather any random legal action.

python2.7 gridworld.py -a q -k 100

Your final Q-values should resemble those of your value iteration agent, especially along well-traveled paths. However, your average returns will be lower than the Q-values predict because of the random actions and the initial learning phase.

You can choose an element from a list uniformly at random by calling the random.choice function. You can simulate a binary variable with probability p of success by using util.flipCoin(p), which returns True with probability p and False with probability 1-p.

Task3: Q-Learning for Pacman

Time to play some Pacman! Pacman will play games in two phases. In the first phase, training, Pacman will begin to learn about the values of positions and actions. Because it takes a very long time to learn accurate Q-values even for tiny grids, Pacman's training games run in quiet mode by default, with no GUI (or console) display. Once Pacman's training is complete, he will enter testing mode. When testing, Pacman's self.epsilon and self.alpha will be set to 0.0, effectively stopping Q-learning and disabling exploration, in order to allow Pacman to exploit his learned policy. Test games are shown in the GUI by default. Without any code changes you should be able to run Q-learning Pacman for very tiny grids as follows:

python2.7 pacman.py -p PacmanQAgent -x 2000 -n 2010 -l smallGrid

Note that PacmanQAgent is already defined for you in terms of the QLearningAgent you've already written. PacmanQAgent is only different in that it has default learning parameters that are more effective for the Pacman problem (epsilon=0.05, alpha=0.2, gamma=0.8). You will receive full credit for this question if the command above works without exceptions and your agent wins at least 80% of the time. The autograder will run 100 test games after the 2000 training games.

Hint: If your QLearningAgent works for gridworld.py but does not seem to be learning a good policy for Pacman on smallGrid, it may be because your getAction and/or computeActionFromQValues methods do not in some cases properly consider unseen actions. In particular, because unseen actions have by definition a Q-value of zero, if all of the actions that have been seen have negative Q-values, an unseen action may be optimal. Beware of the argmax function from util.Counter!

Note: If you want to experiment with learning parameters, you can use the option -a, for example -a epsilon=0.1,alpha=0.3,gamma=0.7. These values will then be accessible as self.epsilon, self.gamma and self.alpha inside the agent.

Note: While a total of 2010 games will be played, the first 2000 games will not be displayed because of the option -x 2000, which designates the first 2000 games for training (no output). Thus, you will only see Pacman play the last 10 of these games. The number of training games is also passed to your agent as the option numTraining.

Note: If you want to watch 10 training games to see what's going on, use the command:

python2.7 pacman.py -p PacmanQAgent -n 10 -l smallGrid -a numTraining=10

During training, you will see output every 100 games with statistics about how Pacman is faring. Epsilon is positive during training, so Pacman will play poorly even after having learned a good policy: this is because he occasionally makes a random exploratory move into a ghost. As a benchmark, it should take between 1,000 and 1400 games before Pacman's rewards for a 100 episode segment becomes positive, reflecting that he's started winning more than losing. By the end of training, it should remain positive and be fairly high (between 100 and 350).

Make sure you understand what is happening here: the MDP state is the exact board configuration facing Pacman, with the now complex transitions describing an entire ply of change to that state. The intermediate game configurations in which Pacman has moved but the ghosts have not replied are not MDP states, but are bundled in to the transitions.

Once Pacman is done training, he should win very reliably in test games (at least 90% of the time), since now he is exploiting his learned policy. For this project, you will get full mark if pacman wins 80% of test games.

However, you will find that training the same agent on the seemingly simple mediumGrid does not work well. In our implementation, Pacman's average training rewards remain negative throughout training. At test time, he plays badly, probably losing all of his test games. Training will also take a long time, despite its ineffectiveness.

Pacman fails to win on larger layouts because each board configuration is a separate state with separate Q-values. He has no way to generalize that running into a ghost is bad for all positions. Obviously, this approach will not scale. We need approximate Q-Learning to win (not covered this course)

Submission
zip entire project files including git log

$zip <MyName>.zip -r valueIterationAgents.py qlearningAgents.py .git (is it good idea to use MOSS with .git?)
then, upload <MyName>.zip to LMS homework submission.