Reinforcement learning (RL)
https://docs.aws.amazon.com/sagemaker/
https://www.geeksforgeeks.org/machine-learning/supervised-vs-reinforcement-vs-unsupervised/
And Gemini
Interaction-Based Learning: The agent learns by taking actions and receiving feedback.
No Labeled Data: Learns from trial and error.
Algorithms: Q-learning, SARSA, Deep Q-Networks (DQN).
Reinforcement Learning: through interactions with an environment.
RL algorithms use a reward-and-punishment paradigm as they process data. They learn from the feedback of each action and self-discover the best processing paths to achieve final outcomes. The algorithms are also capable of delayed gratification.
Benefits of reinforcement learning
Complex environments
RL algorithms can be used in complex environments with many rules and dependencies. In the same environment, a human may not be capable of determining the best path to take, even with superior knowledge of the environment. Instead, model-free RL algorithms adapt quickly to continuously changing environments.
When to Use Model-Free RL
Model-free RL is preferred when the environment is too complex to model accurately, or in high-dimensional and unstructured scenarios. Examples include complex control tasks where environmental transitions are unknown.
Less human interaction
In traditional ML algorithms, humans label data pairs to direct the algorithm. When you use an RL algorithm, this isn’t necessary. It learns by itself. At the same time, it offers mechanisms to integrate human feedback, allowing for systems that adapt to human preferences, expertise, and corrections.
Optimizes for long-term goals
RL inherently focuses on long-term reward maximization, which makes it apt for scenarios where actions have prolonged consequences. It is particularly well-suited for real-world situations where feedback isn't immediately available for every step, since it can learn from delayed rewards.
Example: Decisions about energy consumption or storage might have long-term consequences. RL can be used to optimize long-term energy efficiency and cost. With appropriate architectures, RL agents can also generalize their learned strategies across similar but not identical tasks.
Marketing personalization
In applications like recommendation systems, RL can customize suggestions to individual users based on their interactions. This leads to more personalized experiences.
For example, an application may display ads to a user based on some demographic information. With each ad interaction, the application learns which ads to display to the user to optimize product sales.
Optimization challenges
Traditional optimization methods solve problems by evaluating and comparing possible solutions based on certain criteria. In contrast, RL introduces learning from interactions to find the best or close-to-best solutions over time.
Financial predictions
The dynamics of financial markets are complex, with statistical properties that change over time. RL algorithms can optimize long-term returns by considering transaction costs and adapting to market shifts.
For instance, an algorithm could observe the rules and patterns of the stock market before it tests actions and records associated rewards. It dynamically creates a value function and develops a strategy to maximize profits.
How does reinforcement learning work
The learning process of reinforcement learning (RL) algorithms is similar to animal and human reinforcement learning in the field of behavioral psychology. For instance, a child may discover that they receive parental praise when they help a sibling or clean but receive negative reactions when they throw toys or yell. Soon, the child learns which combination of activities results in the end reward.
An RL algorithm mimics a similar learning process. It tries different activities to learn the associated negative and positive values to achieve the end reward outcome.
Key concepts
The agent is the ML algorithm (or the autonomous system)
The environment is the adaptive problem space with attributes such as variables, boundary values, rules, and valid actions
The action is a step that the RL agent takes to navigate the environment
The state is the environment at a given point in time
The reward is the positive, negative, or zero value in other words, the reward or punishment for taking an action
The cumulative reward is the sum of all rewards or the end value
RL has a predetermined end goal. While it takes an exploratory approach, the explorations are continuously validated and improved to increase the probability of reaching the end goal. It can teach itself to reach very specific outcomes.
Experimenting with real-world reward and punishment systems may not be practical.
With complex RL algorithms, the reasons a particular sequence of steps was taken may be difficult to ascertain. Which actions in a sequence were the ones that led to the optimal end result? This can cause implementation challenges.
Agent: The decision-maker that performs actions.
Environment: The world or system in which the agent operates.
State: The situation or condition the agent is currently in.
Action: The possible moves or decisions the agent can make.
Reward: The feedback or result from the environment based on the agent’s action.
Implement Reinforcement Learning
Step 1: Import libraries and Define Maze, Start and Goal such as numpy and matplotlib.
The maze is represented as a 2D NumPy array.
Zero values are safe paths; ones are obstacles the agent must avoid.
Start and goal define the positions where the agent begins and where it aims to reach.
Step 2: Define RL Parameters and Initialize Q-Table
Step 3: Helper Function for Maze Validity and Action Selection
Step 4: Train the Agent with Q-Learning Algorithm
Step 5: Extract the Optimal Path after Training
The algorithm stops when the goal is reached or no valid next moves are available.
The visited set prevents cycles.
Step 6: Visualize the Maze, Robot Path, Start and Goal
Types of Reinforcements
1. Positive Reinforcement: Positive Reinforcement is defined as when an event, occurs due to a particular behavior, increases the strength and the frequency of the behavior. In other words, it has a positive effect on behavior.
Advantages: Maximizes performance, helps sustain change over time.
Disadvantages: Overuse can lead to excess states that may reduce effectiveness.
2. Negative Reinforcement: Negative Reinforcement is strengthening behavior because a negative condition is stopped or avoided.
Advantages: Increases behavior frequency, ensures a minimum performance standard.
Disadvantages: It may only encourage just enough action to avoid penalties.
Applications
Robotics: RL is used to automate tasks in structured environments such as manufacturing, where robots learn to optimize movements and improve efficiency.
Games: Advanced RL algorithms have been used to develop strategies for complex games like chess, Go and video games, outperforming human players in many instances.
Industrial Control: RL helps in real-time adjustments and optimization of industrial operations, such as refining processes in the oil and gas industry.
Personalized Training Systems: RL enables the customization of instructional content based on an individual's learning patterns, improving engagement and effectiveness.
Advantages
Solves complex sequential decision problems where other approaches can fail.
Learns from real-time interaction, enabling adaptation to changing environments.
Does not require labeled data, unlike supervised learning.
Can innovate by discovering new strategies beyond human intuition.
Handles uncertainty and stochastic environments effectively.
Disadvantages
Computationally intensive, requiring large amounts of data and processing power.
Reward function design is critical; poor design leads to unintended behaviors.
Not suitable for simple problems where traditional methods are more efficient.
Challenging to debug and interpret, making it hard to explain decisions.
Exploration-exploitation trade-off requires careful balancing to optimize learning.
A reward signal is a scalar feedback value in reinforcement learning and behavioral science that indicates the desirability of an action, guiding an agent to maximize cumulative, long-term positive outcomes.
Common Pitfalls in Design
Designing poor reward signals can lead to "reward hacking," Where the agent can find a way to get high rewards without actually solving the intended task.
A well-structured signal must avoid being too sparse (making it hard for the agent to learn) or too dense (potentially confusing the objective).