Introduction to Reinforcement Learning (RL)
Introduction to RL framework and its Components
● Reinforcement learning(RL) is an area of machine learning that focuses on how we or how something should act in an environment in order to maximize some given reward.
● Reinforcement learning algorithms study the behavior of subjects in such an environment and learn to optimize that behavior.
The concept behind Reinforcement learning is that an agent will learn from an environment by communicating with it and getting rewards for performing actions.
Components Of RL Framework
● There are 7 main components of the Reinforcement Learning framework.
○ Value Function
○ A Reward Rt is a scalar feedback signal that indicates how well an agent is doing at step t.
○ We can say that Reinforcement learning is based on the reward hypothesis
○ The agent jobs is maximized cumulative reward.
○ We can say that an Agent is one who take decisions based on reward.
○ Consider an example of a batsman in cricket. He tries to hit a ball if he misses he gets a negative point. If he hits the ball then he is given a reward. So he can learn how to play the particular ball from these positive and negative incentives..
○ The environment is a task or a simulation
○ The agent sends an action to the environment, and the environment sends the observation and reward to the agent after executing each action received from the agent.
○ The state describes the current situation.
○ In a robot working program, we can say the state is the position of its two lag. And when a robot takes an action in a state,it receives a reward.
○ We can say that action is the agent’s methods that allow it to interact and change the environment and thus transfer between states.
○ Each activity the Agent carries out yields an environmental reward.
○ We can say that policy is the learning agent’s way of behaving at a given time.
● Value Function
○ The value function is denoted by V(s).
○ The value function indicates how good an agent’s state is in.
○ The value function is equal to the expected reward for an agent starting from state s.
Examples of Reinforcement Learning Problems
○ Fly stunt maneuvers in a helicopter
○ Defeat the world champion at Backgammon
○ Manage an investment portfolio
○ Control a power station
○ Make a humanoid robot walk
○ Play many different Atari games better than humans
● Examples on Rewards:
○ Manage an investment portfolio
■ +ve reward for each $ in bank
○ Control a power station
■ +ve reward for producing power
■ -ve reward for exceeding safety thresholds
○ Make a humanoid robot walk
■ +ve reward for forwarding motion
■ -ve reward for falling walk
● Difference between reinforcement learning and other machine learning paradigm
○ There is no Superior, only reward signal
■ There’s no superior saying about the final outcome. It’s a concept based on mistakes and attempts
○ Feedback is delayed, not instantaneous
○ Time really matters
○ Agent Actions affects the subsequent data it receives
Application Of Reinforcement Learning
● Resources Management in a computer cluster
○ The allocation of limited resources to various tasks is required to design some efficient algorithms. That is a challenging task and required human generation heuristics.
○ We can use Reinforcement Learning to automatically learn to allocate and schedule computer resources to waiting jobs, with the objective to minimize the average job slowdown.
○ Reinforcement learning (RL) enables a robot to autonomously discover an optimal behavior through trial-and-error interactions with its environment. Instead of explicitly detailing the solution to a problem, in reinforcement learning the designer of a control task provides feedback in terms of a scalar objective function that measures the one-step performance of the robot.
● Education and Training
○ Online platforms are beginning to experiment with using machine learning to create personalized experiences. Researchers are exploring the use of RL and other forms of machine learning in tutoring and personalized learning systems. Using RL can lead to training systems offering personalized instruction and materials tuned to the individual students ‘ needs. A research group develops RL algorithms and statistical methods which require less data for future tutoring systems
○ The application of reinforcement learning, to the healthcare system, has consistently generated better results.
Types of Reinforcement
● There are two types of Reinforcement
○ Positive Reinforcement
○ Negative Reinforcement
○ Positive Reinforcement is defined as when an occurrence happens because of a particular action, it increases the action’s intensity and frequency. This has a positive effect on the action that agents take. It works by providing the individual with a motivating/reinforcing stimulus after the desired behavior is shown, making the behavior more likely to happen in the future.
○ This type of reinforcement learning helps us to maximize performance and sustain change for a more extended period.
○ But too much reinforcement may lead to over-optimization of state, which can affect the results.
○ Example of positive reinforcement :
■ The little boy receives some price(reinforcing stimulus) for every A in his report card.
■ A father gives candy (reinforcing stimulus) to his daughter to clean up toys (behavior).
○ Negative Reinforcement is defined as the strengthening of behavior because a negative condition is stopped or avoided.
○ It helps us to define the minimum standard of performance.
○ The likelihood of the particular behavior occurring again in the future is increased because of removing/avoiding the negative consequence.
○ But, It Only provides enough to meet up the minimum behavior
○ Example of negative reinforcement:
■ Parenting provides many great opportunities for real-life negative reinforcement. Imagine for example a kid who doesn’t want to sleep through the night. He wakes up several times a night and cries before his mother comes in to rock him back to sleep. He is effectively teaching his mother with negative reinforcement as he stops crying a time she comes in to rock him to sleep.
■ Another everyday example of negative reinforcement comes when we are driving. Imagine we drive through rush hour traffic to get to work. Our commute is very stressful and takes you two hours every morning. We get irritated and find another way to get there. This road has very little traffic and in 45 minutes we make it work. We get the same findings for the week later. We start taking the new route every day to save time. Removing the negative stimulus of the bad traffic changes our behavior.
○ Punishment is a mechanism by which a consequence immediately follows a behavior that decreases the future frequency of that behavior.
○ Punishment in plain terms refers to any change that decreases the probability of activity happening again in the future.
○ There are two types of punishment:
■ Positive punishment
● In this type of punishment, we add an undesirable stimulus to decrease behavior.
● An example of positive punishment is scolding a student to get the student to stop texting in class. In this case, a stimulus (the reprimand) is added in order to decrease the behavior (texting in class).
■ Negative punishment
● In negative punishment, we remove a pleasant stimulus to decrease behavior.
● We eliminate a pleasant stimulus is negative punishment to diminish a behavior. A parent may take a favorite toy of a child if a child misbehaves.
○ Extinction refers to the gradual weakening of a conditioned response which results in a reduction or disappearance of the behavior.
○ Let’s understand by example.
■ Imagine a local grocery shop. A mother and her young son regularly visit that shop. The child always screams when he is checking out until his mother agrees to buy the child some candy. For a long time, the mother buys candy during checkout so the child will stop screaming.
■ But one day the mother refuses to buy the child candy. The son becomes increasingly upset when denied candy; however, a few weeks later, the child does not scream for candy.
○ We can say that the extinction of s previously learned behavior when the behavior is not reinforced.
● Q-Learning is a simple type of Reinforcement Learning that uses Q-values (also called action values) to develop the learning agent‘s actions iteratively. Q-learning is an algorithm for learning off-policy reinforcement that tries to find the best action to take given the current state.
● It is considered off-policy since the q-learning feature learns from actions outside the current policy, such as taking random acts, so there is no need for a policy. In particular, q-learning seeks to learn a policy that maximizes total reward.
● In this case, quality stands for how valuable a given action is to obtain any potential reward. The Bellman equation is used by the Q-function and requires two inputs: state(s) and action(a)
● Our goal should be to maximize the value of Q.
● Let’s take one example and understand the process of the Q Learning algorithm.
● The question “How do we train a robot with the shortest path, without stepping on a mine, to reach the end goal?
● The process of Q learning algorithm-
○ Initialize Q-table
○ Choose an action
○ Perform action
○ Measure reward
○ Update Q -table
● Initialize Q-table
○ This is the first step. We have to first build a Q-table with n columns and m rows.
○ Here n columns stand for a number of action and m rows stands that number of states.
○ For the above example
● Choose and perform an action
○ This step runs until we stop the training or the training loop stops as defined in the code
○ In this step, we will choose an action(a) in the state(s) based on the Q-table.
○ Exploration means finding more information about an environment
○ Exploitation means exploiting already known information to maximize the rewards.
○ In the beginning, the epsilon rates will be higher and every Q value in the Q-table is 0. So In the beginning, if we consider the above example then the robot will explore the environment and randomly choose the action. It is because the robot does not know anything about the environment.
○ The epsilon rate decreases as the robot explore the world, and the robot continues to exploit the environment.
○ Throughout the exploration process, the robot becomes increasingly more confident in estimating the Q-values.
○ There are four actions to choose from for the robot example: up, down, left, and right. Our robot knows nothing about the environment at the beginning of the training. So the robot chooses a random action, say right.
○ Then we can update the Q values for being at the start and moving right using the Bellman equation.
○ Now we have taken an action and observed an outcome and reward. We need to update the function Q(s, a)
If you liked the story and want to appreciate us you can clap as much as you can. Appreciate our work by your constructive comment and also you can connect to us on….
Website : https://www.societyofai.in/