Concepts of Reward and Punishment in AI Robotics
Opening your home to a robot can evoke a number of scenarios and dynamics: is the robot like a friend, a companion, a servant, a pet? It doesn’t have the same interpersonal skills as a fellow human but we can’t bribe it with treats like we do with pets. How do we effectively communicate with it in ways it will understand in order to tailor the co-living experience to our ideal preferences? The answer lies in the fascinating topic of intangible digital concepts of reward and punishment for AI robotics.
As robots don’t eat or experience discomfort, these rewards aren’t pieces of sliced cheese or timeouts without toys which may work with our dogs, they are instead intricate digital feedback mechanisms that influence the robot’s learning curves and behavior. Open your mind and welcome to the wonderful and wacky world of domestic robotics training!
In the context of reinforcement learning, the robot will be attempting to perfect tasks and skills. These tasks could be helpful household actions like retrieving the correct beverage from the kitchen and delivering it to you in another room or something fun and frivolous like perfecting a dance move. As the robot prepares to practice the action, you explain to it that it will receive a numerical value, akin to a grade, in exchange for each attempt based on how it performs. Its goal is to maximise its reward with the highest number value and in order to do that it has to aim to gather as many high values as possible while avoiding low values. As you observe the action, you can determine a “score” to bestow upon each attempt. If the attempt fails or encounters issues, the robot can receive a negative numerical value in return. If the attempt is completed or an improvement is made, you can bestow a positive numerical value in return. This creates a paradigm whereby the numerical value is either a reward or a penalty and the robot has been instructed to gain the most points possible. This will allow you to encourage the behavior you would like to see through the creation of a synthetic/numerical rewards based environment.
An alternate punishment and reward matrix is the adjustment of system privileges. The human working with the robot can alter the robots operational freedom based on its performance when learning new tasks. This translates to the human temporarily enabling the robot to perform more advanced tasks or giving it access to more autonomous decision-making if it is consistently meeting its learning goals. For Tova, this could be the transition from remote operation to teleoperation to full autonomous mode. This could also be a switch from a low power or battery saving mode which may see reduced sensor use like lower resolution or narrow angle camera operation or switched off or reduced degree LiDAR scanning to a more sensorial experience which will allow the robot to equate accurate and efficient skill acquisition with increased sensory awareness. This is a form of reward that it can work towards and understand as a positive result for consistency and improvement. Alternatively, if the robot repeatedly fails to follow protocols or complete tasks the operator can restrict certain non-critical functionalities like its speed of movement or certain sensors.
Lastly, there are visual light cues and audio alerts that can act as tokens associated with positive improvement or negative stagnation. Taking advantage of the robot’s many sensors, cameras and microphones, the human can explain to the robot prior to the skill acquisition training that the flash of a blue light to its sensors means it completed the task properly, while the flash of a red light to its sensors means it did not. The human can then assess the progress of the skill throughout the training session and flash the appropriate corresponding colored light to the sensors of the robot to ensure it understands where it stands throughout the learning process. This can also be applied to sonic input. There can be the sound of a bell or chime which can be associated with positive progress and the satisfactory completion of a task or the sound of a buzzer that can be associated with the execution of a task that needs improvement. These can be “heard” by the robot's microphone array and affiliated with a data tag on the robots logs to ensure that the robot knows which tasks have been completed in a satisfactory manner and which have not, giving the robot a clear goal to aim for in its training.
Although not discussed on other platforms, it is my goal to pioneer an experiment with NFC tags given to the robot like treats which are encoded with a message of affirmation and praise for a job well done. Given the robot’s ability to read NFC tags, I believe that this will be one of the simplest ways to allow others to also give praise to the robot for positive behavior and successfully completed tasks. It is one of the most easily translatable and intuitive methods of praise based on its similarity to “treats” for household pets. It also has the ability to be widely applicable to many dynamics and situations rather than only being applicable during a skill acquisition training where the approval matrix has to be predetermined and set forth for a defined session, like in the case of the robot being instructed that its goal is to gain the most points in a training session and receiving numerical “grades” for its attempts. I aim to encode a pack of 25 NFC tags with positive messages and to leave them in a bowl in my living room for guests to try bestowing upon Tova.
These four approaches to robotics reward and punishment are potential blueprints for the tasks ahead of us in training our domestic embodied intelligences to personalize to our homes and lives. Without typical rewards or punishments available to train these new roommates, these proposed systems represent adaptable and immediate feedback processes which will ensure that our robots are able to better understand what we want from them throughout the training process.