|
1.INTRODUCTIONIn recent years, reinforcement learning (RL) has played a great role in artificial intelligence, such as defeating the world’s top Go players through alpha go, and robots imitating and learning human actions and expressions. Reinforcement learning is to update the strategy according to the known reward function through the interaction between the agent and the environment, so as to promote the agent to make the best action and maximize the reward [1]. However, reinforcement learning is no longer applicable to complex scenarios where the reward function is difficult to determine. For example, the task of driving a vehicle from the starting point to the destination in a complex urban environment needs to consider factors such as traffic rules, passenger comfort and efficiency of reaching the destination, which is difficult to build a suitable reward function [2]. However, an experienced driver can fully consider the above problems. Therefore, we can learn the reactions of drivers in different environments during driving to help machines understand the reward. Contrary to the process of reinforcement learning, inverse reinforcement learning (IRL) is to find the reward function that can explain these strategies with the optimal strategies or expert demonstrations. The RL and IRL models are shown in Figure 1, where dynamic model represents the state distribution at the next moment corresponding to the action in the current state, optimal strategy represents the action taken in the current state that maximizes the reward function, and reward function refers to the reward mapping obtained by taking the action in the current state. Inverse reinforcement learning, also known as inverse optimal control, has developed rapidly in the past decade with the upsurge of deep learning since it appeared 20 years ago. On the one hand, IRL can be applied to decision-making and planning areas to reduce manual adjustment in tasks. For example, for multi-objective optimization problems, the relative importance of each task can be determined through IRL. On the other hand, through IRL and RL, robots can imitate human actions or decisions. Imitation learning uses demonstration to infer hidden reward functions, and then uses inferenced reward functions in reinforcement learning to learn imitation strategies [3-4]. The application of inverse reinforcement learning is shown in Figure 2. Fig (a) shows that the robot arm pushes the cup onto the coaster, and the initial position of the cup is random. Fig (b) shows the robot arm insert a book into the empty slot in the bookshelf. Fig (c) shows planning task of automatic driving. Inverse reinforcement learning is a new subject in the field of machine learning developed in recent years, so there are many challenges. First, because of the limited number of expert examples, the strategy will correspond to a variety of reward functions, and the ambiguity problem will hinder the accuracy of inference. Secondly, practical inverse reinforcement learning should show good generalization. The model needs to infer other unobserved learned state actions, and the application cannot be limited to specific scenarios, which challenges the correct extension of a small part of data to unobserved space. Third, considering that there are a large number of noise data in the environment, inaccurate prior information will enter the characteristic equation of IRL, so the model should be robust enough and ensure accuracy. Finally, IRL needs to consider the complexity of space and time. Due to the complexity of the task, the state and action space are generally high-dimensional vectors after discretization, and the computation is exponential with the state vector, which consumes much computing time and space [5]. This paper mainly introduces the current situation and challenges of inverse reinforcement learning. In the following sections: we list the significance and existing problems of IRL in section 1. Next, we introduce the basic principles and models of IRL in section 2. The third part mainly introduces two basic IRL methods. Then, we implement IRL model in a vehicle planning tasks in section 4. Finally, the paper puts forward possible future development trends in section 5. 2.BASIC PRINCIPLEMarkov decision process (MDP) is the model foundation of reinforcement learning, multi-task decision making, and interaction between agents and environment. By assuming Markov properties, MDP means that the current state is only related to the state at the last time and unrelated with the previous state. MDP is to find an optimal strategy to maximize the state reward function [6]. Therefore, we define S as a finite set of States, A as an action set. The policy represents the mapping function between the current state and the action at the next moment. It can be defined as π: S → A. Reward function R(s, a) is a scalar return obtained by taking actions a in the current state s. The Markov decision process model is shown in Figure 3. Inverse reinforcement learning assumes that expert behavior is based on a certain strategy. If the strategy is unknown, it can be obtained by observing the state action pair of expert behavior. General reward functions are defined as linear combinations of reward features. These reward features ϕk is feature function about state s and action a, with the corresponding weight wk. The IRL algorithm flow is shown in Table 1. Table 1.IRL algorithm flow [7] 3.METHODSThe current mainstream IRL methods can be divided into margin-based optimization and entropy-based optimization through different reward functions. Among them, the milestone research breakthrough node is Ziebart’s maximum entropy optimization method, which solves the ambiguity of IRL. 3.1Margin optimization methodThe purpose of the margin-based optimization method is to find a reward function that is better in the example strategy than in the learning strategy, that is, the margin between the two strategies is the largest [8]. When the margin decreases to the set value, convergence is considered. For any track τ, margin formula is as follows. Margin based optimization is to solve the ambiguity of IRL solution by convergence maximum margin. Later, Ratliff proposed an improved method, learn to search (LEARCH) [9], so as to convert the quadratic programming problem into an optimization problem, and then solve it by solving Hessian matrix and gradient, so as to solve the difficult problem of solving high-dimensional continuous time problems. 3.2Entropy optimization methodIRL method has the problem of ambiguity, which means more than one reward functions can explain the behavior of experts, so the maximum marginal method will introduce a random bias. In order to eliminate this bias, the maximum entropy optimization (MaxEnt) obtains the behavior distribution through the maximum entropy principle. Specifically, this method assumes that all trajectory distributions are probability distributions, and only needs to solve the probability model that generates the expert trajectory distribution [10]. Since the probability distribution of maximum entropy does not make any assumptions about any other unknown information except for constraints, this method can avoid ambiguity, and parameterized by the weight of the reward function. MaxEnt method needs to solve the model with maximum entropy, which is essentially a nonlinear convex optimization problem. The maximum entropy of the policy distribution [11] is as follows. where D represents all distributed spaces. For probability distribution P and Q, the improved relative entropy model [12] is as follows. 4.EXPERIMENT AND ANALISISIn this paper, IRL based on maximum entropy optimization is used to modify the trajectory prediction index in unmanned missions. Traditional metrics, such as average or final displacement error (ADE / FDE), do not consider the follow-up planning tasks. Therefore, IRL is used for metric improvement, and the linear combination as shown in formula (1) is constructed by considering the factors of vehicle collision, control term and distance to the target point [13]. where is the position x,y of the ego-vehicle at time t and Sais the obstacles’ position; is the predicted position of the ego-vehicle at time t + 1, the same applies to obstacles; u is a control variable, including heading and velocity control. Build the experiment based on the gym-collision-avoidance simulation environment and use limited-memory BFGS (L-BFGS) to optimize. The time step of the model is 0.1s, and the number of time steps of each track is 50. By selecting the reward features of the above formula, the weights are obtained as follows in Table 2, where the slack variable is the minimum value that the optimizer can reach for convergence, means only use velocity to control and means only use heading to control. Table 2.Different learned weight and variable according to reward features.
By training the first reward function, the convergence results of the model are as follows in Figure 4. The left figure is the change curve of slack variable. It can be seen that as the number of iterations increases, it gradually converges to approach the best point. The right figure shows the training loss of negative loglikelihood (NLL). The weight change curves of four reward features are as follows in Figure 5. After 2.0 iterations, the weight gradually converges to a fixed value. 5.SUMMARY AND FUTURE DIRECTIONAiming at the fuzziness of IRL, the entropy optimization method models the uncertainty of reward as the probability distribution on the trajectory of reward. In addition to the above two methods, IRL algorithms also include models based on Bayesian update, which can be divided into Boltzmann distribution (BIRL), Gaussian process (GPIRL) [14] and maximum likelihood estimation according to different models of observation likelihood. At present, scholars are exploring new solutions for practical application. When the trajectory is too long and the dimension is too large, considering suboptimal algorithm or local optimization becomes a solution. For example, in the task from city a to city B, mature drivers can complete it perfectly at every intersection, but it is still difficult to consider the overall optimization. In the future, we may study the direct and indirect learning algorithm and analyze the complexity and accuracy of the IRL algorithm in view of the high-dimensional number of states, so as to better apply to reality. REFERENCESIbarz J., Tan J. and Finn C,
“How to train your robot with deep reinforcement learning: lessons we have learned,”
The International Journal of Robotics Research, 698
–721
(2021). https://doi.org/10.1177/0278364920987859 Google Scholar
Bachute M. R. and Subhedar J. M.,
“Autonomous driving architectures: insights of machine learning and deep learning algorithms,”
Machine Learning with Applications, 6 100164
(2021). https://doi.org/10.1016/j.mlwa.2021.100164 Google Scholar
Shani L., Zahavy T. and Mannor S.,
“Online apprenticeship learning,”
in Proceedings of the AAAI Conference on Artificial Intelligence,
8240
–8248
(2022). Google Scholar
Ravichandar H.,
“Recent advances in robot learning from demonstration,”
Annual review of control, robotics, and autonomous systems, 297
–330
(2020). https://doi.org/10.1146/control.2020.3.issue-1 Google Scholar
Saurabh A. and Prashant D.,
“A survey of inverse reinforcement learning: Challenges, methods and progress,”
Artificial Intelligence, 297
(2021). Google Scholar
Zhou D., Gu Q. and Szepesvari C.,
“Nearly minimax optimal reinforcement learning for linear mixture markov decision processes,”
in Conference on Learning Theory,
4532
–4576
(2021). Google Scholar
Arora S. and Doshi P.,
“A survey of inverse reinforcement learning: Challenges, methods and progress,”
Artificial Intelligence, 297 103500
(2021). https://doi.org/10.1016/j.artint.2021.103500 Google Scholar
Bashir E. and Luštrek M.,
“Inverse Reinforcement Learning Through Max-Margin Algorithm,”
in Intelligent Environments 2021: Workshop Proceedings of the 17th International Conference on Intelligent Environments,
109
(2021). Google Scholar
Bashir E. and Luštrek M.,
“Inverse Reinforcement Learning Through Max-Margin Algorithm,”
in Intelligent Environments 2021: Workshop Proceedings of the 17th International Conference on Intelligent Environments,
190
(2021). Google Scholar
Bogert K., Gui Y. and Doshi P.,
“IRL with Partial Observations using the Principle of Uncertain Maximum Entropy,”
arXiv preprint arXiv:2208.06988,
(2022). Google Scholar
Mehr N., Wang M. and Schwager M.,
“Maximum-Entropy Multi-Agent Dynamic Games: Forward and Inverse Solutions,”
arXiv preprint arXiv:2110.01027,
(2021). Google Scholar
Qu B., Zhao M. and Feng J.,
“ASRL: An Adaptive GPS Sampling Method Using Deep Reinforcement Learning,”
in 2022 23rd IEEE International Conference on Mobile Data Management (MDM),
153
–158
(2022). Google Scholar
Boris I., and Pavone M.,
“Rethinking trajectory forecasting evaluation,”
arXiv preprint arXiv:2107.10297,
(2021). Google Scholar
Nasernejad P., Sayed T. and Alsaleh R.,
“Modeling pedestrian behavior in pedestrian-vehicle near misses: A continuous Gaussian Process Inverse Reinforcement Learning (GP-IRL) approach,”
Accident Analysis & Prevention, 161 106355
(2021). https://doi.org/10.1016/j.aap.2021.106355 Google Scholar
|