Policy Iteration Adaptive Dynamic Programming Algorithm for Discrete-Time Nonlinear Systems,2014, Derong Liu, Fellow, IEEE, and Qinglai Wei, Member, IEEE
本文是第一次对离散非线性系统采用策略迭代的方法分析收敛性和稳定性。反复实验获得初始的可容许控制策略,迭代值函数是单调不增,收敛到HJB方程的最优值。证明任意迭代控制策略使非线性系统稳定。神经网络近似值函数和求最优控制,且分析权重矩阵的收敛性。
根据Discrete-Time Nonlinear HJB Solution Using Approximate Dynamic Programming: Convergence Proof ,2008 Asma Al-Tamimi; Frank L. Lewis; Murad Abu-Khalaf IEEE Transactions on Systems。对初始值函数为0,其值迭代VI算法迭代控制策略使得系统不能保证稳定。收敛的控制策略才可控制系统。
常见的VI算法是离线进行的,而PI算法几乎是在线迭代ADP。
- 策略迭代算法寻找最优控制策略
- 迭代控制策略可稳定非线性系统
- 迭代控制策略采用策略迭代算法是迭代性能指标(代价函数收敛到最优值)
获得最优控制策略,必须先获得最优值函数。但在考虑迭代控制策略前,
J
∗
(
x
k
)
J*\left( x_k \right)
J∗(xk)未知。传统的DP方法会面临维数灾难问题,控制序列是无限的,几乎不可能由HJB方程获得最优控制。
当迭代次数趋于无穷时,PI算法收敛,有迭迭代控制策略近似最优控制策略,迭代值函数是单调不增收敛到最优。
Theorem3.1给出在PI算法下,迭代值函数是单调不增的。进而Corollary3.1给出当任意可容许控制,后续迭代控制策略均是可容许的。Theorem3.2给出迭代值函数收敛到最优性能指标。证明a:迭代性能指标的极限满足HJB方程;b:对任意可容许控制策略,无限次迭代下值函数小于等于迭代值函数
V
∞
(
x
k
)
≤
P
(
x
k
)
V_{\infty}\left( x_k \right) \le P\left( x_k \right)
V∞(xk)≤P(xk);c:无限次迭代下值函数等价于最优值函数。同样Corollary3.2给出迭代控制策略收敛到最优控制策略。连续时间和离散时间的策略迭代算法不同,首先HJB方程不同,且连续时间下的分析方法基于微分。
Algorithm1给出选定半正定函数,获得可容许控制策略,两个神经网络满足收敛到一定精度。Algorithm2给出离散时间下的PI算法。创建动作网络后,在评价网络的权重收敛之前,控制策略的可容许性是未知的,则说明Algorithm1是离线的。
Theorem3.3给出须获得可容许控制的值函数条件。
两种方法得到迭代值函数:
定义内迭代和外迭代分别更新
和直接根据求和
Theorem4.1给出目标的迭代值函数和迭代控制策略的权重收敛到最优。
对不稳定系统,尽管可通过值迭代和策略迭代获得最优控制策略,但值迭代下并非所有的控制策略是可稳定系统的,且VI算法的迭代控制性质不能分析,只能离线实现。而策略迭代使系统的稳定性得到保证。
仿真:1线性系统中比较策略迭代算法与一般的求解代数里卡提方程;2 非线性系统中比较策略迭代与值迭代 3 倒立摆系统 4非线性卫星姿态
展望:由于神经网络近似迭代值函数和迭代控制策略,近似误差不可避免,这种情况下不能证明迭代性能指标函数的收敛性和系统在迭代控制策略下的稳定性。需要后续额外的证明判据。
matlab实现
检验PI算法的可行性,线性
预训练状态数据
给定系统矩阵和权矩阵,通过随机函数得到状态的训练数据
%------------------------- generate training data & system information ----------------------------
clear; close all; clc;
% system matrices
A = [ 0, 0.1;...
0.3, -1 ];
B = [ 0;...
0.5 ];
state_dim = size(A,1);
control_dim = size(B,2);
% cost function parameters
Q = 1*eye(state_dim);
R = 1*eye(control_dim);
% training data
x_train = zeros(state_dim,1);
x0 = [1;-1];
for i = 1:50
x_train = [x_train, zeros(state_dim,1)];
x_train = [x_train,2*(rand(state_dim,1)-0.5)];
x_train = [x_train,1*(rand(state_dim,1)-0.5)];
x_train = [x_train,0.5*(rand(state_dim,1)-0.5)];
end
r = randperm(size(x_train,2)); % randomization according to column
x_train = x_train(:, r); % reorder
save training_data/state_data x_train state_dim control_dim A B Q R x0;
获取可容许策略
比较LQR求得到最优增益和黎卡提方程的解 与 动作网络训练输出的初始控制策略。并测试两者控制策略所得到性能指标函数
%-------------------------- obtain the initial admissible control ----------------------------
clear; close all; clc;
load training_data/state_data.mat
[K, P] = dlqr(A,B,Q,100*R); % 离散线性二次型函数得到最优增益K和黎卡提方程的解P
actor_target = -K*x_train; % 最优增益求得控制策略
cover = 1;
if isempty(dir('training_data/actor_init.mat')) == 1 || cover == 1 % 训练初始可容许控制不存在时
% action network
actor_init_middle_num = 15; % 动作网络输入层数
actor_init_epoch = 10000;
actor_init_err_goal = 1e-9;
actor_init_lr = 0.01;
actor_init = newff(minmax(x_train), [actor_init_middle_num control_dim], {'tansig' 'purelin'},'trainlm');
actor_init.trainParam.epochs = actor_init_epoch;
actor_init.trainParam.goal = actor_init_err_goal;
actor_init.trainParam.show = 10;
actor_init.trainParam.lr = actor_init_lr;
actor_init.biasConnect = [1;0];
actor_init = train(actor_init, x_train, actor_target);
save training_data/actor_init actor_init % 保存生成的可容许控制
else
load training_data/actor_init
end
%-------------------------- test the initial control ----------------------------
x = x0;
x_net = x;
xx = x;
xx_net = x_net;
uu = [];
uu_net = [];
Fsamples = 200;
JK = 0; % 由dlqr得到最优增益K得到最优控制计算性能指标
Jnet = 0; % 由动作网络生成初始可容许控制计算性能指标
h = waitbar(0,'Please wait');
for k = 1:Fsamples
u = -K*x;
u_net = actor_init(x_net);
JK = JK + x'*Q*x + u'*R*u;
Jnet = Jnet + x_net'*Q*x_net + u_net'*R*u_net;
x = A*x + B*u;
xx = [xx x];
x_net = A*x_net + B*u_net;
xx_net = [xx_net x_net];
uu = [uu u];
uu_net = [uu_net u_net];
waitbar(k/Fsamples,h,['Running...',num2str(k/Fsamples*100),'%']);
end
close(h)
JK
Jnet
figure,
plot(0:Fsamples,xx,'b-',0:Fsamples,xx_net,'r--','linewidth',1)
xlabel('Time steps');
ylabel('States');
set(gca,'FontName','Times New Roman','FontSize',14,'linewidth',1);
grid on;
figure,
plot(0:Fsamples-1,uu,'b-',0:Fsamples-1,uu_net,'r--','linewidth',1)
xlabel('Time steps');
ylabel('Control');
set(gca,'FontName','Times New Roman','FontSize',14,'linewidth',1);
grid on;
PI算法实现
加载训练数据,构建Action和Critic网络分别进行策略提升和策略改进,更新值函数和控制策略文章来源:https://www.toymoban.com/news/detail-842271.html
function pi_algorithm
% This demo checks the feasibility of the policy iteration adaptive dynamic
% programming algorithm
%-------------------------------- start -----------------------------------
clear; close all; clc;
% information of system & cost function
global A; global B; global Q; global R;
load training_data/state_data.mat;
load training_data/actor_init.mat; % 加载预训练离散状态数据和可容许控制策略
% action network
actor = actor_init;
actor_epoch = 20000;
actor_err_goal = 1e-9;
actor_lr = 0.01;
actor.trainParam.epochs = actor_epoch;
actor.trainParam.goal = actor_err_goal;
actor.trainParam.show = 10;
actor.trainParam.lr = actor_lr;
% critic network
critic_middle_num = 15;
critic_epoch = 10000;
critic_err_goal = 1e-9;
critic_lr = 0.01;
critic = newff(minmax(x_train), [critic_middle_num 1], {'tansig' 'purelin'},'trainlm'); % 构造评价网络
critic.trainParam.epochs = critic_epoch;
critic.trainParam.goal = critic_err_goal; % 训练指标达到多少时停止
critic.trainParam.show = 10; % 每多少数据刷新依次
critic.trainParam.lr = critic_lr;
critic.biasConnect = [1;0];
epoch = 10;
eval_step = 400;
performance_index = ones(1,epoch + 1);
figure(1),hold on;
h = waitbar(0,'Please wait');
for i = 1:epoch
% update critic
% evaluate policy
critic_target = evaluate_policy(actor, x_train, eval_step);
critic = train(critic,x_train,critic_target);
performance_index(i) = critic(x0);
figure(1),plot(i,performance_index(i),'*'),xlim([1 epoch]),hold on;
waitbar(i/epoch,h,['Training controller...',num2str(i/epoch*100),'%']);
if i == epoch
break;
end
% update actor
actor_target = zeros(control_dim,size(x_train,2));
for j = 1:size(x_train,2)
x = x_train(:,j);
if x == zeros(state_dim,1)
ud = zeros(control_dim,1);
else
objective = @(u) cost_function(x,u) + critic(controlled_system(x,u));
u0 = actor(x);
ud = fminunc(objective, u0);
end
actor_target(:,j) = ud;
end
actor = train(actor, x_train, actor_target);
end
close(h)
figure(1),
xlabel('Iterations');
ylabel('$V(x_0)$','Interpreter','latex');
set(gca,'FontName','Times New Roman','FontSize',14,'linewidth',1);
grid on;
hold off;
save training_results/actor_critic actor critic
end
%---------------------------- evaluate policy 策略评估-----------------------------
function y = evaluate_policy(actor,x,eval_step)
critic_target = zeros(1,size(x,2));
for k = 1:eval_step
uep = actor(x); % 策略评估下动作网络的输出
critic_target = critic_target + cost_function(x,uep); % 策略评估下将控制代入到值函数
x = controlled_system(x,uep); % 代入被控系统得到下一步状态
end
y = critic_target;
end
%--------------------------- outpout of system 被控系统输出状态值----------------------------
function y = controlled_system(x,u)
% system matrices
global A; global B;
y = A*x + B*u; % dot product should be adopt in nolinear systems
end
%----------------------------- cost function ------------------------------
function y = cost_function(x,u)
global Q; global R;
y = (diag(x'*Q*x) + diag(u'*R*u))';
end
PI算法的验证
加载预训练的状态数据和评价、动作网络数据。
给出LQR计算下得到的最优控制策略和黎卡提方程的解(HJB方程的精确解)比较
迭代值函数和迭代控制策略收敛的近似最优值代入到性能指标函数中文章来源地址https://www.toymoban.com/news/detail-842271.html
clear; close all; clc;
load training_data/state_data.mat;
load training_results/actor_critic.mat
[Kopt, Popt] = dlqr(A,B,Q,R);
x = x0;
x_net = x;
xx = x;
xx_net = x_net;
uu_opt = [];
uu_net = [];
Jreal = 0;
Fsamples = 50;
h = waitbar(0,'Please wait');
for k = 1:Fsamples
uopt = -Kopt*x;
x = A*x + B*(uopt);
xx = [xx x];
u_net = sim(actor,x_net);
Jreal = Jreal + x_net'*Q*x_net + u_net'*R*u_net;
x_net = A*x_net + B*u_net;
xx_net = [xx_net x_net];
uu_opt = [uu_opt uopt];
uu_net = [uu_net u_net];
waitbar(k/Fsamples,h,['Running...',num2str(k/Fsamples*100),'%']);
end
close(h)
Jopt = x0'*Popt*x0
Jnet = critic(x0)
Jreal
figure(1),
plot(0:Fsamples,xx,'b-',0:Fsamples,xx_net,'r--','linewidth',1)
legend('Optimal ','NN','Interpreter','latex');
xlabel('Time steps');
ylabel('States');
set(gca,'FontName','Times New Roman','FontSize',14,'linewidth',1);
grid on;
figure(2),
plot(0:Fsamples-1,uu_opt,'b-',0:Fsamples-1,uu_net,'r--','linewidth',1)
legend('Optimal ','NN','Interpreter','latex');
xlabel('Time steps');
ylabel('Control');
set(gca,'FontName','Times New Roman','FontSize',14,'linewidth',1);
grid on;
到了这里,关于Policy Iteration Adaptive Dynamic Programming Algorithm for Discrete-Time Nonlinear Systems的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!