Lecture 13(Extra Material)：Q-Learning

10月前作者：zzz_qing 分类：Toy博客阅读(38) 违法举报

这篇具有很好参考价值的文章主要介绍了Lecture 13(Extra Material)：Q-Learning。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

目录

Introduction of Q-Learning

Tips of Q-Learning

Double DQN

Dueling DQN

Prioritized Reply

Multi-step

Noisy Net

Distributional Q-function

Rainbow

Q-Learning for Continuous Actions

Introduction of Q-Learning

Critic: The output values of a critic depend on the actor evaluated.

Lecture 13(Extra Material)：Q-Learning

How to estimate V𝝿(s)? 有两种方法：

① Monte-Carlo(MC) based approach

The critic watches 𝝿 playing the game.

Lecture 13(Extra Material)：Q-Learning

② Temporal-difference (TD) approach

Lecture 13(Extra Material)：Q-Learning

MC v.s. TD: TD比较常见，MC比较少用到

Lecture 13(Extra Material)：Q-Learning

MC和TD估算出的V𝝿(s)很有可能是不一样的，不同的方法考虑了不同的假设，最后就会得到不同的运算结果。举例如下：

Lecture 13(Extra Material)：Q-Learning

Another Critic:

Lecture 13(Extra Material)：Q-Learning

从表面上看learn一个Q function，只能拿来评估某一个actor 𝝿的好坏。但实际上只要有了Q function，就可以做reinforcement learning：

Lecture 13(Extra Material)：Q-Learning

Q-Learning:

Lecture 13(Extra Material)：Q-Learning

在Q-Learning中会用到的三个tip：

① Target Network

Lecture 13(Extra Material)：Q-Learning

② Exploration

Lecture 13(Extra Material)：Q-Learning

有两个做exploration的方法：

Lecture 13(Extra Material)：Q-Learning

③ Replay Buffer

The experience in the buffer comes from different policies. Drop the old experience if the buffer is full.

Lecture 13(Extra Material)：Q-Learning

Typical Q-Learning Algorithm:

Lecture 13(Extra Material)：Q-Learning

Tips of Q-Learning

——train Q-Learning的一些tip

Double DQN

Q value is usually over-estimated.

Lecture 13(Extra Material)：Q-Learning

Lecture 13(Extra Material)：Q-Learning

解决target(rt+maxQ)总是太大的问题：Double DQN

在Double DQN里面，选action的Q function和算value的Q function不是同一个

Lecture 13(Extra Material)：Q-Learning

Dueling DQN

Q network就是input state，output每一个action的Q value。

Dueling DQN相较于原来的DQN，唯一的差别是改了network的架构。

Lecture 13(Extra Material)：Q-Learning

按照上图方式改network架构的好处是，假设现在train network的target是希望下面两个被划掉的数字的值变为新的值，那么我们希望在train network的时候，network选择去更新V(s)的值而不是A(s,a)的值。

Lecture 13(Extra Material)：Q-Learning

更新V(s)值的好处是，当我们更新这一列的前两个值的时候，第三个值也会发生改变。即在某一个state，只sample到两个action，没sample到第三个action，但是也可以对第三个action的Q value进行更动。这样的好处是不需要把所有的state action pair都sample一遍，可以用比较有效率的方式去estimate Q value。

Lecture 13(Extra Material)：Q-Learning

实际上要给A一些constrain，使得update A比较麻烦，让network倾向于用V去解决问题。实作上做法如下：

Lecture 13(Extra Material)：Q-Learning

Prioritized Reply

——更改sampling的process。因为更改了sampling的process，会更改update参数的方法。

Lecture 13(Extra Material)：Q-Learning

Multi-step

——Balance between MC and TD

好处：sample了比较多的step，sample大N个step才估测value，所以估测的部分所造成的影响比较轻微。

坏处：r的项比较多，把大N项的r加起来，variance就会比较大。

所以需要调N的值，在variance跟不精确的Q之间取得一个平衡。

Lecture 13(Extra Material)：Q-Learning

Noisy Net

——improve exploration

之前讲过的Epsilon Greedy这种exploration，是在action的space上面加noise。更好的方法Noisy Net，它是在参数的space上面加noise。

Lecture 13(Extra Material)：Q-Learning

注意：the noise would not change in an episode.

Noise on Action v.s. Noise on Parameters:

Lecture 13(Extra Material)：Q-Learning

Distributional Q-function

——model distribution. 可以不只是估测mean的值，还能估测distribution（每一个action都有自己的distribution）。

不太好实作，所以没有很多人在实作的时候使用这个技术。

Q-function是accumulated reward的期望值。所以计算出的Q value，其实是一个期望值。

同样的Q value可能会对应不同的distribution。如果只用一个expected的Q value来代表整个reward，会loss掉一些information。

Lecture 13(Extra Material)：Q-Learning

Lecture 13(Extra Material)：Q-Learning

Rainbow

——把所有方法都综合起来

Lecture 13(Extra Material)：Q-Learning

下图是说，每次拿掉Rainbow中的一种技术：

拿掉double的时候，score和原始的Rainbow没什么差别。一个比较make sense的解释是，当用distributional DQN的时候，就不会over estimate reward。用double就是为了避免over estimate reward的问题。

Lecture 13(Extra Material)：Q-Learning

Q-Learning for Continuous Actions

Q-Learning一个最大的问题是，它不太容易处理continuous的action。

Lecture 13(Extra Material)：Q-Learning

Solution 1:

Lecture 13(Extra Material)：Q-Learning

Solution 2: Using gradient ascent to solve the optimizationproblem.

把a当做parameter，找一个a去maximize Q function，用gradient ascent去update a的value。

Solution 3: Design a network to make the optimization easy

Lecture 13(Extra Material)：Q-Learning

Solution 4: Don't use Q-learning

Lecture 13(Extra Material)：Q-Learning 文章来源地址https://www.toymoban.com/news/detail-436920.html

到了这里，关于Lecture 13(Extra Material)：Q-Learning的文章就介绍完了。如果您还想了解更多内容，请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章，希望大家以后多多支持TOY模板网！

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：如若内容造成侵权/违法违规/事实不符，请点击违法举报进行投诉反馈，一经查实，立即删除！

分享到：

领支付宝红包赞助服务器费用

Faster-git/lecture 01

1.1.1 什么是版本控制系统？版本系统能够帮助我们记录代码的变化，并且可以直接恢复到某个版本的代码，不需要一直操作ctrl+z，我们可以比较文件的变化细节，查出最后是谁修改了哪个地方，从而找出导致怪异问题出现的原因，又是谁在何时报告了某个功能缺陷等等。 1

2024年01月23日
浏览(60)
GAMES101笔记 Lecture02 线性代数基础

Garphics’ Dependencies(图形学的依赖) Basic mathematics(基础的数学) Linear alrebra, calculus, statistics Basic physics(基础的物理) Optics, Mechanics Misc(杂项) Signal processing Numerical analysis And a bit of asethetics(以及一点美学) Vectors(向量) 通常写成 a ⃗ vec a a 或者加粗的 a ; 或者使用起点或者重点来表示

2024年02月09日
浏览(32)
【BI&AI】Lecture 5 - Auditory system

auditory system 听觉系统 pinna 耳廓 auditory canal 耳道 tympanic membrane 鼓膜 cochlea 耳蜗 ossicles 听骨 auditory-vestibular nerve 前庭神经 oval window 椭圆窗 attenuation reflex 衰减反射 tensor tympani muscle 鼓膜张肌 stapedius muscle 镫骨肌 perilymph 外淋巴液 endolymph 内淋巴液 basilar membrane 基底膜 organ of Cor

2024年02月02日
浏览(49)
cs231n assignmen3 Extra Credit: Image Captioning with LSTMs

题面结合课程和上面的讲解，这部分就是让我们来实现lstm的前向操作，具体的操作流程在上面都写好了解析看代码注释吧代码输出题面计算lstm的反向操作解析 sigmoid求导 Tanh 求导反向传播讲解可以看这个然后结合代码注释，想想链式求导法则就好了代码输出题面让

2024年02月10日
浏览(36)
【BI&AI】Lecture 7 - EEG data analysis

EEG 脑电图 excitatory postsynaptic potential(EPSP)兴奋性突触后电位 inhibitory postsynaptic potential(IPSP) 抑制性突触后电位 action potential 动作电位 dipoles 偶极子 Pyramidal neurons 椎体细胞 Axon 轴突 Dendrite 树突 Synapse 突触 Cell body 或 Soma 细胞体 Electroencephalography (EEG) is a method to record an electrogram

2024年01月22日
浏览(51)
Lecture 8 Deep Learning for NLP: Recurrent Networks

Problem of N-gram Language Model N-gram 语言模型的问题 Cen be implemented using counts with smoothing 可以用平滑计数实现 Can be implemented using feed-forward neural networks 可以用前馈神经网络实现 Problem: limited context 问题：上下文限制 E.g. Generate sentences using trigram model: 例如：使用 trigram 模型生成句子

2024年02月09日
浏览(41)
Lecture 8 Flink流处理-Kafka简介与基本使用(Appendix Ⅰ)

认识一个新框架的时候，先要知道这个东西干什么用的，具体有哪些实际应用场景，根据它的应用场景去初步推测它的架构（包括数据结构，设计模式等）是怎样的，而不是上来就看定义概念。 1.1.1 异步处理电商网站中，新的用户注册时，需要将用户

2024年04月23日
浏览(38)
Robot Dynamics Lecture Notes学习笔记之关节空间动力学控制

提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档目前的工业机器人几乎完全依赖于关节位置控制的概念。它们建立在PID控制器的基础上，独立调节机器人每个关节的位置或速度。这样的控制器补偿了调节器和整个机器人中的干扰，并在理想情况下导致

2024年02月05日
浏览(54)
MIT6.S081 - Lecture1: Introduction and Examples

理解操作系统的设计和实现通过 XV6 操作系统动手实验，可以扩展或改进操作系统 Abstraction：对硬件进行抽象 Multiplex：在多个应用程序之间共用硬件资源 Isolation：隔离性，程序出现故障时，不同程序之间不能相互干扰 Sharing：实现共享，如数据交互或协同完成任务 Securi

2024年04月15日
浏览(51)
Q-Learning

Q-Learning是强化学习中，一种基于值(values-based)的算法，最终的return是一个表格，即Q-Table。这个表格的每一行都代表着一个状态（state），每一行的每一列都代表着一个动作（action），而每个值就代表着如果在该state下，采取该action所能获取的最大的未来期望奖励。通

2023年04月09日
浏览(36)