본문 바로가기

대외활동/2023 LG Aimers 3기

Module 6. 강화학습 (Reinforcement Learning) (고려대학교 이병준 교수)

by 제룽 2023. 7. 15.

728x90

날짜: 2023년 7월 13일

Part 1. MDP and Planning

: Markov Decision Process의 약자

Sequential Decision Making under Uncertainty를 위한 기법
강화학습(Reinforcement Learning, RL)을 위한 기본 기법
알고리즘(transition probability, reward function)을 알고 있을 때는 MDP(stocasitc control 기법)을 이용
알고리즘을 모르고 simulation 결과(reward 값)를 활용할 때는 강화학습을 이용

https://velog.io/@recoder/MDP%EC%9D%98%EA%B0%9C%EB%85%90

S : set of states(state space)

state s t∈S : the status of the system, environment
discrete인 경우, S={1,2,...,n}, continuous인 경우, S=ℜn

A : set of actions(action space)

action a t∈A : input to the system
discrete인 경우, A={1,2,...,n}, continuous인 경우, A=ℜn
the decision maker observes the system state and choose an action either randomly or deterministically

p : state transition probability

p(s ′∣s,a):=Prob(s t+1=s ′∣s t=s,a t=a)
현재 state가 s, action은 a 일 때, 다음 state가 s'이 올 확률
deterministc의 경우, 하나의 state(s')에 대해서만 1, 나머지는 0으로 한다.

reward function rt

rt =r(st,at) : 현재 상태 st에서 action at를 수행할 시의 결과
현재(t) step에 agent가 얼마나 잘 하고 있는가
long term effect을 측정할 순 없다. 즉각적인 것만 방영한다.
장기적인 영향은 이후 이를 누적해서 판단한다.

discount factor γ

γ∈(0,1]
미래에 대한 discount 정도
0에 가까울 수록, 미래에 대한 가중치를 크게 감소시키는 것이고
1에 가까울 수록, 미래와 현재의 가중치를 거의 동일하게 주는 것이다.

728x90

저작자표시 비영리 변경금지

'대외활동 > 2023 LG Aimers 3기' 카테고리의 다른 글

LG Aimers 3기 수료 (0)	2023.12.31
Module 7. 딥러닝 (Deep Learning) (KAIST 주재걸 교수) (0)	2023.07.15
Module 5. 지도학습 (분류/회귀) (이화여자대학교 강제원 교수) (0)	2023.07.08
Module 4. Bayesian (고려대학교 김재환) (0)	2023.07.04
Module 3. SCM & 수요예측 (고려대학교 이현석 교수) (0)	2023.07.04

티스토리툴바