๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
์นดํ…Œ๊ณ ๋ฆฌ ์—†์Œ

Module 6. ๊ฐ•ํ™”ํ•™์Šต (Reinforcement Learning) (๊ณ ๋ ค๋Œ€ํ•™๊ต ์ด๋ณ‘์ค€ ๊ต์ˆ˜)

by ์ œ๋ฃฝ 2023. 7. 13.
728x90
๋ฐ˜์‘ํ˜•

Part 1. MDP and Planning

: Markov Decision Process์˜ ์•ฝ์ž
- Sequential Decision Making under Uncertainty๋ฅผ ์œ„ํ•œ ๊ธฐ๋ฒ•
- ๊ฐ•ํ™”ํ•™์Šต(Reinforcement Learning, RL)์„ ์œ„ํ•œ ๊ธฐ๋ณธ ๊ธฐ๋ฒ•
- ์•Œ๊ณ ๋ฆฌ์ฆ˜(transition probability, reward function)์„ ์•Œ๊ณ  ์žˆ์„ ๋•Œ๋Š” MDP(stocasitc control ๊ธฐ๋ฒ•)์„ ์ด์šฉ
- ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋ชจ๋ฅด๊ณ  simulation ๊ฒฐ๊ณผ(reward ๊ฐ’)๋ฅผ ํ™œ์šฉํ•  ๋•Œ๋Š” ๊ฐ•ํ™”ํ•™์Šต์„ ์ด์šฉ

https://velog.io/@recoder/MDP%EC%9D%98%EA%B0%9C%EB%85%90

S : set of states(state space)
- state s t∈S : the status of the system, environment
- discrete์ธ ๊ฒฝ์šฐ, S={1,2,...,n}, continuous์ธ ๊ฒฝ์šฐ, S=ℜn
 
A : set of actions(action space)

- action a t∈A : input to the system
- discrete์ธ ๊ฒฝ์šฐ, A={1,2,...,n}, continuous์ธ ๊ฒฝ์šฐ, A=ℜn
- the decision maker observes the system state and choose an action either randomly or deterministically

 

p : state transition probability
- p(s ′โˆฃs,a):=Prob(s t+1=s ′โˆฃs t=s,a t=a)

- ํ˜„์žฌ state๊ฐ€ s, action์€ a ์ผ ๋•Œ, ๋‹ค์Œ state๊ฐ€ s'์ด ์˜ฌ ํ™•๋ฅ 
- deterministc์˜ ๊ฒฝ์šฐ, ํ•˜๋‚˜์˜ state(s')์— ๋Œ€ํ•ด์„œ๋งŒ 1, ๋‚˜๋จธ์ง€๋Š” 0์œผ๋กœ ํ•œ๋‹ค.

 

reward function rt
โ€‹- rt =r(st,at) : ํ˜„์žฌ ์ƒํƒœ st์—์„œ action at๋ฅผ ์ˆ˜ํ–‰ํ•  ์‹œ์˜ ๊ฒฐ๊ณผ

- ํ˜„์žฌ(t) step์— agent๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ž˜ ํ•˜๊ณ  ์žˆ๋Š”๊ฐ€
- long term effect์„ ์ธก์ •ํ•  ์ˆœ ์—†๋‹ค. ์ฆ‰๊ฐ์ ์ธ ๊ฒƒ๋งŒ ๋ฐฉ์˜ํ•œ๋‹ค.
- ์žฅ๊ธฐ์ ์ธ ์˜ํ–ฅ์€ ์ดํ›„ ์ด๋ฅผ ๋ˆ„์ ํ•ด์„œ ํŒ๋‹จํ•œ๋‹ค.

 

discount factor γ
- γ∈(0,1]
- ๋ฏธ๋ž˜์— ๋Œ€ํ•œ discount ์ •๋„
- 0์— ๊ฐ€๊นŒ์šธ ์ˆ˜๋ก, ๋ฏธ๋ž˜์— ๋Œ€ํ•œ ๊ฐ€์ค‘์น˜๋ฅผ ํฌ๊ฒŒ ๊ฐ์†Œ์‹œํ‚ค๋Š” ๊ฒƒ์ด๊ณ 
- 1์— ๊ฐ€๊นŒ์šธ ์ˆ˜๋ก, ๋ฏธ๋ž˜์™€ ํ˜„์žฌ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๊ฑฐ์˜ ๋™์ผํ•˜๊ฒŒ ์ฃผ๋Š” ๊ฒƒ์ด๋‹ค.

 

728x90
๋ฐ˜์‘ํ˜•