๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๋Œ€์™ธํ™œ๋™/2023 LG Aimers 3๊ธฐ

Module 6. ๊ฐ•ํ™”ํ•™์Šต (Reinforcement Learning) (๊ณ ๋ ค๋Œ€ํ•™๊ต ์ด๋ณ‘์ค€ ๊ต์ˆ˜)

by ์ œ๋ฃฝ 2023. 7. 15.
728x90
๋ฐ˜์‘ํ˜•

๋‚ ์งœ: 2023๋…„ 7์›” 13์ผ

Part 1. MDP and Planning

: Markov Decision Process์˜ ์•ฝ์ž

  • Sequential Decision Making under Uncertainty๋ฅผ ์œ„ํ•œ ๊ธฐ๋ฒ•
  • ๊ฐ•ํ™”ํ•™์Šต(Reinforcement Learning, RL)์„ ์œ„ํ•œ ๊ธฐ๋ณธ ๊ธฐ๋ฒ•
  • ์•Œ๊ณ ๋ฆฌ์ฆ˜(transition probability, reward function)์„ ์•Œ๊ณ  ์žˆ์„ ๋•Œ๋Š” MDP(stocasitc control ๊ธฐ๋ฒ•)์„ ์ด์šฉ
  • ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋ชจ๋ฅด๊ณ  simulation ๊ฒฐ๊ณผ(reward ๊ฐ’)๋ฅผ ํ™œ์šฉํ•  ๋•Œ๋Š” ๊ฐ•ํ™”ํ•™์Šต์„ ์ด์šฉ

 

https://velog.io/@recoder/MDP%EC%9D%98%EA%B0%9C%EB%85%90

 

 

S : set of states(state space)

  • state s t∈S : the status of the system, environment
  • discrete์ธ ๊ฒฝ์šฐ, S={1,2,...,n}, continuous์ธ ๊ฒฝ์šฐ, S=ℜn

A : set of actions(action space)

  • action a t∈A : input to the system
  • discrete์ธ ๊ฒฝ์šฐ, A={1,2,...,n}, continuous์ธ ๊ฒฝ์šฐ, A=ℜn
  • the decision maker observes the system state and choose an action either randomly or deterministically

p : state transition probability

  • p(s ′โˆฃs,a):=Prob(s t+1=s ′โˆฃs t=s,a t=a)
  • ํ˜„์žฌ state๊ฐ€ s, action์€ a ์ผ ๋•Œ, ๋‹ค์Œ state๊ฐ€ s'์ด ์˜ฌ ํ™•๋ฅ 
  • deterministc์˜ ๊ฒฝ์šฐ, ํ•˜๋‚˜์˜ state(s')์— ๋Œ€ํ•ด์„œ๋งŒ 1, ๋‚˜๋จธ์ง€๋Š” 0์œผ๋กœ ํ•œ๋‹ค.

reward function rt

  • rt =r(st,at) : ํ˜„์žฌ ์ƒํƒœ st์—์„œ action at๋ฅผ ์ˆ˜ํ–‰ํ•  ์‹œ์˜ ๊ฒฐ๊ณผ
  • ํ˜„์žฌ(t) step์— agent๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ž˜ ํ•˜๊ณ  ์žˆ๋Š”๊ฐ€
  • long term effect์„ ์ธก์ •ํ•  ์ˆœ ์—†๋‹ค. ์ฆ‰๊ฐ์ ์ธ ๊ฒƒ๋งŒ ๋ฐฉ์˜ํ•œ๋‹ค.
  • ์žฅ๊ธฐ์ ์ธ ์˜ํ–ฅ์€ ์ดํ›„ ์ด๋ฅผ ๋ˆ„์ ํ•ด์„œ ํŒ๋‹จํ•œ๋‹ค.

discount factor γ

  • γ∈(0,1]
  • ๋ฏธ๋ž˜์— ๋Œ€ํ•œ discount ์ •๋„
  • 0์— ๊ฐ€๊นŒ์šธ ์ˆ˜๋ก, ๋ฏธ๋ž˜์— ๋Œ€ํ•œ ๊ฐ€์ค‘์น˜๋ฅผ ํฌ๊ฒŒ ๊ฐ์†Œ์‹œํ‚ค๋Š” ๊ฒƒ์ด๊ณ 
  • 1์— ๊ฐ€๊นŒ์šธ ์ˆ˜๋ก, ๋ฏธ๋ž˜์™€ ํ˜„์žฌ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๊ฑฐ์˜ ๋™์ผํ•˜๊ฒŒ ์ฃผ๋Š” ๊ฒƒ์ด๋‹ค.
728x90
๋ฐ˜์‘ํ˜•