๐ 3 33:35
Reinforcement Learning
It interacts with the environment (no sampled data) ์ฃผ์ด์ง ํ๊ฒฝ๊ณผ ์ง์ ์ํธ์์ฉ (์ฃผ์ด์ง๋ Training Data ์์)
Agent chooses an action at each state, and receives rewards Agent๋ ๊ฐ ์ํ์์ ํ๋์ ์ ํํ๊ณ , ๋ณด์์ ๋ฐ์
During training, it improves its policy (action for each state) in order to maximize the total reward ํ์ต ๊ณผ์ ์์ total reward๋ฅผ ๊ทน๋ํํ๊ธฐ ์ํ ์ ์ฑ ์ ๊ฐ์
Time really matters and feedback is delayed ์๊ฐ์ ๋งค์ฐ ์ค์ํ๋ฉฐ ํผ๋๋ฐฑ์ ์ง์ฐ๋จ
Agent [ State*(t)* โ Policy โ Action*(t)* โ Rewards*(t)* โ State*(t+1)* ]
Policy Iteration, Policy Evaluation
Grid World Example
Markov Process(Markov Chain)
|S_t=s) = P(S_{t+1}=s|S_0=0, S_1=s_1,...,S_t=s)$