Deep Reinforcement Learning

🔖 3 33:35

Reinforcement Learning
- It interacts with the environment (no sampled data) 주어진 환경과 직접 상호작용 (주어지는 Training Data 없음)
- Agent chooses an action at each state, and receives rewards Agent는 각 상태에서 행동을 선택하고, 보상을 받음
- During training, it improves its policy (action for each state) in order to maximize the total reward 학습 과정에서 total reward를 극대화하기 위한 정책을 개선
- Time really matters and feedback is delayed 시간은 매우 중요하며 피드백은 지연됨
- Agent [ State*(t)* → Policy → Action*(t)* → Rewards*(t)* → State*(t+1)* ]
- Policy Iteration, Policy Evaluation
Grid World Example
- Reward
  - Big Rewards: 끝이 났을 때 의도한 바대로 마무리되었다면 +1, 아니면 -1
  - Small Negative Reward ( c ): 영원한 loop에 걸리거나, 다른 케이스를 방지하기 위해 c를 감소
- Noisy Movement: Agents는 원하는 방향대로 움직이지 않는다.
- State Transition Probability > Policy가 중요
- 강화학습의 핵심: Optimal Policy 찾기: (Episode 중 최대한의 보상을 얻도록 하는 policy)
Markov Process(Markov Chain)
- $P(S_{t+1}=s|S_t=s) = P(S_{t+1}=s|S_0=0, S_1=s_1,...,S_t=s)$
- 이전의 과정이 다음의 행동에 영향을 전혀 미치지 않는 상태
- 이는, 이전의 과정을 기억하지 않아도 무방하기 때문에 간편함
- Stochastic Process일 때, Markov Property를 만족한다면 Markov Process라 할 수 있으며, 이는 tuple(S, P) 형태로 나타낼 수 있음