Multi Armed Bandit
- \(t=1,2,\cdots,N\)
- \(k\) arm
- at \(t\), choose arm \(i\), receive reward \(X_{i,t}\sim{}D_i\)(iid unknown distribution)
- Policy: \(\pi=\{i(t),t=1,2,\cdots,N\}\)
- Goal: \(\max\mathbb{E}\{\sum_{t=1}^NX_{i,t}\}\)
- \({\rm OPT}=N\max_i\mu_i=N\mu,\mu_i=\mathbb{E}\{X_{i,t}\}\)
- \({\rm Reg}={\rm OPT}-\mathbb{E}\{\sum_{t=1}^NX_{i,t}|\pi\}\)