Sorry, your browser cannot access this site
This page requires browser support (enable) JavaScript
Learn more >

Multi Armed Bandit

  • \(t=1,2,\cdots,N\)
  • \(k\) arm
  • at \(t\), choose arm \(i\), receive reward \(X_{i,t}\sim{}D_i\)(iid unknown distribution)
  • Policy: \(\pi=\{i(t),t=1,2,\cdots,N\}\)
  • Goal: \(\max\mathbb{E}\{\sum_{t=1}^NX_{i,t}\}\)
  • \({\rm OPT}=N\max_i\mu_i=N\mu,\mu_i=\mathbb{E}\{X_{i,t}\}\)
  • \({\rm Reg}={\rm OPT}-\mathbb{E}\{\sum_{t=1}^NX_{i,t}|\pi\}\)
Math

Introduction

  • Single Hop System, \(L\) Queues
  • \(Q_l(t=1)=(Q_l(t)+A_l(t)-S_l(t))^+=Q_l(t)+A_l(t)-S_l(t)+U_l(t)\)
  • \(U_l(t)=\max\{0,S_l(t)-A_l(t)-Q_l(t)\}\)
Math

Queue Control Problem

\(A_1(t),A_2(t),Q_1(t),Q_2(t),\mu_1(t),\mu_2(t)\)

  • \(A_i(t)\) is iid, \(\mathbb{E}\{A_i(t)\}=\lambda_i\), \(A_i(t)\in[0,A_{\max}]\)
  • \(S_i(t)\) is condition of link \(i\), \(S_i(t)\in\{0,1\}\)
  • \(P_{xy}=P_r\{S_1(t)=x,S_2(t)=y\}\)
  • At every time, serve \(Q_1(t)\) or \(Q_2(t)\)
  • \(\mu_i(t)=1\text{ if } S_i(t)=1,Q_i(t)>0\quad0\text{ otherwise}\)
  • Goal: stabilize both queues

\(~\)

Math

Queue

Definition

  • Arrival: \(A(t)\)
  • Cumulative arrival: \[X[t_1,t_2]=\int_{t_1}^{t_2}A(t)dt\]
  • Service: \(\mu(t)\)
  • Cumulative departure: \[Y[t_1,t_2]=\int_{t_1}^{t_2}Y(t)dt\leq\int_{t_1}^{t_2}\mu(t)dt\]
  • \(Y(t)=Q(t)~if~Q(t)>0\quad0~otherwise\)
  • \(Q(t)=X[0,t]-Y[0,t],Q(0)=0\)
Math