Multi Armed Bandit

\(t=1,2,\cdots,N\)
\(k\) arm
at \(t\), choose arm \(i\), receive reward \(X_{i,t}\sim{}D_i\)(iid unknown distribution)
Policy: \(\pi=\{i(t),t=1,2,\cdots,N\}\)
Goal: \(\max\mathbb{E}\{\sum_{t=1}^NX_{i,t}\}\)
\({\rm OPT}=N\max_i\mu_i=N\mu,\mu_i=\mathbb{E}\{X_{i,t}\}\)
\({\rm Reg}={\rm OPT}-\mathbb{E}\{\sum_{t=1}^NX_{i,t}|\pi\}\)

UCB ALG

Initialize: play machine arm once
Loop: play machine \[i=\arg\max\bar{x}_j+\sqrt{\frac{2\ln{}n}{n_j}}\]

where \(\bar{x}_j={\rm avg}\) reward of arm \(j\), \(n_j=*\) of times \(j\) is played so far

Theorem

For all \(k>1\), if UCB runs on \(k\) arms with distributions \(P_1,P_2,\cdots,P_k\) with support \([0,1]\), then the expected regrest after any \(n\) is at most:

\[\delta\sum_{i,\mu_i<\mu}\frac{\ln{}n}{\Delta_i}+(1+\frac{\pi^2}{3})\sum_{j=1}^k\Delta_j\]

where \(\Delta_i=\mu-\mu_i\)

Proof.
Define \[C_{t,s}=\sqrt{\frac{2\ln{}t}{s}}\]

\[T_i(n)=1+\sum_{t=k+1}^n\{i(t)=t\}\leq{}l+\sum_{t=k+1}^n\{i(t)=i,T_i(t-1)\geq{}l\}\]

After some calculation, we found that

\[T_i(n)\leq{}l+\sum_{t}\sum_{s=1}^{t-1}\sum_{s_i=l}^{t-1}\{\bar{x_s}^{\star}+C_{t,s}\leq\bar{x_i}+C_{t,s_i}\}\]

Also, it's true that

\[\bar{x_s}^\star+C_{t,s}\leq\bar{x_i}+C_{t,s_i}\]

holds if at least one of following holds

\[\bar{x}_s^\star\leq\mu-C_{t,s},\bar{x_{i,s_i}}\geq\mu_i+C_{t,s},\mu<\mu_i+2C_{t,s}\]

However, the first two probability can be bound by Chernoff bound and the last one is impossible (probability \(= 0\)).

\[{\rm Pr}\{\bar{x_s}^\star\leq\mu-C_{t,s}\}\leq{}e^{-4\ln{}t},{\rm Pr}\{\bar{x_{i,s_i}}\geq\mu_i+C_{t,s}\}\leq{}e^{-4\ln{}t}\]

Finally, choose

\[l=\lceil\frac{8\ln{}n}{\Delta_i^2}\rceil\]

\[\mathbb{E}\{T_i(n)\}\leq{}l+\sum_{t}\sum_{s=1}^{t-1}\sum_{s_i=l}^{t-1}t^{-4}=l+1+\frac{\pi^2}{3}\]

\(\varepsilon\)-greedy ALG

Parameter: \(c>0,0<d<1\)
Initialize: Define \[\varepsilon_n=\min\{1,\frac{ck}{d^2n}\}\]
Loop: \(\forall{}n=1,2,\cdots\), let \(i_n=\arg\max_j\bar{x}_{j,n}\). Play \(i_n\) with probability \(1-\varepsilon_n\) and random with probability \(\varepsilon_n\)

Theorem

For all \(k>1\), if \(\varepsilon\)-greedy runs on \(k\) arms with distributions \(P_1,P_2,\cdots,P_k\) with support \([0,1]\) and \(0<d<\min_{i,\mu_i<\mu}\Delta_i\). Then, the probability that after \(n\geq{}ck/d\) plays, \(\varepsilon\)-greedy choose a sub-opt \(j\) is at most \[\mathcal{O}(\frac{c}{d^2n}+o(\frac{1}{n}))\] for \(n\rightarrow\infty\) and \(c>5\).

Lower Bound

Consider two \(\Theta\) values, i.e., \(\Theta=(\theta_1,\theta_2,\cdots,\theta_k)\) with \(\mu_1>\mu_2\geq\dots\geq\mu_k\); \(\Theta^{\prime}=(\theta_1,\theta_2^\prime,\cdots,\theta_k)\) with \(\mu_2^{\prime}>\mu_1\geq\dots\geq\mu_k\). Here we want to prove that it costs about \(\ln{}n\) time to tell the difference between this two.
For a strategy - \(x_{j,s}\): reward for \(j\) at \(s\)-th slot - \(\mathbb{P}\): joint distribution over \(\{I_t,x_{j,s}\}\) under \(\Theta\) - \(\mathbb{P}^{\prime}\): joint distribution over \(\{I_t,x_{j,s}\}\) under \(\Theta^\prime\)

Consider \(A\subset\{T_2(n)=n_2\}\)

\[\mathbb{P}(A)=\int_A\prod_{s=1}^{n_2}\frac{d\mathbb{P}_{\theta_2^\prime}}{d\mathbb{P}_{\theta_2}}(x_{2,s})d\mathbb{P}=\int_A\exp\{\sum_{s=1}^{n_2}\ln\frac{d\mathbb{P}_{\theta_2^\prime}}{d\mathbb{P}_{\theta_2}}(x_{2,s})d\mathbb{P}\}=\int_Ae^{-L_{n_2}}d\mathbb{P}\] \[\mathbb{P}^\prime(A)\geq{}e^{-C_n}\mathbb{P}(A)\]

Thompson Sampling ALG

Here we only consider Bernoulli Bandits with \(N=2\), because the overall analysis is really complex.

\(X_1,X_2\in\{0,1\}\), \(\mu_1\geq\mu_2\)
Initialize: for each \(i\), \(S_i=F_i=0\)
Loop: \(t=1,2,\cdots,d_0\)
- For each \(i\), sample \(\theta_i(t)\sim{\rm Beta}(S_i+1,F_i+1)\)
- Play \(i(t)=\arg\max\theta_i(t)\), observe reward \(r_t\)
- If \(r_t=1\), \(S_{i(t)}+=1\), else \(F_{i(t)}+=1\)

Theorem

In this \(2\) Bernoulli Bandits, TS achieves

\[\mathbb{E}\{R(T)\}=\mathcal{O}(\frac{\ln{}T}{\Delta}+\frac{1}{\Delta^3})\] where \(\Delta=\mu_\max-\mu_\min\)

Proof.

Denote \(j_0=*\) of plays of arm \(1\) until arm \(2\) is played \(L=24\ln{}T/\Delta^2\) times.
Let \(t_j=\) time of the \(j^{\rm th}\) play of arm \(1\) (\(t_0=0\)). Also, let \(y_j=t_j-t_{j-1}-1\), i.e., the interval play time.
Denote \(S(j)=*\) of success in the first \(j\) plays of arm \(1\)
The expected \(*\) of play of arm \(2\): \[\mathbb{E}\{K_2(T)\}\leq{}L+\mathbb{E}\{\sum_{j=j_0}^{T-1}y_j\}\]
Denote \(X(j,s,y)\) to be the \(*\) of attempts before we get a sample from \({\rm Beta}(s+1,j-s+1)\) to exceed \(y\)

It can be seen that \(X(j,s,y)\) is Geo with success probability \(1-F_{s+1,j-s+1}^{\rm beta}(y)\)

Lemma 1

For all non-negative \(j\) an d \(s\leq{}j\) and \(y\in[0,1]\) \[\mathbb{E}\{X(j,s,y)\}=\frac{1}{F_{j+1,y}^{\rm beta}(s)}-1\]

Consider the \(*\) of steps before event \(\{\theta_1(t)>\mu_2+\Delta/2\}\) first happens
Given \(S(j)\), this has the same distribution as \(X(j,S(j),\mu_2+\Delta/2)\)
\(y_j\) can be larger than this \(*\) iff at some \(t\in(t_j,t_{j+1})\), \(\theta_2(t)>\mu_2+\Delta/2\)
Thus: \[\mathbb{E}\{y_j\}\leq\mathbb{E}\{\min(X(j,S(j),\mu_2+\Delta/2),T)\}+\mathbb{E}\{\sum_{t=t_j+1}^{t_{j+1}-1}T\cdot{}I\{\theta_2>\mu_2+\Delta/2\}I\{j>j_0\}\}\]

Sum it up from \(j=j_0\) to \(T-1\) and define event:

\[E_2(t)=\{\theta_2(t)\leq\mu_2+\Delta/2\lor{}K_2(t)\leq{}L\}\]

Lemma 2

\[\mathbb{P}(E_2(t))\geq1-\frac{2}{T^2}\]

Lemma 3

Consider any \(y<\mu_1\), let \(\Delta^1=\mu_1y\), let \(R=\frac{\mu_1(1-y)}{y(1-\mu_1)}>1\) and \(D=D(y||\mu_1)=y\ln\frac{y}{\mu_1}+(1-y)\ln\frac{1-y}{1-\mu_1}\), we have:

\[\mathbb{E}\{\mathbb{E}\{\min(X(j,S(j),y),T)|S(j)\}\}\leq\frac{16}{T}\quad{\rm for }\quad{}j\geq\frac{4\ln{}T}{(\Delta^1)^2}\]

Substitute Lemma 2 and Lemma 3 into the sum-up result (which we omit) and the inequality \[\mathbb{E}\{K_2(T)\}\leq{}L+\mathbb{E}\{\sum_{j=j_0}^{T-1}y_j\}\] we have (skip lots lots of calculation): \[\mathbb{E}\{K_2(T)\}\leq{}\frac{40\ln{}T}{\Delta^2}+\frac{48}{\Delta^4}+18\] \[\mathbb{E}\{R(T)\}=\mathcal{O}(\frac{\ln{}T}{\Delta}+\frac{1}{\Delta^3})\]

Non Stochastic: ALG EXP3

\(K\) actions
\(X(i)=(X_i(1),X_i(2),\cdots),X_i(t)\in[0,1]\)
ALG \(=i_1,i_2,\cdots,\)
\(G_A(T)=\sum_{t=1}^TX_{i_t}(t)\)
Regret \(=\max_j\sum_{t=1}^TX_j(t)-G_A(T)\)

ALG EXP3

Parameter: Real \(\gamma\in[0,1]\)
Initialize: \(w_i(1)=1,\forall{}i=1,2,\cdots,K\)
Loop: for each \(t=1,2,\cdots\)
- Set \[P_i(t)=(1-\gamma)\frac{w_i(t)}{\sum{}w_i(t)}+\frac{\gamma}{K}\]
- Draw \(i_t\) randomly \(\sim{}P_1(t),\cdots,P_K(t)\)
- Get reward \(X_{i_t}\in[0,1]\)
- For \(j=1,2,\cdots,k\)

\[\hat{X_j}(t)=\left\{\begin{aligned}&\frac{X_j(t)}{P_j(t)}&j=i_t\\&0&\text{otherwise}\end{aligned}\right.\quad\quad{}w_j(t+1)=w_j(t)\exp(\frac{\gamma\hat{X_j}(t)}{K})\]

Theorem

For any \(K\) and any \(\gamma\in[0,1]\),

\[G_{\max}-\mathbb{E}\{G_{\exp 3}\}\leq(e-1)\gamma{}G_{\max}+\frac{K\ln{}K}{\gamma}\]

for any reward assignments

Corollary

Assume \(g\geq{}G_{\max}\) and \(\gamma=\min\{1,\sqrt{\frac{K\ln{}K}{(e-1)g}}\}\), then \[G_{\max}-\mathbb{E}\{G_{\exp 3}\}\leq2.63\sqrt{gK\ln{}K}\]