Multi Armed Bandit
- \(t=1,2,\cdots,N\)
- \(k\) arm
- at \(t\), choose arm \(i\), receive reward \(X_{i,t}\sim{}D_i\)(iid unknown distribution)
- Policy: \(\pi=\{i(t),t=1,2,\cdots,N\}\)
- Goal: \(\max\mathbb{E}\{\sum_{t=1}^NX_{i,t}\}\)
- \({\rm OPT}=N\max_i\mu_i=N\mu,\mu_i=\mathbb{E}\{X_{i,t}\}\)
- \({\rm Reg}={\rm OPT}-\mathbb{E}\{\sum_{t=1}^NX_{i,t}|\pi\}\)
UCB ALG
- Initialize: play machine arm once
- Loop: play machine \[i=\arg\max\bar{x}_j+\sqrt{\frac{2\ln{}n}{n_j}}\]
where \(\bar{x}_j={\rm avg}\) reward of arm \(j\), \(n_j=*\) of times \(j\) is played so far
Theorem
For all \(k>1\), if UCB runs on \(k\) arms with distributions \(P_1,P_2,\cdots,P_k\) with support \([0,1]\), then the expected regrest after any \(n\) is at most:
\[\delta\sum_{i,\mu_i<\mu}\frac{\ln{}n}{\Delta_i}+(1+\frac{\pi^2}{3})\sum_{j=1}^k\Delta_j\]
where \(\Delta_i=\mu-\mu_i\)
Proof.
Define \[C_{t,s}=\sqrt{\frac{2\ln{}t}{s}}\]
\[T_i(n)=1+\sum_{t=k+1}^n\{i(t)=t\}\leq{}l+\sum_{t=k+1}^n\{i(t)=i,T_i(t-1)\geq{}l\}\]
After some calculation, we found that
\[T_i(n)\leq{}l+\sum_{t}\sum_{s=1}^{t-1}\sum_{s_i=l}^{t-1}\{\bar{x_s}^{\star}+C_{t,s}\leq\bar{x_i}+C_{t,s_i}\}\]
Also, it's true that
\[\bar{x_s}^\star+C_{t,s}\leq\bar{x_i}+C_{t,s_i}\]
holds if at least one of following holds
\[\bar{x}_s^\star\leq\mu-C_{t,s},\bar{x_{i,s_i}}\geq\mu_i+C_{t,s},\mu<\mu_i+2C_{t,s}\]
However, the first two probability can be bound by Chernoff bound and
the last one is impossible (probability \(=
0\)).
\[{\rm Pr}\{\bar{x_s}^\star\leq\mu-C_{t,s}\}\leq{}e^{-4\ln{}t},{\rm Pr}\{\bar{x_{i,s_i}}\geq\mu_i+C_{t,s}\}\leq{}e^{-4\ln{}t}\]
Finally, choose
\[l=\lceil\frac{8\ln{}n}{\Delta_i^2}\rceil\]
\[\mathbb{E}\{T_i(n)\}\leq{}l+\sum_{t}\sum_{s=1}^{t-1}\sum_{s_i=l}^{t-1}t^{-4}=l+1+\frac{\pi^2}{3}\]
\(\varepsilon\)-greedy ALG
- Parameter: \(c>0,0<d<1\)
- Initialize: Define \[\varepsilon_n=\min\{1,\frac{ck}{d^2n}\}\]
- Loop: \(\forall{}n=1,2,\cdots\), let \(i_n=\arg\max_j\bar{x}_{j,n}\). Play \(i_n\) with probability \(1-\varepsilon_n\) and random with probability \(\varepsilon_n\)
Theorem
For all \(k>1\), if \(\varepsilon\)-greedy runs on \(k\) arms with distributions \(P_1,P_2,\cdots,P_k\) with support \([0,1]\) and \(0<d<\min_{i,\mu_i<\mu}\Delta_i\). Then, the probability that after \(n\geq{}ck/d\) plays, \(\varepsilon\)-greedy choose a sub-opt \(j\) is at most \[\mathcal{O}(\frac{c}{d^2n}+o(\frac{1}{n}))\] for \(n\rightarrow\infty\) and \(c>5\).
Lower Bound
Consider two \(\Theta\) values,
i.e., \(\Theta=(\theta_1,\theta_2,\cdots,\theta_k)\)
with \(\mu_1>\mu_2\geq\dots\geq\mu_k\); \(\Theta^{\prime}=(\theta_1,\theta_2^\prime,\cdots,\theta_k)\)
with \(\mu_2^{\prime}>\mu_1\geq\dots\geq\mu_k\).
Here we want to prove that it costs about \(\ln{}n\) time to tell the difference
between this two.
For a strategy - \(x_{j,s}\): reward for \(j\) at \(s\)-th slot - \(\mathbb{P}\): joint distribution over \(\{I_t,x_{j,s}\}\) under \(\Theta\) - \(\mathbb{P}^{\prime}\): joint distribution
over \(\{I_t,x_{j,s}\}\) under \(\Theta^\prime\)
Consider \(A\subset\{T_2(n)=n_2\}\)
\[\mathbb{P}(A)=\int_A\prod_{s=1}^{n_2}\frac{d\mathbb{P}_{\theta_2^\prime}}{d\mathbb{P}_{\theta_2}}(x_{2,s})d\mathbb{P}=\int_A\exp\{\sum_{s=1}^{n_2}\ln\frac{d\mathbb{P}_{\theta_2^\prime}}{d\mathbb{P}_{\theta_2}}(x_{2,s})d\mathbb{P}\}=\int_Ae^{-L_{n_2}}d\mathbb{P}\] \[\mathbb{P}^\prime(A)\geq{}e^{-C_n}\mathbb{P}(A)\]
Thompson Sampling ALG
Here we only consider Bernoulli Bandits with \(N=2\), because the overall analysis is really complex.
- \(X_1,X_2\in\{0,1\}\), \(\mu_1\geq\mu_2\)
- Initialize: for each \(i\), \(S_i=F_i=0\)
- Loop: \(t=1,2,\cdots,d_0\)
- For each \(i\), sample \(\theta_i(t)\sim{\rm Beta}(S_i+1,F_i+1)\)
- Play \(i(t)=\arg\max\theta_i(t)\), observe reward \(r_t\)
- If \(r_t=1\), \(S_{i(t)}+=1\), else \(F_{i(t)}+=1\)
Theorem
In this \(2\) Bernoulli Bandits, TS achieves
\[\mathbb{E}\{R(T)\}=\mathcal{O}(\frac{\ln{}T}{\Delta}+\frac{1}{\Delta^3})\] where \(\Delta=\mu_\max-\mu_\min\)
Proof.
- Denote \(j_0=*\) of plays of arm \(1\) until arm \(2\) is played \(L=24\ln{}T/\Delta^2\) times.
- Let \(t_j=\) time of the \(j^{\rm th}\) play of arm \(1\) (\(t_0=0\)). Also, let \(y_j=t_j-t_{j-1}-1\), i.e., the interval play time.
- Denote \(S(j)=*\) of success in the first \(j\) plays of arm \(1\)
- The expected \(*\) of play of arm \(2\): \[\mathbb{E}\{K_2(T)\}\leq{}L+\mathbb{E}\{\sum_{j=j_0}^{T-1}y_j\}\]
- Denote \(X(j,s,y)\) to be the \(*\) of attempts before we get a sample from \({\rm Beta}(s+1,j-s+1)\) to exceed \(y\)
It can be seen that \(X(j,s,y)\) is Geo with success probability \(1-F_{s+1,j-s+1}^{\rm beta}(y)\)
Lemma 1
For all non-negative \(j\) an d \(s\leq{}j\) and \(y\in[0,1]\) \[\mathbb{E}\{X(j,s,y)\}=\frac{1}{F_{j+1,y}^{\rm beta}(s)}-1\]
- Consider the \(*\) of steps before event \(\{\theta_1(t)>\mu_2+\Delta/2\}\) first happens
- Given \(S(j)\), this has the same distribution as \(X(j,S(j),\mu_2+\Delta/2)\)
- \(y_j\) can be larger than this \(*\) iff at some \(t\in(t_j,t_{j+1})\), \(\theta_2(t)>\mu_2+\Delta/2\)
- Thus: \[\mathbb{E}\{y_j\}\leq\mathbb{E}\{\min(X(j,S(j),\mu_2+\Delta/2),T)\}+\mathbb{E}\{\sum_{t=t_j+1}^{t_{j+1}-1}T\cdot{}I\{\theta_2>\mu_2+\Delta/2\}I\{j>j_0\}\}\]
Sum it up from \(j=j_0\) to \(T-1\) and define event:
\[E_2(t)=\{\theta_2(t)\leq\mu_2+\Delta/2\lor{}K_2(t)\leq{}L\}\]
Lemma 2
\[\mathbb{P}(E_2(t))\geq1-\frac{2}{T^2}\]
Lemma 3
Consider any \(y<\mu_1\), let \(\Delta^1=\mu_1y\), let \(R=\frac{\mu_1(1-y)}{y(1-\mu_1)}>1\) and \(D=D(y||\mu_1)=y\ln\frac{y}{\mu_1}+(1-y)\ln\frac{1-y}{1-\mu_1}\), we have:
\[\mathbb{E}\{\mathbb{E}\{\min(X(j,S(j),y),T)|S(j)\}\}\leq\frac{16}{T}\quad{\rm for }\quad{}j\geq\frac{4\ln{}T}{(\Delta^1)^2}\]
Substitute Lemma 2 and Lemma 3 into the sum-up result (which we omit) and the inequality \[\mathbb{E}\{K_2(T)\}\leq{}L+\mathbb{E}\{\sum_{j=j_0}^{T-1}y_j\}\] we have (skip lots lots of calculation): \[\mathbb{E}\{K_2(T)\}\leq{}\frac{40\ln{}T}{\Delta^2}+\frac{48}{\Delta^4}+18\] \[\mathbb{E}\{R(T)\}=\mathcal{O}(\frac{\ln{}T}{\Delta}+\frac{1}{\Delta^3})\]
Non Stochastic: ALG EXP3
- \(K\) actions
- \(X(i)=(X_i(1),X_i(2),\cdots),X_i(t)\in[0,1]\)
- ALG \(=i_1,i_2,\cdots,\)
- \(G_A(T)=\sum_{t=1}^TX_{i_t}(t)\)
- Regret \(=\max_j\sum_{t=1}^TX_j(t)-G_A(T)\)
ALG EXP3
- Parameter: Real \(\gamma\in[0,1]\)
- Initialize: \(w_i(1)=1,\forall{}i=1,2,\cdots,K\)
- Loop: for each \(t=1,2,\cdots\)
- Set \[P_i(t)=(1-\gamma)\frac{w_i(t)}{\sum{}w_i(t)}+\frac{\gamma}{K}\]
- Draw \(i_t\) randomly \(\sim{}P_1(t),\cdots,P_K(t)\)
- Get reward \(X_{i_t}\in[0,1]\)
- For \(j=1,2,\cdots,k\)
\[\hat{X_j}(t)=\left\{\begin{aligned}&\frac{X_j(t)}{P_j(t)}&j=i_t\\&0&\text{otherwise}\end{aligned}\right.\quad\quad{}w_j(t+1)=w_j(t)\exp(\frac{\gamma\hat{X_j}(t)}{K})\]
Theorem
For any \(K\) and any \(\gamma\in[0,1]\),
\[G_{\max}-\mathbb{E}\{G_{\exp 3}\}\leq(e-1)\gamma{}G_{\max}+\frac{K\ln{}K}{\gamma}\]
for any reward assignments
Corollary
Assume \(g\geq{}G_{\max}\) and \(\gamma=\min\{1,\sqrt{\frac{K\ln{}K}{(e-1)g}}\}\), then \[G_{\max}-\mathbb{E}\{G_{\exp 3}\}\leq2.63\sqrt{gK\ln{}K}\]