2013 ActiveandPassiveLearningofLinea

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Passive Learning System; Passive Learning Task.

Notes

Cited By

Quotes

Author Keywords

Abstract

We provide new results concerning label efficient, polynomial time, passive and active learning of linear separators. We prove that active learning provides an exponential improvement over PAC (passive) learning of homogeneous linear separators under nearly log-concave distributions. Building on this, we provide a computationally efficient PAC algorithm with optimal (up to a constant factor) sample complexity for such problems. This resolves an open question concerning the sample complexity of efficient PAC algorithms under the uniform distribution in the unit ball. Moreover, it provides the first bound for a polynomial-time PAC algorithm that is tight for an interesting infinite class of hypothesis functions under a general and natural class of data-distributions, providing significant progress towards a longstanding open question. We also provide new bounds for active and passive learning in the case that the data might not be linearly separable, both in the agnostic case and and under the Tsybakov low-noise condition. To derive our results, we provide new structural results for (nearly) log-concave distributions, which might be of independent interest as well.

1. Introduction

Learning linear separators is one of the central challenges in machine learning. They are widely used and have been long studied both in the statistical and computational learning theory. A seminal result of (Blumer et a1., 1989), using tools due to (Vapnik and Chervonenkis, 1971), showed that d- dimensional linear separators can be learned to accuracy 1 i 6 with probability 1 i 6 in the classic PAC model in polynomial time with O((d/e) 10g(1/6) + (1/6) log(1/6)) examples. The best known lower bound for linear separators is Q(d/e+(1/e) 10g(1/6)), and this holds even in the case in which the distribution is uniform (Long, 1995). Whether the upper bound can be improved to match the lower bound Via a polynomial—time algorithm is been long-standing open question, both for general distributions (Ehrenfeucht et a1., 1989; Blumer et a1., 1989) and for the case of the uniform distribution in the unit ball (Long, 1995, 2003; Bshouty et a1., 2009). In this work we resolve this question in the case where the underlying distribution belongs to the class of log-concave and nearly log-concave distributions, a wide class of distributions that includes the gaussian distribution and uniform distribution over any convex set, and which has played an important role in several areas including sampling, optimization, integration, and learning (Lovasz and Vempala, 2007).

We also consider active learning, a major area of research of modern machine learning, where the algorithm only receives the classifications of examples when it requests them (Dasgupta, 2011).

Our main result here is a polynomial—time active learning algorithm with label complexity that is exponentially better than the label complexity of any passive learning algorithm in these settings. This answers an open question in (Balcan et al., 2007) and it also significantly expands the set of cases for which we can show that active learning provides a clear exponential improvement in 1/6 (without increasing the dependence on d) over passive learning. Remarkably, our analysis for passive learning is done Via a connection to our analysis for active learning 7 to our knowledge, this is the first paper using this technique.

We also study active and passive learning in the case that the data might not be linearly sep- arable. We specifically provide new improved bounds for the widely studied Tsybakov low-noise condition (Mammen and Tsybakov, 1999; Bartlett et a1., 2005; Massart and Nedelec, 2006), as well as new bounds on the disagreement coefficient, with implications for the agnostic case (i.e., arbitrary forms of noise).

Passive Learning In the classic passive supervised machine learning setting, the learning algorithm is given a set of labeled examples drawn i.i.d. from some fixed but unknown distribution over the instance space and labeled according to some fixed but unknown target function, and the goal is to output a classifier that does well on new examples coming from the same distribution. This setting has been long studied in both computational learning theory (within the PAC model (Valiant, 1984; Kearns and Vazirani, 1994)) and statistical learning theory (Vapnik, 1982, 1998; Boucheron et a1., 2005), and has played a crucial role in the developments and successes of machine learning.

However, despite remarkable progress, the basic question of providing polynomial-time algo- rithms with tight bounds on the sample complexity has remained open. Several milestone results along these lines that are especially related to our work include the following. The analysis of (Blumer et a1., 1989), proved using tools from (Vapnik and Chervonenkis, 1971), implies that lin- ear separators can be learned in polynomial time with O((d/e) 10g(1/6) + (1/6) 10g(1/6)) labeled examples. (Ehrenfeucht et a1., 1989) proved a bound that implies an 9(d/e + (1/6) log(1/6)) lower bound for linear separators and explicitly posed the question of providing tight bounds for this class. (Haussler et a1., 1994) established an upper bound of O((d/e) log(1/6)), which can be achieved in polynomial—time for linear separators.

(Blumer et a1., 1989) achieved polynomial—time learning by finding a consistent hypothesis (i.e., a hypothesis which correctly classifies all training examples); this is a special case of ERM (Vapnik, 1982). An intensive line of research in the empirical process and statistical learning theory literature has taken account of “local complexity" to prove stronger bounds for ERM (van der VaaIt and Wellner, 1996; van de Geer, 2000; Bartlett et a1., 2005; Long, 2003; Mendelson, 2003; Giné and Koltchinskii, 2006; Hanneke, 2007; Hanneke and Yang, 2012). In the context of learning, local complexity takes account of the fact that really bad classifiers can be easily discarded, and the set of “lo- cal" classifiers that are harder to disqualify is sometimes not as rich. A recent landmark result of (Giné and Koltchinskii, 2006) (see also (Raginsky and Rakhlin, 2011; Hanneke and Yang, 2012)) is the bound for consistent algorithms of

0((d/6)10g(cap(6)) + (1/6)10g(1/5)) (1)

where cap(e) is the Alexander capacity, which depends on the distribution (Alexander., 1987) (see Section 8 and Appendix A for further discussion). However, this bound can be suboptimal for linear separators.

In particular, for linear separators in the case in which the underlying distribution is uniform in the unit ball, the sample complexity is known (Long, 1995, 2003) to be (9 (W), when computational considerations are ignored. (Bshouty et a1., 2009), using the doubling dimension (Assouad, 1983), another measure of local complexity, proved a bound of

0((d/6)V10g(1/6) + (1/6)10g(1/5)) (2)

for a polynomial-time algorithm. As a lower bound of Q(\/E) on cap(e) for e = 0(1 / fl) for the case of linear separators and the uniform distribution is implicit in (Hanneke, 2007), the bound of (Giné and Koltchinskii, 2006) given by (1) cannot yield a bound better than

O((d/e) min{10g d, 10g(1/6)} + (1/6) log(1/6)) (3)

in this case. In this paper we provide a tight bound (up to constant factors) on the sample complexity of polynomial—time learning of linear separators with respect to log-concave distributions. Specifi-

d+10g(1/6)>

cally, we prove an upper bound of O < using a polynomial-time algorithm that holds for

any zero-mean log-concave distribution. We also prove an information theoretic lower bound that matches our (computationally efficient) upper bound for each log-concave distribution. This pro- vides the first bound for a polynomial—time algorithm that is tight for an interesting non-finite class of hypothesis functions under a general class of data—distributions, and also characterizes (up to a con- stant factor) the distribution-specific sample complexity for each distribution in the class. In the spe- cial case of the uniform distribution, our upper bound closes the existing 9 (min{ \ /10g(1/e), 10g (d) }) gap between the upper bounds (2) and (3) and the lower bound of (Long, 1995).

Active Learning We also study learnng of linear separators in the active learning model; here the learning algorithm can access unlabeled (i.e., unclassified) examples and ask for labels of unlabeled examples of its own choice, and the hope is that a good classifier can be learned with significantly fewer labels by actively directing the queries to informative examples. This has been a major area of machine learning research in the past fifteen years mainly due the availability of large amounts of unannotated or raw data in many modern applications (Dasgupta, 2011), with many exciting developments on understanding its underlying principles as well (Freund et a1., 1997; Dasgupta, 2005; Balcan et a1., 2006, 2007; Hanneke, 2007; Dasgupta et a1., 2007; Castro and Nowak, 2007; Balcan et a1., 2008; Koltchinskii, 2010; Beygelzimer et a1., 2010). However, with a few excep- tions (Balcan et a1., 2007; Castro and Nowak, 2007; Dasgupta et al., 2005), most of the theoretical developments have focused on the so called disagreement—based active learning paradigm (Hanneke, 2011; Koltchinskii, 2010); methods and analyses developed in this context are often suboptimal, as they take a conservative approach and consider strategies that query even points on which there is a small amount of uncertainty (or disagreement) among the classifiers still under consideration given the labels queried so far. The results derived in this manner often show an improvement in the 1/6 factor in the label complexity of active versus passive learning; however, unfortunately, the dependence on the d term typically gets worse.

By analyzing a more aggressive, margin-based active learning algorithm, we prove that we can efficiently (in polynomial time) learn homogeneous linear separators when the underlying distribu- tion is log-concave by using only O((d +10g(1/6)+10g10g(1/e)) log(1/e)) label requests, answering an open question in (Balcan et a1., 2007). This represents an exponential improvement of active learning over passive learning and it significantly broadens the cases for which we can show that the dependence on 1/6 in passive learning can be improved to only 0(10g(1/e)) in active learning, but without increasing the dependence on the dimension d. We note that an improvement of this type was known to be possible only for the case when the underlying distributions is (nearly) uniform in the unit ball (Balcan et al., 2007; Dasgupta et a1., 2005; Freund et a1., 1997); even for this spe- cial case, our analysis improves by a multiplicative 10g10g(1/e) factor the results of (Balcan et a1., 2007); it also provides better dependence on d than any other previous analyses implementable in a computationally efficient manner (both disagreement—based (Hanneke, 2011, 2007) and more ag- gressive ones (Dasgupta et a1., 2005; Freund et al., 1997)), and over the inefficient splitting index analysis of (Dasgupta, 2005).

Techniques At the core of our results is a novel characterization of the region of disagreement of two linear separators under a log-concave measure. We show that for any two linear separators specified by normal vectors u and v, for any constant c E (0, 1) we can pick a margin as small as “y = 0(a), where oz is the angle between 11 and v, and still ensure that the probability mass of the region of disagreement outside of band of margin 7 of one of them is ca (Theorem 4). Using this fact, we then show how we can use a margin-based active learning technique, where in each round we only query points near the hypothesized decision boundary, to get an exponential improvement over passive learning.

We then show that any passive learning algorithm that outputs a hypothesis consistent with O(d/e + (1/6) 10g(1/6)) random examples will, with probability at least 1 i 6, output a hypothesis of error at most 6 (Theorem 6). Interestingly, our analysis is quite dissimilar to the classic analyses of ERM. It proceeds by conceptually running the algorithm online on progressively larger chunks of examples, and using the intermediate hypotheses to track the progress of the algorithm. We show, using the same tools as in the active learning analysis, that it is always likely that the algorithm will receive informative examples. Our analysis shows that the algorithm would also achieve 1 i 6 accuracy with high probability even if it periodically built preliminary hypotheses using some of the examples, and then only used borderline cases for those preliminary classifiers for further training [1]. To achieve the optimal sample complexity, we have to carefully distribute the confidence parameter, by allowing higher probability of failure in the later stages, to compensate for the fact that, once the hypothesis is already pretty good, it takes longer to get examples that help to further improve it.

Non-separable case We also study label-efficient learning in the presence of noise. We show how our results for the realizable case can be extended to handle (a variant of) the Tsybakov noise, which has received substantial attention in statistical learning theory, both for passive and active learning (Mammen and Tsybakov, 1999; Bartlett et a1., 2005; MassaIt and Nedelec, 2006; Giné and Koltchinskii, 2006; Balcan et a1., 2007; Koltchinskii, 2010; Hanneke, 2011); this includes the random classification noise commonly studied in computational learning theory (Kearns and Vazirani, 1994), and the more general bounded (or Massart) noise (Bartlett et a1., 2005; Massart and Nedelec, 2006; Giné and Koltchinskii, 2006; Koltchinskii, 2010). Our analysis for Massart noise leads to optimal bounds (up to constant factors) for active and passive learning of linear separators when the marginal distribution on the feature vectors is log-concave, improving the dependence on d over previous best known results. Our analysis for Tsybakov noise leads to bounds on active learning with improved dependence on d over previous known results in this case as well.

We also provide a bound on the Alexander’s capacity (Alexander., 1987; Giné and Koltchinskii, 2006) and the closely related disagreement coefficient notion (Hanneke, 2007), which have been widely used to characterize the sample complexity of various (active and passive) algorithms (Hanneke, 2007; Koltchinskii, 2010; Giné and Koltchinskii, 2006; Beygelzimer et a1., 2010). This immediately implies concrete bounds on the labeled data complexity of several algorithms in the literature, in- eluding active learning algorithms designed for the purely agnostic case (i.e., arbitrary forms of noise), e.g., the A2 algorithm (Balcan et al., 2006) and the DHM algorithm (Dasgupta et al., 2007).

Nearly log-concave distributions We also extend our results both for passive and active learning to deal with nearly log-concave distributions; this is a broader class of distributions introduced by (Applegate and Kannan, 1991), which contains mixtures of (not too separated) log-concave dis- tributions. In deriving our results, we provide new tail bounds and structural results for these dis- tributions, which might be of independent interest and utility, both in learning theory and in other areas including sampling and optimization.

We note that our bounds on the disagreement coefficient improve by a factor of 9(d) over the bounds of (Friedman, 2009) (matching what was known for the much less general case of nearly uniform distribution over the unit sphere); furthermore, they apply to the nearly log-concave case where we allow an arbitrary number of discontinuities, a case not captured by the (Friedman, 2009) conditions at all. We discuss other related papers in Appendix A.

2. Preliminaries and Notation

We focus on binary classification problems; that is, we consider the problem of predicting a binary label 3) based on its corresponding input vector 30. As in the standard machine learning formulation, we assume that the data points (30, y) are drawn from an unknown underlying distribution D Xy over X X Y; X is called the instance space and Y is the label space. In this paper we assume that Y = {i1} and X = Rd; we also denote the marginal distribution over X by D. Let (C be the class of linear separators through the origin, that is (C = {sign(w - m) z w 6 Rd, HwH = 1}. To keep the notation simple, we sometimes refer to a weight vector and the linear classifier with that weight vector interchangeably. Our goal is to output a hypothesis function 11) E (C of small error, where err<w> = eerXYuu) = P(.,y)~DXY[sign(w . m),2 y].

We consider two learning protocols: passive learning and active learning. In the passive learning setting, the learning algorithm is given a set of labeled examples (301,341), . . . , (mmym) drawn i.i.d. from D Xy and the goal is output a hypothesis of small error by using only a polynomial number of labeled examples. In the (pool—based) active learning setting, a set of labeled examples (m1, yl) . . . (mm, gm) is also drawn i.i.d. from D Xy; the learning algorithm is permitted direct access to the sequence of 30,- Values (unlabeled data points), but has to make a label request to obtain the label y,- of example 301-. The hope is that in the active learning setting we can output a classifier of small error by using many fewer label requests than in the passive learning setting by actively directing the queries to informative examples (while keeping the number of unlabeled examples polynomial). For added generality, we also consider the selective sampling active learning model, where the algorithm Visits the unlabeled data points 30,- in sequence, and, for each 1', makes a decision on whether or not to request the label yl- based only on the previously-observed mj values (j g i) and corresponding requested labels, and never changes this decision once made. Both our upper and lower bounds will apply to both selective sampling and pool—based active learning.

In the “realizable case", we assume that the labels are deterministic and generated by a target function that belongs to (C. In the non-realizable case (studied in Sections 8 and 9) we do not make this assumption and instead aim to compete with the best function in (C.

Given two vectors u and v and any distribution D we denote by d f) (u, v) = Pgmf) (sign(u - m) 75 sign(v - 30)); we also denote by 001,0) the angle between the vectors u and v.

3. Log-Concave Densities

Throughout this paper we focus on the case where the underlying distribution D is log-concave or nearly log-concave. Such distributions have played a key role in the past two decades in several areas including sampling, optimization, and integration algorithms (Lovasz and Vempala, 2007), and more recently for learning theory as well (Kalai et a1., 2005; Klivans et a1., 2009b; Vempala, 2010). In this section we first summarize known results about such distributions that are useful for our analysis and then prove a novel structural statement that will be key to our analysis (Theorem 4). In Section 6 we describe extensions to nearly log-concave distributions as well.

Definition 1 A distribution over Rd is log-concave if 10g f (-) is concave, where f is its associated densityfunction. It is isotropic if its mean is the origin and its covariance matrix is the identity.

Log-concave distributions form a broad class of distributions: for example, the Gaussian, Lo- gistic, and uniform distribution over any convex set are log-concave distributions. The following lemma summarizes known useful facts about isotropic log-concave distributions (most are from (Lovasz and Vempala, 2007); the upper bound on the density is from (Klivans et a1., 2009b)).

Lemma 2 Assume that D is log-concave in Rd and let f be its densityfimction. (a) IfD is isotropicthenIP’xND[||X|| 2 om/E] g e‘a‘Hde = 1then: PIND[X 6 [a,b“ g |bia|.

(b) IfD is isotropic, then f(x) 2 2‘7d29dllxll whenever0 g ”30” g 1/9. Furthermore, 2‘” g f(0) g d(20d)d/2, and f(x) 3 A(d) exp(iB(d)||m||), where A(d) is 28ddd/2e and B(d) is 277d

m,f07 all CI} ofany norm.

(c) All marginals of D are log-concave. If D is isotropic, its marginals are isotropic as well. (d) IfEHIXH21= C2, thenIP’Hlel 2 RC] 3 H“. (e) IfD is isotropic and d = 1 we have f(0) 2 1/8 and f(x) 3 1f0r all 1‘.

Throughout our paper we will use the fact that there exists a universal constant c such that the probability of disagreement of any two homogeneous linear separators is lower bounded by the 0 times the angle between their normal vectors. This follows by projecting the region of disagreement in the space given by the two normal vectors, and then using properties of log-concave distributions

in 2-dimensions. The proof is implicit in earlier works (e.g., (Vempala, 2010)); for completeness, we include a proof in Appendix B.

Lemma 3 Assume D is an isotropic log-concave in Rd. Then there exists 0 such that for any two unit vectors u and v in Rd we have 00(1), u) g dD(u, v).

To analyze our active and passive learning algorithms we provide a novel characterization of the region of disagreement of two linear separators under a log-concave measure:

Theorem 4 For any cl > 0, there is a 02 > 0 such that the following holds. Let u and v be two unit vectors in Rd, and assume that 0(u, v) = oz < 7r / 2. If D is isotropic log-concave in Rd, then:

IP’xND[sign(u - m) 75 sign(v - m) and |v - m| 2 02a] 3 01a. (4)

Proof Choose 01,02 > 0. We will show that, if 02 is large enough relative to 1/01, then (4) holds. Let b = 02a. Let E be the set whose probability we want to bound. Since the event under consideration only concerns the projection of 30 onto the span of u and 1), Lemma 2(c) implies we can assume without loss of generality that d = 2.

Next, we claim that each member 30 of E has Hat” 2 b/a = Cg. Assume without loss of generality that v - m is positive. (The other case is symmetric.) Then it - m < 0, so the angle of m with u is obtuse, i.e. €(m,u) 2 7r/2. Since €(u,v) = oz, this implies that €(m,v) 2 7r/2 7 oz. But 30 - v 2 b, and v is unit length, so ”30” cos€(m,v) 2 b, which, since 0(1‘, 1)) 2 7r/2 7 oz, implies ||m|| cos(7r/2 7 oz) 2 b; This, since cos(7r/2 7 oz) 3 oz for all oz 6 [0, 7r/2], in turn implies Hat” 2 b/a = 02. This implies that, if B(r) is a ball of radius 7‘ in R2, that

IP1E1= 2 ME n (B((z' + 1)c2), B(z’cgm- (5) 1'21

To obtain the desired bound, we carefully bound each term in the RHS. Choose 2' 2 1. Let f (301, 302) be the density of D. We have

[HE fl (B((Z Jr 1)Cg) * B(Z02))] 2/ 1E(CL‘1,CL‘2)f(CL‘1,CL‘2)dCL‘1dCL‘2. (.751,x2)€B((i+1)cg)—B(icg) Applying the density upper bound from Lemma 2 with d = 2, there are constants Cl and Cg such that

[HE fl (B((Z Jr 1)02) * B(Z02))] S/ 1E(CL‘1,CL‘2)CI eXp(*CQCQZ)dCL‘1dCL‘2 (x1,x2)€B((i+1)cg)—B(icg)

= Cl €Xp(*C2C2i)/ 1E($17$2)d$1d$2- (.751,x2)€B((i+1)cg)—B(icg)

If we include B(iCQ) in the integral again, we get

[NE W (B((Z + 1)02) * B(Z02))] S 01 eXp(*C202Z.)/ 1E($1,$2)d$1d$2. ($1 ,x2)€B((i+1)C2) Now, we exploit the fact that the integral above is a rescaling of a probability with respect to the uniform distribution. Let Cg be the volume of the unit ball in R2. Then, we have

IP’[Efl(B((i+1)02)iB(i02))] 3 Cl exp(702Cgi)Cg(i+1)2c§oz/7r = C4c§oz(i+1)2 exp(702Cgi), for C4 = 0103/7l'. Returning to (5), we get

462C202 , 3eC202 + 1

(eCZCZ , 1)3 >< Oz.

ME] = Z C4c§aa + 1)2 expecgcgt) = 0403 >< 2'21

2 462C202 4,6202 +1

Since 1irnc2_>00 02 X (202 1)3 = 0, this completes the proof. I EC —

We note that a weaker result of this type was proven (Via different techniques) for the uniform distribution in the unit ball in (Balcan et a1., 2007). In addition to being more general, Theorem 4 is tighter and more refined even for this specific case 7 this improvement is essential for obtaining tight bounds for polynomial time algorithms for passive learning (Section 5) and better bounds for active learning as well.

4. Active Learning

In this section we analyze a margin-based algorithm for actively learning linear separators under log- concave distributions (Balcan et a1., 2007) (Algorithm 1). Lower bounds proved in Section 7 show that this algorithm needs exponentially fewer labeled examples than any passive learning algorithm.

This algorithm has been previously proposed and analyzed in (Balcan et al., 2007) for the spe- cial case of the uniform distribution in the unit ball. In this paper we analyze it for the much more general class of log-concave distributions.

Algorithm 1 Margin-based Active Learning

Input: a sampling oracle for D, a labeling oracle, sequences mk > 0, h E Z + (sample sizes) and bk > 0, h E Z+ (cut—off values). Output: weight vector 1228.

0 Draw m1 example: from D, label them and put them in W(1).

o iterate h = 1,.

— find a hypothesis wk with Hwk H2 = 1 consistent with all labeled examples in W(h ). —1etW(h + 1): W(h). — until mk+1 additional data points are labeled, draw sample 30 from D

  • if |1Dk - m| 2 bk, then reject m,
  • else, ask for label of 30, and put into W(h + 1).

Theorem 5 Assume D is isotropic log- -c0ncave in Rd. There exist constants Cl, Cg s. t. for d 2 4, andfor any 6, 6 > 0 e < 1/4 usingAlgorithm I with bk— — 2—k and mk— — Cg (d + 1n 1+8 —k) afier

= (10g2 C16) iterations we find a separator of error at most 6 with probability 1 i 6. The total number of labeled examples needed is O((d + 10g(1/6) + log log(1/e)) log(1/e)).

Proof Let c be the constant from Lemma 3. We will show, using induction, that, for all h g s, with probability at least 1 i % ZKk m, any 12) consistent with the data in the working set W(h) has err(w ) < 02 1“ s,o that, in particular, err(wk) < 02 k.

The case where h— — 1 follows from the standard VC bounds (see e.g.,(Vapnik and Chervonenkis, 1971)). Assume now the claim is true for h i 1 (h > 1), and consider the hth iteration. Let SI = {m : |1Z)k_1-m| g bk_1}, and 32 = {m : |1Z)k_1-m| > bk_1}. By the induction hypothesis, we know that, with probability at least 1 i 3:, <k_1 m, all 12) consistent with W(h i 1), including 122k_1, have errors at most c2_(k_1). Consider an arbitrary such 12). By Lemma 3 we have 0(122,w*) g 2_(k_1) and 0(122k_1,w*) g 2_(k_1), so 0(122k_1,122) g 4 X 2‘1“. Applying Theorem 4, there is a choice of C1k(the constant such that bk_1 = 01/2164) that satisfies 111(wa m)(w m) < 0 m e 32) < ch and 111(ka -m)(w* . m) < 0,93 6 32) g $. So

c2‘k 2 Now let us treat the case that m E 31. Since we are labeling mk data points in S 1 at iteration

h i 1, classic Vapnik-Chervonenkis bounds (1971) imply that, if Cg is a large enough absolute constant, then with probability 1 i 6 / (4(1 + s i h)2), for all 12) consistent with the data in W(h),

P((w . xx,“ . m) < 0,30 6 32) S

(6)

02—1“ 0

err(122l31) = P((u) . m)(w- m)< 0 | m e S1)< , 4—bk _4—Cl

(7)

Finally, since 31 consists of those points that, after projecting onto the direction 122k_1, fall into an interval of length 2bk, Lemma 2 implies that [P(Sl) g 2bk. Putting this together with (6) and (7), with probability 1 i % ZKk (1 +81%), we have err(fi)) 3 02—16, completing the proof. I

5. Passive Learning

In this section we show how an analysis that was inspired by active learning leads to optimal (up to constant factors) bounds for polynomial-time algorithms for passive learning.

Theorem 6 Assume that D is zero mean and log-concave in Rd. There exists an absolute constant

Cg st for d 2 4, and for any 6, 6 > 0, e < 1/4, any algorithm that outputs a hypothesis that

correctly classifies m = w

2176.

examples finds a separator 0ferr0r at most 6 with probability

PROOF SKETCH: We focus here on the case that D is isotropic. We can treat the non-isotropic case by observing that the two cases are equivalent; one may pass between them by applying the whitening transform. (See Appendix C for details.)

While our analysis will ultimately provide a guarantee for any learning algorithm that always outputs a consistent hypothesis, we will use intermediate hypothesis of Algorithm 1 in the analysis.

Let c be the constant from Lemma 3. While proving Theorem 5, we proved that, if Algorithm 1 is run with bk = % and mm = Cg (d + In W), that for all h g s, with probability 2 1 i gsz m any 12) consistent with the data in W(h) has err(122) 3 02—1“. Thus, after 5 = O(log(1/e)) iterations, with probability at least 2 1 i 6, any linear classifier consistent with all the training data has error 3 6, since any such classifier is consistent with the examples in W(s).

Now, let us analyze the number of examples used, including those examples whose labels were not requested by Algorithm 1. Lemma 2 implies that there is a positive constant cl such that P(S 1) 2 cl bk: again, S 1 consists of those points that fall into an interval of length 2bk after pro- jecting onto 122k_1. The density is lower bounded by a constant when bk 3 1 / 9, and we can use the bound for 1/9 when bk > 1 / 9. The expected number of examples that we need before we find mk elements of 31 is therefore at most :fk. Using a Chernoff bound, if we draw (2:1in examples, the probability that we fail to get mk members of 31 is at most exp (imk / 6), which is at most 6/(4(1 + s i h)2) if Cg is large enough. So, the total number of examples needed, 2k (211%, is at most a constant factor more than

£1216 (d+10g(1++ik)) = 0(28(d+ 10g(1/6))) + 281216 10g(1 + s , k) kzl

kzl

d 1 1 6 s = 0 (w) +Z2klog(1 +57 k). E kzl We can show 2221 2k log(1 + s i h) = O(l/e), completing the proof. I We conclude this section by pointing out several important facts and implications of Theorem 6

and its proof.

(1) The separator in Theorem 6 (and the one in Theorem 5 ) can be found in polynomial time, for example by using linear programming.

(2) The analysis of Theorem 6 also bounds the number of unlabeled examples needed by the active learning algorithm of Theorem 5. This shows that an algorithm can request a nearly optimally small number of labels without increasing the total number of examples required by more than a constant factor. Specifically, in round h, we only need 2k(d + In[(l + s i h) / 6]) unlabeled examples (whp), where s = O(log(1/e)), so the total number of unlabeled examples needed over all rounds is O(d/e + 10g(1/6)/e).

6. More Distributions

In this section we consider learning with respect to a more general class of distributions. We start by providing a general set of conditions on a set D of distributions that is sufficient for efficient passive and active learning w.r.t. distributions in D. We now consider nearly log-concave distribu- tions, an interesting, more general class containing log-concave distributions, considered previously in (Applegate and Kannan, 1991) and (Caramanis and Mannor, 2007). We then prove that isotropic nearly log-concave distributions satisfy our sufficient conditions; in Appendix D, we also show how to remove the assumption that the distribution is isotropic.

Definition 7 A set D ofdistributions is admissible if it satisfies the following:

0 There exists 0 such that for any D E D and any two unit vectors u and v in Rd we have 00(v,u) g dD(u,v).

o For any cl > 0, there is a Cg > 0 such that thefollowing holdsfor all D E D. Let u and v be two unit vectors in Rd s.t. 0(u,v) = oz < 7r/2. Then PIND[sign(u-m) 75 sign(v-m), |v-m| 2 02a] 3 01a.

0 There are positive constants 03, C4, C5 such that, for any D’ E D, for any projection D 0fD’ onto a 0ne-dimensi0nal subspace, the density f of D satisfies f(m) < 03 for all m and f(m) > C4 for all m with |m| < C5.

The proofs of Theorem 5 and Theorem 6 can be used without modification to show:

Theorem 8 If D is admissible, then arbitrary f E (C can be learned with respect to arbitrary distri- butions in D inpolynomial time in the active learning modelfrom O( (d+log(1/6) +10g log(1/e)) log(1/e))

d+10g(1/6)>

labeled examples, and in the passive learning model from O < examples.

6.1. The nearly log-concave case

Definition 9 A density function f : R” a R is 6 log-concave if for any A 6 [0,1], 301 E R”, 302 E R”, we have f(Aml + (1 7 M12) 2 e‘flf(m1))‘f(m2)1_)‘.

Clearly, a density function f is log-concave if it is 0-log-concave. An example of a O(l)-log- concave distribution is a mixture of two log-concave distributions whose covariance matrices are I , and whose means 111 and 112 have H111 i 112” = 0(1).

In this section we prove that for any sufficiently small constant 6 2 0, the class of isotropic fl log-concave distribution in Rd is admissible and has light tails (this second fact is useful for analyzing the disagreement coefficient in Sections 8). In doing so we provide several new properties for such distributions, which could be of independent interest. Detailed proofs of our claims appear in Appendix D.

We start by showing that for any isotropic fl log-concave density f there exists a log-concave density fwhose center is within e(C 7 mm of f’s center and that satisfies f(m)/C 3 f(x) 3

C f (m), for C as small as 65109; d. The fact C depends only exponentially in 10g d (as opposed to exponentially in d) is key for being able to argue that such distributions have light tails.

Lemma 10 For any isotropic fl log-concave density function f there exists a log-concave density function fthat satisfies f(m)/C 3 f(x) 3 Cf(m) and [[fm(f(m) i f(m))dm[[g e(C i 1)\/ Cd,

for C = efi[log2(d+1)l. Moreovei; we have 1/C g f (11 - m)2f(m)dm g Cfor every unit vector 11.

PROOF SKETCH: Note that if the density function f is 6 log-concave we have that h = 1n f satisfies that for any A 6 [0,1], 301 E R7212 E R”, we have h(Aml + (1 7 M12) 2 76 + Ah(m1) + (1 i A)h(m2). Let h be the function whose subgraph is the convex hull of the subgraph of h.By using

2

, A _ d+1 _ _ Caratheodory's theorem [2] we can show that h(m) — maxzji: aizl,cxi20,x22?:ll mm 21-21 oz,h(m,).

This implies h(m) g h(m) and we can prove by induction on 10g2(d + 1) that h(m) 2 h(m) 7 fl [10g2(d + 1)). If we further normalize eh to make it a density function, we obtain f that is log- concave and satisfies f (m) / C 3 f(x) 3 C f (m), where C = efi[log2(d+1)l. This implies that for any x we have |f(m) 7 f(m)l : (C 7 mm).

Using this fact and concentration properties of f (in particular Lemma 2), we can show that the center of f is close to the center of f, as desired. I

Theorem 11 Assume fl is a suficiently small n0n-negative constant and let D be the set of all isotropic fl log-concave distributions. (a ) D is admissible. (b) Any D E D has light tails. That is: P(llel > Rx/Cd) < Ce-R+1,forc = efillogzwflfl.

PROOF SKETCH: (a) Choose D E D. As in Lemma 3, consider the plane determined by 11 and 1) and let projuyvm‘) denote the projection operator that given 30 6 Rd, orthogonally projects 30 onto this plane. If D2 = proju7v(D) then dD(11,1)) = dD2(11’, 11’). By using the Prekopa—Leindler in- equality (Gardner, 2002) one can show that D2 is 6 log-concave (see e.g., (Caramanis and Mannor, 2007)). Moreover, if D is isotropic, than D2 is isotropic as well. By Lemma 10 we know that there exists a C-isotropic log-concave distribution D2 centered at z, [|z[| g e, satisfying f (m) / C 3 f(x) 3 Cf(m) and 1/C g f (11 - m)2f(m)dm g C for every unit vector 11, for constants C = 65 and e = e(Ci 1)\/2—C. Forfl sufficiently small we have (1/20+e)/‘/1/C i e2 g 1/9. Using this, by applying the whitening transform (see Theorem 16 in Appendix D), we can show 5(1) 2 c, for [|m[| 3 1/20 , which implies f2(m) 2 c/C, for [|m[| 3 1/20. Using a reasoning as in Lemma 3 we get 00(1), 11) g dD(11,1)). The generalization of Theorem 4 follows from a similar proof, except using Theorem 16. The density bounds in the n = 1 case also follow from Theorem 16 as well.

(b) Since X is isotropic, we have 1E f[X - X ] = d (where f is its associated density). By Lemma 10, there exists a log-concave density f such that f(m)/C 3 f(x) 3 Cf(m), for C = emlogfldflfl, This implies Ef[X - X] 3 Cd. By Lemma 2 we get that under f, [P(||X|| >

Rx/Cd) < e‘R‘H, so under f we have [P(||X|| > RV Cd) < Ce‘R‘H. I Using Theorem 8 and Theorem 11(a) we obtain:

Theorem 12 Let 6 2 0 be a suficiently small constant. Assume that D is an isotropic 6 log- concave distribution in Rd. Then arbitrary f E (C can be learned with respect to D in polynomial time in the active learning modelfrom O((d+ 10g(1/6) + log 10g(1/e)) log(1/e)) labeled examples,

d+10g(1/6)>

and in the passive learning model from O < examples.

7. Lower Bounds

In this section we give lower bounds on the label complexity of passive and active learning of homogeneous linear separators when the underlying distribution is 6 log-concave, for a sufficiently small constant 6. These lower bounds are information theoretic, applying to any procedure, that might not be necessarily computationally efficient. The proof is in Appendix E.

Theorem 13 For a small enough constant 6 we have: (I) for any 6 log-concave distribution D whose covariance matrix has full rank, the sample complexity of learning origin-centered linear separators under D in the passive learning model is Q (g + % 10g (§)) ; (2) the sample complexity ofactive learning of linear separators under 6 log-concave distributions is Q (d log (a) .

Note that, if the covariance matrix of D does not have full rank, the number of dimensions is effectively less than d, so our lower bound essentially applies for all log-concave distributions.

8. The inseparable case: Disagreement-based active learning

We consider two closely related distribution dependent capacity notions: the Alexander capacity and the disagreement coefficient; they have been widely used for analyzing the label complexity of non-aggressive active learning algorithms (Hanneke, 2007; Dasgupta et a1., 2007; Koltchinskii, 2010; Hanneke, 2011; Beygelzimer et a1., 2010). We begin with the definitions. For 1‘ > 0, define B(111,7‘) = {11 E (C : IP’D(sign(11-m) 75 sign(111-31‘)) g 1‘}. For any 7-1 Q (C, define the region of disagreement as DIS(7—l) = {31‘ E X : 3111,11 6 7-1 s.t. sign(11 - 31‘) 75 sign(111 - 31‘))}. Define the Alexander capacity function capw*yD(-) for 111* E (C w.r.t. D as: capw*yD(1‘) = W.

7‘ Define the disagreement coefficients for 111* E (C w.r.t. D as: disw*7D(e) = sup[capw17 13(1)]. 7‘26 The following is our bound in the disagreement coefficient. Its proof is in Appendix F.

Theorem 14 Let 6 2 0 be a suficiently small constant. Assume that D is an isotropic 6 log- concave distribution in Rd. For any 111*, for any 6, capw*yD(e) is O(dl/HZI‘fi log(1/e)). Thus disw*,D(e) = 0(11/2+%1og(1/e)).

Theorem 14 immediately leads to concrete bounds on the label complexity of several algo- rithms in the literature (Hanneke, 2007; Cohn et a1., 1994; Balcan et a1., 2006; Koltchinskii, 2010; Dasgupta et a1., 2007). For example, by composing it with a result of (Dasgupta et a1., 2007), we obtain a bound of C(d3/2(10g2(1/e) + (IJ/E)2)) for agnostic active learning when D is isotropic log-concave in Rd; that is we only need C(d3/2(10g2(1/e) + (IJ/E)2))) label requests to output a classifier of error at most I! + e, where [J = minwec err(111).

9. The Tsybakov condition

In this section we consider a variant of the Tsybakov noise condition (Mammen and Tsybakov, 1999). We assume that the classifier h that minimizes P(I.y)~ny (h(m) 75 y) is a linear classifier, and that, for the weight vector 111* of the optimal classifier, there exist known parameters 01, a > 0 such that, for all 111, we have

a(dD(111,111*))1/(1_a) g err(111) i err(111*).

By generalizing Theorem 4 so that it provides a stronger bound for larger margins, and combining the result with the other lemmas of this paper and techniques from (Balcan et a1., 2007), we get the following.

Theorem 15 Let s = O(log(1/e)). Assume that the distribution D Xy satisfies the Tsybakov n0ise condition for constants 01 6 [0,1) and a 2 0, and that the marginal D on Rd is isotropic log- concave. (I) If 01 = 0, we can find a separator with excess error 3 6 with probability 1 i 6 using

O(log(1/e))(d + 10g(s/6)) labeled examples in the active learning model, and O (W) labeled examples in the passive learning model. (2 ) If 01 > 0, we can find a separator with excess error 3 6 with probability 1 i 6 using O((1/6)2“ 10g2(1/6)) (d + 10g(s/6)) labeled examples in the active learning model.

In the case 01 = 0 (that is more general than the Massart noise condition) our analysis leads to optimal bounds for active and passive learning of linear separators under log-concave distribu- tions, improving the dependence on d over previous best known results (Hanneke and Yang, 2012; Giné and Koltchinskii, 2006). Our analysis for Tsybakov noise (01 2 0) leads to bounds on active learning with improved dependence on d over previous known results (Hanneke and Yang, 2012) in this case as well. Proofs and further details appear in Appendix G.

10. Discussion and Open Questions

The label sample complexity of our active learning algorithm for learning homogeneous linear sep- arators under isotropic logconcave distributions is O((d +10g(1/6) + log 10g(1 /e)) 10g(1/e)), while our lower bound for this setting is Q (d log (a) . Our upper bound is achieved by an algorithm that uses a polynomial number of unlabeled training examples, and polynomial time. If an unbounded amount of computation time and an unbounded number of unlabeled examples are available, it seems to be easy to learn to accuracy 6 using O(d 10g(1/e)) label requests, no matter what the value of 6. (Roughly, the algorithm can construct an e-cover to initialize a set of candidate hypotheses, then repeatedly wait for an unlabeled example that evenly splits the current list of candidates, and ask its label, eliminated roughly half of the candidates.) It would be interesting to know what is the best label complexity for a polynomial-time algorithm, or even an algorithm that is constrained to use a polynomial number of unlabeled examples.

Conceptually, our analysis of ERM for passive learning under (nearly) log-concave distributions is based on a more aggressive localization than those considered previously in the literature. It would be very interesting to extend this analysis, as well as our analysis for active learning to arbitrary distributions, and more general concept spaces.

Acknowledgements We thank Steve Hanneke for a number of useful discussions. This work was supported in part by NSF grant CCF-0953192, AFOSR grant FA9550-09-1-0538, and a Microsoft Research Faculty Fellowship.

Appendix A. Additional Related Work

Learning with noise. Alexander Capacity and the Disagreement Coefficient Roughly speaking the Alexander capacity (Alexander, 1987; Giné and Koltchinskii, 2006) quantifies how fast the region of disagreement of the set of classifiers at distance 1‘ of the optimal classifier collapses as a function r; [3] the disagreement coefficient (Hanneke, 2007) additionally involves the supremum of 1‘ over a range of values. (Friedman, 2009) provides guarantees on these quantities (for sufficiently small 1‘) for general classes of functions in Rd if the underlying data distribution is sufficiently smooth. Our analysis implies much tighter bounds for linear separators under log-concave distribu- tions (matching what was known for the much less general case of nearly uniform distribution over the unit sphere); furthermore, we also analyze the nearly log-concave case where we allow an arbi- trary number of discontinuities, a case not captured by the (Friedman, 2009) conditions at all. This immediately implies concrete bounds on the labeled data complexity of several algorithms in the literature including the A2 algorithm (Balcan et a1., 2006) and the DHM algorithm (Dasgupta et a1., 2007), with implications for the purely agnostic case (i.e., arbitrary forms of noise), as well as the Koltchinskii’s algorithm (Koltchinskii, 2010) and the CAL algorithm (Balcan et a1., 2006; Hanneke, 2007, 2011). Furthermore, in the realizable case and under Tsybakov noise, we show even better bounds, by considering aggressive active learning algorithms.

Note that as opposed to the realizable case, all existing active learning algorithms analyzed under Massart and Tsybakov noise conditions using the learning model analyzed in this paper (in- cluding our algorithms in Theorem 15), as well as those for the agnostic setting, are not known to run in time poly(d, 1/6). In fact, even ignoring the optimality of sample complexity, there are no known algorithms for passive learning that run in time poly(d, 1/6) for general values of 6, even for the Massart noise condition and under log-concave distributions. Existing works on agnostic passive learning under log-concave distributions either provide running times dp01y(1/ E) (e.g., the work of (Kalai et a1., 2005)) or can only achieve values of e that are significantly larger than the noise rate (Klivans et al., 2009a).

Other Work on Active Learning Several papers (Cesa—Bianchi et a1., 2010; Dekel et a1., 2012) present efficient online learning algorithms in the selective sampling framework, where labels must be actively queried before they are revealed. Under the assumption that the label conditional distri- bution is linear function determined by a fixed target vector, they provide bounds on the regret of the algorithm and on the number of labels it queries when faced with an adaptive adversarial strategy of generating the instances. As pointed by (Dekel et a1., 2012), these results can be converted to a statistical setting when the instances 1‘1 are drawn i.i.d and they further assume a margin condi- tion. In this setting they obtain exponential improvement in label complexity over passive learning. While very interesting, these results are incomparable to ours; their techniques significantly exploit the linear noise condition to get these improvements 7 note that such an improvement would not be possible in the realizable case (as pointed for example in (Gonen et al., 2013)).

(Nowak, 2011) considers an interesting abstract “generalized binary search" problem with ap- plications to active learning; while these results apply for more general concept spaces, it is not clear how to implement the resulting procedures in polynomial time and by using access to only a polynomial number of unlabeled samples from the underlying distribution (as required by the active learning model). Another interesting recent work is that of (Gonen et a1., 2013), which study active learning of linear separators Via an aggressive algorithm using a margin condition, using a general approximation guarantee on the number of labels requested; note that while these results work for potentially more general distributions, as opposed to ours, they do not come with explicit (tight) bounds on the label complexity.

e-nets, Learning, and Geometry Small e-nets are useful for many applications, especially in Computational Geometry (see (Pach and Agarwal, 1995)). The same fundamental techniques of (Vapnik and Chervonenkis, 1971; Vapnik, 1982) have been applied to establish the existence of small e-nets (Haussler and Welzl, 1987) and to bound the sample complexity of learning (Vapnik, 1982; Blumer et a1., 1989), and a number of interesting upper and lower bounds on the smallest possible size of e-nets have been obtained (Komlos et a1., 1992; Clarkson and Varadarajan, 2007; Alon, 2010).

Our analysis implies a O(d/e) upper bound on the size of an e-net for a set of regions of dis- agreement between all possible linear classifiers and the target, when the distribution is zero-mean and log-concave. In particular, since in Theorem 6 we prove that any hypothesis consistent with the training data has error rate 3 6 with probability 1 7 6, setting 6 to a constant gives a proof of a O(d/e) bound on the size of an e-net for the following set: {{1 z (111 -1)(111* - 1) < 0} z 111 E R”}.

Appendix B. Proof of Lemma 3

Lemma 3. Assume D is an isotropic l0g-00n0ave in Rd. Then there exists 0 such that for any two unit vectors 11 and 1) in Rd we have 00(1),11) g dD(11,1)).

Proof Consider two unit vectors 11 and 1). Let proju7v(1) denote the projection operator that, given 1 6 Rd, orthogonally projects 1 onto the plane determined by 11 and 1). That is, if we define an orthogonal coordinate system in which coordinates 1,2 lie in this plane and coordinates 3,. . . , d are orthogonal to this plane, then 1’ = proju7v(11, . . . ,1d) = (11,12). Also, given distribution D over Rd, define proju7v(D) to be the distribution given by first picking 1 N D and then outputting 1’ = proju7v(1). That is, proju7v(D) is just the marginal distribution over coordinates 1,2 in the above coordinate system. Notice that if 1’ = proju7v(1) then 11 - 1 = 11’ - 1’ where 11’ = proju7v(11) and 1)’ = proju7v(1)). So, if D2 = proju7v(D) then dD(11,1)) = dD2(11’,1)’).

By Lemma 2(0), we have that if D is isotropic and log-concave, then D2 is as well. Let A to be the region of disagreement between 11’ and 1)’ intersected with the ball of radius 1 / 9 in R2. The probability mass of A under D2 is at least the volume of A times infxE A D2(1). So, using Lemma 2(b)

dD2(11’,1)’) 2 vol(A) 122 D2(1) 2 00(11,1)), I

as desired. I

Appendix C. Passive Learning

Theorem 6. Assume that D is zero mean and log-concave in Rd. There exists an absolute constant C3 s.t. for d 2 4, and for any 6, 6 > 0, e < 1/4, any algorithm that outputs a hypothesis that correctly classifies m = W

2176.

examples finds a separator 0ferr0r at most 6 with probability

Proof First, let us prove the theorem in the case that D is isotropic. We will then treat the general case at the end of the proof.

While our analysis will ultimately provide a guarantee for any learning algorithm that always outputs a consistent hypothesis, we will use intermediate hypothesis of Algorithm 1 in the analysis.

Let 0 be the constant from Lemma 3. While proving Theorem 5, we proved that, if Algorithm 1 is run with bk = % and mk = 02 (d 7111 #1 that for all k g s, with probability 2 1 7 gsz m; any 11) consistent with the data in W(h‘) has err(11)) g 02"“. Thus, after 5 = O(log(1/e)) iterations, with probability at least 2 1 7 6, any linear classifier consistent with all the training data has error 3 6, since any such classifier is consistent with the examples in W(s).

Now, let us analyze the number of examples used, including those examples whose labels were not requested by Algorithm 1. Lemma 2 implies that there is a positive constant 01 such that [P(S 1) 2 01bk: again, S 1 consists of those points that fall into an interval of length 2bk after pro- jecting onto 16k_1. The density is lower bounded by a constant when bk 3 1 / 9, and we can use the bound for 1/9 when bk > 1/9.

The expected number of examples that we need before we find mk elements of SI is therefore

at most $172. Using a Chernoff bound, if we draw (21:2: examples, the probability that we fail to

get mk members of 31 is at most exp (7mk/6), which is at most 6/(4(1 + s 7 h)2) if C2 is large enough. So, the total number of examples needed, 2k 31117:, is at most a constant factor more than

2210101010)

= O(2S(d +10g(1/6))) + Z 2klog(1 + s 2 1) kzl

= 0 (W) +i2klog(1+s7h).

kzl

We claim that 2221 2’“ 10g(1 + s 7 h) = O(l/e). We have

5 S Z2klog(1 +5 71) g 21(3757 1) k:1 k:1 5+1 3 / 21(3 7 s , k) k:1 (since 2’“(3 + s 7 h) is increasing for h g s + 1) 2(25 7 1)(1 + 1n(4)) 7 sln2 _ 1H2 2 _ 0(1/6)7 completing the proof in the case that D is isotropic. Now let us treat the case in which D is not isotropic. Suppose that E is the covariance matrix of D, so that 2‘1/2 is the “whitening transform”. Suppose, for m = M, an algorithm is given a sample S of examples (11,141), ..., (1m,ym) for 11, ...,1m drawn according to D, and gm labeled by a target hypothesis with weight vector 1). Note that 11) is consistent with S if and only if wTEI/2 is consistent with (24/211,141), ..., (2‘1/21m,ym) (so those examples are consistent with UTEl/2). So our analysis of the isotropic case implies that, with probability 1 7 6, for any 11) consistent with (11,141), ..., (1m,ym), we have P(sign(<wT21/2>(2‘V2m>> ,1 sign<<vT2V2><2—V2m>>> < e.

which of course means that lP’(sign(11)T 1) 75 sign(1)T1)) <5 I

Appendix D. More Distributions

D.1. Isotropic Nearly Log-concave distributions

Lemma 10. For any isotropic fl l0g-00n0ave densityfnnctionf fm there exists a l0g-00n0ave density fi1n0ti0nfthat satisfies f(1 )/C < f(1 )< Cf(1)and [[f1(f 7 f(1)) d))1[[< e( (C 7 1)\/Cd, for C— — 05[1°g2(d+1)l. Moreover we have 1/C < f( (11- 1) 2f(1 )d1 < Cfor every unit ve0t0r11.

Proof Note that if the density function f 1s 6 log-concave we have that h = 1n f satisfies that for any A E [0,1],11 E R”, 12 E R”, we have h(A11 + (1 7 A)12) 2 76 + Ah(11) + (1 7 A)h(12).

Let h be the function whose subgraph is the convex hull of the subgraph of h. That is, h(1) is the maximum of all values of 2:le 01,-h(11,-) for any 111, ...,11k 6 Rd and 011,...,01k 6 [0,1] such that 2:le 01,- = 1 and 1 = 2:le 011-11,. Note that, if the components of 11,- are 111-71, ...,1117d, we can get h(1) by starting with

T = {(111,17 ---iU1,dih(U1))i ---7 (”1,17 ---iuk,dih(uk))}

taking the convex combination of the members of T with mixing coefficients 011, ..., 01k, and then

reading off the last component. Caratheodory’s theorem [4] implies that we can get the same result

using a mixture of at most d + 1 members of T. In other words, we can assume without loss of generality that h = d + 1, so that

d+1

h(1) = max 2 oz,h(1,-). (8)

d 1 d 1 21:1 “izlvaizovxzzaii1 0‘19“ 1'21

Because of the case where (011, ..., 0111-11) concentrates all its weight on one component, we have h(1) g 13(1). We also claim that A h(96) 2 h(96), )3[10g2(d + 1)) (9)

We will prove this by induction on logg (d + 1), treating the case in which d + 1 is a power of 2. (By padding with zeroes if necessary, we may assume without loss of generality that d + 1 is a power of 2.) The base case, in which d = 1, follows immediately from the definitions. Let h = d + 1. Assume that 1 = 11111 + 11212 + + 111611922621 a,- = 1,11,- 2 0. We can write this as:

1‘ = (01 + a2)1‘1,2 + (a3 + a4)$3,4 + ---(ak—1 + ah)$k—1,k

11‘ where mi,i+1 = m1) Jr

11(1) 2 *010809/2) + (01 + a2)h(1‘1,2) + + (ak_1 + ak)h(1k_17k) 2 *filogUi/Z) (61 + 02W + a1h(1‘1) + a2h(1‘2) 7 (113 + a4)fl + agh(13) + a4h(14) +

  1. 1111, for all 1'. Now, by induction we have:

at+at+1

(ak—l + akW + ak—1h($h—1) + akh(1‘k) = 7.810611) + a1h(901) + a2h(1‘2) + a3h(1‘3) + ---akh(1‘k)-

The last inequality follows from the fact that 2:;1 a,- = 1.

So, we have proved (9). If we further normalize eh to make it a density function, we obtain f that is log-concave and satisfies f(1)/C g f(1) N3 Cf(1), where C = 05[1°g2(d+1)l. This implies that for any 1 we have |f(1) 7 f(1)| g (C 7 1)f(1).

We now show that the center of f is close to the center of f. We have:

H/W ))1: L/WMfm hum

C7n/Mmfmwm (C7n/:; Hm1>hm

Using concentration properties of f (in particular Lemma 2) we get

1 /\ [i/W 1))d1 g (C7 1)/ 87%“11 1:0 = 0(C 7 1)\/Cd, as desired. I

Theorem 16 (i ) Let f : R2 7 R be the densityfi1n0ti0n of a l0g-00n0ave distribution centered at z and with covariance matrix A = Ef[(X 7 z)(X 7 2)T]. Assume f satisfies [|z[| g g and 1/C g f(11-1)2f(1)d1 g Cfor every unit vector 11, for C 2 1 constant close to 1. We have: (a) Assume (1/20 + €)/x / 1/C 7 £2 3 1/9. Then there exist an universal constant 0 s.t. we have f(1) 2 0, for all1 with 0 3 H1” 3 1/20. (1)) Assume C 3 1+ 1/5. There exist universal constants 01 and 02 such that f(1) 3 C1 exp(7C2||1||)f0r all 1.

(ii) Let f : R 7 R be the density function of a l0g-00n0ave distribution centered at g with standard deviation 0 = ‘ /Varf(X). Then f(1) g l/Ufor all 1. Iffurthermore f satisfies 1/C g Ef[X2] g Cfor C 2 1 and é/x/l/C 7 £2 3 1/9, then we have f(0) 2 0f0r some universal constant C.

Proof (i) Let Y = A_1/2(X 7 2). Then Y is a log-concave distribution in the isotropic position. Moreover, the density function of g is given by g(y) = det(A1/2)f(A1/2y+ 2). Let M = 1E[XXT]. We have

A = 1E[(X 7 g)(X 7 2)T] = R[XXT] 7 2.2T = M 7 2.2T.

Also, the fact 1 / C 3 f (11 - 1)2f (1)d1 g C for every unit vector 11 is equivalent to 1/0 3 11TIE[XXT]11 g 0

for every unit vector 11. Using 1) = (1,0),1) = (0,1), and 1) = (1/\/2,1/\/2) we get that M171 6 [1/C,C], M22 €[1/C, C], and M12— — M2 1 E [1/C 7 C,C 7 1/C]. We also have [|z[| g €and det A(1/2) =(Ax/det ).All these imply that

(1/0 7 £2)? 7 (0 7 1/0)2 g det(A1/2) g 0.

(a) For1 = A1/2y+zwehave H1 7 2H2 = (17.2)(172)T = [|y[|2 1)TA1), where1) = (1/ [|y[|)y is a unit vector, so [|y[| 3 H17 2H /«/1/C 7é2. If [|1[| 3 1/20 we have [|y[| g 1/9, so by Lemma 2 we have g(y) 2 01, so f(y) 2 0, for some universal constants 01, 02, as desired.

(b) We have f(1) = m1](A_1/2(1 7 2)). By Lemma 2 (b) we have

f(1) g exp[70[[A_1/2(172)[H.

1 det Al/2

21

By the triangle inequality we further obtain:

f(l‘) <

7 Wm lcllA‘l/2zlll exp licHA‘WmHl-

ForC g 1+1/5, we can show that HA‘1/2mH 2 (l/\/2) HmH. Itis enough to show HA‘1/2mH2 2 (1/2) Hml|2, or that 2 H1)H2 2 “(41/21)“, where 1) = A—1/2m (so 30 = A1/21)). This is equivalent to 21)T1) 2 1)TA1), which is true since the matrix 21 7 A is positive semi-definite.

(ii) Define Y = (X 7 Z)/0'. We have lE[Y] = 0 and lE[Y2] = l. The density g on is given by g(y) = Uf(ay + 2). Now, since g is isotropic and log-concave, we can apply Lemma 2(e ) to g. So g(y) g l for all 3). So, Uf(ay + 2') g l for all g, which implies f(x) 3 1/0’ for all m. The second part follows as in Theorem 16. I

D.2. More covariance matrices

In this section, we extend Theorem 5 to the case of arbitrary covariance matrices.

Theorem 17 If all distributions in D are zero-mean and l0g-00n0ave in Rd, then arbitrary f E (C be learned in polynomial time from arbitrary distributions in D in the active learning model from O((d + log(l/6) + log log(l/e)) log(l/e)) labeled examples, and in the passive learning model

from O (W) examples.

Our proof is through a series of lemma. First, (Lovasz and Vempala, 2007) have shown how to reduce to the nearly isotropic case.

Lemma 18 ((Lovasz and Vempala, 2007)) For any constant h > 0, there is a polynomial time algorithm that, given polynomially many samples from a l0g-00n0ave distribution D, outputs an estimate 2 0f the covariance matrix of D such that, with probability 1 7 6 the distribution D’ obtained by sampling mfrom D andprodncing 24/230 has 1+;n g E((u - m)2) g l + hfor all unit vectors ’LL.

As a result of Lemma 18, we can assume without loss of generality that the distribution D satisfies fi 3 E((u - m)2) g l + h for an arbitrarily small constant )1. By Theorem 16, this implies that, without loss of generality, there are constants 017 ..., 04 such that, for the density f of any one or two-dimensional marginal D’ of D, we have

f(x) 2 01 for all m with ”30” g 027 (10)

and for all 30, f(l‘) S C3exp(*04llmll)- (11) We will show that these imply that D is admissible.

Lemma 19 (a) There exists 0 such that for any two unit vectors ’LL and 1) in Rd we have 00(1),1t) g dD(117 1)).

(b) For any 05 > 0, there is a 07 > 0 such that the following holds. Let U and 1) be two unit vectors in Rd, and assume that 0(117 1)) = 0) < 7r / 2. Then

lP’xND[sign(1t-m) 75 sign(1) - m), |1) - m| 2 070)] g 060).

22

Proof (a) Projecting D onto a subspace can only reduce the norm of its mean, and its variance in any direction. Therefore, as in the proof of Lemma 3, we may assume without loss of generality that d = 2. Here, let us define A to be the region of disagreement between 11’ and 1)’ intersected with the ball BC2 of radius 02 in R2. Then we have dD2(11’,1)’) 2 vol(A) infxEA D2(31‘) 2 VOl(BC2)010(U, 1)). (b) This proof basically amounts to observing that everything that was needed for the proof of Theorem 4 is true for D, because of (10) and (l l). I

Armed with Lemma 19, to prove Theorem 17, we can just apply Theorem 8.

Appendix E. Lower Bounds

The proof of our lower bounds (Theorem 13) relies on a lower bound on the packing numbers M D ((C, 6). Recall that the e-packing number, M D ((C, e), is the maximal cardinality of an e-separated set with Classifiers from (C, where we say that 101, ..., 101v are e-separated w.r.t D if dD(11)Z-,11)j) > e

for any 1' 75 j.

Lemma 20 There is a positive constant 0 such that, for all 5 < 0, the following holds. Assume that D is 5 l0g-00n0ave in Rd, and that its covariance matrix has full rank. For all snfiiciently small 6,

d E N, we have MD((C,E) 2 g (iyl—1 7 1.

Proof We first prove the lemma in the case that D is isotropic. The proof in this case follows the outline of a proof for the special case of the uniform distribution in (Long, 1995).

Let UBALLd be the uniform distribution on the surface of the unit ball in Rd. By Theorem 11, there exists 0 such that for any two unit vectors 11 and 1) in Rd we have 00(1),11) g dD(11,1)). This implies that for a fixed 11 the probability that a randomly Chosen 1) has d D (11, 1)) g e is upper bounded by the volume of those vectors in the interior of the unit ball whose angle is at most 6 / 0 divided by the volume of the unit ball. Using known bounds on this ratio (see (Long, 1995)) we have

6 d_l E d—l PueUBALLdldDWw) S 6] S i (2—),SO Pu,ueUBALLd[dD(Uw) S 6] S i (2—) . That

C C means that for a fixed 5 if we pick 5 normal vectors at random from the unit ball, then the expected number of pairs of half—spaces that are E-ClOSC according to D is at most 572d (2—j)d_1. Removing one element of each pair from S yields a set of s 7 3—23 (2—C€)d_1 halfspaces that are e-separated. Setting 5 = %, leads the desired result.

To handle the non-isotropic case, suppose that E is the covariance matrix of D, so that 2‘1/2 is the whitening transform. Let D’ be the whitened version of D, i.e. the distribution obtained by first Choosing 31‘ from D, and then producing 24/2311 We have dD(1),11)) = dD/(UEI/2,11)El/2) (because sign(1) - 31‘) 75 sign(11) - 31‘) iff sign((1)El/2) - (24/21)) 75 sign((11)El/2) - (2‘1/231‘»). So we can use an e-packing w.r.t. D’ to construct an e-packing of the same size w.r.t. D. I

Now we are ready to prove Theorem 13.

Theorem 13. For a small enough constant 5 we have: (I) for any 5 l0g-00n0ave distribution D whose covariance matrix has full rank, the sample complexity of learning origin-centered linear separators under D in the passive learning model is Q (g + i log G» ; (2) the sample complexity 0fa0tive learning of linear separators under 5 l0g-00n0ave distributions is Q (d log (a) .

Proof First, let us consider passive PAC learning. It is known (Long, 1995) that, for any distribution D, the sample complexity of passive PAC learning origin-centered linear separators w.r.t. D is at least

d 7 1 MD((C, 26) W4) 6 (—4 ) - Applying Lemma 20 gives an 9(d/e) lower bound. It is known (Long, 1995) that, if for each 6, there is a pair of Classifier 1), 10 such that (1;; (1), 10) = 6, then the sample complexity of PAC learning is 9((1/6) log(1/6)); this requirement is satisfied by D.

Now let us consider the sample complexity of active learning. As shown in (Kul1<aInietal., 1993), in order to output a hypothesis of error at most 6 with probabality at least 1 7 6, where 6 g 1 / 2 and active learning algorithm that is allowed to make arbitrary yes-no queries must make 9(log M p((C, 6)) queries. Using this together with Lemma 20 we get the desired result. I

Appendix F. The inseparable case: Disagreement-based active learning

Theorem 14. Let 5 2 0 be a snfiiciently small constant. Assume that D is an isotropic 5 log- 00n0ave distribution in Rd. For any 10*, for any 6, capw*yD(e) is O(dl/HZI‘fi log(1/e)). Thus diswwa) = O(dl/2+%1og(1/e)).

Proof Roughly, we will show that almost all 31‘ Classified by a large enough margin by 10* are not in DIS(B(10*, 1)), because all hypotheses agree with 10* about how to Classify such 31‘, and therefore all pairs of hypotheses agree with each other. Consider 10 such that d(10, 10*) g 7‘; by Theorem 11 we have €(10,10*) g 07‘. Define C = 05l10g2(d+1)l as in the proof of Theorem 11. For any 31‘ such that ”31‘” g Mlogfl/r) we have

(10.317101) < llw*w*ll X Hill

3 cTMIOgfl/r).

Thus,if31‘ also satisfies |10* -31‘| 2 01V dC log(1/7‘) we have (10* - 31‘)(10 - 31‘) > 0. Since this is true for all 10, any such 31‘ is not in DIS(B(h, 1)). By Theorem 11 we have, for a constant 02, that

PIND(|10* -31‘| 3 01V Cdlog(1/7‘)) g 021V Cdlog(1/7‘). Moreover, by Theorem 11 we also have PxNDHlmll 2 01V Cdlog(1/7‘)] g 7‘.

These both imply capw*yD(e) = O(C1/2\/Elog(1/e)). I

Appendix G. Massart and Tsybakov noise

In this section we analyze label complexity for active learning under the popular Massart and Tsy- bakov noise conditions, proving Theorem 15.

We consider a variant of the Tsybakov noise condition (Mammen and Tsybakov, 1999). We assume that the Classifier h that minimizes P(I,y)~DXY (h(m) 75 y) is a linear Classifier, and that, for the weight vector 10* of that optimal Classifier, there exist known parameters 0), a > 0 such that, for all 10, we have

a(dD(10,10*))1/(1_a) g err(10) 7 err(10*). (12)

By generalizing Theorem 4 so that it provides a stronger bound for larger margins, and combining the result with the other lemmas of this paper and techniques from (Balcan et al., 2007), we get the following.

Theorem 15. Let s = O(log(1/e)). Assume that the distribution D Xy satisfies the Tsybakov noise condition for constants 0) 6 [0,1) and a 2 0, and that the marginal D on Rd is isotropic l0g-00n0ave. (I) If 0) = 0, we can find a separator with excess error 3 6 with probability 1 7 6 using O(log(1/e)) (d+log(s/6)) labeled examples in the active learning model, and O (W) labeled examples in the passive learning model. (2 ) If 0) > 0, we can find a separator with excess error 3 6 with probability 1 7 6 using O((l/e)2“ log2(l/6)) (d + log(s/6)) labeled examples in the active learning model.

Note that the case where 0) = 0 is more general than the well-known Massart noise condition (Massart and Nedelec, 2006). In this case, for active learning, Theorem 15 improves over the pre- Viously best known results (Hanneke and Yang, 2012) by a (disagreement coefficient) disw*7D(e) factor. For passive learning, the bound on the total number of examples needed improves by log(capw* 7D (6)) factor the previously known best bound of (Giné and Koltchinskii, 2006). It is con- sistent with recent lower bounds of (Raginsky and Rakhlin, 2011) that include log(capw17D(e)) be- cause those bounds are for a worst—case domain distribution, subject to a constraint on capw17D (6).

When 0) > 0, the previously best result for active learning (Hanneke and Yang, 2012) is

O((l/e)2adisw*7D(e)(dlog(disw*,D(e)) + log(1/6)).

Combining this with our new bound on diswap (6) (Theorem 14) we get a bound of O((l/e)2ad?’/2 log(1/e)(log(d) + log log(1/e)) + lOg(1/6))

for log-concave distributions. So our Theorem 15 saves roughly a factor of M6, at the expense of an extra log(1/e) factor.

We note that the results in this section can also be extended to nearly log-concave distributions by making use of our results in Section 6.1.

G.1. Proof of Theorem 15

We are now ready to discuss the proof of Theorem 15. As in (Balcan et al., 2007), we will use a different algorithm in the inseparable case (Algorithm 2).

G.1.1. MASSART NOISE(01 = 0)

We start by analyzing Algorithm 2 in the case that 0) = 0; the resulting assumption is more general than the well—known Massart noise condition.

From the log-concavity assumption, the proof of Theorem 5, with slight modifications, proves that there exists 0 such that for all 10 we have

0a0(10,10*) g err(10) 7 err(10*). (13)

Algorithm 2 Margin-based Active Learning (non-separable case)

Input: a sampling oracle for D, and a labeling oracle a sequence of sample sizes mk > 0, h E Z +; a sequence of cut-off values bk > 0, h E Z + a sequence of hypothesis space radii rk > 0, h E Z +; a sequence of precision values 6k > 0, h E Z +

Output: weight vector 108.

o Pickrandom100: H100H2 = 1. 0 Draw m1 examples from D X, label them and put into W.

— iterateh=1,...,s >1< find 10k 6 B(10k_1, rk) (H10kH2 = 1) to approximately minimize training error: Z(x,y)€W [(1171: ' my) 3 minweB(wk,1,rk) Z(x,y)€W [(10 '1‘?!) + mkfik-

  • Clear the working set W
  • until mk+1 additional data points are labeled, draw sample 31‘ from D X

- if |10k - m| 2 bk, rejectm - otherwise, ask for label of 31‘, and put into W

end iterate

We prove by induction on k that after h g s iterations, we have err(10k) 7 err(10*) 3 0112—1“

with probability 1 7 Q - %. The case h = 1 follows from Classic bounds (Vapnik, 1998). 2 1<k (1+5 1)

Assume now the Claim is true for h 7 1 (h 2 2). Then at the h-th iteration, we can let 31 = {31‘ : |10k_1-31‘| g bk_1} and S2 = {31‘ : |10k_1-31‘| > bk_1}. By induction hypothesis, we know that with probability at least 1 7 gzl<k_1 m; 10k_1 has excess errors at most 0112—06—1), implying,

using (13), that €(10k_1,10*) g 2_(k_1). By assumption, €(10k_1,10k) g 2_(k_1). From Theorem 4, recalling that a is a constant, we have both: P((10k_1-31‘)(10-31‘) < 0,31 6 S2) 3 0a2_k/4 [P((10k_1-31‘)(10*-31‘) < 0,31 6 S2) 3 0a2_k/4.

Taking the sum, we obtain:

[P((1Z) . 31‘)(10* . m) < 0,1: 6 32) g 0a2_k/2. (14) Therefore: err(10k) 7 err(10*) g (err(10k|Sl) 7 err(10*|Sl))lP’(Sl) + P((10-31‘)(10*-31‘)<0,31‘€S2) g (err(10k|Sl) 7 err(10*|Sl))03bk_1 + 0a2_k/2.

By standard Vapnik-Chervonenkis bounds, we can Choose C s.t. with mk samples, we obtain

err(10k|Sl) 7 err(10*|Sl) g 0a2‘k/(03bk_1) with probability 1 7 (6/2)/(1 + s 7 7')2. Therefore err(722k) 7 err(w*) g ca2‘k with probability (5

1 ’ 5 Zi<k (Hsfl), as desired,

The bound on the total number of examples, labeled and unlabeled, follows the same line of argument as Theorem 6, except with the constants of this analysis.

G.1.2. TSYBAKOV NOISE (07 > 0)

We now treat the more general Tsybakov noise. For this analysis, we need a generalization of Theorem 4 that provides a stronger bound on the probably of large-margin errors, using a stronger assumption on the margin.

Theorem 21 There is a positive constant c such that the following holds. Let u and 7) be two unit vectors in Rd, and assume that 0(u, 7)) = 77 < 7r / 2. Assume that D is isotropic log-concave in Rd. Then, for any b 2 077, we have

PxND[sign(u - m) 75 sign(7} - m) and |7) - m| 2 b] 3 C577 exp(7CGb/77), (15) for absolute constants C5 and Cg.

Proof Arguing as in the proof of Lemma 3, we may assume without loss of generality that d = 2.

Next, we Claim that each member 30 of E has ”30” 2 b/ 77. Assume without loss of generality that 7) - m is positive. (The other case is symmetric.) Then u - m < 0, so the angle of m with u is obtuse, i.e. €(m,u) 2 7r/2. Since 0(u, 7)) = 77, this implies that

€(m,7}) 27r/2777. (16)

Butm-v 2 b, and 7) is unit length, s0 ”30” cos 0(1‘, 7)) 2 b, which, using (16), implies ||m|| cos(7r/27 77) 2 b, which, since cos(7r/2 7 77) g 77 for all 77 E [0, 7r/2], in turn implies ”30” 2 b/77. This implies that, if B(7‘) is a ball of radius 7‘ in R2, that

P1E1= ZINE 0 (B((7 + 1)(b/77)), B(i(b/77)))]- (17) 7'21

Let us bound one of the terms in RHS. Choose 7' 2 1. Let f (301, 302) be the density of D. We have

ME m (B((7 + 1)(b/77)), B(i(b/77)))]

2/ 1E(CL‘1,CL‘2)f(CL‘1,CL‘2)dCL‘1dCL‘2. (x17962)€B((i+1)(b/77))-B(i(b/n))

Let R,- = B((7 + 1) (3)/77)) 7 B(7'(b/77). Applying the density upper bound from Lemma 2 with d = 2, there are constants Cl and Cg such that

ME 0 (B((i + 1)(b/77)), B(i(b/77)))1 3/ 1E(m1,m2)Cl exp(7(b/77)Cg7')dm1dm2 (x1,x2)€R-;

= 01 exp(7(b/n>027>- / 1E(CL‘1,CL‘2)dCL‘1dCL‘2. (961,962)ER7

If we include B(7'(b/ 77)) in the integral again, we get

ME m (B((7 + 1)(b/77)), B(i(b/77)))l

01 exp(7(b/n>027> / 1E(ml,m2> dmldmg.

(I17r2)€B((i+1)(b/77))

Now, we exploit the fact that the integral above is a rescaling of a probability with respect to the uniform distribution. Let Cg be the volume of the unit ball in R2. Then, we have

ME (7 (B((i + 1)(b/77)), B(i(b/77)))l S 01 exp(7(b/77)C27')C3(7' + 1)2(b/77)277/7r = 04(b/77)277(7' + 1)2 exp(7(b/77)C27')7

for C4 = 0103/7l'. Returning to (17), we get

ME]

2 C4(b/77)277(7' + 1)2 exp(7(b/77)C27') i=1

= 04(b/77)277:(7' + 1)2 exp(7(b/77)C27') i=1 482(b/n)02 , 38(b/n)02 + 1

= C b 2 — . 4(a) x (awkwg X77

Now, if b/77 > 4/02, we have

582(b/n)02 (g(b/U)Cz/2)3 X 77 C577 >< (3)/77)2 exp(7(b/77)Cg)(where C5 = 4004) 0577 X exp(7(b/77)02 + 21n(b/77)) (7577 >< exp(*(b/77)C2/2)7

PlEl S 04(5/77)2X

l /\

completing the proof. I

Now we are ready to prove Theorem 15 in the case that 07 > 0. Under the noise condition 12 and from the log-concavity assumption, we obtain that there exists 0 such that for all 11) we have:

acl/(1_a)0(w,w*)1/(l_a) g err(7u) 7 err(w*). Let us denote by E = acl/U—a). For all 11), we have: 60(w,w*)1/(1_a) g err(w) 7 err(7u*). (18) We prove by induction on k that after h g s iterations, we have

err(7Z)k) 7 err(w*) 3 62—1“ with probability 1 7 % ZKk m. The case h = 1 follows from Classic bounds.

Assume now the Claim is true for h 7 1 (h 2 2). Then at the h-th iteration, we can let SI = {m : |7Z)k_1-m| g bk_1} and 32 = {m : |7Z)k_1-m| > bk_1}. By the induction hypothesis, we know that with probability at least 1 7 6 Zi<k_1 m, 722k_1 has excess errors at most 62—(k—1)(1—cx), implying

9(1076—1770") S 2‘(k‘1>(1‘“>.

By assumption, 0(722k_1, 722k) 3 2_(k_1)(1_“).

Applying Theorem 21, we have both:

[P((wk_1-a:)(722-at) < 0,90 6 32) 3 5246/4 111’“ka -m)(w* - m) < 0,30 6 Sg) 3 6246/4

Taking the sum, we obtain:

[P((fi; . m)(w* . m) < 0,1: 6 32) 3 5246/2. (19) Therefore: err(7Z)k) 7 err(w*) g (err(722k|Sl) 7 err(w*|Sl))lP’(Sl) 71P((7Z) - m)(7u* - m) < 0,30 6 32) g (err(722k|Sl) 7 err(w*|Sl))bk

+E2‘k/2.

By standard bounds, we can Choose Cl, Cg and Cg s.t. with mk samples, we obtain err(7Z)k|Sl ) 7

err(w*|Sl) 3 6k 3 531: with probability 1 7 (6/2)/(1 7 s 7 7')2. Therefore err(722k) 7 err(w*) g

62"“ With probability 1 7 % ZKk (1 +51%), as desired, completing the proof of Theorem 15.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2013 ActiveandPassiveLearningofLineaMaria Florina Balcan
Philip M. Long
Active and Passive Learning of Linear Separators under Log-concave Distributions2013
  1. Note that such examples would not be i.i.d from the underlying distribution!
  2. Caratheodory’s theorem states that if a point 05 of Rd lies in the convex hull of a set P, then there is a subset IA3 of P consisting of d + 1 or fewer points such that 05 lies in the convex hull of P.
  3. The region of disagreement DIS((C) of a set of classifiers (C is the of set of instances 1 s.t. for each 1 E DIS((C) there exist two classifiers f , g E (C that disagree about the label of 1.
  4. Caratheodory’s theorem states that if a point 1 of Rd lies in the convex hull of a set P, then there is a subset 15 of P consisting of d + 1 or fewer points such that 1 lies in the convex hull of P.