CS229Lecturenotes
AndrewNg
PartVI
LearningTheory
1
Bias/variancetradeo?
Whentalkingaboutlinearregression,wediscussedtheproblemofwhetherto?ta“simple”modelsuchasthelinear“y=θ0+θ1x,”oramore“complex”modelsuchasthepolynomial“y=θ0+θ1x+···θ5x5.”Wesawthefollowingexample:4.54.54.54443.53.53.53332.5yy2.5y22.5221.51.51.51110.50.50.500123x456700123x456700123x4567Fittinga5thorderpolynomialtothedata(rightmost?gure)didnotresultinagoodmodel.Speci?cally,eventhoughthe5thorderpolynomialdidaverygoodjobpredictingy(say,pricesofhouses)fromx(say,livingarea)fortheexamplesinthetrainingset,wedonotexpectthemodelshowntobeagoodoneforpredictingthepricesofhousesnotinthetrainingset.Inotherwords,what’shasbeenlearnedfromthetrainingsetdoesnotgeneralizewelltootherhouses.Thegeneralizationerror(whichwillbemadeformalshortly)ofahypothesisisitsexpectederroronexamplesnotnecessarilyinthetrainingset.
Boththemodelsintheleftmostandtherightmost?guresabovehavelargegeneralizationerror.However,theproblemsthatthetwomodelssu?erfromareverydi?erent.Iftherelationshipbetweenyandxisnotlinear,
1
2
thenevenifwewere?ttingalinearmodeltoaverylargeamountoftrainingdata,thelinearmodelwouldstillfailtoaccuratelycapturethestructureinthedata.Informally,wede?nethebiasofamodeltobetheexpectedgeneralizationerrorevenifwewereto?tittoavery(say,in?nitely)largetrainingset.Thus,fortheproblemabove,thelinearmodelsu?ersfromlargebias,andmayunder?t(i.e.,failtocapturestructureexhibitedby)thedata.Apartfrombias,there’sasecondcomponenttothegeneralizationerror,consistingofthevarianceofamodel?ttingprocedure.Speci?cally,when?ttinga5thorderpolynomialasintherightmost?gure,thereisalargeriskthatwe’re?ttingpatternsinthedatathathappenedtobepresentinoursmall,?nitetrainingset,butthatdonotre?ectthewiderpatternoftherelationshipbetweenxandy.Thiscouldbe,say,becauseinthetrainingsetwejusthappenedbychancetogetaslightlymore-expensive-than-averagehousehere,andaslightlyless-expensive-than-averagehousethere,andsoon.By?ttingthese“spurious”patternsinthetrainingset,wemightagainobtainamodelwithlargegeneralizationerror.Inthiscase,wesaythemodelhaslargevariance.1
Often,thereisatradeo?betweenbiasandvariance.Ifourmodelistoo“simple”andhasveryfewparameters,thenitmayhavelargebias(butsmallvariance);ifitistoo“complex”andhasverymanyparameters,thenitmaysu?erfromlargevariance(buthavesmallerbias).Intheexampleabove,?ttingaquadraticfunctiondoesbetterthaneitheroftheextremesofa?rstora?fthorderpolynomial.
2Preliminaries
Inthissetofnotes,webeginourforayintolearningtheory.Apartfrombeinginterestingandenlighteninginitsownright,thisdiscussionwillalsohelpushoneourintuitionsandderiverulesofthumbabouthowtobestapplylearningalgorithmsindi?erentsettings.Wewillalsoseektoanswerafewquestions:First,canwemakeformalthebias/variancetradeo?thatwasjustdiscussed?Thewillalsoeventuallyleadustotalkaboutmodelselectionmethods,whichcan,forinstance,automaticallydecidewhatorderpolynomialto?ttoatrainingset.Second,inmachinelearningit’sreally
1
Inthesenotes,wewillnottrytoformalizethede?nitionsofbiasandvariancebeyondthisdiscussion.Whilebiasandvariancearestraightforwardtode?neformallyfor,e.g.,linearregression,therehavebeenseveralproposalsforthede?nitionsofbiasandvarianceforclassi?cation,andthereisasyetnoagreementonwhatisthe“right”and/orthemostusefulformalism.
3
generalizationerrorthatwecareabout,butmostlearningalgorithms?ttheirmodelstothetrainingset.Whyshoulddoingwellonthetrainingsettellusanythingaboutgeneralizationerror?Speci?cally,canwerelateerroronthetrainingsettogeneralizationerror?Thirdand?nally,arethereconditionsunderwhichwecanactuallyprovethatlearningalgorithmswillworkwell?Westartwithtwosimplebutveryusefullemmas.Lemma.(Theunionbound).LetA1,A2,...,Akbekdi?erentevents(thatmaynotbeindependent).Then
P(A1∪···∪Ak)≤P(A1)+...+P(Ak).
Inprobabilitytheory,theunionboundisusuallystatedasanaxiom(andthuswewon’ttrytoproveit),butitalsomakesintuitivesense:Theprobabilityofanyoneofkeventshappeningisatmostthesumsoftheprobabilitiesofthekdi?erentevents.
Lemma.(Hoe?dinginequality)LetZ1,...,Zmbemindependentandiden-ticallydistributed(iid)randomvariablesdrawnfromaBernoulli(φ??)distri-?=(1/m)mZibution.I.e.,P(Zi=1)=φ,andP(Zi=0)=1?φ.Letφi=1bethemeanoftheserandomvariables,andletanyγ>0be?xed.Then
?|>γ)≤2exp(?2γ2m)P(|φ?φ
Thislemma(whichinlearningtheoryisalsocalledtheCherno?bound)
?—theaverageofmBernoulli(φ)randomvariables—tosaysthatifwetakeφ
beourestimateofφ,thentheprobabilityofourbeingfarfromthetruevalueissmall,solongasmislarge.Anotherwayofsayingthisisthatifyouhaveabiasedcoinwhosechanceoflandingonheadsisφ,thenifyoutossitmtimesandcalculatethefractionoftimesthatitcameupheads,thatwillbeagoodestimateofφwithhighprobability(ifmislarge).
Usingjustthesetwolemmas,wewillbeabletoprovesomeofthedeepestandmostimportantresultsinlearningtheory.
Tosimplifyourexposition,letsrestrictourattentiontobinaryclassi?ca-tioninwhichthelabelsarey∈{0,1}.Everythingwe’llsayheregeneralizestoother,includingregressionandmulti-classclassi?cation,problems.
WeassumewearegivenatrainingsetS={(x(i),y(i));i=1,...,m}ofsizem,wherethetrainingexamples(x(i),y(i))aredrawniidfromsomeprobabilitydistributionD.Forahypothesish,wede?nethetrainingerror(alsocalledtheempiricalriskorempiricalerrorinlearningtheory)tobem
1??
1{h(x(i))=y(i)}.ε?(h)=
mi=1
4
Thisisjustthefractionoftrainingexamplesthathmisclassi?es.Whenwewanttomakeexplicitthedependenceofε?(h)onthetrainingsetS,wemayalsowritethisaε?S(h).Wealsode?nethegeneralizationerrortobe
ε(h)=P(x,y)~D(h(x)=y).
I.e.thisistheprobabilitythat,ifwenowdrawanewexample(x,y)fromthedistributionD,hwillmisclassifyit.
NotethatwehaveassumedthatthetrainingdatawasdrawnfromthesamedistributionDwithwhichwe’regoingtoevaluateourhypotheses(inthede?nitionofgeneralizationerror).ThisissometimesalsoreferredtoasoneofthePACassumptions.2
Considerthesettingoflinearclassi?cation,andlethθ(x)=1{θTx≥0}.What’sareasonablewayof?ttingtheparametersθ?Oneapproachistotrytominimizethetrainingerror,andpick
?=argminεθ?(hθ).
θ
Wecallthisprocessempiricalriskminimization(ERM),andtheresulting
?=h?.WethinkofERMhypothesisoutputbythelearningalgorithmishθ
asthemost“basic”learningalgorithm,anditwillbethisalgorithmthatwefocusoninthesenotes.(Algorithmssuchaslogisticregressioncanalsobeviewedasapproximationstoempiricalriskminimization.)
Inourstudyoflearningtheory,itwillbeusefultoabstractawayfromthespeci?cparameterizationofhypothesesandfromissuessuchaswhetherwe’reusingalinearclassi?er.Wede?nethehypothesisclassHusedbyalearningalgorithmtobethesetofallclassi?ersconsideredbyit.Forlinearclassi?cation,H={hθ:hθ(x)=1{θTx≥0},θ∈Rn+1}isthusthesetofallclassi?ersoverX(thedomainoftheinputs)wherethedecisionboundaryislinear.Morebroadly,ifwewerestudying,say,neuralnetworks,thenwecouldletHbethesetofallclassi?ersrepresentablebysomeneuralnetworkarchitecture.
EmpiricalriskminimizationcannowbethoughtofasaminimizationovertheclassoffunctionsH,inwhichthelearningalgorithmpicksthehypothesis:
?=argminεh?(h)
h∈H
PACstandsfor“probablyapproximatelycorrect,”whichisaframeworkandsetof
assumptionsunderwhichnumerousresultsonlearningtheorywereproved.Ofthese,theassumptionoftrainingandtestingonthesamedistribution,andtheassumptionoftheindependentlydrawntrainingexamples,werethemostimportant.
2
5
3Thecaseof?niteH
Letsstartbyconsideringalearningprobleminwhichwehavea?nitehy-pothesisclassH={h1,...,hk}consistingofkhypotheses.Thus,HisjustasetofkfunctionsmappingfromXto{0,1},andempiricalriskminimization
?tobewhicheverofthesekfunctionshasthesmallesttrainingerror.selectsh
?.OurWewouldliketogiveguaranteesonthegeneralizationerrorofh
strategyfordoingsowillbeintwoparts:First,wewillshowthatε?(h)isareliableestimateofε(h)forallh.Second,wewillshowthatthisimpliesan
?.upper-boundonthegeneralizationerrorofh
Takeanyone,?xed,hi∈H.ConsideraBernoullirandomvariableZwhosedistributionisde?nedasfollows.We’regoingtosample(x,y)~D.Then,wesetZ=1{hi(x)=y}.I.e.,we’regoingtodrawoneexample,andletZindicatewhetherhimisclassi?esit.Similarly,wealsode?neZj=1{hi(x(j))=y(j)}.SinceourtrainingsetwasdrawniidfromD,ZandtheZj’shavethesamedistribution.
Weseethatthemisclassi?cationprobabilityonarandomlydrawnexample—thatis,ε(h)—isexactlytheexpectedvalueofZ(andZj).Moreover,thetrainingerrorcanbewritten
1??
Zj.ε?(hi)=
mj=1
Thus,ε?(hi)isexactlythemeanofthemrandomvariablesZjthataredrawniidfromaBernoullidistributionwithmeanε(hi).Hence,wecanapplytheHoe?dinginequality,andobtain
P(|ε(hi)?ε?(hi)|>γ)≤2exp(?2γ2m).
Thisshowsthat,forourparticularhi,trainingerrorwillbecloseto
generalizationerrorwithhighprobability,assumingmislarge.Butwedon’tjustwanttoguaranteethatε(hi)willbeclosetoε?(hi)(withhighprobability)forjustonlyoneparticularhi.Wewanttoprovethatthiswillbetrueforsimultaneouslyforallh∈H.Todoso,letAidenotetheeventthat|ε(hi)?ε?(hi)|>γ.We’vealreadyshowthat,foranyparticularAi,itholdstruethatP(Ai)≤2exp(?2γ2m).Thus,usingtheunionbound,we
m