吴恩达cs229-notes4

由天下分享时间：2024/9/14 19:13:15 加入收藏我要投稿点赞

CS229Lecturenotes

AndrewNg

PartVI

LearningTheory

Bias/variancetradeo?

Whentalkingaboutlinearregression,wediscussedtheproblemofwhetherto?ta“simple”modelsuchasthelinear“y=θ0+θ1x,”oramore“complex”modelsuchasthepolynomial“y=θ0+θ1x+···θ5x5.”Wesawthefollowingexample:4.54.54.54443.53.53.53332.5yy2.5y22.5221.51.51.51110.50.50.500123x456700123x456700123x4567Fittinga5thorderpolynomialtothedata(rightmost?gure)didnotresultinagoodmodel.Speci?cally,eventhoughthe5thorderpolynomialdidaverygoodjobpredictingy(say,pricesofhouses)fromx(say,livingarea)fortheexamplesinthetrainingset,wedonotexpectthemodelshowntobeagoodoneforpredictingthepricesofhousesnotinthetrainingset.Inotherwords,what’shasbeenlearnedfromthetrainingsetdoesnotgeneralizewelltootherhouses.Thegeneralizationerror(whichwillbemadeformalshortly)ofahypothesisisitsexpectederroronexamplesnotnecessarilyinthetrainingset.

Boththemodelsintheleftmostandtherightmost?guresabovehavelargegeneralizationerror.However,theproblemsthatthetwomodelssu?erfromareverydi?erent.Iftherelationshipbetweenyandxisnotlinear,

thenevenifwewere?ttingalinearmodeltoaverylargeamountoftrainingdata,thelinearmodelwouldstillfailtoaccuratelycapturethestructureinthedata.Informally,wede?nethebiasofamodeltobetheexpectedgeneralizationerrorevenifwewereto?tittoavery(say,in?nitely)largetrainingset.Thus,fortheproblemabove,thelinearmodelsu?ersfromlargebias,andmayunder?t(i.e.,failtocapturestructureexhibitedby)thedata.Apartfrombias,there’sasecondcomponenttothegeneralizationerror,consistingofthevarianceofamodel?ttingprocedure.Speci?cally,when?ttinga5thorderpolynomialasintherightmost?gure,thereisalargeriskthatwe’re?ttingpatternsinthedatathathappenedtobepresentinoursmall,?nitetrainingset,butthatdonotre?ectthewiderpatternoftherelationshipbetweenxandy.Thiscouldbe,say,becauseinthetrainingsetwejusthappenedbychancetogetaslightlymore-expensive-than-averagehousehere,andaslightlyless-expensive-than-averagehousethere,andsoon.By?ttingthese“spurious”patternsinthetrainingset,wemightagainobtainamodelwithlargegeneralizationerror.Inthiscase,wesaythemodelhaslargevariance.1

Often,thereisatradeo?betweenbiasandvariance.Ifourmodelistoo“simple”andhasveryfewparameters,thenitmayhavelargebias(butsmallvariance);ifitistoo“complex”andhasverymanyparameters,thenitmaysu?erfromlargevariance(buthavesmallerbias).Intheexampleabove,?ttingaquadraticfunctiondoesbetterthaneitheroftheextremesofa?rstora?fthorderpolynomial.

2Preliminaries

Inthissetofnotes,webeginourforayintolearningtheory.Apartfrombeinginterestingandenlighteninginitsownright,thisdiscussionwillalsohelpushoneourintuitionsandderiverulesofthumbabouthowtobestapplylearningalgorithmsindi?erentsettings.Wewillalsoseektoanswerafewquestions:First,canwemakeformalthebias/variancetradeo?thatwasjustdiscussed?Thewillalsoeventuallyleadustotalkaboutmodelselectionmethods,whichcan,forinstance,automaticallydecidewhatorderpolynomialto?ttoatrainingset.Second,inmachinelearningit’sreally

Inthesenotes,wewillnottrytoformalizethede?nitionsofbiasandvariancebeyondthisdiscussion.Whilebiasandvariancearestraightforwardtode?neformallyfor,e.g.,linearregression,therehavebeenseveralproposalsforthede?nitionsofbiasandvarianceforclassi?cation,andthereisasyetnoagreementonwhatisthe“right”and/orthemostusefulformalism.

generalizationerrorthatwecareabout,butmostlearningalgorithms?ttheirmodelstothetrainingset.Whyshoulddoingwellonthetrainingsettellusanythingaboutgeneralizationerror?Speci?cally,canwerelateerroronthetrainingsettogeneralizationerror?Thirdand?nally,arethereconditionsunderwhichwecanactuallyprovethatlearningalgorithmswillworkwell?Westartwithtwosimplebutveryusefullemmas.Lemma.(Theunionbound).LetA1,A2,...,Akbekdi?erentevents(thatmaynotbeindependent).Then

P(A1∪···∪Ak)≤P(A1)+...+P(Ak).

Inprobabilitytheory,theunionboundisusuallystatedasanaxiom(andthuswewon’ttrytoproveit),butitalsomakesintuitivesense:Theprobabilityofanyoneofkeventshappeningisatmostthesumsoftheprobabilitiesofthekdi?erentevents.

Lemma.(Hoe?dinginequality)LetZ1,...,Zmbemindependentandiden-ticallydistributed(iid)randomvariablesdrawnfromaBernoulli(φ??)distri-?=(1/m)mZibution.I.e.,P(Zi=1)=φ,andP(Zi=0)=1?φ.Letφi=1bethemeanoftheserandomvariables,andletanyγ>0be?xed.Then

?|>γ)≤2exp(?2γ2m)P(|φ?φ

Thislemma(whichinlearningtheoryisalsocalledtheCherno?bound)

?—theaverageofmBernoulli(φ)randomvariables—tosaysthatifwetakeφ

beourestimateofφ,thentheprobabilityofourbeingfarfromthetruevalueissmall,solongasmislarge.Anotherwayofsayingthisisthatifyouhaveabiasedcoinwhosechanceoflandingonheadsisφ,thenifyoutossitmtimesandcalculatethefractionoftimesthatitcameupheads,thatwillbeagoodestimateofφwithhighprobability(ifmislarge).

Usingjustthesetwolemmas,wewillbeabletoprovesomeofthedeepestandmostimportantresultsinlearningtheory.

Tosimplifyourexposition,letsrestrictourattentiontobinaryclassi?ca-tioninwhichthelabelsarey∈{0,1}.Everythingwe’llsayheregeneralizestoother,includingregressionandmulti-classclassi?cation,problems.

WeassumewearegivenatrainingsetS={(x(i),y(i));i=1,...,m}ofsizem,wherethetrainingexamples(x(i),y(i))aredrawniidfromsomeprobabilitydistributionD.Forahypothesish,wede?nethetrainingerror(alsocalledtheempiricalriskorempiricalerrorinlearningtheory)tobem

1??

1{h(x(i))=y(i)}.ε?(h)=

mi=1

Thisisjustthefractionoftrainingexamplesthathmisclassi?es.Whenwewanttomakeexplicitthedependenceofε?(h)onthetrainingsetS,wemayalsowritethisaε?S(h).Wealsode?nethegeneralizationerrortobe

ε(h)=P(x,y)～D(h(x)=y).

I.e.thisistheprobabilitythat,ifwenowdrawanewexample(x,y)fromthedistributionD,hwillmisclassifyit.

NotethatwehaveassumedthatthetrainingdatawasdrawnfromthesamedistributionDwithwhichwe’regoingtoevaluateourhypotheses(inthede?nitionofgeneralizationerror).ThisissometimesalsoreferredtoasoneofthePACassumptions.2

Considerthesettingoflinearclassi?cation,andlethθ(x)=1{θTx≥0}.What’sareasonablewayof?ttingtheparametersθ?Oneapproachistotrytominimizethetrainingerror,andpick

?=argminεθ?(hθ).

Wecallthisprocessempiricalriskminimization(ERM),andtheresulting

?=h?.WethinkofERMhypothesisoutputbythelearningalgorithmishθ

asthemost“basic”learningalgorithm,anditwillbethisalgorithmthatwefocusoninthesenotes.(Algorithmssuchaslogisticregressioncanalsobeviewedasapproximationstoempiricalriskminimization.)

Inourstudyoflearningtheory,itwillbeusefultoabstractawayfromthespeci?cparameterizationofhypothesesandfromissuessuchaswhetherwe’reusingalinearclassi?er.Wede?nethehypothesisclassHusedbyalearningalgorithmtobethesetofallclassi?ersconsideredbyit.Forlinearclassi?cation,H={hθ:hθ(x)=1{θTx≥0},θ∈Rn+1}isthusthesetofallclassi?ersoverX(thedomainoftheinputs)wherethedecisionboundaryislinear.Morebroadly,ifwewerestudying,say,neuralnetworks,thenwecouldletHbethesetofallclassi?ersrepresentablebysomeneuralnetworkarchitecture.

EmpiricalriskminimizationcannowbethoughtofasaminimizationovertheclassoffunctionsH,inwhichthelearningalgorithmpicksthehypothesis:

?=argminεh?(h)

h∈H

PACstandsfor“probablyapproximatelycorrect,”whichisaframeworkandsetof

assumptionsunderwhichnumerousresultsonlearningtheorywereproved.Ofthese,theassumptionoftrainingandtestingonthesamedistribution,andtheassumptionoftheindependentlydrawntrainingexamples,werethemostimportant.

3Thecaseof?niteH

Letsstartbyconsideringalearningprobleminwhichwehavea?nitehy-pothesisclassH={h1,...,hk}consistingofkhypotheses.Thus,HisjustasetofkfunctionsmappingfromXto{0,1},andempiricalriskminimization

?tobewhicheverofthesekfunctionshasthesmallesttrainingerror.selectsh

?.OurWewouldliketogiveguaranteesonthegeneralizationerrorofh

strategyfordoingsowillbeintwoparts:First,wewillshowthatε?(h)isareliableestimateofε(h)forallh.Second,wewillshowthatthisimpliesan

?.upper-boundonthegeneralizationerrorofh

Takeanyone,?xed,hi∈H.ConsideraBernoullirandomvariableZwhosedistributionisde?nedasfollows.We’regoingtosample(x,y)～D.Then,wesetZ=1{hi(x)=y}.I.e.,we’regoingtodrawoneexample,andletZindicatewhetherhimisclassi?esit.Similarly,wealsode?neZj=1{hi(x(j))=y(j)}.SinceourtrainingsetwasdrawniidfromD,ZandtheZj’shavethesamedistribution.

Weseethatthemisclassi?cationprobabilityonarandomlydrawnexample—thatis,ε(h)—isexactlytheexpectedvalueofZ(andZj).Moreover,thetrainingerrorcanbewritten

1??

Zj.ε?(hi)=

mj=1

Thus,ε?(hi)isexactlythemeanofthemrandomvariablesZjthataredrawniidfromaBernoullidistributionwithmeanε(hi).Hence,wecanapplytheHoe?dinginequality,andobtain

P(|ε(hi)?ε?(hi)|>γ)≤2exp(?2γ2m).

Thisshowsthat,forourparticularhi,trainingerrorwillbecloseto

generalizationerrorwithhighprobability,assumingmislarge.Butwedon’tjustwanttoguaranteethatε(hi)willbeclosetoε?(hi)(withhighprobability)forjustonlyoneparticularhi.Wewanttoprovethatthiswillbetrueforsimultaneouslyforallh∈H.Todoso,letAidenotetheeventthat|ε(hi)?ε?(hi)|>γ.We’vealreadyshowthat,foranyparticularAi,itholdstruethatP(Ai)≤2exp(?2γ2m).Thus,usingtheunionbound,we

吴恩达cs229-notes4

CS229LecturenotesAndrewNgPartVILearningTheory1Bias/variancetradeo?Whentalkingaboutlinearregression,wediscussedtheproblemofwhetherto?ta“simple”modelsuchasth

推荐度：

点击下载文档文档为doc格式

吴恩达cs229-notes4

吴恩达cs229-notes4

相关推荐文档

精选图文

热门排序

推荐文章

热门标签

相关文章列表