2009 03 04 Exploratory Data Analysis
2009 03 04 Exploratory Data Analysis
4March2009
ResearchMethodsfor
EmpiricalComputerScience
CMPSCI691DD
EdwinHubble
WhatdidHubblesee?
WhatdidHubblesee?
HubblesLaw
V=H0r
Where:
V=recessionalvelocity
H0=Hubbleconstant
r=distance(mpc)
E. Hubble (1929). A relation between distance and radial velocity among extra-galactic nebulae.
Proceedings of the National Academy of Sciences 15(3).
HubblesLaw
Thetoolthatissodullthat
youcannotcutyourselfonit
isnotlikelytobesharpenough
tobeeitherusefulorhelpful.
JohnW.Tukey
Randomvariables
Theembarrassinglydogmaticmisnomer
Theyareneitherrandom,noraretheyvariables
Arandomvariableis...
afunctionthatmapsfrominstancestoscales
thenumericresultofanondeterministicexperiment
Theycanbedistinguishedfromxedvariables
whosevaluecanbesetorpredeterminedbeforethe
experiment
Theyarenottheindividualvalues(e.g.,5.92),but
rathertheprocessofassigningvaluetoinstancesor
(colloquially)thesetofvaluessoassigned
Examples
RecallofanIRsystem,givenquery,corpus,and
designatedrelevantdocuments
Sizeandspeedofcodeproducedbyacompiler,given
sourceandatargetprocessor
Numberofdatabaserowsreturned,givenananytime
queryprocessor,query,database,andtime
Linesofcodewritten,givenanassignment,language,
developmentenvironment,andprogrammer
Notes
Theobjectsofstudyareusuallythesystemsthat
enablerandomvariables(e.g.,IRsystems),rather
thantheinstancesthatthemeasuresareon(e.g.,
queries).
Whatwedeneasarandomvariableforaparticular
experimentcanchangeaswediscoverdeterministic
andcausalrelationshipsinagivensystem
Representationofdatainstances
i.i.d.instancesarecommonlyassumed
IndependentKnowingsomethingaboutoneinstancetells
younothingaboutanother
IdenticallydistributedDrawnfromthesameprobability
distribution
Examples?
QueriesinTRECdata
ProgramsinSPECbenchmarks
DatasetsinUCIrepository
Somealternatives
Timeseries
(e.g.,userssubmittingsetsofslightlymodiedqueries)
Relational
(e.g.,routerperformanceembeddedinanetwork)
Populationsandsamples
Apopulationisaspeciedsetofinstances
Anactualnitesetofinstances(e.g.,theUCIdatasets
formachinelearningresearch)
Ageneralizationofanactualniteset(e.g.,thesetof
alldatasetsthatmightbeproducedbyaparticular
simulatorininnitetime)
Apurelyhypotheticalsetwhichcanbedescribed
mathematically(e.g.,thesetofallcorrectJava
programs)
Samplesarenitesubsetsofpopulations
Examples
Populations Actualdatasamples
Allpossible TheTREC2005
IRqueries HARDqueries
Allpossibleprograms TheSPECjvm98
writteninJava benchmarks
AllJavaprogrammers Studentstaking
activein2005 CMPSCI320inFall2005
TheSPECjvm98 Asubsetof
benchmarks thebenchmarks
Fourstagesofdeningasample
Thetargetpopulation
(e.g.,allcomputerprograms)
Thesamplingframe
(allprogramswritteninJavaorC++)
Theselectedsample
(allprogramswrittenbyCSundergraduatestudents
in200levelcoursesatUMass)
Theactualsample
(allprogramsactuallyturnedin)
Whyissamplingdicult?
Samplingproblems
Thetargetpopulation
Thesamplingframe
Theselectedsample
Theactualsample
RandomsamplinginCS
Randomsamplingisn'teasyinCS
...butit'snoteasyinmostsciences
Answerisn'ttogiveup,buttoconsiderhowtoget
closertotheideal
Denetheidealpopulation
Identifysourcesofbiasinsamplingandinsubsequentsteps
ofsampledenition
Removeormitigateasmanysourcesofbiasaspossible
Modifyyourcondenceinyourabilitygeneralize
basedonyourassessmentofthematchbetween
youractualsampleandyourdesiredpopulation
Typesofscales
Categorical,discrete,ornominalValuescontainnoordering
information(e.g.,multipleaccessprotocolsforunderwater
networking)
OrdinalValuesindicateorder,butnoarithmeticoperations
aremeaningful(e.g.,"novice","experienced",and"expert"as
designationsofprogrammersparticipatinginanexperiment)
IntervalDistancesbetweenvaluesaremeaningful,butzero
pointisnotmeaningful.(e.g.,degreesFahrenheit)
RatioDistancesaremeaningfulandazeropointis
meaningful(e.g.,degreesK)
Datatransformations
Downgradingtype(e.g.,intervaltoordinal)
Shiftingintervals
Tukey'sladderofpowers:trans=original^(1b)
E.g.:2>original^3,0.5>sqrt(original),2>1/original
Combiningseveralvariables
Normalizemeasurements
(e.g.,Simsek&Jensen2005,normalizedtooptimal)
Removeunwantedfactors
(e.g.,removelereadtimesfromtotalcompiletimes)
Considerrelationoftwovariables
(e.g.,Kirkpatrick&Selman,vertex/edgeratio)
Exploratorydataanalysis
Exploratorydataanalysis(EDA)employsavarietyof
techniquesto
maximizeinsightintoadataset;
uncoverunderlyingstructure;
extractimportantvariables;
detectoutliersandanomalies;
testunderlyingassumptions;
developparsimoniousmodels;and
determineoptimalfactorsettings
TheEDAapproachispreciselythatanapproach
notasetoftechniques,butanattitude/philosophy
abouthowadataanalysisshouldbecarriedout.
Source:http://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm
WhyEDA?
Dataanalysistoolsaretypicallyusedfor
Hypothesistesting
Parameterestimation
Graphicstoolsaretypicallyusedforpresentation
However,muchofthequalityofscienticworkis
determinedbythequalityofthehypothesesand
modelsusedbytheresearcher
Candataanalysishelpsuggesthypotheses?
Resources
Books
ExploratoryDataAnalysis,Tukey,(1977)
DataAnalysisandRegression,MostellerandTukey
(1977)
InteractiveDataAnalysis,Hoaglin(1977)
TheABC'sofEDA,VellemanandHoaglin(1981)
Software
DataDesk(DataDescription)
Fathom(Keypress)
XGobi(AT&TResearch)
Demo