0% found this document useful (0 votes)
39 views

2009 03 04 Exploratory Data Analysis

This document provides an overview of exploratory data analysis (EDA). It discusses the goals of EDA, which include uncovering the underlying structure of data and determining optimal factor settings. EDA uses techniques like data transformations and graphics to maximize insight and suggest hypotheses, rather than just testing hypotheses. Resources for EDA include classic books by Tukey and software tools like Data Desk and Fathom.

Uploaded by

mariammari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

2009 03 04 Exploratory Data Analysis

This document provides an overview of exploratory data analysis (EDA). It discusses the goals of EDA, which include uncovering the underlying structure of data and determining optimal factor settings. EDA uses techniques like data transformations and graphics to maximize insight and suggest hypotheses, rather than just testing hypotheses. Resources for EDA include classic books by Tukey and software tools like Data Desk and Fathom.

Uploaded by

mariammari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

ExploratoryDataAnalysis

4March2009

ResearchMethodsfor
EmpiricalComputerScience
CMPSCI691DD
EdwinHubble
WhatdidHubblesee?
WhatdidHubblesee?
HubblesLaw

V=H0r
Where:
V=recessionalvelocity
H0=Hubbleconstant
r=distance(mpc)

E. Hubble (1929). A relation between distance and radial velocity among extra-galactic nebulae.
Proceedings of the National Academy of Sciences 15(3).
HubblesLaw
Thetoolthatissodullthat
youcannotcutyourselfonit
isnotlikelytobesharpenough
tobeeitherusefulorhelpful.
JohnW.Tukey
Randomvariables

Theembarrassinglydogmaticmisnomer
Theyareneitherrandom,noraretheyvariables
Arandomvariableis...
afunctionthatmapsfrominstancestoscales
thenumericresultofanondeterministicexperiment
Theycanbedistinguishedfromxedvariables
whosevaluecanbesetorpredeterminedbeforethe
experiment
Theyarenottheindividualvalues(e.g.,5.92),but
rathertheprocessofassigningvaluetoinstancesor
(colloquially)thesetofvaluessoassigned
Examples

RecallofanIRsystem,givenquery,corpus,and
designatedrelevantdocuments
Sizeandspeedofcodeproducedbyacompiler,given
sourceandatargetprocessor
Numberofdatabaserowsreturned,givenananytime
queryprocessor,query,database,andtime
Linesofcodewritten,givenanassignment,language,
developmentenvironment,andprogrammer
Notes

Theobjectsofstudyareusuallythesystemsthat
enablerandomvariables(e.g.,IRsystems),rather
thantheinstancesthatthemeasuresareon(e.g.,
queries).
Whatwedeneasarandomvariableforaparticular
experimentcanchangeaswediscoverdeterministic
andcausalrelationshipsinagivensystem
Representationofdatainstances

i.i.d.instancesarecommonlyassumed
IndependentKnowingsomethingaboutoneinstancetells
younothingaboutanother
IdenticallydistributedDrawnfromthesameprobability
distribution
Examples?
QueriesinTRECdata
ProgramsinSPECbenchmarks
DatasetsinUCIrepository
Somealternatives
Timeseries
(e.g.,userssubmittingsetsofslightlymodiedqueries)
Relational
(e.g.,routerperformanceembeddedinanetwork)
Populationsandsamples

Apopulationisaspeciedsetofinstances
Anactualnitesetofinstances(e.g.,theUCIdatasets
formachinelearningresearch)
Ageneralizationofanactualniteset(e.g.,thesetof
alldatasetsthatmightbeproducedbyaparticular
simulatorininnitetime)
Apurelyhypotheticalsetwhichcanbedescribed
mathematically(e.g.,thesetofallcorrectJava
programs)
Samplesarenitesubsetsofpopulations
Examples
Populations Actualdatasamples

Allpossible TheTREC2005
IRqueries HARDqueries

Allpossibleprograms TheSPECjvm98
writteninJava benchmarks

AllJavaprogrammers Studentstaking
activein2005 CMPSCI320inFall2005

TheSPECjvm98 Asubsetof
benchmarks thebenchmarks
Fourstagesofdeningasample

Thetargetpopulation
(e.g.,allcomputerprograms)
Thesamplingframe
(allprogramswritteninJavaorC++)
Theselectedsample
(allprogramswrittenbyCSundergraduatestudents
in200levelcoursesatUMass)
Theactualsample
(allprogramsactuallyturnedin)
Whyissamplingdicult?
Samplingproblems

Thetargetpopulation
Thesamplingframe
Theselectedsample
Theactualsample
RandomsamplinginCS

Randomsamplingisn'teasyinCS
...butit'snoteasyinmostsciences
Answerisn'ttogiveup,buttoconsiderhowtoget
closertotheideal
Denetheidealpopulation
Identifysourcesofbiasinsamplingandinsubsequentsteps
ofsampledenition
Removeormitigateasmanysourcesofbiasaspossible
Modifyyourcondenceinyourabilitygeneralize
basedonyourassessmentofthematchbetween
youractualsampleandyourdesiredpopulation
Typesofscales
Categorical,discrete,ornominalValuescontainnoordering
information(e.g.,multipleaccessprotocolsforunderwater
networking)
OrdinalValuesindicateorder,butnoarithmeticoperations
aremeaningful(e.g.,"novice","experienced",and"expert"as
designationsofprogrammersparticipatinginanexperiment)
IntervalDistancesbetweenvaluesaremeaningful,butzero
pointisnotmeaningful.(e.g.,degreesFahrenheit)
RatioDistancesaremeaningfulandazeropointis
meaningful(e.g.,degreesK)
Datatransformations
Downgradingtype(e.g.,intervaltoordinal)
Shiftingintervals
Tukey'sladderofpowers:trans=original^(1b)
E.g.:2>original^3,0.5>sqrt(original),2>1/original
Combiningseveralvariables
Normalizemeasurements
(e.g.,Simsek&Jensen2005,normalizedtooptimal)
Removeunwantedfactors
(e.g.,removelereadtimesfromtotalcompiletimes)
Considerrelationoftwovariables
(e.g.,Kirkpatrick&Selman,vertex/edgeratio)
Exploratorydataanalysis

Exploratorydataanalysis(EDA)employsavarietyof
techniquesto
maximizeinsightintoadataset;
uncoverunderlyingstructure;
extractimportantvariables;
detectoutliersandanomalies;
testunderlyingassumptions;
developparsimoniousmodels;and
determineoptimalfactorsettings
TheEDAapproachispreciselythatanapproach
notasetoftechniques,butanattitude/philosophy
abouthowadataanalysisshouldbecarriedout.

Source:http://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm
WhyEDA?

Dataanalysistoolsaretypicallyusedfor
Hypothesistesting
Parameterestimation
Graphicstoolsaretypicallyusedforpresentation
However,muchofthequalityofscienticworkis
determinedbythequalityofthehypothesesand
modelsusedbytheresearcher
Candataanalysishelpsuggesthypotheses?
Resources

Books
ExploratoryDataAnalysis,Tukey,(1977)
DataAnalysisandRegression,MostellerandTukey
(1977)
InteractiveDataAnalysis,Hoaglin(1977)
TheABC'sofEDA,VellemanandHoaglin(1981)
Software
DataDesk(DataDescription)
Fathom(Keypress)
XGobi(AT&TResearch)
Demo

You might also like