ST2195 Programming For Data Science
ST2195 Programming For Data Science
Note: Content extracted from ST2195 Subject Guides (available via UOL VLE)
1
What is Data?
2
What is Data Science?
Many definitions exist none of which is widely acceptable. Here’s a reasonable one…
Data science is the application of computational and statistical techniques on data to address or
gain insight into some problem in the real world.
We can also think of data science as the union of the various techniques that are required to accomplish
the above…
Data science = Data collection + Data (pre-)processing + Big data + Scientific hypotheses +
Business Insights + Visualisation + Machine learning + Statistics + (etc)
In this course we will explore the programming aspect of all the above areas.
3
What Data Science is Not…
Data Science Is Not (Just) Machine Learning
Predictions can be an important part of data science, but truly hard elements involve also
Collecting the data
Defining the problem you are trying to solve (and frequently, re-defining it many times)
Interpreting and understanding the results, and knowing what actions to take based upon this
4
Data vs Information
Data
Raw, unorganized facts that need to be processed
Unusable until it is organized
Information
Created when data is processed, organized, and structured
Needs to be put in an appropriate context in order to become useful
5
Examples of Data Science Tasks
Example 1: Email Spam
4601 email messages were stored and labelled as spam or not.
The relative frequency of the 57 most common words are available. See table below for some
of them.
Aim is to design automatic spam detector that could filter out spam before clogging the users
mailboxes.
6
Examples of Data Science Tasks
Example 2: Real Estate
Determine neighbourhood
characteristics that drive house
prices.
Data is from the Boston Housing
dataset available from the “scikit
learn” Python library
8
Computer Programming and Data Science
How do we actually do data science? Some of the tasks we need to undertake…
Data collection
Data processing (wrangling)
Data visualisation
Train and apply algorithms from fields such as machine learning, statistics, data mining, optimisation,
image processing, etc.
We will explore programming languages R and Python to perform data science tasks.
Both R and Python are open source, i.e. free software which the user can modify and distribute within
the terms of a licence
9
Programming for Data Science Tools
Integrated Development Environments (IDEs)
Software applications that facilitate computer programming and software development. Examples
include RStudio, Spyder, Microsoft Visual Studio, etc
GitHub
Code hosting platform for version control and collaboration that is based on Git
Largest host of source code in the world
10
Useful Links, Resources, and References
References
• Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical
learning. Springer.
11