Data Science Case Study & Applications

The document provides information about a case study on data science. It lists the names of three students - Shubham Karnik, Dheren Salian, and Omkar Samal. It then discusses databases and data architectures, including traditional databases with tables, fields, keys, and metadata. It also discusses SQL, database servers, efficiency issues, structured vs unstructured data, object databases, and challenges with real-world databases like data integrity, interoperability, data cleansing and concludes that data science uses scientific methods and algorithms to extract knowledge from structured and unstructured data across many applications.

Uploaded by

KARTIK LADWA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

161 views16 pages

Data Science Case Study & Applications

Uploaded by

KARTIK LADWA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

CASE STUDY OF

DATA SCIENCE
SUBJECT:- DATA SCIENCE AND APPLICATIONS.

SHUBHAM KARNIK / 70.

DHEREN SALIAN / 78.
OMKAR SAMAL / 79.
DATABASE INFORMATION

Data Science is currently a popular interest of employers.

our Industrial Affiliates Partners say there is high demand
for students trained in Data Science
 databases, warehousing, data architectures
 data analytics – statistics, machine learning
BigData – gigabytes/day or more
Examples:
 Walmart, cable companies (ads linked to content, viewer trends),
airlines/Orbitz, HMOs, call centers, Twitter (500M tweets/day),
traffic surveillance cameras, detecting fraud, identity theft...
supports “Business Intelligence”
 quantitative decision-making and control
 finance, inventory, pricing/marketing, advertising
Data Architectures
 traditional databases (CSCE 310/608)
 tables, fields
 tuples = records or rows
 <yellowstone,WY,6000000 acres,geysers>
 key = field with unique values
 can be used as a reference from one table into another
 important for avoiding redundancy (normalization), which risks
inconsistency
 join – combining 2 tables using a key
 metadata – data about the data
 names of the fields, types (string, int, real, mpeg...)
 also things like source, date, size, completeness/sampling
Name HomeTown Grad school PhD teaches title
John Flaherty Houston, TX Rice 2005 CSCE 411 Design and Analysis of Algorithms
Susan Jenkins Omaha, NE Univ of Michigan 2004 CSCE 121 Introduction to Computing in C++
Susan Jenkins Omaha, NE Univ of Michigan 2004 CSCE 206 Programming in C
Bill Jones Pittsburgh, PA Carnegie Mellon 1999 CSCE 314 Programming Languages
Bill Jones Pittsburgh, PA Carnegie Mellon 1999 CSCE 206 Programming in C

Instructors:
Name HomeTown Grad school PhD
John Flaherty Houston, TX Rice 2005
Susan Jenkins Omaha, NE Univ of Michigan 2004
Bill Jones Pittsburgh, PA Carnegie Mellon 1999

TeachingAssignments:
Name teaches
John Flaherty CSCE 411
Susan Jenkins CSCE 121
Susan Jenkins CSCE 206
Courses:
Bill Jones CSCE 314 course title
Bill Jones CSCE 206 CSCE 411 Design and Analysis of Algorithms
CSCE 121 Introduction to Computing in C++
CSCE 314 Programming Languages
CSCE 206 Programming in C
SQL: Structured Query Language
>SELECT Name,HomeTown FROM Instructors WHERE PhD<2000;
Bill Jones Pittsburgh, PA

>SELECT Course,Title FROM Courses ORDER BY Course;

CSCE 121 Introduction to Computing in C++
CSCE 206 Programming in C
CSCE 314 Programming Languages
CSCE 411 Design and Analysis of Algorithms

can also compute sums, counts, means, etc.

example of JOIN: find all courses taught by someone from CMU:

>SELECT TeachingAssignments.Course
FROM Instructors JOIN TeachingAssignments
ON Instructors.Name=TeachingAssigmnents.Name
WHERE Instructor.PhD=“Carnegie Mellon”
CSCE 314
CSCE 206
because they were both taught by Bill Jones
 SQL servers

 centralized database, required for concurrent access by multiple users

 ODBC: Open DataBase Connectivity – protocol to connect to servers and do
queries, updates from languages like Java, C, Python
 Oracle, IBM DB2 - industrial strength SQL databases
Some efficiency issues with real databases
 indexing
 how to efficiently find all songs written by Paul Simon in a database
with 10,000,000 entries?
 data structures for representing sorted order on fields
 disk management
 databases are often too big to fit in RAM, leave most of it on disk and
swap in blocks of records as needed – could be slow
 concurrency
 transaction semantics: either all updates happen en batch or none
(commit or rollback)
 like delete one record and simultaneously add another but guarantee
not to leave in an inconsistent state
 other users might be blocked till done
 query optimization
 the order in which you JOIN tables can drastically affect the size of the
intermediate tables
 Unstructured data
 raw text
 documents, digital libraries
 grep, substring indexing, regular expressions
 like find all instances of “[aA]g+ies” including “agggggies”
 Information Retrieval (CSCE 470)
 look for synonyms, similar words (like “car” and “auto”)
 tfIdf (term frequency/inverse doc frequency) – weighting for important words
 LSI (latent semantic indexing) – e.g. ‘dogs’ is similar to ‘canines’ because they are used
similarly (both near ‘bark’ and ‘bite’)
 Natural Language parsing
 extracting requirements from jobs postings
 Unstructured data
 images, video (BLOBs=binary large objects)
 how to extract features? index them? search them?
 color histograms
 convolutions/transforms for pattern matching
 looking for ICBM missiles in aerial photos of Cuba
 streams
 sports ticker, radio, stock quotes...
 XML files
 with tags indicating field names
<course>
<name>CSCE 411</name>
<title>Design and Analysis of Algorithms</title>
</course>
 Object databases

Texas A&M
College Station, TX
ClassOfferedAt Div 1A
53,299 students Instructor/Employee

CHEM 102
Intro to Chemistry TaughtBy Dr. Frank Smith
TR, 3:00-4:00 302 Miller St.
prereq: CHEM 101 PhD, Cornell
13 years experience

In a database with millions of objects,

how do you efficiently do queries (i.e. follow pointers)
and retrieve information?
 Real-world issues with databases
 it’s all about scaling up to many records (and many users)
 data warehousing:
 full database is stored in secure, off-site location
 slices, snapshots, or views are put on interactive query servers for fast user access
(“staging”)
 might be processed or summarized data

 databases are often distributed

 different parts of the data held in different sites
 some queries are local, others are “corporate-wide”
 how to do distributed queries?
 how to keep the databases synchronized?
 CSCE 438 – Distributed Object Programming
Data integrity
 missing values
 how to interpret? not available? 0? use the mean?
 duplicated values
 including partial matches (Jon Smith=John Smith?)
 inconsistency:
 multiple addresses for person
 out-of-date data
 inconsistent usage:
 does “destination” mean of first leg or whole flight?
 outliers:
 salaries that are negative, or in the trillions
 most database allow “integrity constraints” to be defined that validate newly entered
data
Interoperability
 how can data from one database be compared or combined with another?
 what if fields are not the same, or not present, or used differently?
 think of medical or insurance records
 translation/mapping of terms
 standards
 units like ft/s, or gallons, etc.
 identifiers like SSN, UIN, ISBN
 “federated” databases – queries that combine information across multiple
servers
Data cleansing
 filling in missing data (imputing values)
 detecting and removing outliers
 smoothing
 removing noise by averaging values together
 filtering, sampling
 keeping only selected representative values
 feature extraction
 e.g. in a photo database, which people are wearing glasses? which have more than
one person? which are outdoors?
Conclusion.

 Data science is an interdisciplinary field that uses

scientific methods, processes, algorithms and
systems to extract knowledge and insights from
structured and unstructured data, and apply
knowledge and actionable insights from data across a
broad range of application domains. Data science is
related to data mining, machine learning and big data .
THANK YOU

DBMS Lecture Notes for B.Tech II Year
No ratings yet
DBMS Lecture Notes for B.Tech II Year
95 pages
1 Introduction
No ratings yet
1 Introduction
38 pages
Database Management Systems Course Overview
No ratings yet
Database Management Systems Course Overview
15 pages
CST204 - Ktu Qbank
No ratings yet
CST204 - Ktu Qbank
15 pages
Database Management System Course Overview
No ratings yet
Database Management System Course Overview
14 pages
CSE311 Course Objective and Outcome Outline
No ratings yet
CSE311 Course Objective and Outcome Outline
4 pages
Database Management Systems Syllabus 2024
No ratings yet
Database Management Systems Syllabus 2024
83 pages
Foundation of Data Science - CS3352 - Important Questions With Answer - Unit 1 - Introduction
No ratings yet
Foundation of Data Science - CS3352 - Important Questions With Answer - Unit 1 - Introduction
16 pages
Information Science Engineering Program Overview
No ratings yet
Information Science Engineering Program Overview
98 pages
Unit 1 - DA - Introduction To Data Science
No ratings yet
Unit 1 - DA - Introduction To Data Science
70 pages
Dbms Manual (2024) - 22 Scheme - Ai & Ds1
No ratings yet
Dbms Manual (2024) - 22 Scheme - Ai & Ds1
66 pages
DBMS Lec 1 New Updated
No ratings yet
DBMS Lec 1 New Updated
46 pages
Advanced Database Management Course
No ratings yet
Advanced Database Management Course
78 pages
CPSC 304 Database Systems Overview
No ratings yet
CPSC 304 Database Systems Overview
39 pages
Database Management Systems Course Overview
No ratings yet
Database Management Systems Course Overview
113 pages
Data Science Insights for Students
No ratings yet
Data Science Insights for Students
22 pages
PG Diploma in Data Analytics2024
No ratings yet
PG Diploma in Data Analytics2024
15 pages
Database Management Systems Syllabus 2024
No ratings yet
Database Management Systems Syllabus 2024
20 pages
DSCI 551 Course Overview and Logistics
No ratings yet
DSCI 551 Course Overview and Logistics
94 pages
Unit I Introtodbms
No ratings yet
Unit I Introtodbms
160 pages
Data Science Brouchure Vs3
No ratings yet
Data Science Brouchure Vs3
1 page
KIIT Data Analytics Course Guide
No ratings yet
KIIT Data Analytics Course Guide
65 pages
ME CSE Sem 1
No ratings yet
ME CSE Sem 1
9 pages
FDS Book
No ratings yet
FDS Book
123 pages
Course Code CSE3001 CT C LTP 4 Prerequisite: Objectives
No ratings yet
Course Code CSE3001 CT C LTP 4 Prerequisite: Objectives
7 pages
Intro To Data Science Prelims Reviewer
No ratings yet
Intro To Data Science Prelims Reviewer
12 pages
CSE - Database Management Systems
No ratings yet
CSE - Database Management Systems
17 pages
COSC 2103 Database Systems Final
No ratings yet
COSC 2103 Database Systems Final
5 pages
CU MSDS All Semesters Syllabus
No ratings yet
CU MSDS All Semesters Syllabus
10 pages
Intro to Relational Databases
No ratings yet
Intro to Relational Databases
24 pages
Lect01-Annotated DB
No ratings yet
Lect01-Annotated DB
31 pages
DBMS
No ratings yet
DBMS
251 pages
Big Data Course Overview
No ratings yet
Big Data Course Overview
97 pages
Data Base Theory
No ratings yet
Data Base Theory
47 pages
Data Analytics Course Overview at KIIT
No ratings yet
Data Analytics Course Overview at KIIT
738 pages
DBMS Lab Manual Program
No ratings yet
DBMS Lab Manual Program
65 pages
Final Exam and Database Review Guide
No ratings yet
Final Exam and Database Review Guide
22 pages
Data Engineering Course Overview
No ratings yet
Data Engineering Course Overview
33 pages
Introduction To Dbms
No ratings yet
Introduction To Dbms
37 pages
B.Tech CSE Data Science Syllabus
No ratings yet
B.Tech CSE Data Science Syllabus
43 pages
Introduction to Data Science Course
No ratings yet
Introduction to Data Science Course
3 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
34 pages
CS 501 Course Outline DataMining
No ratings yet
CS 501 Course Outline DataMining
11 pages
MTech DS Curriculum 2023
No ratings yet
MTech DS Curriculum 2023
16 pages
Query Processing in Database Systems
No ratings yet
Query Processing in Database Systems
25 pages
Database Management System Lab Guide
No ratings yet
Database Management System Lab Guide
54 pages
CSE544 Database Systems Course Overview
No ratings yet
CSE544 Database Systems Course Overview
32 pages
Database Management System Course Overview
No ratings yet
Database Management System Course Overview
32 pages
Relational Algebra: CSCD343-Introduction To Databases - A. Vaisman 1
No ratings yet
Relational Algebra: CSCD343-Introduction To Databases - A. Vaisman 1
21 pages
Data Warehousing and Mining Syllabus
No ratings yet
Data Warehousing and Mining Syllabus
6 pages
Module 1
No ratings yet
Module 1
96 pages
Dbms Lab Manual IV Sem 2024-Icb
No ratings yet
Dbms Lab Manual IV Sem 2024-Icb
27 pages
B.Tech CSE Semester 3 Syllabus
No ratings yet
B.Tech CSE Semester 3 Syllabus
13 pages
Data Mining and Warehousing Lecture Notes
No ratings yet
Data Mining and Warehousing Lecture Notes
302 pages
Database Design Lab for AI Students
No ratings yet
Database Design Lab for AI Students
88 pages
MCA NEW Syllbus (NEP2020) - Updated With BIG DATA Analytics
No ratings yet
MCA NEW Syllbus (NEP2020) - Updated With BIG DATA Analytics
17 pages
CS3492 DBMS Notes
100% (1)
CS3492 DBMS Notes
165 pages
Data-Intensive Computing: CSE487/587 Bina Ramamurthy (Bina@Buffalo - Edu)
No ratings yet
Data-Intensive Computing: CSE487/587 Bina Ramamurthy (Bina@Buffalo - Edu)
10 pages
Introduction to RDBMS Concepts
No ratings yet
Introduction to RDBMS Concepts
140 pages
Indian Constitution: Features and Analysis
No ratings yet
Indian Constitution: Features and Analysis
3 pages
Cheat
No ratings yet
Cheat
299 pages
Arusha
100% (1)
Arusha
813 pages
APEC Ren Final Version2
No ratings yet
APEC Ren Final Version2
10 pages
Kidney Failure: Signs and Management
No ratings yet
Kidney Failure: Signs and Management
6 pages
TSC 2024
No ratings yet
TSC 2024
22 pages
Schools Bureaucratic Organizations
No ratings yet
Schools Bureaucratic Organizations
3 pages
Join the Mašta Creative Activism Project
No ratings yet
Join the Mašta Creative Activism Project
4 pages
CALCULUS
No ratings yet
CALCULUS
134 pages
Project Work Schedule and Status Updates
No ratings yet
Project Work Schedule and Status Updates
6 pages
Agency Problem Causes & Effects
No ratings yet
Agency Problem Causes & Effects
2 pages
Facilities Solutions for KSA Businesses
No ratings yet
Facilities Solutions for KSA Businesses
18 pages
AccuScan LC
No ratings yet
AccuScan LC
1 page
Promotion 2024 - SP Cluster - Eligible Officers
No ratings yet
Promotion 2024 - SP Cluster - Eligible Officers
12 pages
TC78H660FTG Datasheet en 20200714
No ratings yet
TC78H660FTG Datasheet en 20200714
26 pages
Parametric Curves: Curves in Computer Graphics
No ratings yet
Parametric Curves: Curves in Computer Graphics
11 pages
Norma IEEE 1523-2018
No ratings yet
Norma IEEE 1523-2018
25 pages
Dokumen - Tips Ks3 Maths Progress Ks3 Maths Progress Author For Pearson Education Oxford Subject
No ratings yet
Dokumen - Tips Ks3 Maths Progress Ks3 Maths Progress Author For Pearson Education Oxford Subject
3 pages
Parts Index for Ricoh Aficio 8300DN
No ratings yet
Parts Index for Ricoh Aficio 8300DN
20 pages
MBG 531 2017 Grating
100% (1)
MBG 531 2017 Grating
34 pages
Mentor Teacher Evaluation for Mary Kelly
No ratings yet
Mentor Teacher Evaluation for Mary Kelly
4 pages
BOC 02PDT ENG RP 003 01 - STD Topside Removal Lift Point Design
100% (2)
BOC 02PDT ENG RP 003 01 - STD Topside Removal Lift Point Design
91 pages
Modul B.ing LM SMT 2
No ratings yet
Modul B.ing LM SMT 2
15 pages
Resonance in RLC Circuits Explained
No ratings yet
Resonance in RLC Circuits Explained
22 pages
Barrick Gold Vacancy
No ratings yet
Barrick Gold Vacancy
2 pages
DGMS Approval Process FAQs
No ratings yet
DGMS Approval Process FAQs
11 pages
IVF Lab Disposables Catalog
No ratings yet
IVF Lab Disposables Catalog
15 pages
Vsphere 5 Cheat Sheet
No ratings yet
Vsphere 5 Cheat Sheet
3 pages
EKM 202 Product Information EN 09022017
No ratings yet
EKM 202 Product Information EN 09022017
17 pages
Class 7th Cyber Safety & Security
No ratings yet
Class 7th Cyber Safety & Security
38 pages

Data Science Case Study & Applications

Uploaded by

Data Science Case Study & Applications

Uploaded by

CASE STUDY OF

SHUBHAM KARNIK / 70.

Data Science is currently a popular interest of employers.

>SELECT Course,Title FROM Courses ORDER BY Course;

can also compute sums, counts, means, etc.

example of JOIN: find all courses taught by someone from CMU:

 centralized database, required for concurrent access by multiple users

In a database with millions of objects,

 databases are often distributed

 Data science is an interdisciplinary field that uses

You might also like