Introduction ✓Data can be considered as raw material while Data Science is the process of extracting meaningful information from the Data. ✓Data science is an interdisciplinary field that combines ✓statistics, ✓computer science, and ✓domain expertise to extract knowledge and insights from structured and unstructured data. Introduction ✓Data science is being employed in a wide range of industries, including ✓Healthcare ✓Finance, ✓Agriculture, ✓Retail, and more. ✓It empowers organizations to make data-driven decisions and gain a competitive advantage. Popularity of Data Science • Digital devices are cheaper and more powerful. • Democratization of Hardware and Software • Internet world is producing huge amount of data. • According to a report of the International Data Corporation (IDC 2021), worldwide data will grow 61% to 175 zettabytes by 2025 • In the same report it was predicted that 49% of the world’s stored data will reside in public cloud environments. • Due to immense computational resources available with us, we can perform analytics and train a machine using massive data. Learning Objectives Learning objectives of this course are: 1.Understand the fundamentals of Python programming. ▪ Acquire a working knowledge of Python syntax and control structures. ▪ Learn to write Python functions and work with Python data structures like lists, tuples, dictionaries, and sets. 2.Gain proficiency in using Python’s data science libraries. ▪ Learn to manipulate and analyze data using Pandas. ▪ Perform numerical computations with NumPy. ▪ Create visualizations with Matplotlib and Seaborn. 3.Develop skills in data manipulation and cleaning. ▪ Understand how to handle missing data, duplicate data, and data errors. ▪ Learn data transformation techniques essential for preparing datasets for analysis. Learning Objectives 4. Learn the basics of statistical analysis and hypothesis testing. ▪ Comprehend the concepts of descriptive statistics and inferential statistics. ▪ Perform hypothesis testing using Python to make data-driven inferences. 5.Understand machine learning concepts and apply them using Scikit- learn. ▪ Grasp the basics of machine learning algorithms including supervised and unsupervised learning. ▪ Implement machine learning models and understand the principles of model selection and evaluation. 6.Learn to present data insights effectively. ▪ Develop the ability to communicate results through visualizations. ▪ Create comprehensive reports on the findings from data analysis. Learning Outcomes By the end of this course, students should be able to:
1.Write Python scripts and programs for various data science
applications.
2.Efficiently use python libraries/frameworks for data
preprocessing tasks.
3.Conduct exploratory data analysis to uncover insights and
prepare data for modeling.
4.Apply statistical methods to analyze data and validate
hypotheses using Python. Learning Outcomes 5. Create meaningful data visualizations to present findings in a clear and impactful manner.
6. Build, test, and evaluate machine learning models with
Scikit-learn.
7. Demonstrate the ability to extract and interpret data to
inform decision-making processes.
8. Possess a solid foundation in data science that can be
applied to real-world problems across various domains. Data Science Operations ✓Data Science is described as a field revolving around 5 data-related operations. • Collection • Storing • Processing • Describing • Modelling Collection • Data Collection is the process of gathering data (text, video, audio and others). • Depends on Data scientist and the environment that the Data Scientist is working. • Let say a data scientists is working on an agriculture-based company. • Which crops yields higher production in districts of Maharashtra? • Effect of type of seed, fertilizer, irrigation on crops • Data Already Exists • Access using SQL Collection • Suppose a data scientists working for a political party. • What is the winning chance of a candidate in a particular region ? • Data exists at different sources such as Facebook, opinion polls etc. • Access using API or crawler. • Take another example where a data scientists working with Healthcare Systems. • Effect of new drugs on patients. • Data not available • Needs to design experiments Essential Expertise Needed for Data Collection • Knowledge of programming • Knowledge of statistics (incase of design experiments to collect data) • Knowledge of Databases(Intermediatory) Storing Data • Operational Data. • Operational data is the vital part of an organization. • Generated by day-to-day operations. • It's transactional data that reflects the business activities and includes sales records, banking transactions, or customer interactions. • Unstructured Data • This data is the wild frontier of data science. • It doesn't fit neatly into traditional row and column databases and includes: • Test • Speech • Video (Data coming from web, social media posts, online reviews etc.) Storing Data • Structured Data • Structured data is highly organized and easily searchable. • It stores in relational databases and is what you typically find in an e-commerce setting in the form of customer profiles. • Data from multiple databases • Combine into common repository Expertise Needed for Storing Data
• Programming Skills (Basic)
• Understanding of Databases (Basic) • Understanding of NoSQL Databases like JSON, XML etc. • Understanding of Data Warehouses (data from many different sources into a single data repository) Processing Data • Data Wrangling (the process of transforming and mapping data from one "raw" data form into another format) • Extract • Transform and Structure • Load • Data Cleaning • Fill missing values • Correct spelling errors • Identify and remove outliers Processing Data Processing Data • Data scaling (adjusting the range of feature values) • Kilometers to miles, rupees to dollars • Normalizing (shifted and rescaled so that they end up ranging between 0 and 1) • Min-Max scaling • Standardizing • values are centered around the mean Expertise Needed for Processing Data • Programming Skills (Essential) • SQL and NoSQL Databases (Optional) • Basic Statistics (Essential) Describing Data • Descriptive Statistics • Descriptive statistics provide us with a powerful way to describe and summarize data. • Visualising Data • Visualization is about representing data graphically to uncover patterns, see trends, and communicate information effectively • Summarising Data • Summarizing data involves reducing large datasets to their basic essence. • Mean • Median • Mode, etc. Skills Required for Describing Data • Statistics • Excel • Python •R • Tableau Data Modelling • Statistical Modelling: Underlying relationships • What is the relationship between x and y of a dataset? • E.g.: There is a linear relationship between no. of days of treatment and BP. • Modelling underlying data distribution • Give statistical guarantees (p-values, goodness-of-fit tests) • Algorithmic Modelling (Machine Learning Modelling) • Alternative to statistical modelling when the relationship among output and input variables are not simple. • If you only care about the prediction and not about why certain thing are happening. Statistical Modelling VS Algorithmic Modelling • Simple models - - Complex models • More suited for low-dimensional data - - Can work with high dimensional data. • Data lean models -- Data hungry models • More of statistics -- More of ML,DL • Linear Regression, Logistic Regression, Linear Discriminant Analysis -- Linear Regression, Logistic Regression, Linear Discriminant Analysis, Decision Trees, K-NNs, SVMs, Naïve Bayes, Multilayered Neural Networks Modelling (Skills Required) • Inferential Statistics • Probability Theory • ML and DL • Python packages and frameworks(numpy, pandas, scikit- learn, Tensor Flow, PyTorch, Keras)