0% found this document useful (0 votes)
16 views

Introduction To Data Science

Introduction to Data Science Benefits and uses of data science and big data, Facets of data, The data science process, The data Science Process- Defining goals, Retrieving Data, Cleansing and Transforming data, Exploratory Data analysis, Build Models, Visualization. Understanding Big Data - What is big data; why big data – convergence of key trends – unstructured data – Industry examples of big data – web analytics – big data and marketing – fraud and big data – risk and big data...

Uploaded by

dinu89
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Introduction To Data Science

Introduction to Data Science Benefits and uses of data science and big data, Facets of data, The data science process, The data Science Process- Defining goals, Retrieving Data, Cleansing and Transforming data, Exploratory Data analysis, Build Models, Visualization. Understanding Big Data - What is big data; why big data – convergence of key trends – unstructured data – Industry examples of big data – web analytics – big data and marketing – fraud and big data – risk and big data...

Uploaded by

dinu89
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

MODULE 3

Data science in a big data world


Big data is a blanket term for any collection of data sets so large or complex that it becomes
difficult to process them using traditional data management techniques such as, for example, the
RDBMS (relational database management systems). The widely adopted RDBMS has long been
regarded as a one-size-fits-all solution, but the demands of handling big data have shown
otherwise. Data science involves using methods to analyze massive amounts of data and extract
the knowledge it contains. You can think of the relationship between big data and data science as
being like the relationship between crude oil and an oil refinery. Data science and big data evolved
from statistics and traditional data management but are now considered to be distinct disciplines.
The characteristics of big data are often referred to as the three Vs:
■ Volume—How much data is there?
■ Variety—How diverse are different types of data?
■ Velocity—At what speed is new data generated?
Often these characteristics are complemented with a fourth V, veracity: How accurate is the data?
These four properties make big data different from the data found in traditional data management
tools. Consequently, the challenges they bring can be felt in almost every aspect: data capture,
curation, storage, search, sharing, transfer, and visualization. In addition, big data calls for
specialized techniques to extract the insights
Data science is an evolutionary extension of statistics capable of dealing with the massive amounts
of data produced today. It adds methods from computer science to the repertoire of statistics. The
main things that set a data scientist apart from a statistician are the ability to work with big data
and experience in machine learning, computing, and algorithm building. Their tools tend to differ
too, with data scientist job descriptions more frequently mentioning the ability to use Hadoop, Pig,
Spark, R, Python, and Java etc.
Benefits and uses of data science and big data
 Both Commercial & Noncommercial uses
Data science and big data are used almost everywhere in both commercial and noncommercial
settings. The number of use cases is vast, and the examples we’ll provide throughout this book
only scratch the surface of the possibilities. Commercial companies in almost every industry use
data science and big data to gain insights into their customers, processes, staff, completion, and
products. Many companies use data science to offer customers a better user experience, as well as
to cross-sell, up-sell, and personalize their offerings. A good example of this is Google AdSense,
which collects data from internet users so relevant commercial messages can be matched to the
person browsing the internet. MaxPoint (http://maxpoint.com/us) is another example of real-time
personalized advertising.
 People analytics and Text mining
Human resource professionals use people analytics and text mining to screen candidates, monitor
the mood of employees, and study informal networks among coworkers. People analytics is the
central theme in the book Money ball: The Art of Winning an Unfair Game. In the book (and
movie) we saw that the traditional scouting process for American baseball was random, and
replacing it with correlated signals changed everything. Relying on statistics allowed them to hire
the right players and pit them against the opponents where they would have the biggest advantage.

 Financial institutions
Financial institutions use data science to predict stock markets, determine the risk of lending
money, and learn how to attract new clients for their services. At the time of writing this book, at
least 50% of trades worldwide are performed automatically by machines based on algorithms
developed by quants, as data scientists who work on trading algorithms are often called, with the
help of big data and data science techniques. Governmental organizations are also aware of data’s
value.

 Governmental organizations
Many governmental organizations not only rely on internal data scientists to discover valuable
information, but also share their data with the public. You can use this data to gain insights or build
data-driven applications. Data.gov is but one example; it’s the home of the US Government’s open
data. A data scientist in a governmental organization gets to work on diverse projects such as
detecting fraud and other criminal activity or optimizing project funding. A well-known example
was provided by Edward Snowden, who leaked internal documents of the American National
Security Agency and the British Government Communications Headquarters that show clearly
how they used data science and big data to monitor millions of individuals. Those organizations
collected 5 billion data records from widespread applications such as Google Maps, Angry Birds,
email, and text messages, among many other data sources. Then they applied data science
techniques to distill information.

 Nongovernmental organizations (NGOs)


They are also no strangers to using data. They use it to raise money and defend their causes. The
World Wildlife Fund (WWF), for instance, employs data scientists to increase the effectiveness of
their fundraising efforts. Many data scientists devote part of their time to helping NGOs, because
NGOs often lack the resources to collect data and employ data scientists. DataKind is one such
data scientist group that devotes its time to the benefit of mankind.

 Universities
They use data science in their research but also to enhance the study experience of their students.
The rise of massive open online courses (MOOC) produces a lot of data, which allows universities
to study how this type of learning can complement traditional classes. MOOCs are an invaluable
asset if you want to become a data scientist and big data professional, so definitely look at a few
of the better-known ones: Coursera, Udacity, and edX. The big data and data science landscape
changes quickly, and MOOCs allow you to stay up to date by following courses from top
universities. If you aren’t acquainted with them yet, take time to do so now; you’ll come to love
them as we have.
Facets of data
In data science and big data, you’ll come across many different types of data, and each of them
tends to require different tools and techniques.
The main categories of data are these:
■ Structured
■ Unstructured
■ Natural language
■ Machine-generated
■ Graph-based
■ Audio, video, and images
■ Streaming
Structured data: Structured data is data that depends on a data model and resides in a fixed field
within a record. As such, it’s often easy to store structured data in tables within databases or Excel
files. SQL, or Structured Query Language, is the preferred way to manage and query data that
resides in databases. You may also come across structured data that might give you a hard time
storing it in a traditional relational database. Hierarchical data such as a family tree is one such
example. The world isn’t made up of structured data, though; it’s imposed upon it by humans and
machines. More often, data comes unstructured.
An Excel table is an example of structured data.
Unstructured data: Unstructured data is data that isn’t easy to fit into a data model because the
content is context-specific or varying. One example of unstructured data is your regular email.
Although email contains structured elements such as the sender, title, and body text, it’s a challenge
to find the number of people who have written an email complaint about a specific employee
because so many ways exist to refer to a person, for example. The thousands of different languages
and dialects out there further complicate this. A human-written email, is also a perfect example of
natural language data.
Natural language: Natural language is a special type of unstructured data; it’s challenging to
process because it requires knowledge of specific data science techniques and linguistics.
The natural language processing community has had success in entity recognition, topic
recognition, summarization, text completion, and sentiment analysis, but models trained in one
domain don’t generalize well to other domains. Even state-of-the-art techniques aren’t able to
decipher the meaning of every piece of text. This shouldn’t be a surprise though: humans struggle
with natural language as well. It’s ambiguous by nature. The concept of meaning itself is
questionable here. Have two people listen to the same conversation. Will they get the same
meaning? The meaning of the same words can vary when coming from someone upset or joyous
Machine-generated data: Machine-generated data is information that’s automatically created by
a computer, process, application, or other machine without human intervention. Machine-
generated data is becoming a major data resource and will continue to do so. Wikibon has forecast
that the market value of the industrial Internet (a term coined by Frost & Sullivan to refer to the
integration of complex physical machinery with networked sensors and software) will be
approximately $540 billion in 2020. IDC (International Data Corporation) has estimated there will
be 26 times more connected things than people in 2020. This network is commonly referred to as
the internet of things. The analysis of machine data relies on highly scalable tools, due to its high
volume and speed. Examples of machine data are web server logs, call detail records, network
event logs, and telemetry

Graph-based or network data:


“Graph data” can be a confusing term because any data can be shown in a graph. “Graph” in this
case points to mathematical graph theory. In graph theory, a graph is a mathematical structure to
model pair-wise relationships between objects. Graph or network data is, in short, data that focuses
on the relationship or adjacency of objects. The graph structures use nodes, edges, and properties
to represent and store graphical data. Graph-based data is a natural way to represent social
networks, and its structure allows you to calculate specific metrics such as the influence of a person
and the shortest path between two people.
Audio, image, and video: Audio, image, and video are data types that pose specific challenges to
a data scientist. Tasks that are trivial for humans, such as recognizing objects in pictures, turn out
to be challenging for computers. MLBAM (Major League Baseball Advanced Media) announced
in 2014 that they’ll increase video capture to approximately 7 TB per game for the purpose of live,
in-game analytics. High-speed cameras at stadiums will capture ball and athlete movements to
calculate in real time, for example, the path taken by a defender relative to two baselines. Recently
a company called DeepMind succeeded at creating an algorithm that’s capable of learning how to
play video games. This algorithm takes the video screen as input and learns to interpret everything
via a complex process of deep learning. It’s a remarkable feat that prompted Google to buy the
company for their own Artificial Intelligence (AI) development plans. The learning algorithm
takes in data as it’s produced by the computer game; it’s streaming data.
Streaming data: While streaming data can take almost any of the previous forms, it has an extra
property. The data flows into the system when an event happens instead of being loaded into a data
store in a batch. Although this isn’t really a different type of data, we treat it here as such because
you need to adapt your process to deal with this type of information. Examples are the “What’s
trending” on Twitter, live sporting or music events, and the stock market.

The data science process(In Brief)


The data science process typically consists of six steps, as you can see in the mind map in figure
below. We will introduce them briefly here
 Setting the research goal
Data science is mostly applied in the context of an organization. When the business asks you to
perform a data science project, you’ll first prepare a project charter. This charter contains
information such as what you’re going to research, how the company benefits from that, what data
and resources you need, a timetable, and deliverables.

 Retrieving data
The second step is to collect data. You’ve stated in the project charter which data you need and
where you can find it. In this step you ensure that you can use the data in your program, which
means checking the existence of, quality, and access to the data. Data can also be delivered by
third-party companies and takes many forms ranging from Excel spreadsheets to different types of
databases.

 Data preparation
Data collection is an error-prone process; in this phase you enhance the quality of the data and
prepare it for use in subsequent steps. This phase consists of three sub phases: data cleansing
removes false values from a data source and inconsistencies across data sources, data integration
enriches data sources by combining information from multiple data sources, and data
transformation ensures that the data is in a suitable format for use in your models.

 Data exploration
Data exploration is concerned with building a deeper understanding of your data. You try to
understand how variables interact with each other, the distribution of the data, and whether there
are outliers. To achieve this, you mainly use descriptive statistics, visual techniques, and simple
modeling. This step often goes by the abbreviation EDA, for Exploratory Data Analysis.

 Data modeling or model building


In this phase you use models, domain knowledge, and insights about the data you found in the
previous steps to answer the research question. You select a technique from the fields of statistics,
machine learning, operations research, and so on. Building a model is an iterative process that
involves selecting the variables for the model, executing the model, and model diagnostics.

 Presentation and automation


Finally, you present the results to your business. These results can take many forms, ranging from
presentations to research reports. Sometimes you’ll need to automate the execution of the process
because the business will want to use the insights you gained in another project or enable an
operational process to use the outcome from your model.
AN ITERATIVE PROCESS The previous description of the data science process gives you the
impression that you walk through this process in a linear way, but in reality you often have to step
back and rework certain findings. For instance, you might find outliers in the data exploration
phase that point to data import errors. As part of the data science process you gain incremental
insights, which may lead to new questions. To prevent rework, make sure that you scope the
business question clearly and thoroughly at the start.

Data Science Process in detail


Following a structured approach to data science helps you to maximize your chances of success in
a data science project at the lowest cost. It also makes it possibly to take up a project as a team,
with each team member focusing on what they do best. Take care, however: this approach may
not be suitable for every type of project or be the only way to do good data science. The typical
data science process consists of six steps through which you’ll iterate.
The following list is a short introduction; each of the steps will be discussed in greater depth
throughout this chapter.
1 The first step of this process is setting a research goal. The main purpose here is making sure all
the stakeholders understand the what, how, and why of the project. In every serious project this
will result in a project charter.
2 The second phase is data retrieval. You want to have data available for analysis, so this step
includes finding suitable data and getting access to the data from the data owner. The result is data
in its raw form, which probably needs polishing and transformation before it becomes usable.
3 Now that you have the raw data, it’s time to prepare it. This includes transforming the data from
a raw form into data that’s directly usable in your models. To achieve this, you’ll detect and correct
different kinds of errors in the data, combine data from different data sources, and transform it. If
you have successfully completed this step, you can progress to data visualization and modeling.
4 The fourth step is data exploration. The goal of this step is to gain a deep understanding of the
data. You’ll look for patterns, correlations, and deviations based on visual and descriptive
techniques. The insights you gain from this phase will enable you to start modeling.
5 Finally, we get the model building. It is now that you attempt to gain the insights or make the
predictions stated in your project charter. Now is the time to bring out the heavy guns, but
remember research has taught us that often (but not always) a combination of simple models tends
to outperform one complicated model. If you’ve done this phase right, you’re almost done.
6 The last step of the data science model is presenting your results and automating the analysis, if
needed. One goal of a project is to change a process and/or make better decisions. You may still
need to convince the business that your findings will indeed change the business process as
expected. This is where you can shine in your influencer role. The importance of this step is more
apparent in projects on a strategic and tactical level. Certain projects require you to perform the
business process over and over again, so automating the project will save time.
Step 1: Defining research goals and creating a project charter
A project starts by understanding the what, the why, and the how of your project. What does the
company expect you to do? And why does management place such a value on your research?
a) Spend time understanding the goals and context of your research
An essential outcome is the research goal that states the purpose of your assignment in a clear and
focused manner. Understanding the business goals and context is critical for project success.
Continue asking questions and devising examples until you grasp the exact business expectations,
identify how your project fits in the bigger picture, appreciate how your research is going to change
the business, and understand how they’ll use your results. Nothing is more frustrating than
spending months researching something until you have that one moment of brilliance and solve
the problem, but when you report your findings back to the organization, everyone immediately
realizes that you misunderstood their question. Don’t skim over this phase lightly. Many data
scientists fail here: despite their mathematical wit and scientific brilliance they never seem to grasp
the business goals and context.
b) Create a project charter
Clients like to know upfront what they’re paying for, so after you have a good understanding of
the business problem, try to get a formal agreement on the deliverables. All this information is best
collected in a project charter. For any significant project this would be mandatory.
A project charter requires teamwork, and your input covers at least the following:
■ A clear research goal
■ The project mission and context
■ How you’re going to perform your analysis
■ What resources you expect to use
■ Proof that it’s an achievable project, or proof of concepts
■ Deliverables and a measure of success
■ A timeline
Your client can use this information to make an estimation of the project costs and the data and
people required for your project to become a success.
Step 2: Retrieving data
The next step in data science is to retrieve the required data. Sometimes you need to go into the
field and design a data collection process yourself, but most of the time you won’t be involved in
this step. Many companies will have already collected and stored the data for you, and what they
don’t have can often be bought from third parties. Don’t be afraid to look outside your organization
for data, because more and more organizations are making even high-quality data freely available
for public and commercial use.
Data can be stored in many forms, ranging from simple text files to tables in a database. The
objective now is acquiring all the data you need. This may be difficult, and even if you succeed,
data is often like a diamond in the rough: it needs polishing to be of any use to you.
a) Start with data stored within the company
Your first act should be to assess the relevance and quality of the data that’s readily available
within your company. Most companies have a program for maintaining key data, so much of the
cleaning work may already be done. This data can be stored in official data repositories such as
databases, data marts, data warehouses, and data lakes maintained by a team of IT professionals.
The primary goal of a database is data storage, while a data warehouse is designed for reading and
analyzing that data. A data mart is a subset of the data warehouse and geared toward serving a
specific business unit. While data warehouses and data marts are home to preprocessed data, data
lakes contain data in its natural or raw format. But the possibility exists that your data still resides
in Excel files on the desktop of a domain expert. Finding data even within your own company can
sometimes be a challenge. As companies grow, their data becomes scattered around many places.
Knowledge of the data may be dispersed as people change positions and leave the company.
Getting access to data is another difficult task. Organizations understand the value and sensitivity
of data and often have policies in place so everyone has access to what they need and nothing
more.
b) Don’t be afraid to shop around
If data isn’t available inside your organization, look outside your organization’s walls. Many
companies specialize in collecting valuable information. Although data is considered an asset more
valuable than oil by certain companies, more and more governments and organizations share their
data for free with the world. This data can be of excellent quality; it depends on the institution that
creates and manages it. The information they share covers a broad range of topics such as the
number of accidents or amount of drug abuse in a certain region and its demographics. This data
is helpful when you want to enrich proprietary data but also convenient when training your data
science skills at home.
c) Do data quality checks now to prevent problems later
Expect to spend a good portion of your project time doing data correction and cleansing, sometimes
up to 80%. The retrieval of data is the first time you’ll inspect the data in the data science process.
Most of the errors you’ll encounter during the data gathering phase are easy to spot, but being too
careless will make you spend many hours solving data issues that could have been prevented
during data import.
Step 3: Cleansing, integrating, and transforming data
The data received from the data retrieval phase is likely to be “a diamond in the rough.” Your task
now is to sanitize and prepare it for use in the modeling and reporting phase. Doing so is
tremendously important because your models will perform better and you’ll lose less time trying
to fix strange output. It can’t be mentioned nearly enough times: garbage in equals garbage out.
Your model needs the data in a specific format, so data transformation will always come into play.
It’s a good habit to correct data errors as early on in the process as possible. However, this isn’t
always possible in a realistic setting, so you’ll need to take corrective actions in your program.
a) Cleansing data
Data cleansing is a sub process of the data science process that focuses on removing errors in your
data so your data becomes a true and consistent representation of the processes it originates from.
 Data entry errors
Data collection and data entry are error-prone processes. They often require human intervention,
and because humans are only human, they make typos or lose their concentration for a second
and introduce an error into the chain. But data collected by machines or computers isn’t free
from errors either.

 Redundant whitespace
Whitespaces tend to be hard to detect but cause errors like other redundant characters would. Who
hasn’t lost a few days in a project because of a bug that was caused by whitespaces at the end of a
string? You ask the program to join two keys and notice that observations are missing from the
output file. After looking for days through the code, you finally find the bug. Then comes the
hardest part: explaining the delay to the project stakeholders. The cleaning during the ETL phase
wasn’t well executed, and keys in one table contained a whitespace at the end of a string. This
caused a mismatch of keys such as “FR” – “FR”, dropping the observations that couldn’t be
matched.
 Impossible values and sanity checks
Sanity checks are another valuable type of data check. Here you check the value against physically
or theoretically impossible values such as people taller than 3 meters or someone with an age of
299 years. Sanity checks can be directly expressed with rules: check = 0 <= age <= 120
 Outliers
An outlier is an observation that seems to be distant from other observations or, more
specifically, one observation that follows a different logic or generative process than the other
observations. The easiest way to find outliers is to use a plot or a table with the minimum and
maximum values.

 Dealing with missing values


Missing values aren’t necessarily wrong, but you still need to handle them separately;

 Deviations from a code book


Detecting errors in larger data sets against a code book or against standardized values can be
done with the help of set operations. A code book is a description of your data, a form of
metadata. It contains things such as the number of variables per observation, the number of
observations, and what each encoding within a variable means. (For instance “0” equals
“negative”, “5” stands for “very positive”.) A code book also tells the type of data you’re
looking at: is it hierarchical, graph, something else? You look at those values that are present
in set A but not in set B. These are values that should be corrected. It’s no coincidence that sets
are the data structure that we’ll use when we’re working in code.
 Different units of measurement
When integrating two data sets, you have to pay attention to their respective units of
measurement. An example of this would be when you study the prices of gasoline in the world.
To do this you gather data from different data providers. Data sets can contain prices per gallon
and others can contain prices per liter.

 Different levels of aggregation


Having different levels of aggregation is similar to having different types of measurement. An
example of this would be a data set containing data per week versus one containing data per
work week. This type of error is generally easy to detect, and summarizing (or the inverse,
expanding) the data sets will fix it.

 Correct errors as early as possible


A good practice is to mediate data errors as early as possible in the data collection chain and
to fix as little as possible inside your program while fixing the origin of the problem. Retrieving
data is a difficult task, and organizations spend millions of dollars on it in the hope of making
better decisions.
b) Combining data from different data sources
Your data comes from several different places, and in this sub step we focus on integrating
these different sources. Data varies in size, type, and structure, ranging from databases and
Excel files to text documents.
The different ways of combining data
You can perform two operations to combine information from different data sets. The first
operation is joining: enriching an observation from one table with information from another
table. The second operation is appending or stacking: adding the observations of one table to
those of another table.
 Joining tables
Joining tables allows you to combine the information of one observation found in one table
with the information that you find in another table. The focus is on enriching a single
observation. Let’s say that the first table contains information about the purchases of a
customer and the other table contains information about the region where your customer lives.
Joining the tables allows you to combine the information so that you can use it for your model
 Appending tables
Appending or stacking tables is effectively adding observations from one table to another
table.
 Using views to simulate data joins and appends
To avoid duplication of data, you virtually combine data with views. The problem is that we
duplicated the data and therefore needed more storage space. In the example we’re working
with, that may not cause problems, but imagine that every table consists of terabytes of data;
then it becomes problematic to duplicate the data. For this reason, the concept of a view was
invented. A view behaves as if you’re working on a table, but this table is nothing but a virtual
layer that combines the tables for you.

 Enriching aggregated measures


Data enrichment can also be done by adding calculated information to the table, such as the
total number of sales or what percentage of total stock has been sold in a certain region

c) Transforming data
Certain models require their data to be in a certain shape. Now that you’ve cleansed and
integrated the data, this is the next task you’ll perform: transforming your data so it takes a
suitable form for data modeling.

 Transforming data
Relationships between an input variable and an output variable aren’t always linear. Take, for
instance, a relationship of the form y = aebx. Taking the log of the independent variables
simplifies the estimation problem dramatically.
 Reducing the number of variables
Sometimes you have too many variables and need to reduce the number because they don’t
add new information to the model. Having too many variables in your model makes the model
difficult to handle, and certain techniques don’t perform well when you overload them with
too many input variables. For instance, all the techniques based on a Euclidean distance
perform well only up to 10 variables.

 Turning variables into dummies


Variables can be turned into dummy variables (figure 2.13). Dummy variables can only take
two values: true (1) or false (0). They’re used to indicate the absence of a categorical effect
that may explain the observation. In this case you’ll make separate columns for the classes
stored in one variable and indicate it with 1 if the class is present and 0 otherwise.
Step 4: Exploratory data analysis
During exploratory data analysis you take a deep dive into the data Information becomes much
easier to grasp when shown in a picture, therefore you mainly use graphical techniques to gain
an understanding of your data and the interactions between variables. This phase is about
exploring data, so keeping your mind open and your eyes peeled is essential during the
exploratory data analysis phase. The goal isn’t to cleanse the data, but it’s common that you’ll
still discover anomalies you missed before, forcing you to take a step back and fix them.
The visualization techniques you use in this phase range from simple line graphs or histograms,
as shown in figure, to more complex diagrams such as Sankey and network graphs. Sometimes
it’s useful to compose a composite graph from simple graphs to get even more insight into the
data. Other times the graphs can be animated or made interactive to make it easier and, let’s
admit it, way more fun.
These plots can be combined to provide even more insight. Overlaying several plots is
common practice.
Step 5: Build the models
With clean data in place and a good understanding of the content, you’re ready to build models
with the goal of making better predictions, classifying objects, or gaining an understanding of
the system that you’re modeling. This phase is much more focused than the exploratory
analysis step, because you know what you’re looking for and what you want the outcome to
be. The techniques you’ll use now are borrowed from the field of machine learning, data
mining, and/or statistics.
Building a model is an iterative process. The way you build your model depends on whether
you go with classic statistics or the somewhat more recent machine learning school, and the
type of technique you want to use. Either way, most models consist of the following main
steps:
1 Selection of a modeling technique and variables to enter in the model
2 Execution of the model
3 Diagnosis and model comparison

 Model and variable selection

You’ll need to select the variables you want to include in your model and a modeling technique.
Your findings from the exploratory analysis should already give a fair idea of what variables
will help you construct a good model. Many modeling techniques are available, and choosing
the right model for a problem requires judgment on your part. You’ll need to consider model
performance and whether your project meets all the requirements to use your model, as well
as other factors:
■ Must the model be moved to a production environment and, if so, would it be
easy to implement?
■ How difficult is the maintenance on the model: how long will it remain relevant
if left untouched?
■ Does the model need to be easy to explain?
 Model execution
Once you’ve chosen a model you’ll need to implement it in code. Luckily, most programming
languages, such as Python, already have libraries such as StatsModels or Scikit-learn. These
packages use several of the most popular techniques Coding a model is a nontrivial task in most
cases, so having these libraries available can speed up the process. As you can see in the following
code, it’s fairly easy to use linear regression with StatsModels or Scikit-learn. Doing this yourself
would require much more effort even for the simple techniques.

 Model diagnostics and model comparison


You’ll be building multiple models from which you then choose the best one based on multiple
criteria. Working with a holdout sample helps you pick the best-performing model. A holdout
sample is a part of the data you leave out of the model building so it can be used to evaluate the
model afterward. The principle here is simple: the model should work on unseen data. You use
only a fraction of your data to estimate the model and the other part, the holdout sample, is kept
out of the equation. The model is then unleashed on the unseen data and error measures are
calculated to evaluate it.
Step 6: Presenting findings and building applications on top of them
After you’ve successfully analyzed the data and built a well-performing model, you’re ready to
present your findings to the world. This is an exciting part; all your hours of hard work have paid
off and you can explain what you found to the stakeholders. Sometimes people get so excited about
your work that you’ll need to repeat it over and over again because they value the predictions of
your models or the insights that you produced. For this reason, you need to automate your models.
This doesn’t always mean that you have to redo all of your analysis all the time. Sometimes it’s
sufficient that you implement only the model scoring; other times you might build an application
that automatically updates reports, Excel spreadsheets, or PowerPoint presentations. The last stage
of the data science process is where your soft skills will be most useful, and yes, they’re extremely
important.
BIG DATA

Big Data and its main characteristics


Big Data is the term describing large sets of diverse data ‒ structured, unstructured, and semi-
structured ‒ that are continuously generated at a high speed and in high volumes. A growing
number of companies now use this data to uncover meaningful insights and improve their decision-
making, but they can’t store and process it by means of traditional data storage and processing
units.

To understand Big Data, you need to get acquainted with its attributes known as the four V’s.

 Volume is what’s “big” in Big Data. This relates to terabytes to petabytes of information
coming from a range of sources such as IoT devices, social media, text files, business
transactions, etc. Just so you can grasp the scale, 1 petabyte is equal to 1,000,000 gigabytes.
A single HD movie on Netflix takes up over 4 gigabytes while you are watching. Now
imagine that 1 petabyte contains 250,000 movies. And Big Data isn’t about 1 petabyte, it’s
about thousands and millions of them.
 Velocity is the speed at which the data is generated and processed. It’s represented in terms
of batch reporting, near real-time/real-time processing, and data streaming. The best-case
scenario is when the speed with which the data is produced meets the speed with which it
is processed. Let’s take the transportation industry for example. A single car connected to
the Internet with a telematics device plugged in generates and transmits 25 gigabytes of
data hourly at a near-constant velocity. And most of this data has to be handled in real-time
or near real-time.
 Variety is the vector showing the diversity of Big Data. This data isn’t just about structured
data that resides within relational databases as rows and columns. It comes in all sorts of
forms that differ from one application to another, and most of Big Data is unstructured.
Say, a simple social media post may contain some text information, videos or images, a
timestamp. etc.
 Veracity is the measure of how truthful, accurate, and reliable data is and what value it
brings. Data can be incomplete, inconsistent, or noisy, decreasing the accuracy of the
analytics process. Due to this, data veracity is commonly classified as good, bad, and
undefined. That’s quite a help when dealing with diverse data sets such as medical records,
in which any inconsistencies or ambiguities may have harmful effects.
Knowing the key characteristics, you can understand that not all data can be referred to as Big
Data.

Big data refers to datasets whose size is beyond the ability of typical database software tools to
capture, store, manage and analyse.

Convergence of key trends


• With digital transformation, technologies such as big data, IoT and cloud storage are on
the topmost agenda of all the businesses. Each of these technologies is no longer a
sophisticated ‘nice to have’ technology.
• They have become a necessity. Although the three distinct technologies evolved
independently, they are increasingly becoming more intertwined.
• With more adaptation and development, we are seeing a persistent convergence of these
three technologies in a way that the capabilities of the technologies are aligned in the best
possible way.

-
Unstructured data
In the modern world of big data, unstructured data is the most abundant. It’s so prolific because
unstructured data could be anything: media, imaging, audio, sensor data, text data, and much more.
Unstructured simply means that it is datasets (typical large collections of files) that aren’t stored
in a structured database format. Unstructured data has an internal structure, but it’s not predefined
through data models. It might be human generated, or machine generated in a textual or a non-
textual format.

Examples of unstructured data are:


 Rich media. Media and entertainment data, surveillance data, geo-spatial data, audio, weather
data
 Document collections. Invoices, records, emails, productivity applications

 Internet of Things (IoT). Sensor data, ticker data


 Analytics. Machine learning, artificial intelligence (AI)

Importance of Unstructured data


■ The amount of data (all data, everywhere) is doubling every two years
. ■ Our world is becoming more transparent. We, in turn, are beginning to accept this as we
become more comfortable with parting with data that we used to consider sacred and private.
■ Most new data is unstructured. Specifically, unstructured data represents almost 95 percent
of new data, while structured data represents only 5 percent
. ■ Unstructured data tends to grow exponentially, unlike structured data, which tends to grow
in a more linear fashion.
■ Unstructured data is vastly underutilized. Imagine huge deposits of oil or other natural
resources that are just sitting there, waiting to be used. That’s the current state of unstructured
data as of today. Tomorrow will be a different story because there’s a lot of money to be made
for smart individuals and companies that can mine unstructured data successfully

From a business perspective, you ’ll need to learn how to:


Use Big Data analytics to drive value for your enterprise that aligns with your core competencies
and creates a competitive advantage for your enterprise
■ Capitalize on new technology capabilities and leverage your existing technology assets

■ Enable the appropriate organizational change to move towards fact based decisions, adoption of
new technologies, and uniting people from multiple disciplines into a single multidisciplinary team
■ Deliver faster and superior results by embracing and capitalizing on the ever-increasing rate of
change that is occurring in the global market place.
Big Data analytics uses a wide variety of advanced analytics as shown in figure below

These methods of analytics are used to provide

■ Deeper insights. Rather than looking at segments, classifi cations, regions, groups, or other
summary levels you ’ll have insights into all the individuals, all the products, all the parts, all the
events, all the transactions, etc.
■ Broader insights. The world is complex. Operating a business in a global, connected economy
is very complex given constantly evolving and changing conditions. As humans, we simplify
conditions so we can process events and understand what is happening. But our bestlaid plans
often go astray because of the estimating or approximating. Big Data analytics takes into account
all the data, including new data sources, to understand the complex, evolving, and interrelated
conditions to produce more accurate insights.

■ Frictionless actions. Increased reliability and accuracy that will allow the deeper and broader
insights to be automated into systematic actions.
The key to success for organizations seeking to take advantage of this opportunity is:

■ Leverage all your current data and enrich it with new data sources
■ Enforce data quality policies and leverage today ’s best technology and people to support the
policies
■ Relentlessly seek opportunities to imbue your enterprise with fact based decision making
■ Embed your analytic insights throughout your organization

Industry Examples of Big Data


Nothing helps us understand Big Data more than examples of how the technology and
approaches is being used in the “real world.”
Digital Marketing and the Non-line World
Google ’s digital marketing evangelist and author Avinash Kaushik spent the first 10 years of his
professional career in the world of business intelligence, during which he actually built large multi
tera byte data warehouses and the intelligence platforms
As Kaushik explains
When I built data warehouses, one of the things we were constantly in quest of was these single
sources of truth. My specialty was to work with large, complicated multinational companies and
build the single source of truth in a relational database such as Oracle. The single source of truth
would reply on very simplistic data from ERP and other sources.
Now in hindsight the thing I had to learn quickly is that that the big data warehouse approach does
not work in the online world. Back then we were tasked to collect all the clickstream data from
our digital activities and it worked great for a few months, then it was a big disaster because
everything that works in the BI world does not work in the online world.
This is a painful lesson that I had to learn. That led me to postulating this thing—that in order for
you to be successful online, you have to embrace multiplicity. In fact, you have to give up on these
deeply embedded principles that companies can survive by building a single source of the truth.
This multiplicity that Kaushik refers to requires multiple skills in the decision-making team,
multiple tools, and multiple types of data (clickstream data, consumer data, competitive
intelligence data, etc.). “The problem is that this approach and thinking is diametrically 100 percent
opposed to the thing that we learned in the BI world,” he says. “That ’s why most companies
struggle with making smart decisions online because they cannot at some level, embrace
multiplicity and they cannot bring themselves to embrace incomplete data and it is against their
blood and DNA, they ’re forced to pick perfect data in order to make decisions.”
Avinash Kaushik describes a simplistic scenario involving a major newspaper publisher that
struggled to find out with what people want to read when they come to their website or digital
existence or mobile app or iPad app. One obvious action would be to have the analyst or the
marketer log into Google analytics or Omniture ’s clickstream analysis tool and look at the top
viewed pages on their website, mobile apps, and so on. That will help them understand what people
want to read. Do they want to read more sports? International news? In their minds that will help
frame the front page of the digital platform they have. Kaushik then explains the actual reason for
their problem: “The only way that your clickstream analysis tool can collect a piece of data is that
a page has to be rendered, it executes the code for good analytics, and it records the page for you—
now you know
Don’t Abdicate Relationships
Many of today ’s marketers are discussing and assessing their approaches to engage consumers in
different ways such as social media marketing. However, Avinash Kaushik believes that a lot of
companies abdicate their primary online existence—their own websites—in favor of the company
’s Facebook page. For example, a large beer company urged consumers who came to the company
’s official website to go to Facebook where they could enter a sweepstakes. Without pulling any
punches, Kaushik says, “It annoys the hell out of me. If you ’re going to be on Facebook, at least
be great on Facebook . . . don ’t suck at it!” When it comes to marketing, when it comes to
consumer relationships, the thing that gets Kaushik excited is working with companies that never
had the opportunity to have a relationship with a consumer directly because they abdicate, that is,
they relinquish that relationship to their retailers. Now we ’ve created this world where any
company, no matter how far removed it is from a consumer can directly have a relationship with
the consumer
According to Kaushik, the companies that do this best are consumer package goods companies
such as Proctor & Gamble (P&G). Now, for example, P&G can plug into the digital marketing
world and get direct access to consumers through social media. Once that connection is made, it
helps the firm with critical business decisions such as new product development. Kaushik advised
this beer giant, with a multibillion-dollar marketing budget, to be on Facebook, but don ’t abdicate
your own existence, your own outpost in the world because it ’s a lonely place where you can own
the data about the consumers. He elaborated and explained:
Web Analytics:
Web analytics and big data are closely intertwined concepts that play a significant role in
understanding and optimizing online performance, customer behavior, and business strategies.
Here's a breakdown of each:
Web analytics involves collecting, measuring, analyzing, and reporting data related to website
traffic and user interactions. It provides valuable insights into how users find and navigate a
website, what actions they take, and how they engage with content. Key metrics in web analytics
include:
1. Traffic Sources: Understanding where website visitors come from, such as search engines,
social media, referral sites, or direct visits.
2. Page Views and Sessions: Tracking the number of pages viewed and sessions initiated by users
on the website.
3. Bounce Rate: The percentage of visitors who navigate away from the site after viewing only
one page, indicating a lack of engagement.
4. Conversion Rate: The percentage of visitors who complete a desired action, such as making a
purchase, filling out a form, or signing up for a newsletter.
5. User Behavior: Analyzing click patterns, navigation paths, and time spent on different pages to
identify user preferences and areas for improvement.
Web analytics tools like Google Analytics, Adobe Analytics, and others help businesses gather
and interpret this data to make informed decisions about website design, content strategy,
marketing campaigns, and overall user experience.
In the context of web analytics, big data techniques and technologies are used to process and
analyze large volumes of web data to uncover patterns, trends, and correlations that traditional
analytics tools may overlook. This allows businesses to gain deeper insights into user behavior,
personalize experiences, and optimize marketing strategies at scale.
Some common big data technologies and frameworks used in web analytics include Hadoop,
Spark, Apache Kafka, and NoSQL databases like MongoDB and Cassandra.
By leveraging web analytics and big data together, businesses can gain a comprehensive
understanding of their online presence, customer preferences, and market dynamics, enabling them
to make data-driven decisions and stay competitive in today's digital landscape.
Big Data & Marketing
Big data has revolutionized the field of marketing by providing unprecedented insights into
consumer behavior, preferences, and trends. Here's how big data is transforming marketing:
1. Consumer Insights: Big data analytics enables marketers to gather and analyze vast amounts
of data from various sources, including social media, web interactions, purchase history,
demographic information, and more. By understanding consumer behavior and preferences at a
granular level, marketers can create highly targeted and personalized marketing campaigns.
2. Predictive Analytics: Big data enables predictive modeling, allowing marketers to forecast
future trends, identify potential opportunities, and anticipate consumer needs. Predictive analytics
can help optimize marketing strategies, such as determining the best time to launch a campaign,
predicting customer churn, or identifying segments with the highest conversion potential.
3. Segmentation and Targeting: With big data, marketers can segment their audience more
effectively based on demographics, behavior, interests, and other relevant factors. This enables
personalized targeting and messaging, leading to higher engagement and conversion rates.
Advanced segmentation techniques, such as clustering and machine learning algorithms, help
marketers identify niche segments and tailor their marketing efforts accordingly.
4. Real-time Marketing: Big data analytics enables real-time monitoring of consumer interactions
and market trends, allowing marketers to respond quickly to changing conditions and
opportunities. Real-time marketing tactics, such as dynamic pricing, personalized
recommendations, and contextual advertising, can be implemented to deliver timely and relevant
messages to consumers.
5. Marketing Attribution: Big data helps marketers track and analyze the effectiveness of their
marketing campaigns across multiple channels and touchpoints. By attributing conversions and
sales to specific marketing efforts, businesses can optimize their marketing spend, allocate
resources more efficiently, and maximize ROI.
6. Content Optimization: Big data analytics provides insights into content performance, allowing
marketers to understand which types of content resonate with their target audience. By analyzing
metrics such as engagement, click-through rates, and social shares, marketers can optimize content
creation and distribution strategies to drive better results.
7. Customer Experience Enhancement: Big data enables marketers to gain a holistic view of the
customer journey across multiple channels and touchpoints. By analyzing customer interactions
and feedback, businesses can identify pain points, optimize the user experience, and deliver more
personalized and seamless experiences to their customers.
Overall, big data empowers marketers to make data-driven decisions, improve targeting and
personalization, enhance customer experiences, and ultimately drive business growth. However,
it's essential for marketers to prioritize data privacy and security and ensure ethical use of consumer
data in all marketing activities.
Fraud & Big Data
Big data plays a crucial role in fraud detection and prevention across various industries, including
finance, healthcare, e-commerce, and more. Here's how big data is used to combat fraud:
1. Anomaly Detection: Big data analytics can identify unusual patterns or anomalies in vast
datasets, which may indicate fraudulent activities. By analyzing transactional data, user behavior,
and other relevant information in real-time, anomalies such as unusual spending patterns, atypical
login locations, or suspicious account activity can be flagged for further investigation.
2. Machine Learning and Predictive Analytics: Machine learning algorithms trained on
historical data can predict and detect potential fraud with high accuracy. These algorithms can
continuously learn from new data and adapt to evolving fraud patterns, improving detection rates
over time. Predictive analytics models can also forecast future fraudulent activities based on
historical trends and patterns, enabling proactive fraud prevention measures.
3. Behavioral Analysis: Big data analytics can analyze user behavior and interactions to create
behavioral profiles for individuals or entities. Deviations from established behavior patterns, such
as sudden changes in spending habits or transaction frequency, can raise red flags for potential
fraud. Behavioral analysis can help detect both known fraud patterns and emerging threats that
may not be detected by rule-based systems.
4. Network Analysis: Big data techniques can analyze complex networks of relationships and
connections between entities, such as customers, accounts, devices, or IP addresses. By examining
the relationships between different data points, suspicious patterns of behavior, such as collusion
or organized fraud rings, can be identified more effectively.
5. Data Fusion and Enrichment: Big data platforms can integrate and analyze diverse data
sources, including structured and unstructured data from internal and external sources. By
combining transactional data with other relevant information, such as social media activity,
geolocation data, device information, and historical fraud data, a more comprehensive view of
potential fraud can be obtained, enhancing detection capabilities.
6. Real-time Monitoring and Alerts: Big data technologies enable real-time monitoring of
transactions and activities, allowing organizations to detect and respond to fraudulent events as
they occur. Automated alerting systems can notify stakeholders of suspicious activities instantly,
enabling timely intervention and mitigation of fraud risks.
7. Fraud Risk Assessment: Big data analytics can assess the overall fraud risk within an
organization by analyzing various factors, such as transaction volumes, geographic locations,
customer profiles, and industry benchmarks. By quantifying the likelihood and potential impact of
fraud, organizations can allocate resources more effectively to prevent and mitigate fraud risks.
In summary, big data provides powerful tools and techniques for detecting, preventing, and
mitigating fraud across diverse industries. By leveraging advanced analytics, machine learning,
and real-time monitoring capabilities, organizations can stay ahead of evolving fraud threats and
protect their assets, customers, and reputation more effectively.
Risk & Big Data
Big data plays a critical role in managing and mitigating risks across various domains, including
finance, insurance, healthcare, cybersecurity, supply chain management, and more. Here's how big
data is utilized in risk management:
1. Risk Identification: Big data analytics can aggregate and analyze vast amounts of data from
multiple sources to identify potential risks and vulnerabilities. By analyzing historical data, market
trends, customer behavior, and other relevant information, organizations can identify emerging
risks and anticipate potential threats before they materialize.
2. Predictive Analytics: Big data enables predictive modeling and analysis to forecast future risks
and trends. By leveraging machine learning algorithms and statistical techniques, organizations
can identify patterns and correlations in data that may indicate future risks or opportunities.
Predictive analytics can help organizations anticipate and prepare for potential risks, enabling
proactive risk management strategies.
3. Fraud Detection: As mentioned earlier, big data analytics is instrumental in detecting and
preventing fraudulent activities across various industries. By analyzing transactional data, user
behavior, and other relevant information, organizations can identify suspicious patterns and
anomalies that may indicate fraudulent behavior. Real-time monitoring and advanced analytics
techniques can help organizations detect and mitigate fraud risks more effectively.
4. Credit Risk Assessment: In the financial industry, big data analytics is used to assess credit
risk by analyzing borrower profiles, payment history, economic indicators, and other relevant
factors. By leveraging advanced analytics and machine learning algorithms, lenders can evaluate
creditworthiness more accurately and make informed decisions about lending.
5. Operational Risk Management: Big data helps organizations identify and mitigate operational
risks associated with business processes, supply chains, and infrastructure. By analyzing data from
sensors, IoT devices, and operational systems, organizations can monitor performance, identify
potential bottlenecks or failures, and take proactive measures to mitigate operational risks.
6. Compliance and Regulatory Risk: Big data analytics can assist organizations in managing
compliance and regulatory risks by analyzing large volumes of data to ensure adherence to laws,
regulations, and industry standards. By monitoring transactions, communications, and activities,
organizations can identify potential compliance violations and take corrective actions to mitigate
regulatory risks.
7. Cybersecurity Risk Management: Big data analytics plays a crucial role in cybersecurity risk
management by analyzing network traffic, system logs, and user behavior to detect and respond to
cyber threats. By leveraging advanced analytics and machine learning algorithms, organizations
can identify suspicious activities, detect security breaches, and strengthen their cybersecurity
posture.
Overall, big data enables organizations to enhance their risk management capabilities by providing
insights into potential risks, predicting future trends, and enabling proactive risk mitigation
strategies. By leveraging advanced analytics and real-time monitoring capabilities, organizations
can identify and address risks more effectively, ultimately enhancing resilience and driving
business success.
Credit Card & Big Data
Big data has significantly impacted the credit card industry, revolutionizing how credit card
issuers assess risk, detect fraud, personalize offers, and enhance customer experiences. Here's
how big data is transforming credit card operations:
1. Credit Risk Assessment: Big data analytics enables credit card issuers to assess credit risk
more accurately by analyzing vast amounts of data, including credit history, payment behavior,
income levels, employment status, and demographic information. By leveraging advanced
analytics and machine learning algorithms, issuers can evaluate creditworthiness more precisely,
approve applications faster, and offer tailored credit limits and interest rates to individual
customers.
2. Fraud Detection and Prevention: Big data analytics is instrumental in detecting and preventing
credit card fraud by analyzing transactional data, user behavior, and other relevant information.
By monitoring patterns and anomalies in real-time, issuers can identify suspicious activities, such
as unauthorized transactions, account takeover attempts, or fraudulent applications, and take
immediate action to mitigate fraud risks.
3. Personalized Offers and Rewards: Big data enables credit card issuers to personalize offers,
rewards, and promotions based on individual customer preferences, spending patterns, and
lifestyle choices. By analyzing transactional data and external sources such as social media activity
and location data, issuers can tailor rewards programs, cashback offers, and promotional
campaigns to meet the unique needs and interests of each customer, driving engagement and
loyalty.
4. Customer Segmentation and Targeting: Big data analytics helps credit card issuers segment
their customer base more effectively and target specific customer segments with relevant products
and services. By analyzing demographic data, spending behavior, and life events, issuers can
identify high-value segments, such as frequent travelers, business owners, or affluent consumers,
and develop targeted marketing strategies to attract and retain these customers.
5. Customer Experience Enhancement: Big data enables credit card issuers to enhance the
customer experience by providing personalized recommendations, proactive support, and seamless
interactions across multiple channels. By leveraging data analytics and machine learning, issuers
can anticipate customer needs, resolve issues more efficiently, and deliver tailored services that
meet or exceed customer expectations, fostering loyalty and satisfaction.
6. Risk Management and Compliance: Big data analytics helps credit card issuers manage risk
and ensure compliance with regulatory requirements by analyzing transactional data, monitoring
for suspicious activities, and detecting potential compliance violations. By leveraging advanced
analytics and real-time monitoring capabilities, issuers can identify emerging risks, mitigate fraud
threats, and maintain regulatory compliance more effectively, protecting both customers and the
business.
Overall, big data has transformed the credit card industry by enabling issuers to make data-driven
decisions, enhance risk management capabilities, personalize customer experiences, and drive
business growth. By leveraging big data analytics, credit card issuers can stay competitive in a
rapidly evolving market and meet the evolving needs and expectations of their customers.
Big Data & Algorithmic trading
Big data and algorithmic trading are closely intertwined in the world of finance, particularly in the
domain of high-frequency trading (HFT) and quantitative trading. Here's how big data impacts
algorithmic trading:
1. Data Sources: Big data encompasses vast amounts of financial data from various sources,
including market data feeds, news articles, social media sentiment, economic indicators, company
filings, and more. Algorithmic traders leverage this diverse range of data sources to make informed
trading decisions and generate alpha.
2. Data Processing: Big data technologies and techniques are used to process and analyze large
volumes of financial data in real-time. High-performance computing systems and distributed
processing frameworks enable algorithmic traders to analyze market data quickly and efficiently,
identifying trading opportunities and reacting to market events with minimal latency.
3. Predictive Analytics: Big data analytics enables algorithmic traders to build predictive models
and algorithms that forecast market trends, price movements, and volatility. By analyzing
historical data and identifying patterns and correlations, traders can develop predictive models that
generate signals for buying or selling assets based on predefined criteria.
4. Machine Learning: Machine learning algorithms play a crucial role in algorithmic trading by
analyzing large datasets to identify complex patterns and relationships. Traders use machine
learning techniques such as regression, classification, clustering, and neural networks to develop
predictive models that learn from data and adapt to changing market conditions, improving trading
performance over time.
5. Strategy Optimization: Big data analytics helps algorithmic traders optimize trading strategies
by backtesting and simulating strategies using historical data. By analyzing the performance of
different trading algorithms under various market conditions, traders can identify strategies that
generate the highest returns with the lowest risk and refine them accordingly.
6. Risk Management: Big data analytics enables algorithmic traders to manage risk more
effectively by analyzing portfolio metrics, assessing market conditions, and monitoring for
anomalies or outliers. Traders use risk models and scenario analysis to quantify and mitigate risk
exposure, ensuring that trading strategies remain within acceptable risk limits.
7. Execution Algorithms: Big data technologies are used to optimize trade execution algorithms,
minimizing execution costs and market impact. Traders leverage real-time market data and order
flow information to execute trades efficiently and achieve best execution, taking into account
factors such as liquidity, market depth, and order book dynamics.
Overall, big data plays a crucial role in algorithmic trading by providing access to vast amounts of
financial data, enabling real-time data processing and analysis, facilitating predictive modeling
and machine learning, optimizing trading strategies, managing risk, and enhancing trade execution
efficiency. By leveraging big data analytics, algorithmic traders can gain a competitive edge in
today's fast-paced and data-driven financial markets.
Big data & Healthcare
Big data has the potential to revolutionize healthcare by transforming how medical data is
collected, analyzed, and utilized to improve patient outcomes, enhance clinical decision-making,
and optimize healthcare delivery. Here's how big data is impacting healthcare:
1. Medical Research and Drug Discovery: Big data analytics enables researchers to analyze vast
amounts of medical data, including electronic health records (EHRs), genomic data, clinical trials
data, and medical literature, to identify patterns, correlations, and insights that can lead to new
discoveries and advancements in medicine. By leveraging big data, researchers can accelerate the
drug discovery process, identify new treatment options, and develop personalized medicine
approaches tailored to individual patients' genetic makeup and medical history.
2. Disease Surveillance and Outbreak Detection: Big data analytics helps public health
organizations and policymakers monitor and track disease outbreaks, epidemics, and health trends
in real-time. By analyzing data from sources such as electronic health records, social media, search
queries, and wearable devices, healthcare authorities can detect early warning signs of infectious
diseases, identify high-risk populations, and implement timely interventions to prevent the spread
of illness and protect public health.
3. Predictive Analytics for Preventive Care: Big data enables predictive modeling and analytics
to identify individuals at risk of developing chronic diseases or experiencing adverse health
outcomes. By analyzing demographic data, medical history, lifestyle factors, and biometric data,
healthcare providers can stratify patient populations based on their risk profiles and proactively
intervene with preventive measures, lifestyle modifications, and targeted interventions to mitigate
health risks and improve outcomes.
4. Clinical Decision Support Systems: Big data analytics powers clinical decision support
systems (CDSS) that provide healthcare providers with evidence-based recommendations,
guidelines, and insights at the point of care. By integrating patient data, medical literature, best
practices, and clinical guidelines, CDSS can assist clinicians in diagnosing conditions, selecting
appropriate treatments, and avoiding medical errors, ultimately improving patient safety and
quality of care.
5. Population Health Management: Big data analytics enables population health management
initiatives aimed at improving the health outcomes of entire populations. By aggregating and
analyzing data from diverse sources, including EHRs, claims data, social determinants of health,
and environmental factors, healthcare organizations can identify at-risk populations, allocate
resources more effectively, and implement targeted interventions to address health disparities,
reduce healthcare costs, and improve population health outcomes.
6.Telemedicine and Remote Monitoring: Big data facilitates telemedicine and remote
monitoring initiatives by enabling the collection, analysis, and transmission of health data from
remote locations. By leveraging wearable devices, mobile health apps, and remote monitoring
technologies, healthcare providers can remotely monitor patients' vital signs, track disease
progression, and provide timely interventions and virtual consultations, improving access to care
and patient engagement, especially in rural or underserved areas.
7. Healthcare Operations and Resource Management: Big data analytics helps healthcare
organizations optimize operations, resource allocation, and healthcare delivery processes. By
analyzing data on patient flow, bed occupancy, staffing levels, and resource utilization, healthcare
administrators can identify inefficiencies, streamline workflows, and allocate resources more
effectively to improve patient throughput, reduce wait times, and enhance the overall healthcare
experience.
Overall, big data has the potential to drive significant advancements in healthcare by enabling
data-driven decision-making, personalized medicine approaches, preventive care interventions,
population health management initiatives, and operational efficiencies, ultimately leading to better
patient outcomes, improved healthcare quality, and reduced costs. However, it's essential to
address privacy, security, and regulatory concerns to ensure responsible and ethical use of
healthcare data while maximizing the benefits of big data in healthcare.
Big data in medicine advertising
Big data has transformed advertising in the medical field, enabling pharmaceutical companies,
healthcare providers, and other stakeholders to target and engage audiences more effectively,
comply with regulations, and measure the impact of their marketing campaigns. Here's how big
data is used in medicine advertising:
1. Audience Targeting: Big data analytics enables medical advertisers to identify and target
specific audience segments based on demographic data, medical history, online behavior, and other
relevant factors. By analyzing vast amounts of data from various sources, including electronic
health records (EHRs), patient databases, and online platforms, advertisers can create highly
targeted and personalized advertising campaigns tailored to individual patient needs, preferences,
and interests.
2. Precision Marketing: Big data facilitates precision marketing approaches that deliver relevant
and timely messages to the right audience at the right time and through the right channels. By
leveraging data analytics and machine learning algorithms, advertisers can optimize ad
placements, messaging, and creative assets to maximize engagement and conversion rates while
minimizing ad spend and wastage.
3. Compliance and Regulation: Big data helps medical advertisers comply with regulations and
guidelines governing the advertising and promotion of pharmaceutical products and medical
services. By analyzing regulatory requirements, monitoring industry guidelines, and leveraging
compliance tools and solutions, advertisers can ensure that their marketing campaigns adhere to
legal and ethical standards, mitigate compliance risks, and maintain trust and credibility with
stakeholders.
4. Adaptive Messaging: Big data enables advertisers to adapt their messaging and content based
on real-time feedback, audience reactions, and market dynamics. By monitoring social media
sentiment, online reviews, and other feedback channels, advertisers can quickly identify emerging
trends, address customer concerns, and refine their messaging to resonate with target audiences
more effectively.
5. Performance Measurement: Big data analytics provides advertisers with actionable insights
and performance metrics to measure the effectiveness of their advertising campaigns and optimize
future marketing efforts. By analyzing key performance indicators (KPIs) such as click-through
rates, conversion rates, return on investment (ROI), and customer lifetime value (CLV), advertisers
can assess campaign performance, identify areas for improvement, and allocate resources more
efficiently to maximize advertising ROI.
6. Predictive Analytics: Big data enables predictive modeling and analytics to forecast future
trends, audience behavior, and market dynamics. By leveraging historical data, market trends, and
predictive algorithms, advertisers can anticipate patient needs, identify emerging opportunities,
and develop proactive marketing strategies to stay ahead of the competition and capitalize on
market trends.
7. Data Privacy and Security: Big data in medicine advertising must prioritize data privacy and
security to protect patient confidentiality and comply with healthcare regulations such as the
Health Insurance Portability and Accountability Act (HIPAA). Advertisers must implement robust
data governance frameworks, encryption protocols, and access controls to safeguard patient data
and ensure compliance with privacy regulations while leveraging big data analytics for advertising
purposes.
In summary, big data has transformed medicine advertising by enabling audience targeting,
precision marketing, compliance management, adaptive messaging, performance measurement,
predictive analytics, and data privacy and security. By leveraging big data analytics, medical
advertisers can create more personalized and effective advertising campaigns that resonate with
target audiences, drive engagement, and ultimately improve patient outcomes.
Big Data Technologies
Big data technologies encompass a wide range of tools, platforms, and frameworks designed to
process, store, analyze, and visualize large volumes of data. These technologies enable
organizations to extract actionable insights, make data-driven decisions, and derive value from
their data assets. Here are some of the key big data technologies:
1. Hadoop: Hadoop is an open-source framework for distributed storage and processing of large
datasets across clusters of commodity hardware. It consists of two main components: Hadoop
Distributed File System (HDFS) for storage and MapReduce for parallel processing. Hadoop is
widely used for batch processing, data warehousing, and analytics applications.
2. Apache Spark: Apache Spark is a fast and general-purpose cluster computing framework that
provides in-memory processing capabilities for big data analytics. Spark supports a wide range of
programming languages (e.g., Scala, Java, Python) and provides high-level APIs for batch
processing, streaming analytics, machine learning, and graph processing.
3. NoSQL Databases: NoSQL databases, such as MongoDB, Cassandra, and HBase, are designed
to handle large volumes of unstructured and semi-structured data. These databases provide flexible
schema designs, horizontal scalability, and high availability, making them well-suited for big data
applications that require real-time data ingestion and low-latency access.
4. Apache Kafka: Apache Kafka is a distributed streaming platform that enables real-time data
processing and event-driven architectures. Kafka provides high-throughput, fault-tolerant
messaging capabilities for streaming data pipelines, enabling organizations to ingest, process, and
analyze large volumes of data in real-time.
5. Apache Flink: Apache Flink is a stream processing framework that provides stateful, fault-
tolerant processing of real-time data streams. Flink supports event time processing, windowing,
and exactly-once semantics, making it suitable for high-throughput, low-latency stream processing
applications.
6. Data Lakes: Data lakes are centralized repositories that store raw, unstructured, and structured
data at scale. Technologies such as Amazon S3, Azure Data Lake Storage, and Google Cloud
Storage provide scalable storage solutions for storing diverse data types, including structured data,
semi-structured data, and unstructured data, enabling organizations to perform batch processing,
analytics, and machine learning on large datasets.
7. Distributed Computing Frameworks: Distributed computing frameworks, such as Apache
Hadoop, Apache Spark, and Apache Flink, provide distributed processing capabilities for
parallelizing data processing tasks across clusters of servers. These frameworks enable
organizations to scale their data processing and analytics capabilities horizontally, processing large
volumes of data in parallel to achieve high throughput and performance.
8. Machine Learning Libraries: Machine learning libraries, such as TensorFlow, PyTorch, and
Scikit-learn, provide tools and algorithms for building and training machine learning models on
big data. These libraries support distributed training and processing of large datasets, enabling
organizations to perform advanced analytics, predictive modeling, and pattern recognition on big
data.
9. Data Visualization Tools: Data visualization tools, such as Tableau, Power BI, and Apache
Superset, enable organizations to create interactive dashboards and visualizations for exploring
and analyzing big data. These tools provide intuitive interfaces for visualizing complex datasets,
uncovering insights, and communicating findings to stakeholders effectively.
10. Containerization and Orchestration: Containerization platforms, such as Docker and
Kubernetes, enable organizations to package and deploy big data applications and services in
lightweight, portable containers. Container orchestration frameworks, such as Kubernetes, provide
tools for managing and scaling containerized applications across clusters of servers, ensuring high
availability, fault tolerance, and resource optimization.
These are just some of the key big data technologies that organizations leverage to unlock the value
of their data assets and drive innovation in data-driven decision-making, analytics, and business
intelligence. As the field of big data continues to evolve, new technologies and innovations are
emerging to address the growing demands for processing, analyzing, and deriving insights from
large volumes of data.

You might also like