ChatGPT - Shared Content
ChatGPT - Shared Content
Q1: Major issues in data mining – performance, scalability, and data quality
Data mining systems must handle large and complex datasets efficiently. Performance
issues arise when algorithms need excessive computation or memory. For example, naive
association-rule mining on millions of transactions can be prohibitively slow. Efficient data
structures (e.g. indexing, hashing) and algorithmic optimizations are needed so that mining
completes in reasonable time unstop.com unstop.com . Scalability refers to how well a method
copes as data volume grows. Many data sources now produce big data that do not fit in
memory. Scalable solutions include parallel/distributed mining (MapReduce/Spark) and
sampling. An algorithm lacking scalability may become impractical even on modestly larger
datasets unstop.com unstop.com .
Data quality is equally critical. Real-world data often contain errors: missing entries, duplicate
records, noisy or inconsistent values. For instance, sensor datasets may have random
missing readings or outliers. Poor data quality can drastically reduce accuracy: “missing
values can harm the accuracy and reliability of models, reduce sample size, [and] introduce
bias” geeksforgeeks.org geeksforgeeks.org . Noise or outliers can skew statistics and lead to incorrect
patterns. To address this, preprocessing steps (cleaning, normalization, outlier detection) are
applied (see Q2, Set-4). For example, imputing missing ages with the median value avoids
losing many records. Finally, interplay of issues means that dirty or massive data aggravates
performance problems – e.g. noisy, high-dimensional data may require extra passes to clean
before mining.
Performance – Measured by runtime and memory usage. Slow algorithms (e.g. naive
join) must be avoided or accelerated. E.g., an $O(n^2)$ clustering on millions of points
would be infeasible. Optimized methods (e.g. indexing) are needed unstop.com .
Scalability – Ability to handle growing data size. An algorithm that works on 10K records
may fail on 10M. Scalable approaches use parallelism or data reduction. E.g., Apache
Spark’s MLlib or sampling strategies help large-scale mining unstop.com .
Data Quality – Missing values (blank entries, “NA”) and noise/outliers corrupt the input.
For example, inaccurate sensor readings (outliers) can distort learned models, and
missing survey answers reduce the effective data. As noted, unclean data “reduces the
sample size” and “introduces bias” if not handled geeksforgeeks.org medium.com .
In summary, data mining must incorporate efficient algorithms and hardware to ensure good
performance and scalability, and robust preprocessing to improve data quality unstop.com
geeksforgeeks.org . Addressing these issues is crucial for reliable, timely knowledge discovery.
Data preprocessing transforms raw data into a clean, consistent form ready for mining. Key
steps include data cleaning, integration, transformation, and reduction geeksforgeeks.org
r Copy code
Data Cleaning: Identify and fix errors. In our data, Bob’s Age is missing ( NA ) and David’s
salary (1,000,000) is an outlier. We handle missing values by imputation (e.g. set Bob’s
age to the median age 30) and remove or cap outliers (e.g. replace David’s salary with a
reasonable max of 60000). Also, “sales” vs “Sales” in Carol’s Department should be
standardized (e.g. lowercase all or map to a code). Data cleaning ensures accuracy and
consistency geeksforgeeks.org .
Data Integration: If data come from multiple sources, integrate them. (In our mini-
example this step is trivial, but in practice one might join tables on keys.) Integration also
includes resolving schema and value conflicts, e.g. mapping synonyms or merging
tables.
Data Transformation: Convert data to suitable formats. Common transformations
include normalization (rescaling numeric attributes) and encoding. For instance, after
cleaning, we might scale Salary to 0–1 range, or one-hot encode “Department” into binary
fields (HR, Sales, Tech) for mining. We might also discretize Age into bins
(young/mid/senior) if needed. These steps (normalization, aggregation, discretization)
make patterns more detectable geeksforgeeks.org .
Data Reduction: Simplify the dataset while preserving key information. Examples:
dropping irrelevant attributes (e.g. “Name” might be dropped as an ID), or reducing
dimensionality (using PCA or feature selection) to improve efficiency. We might also
apply binning to combine values. This results in faster mining with little loss of
meaningful info geeksforgeeks.org .
For our example, a possible cleaned and transformed output table is:
Here we imputed Bob’s Age, capped David’s salary to 60000, normalized salary, and one-hot
encoded Department. Preprocessing like this greatly improves mining quality and
performance geeksforgeeks.org geeksforgeeks.org .
The Apriori algorithm finds all frequent itemsets by iteratively expanding smaller frequent
sets. Its principle is that any subset of a frequent itemset must also be frequent. Thus Apriori
prunes the search space using this “downward closure” property geeksforgeeks.org . The algorithm
proceeds level-by-level:
1. First pass (1-itemsets): Scan the transaction database to count each single item’s
support. Discard items below the minimum support threshold. The result is the frequent
1-itemsets $L_1$ geeksforgeeks.org .
2. k>1 passes (k-itemsets): Construct candidate $k$-itemsets ($C_k$) by joining pairs of
$(k-1)$-itemsets in $L_{k-1}$ (only combining those that share $k-2$ items) geeksforgeeks.org
. For each candidate, check the Apriori property: every $(k-1)$-subset must be in $L_{k-
1}$. Prune any candidate violating this.
3. Count support: Rescan the database (or use previous counts) to find support counts of
remaining candidates $C_k$. Keep those meeting min support as $L_k$.
4. Repeat: Continue for $k=2,3,\dots$ until no new frequent itemsets emerge.
As an illustrative example, consider five grocery transactions (items are Bread, Butter, Milk):
$L_1$: Count single items: Bread(4), Milk(4), Butter(3). All are ≥3, so $L_1$ = {Bread, Milk,
Butter}.
$C_2$ generation: Form candidate pairs from $L_1$: {Bread, Milk}, {Bread, Butter},
{Butter, Milk}.
Count 2-item support: Scan transactions: Bread-Milk appears in T1, T3, T5 (3 times);
Bread-Butter in T1, T2 (2 times); Butter-Milk in T1, T4 (2 times). Only {Bread, Milk} has
count 3 ≥ 3. So $L_2 = {{Bread,Milk}}$. The others are pruned.
$C_3$ generation: Attempt to form triples, but only one frequent 2-itemset exists, so no
3-candidates. The algorithm ends.
The Apriori property ensured we never considered {Butter,Milk} as frequent, because its count
was below threshold, and therefore any 3-item superset would also fail geeksforgeeks.org . The
frequent itemsets can then be used to generate association rules (next question). In
summary, Apriori performs iterative scans and prunes using subset frequency geeksforgeeks.org
geeksforgeeks.org .
FP-Growth is an alternative frequent-pattern mining method that compresses the dataset into
a compact structure and avoids generating candidate itemsets explicitly geeksforgeeks.org . It
works as follows:
1. Build FP-tree: First, scan the transactions to find frequent 1-items and sort them in
descending frequency. Then rescan transactions, inserting each transaction’s frequent
items (in that order) into a prefix-tree (the FP-tree), incrementing counts on shared
prefixes. This one-pass data compression groups common prefixes.
2. Mine the FP-tree recursively: For each frequent item (starting from the least frequent),
extract its conditional pattern base (paths in the tree that lead to that item). Build a
conditional FP-tree for that item and recursively mine it to find frequent itemsets ending
in that item.
3. Combine patterns: The result is the set of all frequent itemsets.
An example: using the same transactional data from Q3 (five transactions of Bread, Milk,
Butter) with min support 60%, the FP-tree will have a shared branch for {Bread, Milk} since
both appear in most transactions. When mining, FP-Growth quickly derives the frequent sets
{Bread}, {Milk}, {Butter}, and {Bread,Milk} from the tree without generating all candidate pairs
explicitly.
Differences from Apriori: FP-Growth is generally faster and more scalable. Apriori must
repeatedly scan the database and generate/prune large candidate sets at each level, which is
costly in time and memory. In contrast, FP-Growth scans the data only twice (once to find
frequent items, once to build the FP-tree) and then mines the compact tree structure
geeksforgeeks.org . It eliminates explicit candidate generation entirely. As a result, FP-Growth
“avoids inefficiencies [of Apriori] by compressing the data into an FP-tree” geeksforgeeks.org and
“substantially reduces the number of candidate itemsets” that must be considered
hanj.cs.illinois.edu . In practice, FP-Growth often outperforms Apriori, especially on large or dense
datasets, because it focuses on the most promising patterns via the tree. (However, if the
data has very long frequent patterns, the FP-tree itself can become large.) In summary, both
algorithms produce the same frequent itemsets, but FP-Growth typically uses less time and
memory by leveraging tree compression and avoiding repeated candidate checks
geeksforgeeks.org hanj.cs.illinois.edu .
Decision tree induction builds a classification tree by recursively splitting the data based on
attribute tests. A popular method is the ID3 algorithm, which uses information gain (entropy)
to select the best attribute at each node geeksforgeeks.org . The process is:
1. Start at root: Compute the entropy (uncertainty) of the target class on the full training
set. For each candidate attribute, calculate the information gain (reduction in entropy) if
the data were split by that attribute geeksforgeeks.org .
2. Select best attribute: Choose the attribute with highest information gain as the root
node’s test. Create one child branch for each attribute value.
3. Split data and recurse: Partition the training set by each branch. For each subset, if all
instances belong to the same class, make that leaf with the class label. Otherwise, repeat
the process: compute gains among remaining attributes and pick the best split for this
subset geeksforgeeks.org .
4. Stop criteria: Stop when all data in a node are of one class, or when no attributes remain
(in which case use majority class). ID3 thus grows a tree until leaves are pure or stopping
conditions are met geeksforgeeks.org .
For example, consider the well-known “PlayTennis” dataset with attributes {Outlook, Humidity,
Wind} and binary class {Play, Don’t Play}. ID3 might compute that Outlook yields the highest
information gain on the root data, so it creates branches for Sunny, Overcast, and Rain. The
subset with Outlook=Overcast may already have all “Play = Yes” instances, making a leaf. The
subset Outlook=Sunny is further split (perhaps by Humidity) since it contains mixed classes.
Eventually the tree could look like Figure above, where inner nodes test attributes and leaves
give a class decision.
In summary, decision-tree induction iteratively selects the most informative attribute (by
entropy/gain) to split the data geeksforgeeks.org , partitions the dataset accordingly, and recurses
until pure leaves are obtained geeksforgeeks.org . The result is a tree where each path from root to
leaf is a classification rule. (Optional pruning and handling of numeric attributes can refine the
model, but the basic process is as above.)
SET-2
Task primitives are the components used to define a data mining query or request. They
specify what to mine and how geeksforgeeks.org geeksforgeeks.org . The main primitives include:
Task-relevant data: Which dataset or subset to mine. This includes selecting the
database, tables, attributes, or time period of interest geeksforgeeks.org . For instance, an
analyst might specify mining sales data only for Canadian customers.
Kind of knowledge to be mined: The type of pattern or model desired (e.g. classification,
clustering, association, summarization) geeksforgeeks.org . For example, one might choose
“association rules” mining to find itemsets, or “classification” if predicting a target
attribute.
Background knowledge: Domain information or constraints (like concept hierarchies,
taxonomies) that can guide mining. For example, knowing that “City” rolls up into “State”
might constrain or structure the search. Background knowledge can also include expert
beliefs (used in interestingness measures).
Interestingness measures and thresholds: The metrics that judge whether discovered
patterns are significant (e.g. minimum support/confidence for association rules, or a
coverage threshold) geeksforgeeks.org . They allow the user to demand only “strong” rules or
patterns above a threshold.
Output representation: How results should be presented (e.g. charts, tables, rulesets)
geeksforgeeks.org .
These primitives shape a mining query. For example, a query may say “Find all frequent
itemsets (kind) in the retail transactions table (data) with support ≥ 5% (threshold) and display
them as implication rules (representation)”. Each part of the query corresponds to a primitive
above. Thus, task primitives enable precise, user-driven mining: the user specifies data,
objective, evaluation criteria, and output format geeksforgeeks.org geeksforgeeks.org . They also allow
integration with databases: one can often formulate a DM query in an SQL-like language by
plugging in these primitives. In summary, task primitives act like query clauses that define the
scope (data), goal (type of knowledge), guidance (background/thresholds), and display of the
mining task geeksforgeeks.org geeksforgeeks.org . This modular design makes the mining process
controllable and interpretable.
Data visualization presents the data or discovered knowledge graphically, using charts, plots,
or dashboards. Visual primitives include bar charts, histograms, scatter plots, pie charts, line
graphs, heat maps, etc. Visualization turns abstract data into shapes and colors, which
humans interpret quickly. For example, plotting a decision tree or a cluster scatterplot can
reveal structure at a glance. As noted by experts, “data visualization uses graphs and maps to
present information in a simple, clear manner… it helps spot patterns and trends within large
data quickly” geeksforgeeks.org . Visualization is crucial in knowledge presentation: it makes
complex relationships obvious. A peak in a time-series plot may signal an important event, or
a colored cluster view can show grouping.
The Apriori method proceeds in passes. Suppose we have this small transactional database
with min support = 2 (40% if 5 transactions):
TID Items
1 {A, B, C}
2 {A, B}
TID Items
3 {A, C}
4 {B, C}
5 {A, C}
The algorithm stops. The frequent itemsets are: {A}, {B}, {C}, {A,B}, {A,C}, {B,C}.
This matches Apriori’s iterative principle: find frequent 1-items (support ≥2) first, then build
candidate 2-itemsets and test them geeksforgeeks.org . If any candidate’s support were below
threshold, it would prune further supersets. For example, if {B,C} had been infrequent, {A,B,C}
would be pruned without counting. Thus Apriori uses the downward-closure property to
reduce work geeksforgeeks.org . In this example all 2-item candidates survived, but the 3-itemset
failed due to support. The resulting frequent sets can then generate association rules (next
question).
partitioning each frequent set into antecedent $X$ and consequent $Y$. For each rule, we
compute the confidence and keep it only if it meets the minimum confidence threshold
geeksforgeeks.org .
Procedure: For each frequent itemset $L$ (with $|L|\ge 2$), do:
For every non-empty proper subset $X\subset L$, let $Y = L\setminus X$.
Compute confidence = support($L$) / support($X$).
If confidence ≥ min_conf, output rule $X \to Y$.
Example: From the earlier frequent sets ${A,B},{A,C},{B,C}$, consider rule $A \to C$ derived
from ${A,C}$. Its confidence = support({A,C})/support({A}) = 3/4 = 0.75. If min_conf were 70%,
this rule passes (0.75 ≥ 0.70). We would include “$A \to C$ (0.75)” as a strong rule. In
contrast, rule $B \to A$ from {A,B} has confidence = support({A,B})/support({B}) = 2/3 ≈ 0.67,
which fails if min_conf=0.7.
This rule-generation method relies directly on the support counts found by Apriori. It
systematically explores each frequent set’s splits. In code or pseudo-code:
GeeksforGeeks illustrates this: for a support threshold of 50%, Apriori example rules were
Bread→Milk with confidence 75% (accepted) and Butter→Bread with 100% geeksforgeeks.org ,
while lower-confidence rules are discarded. By using the confidence measure, we extract only
those association rules that are deemed strong (above threshold) from the frequent itemsets
geeksforgeeks.org geeksforgeeks.org .
Sunny High No
Sunny High No
Rainy High No
Steps:
1. Compute information gain for each attribute: Using entropy on “Play?”, ID3 calculates
which attribute best splits the data. In this example, attribute Outlook/Weather often has
the highest gain, so it is chosen as the root geeksforgeeks.org .
2. Split data: Create a root node testing Weather. It has branches Sunny, Overcast, Rainy.
Partition the training set accordingly. For instance, when Weather=Sunny, remaining data
has mixed Play results.
3. Recurse on subsets: For the Sunny subset, compute gains for remaining attributes
(Humidity). If Humidity is best, split again under the Sunny branch. Continue splitting until
leaves are pure (all one class) or no attributes remain geeksforgeeks.org .
4. Build leaf nodes: If a subset has all “Yes” or all “No”, make a leaf with that class. In our
data, the Overcast branch may have all “Yes”, so that branch directly yields “Play=Yes”.
The resulting tree, when drawn, might have Weather at the root, branches for
Sunny/Overcast/Rainy, with further splits under Sunny (e.g. by Humidity) and Rainy (e.g. by
Humidity) until leaves predict “Yes” or “No”. The figure below (PlayTennis example) illustrates
such a tree:
Root: Weather
Sunny → test Humidity : High→No, Normal→Yes.
Overcast → Yes.
Rainy → test Wind : Strong→No, Weak→Yes.
Visualizing the tree: (The figure above is an example Play-Tennis tree.) Each node’s test and
resulting branches correspond to decisions. This tree fully encodes the classification rules
learned from data.
In summary, training with ID3 involves selecting the highest information-gain attribute at each
node and splitting the dataset geeksforgeeks.org , recursively building the tree until leaves are
homogeneous geeksforgeeks.org . The final decision tree can be drawn graphically, with internal
nodes as attribute tests and leaves as class labels. The diagram depicts one such tree
learned from the example data.
SET-3
Data mining tasks fall into two broad categories: descriptive and predictive. Descriptive tasks
summarize or characterize the properties of the data itself, uncovering new patterns without
necessarily making predictions. Examples include clustering (grouping similar items),
association rule mining (finding frequent item co-occurrences), and summarization (e.g.
OLAP cubes) geeksforgeeks.org . For instance, clustering might reveal customer segments, and
association rules might show that “70% of people who buy bread also buy milk.” Such
patterns describe what is in the data.
Predictive tasks, on the other hand, involve building a model that predicts unknown outcomes
for new data. Classic examples are classification and regression. In classification (e.g.
decision trees, KNN, neural networks), the goal is to predict a categorical label (spam vs. not
spam, or disease vs. healthy) for unseen cases. In regression, the target is numeric. These
models are evaluated on prediction accuracy rather than just pattern discovery. For example,
a decision tree trained on past loan applications to predict “approve” or “deny” is a predictive
task geeksforgeeks.org .
Key differences include:
Examples: A supermarket might use descriptive mining to find that “20% of transactions
include bread and milk” (association rule), or to segment shoppers into clusters (e.g. budget
shoppers vs. premium shoppers). That same supermarket might use predictive mining to
build a model that predicts whether a new customer will make a purchase (“will buy or not
buy”) based on their demographics (classification) geeksforgeeks.org geeksforgeeks.org .
In practice, many systems use both: first describing the data, then using those insights to
inform predictive modeling. But fundamentally, descriptive and predictive mining address
different questions – one explains the data we have, the other predicts unknown data yet to
come geeksforgeeks.org geeksforgeeks.org .
Output representation primitives specify how mined results are presented to the user. They
include charts, graphs, tables, and specialized structures that make patterns easy to interpret.
Common primitives are:
Charts/plots: e.g. bar charts, pie charts, line graphs, scatter plots, heatmaps. These
visualize numeric summaries or distributions. For instance, plotting the support of
association rules can highlight which rules are strongest.
Tables and lists: e.g. listing association rules or cluster centroids in a table. A sorted
table of rules by confidence lets analysts review top rules.
Decision trees/flowcharts: For classification, a tree diagram is an intuitive
representation.
Graphs and networks: For relationships (e.g. social networks or semantic graphs), nodes
and edges diagrams can be used.
Data cubes and pivot tables: Summarized data in a multi-dimensional grid format (e.g.
average sales by region and quarter).
These primitives are chosen based on the task: e.g. an output primitive for association
rules is often a list of “if-then” rules with their support/confidence. For clustering, output
primitives might include lists of members per cluster and feature bar charts of cluster
profiles.
These representation choices greatly aid interpretation. As one source notes, visualizing
patterns (using “charts, graphs, and maps”) helps present discovered patterns “in a way that
is easy to understand and interpret” geeksforgeeks.org . For example, showing a line chart of sales
over time (from a time-series summarization) immediately reveals trends or seasonality. A
decision tree diagram makes the logic of classification transparent: one glance shows which
attribute splits are most important. Graphical displays leverage human visual perception: we
can quickly spot outliers in a scatterplot or trending behavior in a histogram.
In short, output primitives turn abstract results into concrete visual or tabular forms,
facilitating insight. They transform mining outputs into user-friendly knowledge. By choosing
an appropriate primitive (e.g. bar chart for frequency, tree for classification rules, scatterplot
for clusters), analysts can understand and communicate the mined knowledge effectively
geeksforgeeks.org . Thus, output representation is a crucial step in making data mining results
actionable.
In standard Apriori, generating and testing all candidate itemsets $C_k$ can be expensive,
especially for $k=2$ where the number of pairs grows quickly. A hash-based technique is a
known optimization to prune many candidates early hanj.cs.illinois.edu . The idea (Park–Chen–Yu
technique) is:
During the scan to get $L_1$, also hash every 2-itemset in each transaction into a hash
table of buckets. Each bucket count is incremented for each occurrence of any pair
mapping to that bucket.
After the scan, any bucket whose count is below the support threshold cannot contain
any frequent 2-itemsets. All candidate 2-itemsets that hashed into those low-count
buckets are pruned from $C_2$.
This drastically reduces the number of 2-item candidates to consider. For example, Figure 6.5
in a textbook shows a hash table where buckets 0,1,3,4 had counts below the threshold; thus
all item pairs in those buckets are eliminated hanj.cs.illinois.edu . Only pairs in remaining buckets
(2,5,6) survive.
Example: Suppose transactions generate 10 possible pairs, but hashing shows only 4
buckets meet support. Instead of testing all 10 pairs, the algorithm only tests ~4, saving
effort.
In summary, the hash-based Apriori reduces candidate explosion by filtering using bucket
counts hanj.cs.illinois.edu . It improves efficiency: less memory is needed to store $C_2$, and fewer
scans of the data are required. Overall mining runs faster. The drawback is a bit of extra
computation to build the hash table, but this is usually small compared to the savings in
candidate processing hanj.cs.illinois.edu .
Using the same sample dataset as before, FP-Growth would find the identical frequent
itemsets as Apriori (e.g. in our example, {Bread}, {Milk}, {Butter}, {Bread,Milk}). The difference
lies in efficiency: FP-Growth typically mines faster and uses memory differently.
Time complexity: Apriori makes one data scan per level of itemset size. If the maximum
frequent itemset has size 4, Apriori scans the DB 4 times and generates many intermediate
candidates (which are tested). FP-Growth, by contrast, only needs two full scans – one to get
$L_1$ and one to build the FP-tree – and then works on the compact tree with recursive
processing. As a result, FP-Growth often runs much faster. In general, FP-Growth eliminates
the repeated scans and costly joins of Apriori geeksforgeeks.org . It “avoids inefficiencies … by
compressing the data into an FP-tree” geeksforgeeks.org . In practice, benchmarks show FP-Growth
outperforming Apriori especially on large or dense data.
Space complexity: Apriori must explicitly store all candidate itemsets in memory while
scanning, which can become huge for high-dimensional data. FP-Growth instead stores a
prefix-tree (FP-tree) that compresses common prefixes of transactions. If many transactions
share items, the FP-tree is much smaller than the raw data, reducing memory usage.
However, if data have little overlap, the tree might not compress well. Generally, FP-Growth
uses space for the tree structure and “node links” but does not store large candidate lists.
Thus FP-Growth often uses less memory.
In summary, FP-Growth is usually both faster and more space-efficient. It removes the need
to generate and test thousands of candidate sets, as Apriori does geeksforgeeks.org . This leads to
lower runtime and memory use for most real-world datasets. However, one should note that
Apriori’s simpler approach may use less memory on extremely sparse data, and FP-tree
construction can be an overhead if the tree itself grows large. Overall, FP-Growth’s tree-
compression gives it a significant advantage in efficiency geeksforgeeks.org hanj.cs.illinois.edu .
A multilayer feedforward neural network (MLP) consists of an input layer, one or more hidden
layers, and an output layer, where each layer’s neurons are fully connected to the next (no
cycles) geeksforgeeks.org . Each input neuron represents a feature of the data; hidden layers detect
higher-level patterns by weighted summation and activation; the output neurons produce the
final results (e.g. class scores or numeric predictions). The figure below shows a simple MLP
with one hidden layer:
Figure: Example MLP architecture (input layer, hidden layer, output layer)
Training uses backpropagation to adjust the weights of all connections so that the network’s
predictions match the targets. In a forward pass, input values are propagated through the
network to compute an output. We then compute the error (difference between predicted and
actual output). Backpropagation then propagates this error backward through the network,
layer by layer, computing gradients of the loss with respect to each weight geeksforgeeks.org .
Mathematically, it uses the chain rule of calculus to find how changing each weight would
change the error.
where η is the learning rate. This process (compute output, calculate error, propagate error
backward, update weights) is repeated iteratively (over many epochs) geeksforgeeks.org . Over time,
the weights converge to values that minimize the overall error.
Key points: Backpropagation efficiently updates all weights in a deep network by reusing
partial computations (the gradient) geeksforgeeks.org . It “computes the gradient of the loss with
respect to each weight using the chain rule” geeksforgeeks.org , making training of multi-layer
networks feasible. With each weight update, the network’s predictions become more
accurate. In summary, the MLP architecture learns by iterating forward passes and backward
gradient updates until it effectively models the data patterns.
SET-4
Outliers are extreme or erroneous values that fall far from normal data. A single outlier can
pull means and variances, distort relationships, and mislead patterns. For example, an
incorrectly entered income of $1 billion (instead of $100k) would drastically increase the
average and weaken correlations. In machine learning, outliers can “skew [the] analysis” and
cause models to overfit to the anomalies medium.com . They can also trigger false pattern
discoveries if not identified.
These preprocessing steps “clean” the data so that subsequent mining is reliable. For
example, imputing a cluster’s missing points allows clustering to consider more data, and
trimming outliers prevents spurious rules. Effective preprocessing thus mitigates the impact:
it yields models “that produce accurate and unbiased results” geeksforgeeks.org medium.com . In
summary, we analyze missingness and outlier patterns, then apply appropriate fixes
(imputation, removal, transformation) to ensure high-quality input for mining.
For example, consider association mining on retail data. A novice user might find “beer →
diapers” interesting (it’s a classic surprising rule), while a beer-and-diapers shop owner might
think it obvious. Their domain background determines which rules are highlighted. Similarly, a
financial analyst might only consider patterns involving certain known economic indicators as
relevant.
Thus, domain knowledge guides pattern selection and interpretation. It determines the
search space (via constraints), and how we rank or filter results (via subjective
interestingness). Patterns consistent with known beliefs may be down-weighted, while
anomalies gain attention. In practice, miners allow incorporating such knowledge so that the
final output reflects what the user truly cares about kamaleshvcet.files.wordpress.com . In short, the user’s
domain knowledge and beliefs inform which patterns are tagged “interesting” and how they
are explained.
Apriori and FP-Growth both find the same frequent itemsets, but their time/space profiles
differ greatly.
Apriori: This algorithm generates candidate itemsets level by level and scans the
database multiple times. Its time complexity can be very high when the number of
candidates is large. In the worst case (dense data), the number of candidate $k$-
itemsets is $\binom{n}{k}$, leading to exponential work. Memory-wise, Apriori must store
all candidate sets $C_k$ at each level, which can be huge. Each database scan has to
count every candidate’s support. Thus Apriori’s runtime grows rapidly with more data and
lower thresholds. Its space complexity grows with the number of candidates.
FP-Growth: FP-Growth constructs an FP-tree in (roughly) two scans of the data and then
mines it recursively. Time complexity is typically much lower in practice: after building the
tree, mining it avoids generating all candidate sets. As the GFG article notes, FP-Growth
“avoids inefficiencies” of Apriori such as “multiple scans” and large candidate sets
geeksforgeeks.org . In many cases, FP-Growth takes orders of magnitude less time. Space
complexity is different: FP-Growth must store the entire FP-tree (which encodes all
transactions). The tree may be smaller than the raw data if there is redundancy (many
shared prefixes). In sparse datasets, the tree is compact; in worst-case dense data, the
tree may still be large. However, even then, FP-Growth avoids storing all candidates.
Overall, FP-Growth often uses less memory than Apriori because it compresses the data
into the tree and does not keep large candidate lists.
In summary, FP-Growth typically outperforms Apriori in both time and space. It eliminates
the costly candidate generation phase, greatly reducing runtime geeksforgeeks.org . It also reduces
memory needs by compressing common itemsets into one tree structure, whereas Apriori
must enumerate each candidate set. For example, FP-Growth can mine frequent patterns
from a huge database much faster than Apriori by using this tree-based approach
geeksforgeeks.org hanj.cs.illinois.edu . The trade-off is the overhead of building the tree and maintaining
links, but in practice FP-Growth’s efficiencies make it superior for most large-scale mining
tasks.
Level 1 (1-itemsets):
Level 2 (2-itemsets):
Level 3 (3-itemsets):
Generate $C_3$ by joining $L_2$ with itself. Only one possible: {Bread,Butter,Milk}.
Check if all its 2-subsets are frequent: {Bread,Butter} and {Butter,Milk} were pruned, so by
the Apriori property we discard {Bread,Butter,Milk} without counting.
Since $L_3$ is empty, the algorithm stops. Frequent itemsets found: all of $L_1$ plus $L_2$:
{Bread}, {Butter}, {Milk}, {Bread,Milk}.
This demonstrates the level-by-level pruning: at each pass $k$, Apriori uses $L_{k-1}$ to form
candidates $C_k$ and prunes any candidates with infrequent subsets geeksforgeeks.org . In the
example, after level 2 we saw that no 3-itemset could be frequent because its subsets weren’t
in $L_2$. At each level, the database is re-scanned (except with hash or transaction
reductions if applied), but by then the candidate list is much smaller. The algorithm returns all
frequent itemsets with support above threshold. Notably, any frequent itemset of size $k$
must have all its $(k-1)$-subsets in $L_{k-1}$, which is the core pruning condition
geeksforgeeks.org .
The k-Nearest Neighbors (k-NN) algorithm classifies each new instance by majority vote of
its $k$ nearest labeled neighbors in feature space. There is no formal “training” phase besides
storing the training data. To illustrate, consider a small 2D dataset:
# Example in Python
X_train = [[1,2], [2,1], [1.5,1.8], # Class 0
[5,6], [6,5], [5,5.5]] # Class 1
y_train = [0, 0, 0, 1, 1, 1]
X_test = [[1,1], [6,6], [3,3]]
y_test = [0, 1, 0] # actual classes
Actual \ Pred 0 1
If y_pred = [0,1,1] vs y_test = [0,1,0] , then: one true negative (1 predicted correctly),
one true positive (0 predicted correctly), and one false negative (the last point: actual 0 but
predicted 1). The confusion matrix is:
2 0
( )
1 1
(interpreting first row as actual 0’s: 2 correct, 0 misclassed; second row actual 1’s: 1
misclassed as 0, 1 correct.) This matrix allows calculation of metrics.
Thus, by training the k-NN classifier on the training set and applying it to test data, we can
produce a confusion matrix summarizing true vs. predicted classes en.wikipedia.org . The matrix
helps diagnose classifier performance (identifying which types of errors occur). Accuracy is a
simple summary of the diagonal of this matrix (the fraction correct). Other metrics (precision,
recall) can also be derived.
We fit the model with training examples (though fitting just stores data).
We predict on test examples and compute a confusion matrix en.wikipedia.org .
We then compute accuracy = (number correct)/(total).
This approach provides a clear quantitative measure of the classifier’s performance. The
confusion matrix in particular “visualizes performance” and shows if classes are being
confused en.wikipedia.org . By using accuracy and the confusion matrix, we get a full picture of
how well the k-NN model is doing on the dataset.
Citations
06.dvi
http://hanj.cs.illinois.edu/cs412/bk3/06.pdf
https://kamaleshvcet.files.wordpress.com/2017/10/unit-iii1.pdf
All Sources
kamalesh...wordpress en.wikipedia