Enhanced FP-Growth Framework and Apriori Algorithm Utilizing TDA for Big Data Analysis
Enhanced FP-Growth Framework and Apriori Algorithm Utilizing TDA for Big Data Analysis
Abstract: To efficiently analyze huge datasets, mining big data requires advanced computational techniques and algorithms.
Apriori and FP-Growth are two of the most well-known algorithms in data mining. They help businesses make decisions
based on customer trends and behaviors by finding patterns and correlations. Machine learning has made these algorithms
even better by making them more accurate and efficient. The association rule approach does have some problems, though.
For example, it needs a lot of memory, it has to search through all the data sets to find the frequency of an item set, and it
sometimes makes rules that aren't the best. This study conducts a comparative analysis of the FP-Growth, Apriori, and
TDA algorithms, demonstrating notable performance differences. The FP-Growth algorithm was much better at working
with large datasets than the Apriori method, which had problems with scalability and took longer to process larger datasets,
even though it was easier to build. This study suggests changes to the FP-Growth algorithm to fix these problems. It uses
the TDA matrix to make a very compact FP-tree. This method tries to cut down on the time it takes to mine and the number
of items that are created, which will make memory use more efficient and speed up processing for large datasets. In short,
the proposed method is a promising way to make data mining processes more efficient and scalable, especially when it comes
to big data analytics.
How to Cite: Abdulkader Mohammed Abdulla Al-Badani (2025) Enhanced FP-Growth Framework and Apriori Algorithm Utilizing
TDA for Big Data Analysis. International Journal of Innovative Science and Research Technology, 10(12), 919-928.
[Link]
I. INTRODUCTION make itemsets, which cuts down on the number of times the
database needs to be scanned for association analysis.
In today's world of information technology, the rapid
increase in data creation means that we need better ways to Even though the FP-Growth algorithm has come a long
collect and store data. This change has greatly improved the way, we still need to improve data mining methods so that we
ability to collect and store huge amounts of data[1], which can get more information from large datasets. Current
makes it easier to analyze all of that data and lets businesses research is still looking into new ways to make data mining
get useful information from large amounts of it. Being able to processes more efficient, with the goal of getting around
handle and understand large datasets is now an important part current problems and making them work better. This research
of making decisions in many fields. Association rule mining aims to augment the existing discourse by exploring
is one of the many methods that have been created to deal with innovative methodologies that improve the efficiency and
these problems. It is a key way to find patterns in large efficacy of association rule mining[3]. This research seeks to
datasets. enhance the field of data mining by filling existing knowledge
gaps and offering valuable insights for businesses aiming to
Association rule mining, especially the Apriori utilize large datasets for strategic decision-making.
algorithm, has been very helpful in finding frequent itemsets
and making Boolean association rules[2]. Since it was first The field of data mining has become more and more
created, the Apriori algorithm has been improved many times, important for getting useful information from large
making association analysis faster and more accurate. Even datasets[4]. This helps people make decisions in many
with these improvements, the algorithm's dependence on different areas. Data mining involves a number of important
candidate generation and multiple database scans makes it steps, including preparing the data, choosing the right
hard to work with large datasets quickly. To deal with these methods, putting them into action, looking at the results, and
problems, new methods like the FP-Growth algorithm have figuring out what they mean. Every step is important for
come up. The FP-Growth algorithm uses a tree structure to making sure that the results are correct and useful.
In [22], a better FP-Growth method for mining rules that prospects for its utilization in optimization contexts beyond
are based on descriptions is introduced. They have made a those initially investigated in the research.
unique change to how gene groups are described using the
Gene Ontology (GO) FP-Growth algorithm. The results show III. APRIORI ALGORITHM
that the new method lets you make rules faster. Reference [23]
introduces a new way to mine association rules using FP- The method described in this study successfully
Linked lists. It has come up with a new way to mine frequent identifies subsets that are shared by only a few item sets,
patterns that uses a linked list structure and a bit matrix to find showing that it can be used in many different fields. The
them. This is based on the FP-Growth idea. This method method consistently produces accurate and meaningful results
makes things more efficient by using less memory and by using regular pattern mining techniques based on support
speeding up the process of finding patterns. and confidence metrics. The research results show that the
method can successfully find patterns in data that are both
In [24], you can learn about good ways to find frequent common and important. This ability to see important patterns
itemsets using data mining. These methods are based on the makes it easier to make decisions in many situations, such as
frequent pattern development approach and are meant to recommendation systems and market basket analysis. This
protect privacy, usefulness, and speed in mining frequent shows how useful and flexible the approach is in real life.
itemsets. In [25], a better way to mine frequent itemsets is
created. It has put forward a more effective, non-recursive In other words, the difference in execution time and
FPNR-growth method that boosts performance in terms of magnitude of results between the two databases points out
both space and time complexities. This new method closes the how much the number of attributes and the general operation
gap between theoretical research and real-world use by complexity affect the performance of the algorithm. The
lowering the computing overhead and making sure that the underlying message here is that, while one is constructing the
patterns mined are useful and relevant to real-world situations. algorithm, the database architecture should be understood first
By focusing on these improvements, the method makes because it reflects the efficiency and effectiveness of the data
frequent itemset mining much more scalable, which is useful to be processed.
for large datasets that are common in fields like banking and
retail. As a result, practitioners can get useful information IV. FP-GROWTH ALGORITHM
faster, which leads to better ways of making decisions in the
long run. The FP-Growth algorithm is an important tool for data
mining, especially for quickly finding frequent itemsets
The literature on association rule mining has changed a without having to create candidates. The algorithm has two
lot over time. FP-Growth trees have become a popular method main steps: building the FP-tree and using FP-tree recursion
because they work well with large datasets. A significant to find frequent itemsets. To start, the FP-tree is built by
research by [26] presents a novel method that eliminates the scanning the dataset to find feature items. These items are then
necessity for building conditional FP-trees, thereby improving put in the first column in order of their support, from highest
the efficiency of frequent itemset mining. This improvement to lowest. This structure makes sure that the most important
solves a major problem with traditional FP-Growth methods, things come first, which makes it easier to get around. At the
which often require a lot of extra computing power because same time, the second column has a chain table that connects
conditional tree construction is recursive. By removing this nodes of the same items in the FP-tree. This keeps the structure
step, the suggested method not only makes mining easier but consistent, which is important for recursive mining. During
also makes it useful in more areas that need real-time data the mining phase, the algorithm goes through the FP-tree in a
analysis. systematic way to find frequent itemsets. It uses conditional
FP-trees to narrow down and isolate patterns. This two-phase
The study further clarifies the benefits and limitations of method not only makes computations more efficient, but it
FP-Growth trees in association rule mining. It shows how also makes traditional candidate generation methods much
well the method can handle large datasets, which is very less complicated. Because of this, the FP-Growth algorithm
important in areas like market basket analysis, bioinformatics, is a great way to find patterns in large datasets. It is both fast
and network traffic analysis. The decrease in computational and accurate for frequent itemset mining.
overhead is especially important because it makes this method
a good choice for real-time applications where speed and Below is the pseudo-code for the FP-Growth algorithm
efficiency are very important. in a transaction database [27]:
A major part of this research is the creation of a new Input: Dataset D; support threshold min_sup
algorithm that uses the TDA (Two-Dimensional array) The output is an FP-tree.
structure. This algorithm solves difficult optimization
problems and makes FP-tree-based algorithms work better, To get the frequent 1 item set L1, first go through the
getting better accuracy and speed than earlier methods. The dataset and find the support for each feature item. Then,
fact that it consistently outperforms different datasets shows sort the items in order of support and use min_sup to get
how strong it is and how it could change the way optimization rid of the less common ones.
works. The algorithm's capacity to provide expedited and You can get the frequent 1 item set L1 by fi Make the root
accurate solutions to complex issues indicates favorable node of the FP-tree, give it the value T, and set the content
of the root node to "null." Make a list of items that are The items that show up a lot in the transactions are filtered
often used and leave the connection blank. To move on to based on L1, recorded as P, and sorted according to the
the second iteration of the dataset, follow these steps. First, feature item order in L1.
the dataset is traversed, and the support for each feature Change the connections that are relevant in the table of
item is calculated. Then, the items are sorted in order of commonly used objects.
support, and the infrequent items are filtered out using
min_sup. There is a header table that the FP-tree is linked to. The
for the D to do business. header table sorts single items and their numbers by how often
The items in the transactions are sorted based on the they appear, from most to least. Table 1 is a transactional
feature item order in L1, and the items that show up most dataset, and Figure 1 shows the FP-tree that the FP-Growth
often are filtered based on L1. The items are then recorded algorithm made from this data. Every node in the FP-tree
as P. shows an item and how many times it appears. The tree
structure makes it easy to mine itemsets that come up often
without making candidate sets.
Header Table
V. THE PROPOSED ALGORITHM ordered items. Each cell in this array holds the support value
for the itemsets that go with it. This makes it easy and quick
The Two-Dimensional Array (TDA) is an important tool to get important information about frequently used itemsets
for summarizing transactional databases because it organizes from the database. This systematic arrangement not only
all of the frequent itemsets in a way that makes sense. The makes it easier to get to the data, but it also makes it easier to
TDA is organized in a way that makes it easier to find support. analyze transactional patterns.
Its dimensions are N×M, where N is the number of
transactions and M is the maximum number of frequently
Table 3 shows how each Ordered Frequent Itemsets List time and effort needed to make frequent itemsets. This makes
(OFIL) is processed over and over again to build the TDA. At data analysis easier and more efficient overall.
first, the TDA matrix is filled with "0" values, which sets a
baseline for adding more data later. During each iteration, By getting rid of rare itemsets early on, the TDA
items from the OFIL lists are taken out and added to the algorithm becomes more efficient because it focuses on the
matrix. This makes the representation of transactional data most important and relevant patterns. This leads to better data
more accurate. After this process is done, Table 3 gives a full mining results. This method works much better than the FP-
explanation of the finished TDA, showing how well it works Growth method. The TDA algorithm speeds up the mining
to summarize and present transactional insights. This process by quickly getting rid of non-frequent itemsets. This
methodical approach shows how useful the TDA is for making makes it easier and more accurate to find patterns in the data.
data analysis in transactional databases more accurate and So, the algorithm not only speeds up the mining process but
efficient. also makes the results more accurate, which means that large
datasets can give researchers and practitioners more useful
Table 3. The TDA. information. This enhancement aids strategic planning and
T1 I2 I1 I5 0 decision-making by diminishing processing time and
T2 I2 I4 0 0 computational burden while producing more frequent
T3 I2 I3 0 0 itemsets. The TDA algorithm works better than FP-Growth,
T4 I2 I1 I4 0 which means it can analyze data faster and handle bigger
T5 I1 I3 0 0 datasets without losing quality. This feature lets businesses
T6 I2 I3 0 0 make quick, smart choices that lead to more efficiency and
T7 I1 I3 0 0 new ideas in many fields. The algorithm's streamlined
T8 I2 I1 I3 I5 approach also makes it easier to work with large datasets, and
T9 I2 I1 I3 0 its ability to grow makes it even better for frequent itemset
mining tasks. The next sections will give a full explanation of
The FP-Growth algorithm works well with small the proposed algorithm, including its benefits and uses:-
datasets, but it has big problems when used with large datasets
Input
because it needs a lot of memory and takes a long time to build
• Transaction database (DB)
an FP-tree and find frequent itemsets. The FP-tree may grow
too big for the main memory, making the method useless for • Minimum support threshold (minsup)
analyzing large amounts of data. The One-Itemset-at-a-Time
Output
Mining (TDA) method, along with a minimum support
(minsup) threshold, is a good way to deal with these problems. • Identified frequent (recurring) itemsets
This method uses a two-dimensional array that is updated in Step 1: Database Scanning and Frequent Item Preparation
real time with new itemsets. This solves the memory problems After each transaction is processed:
that come with the traditional FP-Growth method. The TDA 1. Scan the entire database.
method improves memory management and scalability by 2. Identify:
o The set F of all items.
focusing on one itemset at a time. This makes it possible to
o The supporters (transactions) for each item in F.
process large datasets efficiently without running out of
memory. The TDA approach is a big step forward in the field o The frequent items (those meeting minsup).
of data mining because it gives you a strong way to work with 3. Sort F in descending order of frequency to form OFIL
large amounts of data. (Ordered Frequent Item List).
4. Remove all infrequent items.
The suggested method starts scanning the dataset from
This step ensures that only the most relevant, high-
the last column and uses the Transactional Data Analysis
(TDA) to figure out how much support each item in each support items remain, improving both computational
column gets. It then skips groups of items that don't get enough efficiency and the overall accuracy of later itemset mining
support. The system improves efficiency by cutting down on processes.
the search space by removing infrequent itemsets. This
method makes it easier to quickly find frequent itemsets in big Step 2: Construction of the TDA
datasets by focusing on the most important elements and For every transaction row that corresponds to the OFIL
structure:
getting rid of those with little support. So, this method not
only makes the math easier, but it also makes sure that the o Insert each frequently occurring item into the appropriate
column, following the sorted order in OFIL.
analysis is accurate and important when dealing with large
2. The resulting matrix:
amounts of data. The last column will show the frequencies
of the previous itemsets and those of similar records. This will o Highlights usage/consumption patterns.
help find strong correlations and trends. Such insights are o Provides a clear view of item availability and inventory
useful for making smart choices because they are more likely status.
to lead to real-world results. The method lets you get useful o Helps determine which items may need restocking.
information by focusing on high-frequency itemsets, which
are then taken out of the TDA. This process speeds up the Maintaining OFIM ensures a clean, organized data
process of finding patterns in the data, which cuts down on the structure for pattern mining.
Step 3: Frequent Itemset Generation Table 4. Displays the Created Frequent Item Sets
Let c be the number of columns in the TDA. Frequent itemsets
{I2,I3:2}, {I1,I3:2}, {I2,I1,I5:2}
Initialize
{I2,I1,I3:2}
• Set c = M, where M is the total number of columns in
TDA.
VI. RESULTS AND DISCUSSIONS
Process Columns in Reverse Order
The UCI Machine Learning Repository is an important
For each column from c = M down to 1: resource for the data mining and knowledge discovery
communities. It has a wide range of benchmark and real-world
Case A: When c = 1 datasets that are necessary for testing the effectiveness of new
1. Compare frequent items in the current column with those methods [28]. Researchers can rigorously test their algorithms
in previous columns. across a wide range of fields, such as the social sciences,
2. Compile their supporters. biology, and economics, by using these datasets. This variety
3. Represent results as [r, f : n | OFIL], where: not only makes the algorithms stronger, but it also encourages
o r = parent frequent item from earlier columns. new uses in many different areas. Researchers use these
o f = frequent item in the current column. databases to find patterns, back up their results, and learn more
o n = support count. about the things they are studying. This study compares the
4. Retrieve the rows of item f from the supporters of column suggested method to the well-known FP Growth algorithm by
1. looking at how many frequent itemsets it finds and how long
5. Retrieve the corresponding rows of item r. it takes to get these itemsets from the datasets. The results
6. Remove these extracted rows from TDA to eliminate show that the suggested method makes data mining much
duplicates and maintain accuracy. more efficient by finding more frequent itemsets and cutting
down on computation time by a large amount. This
This step clarifies the relationships between items and helps enhancement facilitates the application of these
identify patterns in early-column frequent items. methodologies to larger and more intricate datasets,
potentially transforming the extraction of data-driven insights.
Case B: When c > 1 The experiments were done on a laptop with a 64-bit Windows
1. Move to column c and compare its frequent items with 10 operating system, Python, 32GB of RAM, and an Intel(R)
those in earlier columns. Core(TM) i7-10850H CPU @ 2.70GHz. Table 6 shows a
2. Compile supporters for each frequent item. detailed statistical analysis of the datasets used in this
3. Represent results again as [r, f : n | OFIL]. comparative study. These datasets range in size and
4. Extract rows linked to the repeating parent items. complexity from small to large. These differences make it
5. Process the repeating item f according to its order. easier to fully evaluate the analytical techniques, showing how
6. Remove the extracted rows from TDA. useful and relevant they are in many different situations.
7. Verify that the remaining TDA structure is consistent and
matches the master file. Table 5: Characteristics of the Test Datasets
This makes sure that the generation of frequent itemsets Datasets Size #Transactions
happens in an orderly way and that the relationships between Poker Hand 23.9MB 268325
items are correctly represented. Sepsis Survival Minimal 1.31MB 110205
Clinical Records
Table 4 shows a complete list of all the frequent item sets
that were found in the data analysis. The sets are arranged by
their support values, with the ones with the highest frequency In algorithmic performance evaluation, employing
at the top. This systematic arrangement makes it easier to find varied datasets is essential for evaluating the effectiveness of
patterns and trends in the dataset. Researchers can find various computational methodologies. This study utilized
important links between products that might affect how people two separate datasets to methodically evaluate the
buy things by looking closely at these common item groups. performance of algorithms under diverse conditions. The
These kinds of insights are very helpful for creating targeted results showed a big difference: one algorithm was much
marketing plans and improving inventory management to faster than the other, while the other was much more accurate.
better match what customers want. So, looking at frequent This difference shows how important it is to consider the
item sets not only helps us understand how consumers behave, situation when choosing the best algorithmic strategy for a
but it also gives us useful information that we can use to make job. A detailed comparison of the FP-Growth algorithm and
strategic business decisions. the suggested algorithm reveals their strengths and
weaknesses, which helps us understand these results better.
This analysis ultimately emphasizes the significance of
considering both speed and accuracy, among other factors,
when selecting an algorithm for practical application.
Experiment One learning models not only made predictions more accurate, but
The experiment used the Poker Hand dataset, which has they also gave us a better understanding of how to play
records of hands of five playing cards taken from a standard complex games. This progress opens up new possibilities for
deck of 52 cards. There are ten predictive features for analysis looking into how AI can make it easier to make decisions in
because each card has two properties: suit and rank. This poker and other games that require strategy. Researchers
dataset has a lot of information that makes it easy to look at showed that the proposed method worked better than the
different poker hands and their chances of winning. Using original FP-Growth algorithm in a number of different
machine learning, it is possible to make models that can guess situations by changing the minsup parameters. Table 6 shows
how strong a hand is or find the best ways to play. We did a the results, including how long it took to find the frequency of
lot of tests with different minimum support (minsup) values to itemsets and the number of frequent itemsets at different
see how well the proposed approach worked compared to the minsup values, like 30%, 45%, 50%, and 60%. These results
original FP-Growth algorithm. The Poker Hand dataset show how AI could help improve strategic gameplay and
offered a practical and realistic environment for evaluating the decision-making.
algorithm's efficacy. The results showed that the machine
Table 6: Comparison Results for the Poker Hand Dataset with Various minsup thresholds.
No. minsup Execution time per milliseconds (s) # Discovered Frequent itemsets
Apriori FP-Growth New Apriori Apriori New
algorithm algorithm
1 30% 84.745 77.288 7.932 1061 864 266
2 45% 82.815 75.267 7.544 837 602 245
3 50% 81.897 74.875 7.165 831 535 94
4 60% 79.197 72.586 7.154 827 411 53
The minimum support (minsup) threshold has a big shows how three algorithms compare in terms of performance
effect on how well data mining algorithms work. It changes at four different minsup thresholds. It shows that the proposed
both the number of frequent itemsets created and the time it method is more efficient than the FP-Growth algorithm. The
takes to run. When the minsup values go up, the number of results show a big increase in execution speed, which shows
frequent itemsets and the execution times of both the proposed that the proposed method can handle large amounts of data
method and the FP-Growth algorithm go down. The proposed well. This progress also makes data mining easier and opens
method, on the other hand, always runs faster than the FP- the door for more research into how to improve algorithmic
Growth algorithm, no matter what minsup threshold is used. performance across a range of applications. The FP-Growth
This shows that the suggested method is more scalable and method, on the other hand, needs a lot of memory and time to
efficient, especially when working with large datasets where build multiple conditional sub-trees before it can find frequent
the cost of computing is very important. The faster execution itemsets. This can make it less efficient in large data
speed not only makes everything work better, but it also makes environments.
it easier to get insights faster when analyzing data. Figure 3
Table 7: Comparison results for the Sepsis Survival Minimal Clinical Records dataset with various minsup thresholds.
No. minsup Execution Time Per Milliseconds (s) # Discovered Frequent Itemsets
Apriori FP-Growth New Apriori Apriori New
Algorithm Algorithm
1 10% 3.146 0.686 0.501 123 102 88
2 20% 2.015 0.641 0.488 102 93 42
3 30% 1.722 0.614 0.321 96 83 31
4 50% 1.430 0.561 0.315 81 61 17
Fig 2: Comparing the Results of the Execution time and the minsup Thresholds for the Poker Hand Dataset.
Experiment two not only shows how the dataset could help doctors make
The study used the Sepsis Survival Minimal Clinical better decisions, but it also shows how carefully the treatment
Records dataset, which is a complete set of 110,204 hospital outcomes were measured.
admissions in Norway between 2011 and 2012. These
admissions involved 84,811 patients who had infections, Fig 4 shows that the four different minimum support
septic shock, sepsis caused by pathogenic microorganisms, or (minsup) thresholds have very different algorithm execution
systemic inflammatory response syndrome. The wealth of times, which shows that the performance is very different.
information in this dataset makes it possible to do a detailed The data shows that the chosen minsup threshold has a big
analysis of patient outcomes, which makes it easier to assess effect on execution times, which shows how important it is to
the effectiveness of different treatment plans. Researchers choose the right threshold for algorithm efficiency.
are looking for patterns in demographics, clinical Algorithm A consistently outperformed the other algorithms
interventions, and survival rates that could help improve how tested, even when the minsup levels were lower. This means
sepsis is treated and how patients are cared for in the future. that Algorithm A is especially good at handling large
The main prediction task is to use the patient's medical datasets, making it a strong choice for applications that need
records to figure out if they will live for about nine days after a lot of data. The results show that the algorithm can improve
being admitted. Table 7 shows the execution time, the data processing and optimize computer resources, making it
number of frequent item sets found using the FP-Growth the best choice for jobs that need to be done quickly and on a
algorithm, and the best method for each of the four minimal large scale.
criteria thresholds: 10%, 20%, 30%, and 50%. This analysis
Fig3: Comparing the Results of the Execution time and the minsup Thresholds for the Sepsis Survival Minimal Clinical
Records Dataset.