Improved Data Mining Approach To Find Frequent Itemset Using Support Count Table
Improved Data Mining Approach To Find Frequent Itemset Using Support Count Table
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 1, Issue 2, July August 2012 ISSN 2278-6856
Improved Data mining approach to find Frequent Itemset Using Support count table
Ramratan Ahirwal1, Neelesh Kumar Kori2 and Dr.Y.K. Jain3
1
1. INTRODUCTION
Mining data streams is a very important research topic and has recently attracted a lot of attention, because in many cases data is generated by external sources so rapidly that it may become impossible to store it and analyze it offline. Moreover, in some cases streams of data must be analyzed in real time to provide information about trends, outlier values or regularities that must be signaled as soon as possible. The need for online computation is a notable challenge with respect to classical data mining algorithms [1], [2]. Important application fields for stream mining are as diverse as financial applications, network monitoring, security problems, telecommunication networks, Web applications, sensor networks, analysis of atmospheric data, etc. The innovation in computer science have made it possible to acquire and store enormous amounts of data digitally in databases, currently giga or terabytes in a single database and even more in the future. Many fields and systems of human activity have become increasingly Volume 1, Issue 2 July-August 2012
dependent on collected, stored, and processed information. However, the abundance of the collected data makes it laborious to find essential information in it for a specific purpose. Data mining is the analysis of (often large) observational datasets from the database, data warehouse or other large repository incomplete, noisy, ambiguous, the practical application of random data to find unsuspected relationships and summarize the data that are both understandable and useful to the data owner. It is a means that data extraction, cleaning and transformation, analysis, and other treatment models, and automatically discovers the patterns and interesting knowledge hidden in large amounts of data, this helps us make decisions based on a wealth of data. Information communication mode of software development lies in how to collection, analysis, and mine out the hidden useful information in the various data from information communication between developers and the staff interaction with manages, and then used the knowledge to make decision. oustead College uses database technology to manage the library currently. Its main purpose is to facilitate the procurement of books, cataloging, and circulation management. In order to better satisfy the needs of readers, we must to explore the needs of readers, to provide the information which they need initiatively. Most current library evaluation techniques focus on frequencies and aggregate measures; these statistics hide underlying patterns. Discovering these patterns is the key that use library services [3]. Data mining is applied to library operations [4].With the fast development of the technology and the more requirements of the users, the dynamic elements in data mining are becoming more important, including dynamic databases and the knowledge bases, users' interestingness and the data varying with time and space. I order to solve the problems such as low effectiveness; high randomness and hard implementation in dynamic mining, more research on dynamic data mining have been done. In [5][6] , an evolutionary immune mechanism was proposed based on the fact that the elements involved in the domains could be modeled as the ones in immune models. It focused on how to utilize the relationship between antigens and antibodies in a dynamic data mining such as an Page 195
3. LITERATURE REVIEW
In 2011, jinwei Wang et al. [12] proposed to conquer the shortcomings and deficiencies of the existing interpolation technique of missing data, an interpolation technique for missing context data based on Time-Space Relationship and Association Rule Mining (TSRARM) is proposed to perform spatiality and time series analysis on sensor data, which generates strong association rules to interpolate missing data. Finally, the simulation experiment verifies the rationality and efficiency of TSRARM through the acquisition of temperature sensor data. In 2011, M. Chaudhary et al. [13] proposed new and more optimized algorithm for online rule generation. The advantage of this algorithm is that the graph generated in our algorithm has less edge as compared to the lattice used in the existing algorithm. The Proposed algorithm generates all the essential rulesalso and no rule is missing. The use of non redundant association rules help significantly in the reduction of irrelevant noise in the data mining process. This graph theoretic approach, called adjacency lattice is crucial for online mining of data. The adjacency lattice could be stored either in main memory or secondary memory. The idea of adjacency lattice is to pre store a number of large item sets in special format which reduces disc I/O required in performing the query. In 2011,Fu et al. [14] analyzes Real-time monitoring data mining has been a necessary means of improving operational efficiency, economic safety and fault detection of power plant. Based on the data mining arithmetic of interactive association rules and taken full advantage of the association characteristics of real-time test-spot data during the power steam turbine run, the principle of mining quantificational association rule in parameters is put forward among the real-time monitor data of steam turbine. Through analyzing the practical run results of a certain steam turbine with the data mining method based on the interactive rule, it shows that it can supervise stream turbine run and condition monitoring, and afford model reference and decision-making supporting for the fault diagnose and condition-based maintenance. In 2011,Xin et al. [15] analyzes that use association rule learning to process statistical data of private economy and analyze the results to improve the quality of statistical data of private economy. Finally the article provides some exploratory comments and suggestions about the application of association rule mining in private economy statistics.
NO. 1 . . 2I-1
Itemset (A)
4.3 Proposed Method to find frequent itemset In our proposed work we are giving the method that may be useful for static as well as for stream line database to find frequent itemset. In our proposed work we employ the support count table that required only to scaning the database once to make the entries in the table for each transaction the table retains the information till the observation not complete or frequent itemset not found. When the trasactions are added into dataset or expired from the dataset simultaneously update the table. The updated support count table has the frequency count of each itemset. To find the frequent itemset for any threshold value we scan the table not the database. As in A-priori we are required l+1 scan of the dataset and generate the candidates to find frequent set. Our approach has only single scan of database and no candidate generation is required. Table has entries of frequency count of every itemset but not the total support count of that itemset. The frequency count of each itemset is the count of the occurrence of such itemset in transactional database D so to find frequent itemset we are required to find the total support count of that itemset, Total support count of an itemset is the count of the occurrence of total items of that itemset in the no. of transactions in D. This total count in our scheme is calculated by scanning the table and then found total support count compared with the threshold S0 if the count is greater than the threshold then itemset is included in frequent set. This procedure is repeated for every itemset to find frequent them.
Algorithm: To find frequent itemset Input: A database D and the support threshold S0. Output: frequent itemsets Fitemset. Method
For example Let I=(i1,i2,i3,i4) be the set of items and the different types of itemset that may be generated from the I are {i1},{i2},{i3}..{i1,i2,i3,i4}.Then all transaction itemset X that may occur in database D are all will be any subset of I and equal to itemset. Now table created initially as given below Volume 1, Issue 2 July-August 2012
Page 198
Step2: To find frequent itemset we make use of support count table given below as follows: Table 3: Frequency count for above example No . 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Itemset (A) {10} {20} {30} {40} {10,20} {10,30} {10,40} {20,30} {20,40} {30,40} {10,20,30} {10,20,40} {10,30,40} {20.30,40} {10,20,30,40 } Supportcount(Scount) 2 0 0 1 1 2 0 2 0 2 2 0 0 2 1
Step3: for (j=1; j< 2I ; j++) // Repeat step3 to find total count Step:3.1 If Ai Aj TCount = TCount +Scount(j) Step:4 If (Tcount S0) Then Fitemset = Fitemset U Ai Step:5 Go to step 2 Step:6 End
To better explain our algorithm, now we consider one example: Let I= (10, 20, 30, 40) be the set of four items & value assumed for the threshold is 2.Total transactions in D are considered 15.Table of transactions of D is given below: ti d 1 2 3 4 transactions
{10} {10,20} {30,40} {10,20,30,40 } 5 {10,30} 6 {10,30} 7 {30,40} 8 {20,30,40} 9 {20,30,40} 10 {10,20,30} 11 {20,30} 12 {40} 13 {20,30} 14 {10,20,30} 15 {10} Step1: By scanning the database the table of support count will be as follows: Given in table3.
To check itemset {10} is frequent or not, we obtain the total support count by scaning the support count table for {10}, so from the table total support of {10} is 8.This value of total support count is compared with threshold value 2, since threshold value is 2 and less than the total count, so the itemset {10} is frequent itemset and included in Fitemset. This process is repeated for every itemset. In such a way we get every frequent itemset using support count table Frequent itemset for the given dataset is: Fitemset={{10},{20},{30},{40},{10,20},{10,30},{20,30},{ 20,40},{30, 40}, {10,20,30},{20,30,40}}
5. RESULT ANALYSIS
To study the performance of our proposed algorithm, we have done several experiments. The experimental environment is intel core processor with operating system is window XP. The algorithm is implemented with java netbeans 7.1.The meaning of used parameters are as follows D for transaction database, I for no. of items in transactions and S0 for MINsupport. Table 4 shows the results for execution time in sec when I=5 and transactional database D scale-up from 50 to 1000 and MINsupport S scale-up from 2 to 8.We see from the table Page 199
120 100 80 60 40 20 0 1000 2000 3000 4000 5000 6000 No. of Transactions
Figure 2: Execution time(s), MINsupport(S0=2); Figure 2 shows the algorithm execution time {for MINsupport(S0=2), I=5} is increasing almost linearly with the increasing of dataset size. It can be concluded our algorithm has a good scalable performance. Now later to examine the scalability performance of our algorithm we increased the dataset D from 1000 to 6000 with same parameter MINsupport(S0=2), I=5, result is given in figure 5.
Figure 3: Execution time(s), Transaction database (D=200); Volume 1, Issue 2 July-August 2012
REFERENCES
[1] M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy, Mining data streams: A review, ACM SIGMOD Record, vol. Vol. 34,no. 1, 2005. [2] C. C. Aggarwal, Data Streams: models and algorithms. Springer, 2007. [3] Nicholson, S. The Bibliomining Process: Data Warehousing and Data Mining for Library DecisionMaking. Information Technology and Libraries. 2003, 22(4):146-151. [4] Jiann-Cherng Shieh, Yung-Shun Lin. Bibliomining User Behaviors in the Library. Journal of Educational Media & Library Sciences.2006, 44(1):36-60. [5] Yiqing Qin, Bingru Yang, Guangmei Xu, et al. Research on Evolutionary Immune Mechanism in KDD [A]. In: Proceedings of Intelligent Systems and Knowledge Engineering 2007 (ISKE2007) [C], Cheng Du, China, October, 2007, 94-99. [6] Yang B R. Knowledge discovery based on inner mechanism: construction, realization and application [M]. USA: Elliott & Fitzpatrick Inc. 2004. [7] Binesh Nair, Amiya Kumar Tripathy, Accelerating Closed Frequent Itemset Mining by Elimination of Null Transactions, Journal of Emerging Trends in Computing and Information Sciences, Volume 2 No.7, JULY 2011, pp 317-324. [8] E.Ramaraj and N.Venkatesan, Bit Stream MaskSearch Algorithm in Frequent Itemset Mining, European Journal of Scientific Research ISSN 1450216X Vol.27 No.2 (2009), pp.286-297. [9] Shilpa and Sunita Parashar, Performance Analysis of Apriori Algorithm with Progressive Approach for Mining Data, International Journal of Computer Applications (0975 8887) Volume 31 No.1, October 2011, pp 13-18. [10] G. Cormode and M. Hadiieleftheriou, Finding frequent items in data streams, In Proceedings of the 34th International Conference on Very Large Data Bases (VLDB), pages 15301541, Auckland, New Zealand, 2008. [11] D.Y. Chiu, Y.H. Wu, and A.L. Chen, Efficient frequent sequence mining by a dynamic strategy switching algorithm, The International Journal on Very Large Data Bases (VLDB Journal), 18(1):303 327, 2009. [12] Jinwei Wang and Haitao Li , An Interpolation Approach for Missing Context Data Based on the TimeSpace Relationship and Association Rule Mining ,Multimedia Information Networking and Security (MINES), 2011,IEEE. [13] Chaudhary, M. ,Rana, A. , Dubey, G, Online Mining of data to generate association rule mining in large databases , Volume 1, Issue 2 July-August 2012
Page 201