Priyadarshini J. L. College of Engineering, Nagpur: Session 2022-23 Semester-V
Priyadarshini J. L. College of Engineering, Nagpur: Session 2022-23 Semester-V
ACTIVITY REPORT
SUBJECT:- DATAWARE HOUSING AND MINING
ACTIVITY 1
“DATA MINING TOOL :- WEKA”
▪ Information
▪ History of Weka
▪ Features of Weka
▪ Types of Algorithm by Weka
▪ How to Install Dataset in Weka Tool
▪ Data Mining Techniques
Information
Weka contains a collection of visualization tools and algorithms for data analysis and
predictive modelling, together with graphical user interfaces for easy access to these
functions. The original non-Java version of Weka was a Tcl/Tk front-end to (mostly
third-party) modelling algorithms implemented in other programming languages, plus
data preprocessing utilities in C and a makefile-based system for running machine
learning experiments.
This original version was primarily designed as a tool for analyzing data from
agricultural domains. Still, the more recent fully Java-based version (Weka 3),
developed in 1997, is now used in many different application areas, particularly for
educational purposes and research.
The foundation of any Machine Learning application is data - not just a little data but a
huge data which is termed as Big Data in the current terminology.
To train the machine to analyze big data, you need to have several considerations on the
data −
• The data must be clean.
• It should not contain null values.
Besides, not all the columns in the data table would be useful for the type of analytics
that you are trying to achieve. The irrelevant data columns or ‘features’ as termed in
Machine Learning terminology, must be removed before the data is fed into a machine
learning algorithm.
History of Weka
o In 1993, the University of Waikato in New Zealand began the development of the
original version of Weka, which became a mix of Tcl/Tk, C, and makefiles.
o In 1997, the decision was made to redevelop Weka from scratch in Java,
including implementing modelling algorithms.
o In 2005, Weka received the SIGKDD Data Mining and Knowledge Discovery
Service Award.
o In 2006, Pentaho Corporation acquired an exclusive licence to use Weka for
business intelligence. It forms the data mining and predictive analytics
component of the Pentaho business intelligence suite. Hitachi Vantara has since
acquired Pentaho, and Weka now underpins the PMI (Plugin for Machine
Intelligence) open-source component.
Features of Weka
Weka has the following features, such as:
1. Preprocess
The preprocessing of data is a crucial task in data mining. Because most of the data is
raw, there are chances that it may contain empty or duplicate values, have garbage
values, outliers, extra columns, or have a different naming convention. All these things
degrade the results.
2. Classify
After selecting the desired classifier, we select test options for the training set. Some of
the options are:
o Use training set: the classifier will be tested on the same training set.
o A supplied test set: evaluates the classifier based on a separate test set.
o Cross-validation Folds: assessment of the classifier based on cross-validation
using the number of provided folds.
o Percentage split: the classifier will be judged on a specific percentage of data.
Other than these, we can also use more test options such as Preserve order for %
split, Output source code, etc.
3. Cluster
4. Associate
Association rules highlight all the associations and correlations between items of a
dataset. In short, it is an if-then statement that depicts the probability of relationships
between data items. A classic example of association refers to a connection between the
sale of milk and bread.
The tool provides Apriori, FilteredAssociator, and FPGrowth algorithms for association
rules mining in this category.
5. Select Attributes
Every dataset contains a lot of attributes, but several of them may not be significantly
valuable. Therefore, removing the unnecessary and keeping the relevant details are very
important for building a good model.
6. Visualize
In the visualize tab, different plot matrices and graphs are available to show the trends
and errors identified by the model.
Each algorithm has configuration parameters such as batchSize, debug, etc. Some
configuration parameters are common across all the algorithms, while some are specific.
These configurations can be editable once the algorithm is selected to use.
2) After successful download, open the file location and double-click on the downloaded
file. The Step-Up wizard will appear. Click on Next
3) The License Agreement terms will open. Read it thoroughly and click on “I Agree”.
8) After the installation is complete, the following window will appear. Click on Next.
9) Click on Finish.
10) WEKA Tool and Click on Explorer
The WEKA Explorer windows show different tabs starting with preprocessing. Initially,
the preprocess tab is active, as first the data set is preprocessed before applying
algorithms to it and exploring the dataset.
11) Open file
See “Confusion Matrix” that is created and if your data set is proper and classification
done proper it will show a statusier ok
28) Right here now you can do visualize the tree for that click on this
34) Now clustering all paste on the for grouping of the data
Clustering is done proper
35) Now go to “Associate”
40) Can also do the visualization each point you can adjust this for fast scrolling and
view the appropriate results
Data Mining Techniques
Data mining includes the utilization of refined data analysis tools to find previously
unknown, valid patterns and relationships in huge data sets. These tools can incorporate
statistical models, machine learning techniques, and mathematical algorithms, such as
neural networks or decision trees. Thus, data mining incorporates analysis and
prediction.
In recent data mining projects, various major data mining techniques have been
developed and used, including association, classification, clustering, prediction,
sequential patterns, and regression.
1. Classification:
This technique is used to obtain important and relevant information about data and
metadata. This data mining technique helps to classify data in different classes.
2. Clustering:
In other words, we can say that Clustering analysis is a data mining technique to identify
similar data. This technique helps to recognize the differences and similarities between
the data. Clustering is very similar to the classification, but it involves grouping chunks
of data together based on their similarities.
3. Regression:
Regression analysis is the data mining process is used to identify and analyze the
relationship between variables because of the presence of the other factor. It is used to
define the probability of the specific variable. Regression, primarily a form of planning
and modeling. For example, we might use it to project certain costs, depending on other
factors such as availability, consumer demand, and competition. Primarily it gives the
exact relationship between two or more variables in the given data set.
4. Association Rules:
This data mining technique helps to discover a link between two or more items. It finds
a hidden pattern in the data set.
Association rules are if-then statements that support to show the probability of
interactions between data items within large data sets in different types of databases.
Association rule mining has several applications and is commonly used to help sales
correlations in data or medical data sets.
The way the algorithm works is that you have various data, For example, a list of grocery
items that you have been buying for the last six months. It calculates a percentage of
items being purchased together.
5. Outer detection:
This type of data mining technique relates to the observation of data items in the data
set, which do not match an expected pattern or expected behavior. This technique may
be used in various domains like intrusion, detection, fraud detection, etc. It is also
known as Outlier Analysis or Outilier mining. The outlier is a data point that diverges
too much from the rest of the dataset. The majority of the real-world datasets have an
outlier. Outlier detection plays a significant role in the data mining field. Outlier
detection is valuable in numerous fields like network interruption identification, credit
or debit card fraud detection, detecting outlying in wireless sensor network data, etc.
6. Sequential Patterns:
In other words, this technique of data mining helps to discover or recognize similar
patterns in transaction data over some time.
7. Prediction: