0% found this document useful (0 votes)
37 views31 pages

Priyadarshini J. L. College of Engineering, Nagpur: Session 2022-23 Semester-V

Weka is a popular open-source machine learning tool that contains algorithms for data preprocessing, classification, regression, clustering, association rules, and visualization. It has a graphical user interface that allows users to easily access these machine learning functions. Weka supports common data mining techniques like classification using decision trees, naive bayes, and neural networks, as well as clustering, association rule mining, attribute selection, and visualization. The tool can be used for educational purposes and research across many application areas.

Uploaded by

Riya Patil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
37 views31 pages

Priyadarshini J. L. College of Engineering, Nagpur: Session 2022-23 Semester-V

Weka is a popular open-source machine learning tool that contains algorithms for data preprocessing, classification, regression, clustering, association rules, and visualization. It has a graphical user interface that allows users to easily access these machine learning functions. Weka supports common data mining techniques like classification using decision trees, naive bayes, and neural networks, as well as clustering, association rule mining, attribute selection, and visualization. The tool can be used for educational purposes and research across many application areas.

Uploaded by

Riya Patil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 31

Lokmanya Tilak Jankalyan Shikshan Sanstha’s

PRIYADARSHINI J. L. COLLEGE OF ENGINEERING,


NAGPUR
An Autonomus Institution Affilliated to R.T.M. Nagpur University
Accredited with Grade “A” by NAAC

SESSION 2022-23 SEMESTER-V


DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING

ACTIVITY REPORT
SUBJECT:- DATAWARE HOUSING AND MINING

ACTIVITY 1
“DATA MINING TOOL :- WEKA”

SUBMITTED BY :- RIYA PRAKASH PATIL


ROLL NO:- 15

Under The Guidance of


Prof. Manisha Vaidya

Sign of Subject Teacher Sign of HOD


CONTENT

▪ Information
▪ History of Weka
▪ Features of Weka
▪ Types of Algorithm by Weka
▪ How to Install Dataset in Weka Tool
▪ Data Mining Techniques
Information

Weka contains a collection of visualization tools and algorithms for data analysis and
predictive modelling, together with graphical user interfaces for easy access to these
functions. The original non-Java version of Weka was a Tcl/Tk front-end to (mostly
third-party) modelling algorithms implemented in other programming languages, plus
data preprocessing utilities in C and a makefile-based system for running machine
learning experiments.

This original version was primarily designed as a tool for analyzing data from
agricultural domains. Still, the more recent fully Java-based version (Weka 3),
developed in 1997, is now used in many different application areas, particularly for
educational purposes and research.

The foundation of any Machine Learning application is data - not just a little data but a
huge data which is termed as Big Data in the current terminology.
To train the machine to analyze big data, you need to have several considerations on the
data −
• The data must be clean.
• It should not contain null values.
Besides, not all the columns in the data table would be useful for the type of analytics
that you are trying to achieve. The irrelevant data columns or ‘features’ as termed in
Machine Learning terminology, must be removed before the data is fed into a machine
learning algorithm.

History of Weka

o In 1993, the University of Waikato in New Zealand began the development of the
original version of Weka, which became a mix of Tcl/Tk, C, and makefiles.
o In 1997, the decision was made to redevelop Weka from scratch in Java,
including implementing modelling algorithms.
o In 2005, Weka received the SIGKDD Data Mining and Knowledge Discovery
Service Award.
o In 2006, Pentaho Corporation acquired an exclusive licence to use Weka for
business intelligence. It forms the data mining and predictive analytics
component of the Pentaho business intelligence suite. Hitachi Vantara has since
acquired Pentaho, and Weka now underpins the PMI (Plugin for Machine
Intelligence) open-source component.
Features of Weka
Weka has the following features, such as:

1. Preprocess

The preprocessing of data is a crucial task in data mining. Because most of the data is
raw, there are chances that it may contain empty or duplicate values, have garbage
values, outliers, extra columns, or have a different naming convention. All these things
degrade the results.

To make data cleaner, better and comprehensive, WEKA comes up with a


comprehensive set of options under the filter category. Here, the tool provides both
supervised and unsupervised types of operations. Here is the list of some operations for
preprocessing:

o ReplaceMissingWithUserConstant: to fix empty or null value issue.


o ReservoirSample: to generate a random subset of sample data.
o NominalToBinary: to convert the data from nominal to binary.
o RemovePercentage: to remove a given percentage of data.
o RemoveRange: to remove a given range of data.

2. Classify

Classification is one of the essential functions in machine learning, where we assign


classes or categories to items. The classic examples of classification are: declaring a
brain tumour as "malignant" or "benign" or assigning an email to a "spam" or
"not_spam" class.

After selecting the desired classifier, we select test options for the training set. Some of
the options are:

o Use training set: the classifier will be tested on the same training set.
o A supplied test set: evaluates the classifier based on a separate test set.
o Cross-validation Folds: assessment of the classifier based on cross-validation
using the number of provided folds.
o Percentage split: the classifier will be judged on a specific percentage of data.

Other than these, we can also use more test options such as Preserve order for %
split, Output source code, etc.

3. Cluster

In clustering, a dataset is arranged in different groups/clusters based on some


similarities. In this case, the items within the same cluster are identical but different
from other clusters. Examples of clustering include identifying customers with similar
behaviours and organizing the regions according to homogenous land use.

4. Associate

Association rules highlight all the associations and correlations between items of a
dataset. In short, it is an if-then statement that depicts the probability of relationships
between data items. A classic example of association refers to a connection between the
sale of milk and bread.

The tool provides Apriori, FilteredAssociator, and FPGrowth algorithms for association
rules mining in this category.

5. Select Attributes

Every dataset contains a lot of attributes, but several of them may not be significantly
valuable. Therefore, removing the unnecessary and keeping the relevant details are very
important for building a good model.

Many attribute evaluators and search methods include BestFirst, GreedyStepwise,


and Ranker.

6. Visualize

In the visualize tab, different plot matrices and graphs are available to show the trends
and errors identified by the model.

Types of Algorithms by Weka


WEKA provides many algorithms for machine learning tasks. Because of their core
nature, all the algorithms are divided into several groups. These are available under
the Explorer tab of the WEKA. Let's look at those groups and their core nature:
o Bayes: consists of algorithms based on Bayes theorem like Naive Bayes
o functions: comprises the algorithms that estimate a function, including Linear
Regression
o lazy: covers all algorithms that use lazy learning similar to KStar, LWL
o meta: consists of those algorithms that use or integrate multiple algorithms for
their work like Stacking, Bagging
o misc: miscellaneous algorithms that do not fit any of the given categories
o rules: combines algorithms that use rules such as OneR, ZeroR
o trees: contains algorithms that use decision trees, such as J48, RandomForest

Each algorithm has configuration parameters such as batchSize, debug, etc. Some
configuration parameters are common across all the algorithms, while some are specific.
These configurations can be editable once the algorithm is selected to use.

How to Install Dataset in Weka Tool

1) Download the software from https://sourceforge.net/projects/weka/

2) After successful download, open the file location and double-click on the downloaded
file. The Step-Up wizard will appear. Click on Next
3) The License Agreement terms will open. Read it thoroughly and click on “I Agree”.

4) According to your requirements, select the components to be installed. Full


component installation is recommended. Click on Next.
5) Select the destination folder and Click on Next.

6) Then, Choose start menu folder and install.


7) Then, Installation will star

8) After the installation is complete, the following window will appear. Click on Next.
9) Click on Finish.
10) WEKA Tool and Click on Explorer

11) Explorer window opens.

The WEKA Explorer windows show different tabs starting with preprocessing. Initially,
the preprocess tab is active, as first the data set is preprocessed before applying
algorithms to it and exploring the dataset.
11) Open file

13) Local disk :C

14) Program Files


15) Weka-3-8-6

16) Click on the data


17) Now there are default data set are available
18) Now let us click on the “Soybean”

19) Now this is the graph of the Weka

20) Now you can change the graph also


21) Now you can change its class

22) Like this seeds


23) Now after this preprocessing , now click on classify
24) Now we have to choose classifier
25) Go to trees

26) Then choose J48


27) And click on the start

See “Confusion Matrix” that is created and if your data set is proper and classification
done proper it will show a statusier ok
28) Right here now you can do visualize the tree for that click on this

29) And go to visualize tree

30) This is the Weka classifier tree


Now classification is over
31) Now will go to clustering

32) To choose SimpleKMeans clustering


33) And click on start

34) Now clustering all paste on the for grouping of the data
Clustering is done proper
35) Now go to “Associate”

36) Choose the “Apriori” and click on start


37) Now it’s done is basically works on the item set counts
38) Now you can go to select attributes
39) Then start

40) Can also do the visualization each point you can adjust this for fast scrolling and
view the appropriate results
Data Mining Techniques
Data mining includes the utilization of refined data analysis tools to find previously
unknown, valid patterns and relationships in huge data sets. These tools can incorporate
statistical models, machine learning techniques, and mathematical algorithms, such as
neural networks or decision trees. Thus, data mining incorporates analysis and
prediction.

Depending on various methods and technologies from the intersection of machine


learning, database management, and statistics, professionals in data mining have
devoted their careers to better understanding how to process and make conclusions
from the huge amount of data, but what are the methods they use to make it happen?

In recent data mining projects, various major data mining techniques have been
developed and used, including association, classification, clustering, prediction,
sequential patterns, and regression.
1. Classification:

This technique is used to obtain important and relevant information about data and
metadata. This data mining technique helps to classify data in different classes.

Data mining techniques can be classified by different criteria, as follows:

i. Classification of Data mining frameworks as per the type of data


sources mined:
This classification is as per the type of data handled. For example, multimedia,
spatial data, text data, time-series data, World Wide Web, and so on..
ii. Classification of data mining frameworks as per the database
involved:
This classification based on the data model involved. For example. Object-
oriented database, transactional database, relational database, and so on..
iii. Classification of data mining frameworks as per the kind of
knowledge discovered:
This classification depends on the types of knowledge discovered or data mining
functionalities. For example, discrimination, classification, clustering,
characterization, etc. some frameworks tend to be extensive frameworks offering
a few data mining functionalities together..
iv. Classification of data mining frameworks according to data mining
techniques used:
This classification is as per the data analysis approach utilized, such as neural
networks, machine learning, genetic algorithms, visualization, statistics, data
warehouse-oriented or database-oriented, etc.
The classification can also take into account, the level of user interaction involved
in the data mining procedure, such as query-driven systems, autonomous
systems, or interactive exploratory systems.

2. Clustering:

Clustering is a division of information into groups of connected objects. Describing the


data by a few clusters mainly loses certain confine details, but accomplishes
improvement. It models data by its clusters. Data modeling puts clustering from a
historical point of view rooted in statistics, mathematics, and numerical analysis. From
a machine learning point of view, clusters relate to hidden patterns, the search for
clusters is unsupervised learning, and the subsequent framework represents a data
concept. From a practical point of view, clustering plays an extraordinary job in data
mining applications. For example, scientific data exploration, text mining, information
retrieval, spatial database applications, CRM, Web analysis, computational biology,
medical diagnostics, and much more.

In other words, we can say that Clustering analysis is a data mining technique to identify
similar data. This technique helps to recognize the differences and similarities between
the data. Clustering is very similar to the classification, but it involves grouping chunks
of data together based on their similarities.

3. Regression:

Regression analysis is the data mining process is used to identify and analyze the
relationship between variables because of the presence of the other factor. It is used to
define the probability of the specific variable. Regression, primarily a form of planning
and modeling. For example, we might use it to project certain costs, depending on other
factors such as availability, consumer demand, and competition. Primarily it gives the
exact relationship between two or more variables in the given data set.

4. Association Rules:

This data mining technique helps to discover a link between two or more items. It finds
a hidden pattern in the data set.

Association rules are if-then statements that support to show the probability of
interactions between data items within large data sets in different types of databases.
Association rule mining has several applications and is commonly used to help sales
correlations in data or medical data sets.

The way the algorithm works is that you have various data, For example, a list of grocery
items that you have been buying for the last six months. It calculates a percentage of
items being purchased together.

These are three major measurements technique:


o Lift:
This measurement technique measures the accuracy of the confidence over how
often item B is purchased.
(Confidence) / (item B)/ (Entire dataset)
o Support:
This measurement technique measures how often multiple items are purchased
and compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
o Confidence:
This measurement technique measures how often item B is purchased when item
A is purchased as well.
(Item A + Item B)/ (Item A)

5. Outer detection:

This type of data mining technique relates to the observation of data items in the data
set, which do not match an expected pattern or expected behavior. This technique may
be used in various domains like intrusion, detection, fraud detection, etc. It is also
known as Outlier Analysis or Outilier mining. The outlier is a data point that diverges
too much from the rest of the dataset. The majority of the real-world datasets have an
outlier. Outlier detection plays a significant role in the data mining field. Outlier
detection is valuable in numerous fields like network interruption identification, credit
or debit card fraud detection, detecting outlying in wireless sensor network data, etc.

6. Sequential Patterns:

The sequential pattern is a data mining technique specialized for evaluating


sequential data to discover sequential patterns. It comprises of finding interesting
subsequences in a set of sequences, where the stake of a sequence can be measured in
terms of different criteria like length, occurrence frequency, etc.

In other words, this technique of data mining helps to discover or recognize similar
patterns in transaction data over some time.

7. Prediction:

Prediction used a combination of other data mining techniques such as trends,


clustering, classification, etc. It analyzes past events or instances in the right sequence to
predict a future event.

You might also like