0% found this document useful (0 votes)
279 views

Unstructured Data Classification

The document discusses unstructured data classification and natural language processing techniques. It provides examples of classification algorithms like decision trees and random forests. It also discusses preprocessing steps like stopword removal, bag-of-words, and techniques like TF-IDF for feature extraction from text data. Multiple choice questions are provided about classification, preprocessing, algorithms and their applications to sentiment analysis and spam detection problems.

Uploaded by

Yees BoojPai
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
279 views

Unstructured Data Classification

The document discusses unstructured data classification and natural language processing techniques. It provides examples of classification algorithms like decision trees and random forests. It also discusses preprocessing steps like stopword removal, bag-of-words, and techniques like TF-IDF for feature extraction from text data. Multiple choice questions are provided about classification, preprocessing, algorithms and their applications to sentiment analysis and spam detection problems.

Uploaded by

Yees BoojPai
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Join our channel if you haven’t joined yet https://t.

me/fresco_milestone ( @fresco_milestone )

Unstructured Data Classification

Identify the unstructured data from the following.

Answer : image

What kind of classification is our case study 'Spam Detection'?

Answer : Binary

Which pre-processing technique is used to remove the most commonly used words?

Answer : Stopword removal

The cross-validation technique is used to evaluate a classifier by dividing the data set into a training
set to train the classifier and a testing set to test the same.

Answer : True

True Positive is when the predicted instance and the actual instance are not negative.

Answer : True

True Negative is when the predicted instance and the actual instance are positive.

Answer : False

An algorithm that counts how many times a word appears in a document is __________

Answer : Bag-of-Words (BOW)

Pruning is a technique associated with __________

Answer : Decision tree

Select the correct statement about Nonlinear classification.

Answer : Kernel tricks are used by Nonlinear classifiers to achieve maximum-margin hyper planes
(Incorrect)

Stemming and lemmatization give the same result.

Answer : False

Question Type: Single-Select

a) Download the dataset from https://hrcdn.net/s3_pub/istreet-


assets/H4_TQkbOj39HUNoBukluIQ/training.txt and load it to the variable 'sentiment_analysis_data'.

b) Give the column names as 'label' and 'message'.

c) Try out the code snippets and answer the questions.


Join our channel if you haven’t joined yet https://t.me/fresco_milestone ( @fresco_milestone )

What is the output of the following command: print(sentiment_analysis_data['label'].unique())

Answer : [1 0]

The most widely used package for machine learning in Python is _________

Answer : sklearn

In Supervised learning, class labels of the training samples are ____________

Answer : Known

Select the pre-processing technique(s) from the following.

Answer : All the options

Model Tuning helps to increase accuracy.

Answer : True (Incorrect) Cannot say

Question Type: Single-Select

a) Download the dataset from https://hrcdn.net/s3_pub/istreet-


assets/H4_TQkbOj39HUNoBukluIQ/training.txt and load it to the variable 'sentiment_analysis_data'.

b) Give the column names as 'label' and 'message'.

c) Try out the code snippets and answer the questions.

What command should be given to tokenize a sentence into words?

Answer : from nltk.tokenize import word_tokenize, Word_tokens =word_tokenize(sentence)

Identify the stop word(s) from the following.

Answer : Both "the" and "it"

The following are performance evaluation measures, except __________

Answer : Decision Tree

Images and documents are examples of ___________

Answer : Unstructured data

Choose the correct sequence for classifier building from the following.

Answer : Initialize -> Train -> Predict -> Evaluate

Which of the given hyperparameters, when increased, may cause the random forest to overfit the
data?

Answer : Depth of Tree


Join our channel if you haven’t joined yet https://t.me/fresco_milestone ( @fresco_milestone )

The fit (X, y) is used to __________

Answer : Train the classifier

Question Type: Single-Select

a) Download the dataset from https://hrcdn.net/s3_pub/istreet-


assets/H4_TQkbOj39HUNoBukluIQ/training.txt and load it to the variable 'sentiment_analysis_data'.

b) Give the column names as 'label' and 'message'.

c) Try out the code snippets and answer the questions.

What does the command sentiment_analysis_data['label'].value_counts() return?

Answer : The count of unique values in the 'label' column

What is the purpose of lemmatization?

Answer : To convert words into a proper base form

Clustering is supervised classification.

Answer : False

Supervised learning differs from unsupervised learning as supervised learning requires __________

Answer : Labeled data

Set2:

To view the first 3 rows of the dataset, which of the following commands is used?

Answer : sentiment_analysis_data.head(3)

Inverse Document frequency is used in the term-document matrix.

Answer : True

Can we consider sentiment classification as a text classification problem?

Answer : Yes

In document classification, each document has to be converted from full text to a document vector.

Answer : true

A technique used to depict the performance in a tabular form that has 2 dimensions namely actual
and predicted sets of data is ___________

Answer : Confusion Matrix


Join our channel if you haven’t joined yet https://t.me/fresco_milestone ( @fresco_milestone )

Which NLP technique uses a lexical knowledge base to obtain the correct base form of the words?

Answer : Lemmatization

Which numerical statistics is used to identify the importance of a rare word in a document?

Answer : TF-IDF

Which type of cross-validation is used for an imbalanced dataset?

Answer : K-Fold

Cross-validation causes over-fitting.

Answer : False

$Download the dataset from https://inclass.kaggle.com/c/si650winter11/download/training.txt and


load it to the variable 'sentiment_analysis_data'.

b) Give the column names as 'label' and 'message'.

c) Try out the code snippets and answer the questions.

Is there a class imbalance problem in the given data set?

Answer : Yes

SVM is a _____________

Answer : Supervised learning algorithm

In a Term Document Matrix (TDM), each row represents ____________

Answer : TF-IDF value

Imagine you have just finished training a decision tree for spam classification, and it is showing
abnormal bad performance on both your training and test sets. Assume that your implementation
has no bugs. What could be the reason for this problem?

Answer : All the options

In a Document Term Matrix (DTM), each row represents

Answer : TF-IDF value

Email spam data is an example of __________

Answer : Unstructured data

Choose the correct sequence from the following.

Answer : Data Analysis -> Pre-Processing -> Model Building -> Predict

High classification accuracy always indicates a good classifier.


Join our channel if you haven’t joined yet https://t.me/fresco_milestone ( @fresco_milestone )

Answer : False

_______ directly achieves multi-class classification (without the support of binary classifiers).

Answer : K Nearest Neighbor

A classifier that can compute using numeric as well as categorical values is __________

Answer : Random Forest Classifier

Lemmatization offers better precision than stemming.

Answer : True

The following are pre-processing methods used for unstructured data classification, except
_________

Answer : Confusion_matrix

TF-IDF is a feature extraction technique.

Answer : True

The higher value of which of the following hyperparameters is better for the decision tree
algorithm?

Answer : Cannot say

$Download the dataset from https://hrcdn.net/s3_pub/istreet-


assets/H4_TQkbOj39HUNoBukluIQ/training.txt and load it to the variable 'sentiment_analysis_data'.

b) Give the column names as 'label' and 'message'.

c) Try out the code snippets and answer the questions.

What kind of classification is the given case study (Sentiment Analysis dataset)?

Answer : Binary classification

$ Download the dataset from https://hrcdn.net/s3_pub/istreet-


assets/H4_TQkbOj39HUNoBukluIQ/training.txt and load it to the variable 'sentiment_analysis_data'.

b) Give the column names as 'label' and 'message'.

c) Try out the code snippets and answer the questions.

Which of the following commands is used to view the dataset SIZE, and what is the value returned?

Answer : sentiment_analysis_data.shape, (6918, 2)

You might also like