0% found this document useful (2 votes)

330 views

BERT - Assignment - Jupyter Notebook

This document provides instructions for classifying Amazon product reviews using BERT. It contains 5 parts: 1) preprocessing the review data, 2) creating a BERT model, 3) tokenization, 4) getting BERT embeddings for reviews, and 5) using embeddings for classification. The instructions emphasize following the provided code cells and comments, not changing grader functions, and returning expected output formats.

Uploaded by

sriharsha bsm

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (2 votes)

330 views

BERT - Assignment - Jupyter Notebook

Uploaded by

sriharsha bsm

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

In this notebook, You will do amazon review classification with BERT.

It contains 5 parts as below. Detailed instrctions are given in the eac

h cell. please read every comment we have written.
1. Preprocessing
2. Creating a BERT model from the Tensorflow HUB.
3. Tokenization
4. getting the pretrained embedding Vector for a given review from t
he BERT.
5. Using the embedding data apply NN and classify the reviews.

instructions:

1. Don't change any Grader Functions. Don't manipulate any Grader fu

nctions. If you manipulate any, it will be considered as plagiarised.

2. Please read the instructions on the code cells and markdown cell
s. We will explain what to write.

3. please return outputs in the same format what we asked. Eg. Don't
return List of we are asking for a numpy array.

4. Please read the external links that we are given so that you will
learn the concept behind the code that you are writing.

5. We are giving instructions at each section if necessary, please f

ollow them.

Every Grader function has to return True.

In [ ]: 1 #all imports
2 import numpy as np
3 import pandas as pd
4 import tensorflow as tf
5 import tensorflow_hub as hub
6 from tensorflow.keras.models import Model

In [ ]: 1 tf.test.gpu_device_name()

Grader function 1

In [ ]: 1 def grader_tf_version():
2 assert((tf.__version__)>'2')
3 return True
4 grader_tf_version()
Part-1: Preprocessing

In [ ]: 1 #Read the dataset - Amazon fine food reviews

2 reviews = pd.read_csv(r"D:\ML\Internal DL\NLP\amazon-fine-food-reviews\Revie
3 #check the info of the dataset
4 reviews.info()

In [ ]: 1 #get only 2 columns - Text, Score

2 #drop the NAN values

In [ ]: 1 #if score> 3, set score = 1

2 #if score<=2, set score = 0
3 #if score == 3, remove the rows.

Grader function 2

In [ ]: 1 def grader_reviews():
2 temp_shape = (reviews.shape == (525814, 2)) and (reviews.Score.value_cou
3 assert(temp_shape == True)
4 return True
5 grader_reviews()

In [ ]: 1 def get_wordlen(x):
2 return len(x.split())
3 reviews['len'] = reviews.Text.apply(get_wordlen)
4 reviews = reviews[reviews.len<50]
5 reviews = reviews.sample(n=100000, random_state=30)

In [ ]: 1 #remove HTML from the Text column and save in the Text column only

In [ ]: 1 #print head 5

In [ ]: 1 #split the data into train and valudation data(20%) with Stratify sampling,

In [ ]: 1 #plot bar graphs of y_train and y_test

In [ ]: 1 #saving to disk. if we need, we can load preprocessed data directly.

2 reviews.to_csv('preprocessed.csv', index=False)

Part-2: Creating BERT Model

If you want to know more about BERT, You can watch live sessions on Tran
sformers and BERt. we will strongly recommend you to read Transformers
(https://jalammar.github.io/illustrated-transformer/), BERT Paper (http
s://arxiv.org/abs/1810.04805) and, This blog (https://jalammar.github.i
o/a-visual-guide-to-using-bert-for-the-first-time/).

For this assignment, we are using BERT uncased Base model (https://tfhu
b.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1). It uses L=12 hidden
layers (i.e., Transformer blocks), a hidden size of H=768, and A=12 atte
ntion heads.

In [ ]: 1 ## Loading the Pretrained Model from tensorflow HUB

2 tf.keras.backend.clear_session()
3
4 # maximum length of a seq in the data we have, for now i am making it as 300
5 max_seq_length = 55
6
7 #BERT takes 3 inputs
8
9 #this is input words. Sequence of words represented as integers
10 input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int
11
12 #mask vector if you are padding anything
13 input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
14
15 #segment vectors. If you are giving only one sentence for the classification
16 #If you are giving two sentenced with [sep] token separated, first seq segme
17 #second seq segment vector are 1's
18 segment_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
19
20 #bert layer
21 bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-
22 pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, seg
23
24 #Bert model
25 #We are using only pooled output not sequence out.
26 #If you want to know about those, please read https://www.kaggle.com/questio
27 bert_model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs
28

In [ ]: 1 bert_model.summary()

In [ ]: 1 bert_model.output

Part-3: Tokenization

In [ ]: 1 #getting Vocab file

2 vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
3 do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()

In [ ]: 1 #import tokenization - We have given tokenization.py file

In [ ]: 1 # Create tokenizer " Instantiate FullTokenizer"
2 # name must be "tokenizer"
3 # the FullTokenizer takes two parameters 1. vocab_file and 2. do_lower_case
4 # we have created these in the above cell ex: FullTokenizer(vocab_file, do_l
5 # please check the "tokenization.py" file the complete implementation

Grader function 3

In [ ]: 1 #it has to give no error

2 def grader_tokenize(tokenizer):
3 out = False
4 try:
5 out=('[CLS]' in tokenizer.vocab) and ('[SEP]' in tokenizer.vocab)
6 except:
7 out = False
8 assert(out==True)
9 return out
10 grader_tokenize(tokenizer)

In [ ]: 1 # Create train and test tokens (X_train_tokens, X_test_tokens) from (X_train

2
3 # add '[CLS]' at start of the Tokens and '[SEP]' at the end of the tokens.
4
5 # maximum number of tokens is 55(We already given this to BERT layer above)
6
7 # if it is less than 55, add '[PAD]' token else truncate the tokens length.(
8
9 # Based on padding, create the mask for Train and Test ( 1 for real token, 0
10 # it will also same shape as input tokens (None, 55) save those in X_train_m
11
12 # Create a segment input for train and test. We are using only one sentence
13
14 # type of all the above arrays should be numpy arrays
15
16 # after execution of this cell, you have to get
17 # X_train_tokens, X_train_mask, X_train_segment
18 # X_test_tokens, X_test_mask, X_test_segment

Example
In [ ]: 1 import pickle

In [ ]: 1 ##save all your results to disk so that, no need to run all again.
2 pickle.dump((X_train, X_train_tokens, X_train_mask, X_train_segment, y_train
3 pickle.dump((X_test, X_test_tokens, X_test_mask, X_test_segment, y_test),ope

In [ ]: 1 #you can load from disk

2 #X_train, X_train_tokens, X_train_mask, X_train_segment, y_train = pickle.lo
3 #X_test, X_test_tokens, X_test_mask, X_test_segment, y_test = pickle.load(op

Grader function 4
In [ ]: 1 def grader_alltokens_train():
2 out = False
3
4 if type(X_train_tokens) == np.ndarray:
5
6 temp_shapes = (X_train_tokens.shape[1]==max_seq_length) and (X_train_
7 (X_train_segment.shape[1]==max_seq_length)
8
9 segment_temp = not np.any(X_train_segment)
10
11 mask_temp = np.sum(X_train_mask==0) == np.sum(X_train_tokens==0)
12
13 no_cls = np.sum(X_train_tokens==tokenizer.vocab['[CLS]'])==X_train_t
14
15 no_sep = np.sum(X_train_tokens==tokenizer.vocab['[SEP]'])==X_train_t
16
17 out = temp_shapes and segment_temp and mask_temp and no_cls and no_s
18
19 else:
20 print('Type of all above token arrays should be list not numpy array
21 out = False
22 assert(out==True)
23 return out
24
25 grader_alltokens_train()

Grader function 5

In [ ]: 1 def grader_alltokens_test():
2 out = False
3 if type(X_test_tokens) == np.ndarray:
4
5 temp_shapes = (X_test_tokens.shape[1]==max_seq_length) and (X_test_m
6 (X_test_segment.shape[1]==max_seq_length)
7
8 segment_temp = not np.any(X_test_segment)
9
10 mask_temp = np.sum(X_test_mask==0) == np.sum(X_test_tokens==0)
11
12 no_cls = np.sum(X_test_tokens==tokenizer.vocab['[CLS]'])==X_test_tok
13
14 no_sep = np.sum(X_test_tokens==tokenizer.vocab['[SEP]'])==X_test_tok
15
16 out = temp_shapes and segment_temp and mask_temp and no_cls and no_s
17
18 else:
19 print('Type of all above token arrays should be list not numpy array
20 out = False
21 assert(out==True)
22 return out
23 grader_alltokens_test()

Part-4: Getting Embeddings from BERT M

odel
We already created the BERT model in the part-2 and input data in the pa
rt-3. We will utlize those two and will get the embeddings for each sent
ence in the Train and Validation data.

In [ ]: 1 bert_model.input

In [ ]: 1 bert_model.output

In [ ]: 1 # get the train output, BERT model will give one output so save in
2 # X_train_pooled_output

In [ ]: 1 # get the test output, BERT model will give one output so save in
2 # X_test_pooled_output

In [ ]: 1 ##save all your results to disk so that, no need to run all again.
2 pickle.dump((X_train_pooled_output, X_test_pooled_output),open('final_output

In [ ]: 1 #X_train_pooled_output, X_test_pooled_output= pickle.load(open('final_output

Grader function 6

In [ ]: 1 #now we have X_train_pooled_output, y_train

2 #X_test_pooled_ouput, y_test
3
4 #please use this grader to evaluate
5 def greader_output():
6 assert(X_train_pooled_output.shape[1]==768)
7 assert(len(y_train)==len(X_train_pooled_output))
8 assert(X_test_pooled_output.shape[1]==768)
9 assert(len(y_test)==len(X_test_pooled_output))
10 assert(len(y_train.shape)==1)
11 assert(len(X_train_pooled_output.shape)==2)
12 assert(len(y_test.shape)==1)
13 assert(len(X_test_pooled_output.shape)==2)
14 return True
15 greader_output()

Part-5: Training a NN with 786 feature

s
Create a NN and train the NN.
1. You have to use AUC as metric.
2. You can use any architecture you want.
3. You have to use tensorboard to log all your metrics and Losses. You h
ave to send those logs.
4. Print the loss and metric at every epoch.
5. You have to submit without overfitting and underfitting.

In [ ]: 1 ##imports
2 from tensorflow.keras.layers import Input, Dense, Activation, Dropout
3 from tensorflow.keras.models import Model

In [ ]: 1 ##create an NN and
2

In [ ]: 1

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
NCP-IB Exam Questions
No ratings yet
NCP-IB Exam Questions
3 pages
Introductory Techniques For 3D Computer Vision
100% (1)
Introductory Techniques For 3D Computer Vision
180 pages
DevOps For Data Science (Alex K Gold) (Z-Library)
No ratings yet
DevOps For Data Science (Alex K Gold) (Z-Library)
274 pages
A Survey On Vision Transformer
No ratings yet
A Survey On Vision Transformer
24 pages
Recurrent Neural Networks: Anahita Zarei, PH.D
No ratings yet
Recurrent Neural Networks: Anahita Zarei, PH.D
37 pages
A Hands-On Guide To Text Classification With Transformer Models (XLNet, BERT, XLM, RoBERTa)
No ratings yet
A Hands-On Guide To Text Classification With Transformer Models (XLNet, BERT, XLM, RoBERTa)
9 pages
100 Days of Data Engineering - Make A Copy and Use As You Need
No ratings yet
100 Days of Data Engineering - Make A Copy and Use As You Need
7 pages
Machine Learning Pesit Lab Manual
0% (1)
Machine Learning Pesit Lab Manual
35 pages
Natural Language Processing Professional Program
No ratings yet
Natural Language Processing Professional Program
13 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
ACF
No ratings yet
ACF
27 pages
Day 5 Supervised Technique-Decision Tree For Classification PDF
100% (1)
Day 5 Supervised Technique-Decision Tree For Classification PDF
58 pages
Binary Classification Tutorial With The Keras Deep Learning Library
No ratings yet
Binary Classification Tutorial With The Keras Deep Learning Library
33 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
Pytorch: Tensors and Datasets
No ratings yet
Pytorch: Tensors and Datasets
9 pages
Spark Lab
No ratings yet
Spark Lab
6 pages
Spring 2022 CS7643 Deep Learning Syllabus and Schedule - v5.1
No ratings yet
Spring 2022 CS7643 Deep Learning Syllabus and Schedule - v5.1
11 pages
Scalable-ML-3 4 1
No ratings yet
Scalable-ML-3 4 1
147 pages
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
No ratings yet
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
23 pages
Data Augmentation For Supervised Learning With Generative Adversa
No ratings yet
Data Augmentation For Supervised Learning With Generative Adversa
60 pages
Weka Tutorial
No ratings yet
Weka Tutorial
2 pages
Neural
No ratings yet
Neural
35 pages
Cert DEWD (Edits)
No ratings yet
Cert DEWD (Edits)
158 pages
Principal Component Analysis - Ipynb
No ratings yet
Principal Component Analysis - Ipynb
27 pages
MAD-GAN: Multivariate Anomaly Detection For Time Series Data With Generative Adversarial Networks
No ratings yet
MAD-GAN: Multivariate Anomaly Detection For Time Series Data With Generative Adversarial Networks
17 pages
DataEngineeringDatabricks
No ratings yet
DataEngineeringDatabricks
139 pages
100 Days of ML
100% (1)
100 Days of ML
15 pages
Cloudera Certification Dump - 410-Anil
100% (3)
Cloudera Certification Dump - 410-Anil
49 pages
Pyspark IQ FREE Guide
No ratings yet
Pyspark IQ FREE Guide
57 pages
Deep Learning With PyTorch: Object Classification - Filliat Et Al
No ratings yet
Deep Learning With PyTorch: Object Classification - Filliat Et Al
3 pages
18 Free Exploratory Data Analysis Tools For People Who Don't Code So Well
No ratings yet
18 Free Exploratory Data Analysis Tools For People Who Don't Code So Well
14 pages
PPB ML Notes
No ratings yet
PPB ML Notes
54 pages
AML 04 Backpropagation
100% (1)
AML 04 Backpropagation
26 pages
Simple Libraries in Python
No ratings yet
Simple Libraries in Python
12 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Plete Python Manual 4th HQ PDF-Edition 2019
No ratings yet
Plete Python Manual 4th HQ PDF-Edition 2019
163 pages
Designing Machine Learning Workflows in Python Chapter1
No ratings yet
Designing Machine Learning Workflows in Python Chapter1
32 pages
2020 - William L. Hamilton - Graph Representation Learning-Morgan & Claypool
No ratings yet
2020 - William L. Hamilton - Graph Representation Learning-Morgan & Claypool
161 pages
Image Restoration Using Residual Generative Adversarial Networks-FINAL
No ratings yet
Image Restoration Using Residual Generative Adversarial Networks-FINAL
21 pages
Introduction To Data Visualization With Matplotlib Chapter2
No ratings yet
Introduction To Data Visualization With Matplotlib Chapter2
27 pages
Hartmut Zabel, Michael Farle (Eds.) - Magnetic Nanostructures - Spin Dynamics and Spin Transport
No ratings yet
Hartmut Zabel, Michael Farle (Eds.) - Magnetic Nanostructures - Spin Dynamics and Spin Transport
278 pages
GPU Computing With Spark and Python
No ratings yet
GPU Computing With Spark and Python
33 pages
Using Categorical Data With One Hot Encoding - Kaggle PDF
No ratings yet
Using Categorical Data With One Hot Encoding - Kaggle PDF
4 pages
Slide 13 - Kafka
No ratings yet
Slide 13 - Kafka
109 pages
Where can buy Learn Quantum Computing with Python and Q#: A hands-on approach 1st Edition Sarah C. Kaiser ebook with cheap price
100% (2)
Where can buy Learn Quantum Computing with Python and Q#: A hands-on approach 1st Edition Sarah C. Kaiser ebook with cheap price
37 pages
ARIMA Models in Python Chapter2
No ratings yet
ARIMA Models in Python Chapter2
43 pages
Introduction
100% (1)
Introduction
49 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
1694600777-Unit2.2 Logistic Regression CU 2.0
100% (1)
1694600777-Unit2.2 Logistic Regression CU 2.0
37 pages
Machine Learning Engineer
No ratings yet
Machine Learning Engineer
4 pages
The Python Workbook: A Brief Introduction with Exercises and Solutions 2nd Edition Ben Stephenson all chapter instant download
100% (1)
The Python Workbook: A Brief Introduction with Exercises and Solutions 2nd Edition Ben Stephenson all chapter instant download
49 pages
Pandas Illustrated: The Definitive Visual Guide To Pandas - by Lev Maximov - Jan, 2023 - Better Programming
No ratings yet
Pandas Illustrated: The Definitive Visual Guide To Pandas - by Lev Maximov - Jan, 2023 - Better Programming
99 pages
Feature Selection in Machine Learning
No ratings yet
Feature Selection in Machine Learning
4 pages
Data Science Learning Path For 50 Days
No ratings yet
Data Science Learning Path For 50 Days
15 pages
Notes On COMPUTER VISION
No ratings yet
Notes On COMPUTER VISION
10 pages
Regression Project
100% (1)
Regression Project
60 pages
Java Reflection Complete Self-Assessment Guide
From Everand
Java Reflection Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
Mastering Kafka Streams: From Basics to Expert Proficiency
From Everand
Mastering Kafka Streams: From Basics to Expert Proficiency
William Smith
No ratings yet
Android Studio Ladybug Essentials - Java Edition: Developing Android Apps Using Android Studio Ladybug and Java
From Everand
Android Studio Ladybug Essentials - Java Edition: Developing Android Apps Using Android Studio Ladybug and Java
Neil Smyth
No ratings yet
Real-time Analytics with Storm and Cassandra
From Everand
Real-time Analytics with Storm and Cassandra
Shilpi Saxena
No ratings yet
Create PO ME21N
100% (1)
Create PO ME21N
5 pages
PRESENTATION
No ratings yet
PRESENTATION
11 pages
Computer Organization Architecture MCQ - Avatto-Pahe15
No ratings yet
Computer Organization Architecture MCQ - Avatto-Pahe15
2 pages
CSC 318 Class Notes
No ratings yet
CSC 318 Class Notes
21 pages
DS-K3G200LX+Tripod+Turnstile Datasheet 20230515
No ratings yet
DS-K3G200LX+Tripod+Turnstile Datasheet 20230515
4 pages
Ad 9833
No ratings yet
Ad 9833
21 pages
LC Relay ESP01 4R 5V
No ratings yet
LC Relay ESP01 4R 5V
16 pages
Kicker Pt250 Bass Station Subwoofer User Manual
No ratings yet
Kicker Pt250 Bass Station Subwoofer User Manual
7 pages
C# Precisely
100% (2)
C# Precisely
216 pages
QUESTIONS
No ratings yet
QUESTIONS
6 pages
Onkyo tx-sr252 SM Rev1
No ratings yet
Onkyo tx-sr252 SM Rev1
59 pages
Datasheet NVIDIA L40S R08
No ratings yet
Datasheet NVIDIA L40S R08
2 pages
Wang2021 Article Z-domainModelingOfPeakCurrentM
No ratings yet
Wang2021 Article Z-domainModelingOfPeakCurrentM
11 pages
G6 Series and Parallel Circuits
No ratings yet
G6 Series and Parallel Circuits
11 pages
Dokumen - Tips - Bluestar Catalog 2
No ratings yet
Dokumen - Tips - Bluestar Catalog 2
19 pages
Colins Chi Tangie: Professional Summary
No ratings yet
Colins Chi Tangie: Professional Summary
5 pages
Network Address Translation (NAT) : Khawar Butt Ccie # 12353 (R/S, Security, SP, DC, Voice, Storage & Ccde)
No ratings yet
Network Address Translation (NAT) : Khawar Butt Ccie # 12353 (R/S, Security, SP, DC, Voice, Storage & Ccde)
11 pages
Coldplay Survey Responses
No ratings yet
Coldplay Survey Responses
10 pages
Commutation of SCR PDF
No ratings yet
Commutation of SCR PDF
2 pages
SGOS 5.4 - Integrating The ProxySG and ProxyAV Appliances
No ratings yet
SGOS 5.4 - Integrating The ProxySG and ProxyAV Appliances
96 pages
Internship Report
No ratings yet
Internship Report
20 pages
Aivp Paper 5837 15848 1 PB
No ratings yet
Aivp Paper 5837 15848 1 PB
12 pages
CIT 622 Computer Networks PDF
No ratings yet
CIT 622 Computer Networks PDF
124 pages
Power Off Reset Reason
No ratings yet
Power Off Reset Reason
3 pages
Model Question Paper 1
No ratings yet
Model Question Paper 1
3 pages
Shift Handover 15 Jan
No ratings yet
Shift Handover 15 Jan
12 pages
ZFS
100% (1)
ZFS
24 pages
Unit V - MPMC
No ratings yet
Unit V - MPMC
92 pages
11.4.4.2 Lab - Task and System CLI Commandsc7fa
No ratings yet
11.4.4.2 Lab - Task and System CLI Commandsc7fa
5 pages