Lecture Notes 6
Lecture Notes 6
R
NN
The general feedforward neural networks and CNNs are not good for
time-series kinds of datasets as these networks don’t have any memory of
their own. Recurrent neural networks bring with them the unique ability
to remember important stuff during the training over a period of time.
This makes them well suited for tasks such as natural language translation,
speech recognition, and image captioning. These networks have states
defined over a timeline and use the output of the previous state in the
current input, as shown in Figure 1-32.
39
Chapter 1 Introduction to Machine Learning
40
Chapter 1 Introduction to Machine Learning
Now we are going to use a small dataset and build a deep learning
model to predict the sentiment given the user review. We are going to
make use of TensorFlow and Keras to build this model. There are couple of
steps that we need to do before we train this model in Databricks. We first
need to go to the cluster and click Libraries. On the Libraries tab, we need
to select the Pypi option and mention Keras to get it installed. Similarly, we
need to mention TensorFlow as well once Keras is installed.
Once we upload the reviews dataset, we can create a pandas dataframe
like we did in the earlier case.
[In]: df=sparkDF.toPandas()
[In]: df.columns
[Out]: Index(['Sentiment', 'Summary'], dtype='object')
[In]: df.head(10)
[Out]:
41
Chapter 1 Introduction to Machine Learning
[In]: df.Sentiment.value_counts()
[Out]:
1 1000
0 1000
We can also confirm the class balance by taking a value counts of the
target column. It seems the data is well balanced. Before we go ahead with
building the model, since we are dealing with text data, we need to clean it
a little bit to ensure no unwanted errors are thrown at the time of training.
Hence, we write a small helper function using regular expressions.
[In]:
import re
def clean_reviews(text):
text=re.sub("[^a-zA-Z]"," ",str(text))
return re.sub("^\d+\s|\s\d+\s|\s\d+$", " ", text)
[In]: df['Summary']=df.Summary.apply(clean_reviews)
[In]: df.head(10)
42
Chapter 1 Introduction to Machine Learning
[Out]:
The next step is to separate input and output data. Since the data is
already small, we are not going to split it into train and test sets; rather, we
will train the model on all the data.
[In]: X=df.Summary
[In]: y=df.Sentiment
We now create the tokenizer object with 10,000 vocab words, and an
out-of-vocabulary (oov) token is mentioned for the unseen words that the
model gets exposed to that are not part of the training.
[In]: tokenizer=Tokenizer(num_words=10000,oov_token='xxxxxxx')
[In]: tokenizer.fit_on_texts(X)
[In]: X_dict=tokenizer.word_index
[In]: len(X_dict)
[Out]: 2018
43