Open navigation menu

Scribd

0% found this document useful (0 votes)

81 views61 pages

Sequence-To-Sequence Models: CIS 530, Computational Linguistics: Spring 2018

Uploaded by

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

81 views61 pages

Sequence-To-Sequence Models: CIS 530, Computational Linguistics: Spring 2018

Uploaded by

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Sequence-to-sequence

Models
CIS 530, Computational Linguistics: Spring 2018

! John Hewitt & Reno Kriz No

ing University of Pennsylvania we
arn
ve
nd
le ee
eep Some concepts drawn a bit transparently from Graham Neubig’s excellent pe
r!
D Neural Machine Translation and Sequence-to-sequence Models: A Tutorial
[Link]
We’ve already seen RNNs for language modeling

The memory
vector, or “state” .

Only
We’ve already seen RNNs for language modeling

The memory
vector, or “state” .

The “word vector”

Only use representation of
the word.

The RNN function,

which combines
the word vector
and the previous
state to create a
new state.
We’ve already seen RNNs for language modeling

The memory
vector, or “state” .

The “word vector”

Only use neural representation of
the word.

The RNN function,

which combines
the word vector
and the previous
state to create a
new state.
We’ve already seen RNNs for language modeling

The memory
vector, or “state” .

The “word vector”

Only use neural nets representation of
the word.

The RNN function,

which combines
the word vector
and the previous
state to create a
new state.
We’ve already seen RNNs for language modeling

The memory
vector, or “state” .

The “word vector”

Only use neural nets representation of
the word.

The RNN function,

which combines
the word vector
and the previous
state to create a
new state.
How does the RNN function work?
The RNN function takes the current RNN state and a word vector and produces a
subsequent RNN state that “encodes” the sentence so far.

RNN := 1 + 2 + 3
function
Learned weights representing how to combine past
information (the RNN memory) and current information (the
new word vector.)
How does the prediction function work?
We’ve seen how RNNs “encode” word sequences. But how do they produce
probability distributions over a vocabulary?

Only use neural

softmax ( )=
4

A probability distribution over the vocab, constructed from the

RNN memory and 1 last transformation (in green.) The
softmax function turns “scores” into a probability distribution.
Want to predict things other than the next word?
The model architecture (read: “design”) we’ve seen so far is frequently used in
tasks other than language modeling, because modeling sequential information is
useful in language, apparently.

Only use neural nets

Here’s our RNN
encoder,
representing the
sentence.
Want to predict things other than the next word?
The model architecture (read: “design”) we’ve seen so far is frequently used in
tasks other than language modeling, because modeling sequential information is
useful in language, apparently.

Only use neural nets

Predict
parts of
speech!
ADV VB ADJ NNS
Want to predict things other than the next word?
The model architecture (read: “design”) we’ve seen so far is frequently used in
tasks other than language modeling, because modeling sequential information is
useful in language, apparently.

Only use neural nets

Or syntax!

ADV VB ADJ NNS

General idea: build a representation
The method of building the representation is called an Encoder and is frequently
an RNN.

Only use neural nets

Each memory vector in the encoder attempts to represent the sentence so far, but
mostly represents the word most recently input.
General idea: generate the output one token at a time
The model that takes the encoded representation and generates the output is
called the Decoder, and, errrr, is also generally an RNN.

Only use neural nets

Jiri
General idea: generate the output one token at a time
The model that takes the encoded representation and generates the output is
called the Decoder, and, errrr, is also generally an RNN.

Only use neural nets

Jiri naanị
General idea: generate the output one token at a time
The model that takes the encoded representation and generates the output is
called the Decoder, and, errrr, is also generally an RNN.

Only use neural nets

Jiri naanị netwọk

General idea: generate the output one token at a time
The model that takes the encoded representation and generates the output is
called the Decoder, and, errrr, is also generally an RNN.

Only use neural nets

Jiri naanị netwọk nụ

General idea: generate the output one token at a time
The model that takes the encoded representation and generates the output is
called the Decoder, and, errrr, is also generally an RNN.

Only use neural nets

Decoder (2seq)

Encoder (seq)
Jiri naanị netwọk nụ
How is it trained?
In practice, training for a single sentence is done by “forcing” the decoder to generate gold
sequences, and penalizing it for assigning the sequence a low probability. Losses for each
token in the sequence are summed. Then, the summed loss is used to take a step in the right
direction in all model parameters (including word embeddings!) (stochastic gradient descent.)

neural nets
How is it trained?
In practice, training for a single sentence is done by “forcing” the decoder to generate gold
sequences, and penalizing it for assigning the sequence a low probability. Losses for each
token in the sequence are summed. Then, the summed loss is used to take a step in the right
direction in all model parameters (including word embeddings!) (stochastic gradient descent.)

neural nets .7

ọk
ne nị

nụ
ri

tw
a
Ji
na
How is it formalized? How is it trained?
In practice, training for a single sentence is done by “forcing” the decoder to generate gold
sequences, and penalizing it for assigning the sequence a low probability. Losses for each
token in the sequence are summed. Then, the summed loss is used to take a step in the right
direction in all model parameters (including word embeddings!) (stochastic gradient descent.)
Loss
neural nets .7 GOLD: Jiri -log(.7)

ọk
ne nị

nụ
ri

tw
a
Ji
na
Sentence-level training
Almost all such networks are trained using cross-entropy loss. At each step, the network
produces a probability distribution over possible next tokens. This distribution is
penalized from being different from the true distribution (e.g., a probability of 1 on the
actual next token.)

Jiri
-log(.7)
1

0
Sentence-level training
Almost all such networks are trained using cross-entropy loss. At each step, the network
produces a probability distribution over possible next tokens. This distribution is
penalized from being different from the true distribution (e.g., a probability of 1 on the
actual next token.)

Jiri naanị
-log(.7) -log(.5)
1

0
Sentence-level training
Almost all such networks are trained using cross-entropy loss. At each step, the network
produces a probability distribution over possible next tokens. This distribution is
penalized from being different from the true distribution (e.g., a probability of 1 on the
actual next token.)

Jiri naanị netwọk nụ

-log(.7) -log(.5) -log(.6) -log(.4)
1

0
Sentence-level training
Almost all such networks are trained using cross-entropy loss. At each step, the network
produces a probability distribution over possible next tokens. This distribution is
penalized from being different from the true distribution (e.g., a probability of 1 on the
actual next token.)

Minimize
Jiri naanị netwọk nụ This!
sum( -log(.7) -log(.5) -log(.6) -log(.4)) = 1.07
1

0
How is it formalized?
Let ht be the RNN hidden state at timestep t:
How is it formalized?
Let ht be the RNN hidden state at timestep t:

Let xt be the input vector at timestep t:

How is it formalized?
Let ht be the RNN hidden state at timestep t:

Let xt be the input vector at timestep t:

The RNN equation posits 2 matrices and 1 vector as parameters:

Whx integrates input vector information.

How is it formalized?
Let ht be the RNN hidden state at timestep t:

Let xt be the input vector at timestep t:

The RNN equation posits 2 matrices and 1 vector as parameters:

Whx integrates input vector information.

Whh integrates information from the previous timestep.

How is it formalized?
Let ht be the RNN hidden state at timestep t:

Let xt be the input vector at timestep t:

The RNN equation posits 2 matrices and 1 vector as parameters:

Whx integrates input vector information.

Whh integrates information from the previous timestep.

bh is a bias term. (What function does this perform?)

How is it formalized?
Let ht be the RNN hidden state at timestep t:

Let xt be the input vector at timestep t:

The RNN equation posits 2 matrices and 1 vector as parameters:

Whx integrates input vector information.

Whh integrates information from the previous timestep.

bh is a bias term. (What function does this perform?)

The RNN equation is: ht = tanh(Whxxt + Whhht−1 + bh)

Glossing over this slide is totally reasonable. Also feel free to
check your phone, ping your Bitcoin investment, see if your
How is it formalized? The Boring Company® (Not a) Flamethrower has shipped.

For prediction, we take the current hidden state, and use it as features in what is
more or less a linear regression.
Glossing over this slide is totally reasonable. Also feel free to
check your phone, ping your Bitcoin investment, see if your
How is it formalized? The Boring Company® (Not a) Flamethrower has shipped.

For prediction, we take the current hidden state, and use it as features in what is
more or less a linear regression.

Let dt be our decision (e.g., word, POS tag) at timestep t. Let D be the set of all
possible decisions. Let st-1 be the most recent decoder hidden state.
Glossing over this slide is totally reasonable. Also feel free to
check your phone, ping your Bitcoin investment, see if your
How is it formalized? The Boring Company® (Not a) Flamethrower has shipped.

For prediction, we take the current hidden state, and use it as features in what is
more or less a linear regression.

Let dt be our decision (e.g., word, POS tag) at timestep t. Let D be the set of all
possible decisions. Let st-1 be the most recent decoder hidden state.

dt = argmaxd’ ∈ D p( d’ | x1:n,d1:t-1)
Glossing over this slide is totally reasonable. Also feel free to
check your phone, ping your Bitcoin investment, see if your
How is it formalized? The Boring Company® (Not a) Flamethrower has shipped.

For prediction, we take the current hidden state, and use it as features in what is
more or less a linear regression.

Let dt be our decision (e.g., word, POS tag) at timestep t. Let D be the set of all
possible decisions. Let st-1 be the most recent decoder hidden state.

dt = argmaxd’ ∈ D p( d’ | x1:n,d1:t-1)

p( * | x1:n,d1:t-1) = softmaxD(WDhst-1+bD)
Glossing over this slide is totally reasonable. Also feel free to
check your phone, ping your Bitcoin investment, see if your
How is it formalized? The Boring Company® (Not a) Flamethrower has shipped.

For prediction, we take the current hidden state, and use it as features in what is
more or less a linear regression.

Let dt be our decision (e.g., word, POS tag) at timestep t. Let D be the set of all
possible decisions. Let st-1 be the most recent decoder hidden state.

dt = argmaxd’ ∈ D p( d’ | x1:n,d1:t-1)

p( * | x1:n,d1:t-1) = softmaxD(WDhst-1+bD)

Note that WhDst-1+bD produces a vector of scores. The softmax function

normalizes scores to a probability distribution by exponentiating each dimension,
and normalizing by the sum. For some choice k of K, p(k) = escore(k)/ ∑k’ ∈ K escore(k’)
The information bottleneck and latent structure
Given the diagram below, what problem do you foresee when translating
progressively longer sentences?

Only use neural nets

Jiri naanị netwọk nụ

The information bottleneck and latent structure
We are trying to encode variable-length structure (e.g., variable-length sentences)
in a fixed-length memory (e.g., only the 300 dimensions of your hidden state.)

Only use neural nets The last encoder hidden state is the
bottleneck -- all information in the
source sentence must pass through it
to get to the decoder.

Finding a solution to this problem was

Encoder (seq) the final advance that made neural
MT competitive with previous
approaches.
The information bottleneck and latent structure
The key insight is related to the word alignment work we did last week. We allow
the decoder to look at any encoder state, and let it learn which are important at
each time step!
Only use neural nets

Jiri naanị netwọk nụ

Attention summarizes the encoder,
Learning to pay attention focusing on specific parts/words.

Step 1: Take the decoder state, and compute an affinity αi with all encoder states.

α0 α1 α2 α3 α4

Jiri
Attention summarizes the encoder,
Learning to pay attention focusing on specific parts/words.

Step 1: Take the decoder state, and compute an affinity αi with all encoder states.

α0 α1 α2 α3 α4 The affinity function, , is a

dot product, or something
similar.

: = αi

Jiri
Attention summarizes the encoder,
Learning to pay attention focusing on specific parts/words.

Step 2: Normalize the scores to sum to 1 by the softmax function.

softmax
α0 α1 α2 α3 α4

Jiri
Attention summarizes the encoder,
Learning to pay attention focusing on specific parts/words.

Step 2: Normalize the scores to sum to 1 by the softmax function.

a0 a1 a2 a3 a4
Note that ∑i=1,2,3,4ai =1
softmax
α0 α1 α2 α3 α4

Jiri
Attention summarizes the encoder,
Learning to pay attention focusing on specific parts/words.

Step 3: Average the encoder states, weighted by the a distribution.

a0 1 a1 1 a2 1 a3 1 a4 1

This weighted average

Is called the context

vector.

Jiri
Attention summarizes the encoder,
Learning to pay attention focusing on specific parts/words.

Step 3: Average the encoder states, weighted by the a distribution.

In this example, since “Jiri”

Only use neural nets means “use”, the attention
will focus on the vectors
around “use”.

Jiri
Focus of context vector over encoder states
Attention summarizes the encoder,
Learning to pay attention focusing on specific parts/words.

Step 4: Use the context vector at prediction, concatenating it to the decoder state.

softmax ( )= Probability distribution

over the vocabulary
4

Jiri

This vector has the current decoder information, , but also a

focused summary of the encoder, .
Glossing over this slide is totally reasonable. Feel free to
check your phone, ping your Bitcoin investment, see if your
Attention Formalization The Boring Company® (Not a) Flamethrower has shipped.

Attention computes the affinity between the decoder state and all encoder
[Link] are many affinity computation methods, but they’re all like a dot
product.

Let there are n encoder states. The affinity between encoder state i and the
decoder state is αi. The encoder states are h1:n, and the decoder state is st-1.

Let αi = f(hi, st-1) = hiTst-1

Let weights a = softmax(α).

Let the context c = ∑i=1:nhiai. (Note that this is a weighted average.)

Glossing over this slide is totally reasonable. Feel free to
check your phone, ping your Bitcoin investment, see if your
Attention Formalization The Boring Company® (Not a) Flamethrower has shipped.

Attention is used at prediction as extra information in the final prediction.

Reminder, we let the context c = ∑i=1:nhiai. (Weighted average of encoder states.)

Glossing over this slide is totally reasonable. Feel free to
check your phone, ping your Bitcoin investment, see if your
Attention Formalization The Boring Company® (Not a) Flamethrower has shipped.

Attention is used at prediction as extra information in the final prediction.

Reminder, we let the context c = ∑i=1:nhiai. (Weighted average of encoder states.)

Let the notation [s;c] mean the concatenation of vectors s and c.

dt = argmaxd’ ∈ D p( d’ | x1:n,d1:t-1)

(same as before, without attention)

Glossing over this slide is totally reasonable. Feel free to
check your phone, ping your Bitcoin investment, see if your
Attention Formalization The Boring Company® (Not a) Flamethrower has shipped.

Attention is used at prediction as extra information in the final prediction.

Reminder, we let the context c = ∑i=1:nhiai. (Weighted average of encoder states.)

Let the notation [s;c] mean the concatenation of vectors s and c.

dt = argmaxd’ ∈ D p( d’ | x1:n,d1:t-1)

p( * | x1:n,d1:t-1) = softmaxD(WD(2h)[st-1;c]+bD)

So, the only difference is that the final prediction uses the context vector
concatenated to the decoder state to make the prediction.
Empirical considerations
There are a lot of “hyperparameter” choices that can greatly affect the quality of
your model. In short, take parameters from papers/tutorials, and grid search
(try many combinations of parameters) around them.

RNN variants: LSTMs have a different (much better) recurrent equation.

Hidden state sizes: larger: more memory! Requires more data.

Embedding sizes: more representation power! Requires more data.

Learning rate: the step size you take in learning your parameters! Start this
“large”, and cut it in half when your training stops improving development set
performance.
Empirical considerations
There are a lot of “hyperparameter” choices that can greatly affect the quality of
your model. In short, take parameters from papers/tutorials, and grid search
(try many combinations of parameters) around them.

Regularization: “dropout” prevents overfitting by making each node in your

hidden state unavailable for an observation with a given probability. Try
some values around .2 to .3.

Batch size: The number of observations to group together before performing a

parameter update step. Larger batches: less fine-grained training, many more
observations per minute, especially on GPU.
Case study: text simplification
Text simplification is the process in which a text is transformed into an equivalent
text that can be more easily read by a broader audience (Saggion, 2017).

Simplification can be used as a preprocessing tool for improving performance of

many NLP end-tasks such as parsing, SRL, summarization, Information Retrieval etc.
Case study: text simplification
Text simplification is the process in which a text is transformed into an equivalent
text that can be more easily read by a broader audience (Saggion, 2017).

Simplification can be used as a preprocessing tool for improving performance of

many NLP end-tasks such as parsing, SRL, summarization, Information Retrieval etc.

“There’s just one major hitch: the primary purpose of education is to develop
citizens with a wide variety of skills.”

“The purpose of education is to develop many skills.”

Case study: text simplification
Text simplification can be thought of in part as monolingual machine translation.

Problem: The most common rewrite operation is copying from the complex
sentence to the simple sentence.
- One solution: Add in reinforcement learning (Zhang and Lapata 2017), to
encourage the model to use other rewrite operations, such as deletion,
substitution, word reordering.
A brief introduction to Reinforcement Learning

The reinforcement learning framework (Sutton and Barto, 1998)

Case study: text simplification

Basic encoder-decoder model, from (Zhang and Lapata, 2017).

Case study: text simplification

Encoder-Decoder model with reinforcement learning (Zhang and Lapata, 2017).

- Process of generating new words from
Derivational morphology existing words
- Changes semantic meaning
- Often a new part-of-speech

employ V -> N, Agent employer

employ V -> N, Passive employee

employ V -> N, Result employment

employ V -> Adj, Potential employable

employable V -> Adj -> N, Stative employability

Encoder (seq) Decoder (2seq)

Derivational morphology

c o m p o s i

c o m p o s e VERB-NOM
Derivational morphology: search
e r

n g

f i c a t i o n
g r o u n d

i q

m e n t

s
Reference Sheet
A learned parameter dt = argmaxd’ ∈ D p( d’ | x1:n,d1:t-1)
The memory vector,
or “state” . Color matrix
denotes whether Whx integrates input vector p( * | x1:n,d1:t-1) =
encoder or decoder. information.
softmaxD(WDhst-1+bD)
The “word vector” Whh integrates information from the
representation of the previous timestep.
word.
bh is a bias term.
The RNN function,
dt is our decision at timestep t.
which combines the
word vector and the The RNN equation is:
previous state to create
a new state. ht = tanh(Whxxt + Whhht−1 + bh)

You might also like

LSTM and RNN Language Models
No ratings yet
LSTM and RNN Language Models
59 pages
NLP Week7 RNNLSTM
No ratings yet
NLP Week7 RNNLSTM
66 pages
Deep Recurrent Neural Networks
No ratings yet
Deep Recurrent Neural Networks
24 pages
RNN StannfordBased
No ratings yet
RNN StannfordBased
102 pages
DL Mod4
No ratings yet
DL Mod4
105 pages
UNIT-3 Sequence Modeling
No ratings yet
UNIT-3 Sequence Modeling
20 pages
Building a Recurrent Neural Network Language Model
No ratings yet
Building a Recurrent Neural Network Language Model
50 pages
Sequence Models231205
No ratings yet
Sequence Models231205
72 pages
Module 4 RNN LSTM GRU
No ratings yet
Module 4 RNN LSTM GRU
59 pages
RNN LSTM GRU Transformers
0% (1)
RNN LSTM GRU Transformers
123 pages
11 RNN
No ratings yet
11 RNN
32 pages
Module 4-1
No ratings yet
Module 4-1
44 pages
Recurrent Neural Nets
No ratings yet
Recurrent Neural Nets
144 pages
DL M5 Tech
No ratings yet
DL M5 Tech
21 pages
Rnnjan 25
No ratings yet
Rnnjan 25
59 pages
RNN Encoder-Decoder for Machine Translation
No ratings yet
RNN Encoder-Decoder for Machine Translation
11 pages
NN Text Generation Zaid Bouslikhin
No ratings yet
NN Text Generation Zaid Bouslikhin
14 pages
Image Captions With Deep Learning: Yulia Kogan & Ron Shiff
No ratings yet
Image Captions With Deep Learning: Yulia Kogan & Ron Shiff
24 pages
Unit 3 Questions With Answers Ghanta Ka Password
No ratings yet
Unit 3 Questions With Answers Ghanta Ka Password
20 pages
RNN Recurrent Neural Network: Application Input Sequence Task
No ratings yet
RNN Recurrent Neural Network: Application Input Sequence Task
10 pages
M5 Topic 1 - Encoder Decoder
No ratings yet
M5 Topic 1 - Encoder Decoder
21 pages
RNN Basics
No ratings yet
RNN Basics
17 pages
Outline
No ratings yet
Outline
50 pages
4.2 Sequence2Sequence (RNN)
No ratings yet
4.2 Sequence2Sequence (RNN)
46 pages
Module 3 Part 2 Encoder
No ratings yet
Module 3 Part 2 Encoder
14 pages
Lecture 13 - Transformer Encoder Decoderv2
No ratings yet
Lecture 13 - Transformer Encoder Decoderv2
65 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part IV Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part IV Spring 2015
12 pages
06 - LLM
No ratings yet
06 - LLM
18 pages
Unit III - Recurrent Neural Networks
No ratings yet
Unit III - Recurrent Neural Networks
44 pages
Encoder-Decoder Models Overview
No ratings yet
Encoder-Decoder Models Overview
63 pages
NLP Lecture: Language Models & RNNs
No ratings yet
NLP Lecture: Language Models & RNNs
14 pages
Neural Networks in Information Retrieval
No ratings yet
Neural Networks in Information Retrieval
290 pages
01-Transformer Based NLP Applications
No ratings yet
01-Transformer Based NLP Applications
55 pages
Mathematics of LLMs Part 1
No ratings yet
Mathematics of LLMs Part 1
8 pages
Intro to RNNs: A Beginner's Guide
No ratings yet
Intro to RNNs: A Beginner's Guide
8 pages
Dl-Unit 5
No ratings yet
Dl-Unit 5
10 pages
RNN LSTM
No ratings yet
RNN LSTM
71 pages
cl8 Encdec
No ratings yet
cl8 Encdec
51 pages
Attention Is All You Need Explained
No ratings yet
Attention Is All You Need Explained
46 pages
NLP Lecture 6
No ratings yet
NLP Lecture 6
57 pages
CM Slides On Attention
No ratings yet
CM Slides On Attention
162 pages
Recurrent Neural Networks LSTMS, Transformers, Graph Neural Networks
No ratings yet
Recurrent Neural Networks LSTMS, Transformers, Graph Neural Networks
97 pages
Deep Learning Basics
No ratings yet
Deep Learning Basics
10 pages
NLP Basics
No ratings yet
NLP Basics
119 pages
Slides On RNNs 26th March 2025
No ratings yet
Slides On RNNs 26th March 2025
30 pages
Lec 5 - RNN
No ratings yet
Lec 5 - RNN
61 pages
A M3 RD Ipjn Yd Ps GKF
No ratings yet
A M3 RD Ipjn Yd Ps GKF
20 pages
Overview of Recurrent Neural Networks
No ratings yet
Overview of Recurrent Neural Networks
53 pages
Sequence Models
No ratings yet
Sequence Models
73 pages
RNNs in Deep Learning for Image Analysis
No ratings yet
RNNs in Deep Learning for Image Analysis
47 pages
AN2DL 05 2324 Seq2SeqAndWordEmbedding
No ratings yet
AN2DL 05 2324 Seq2SeqAndWordEmbedding
42 pages
Class44-46 Introduction To Enncoder-Decoder Model Attention-03-09May2023
No ratings yet
Class44-46 Introduction To Enncoder-Decoder Model Attention-03-09May2023
35 pages
AI Foundation Application-RNN
No ratings yet
AI Foundation Application-RNN
60 pages
Recurrent Neural Networks: RNN: S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology
No ratings yet
Recurrent Neural Networks: RNN: S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology
47 pages
Lecture06 RNNtTransformer
No ratings yet
Lecture06 RNNtTransformer
60 pages
Guidance Is All You Need: Advancing Large Language Models With Temperature-Guided Reasoning
No ratings yet
Guidance Is All You Need: Advancing Large Language Models With Temperature-Guided Reasoning
23 pages
Transformers as SVMs: Optimization Insights
No ratings yet
Transformers as SVMs: Optimization Insights
58 pages
Mathematics 08 01245 v2
No ratings yet
Mathematics 08 01245 v2
29 pages
Neural Networks: Neuron Modeling & Activation Functions
No ratings yet
Neural Networks: Neuron Modeling & Activation Functions
13 pages
Fine-Grained Visual Classification With A Learnable Part-Based Attention
No ratings yet
Fine-Grained Visual Classification With A Learnable Part-Based Attention
22 pages
Deep Learning - IIT Ropar - Unit 12 - Week 9
No ratings yet
Deep Learning - IIT Ropar - Unit 12 - Week 9
4 pages
Deep Learning Fundamentals Guide
No ratings yet
Deep Learning Fundamentals Guide
142 pages
C41 FPL2024 Sda
No ratings yet
C41 FPL2024 Sda
10 pages
LLM - Neural Language Models
No ratings yet
LLM - Neural Language Models
28 pages
Chang 等 - 2022 - Flexible Hybrid Lenses Light Field Super-Resolutio
No ratings yet
Chang 等 - 2022 - Flexible Hybrid Lenses Light Field Super-Resolutio
9 pages
First Exam 24 25 Solution
No ratings yet
First Exam 24 25 Solution
13 pages
Early Detection of Mental Health Issues Using Soci
No ratings yet
Early Detection of Mental Health Issues Using Soci
9 pages
Transformers Are RNNS: Fast Autoregressive Transformers With Linear Attention
No ratings yet
Transformers Are RNNS: Fast Autoregressive Transformers With Linear Attention
17 pages
Understanding Attention Mechanisms in Deep Learning
No ratings yet
Understanding Attention Mechanisms in Deep Learning
104 pages
Deep Learning for Pneumonia Detection
No ratings yet
Deep Learning for Pneumonia Detection
16 pages
Visual Attributes in Object Recognition
No ratings yet
Visual Attributes in Object Recognition
26 pages
Bayesian Segnet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures For Scene Understanding
No ratings yet
Bayesian Segnet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures For Scene Understanding
11 pages
Coursera Machine Learning Specialization
No ratings yet
Coursera Machine Learning Specialization
46 pages
Smart Fiber-Optic Distributed Acoustic Sensing sDAS With Multitask Learning For Time-Efficient Ground Listening Applications
No ratings yet
Smart Fiber-Optic Distributed Acoustic Sensing sDAS With Multitask Learning For Time-Efficient Ground Listening Applications
15 pages
Brazilian License Plate AI Tool
No ratings yet
Brazilian License Plate AI Tool
4 pages
Deep RL Movie Recommendation System
No ratings yet
Deep RL Movie Recommendation System
5 pages
Introduction To Transformers An NLP Perspective
No ratings yet
Introduction To Transformers An NLP Perspective
119 pages
Recurrent Neural Networks (RNNS) : A Gentle Introduction and Overview
No ratings yet
Recurrent Neural Networks (RNNS) : A Gentle Introduction and Overview
16 pages
1 s2.0 S2214212622001296 Main
No ratings yet
1 s2.0 S2214212622001296 Main
20 pages
Conditional GANs for Tabular Data Modeling
No ratings yet
Conditional GANs for Tabular Data Modeling
15 pages
Automated Crowdturfing Attacks and Defenses in Online Review Systems - Yuanshun Yao, Bimal Viswanath, Jenna Cryan, Haitao Zheng, Ben Y. Zhao
No ratings yet
Automated Crowdturfing Attacks and Defenses in Online Review Systems - Yuanshun Yao, Bimal Viswanath, Jenna Cryan, Haitao Zheng, Ben Y. Zhao
16 pages
Skin - Cancer - Report Final
No ratings yet
Skin - Cancer - Report Final
26 pages
Softmax Policy Gradient Methods Can Take Exponenti
No ratings yet
Softmax Policy Gradient Methods Can Take Exponenti
65 pages
ML Lecture#4
No ratings yet
ML Lecture#4
109 pages
BERT On TSP
No ratings yet
BERT On TSP
8 pages