Skip to content

Commit

Permalink
Add deep_qa_1 as baseline
Browse files Browse the repository at this point in the history
  • Loading branch information
hailiang-wang committed Aug 12, 2017
1 parent b6b111b commit 7a51176
Show file tree
Hide file tree
Showing 14 changed files with 585 additions and 62 deletions.
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,3 @@ node_modules
*.pyc
__pycache__
_env
tmp
77 changes: 16 additions & 61 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,79 +1,34 @@
# insuranceqa-corpus-zh
保险行业语料库
# 保险行业语料库

![](https://camo.githubusercontent.com/ae91a5698ad80d3fe8e0eb5a4c6ee7170e088a7d/687474703a2f2f37786b6571692e636f6d312e7a302e676c622e636c6f7564646e2e636f6d2f61692f53637265656e25323053686f74253230323031372d30342d30342532306174253230382e32302e3437253230504d2e706e67)

## Welcome
# Welcome

该语料库包含从网站[Insurance Library](http://www.insurancelibrary.com/) 收集的问题和答案。
Baseline model for [insuranceqa-corpus-zh](https://github.com/Samurais/insuranceqa-corpus-zh/wiki).

据我们所知,这是保险领域首个开放的QA语料库:
Baseline: mini-batch size = 100, hidden_layers = [100, 50], lr = 0.0001.

* 该语料库的内容由现实世界的用户提出,高质量的答案由具有深度领域知识的专业人士提供。 所以这是一个具有真正价值的语料,而不是玩具。
![](./deep_qa_1/baseline_acc.png)

* 在上述论文中,语料库用于答复选择任务。 另一方面,这种语料库的其他用法也是可能的。 例如,通过阅读理解答案,观察学习等自主学习,使系统能够最终拿出自己的看不见的问题的答案。
![](./deep_qa_1/baseline_loss.png)

欢迎任何进一步增加此数据集的想法。
> Epoch 25, total step 36400, accuracy 0.9031, cost 1.056221.
## 语料数据
## Deps
Python3+

| - | 问题 | 答案 | 词汇(英语) |
| ------------- |-------------| ----- | ----- |
| 训练 | 12,889 | 21,325 | 107,889 |
| 验证 | 2,000 | 3354 | 16,931 |
| 测试 | 2,000 | 3308 | 16,815 |

每条数据包括问题的中文,英文,答案的正例,答案的负例。案的正例至少1项,基本上在*1-5*条,都是正确答案。答案的负例有*200*条,负例根据问题使用检索的方式建立,所以和问题是相关的,但却不是正确答案。

```
{
"INDEX": {
"zh": "中文",
"en": "英文",
"domain": "保险种类",
"answers": [""] # 答案正例列表
"negatives": [""] # 答案负例列表
},
more ...
}
```

* 训练:```corpus/train.json```

* 验证:```corpus/valid.json```

* 测试:```corpus/test.json```

* 答案:```corpus/answers.json```
一共有 27,413 个回答,数据格式为 ```json```:
```
{
"INDEX": {
"zh": "中文",
"en": "英文"
},
more ...
}
pip install -r Requirements.txt
```

### 中英文对照文件

#### 问答对

```
格式 INDEX ++$++ 保险种类 ++$++ 中文 ++$++ 英文
```

```corpus/train.txt```, ```corpus/valid.txt```, ```corpus/test.txt```.

#### 答案

## Run
A very simple network as baseline model.
```
格式 INDEX ++$++ 中文 ++$++ 英文
python3 deep_qa_1/network.py
python3 visual/accuracy.py
python3 visual/loss.py
```

```corpus/answers.txt```

## 声明

声明1 : [insuranceqa-corpus-zh](https://github.com/Samurais/insuranceqa-corpus-zh)
Expand All @@ -88,4 +43,4 @@ InsuranceQA Corpus, Hai Liang Wang, https://github.com/Samurais/insuranceqa-corp

声明2 : [insuranceQA](https://github.com/shuzi/insuranceQA)

此数据集仅作为研究目的提供。如果您使用这些数据发表任何内容,请引用我们的论文:[Applying Deep Learning to Answer Selection: A Study and An Open Task](https://arxiv.org/abs/1508.01585)。Minwei Feng, Bing Xiang, Michael R. Glass, Lidan Wang, Bowen Zhou @ 2015
此数据集仅作为研究目的提供。如果您使用这些数据发表任何内容,请引用我们的论文:[Applying Deep Learning to Answer Selection: A Study and An Open Task](https://arxiv.org/abs/1508.01585)。Minwei Feng, Bing Xiang, Michael R. Glass, Lidan Wang, Bowen Zhou @ 2015
10 changes: 10 additions & 0 deletions Requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
insuranceqa-data==1.3
matplotlib==2.0.2
numpy==1.13.1
pandas==0.20.3
scikit-learn==0.18.1
scipy==0.19.1
six==1.10.0
virtualenv==15.1.0
virtualenv-clone==0.2.4
virtualenvwrapper==4.1.1
7 changes: 7 additions & 0 deletions deep_qa_1/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# deep_qa_1


## Test data module
```
py.test -s -v -f ./deep_qa_1/data.py
```
Empty file added deep_qa_1/__init__.py
Empty file.
Binary file added deep_qa_1/baseline_acc.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added deep_qa_1/baseline_loss.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
182 changes: 182 additions & 0 deletions deep_qa_1/data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#===============================================================================
#
# Copyright (c) 2017 Hai Liang Wang<[email protected]> All Rights Reserved
#
#
# File: /Users/hain/ai/InsuranceQA-Machine-Learning/deep_qa_1/network.py
# Author: Hai Liang Wang
# Date: 2017-08-08:18:32:05
#
#===============================================================================

"""
A Simple Network to learning QA.
"""
from __future__ import print_function
from __future__ import division

__copyright__ = "Copyright (c) 2017 Hai Liang Wang. All Rights Reserved"
__author__ = "Hai Liang Wang"
__date__ = "2017-08-08:18:32:05"


import os
import sys
curdir = os.path.dirname(os.path.abspath(__file__))
sys.path.insert(0, os.path.dirname(curdir))

if sys.version_info[0] < 3:
reload(sys)
sys.setdefaultencoding("utf-8")
# raise "Must be using Python 3"

import random
import insuranceqa_data as insuranceqa

_train_data = insuranceqa.load_pairs_train()
_test_data = insuranceqa.load_pairs_test()
_valid_data = insuranceqa.load_pairs_valid()


'''
build vocab data with more placeholder
'''
vocab_data = insuranceqa.load_pairs_vocab()
print("keys", vocab_data.keys())
vocab_size = len(vocab_data['word2id'].keys())
VOCAB_PAD_ID = vocab_size+1
VOCAB_GO_ID = vocab_size+2
vocab_data['word2id']['<PAD>'] = VOCAB_PAD_ID
vocab_data['word2id']['<GO>'] = VOCAB_GO_ID
vocab_data['id2word'][VOCAB_PAD_ID] = '<PAD>'
vocab_data['id2word'][VOCAB_GO_ID] = '<GO>'


def _get_corpus_metrics():
'''
max length of questions
'''
for cat, data in zip(["valid", "test", "train"], [_valid_data, _test_data, _train_data]):
max_len_question = 0
total_len_question = 0
max_len_utterance = 0
total_len_utterance = 0
for x in data:
total_len_question += len(x['question'])
total_len_utterance += len(x['utterance'])
if len(x['question']) > max_len_question:
max_len_question = len(x['question'])
if len(x['utterance']) > max_len_utterance:
max_len_utterance = len(x['utterance'])
print('max len of %s question : %d, average: %d' % (cat, max_len_question, total_len_question/len(data)))
print('max len of %s utterance: %d, average: %d' % (cat, max_len_utterance, total_len_utterance/len(data)))
# max length of answers


class BatchIter():
'''
Load data with mini-batch
'''
def __init__(self, data = None, batch_size = 100):
assert data is not None, "data should not be None."
self.batch_size = batch_size
self.data = data

def next(self):
random.shuffle(self.data)
index = 0
total_num = len(self.data)
while index <= total_num:
yield self.data[index:index + self.batch_size]
index += self.batch_size

def padding(lis, pad, size):
'''
right adjust a list object
'''
if size > len(lis):
lis += [pad] * (size - len(lis))
else:
lis = lis[0:size]
return lis

def pack_question_n_utterance(q, u, q_length = 20, u_length = 99):
'''
combine question and utterance as input data for feed-forward network
'''
assert len(q) > 0 and len(u) > 0, "question and utterance must not be empty"
q = padding(q, VOCAB_PAD_ID, q_length)
u = padding(u, VOCAB_PAD_ID, u_length)
assert len(q) == q_length, "question should be pad to q_length"
assert len(u) == u_length, "utterance should be pad to u_length"
return q + [VOCAB_GO_ID] + u

def __resolve_input_data(data, batch_size, question_max_length = 20, utterance_max_length = 99):
'''
resolve input data
'''
batch_iter = BatchIter(data = data, batch_size = batch_size)

for mini_batch in batch_iter.next():
result = []
for o in mini_batch:
x = pack_question_n_utterance(o['question'], o['utterance'], question_max_length, utterance_max_length)
y_ = o['label']
assert len(x) == utterance_max_length + question_max_length + 1, "Wrong length afer padding"
assert VOCAB_GO_ID in x, "<GO> must be in input x"
assert len(y_) == 2, "desired output."
result.append([x, y_])
if len(result) > 0:
# print('data in batch:%d' % len(mini_batch))
yield result
else:
raise StopIteration

# export data

def load_train(batch_size = 100, question_max_length = 20, utterance_max_length = 99):
'''
load train data
'''
return __resolve_input_data(_train_data, batch_size, question_max_length, utterance_max_length)

def load_test(question_max_length = 20, utterance_max_length = 99):
'''
load test data
'''
result = []
for o in _test_data:
x = pack_question_n_utterance(o['question'], o['utterance'], question_max_length, utterance_max_length)
y_ = o['label']
assert len(x) == utterance_max_length + question_max_length + 1, "Wrong length afer padding"
assert VOCAB_GO_ID in x, "<GO> must be in input x"
assert len(y_) == 2, "desired output."
result.append((x, y_))
return result

def load_valid(batch_size = 100, question_max_length = 20, utterance_max_length = 99):
'''
load valid data
'''
return __resolve_input_data(_valid_data, batch_size, question_max_length, utterance_max_length)

def test_batch():
'''
retrieve data with mini batch
'''
for mini_batch in load_test():
for x, y_ in mini_batch:
print("length", len(x))
assert len(y_) == 2, "data size should be 2"

print("VOCAB_PAD_ID", VOCAB_PAD_ID)
print("VOCAB_GO_ID", VOCAB_GO_ID)

if __name__ == '__main__':
test_batch()


Loading

0 comments on commit 7a51176

Please sign in to comment.