Add deep_qa_1 as baseline

pzhao16me · Aug 12, 2017 · 7a51176 · 7a51176
1 parent b6b111b
commit 7a51176
Show file tree

Hide file tree

Showing 14 changed files with 585 additions and 62 deletions.
diff --git a/.gitignore b/.gitignore
@@ -4,4 +4,3 @@ node_modules
 *.pyc
 __pycache__
 _env
-tmp
diff --git a/README.md b/README.md
@@ -1,79 +1,34 @@
-# insuranceqa-corpus-zh
-保险行业语料库
+# 保险行业语料库
 
 ![](https://camo.githubusercontent.com/ae91a5698ad80d3fe8e0eb5a4c6ee7170e088a7d/687474703a2f2f37786b6571692e636f6d312e7a302e676c622e636c6f7564646e2e636f6d2f61692f53637265656e25323053686f74253230323031372d30342d30342532306174253230382e32302e3437253230504d2e706e67)
 
-## Welcome
+# Welcome
 
-该语料库包含从网站[Insurance Library](http://www.insurancelibrary.com/) 收集的问题和答案。
+Baseline model for [insuranceqa-corpus-zh](https://github.com/Samurais/insuranceqa-corpus-zh/wiki).
 
-据我们所知，这是保险领域首个开放的QA语料库：
+Baseline: mini-batch size = 100, hidden_layers = [100, 50], lr = 0.0001.
 
-* 该语料库的内容由现实世界的用户提出，高质量的答案由具有深度领域知识的专业人士提供。 所以这是一个具有真正价值的语料，而不是玩具。
+![](./deep_qa_1/baseline_acc.png)
 
-* 在上述论文中，语料库用于答复选择任务。 另一方面，这种语料库的其他用法也是可能的。 例如，通过阅读理解答案，观察学习等自主学习，使系统能够最终拿出自己的看不见的问题的答案。
+![](./deep_qa_1/baseline_loss.png)
 
-欢迎任何进一步增加此数据集的想法。
+> Epoch 25, total step 36400, accuracy 0.9031, cost 1.056221.
 
-## 语料数据
+## Deps
+Python3+
 
-| - | 问题      |  答案  | 词汇（英语）  | 
-| ------------- |-------------| ----- |   ----- |           
-| 训练      | 12,889 | 21,325  |    107,889        |
-| 验证      | 2,000     |  3354 |   16,931          |
-| 测试       | 2,000      |    3308 |  16,815            |
-
-每条数据包括问题的中文，英文，答案的正例，答案的负例。案的正例至少1项，基本上在*1-5*条，都是正确答案。答案的负例有*200*条，负例根据问题使用检索的方式建立，所以和问题是相关的，但却不是正确答案。
-
-```
-{
-    "INDEX": {
-        "zh": "中文",
-        "en": "英文",
-        "domain": "保险种类",
-        "answers": [""] # 答案正例列表
-        "negatives": [""] # 答案负例列表
-    },
-    more ...
-}
-```
-
-* 训练：```corpus/train.json```
-
-* 验证：```corpus/valid.json```
-
-* 测试：```corpus/test.json```
-
-* 答案：```corpus/answers.json```
-一共有 27,413 个回答，数据格式为 ```json```:
 ```
-{
-    "INDEX": {
-        "zh": "中文",
-        "en": "英文"
-    },
-    more ...
-}
+pip install -r Requirements.txt
 ```
 
-### 中英文对照文件
-
-#### 问答对
-
-```
-格式 INDEX ++$++ 保险种类 ++$++ 中文 ++$++ 英文
-```
-
-```corpus/train.txt```, ```corpus/valid.txt```, ```corpus/test.txt```.
-
-#### 答案
-
+## Run
+A very simple network as baseline model.
 ```
-格式 INDEX ++$++ 中文 ++$++ 英文
+python3 deep_qa_1/network.py
+python3 visual/accuracy.py
+python3 visual/loss.py
 ```
 
-```corpus/answers.txt```
-
 ## 声明
 
 声明1 : [insuranceqa-corpus-zh](https://github.com/Samurais/insuranceqa-corpus-zh)
@@ -88,4 +43,4 @@ InsuranceQA Corpus, Hai Liang Wang, https://github.com/Samurais/insuranceqa-corp
 
 声明2 : [insuranceQA](https://github.com/shuzi/insuranceQA)
 
-此数据集仅作为研究目的提供。如果您使用这些数据发表任何内容，请引用我们的论文：[Applying Deep Learning to Answer Selection: A Study and An Open Task](https://arxiv.org/abs/1508.01585)。Minwei Feng, Bing Xiang, Michael R. Glass, Lidan Wang, Bowen Zhou @ 2015
+此数据集仅作为研究目的提供。如果您使用这些数据发表任何内容，请引用我们的论文：[Applying Deep Learning to Answer Selection: A Study and An Open Task](https://arxiv.org/abs/1508.01585)。Minwei Feng, Bing Xiang, Michael R. Glass, Lidan Wang, Bowen Zhou @ 2015
diff --git a/Requirements.txt b/Requirements.txt
@@ -0,0 +1,10 @@
+insuranceqa-data==1.3
+matplotlib==2.0.2
+numpy==1.13.1
+pandas==0.20.3
+scikit-learn==0.18.1
+scipy==0.19.1
+six==1.10.0
+virtualenv==15.1.0
+virtualenv-clone==0.2.4
+virtualenvwrapper==4.1.1
diff --git a/deep_qa_1/README.md b/deep_qa_1/README.md
@@ -0,0 +1,7 @@
+# deep_qa_1
+
+
+## Test data module
+```
+py.test -s -v -f ./deep_qa_1/data.py
+```
diff --git a/deep_qa_1/__init__.py b/deep_qa_1/__init__.py
diff --git a/deep_qa_1/baseline_acc.png b/deep_qa_1/baseline_acc.png
diff --git a/deep_qa_1/baseline_loss.png b/deep_qa_1/baseline_loss.png
diff --git a/deep_qa_1/data.py b/deep_qa_1/data.py
@@ -0,0 +1,182 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+#===============================================================================
+#
+# Copyright (c) 2017 Hai Liang Wang<[email protected]> All Rights Reserved
+#
+#
+# File: /Users/hain/ai/InsuranceQA-Machine-Learning/deep_qa_1/network.py
+# Author: Hai Liang Wang
+# Date: 2017-08-08:18:32:05
+#
+#===============================================================================
+
+"""
+   A Simple Network to learning QA.
+   
+   
+"""
+from __future__ import print_function
+from __future__ import division
+
+__copyright__ = "Copyright (c) 2017 Hai Liang Wang. All Rights Reserved"
+__author__    = "Hai Liang Wang"
+__date__      = "2017-08-08:18:32:05"
+
+
+import os
+import sys
+curdir = os.path.dirname(os.path.abspath(__file__))
+sys.path.insert(0, os.path.dirname(curdir))
+
+if sys.version_info[0] < 3:
+    reload(sys)
+    sys.setdefaultencoding("utf-8")
+    # raise "Must be using Python 3"
+
+import random
+import insuranceqa_data as insuranceqa
+
+_train_data = insuranceqa.load_pairs_train()
+_test_data = insuranceqa.load_pairs_test()
+_valid_data = insuranceqa.load_pairs_valid()
+
+
+'''
+build vocab data with more placeholder
+'''
+vocab_data = insuranceqa.load_pairs_vocab()
+print("keys", vocab_data.keys())
+vocab_size = len(vocab_data['word2id'].keys())
+VOCAB_PAD_ID = vocab_size+1
+VOCAB_GO_ID = vocab_size+2
+vocab_data['word2id']['<PAD>'] = VOCAB_PAD_ID
+vocab_data['word2id']['<GO>'] = VOCAB_GO_ID
+vocab_data['id2word'][VOCAB_PAD_ID] = '<PAD>'
+vocab_data['id2word'][VOCAB_GO_ID] = '<GO>'
+
+
+def _get_corpus_metrics():
+    '''
+    max length of questions
+    '''
+    for cat, data in zip(["valid", "test", "train"], [_valid_data, _test_data, _train_data]):
+        max_len_question = 0
+        total_len_question = 0
+        max_len_utterance = 0
+        total_len_utterance = 0
+        for x in data:
+            total_len_question += len(x['question']) 
+            total_len_utterance += len(x['utterance'])
+            if len(x['question']) > max_len_question: 
+                max_len_question = len(x['question'])
+            if len(x['utterance']) > max_len_utterance: 
+                max_len_utterance = len(x['utterance'])
+        print('max len of %s question : %d, average: %d' % (cat, max_len_question, total_len_question/len(data)))
+        print('max len of %s utterance: %d, average: %d' % (cat, max_len_utterance, total_len_utterance/len(data)))
+    # max length of answers
+
+
+class BatchIter():
+    '''
+    Load data with mini-batch
+    '''
+    def __init__(self, data = None, batch_size = 100):
+        assert data is not None, "data should not be None."
+        self.batch_size = batch_size
+        self.data = data
+
+    def next(self):
+        random.shuffle(self.data)
+        index = 0
+        total_num = len(self.data)
+        while index <= total_num:
+            yield self.data[index:index + self.batch_size]
+            index += self.batch_size
+
+def padding(lis, pad, size):
+    '''
+    right adjust a list object
+    '''
+    if size > len(lis):
+        lis += [pad] * (size - len(lis))
+    else:
+        lis = lis[0:size]
+    return lis
+
+def pack_question_n_utterance(q, u, q_length = 20, u_length = 99):
+    '''
+    combine question and utterance as input data for feed-forward network
+    '''
+    assert len(q) > 0 and len(u) > 0, "question and utterance must not be empty"
+    q = padding(q, VOCAB_PAD_ID, q_length)
+    u = padding(u, VOCAB_PAD_ID, u_length)
+    assert len(q) == q_length, "question should be pad to q_length"
+    assert len(u) == u_length, "utterance should be pad to u_length"
+    return q + [VOCAB_GO_ID] + u
+
+def __resolve_input_data(data, batch_size, question_max_length = 20, utterance_max_length = 99):
+    '''
+    resolve input data
+    '''
+    batch_iter = BatchIter(data = data, batch_size = batch_size)
+
+    for mini_batch in batch_iter.next():
+        result = []
+        for o in mini_batch:
+            x = pack_question_n_utterance(o['question'], o['utterance'], question_max_length, utterance_max_length)
+            y_ = o['label']
+            assert len(x) == utterance_max_length + question_max_length + 1, "Wrong length afer padding"
+            assert VOCAB_GO_ID in x, "<GO> must be in input x"
+            assert len(y_) == 2, "desired output."
+            result.append([x, y_])
+        if len(result) > 0:
+            # print('data in batch:%d' % len(mini_batch))
+            yield result
+        else:
+            raise StopIteration
+
+# export data
+
+def load_train(batch_size = 100, question_max_length = 20, utterance_max_length = 99):
+    '''
+    load train data
+    '''
+    return __resolve_input_data(_train_data, batch_size, question_max_length, utterance_max_length)
+
+def load_test(question_max_length = 20, utterance_max_length = 99):
+    '''
+    load test data
+    '''
+    result = []
+    for o in _test_data:
+        x = pack_question_n_utterance(o['question'], o['utterance'], question_max_length, utterance_max_length)
+        y_ = o['label']
+        assert len(x) == utterance_max_length + question_max_length + 1, "Wrong length afer padding"
+        assert VOCAB_GO_ID in x, "<GO> must be in input x"
+        assert len(y_) == 2, "desired output."
+        result.append((x, y_))
+    return result
+
+def load_valid(batch_size = 100, question_max_length = 20, utterance_max_length = 99):
+    '''
+    load valid data
+    '''
+    return __resolve_input_data(_valid_data, batch_size, question_max_length, utterance_max_length)
+
+def test_batch():
+    '''
+    retrieve data with mini batch
+    '''
+    for mini_batch in load_test():
+        for x, y_ in mini_batch:
+            print("length", len(x))
+            assert len(y_) == 2, "data size should be 2"
+
+    print("VOCAB_PAD_ID", VOCAB_PAD_ID)
+    print("VOCAB_GO_ID", VOCAB_GO_ID)
+
+if __name__ == '__main__':
+    test_batch()
+
+
-Original file line number
+Diff line change
@@ Expand Up / @@ -4,4 +4,3 @@ node_modules @@
     *.pyc
     __pycache__
     _env
-    tmp