ch08 rewrites (rasbt#18)

battyone · Jul 27, 2017 · ff05533 · ff05533
1 parent 88bf92b
commit ff05533
Showing 1 changed file with 27 additions and 2 deletions.
diff --git a/code/ch08/ch08.ipynb b/code/ch08/ch08.ipynb
@@ -87,7 +87,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "- [Obtaining the IMDb movie review dataset](#Obtaining-the-IMDb-movie-review-dataset)\n",
+    "- [Preparing the IMDb movie review data for text processing](#Preparing-the-IMDb-movie-review-data-for-text-processing)\n",
+    "  - [Obtaining the IMDb movie review dataset](#Obtaining-the-IMDb-movie-review-dataset)\n",
+    "  - [Preprocessing the movie dataset into more convenient format](#Preprocessing-the-movie-dataset-into-more-convenient-format)\n",
     "- [Introducing the bag-of-words model](#Introducing-the-bag-of-words-model)\n",
     "  - [Transforming words into feature vectors](#Transforming-words-into-feature-vectors)\n",
     "  - [Assessing word relevancy via term frequency-inverse document frequency](#Assessing-word-relevancy-via-term-frequency-inverse-document-frequency)\n",
@@ -113,7 +115,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Obtaining the IMDb movie review dataset"
+    "# Preparing the IMDb movie review data for text processing "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Obtaining the IMDb movie review dataset"
    ]
   },
   {
@@ -130,6 +139,13 @@
     "B) If you are working with Windows, download an archiver such as [7Zip](http://www.7-zip.org) to extract the files from the download archive."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Preprocessing the movie dataset into more convenient format"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 2,
@@ -902,6 +918,15 @@
     "                           n_jobs=-1)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Important Note**\n",
+    "\n",
+    "Please note that it is highly recommended to use `n_jobs=-1` (instead of `n_jobs=1`) in the previous code example to utilize all available cores on your machine and speed up the grid search. However, some Windows users reported issues when running the previous code with the `n_jobs=-1` setting related to pickling the tokenizer and tokenizer_porter functions for multiprocessing on Windows. Another workaround would be to replace those two functions, `[tokenizer, tokenizer_porter]`, with `[str.split]`. However, note that the replacement by the simple str.split would not support stemming.\n"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 26,