노트북 번역

ruke79 · Jul 31, 2021 · aec01fe · aec01fe
1 parent 1a49c6d
commit aec01fe
Show file tree

Hide file tree

Showing 2 changed files with 48 additions and 52 deletions.
diff --git a/notebooks/tabular_data_vectorization.ipynb b/notebooks/tabular_data_vectorization.ipynb
@@ -4,14 +4,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Vectorizing tabular fields"
+    "# 표 형식 특성 벡터화하기"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This notebook covers simple methods to vectorize tabular data using the same dataset as in the other examples, but this time ignoring the text of the question."
+    "이 노트북은 다른 예제와 동일한 테이터셋을 사용해 표 형태 데이터를 벡터화하는 간단한 방법을 다룹니다. 하지만 이번에는 질문 텍스트를 무시합니다."
    ]
   },
   {
@@ -37,7 +37,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let's pretend we wanted to predict the **score** from the tags, number of comments, and question creation date. Here is what the data looks like"
+    "태그, 코멘트 개수, 질문 날짜로부터 **점수**를 예측한다고 가정해 보죠. 이 데이터는 다음과 같습니다."
    ]
   },
   {
@@ -144,15 +144,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In order to use this data as input to a model, we need to give it a suitable numerical representation. To do so, we will do three things here:\n",
+    "이 데이터를 모델의 입력으로 사용하기 위해 적절한 수치 표현으로 바꾸어야 합니다. 이렇게 하기 위해 다음 세 가지 작업을 합니다:\n",
     "\n",
-    "1. Normalize numerical input features to limit the impact of outliers\n",
+    "1. 수치 입력 특성을 정규화하여 이상치로 인한 영향을 줄입니다.\n",
+    "2. 날짜 특성을 모델이 이해하기 쉬운 형태로 변환합니다.\n",
+    "3. 모델이 범주형 특성을 이해할 수 있도록 더비(dummy) 변수로 바꿉니다.\n",
     "\n",
-    "2. Transform the date feature in a way that makes it easier to understand for a model.\n",
-    "\n",
-    "3. Get dummy variables from categorical features so a model can ingest them.\n",
-    "\n",
-    "First, we normalize the data to reduce the effect of outliers on downstream model performance."
+    "먼저 데이터를 정규화하여 이상치가 모델 성능에 미치는 영향을 줄입니다."
    ]
   },
   {
@@ -278,7 +276,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now, let's represent dates in a way that would make it easier for a model to extract patterns (see chapter 4 of the attached book for more information on why we chose these particular features.)"
+    "이제 날짜를 모델이 패턴을 추출하기 쉬운 형태로 표현해 보죠(특정 특성을 선택한 이유에 대한 자세한 내용은 책의 4장을 참고하세요)."
    ]
   },
   {
@@ -287,10 +285,10 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Convert our date to a pandas datetime\n",
+    "# 날짜를 판다스 datetime으로 변환합니다.\n",
     "tabular_df[\"date\"] = pd.to_datetime(tabular_df[\"CreationDate\"])\n",
     "\n",
-    "# Extract meaningful features from the datetime object\n",
+    "# datetime 객체에서 의미있는 특성을 추출합니다.\n",
     "tabular_df[\"year\"] = tabular_df[\"date\"].dt.year\n",
     "tabular_df[\"month\"] = tabular_df[\"date\"].dt.month\n",
     "tabular_df[\"day\"] = tabular_df[\"date\"].dt.day\n",
@@ -447,7 +445,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "And finally let's transform tags into dummy variables using pandas' [get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) function, with each tag being assigned an index that will take the value \"1\" only if it is present in the given row."
+    "마지막으로 태그를 판다스의 [get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) 함수를 사용해 더비 변수로 변환합니다. 각 태그는 하나의 더비 변수에 할당됩니다."
    ]
   },
   {
@@ -456,17 +454,17 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Select our tags, represented as strings, and transform them into arrays of tags\n",
+    "# 문자열로 표현된 태그를 태그 배열로 변환합니다.\n",
     "tags = tabular_df[\"Tags\"]\n",
     "clean_tags = tags.str.split(\"><\").apply(\n",
     "    lambda x: [a.strip(\"<\").strip(\">\") for a in x])\n",
     "\n",
-    "# Use pandas' get_dummies to get dummy values \n",
-    "# select only tags that appear over 500 times\n",
+    "# 판다사의 get_dummies 함수를 사용해 더비 변수를 만듭니다.\n",
+    "# 500번 이상 나타난 태그만 선택합니다.\n",
     "tag_columns = pd.get_dummies(clean_tags.apply(pd.Series).stack()).sum(level=0)\n",
     "all_tags = tag_columns.astype(bool).sum(axis=0).sort_values(ascending=False)\n",
     "top_tags = all_tags[all_tags > 500]\n",
-    "top_tag_columns = tag_columns[top_tags.index]\n"
+    "top_tag_columns = tag_columns[top_tags.index]"
    ]
   },
   {
@@ -583,10 +581,10 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Add our tags back into our initial DataFrame\n",
+    "# 태그를 원래 DateFrame에 추가합니다.\n",
     "final = pd.concat([tabular_df, top_tag_columns], axis=1)\n",
     "\n",
-    "# Keeping only the vectorized features\n",
+    "# 벡터화된 특성만 남깁니다.\n",
     "col_to_keep = [\"year\", \"month\", \"day\", \"hour\", \"NormComment\",\n",
     "               \"NormScore\"] + list(top_tags.index)\n",
     "final_features = final[col_to_keep]"
@@ -749,7 +747,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Voila! Our tabular data is now ready to be used for a model."
+    "좋습니다. 이제 모델에 사용할 데이터가 준비되었습니다."
    ]
   }
  ],

diff --git a/notebooks/third_model.ipynb b/notebooks/third_model.ipynb
@@ -4,9 +4,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Selecting useful features\n",
+    "# 유용한 특성 선택하기\n",
     "\n",
-    "When looking at the second model's feature importances, we saw that the TF-IDF vectorization features were suspiciously absent from the most important features we displayed. To verify that a model can perform well without these features, let's omit them and train a third model."
+    "두 번째 모델의 특성 중요도를 보면 TF-IDF 벡터 특성은 가장 중요한 특성에서 빠져있습니다. 이 특성을 제외하고도 모델의 잘 수행되는지 확인하기 위해 이 특성들을 빼고 세 번째 모델을 훈련해 보겠습니다."
    ]
   },
   {
@@ -78,7 +78,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The function below add features. Feel free to check out the ml_editor source code to see more about what these functions are doing!"
+    "아래 함수는 특성을 추가합니다. 이 함수에 대한 자세한 내용은 `ml_editor` 소스 코드를 참고하세요."
    ]
   },
   {
@@ -128,9 +128,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Model\n",
+    "# 모델\n",
     "\n",
-    "Now that we've added our new features, let's train a new model. We'll use the same architecture as before, but with new features. You can visualize the new features below."
+    "새로운 특성을 추가했으니 새 모델을 훈련해 보겠습니다. 이전과 동일한 구조를 사용하지만 새로운 특성을 사용합니다. 아래 새로운 특성을 출력해 보겠습니다."
    ]
   },
   {
@@ -139,7 +139,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# We split again since we have now added all features. \n",
+    "# 추가된 특성이 있으므로 데이터를 다시 분할합니다.\n",
     "train_df, test_df = get_split_by_author(df, test_size=0.2, random_state=40)"
    ]
   },
@@ -395,11 +395,10 @@
    "source": [
     "def get_feature_vector_and_label(df, feature_names):\n",
     "    \"\"\"\n",
-    "    Generate input and output vectors using the vectors feature and\n",
-    "     the given feature names\n",
-    "    :param df: input DataFrame\n",
-    "    :param feature_names: names of feature columns (other than vectors)\n",
-    "    :return: feature array and label array\n",
+    "    벡터 특성과 특성 이름으로 입력과 출력 벡터를 생성합니다.\n",
+    "    :param df: 입력 DataFrame\n",
+    "    :param feature_names: (벡터가 아닌) 특성 열의 이름 \n",
+    "    :return: 특성 배열과 레이블 배열\n",
     "    \"\"\"\n",
     "    features = df[feature_names].astype(float)\n",
     "    labels = df[\"Score\"] > df[\"Score\"].median()\n",
@@ -455,7 +454,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We train a model using `sklearn`, and measure its performance using the methods we've covered before."
+    "`sklearn`을 사용해 모델을 훈련하고 이전에 설명한 방법으로 성능을 측정합니다."
    ]
   },
   {
@@ -486,28 +485,27 @@
    ],
    "source": [
     "def get_metrics(y_test, y_predicted):  \n",
-    "    # true positives / (true positives+false positives)\n",
+    "    # 진짜 양성 / (진짜 양성 + 가짜 양성)\n",
     "    precision = precision_score(y_test, y_predicted, pos_label=True,\n",
     "                                    average='binary')             \n",
-    "    # true positives / (true positives + false negatives)\n",
+    "    # 진짜 양성 / (진짜 양성 + 가짜 음성)\n",
     "    recall = recall_score(y_test, y_predicted, pos_label=True,\n",
     "                              average='binary')\n",
     "    \n",
-    "    # harmonic mean of precision and recall\n",
+    "    # 정밀도와 재현율의 조화 평균\n",
     "    f1 = f1_score(y_test, y_predicted, pos_label=True, average='binary')\n",
     "    \n",
-    "    # true positives + true negatives/ total\n",
+    "    # 진짜 양성 + 진짜 음성 / 전체\n",
     "    accuracy = accuracy_score(y_test, y_predicted)\n",
     "    return accuracy, precision, recall, f1\n",
     "\n",
     "\n",
-    "\n",
-    "# Training accuracy\n",
-    "# Thanks to https://datascience.stackexchange.com/questions/13151/randomforestclassifier-oob-scoring-method\n",
+    "# 훈련 정확도\n",
+    "# https://datascience.stackexchange.com/questions/13151/randomforestclassifier-oob-scoring-method 참조\n",
     "y_train_pred = np.argmax(clf.oob_decision_function_,axis=1)\n",
     "\n",
     "accuracy, precision, recall, f1 = get_metrics(y_train, y_train_pred)\n",
-    "print(\"Training accuracy = %.3f, precision = %.3f, recall = %.3f, f1 = %.3f\" % (accuracy, precision, recall, f1))"
+    "print(\"훈련 정확도 = %.3f, 정밀도 = %.3f, 재현율 = %.3f, f1 = %.3f\" % (accuracy, precision, recall, f1))"
    ]
   },
   {
@@ -525,14 +523,14 @@
    ],
    "source": [
     "accuracy, precision, recall, f1 = get_metrics(y_test, y_predicted)\n",
-    "print(\"Validation accuracy = %.3f, precision = %.3f, recall = %.3f, f1 = %.3f\" % (accuracy, precision, recall, f1))"
+    "print(\"검증 정확도 = %.3f, 정밀도 = %.3f, 재현율 = %.3f, f1 = %.3f\" % (accuracy, precision, recall, f1))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let's save our new model and vectorizer to disk so we can use it later."
+    "새로운 모델과 벡터화 객체를 나중에 사용하기 위해 디스크에 저장합니다."
    ]
   },
   {
@@ -560,9 +558,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Validating that features are useful\n",
+    "## 특성의 유용성 검증하기\n",
     "\n",
-    "Let's look at feature importances to validate that our new features are being used by the new model."
+    "새로운 특성을 모델이 사용하는 확인하기 위해 특성 중요도를 살펴 보겠습니다."
    ]
   },
   {
@@ -635,20 +633,20 @@
    ],
    "source": [
     "k = 20\n",
-    "print(\"Top %s importances:\\n\" % k)\n",
+    "print(\"상위 %s개 중요도:\\n\" % k)\n",
     "print('\\n'.join([\"%s: %.2g\" % (tup[0], tup[1]) for tup in get_feature_importance(clf, all_feature_names)[:k]]))\n",
     "\n",
-    "print(\"\\nBottom %s importances:\\n\" % k)\n",
+    "print(\"\\n하위 %s개 중요도:\\n\" % k)\n",
     "print('\\n'.join([\"%s: %.2g\" % (tup[0], tup[1]) for tup in get_feature_importance(clf, all_feature_names)[-k:]]))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Comparing predictions to data\n",
+    "## 예측과 데이터 비교하기\n",
     "\n",
-    "This section uses the evaluation methods described in the Comparing Data To Predictions [notebook](https://github.com/hundredblocks/ml-powered-applications/blob/master/notebooks/comparing_data_to_predictions.ipynb) on this third model."
+    "이 섹션은 새로운 모델로 데이터와 예측 비교하기 [노트북](https://github.com/rickiepark/ml-powered-applications/blob/master/notebooks/comparing_data_to_predictions.ipynb)에서 설명한 평가 방법을 사용합니다."
    ]
   },
   {
@@ -735,16 +733,16 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This third model only uses features that can be easily interpreted, and is better calibrated than previous models. This makes it a very good candidate for our application (feel free to look at `Comparing Models` for a more detailed comparison)."
+    "세 번째 모델은 이해하기 쉬운 특성만 사용하고 이전 모델보다 더 보정이 잘 되어 있습니다. 이 애플리케이션에 매우 좋은 후보 모델입니다(자세한 비교 내용은 `Comparing Models`을 참고하세요)."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Running Inference\n",
+    "## 추론 함수\n",
     "\n",
-    "Just like for our first two models, we define an inference function that takes in an arbitrary question and outputs an estimated probability of it receiving a high score according to our model."
+    "이전 두 개의 모델과 마찬가지로 임의의 질문을 받고 높은 점수를 받을 추정 확률을 출력하는 추론 함수를 정의합니다."
    ]
   },
   {