{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Dataset Deduplication for `java-pretrain`\n",
    "To have as much data as possible for our pretraing experiments we combined the training partitions of code2seq's `java-small`, `java-medium` and `java-large` datasets.  \n",
    "However, these datasets are not completely separated and thus, the training partition of `java-large` can potentially contain Java methods from the test partition of `java-small`. To alleviate this issue, we perform a strict deduplication between the new training partition of `java-pretrain` and the valid/test partitions of `java-small` and `java-medium`.  \n",
    "For this, we performed the following steps:\n",
    " 1. Copy the training partitions of `java-small`, `java-medium` and `java-large` into a new `java-pretrain/training` folder. The valid and test partition for this new `java-pretrain` dataset is not important for our use-case as we only use it for self-supervised language model pretraining.\n",
    " 2. Perform harsh project-level deduplication by running this notebook until step 5\n",
    " 3. Compute more granular file-level similarities by running the `deduplicate-java-pretrain.py` script for the datasets and partitions you wish to deduplicate against. In our case, we deduplicated against the valid and train partitions of `java-small` and `java-medium`. This will store a pickled list of files to delete in `java-pretrain`.\n",
    " 4. Load this list in step 5 of this notebook to inspect what is to be deleted and finally run the file-level deletion"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%cd ../..\n",
    "%reload_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import re\n",
    "import shutil\n",
    "from collections import defaultdict\n",
    "from difflib import SequenceMatcher\n",
    "from pathlib import Path\n",
    "\n",
    "from tqdm import tqdm\n",
    "\n",
    "from code_transformer.utils.io import save_pickled, load_pickled\n",
    "from code_transformer.env import CODE2SEQ_RAW_DATA_PATH"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1 Data Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_path_code2seq = CODE2SEQ_RAW_DATA_PATH"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset = 'java-small'\n",
    "partition = 'test'\n",
    "projects_folder = Path(f\"{data_path_code2seq}/{dataset}/{partition}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reference_dataset = 'java-pretrain'\n",
    "reference_partition = 'training'\n",
    "reference_projects_folder = Path(f\"{data_path_code2seq}/{reference_dataset}/{reference_partition}\")\n",
    "reference_projects = {p for p in reference_projects_folder.iterdir()}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2 Helper Methods"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_files_recursive(folder):\n",
    "    if not folder.is_dir():\n",
    "        return [folder]\n",
    "    results = []\n",
    "    for file in folder.iterdir():\n",
    "        if file.is_dir():\n",
    "            results.extend(get_files_recursive(file))\n",
    "        else:\n",
    "            results.append(file)\n",
    "    return results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def recursive_search(folder, file_name):\n",
    "    print(folder)\n",
    "    results = []\n",
    "    for file in folder.iterdir():\n",
    "        if file.is_dir():\n",
    "            results.extend(recursive_search(file, file_name))\n",
    "        elif file.stem == file_name:\n",
    "            results.append(file)\n",
    "    return results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def remove_comments(string):\n",
    "    pattern = r\"(\\\".*?\\\"|\\'.*?\\')|(/\\*.*?\\*/|//[^\\r\\n]*$)\"\n",
    "    # first group captures quoted strings (double or single)\n",
    "    # second group captures comments (//single-line or /* multi-line */)\n",
    "    regex = re.compile(pattern, re.MULTILINE|re.DOTALL)\n",
    "    def _replacer(match):\n",
    "        # if the 2nd group (capturing comments) is not None,\n",
    "        # it means we have captured a non-quoted (real) comment string.\n",
    "        if match.group(2) is not None:\n",
    "            return \"\" # so we will return empty to remove the comment\n",
    "        else: # otherwise, we will return the 1st group\n",
    "            return match.group(1) # captured quoted-string\n",
    "    return regex.sub(_replacer, string)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def read_file(f):\n",
    "    try:\n",
    "        return f.read_text()\n",
    "    except UnicodeDecodeError:\n",
    "        return f.read_text(encoding='cp1252')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def similar(a, b):\n",
    "    return SequenceMatcher(None, a, remove_comments(read_file(b))).ratio()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def delete_project_files(projects_to_delete):\n",
    "    for p in projects_to_delete:\n",
    "        path = Path(f\"{reference_projects_folder}/{p}\")\n",
    "        if path.exists():\n",
    "            print(path)\n",
    "            shutil.rmtree(path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3 File Lookup\n",
    "The underlying assumption is that if a code file is duplicated then its clone will have the same file name. Thus, we create a mapping from file name => path for a quick lookup of potential duplication candidates"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "file_lookup = defaultdict(list)\n",
    "for p in tqdm(list(reference_projects_folder.iterdir())):\n",
    "    for f in get_files_recursive(p):\n",
    "        file_lookup[f.stem].append(f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "save_pickled(file_lookup, \"file_lookup\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4 Duplication detection\n",
    " 1. Iterate through all projects in java-small/medium  \n",
    " 2. If a project has the exact same name as one in java-pretrain, we instantly mark the whole project folder to be deleted\n",
    " 3. Get all files within this project  \n",
    " 4. For every file:  \n",
    "     - find duplication candidates by consulting lookup  \n",
    "     - Restrict set of candidate duplicates to files where the file sizes of search and candidate file differ by at most 5%  \n",
    " 5. Detect projects in java-pretrain that have a large overlap with the project, i.e., > 25% of the files in the search project have candidates in the candidate project  \n",
    " 6. These projects will later be deleted from java-pretrain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "results = dict()\n",
    "projects_to_delete = []\n",
    "for p1 in tqdm(list(projects_folder.iterdir())):\n",
    "    search_files = get_files_recursive(p1)\n",
    "    num_files = len(search_files)\n",
    "    print(p1.stem, num_files)\n",
    "\n",
    "    project_candidates = defaultdict(list)\n",
    "    if p1.stem in {p.stem for p in reference_projects}:\n",
    "        projects_to_delete.append(p1.stem)\n",
    "\n",
    "    for i, search_file in enumerate(search_files):\n",
    "        print(f\"{i}/{num_files}\", end='\\r')\n",
    "        similar_files = []\n",
    "        for candidate_file in file_lookup[search_file.stem]: \n",
    "            size1 = os.path.getsize(search_file)\n",
    "            size2 = os.path.getsize(candidate_file)\n",
    "            if size2 == 0:\n",
    "                size_diff = 1 if size1 == size2 else 0\n",
    "            else:\n",
    "                size_diff = size1 / size2\n",
    "            if size_diff > 0.95 and size_diff < 1.05:\n",
    "                similar_files.append(candidate_file)\n",
    "                project_candidates[candidate_file.parts[len(reference_projects_folder.parts)]].append(candidate_file)\n",
    "\n",
    "\n",
    "    for k in project_candidates.keys():\n",
    "        project_candidates[k] = len(set(project_candidates[k]))\n",
    "    results[p1.stem] = (project_candidates, num_files)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(projects_to_delete)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "delete_project_files(projects_to_delete)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Find project candidates in java-pretrain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "projects_to_delete_2 = set()\n",
    "for project_name, (project_candidates, num_files) in results.items():\n",
    "    for project_candidate, num_matches in project_candidates.items():\n",
    "        if num_matches / num_files > 0.25 and num_files > 100:\n",
    "            projects_to_delete_2.add(project_candidate)\n",
    "            print(project_name, project_candidate)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(projects_to_delete_2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "delete_project_files(projects_to_delete_2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5 File similarity deduplication\n",
    "The above steps are only a course filtering step to already delete obvious duplicates. It can still be that files from different projects are very similar. This can only be detected by comparing the contents. This is an expensive computation and is done by the `deduplicate-java-pretrain.py` script. The result of this script are lists of files that should be deleted. Deletion is done manually here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset = 'java-small'\n",
    "partition = 'test'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "files_to_delete = load_pickled(f\"{CODE2SEQ_RAW_DATA_PATH}/java-pretrain/files_to_delete_{dataset}_{partition}.p\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for f in tqdm(files_to_delete):\n",
    "    if f.exists():\n",
    "        f.unlink()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}