Skip to content

Latest commit

 

History

History
167 lines (104 loc) · 15.9 KB

creating-a-custom-model-for-github-copilot.md

File metadata and controls

167 lines (104 loc) · 15.9 KB
title shortTitle intro permissions product versions topics redirect_from
Creating a custom model for GitHub Copilot
Create a custom model
You can fine-tune {% data variables.product.prodname_copilot_short %} code completion by creating a custom model based on code in your organization's repositories.
Owners of organizations enrolled in the {% data variables.release-phases.public_preview %}.
The organization must belong to an enterprise with a {% data variables.product.prodname_copilot_enterprise_short %} subscription.
feature
copilot-custom-models
Copilot
/copilot/managing-copilot/managing-github-copilot-in-your-organization/customizing-copilot-for-your-organization/creating-a-custom-model-for-github-copilot
/copilot/managing-copilot/managing-github-copilot-in-your-organization/enhancing-copilot-for-your-organization/creating-a-custom-model-for-github-copilot

[!NOTE] Custom models for {% data variables.product.prodname_copilot_enterprise %} is in {% data variables.release-phases.public_preview %} and is subject to change. During the {% data variables.release-phases.public_preview %}, there is no additional cost to {% data variables.product.prodname_copilot_enterprise_short %} customers enrolled on the {% data variables.release-phases.public_preview %} for creating or using a custom model.

Prerequisite

The code on which you want to train a custom model must be hosted in repositories owned by your organization on {% data variables.product.github %}.

Limitations

  • For the {% data variables.release-phases.public_preview %}, an enterprise can deploy one custom model in a single organization.
  • Code completion suggestions based on the custom model are only available to managed users who get a {% data variables.product.prodname_copilot_enterprise_short %} subscription from the organization in which the custom model is deployed. For more information, see "AUTOTITLE."
  • The custom model is not used for code suggested in responses by {% data variables.product.prodname_copilot_chat %}.

About {% data variables.product.prodname_copilot_short %} custom models

By default {% data variables.product.prodname_copilot %} uses a large language model that has been trained on a large number of public code repositories, so that it can provide code completion for a wide range of programming languages in many different contexts. You can use this model as the basis for creating a custom large language model that you train specifically on your own code. This process is often known as fine-tuning.

By creating a custom model you enable {% data variables.product.prodname_copilot %} to show you code completion suggestions that are:

  • Based on code in your own designated repositories.
  • Created for proprietary or less publicly represented programming languages.
  • Tailored according to your organization's coding style and guidelines.

This provides:

  • Personalization - {% data variables.product.prodname_copilot_short %} has a detailed knowledge of your codebase, including available modules, functions, and internal libraries. A custom model may be particularly beneficial if your code is not typical of the wide range of code used to train the base model.
  • Efficiency and quality - {% data variables.product.prodname_copilot_short %} is better equipped to help you write code faster and with fewer errors.
  • Privacy - The custom model’s training process, hosting and inferencing are secure and private to your organization. Your data always remains yours, is never used to train another customer’s model, and your custom model is never shared.

About model creation

Currently, in the {% data variables.release-phases.public_preview %}, only one organization in an enterprise is permitted to create a custom model.

As an owner of the organization that's permitted to create a custom model, you can choose which of your organization's repositories to use to train the model. You can train the model on one, several, or all of the repositories in the organization. The model is trained on the content of the default branches of the selected repositories. Optionally, you can specify that only code written in certain programming languages should be used for training. The custom model will be used for generating code completion suggestions in all file types, irrespective of whether that type of file was used for training.

You can also choose whether telemetry data (such as the prompts entered by users and the suggestions generated by {% data variables.product.prodname_copilot_short %}) should be used when training the model. For more information, see "Telemetry data collection and usage for custom models," later in this article.

Once initiated, custom model creation will take many hours to complete. You can check the progress of the training in your organization's settings. When model creation completes - or if it fails to complete - the person who initiated the model training will be notified by email.

If model creation fails, {% data variables.product.prodname_copilot_short %} will continue to use the current model for generating code completion suggestions.

About model usage

As soon as the custom model is successfully created, all managed users in your enterprise who get {% data variables.product.prodname_copilot_enterprise_short %} access from the organization in which the custom model is deployed will start to see {% data variables.product.prodname_copilot_short %} code completion suggestions that are generated using the custom model. The custom model will always be used for any code these users edit, irrespective of where the code resides. Users cannot choose which model is used to generate the code completion suggestions they see.

When you can benefit from a custom model

The value of a custom model is most pronounced in environments with:

  • Proprietary or less publicly represented programming languages
  • Internal libraries or custom frameworks
  • Custom standards and company-specific coding practices

However, even in standardized environments, fine-tuning offers an opportunity to align {% data variables.product.prodname_copilot_short %} code completion more closely with your organization’s established coding practices and standards.

Assess the effectiveness of a custom model

While some coding environments are more likely to benefit from fine-tuning, there is no guaranteed correlation between specific behaviors in a codebase and the quality of the results you get from a custom model. It is advisable to assess the use and satisfaction levels of {% data variables.product.prodname_copilot %} code completion suggestions before and after the implementation of a custom model.

  • Use the {% data variables.product.prodname_dotcom %} API to assess the usage of {% data variables.product.prodname_copilot %}. See "AUTOTITLE."
  • Survey developers to assess their level of satisfaction with {% data variables.product.prodname_copilot %} code completion suggestions.

Comparing results from the API and developer survey, from before and after the implementation of a custom model, will give you an indication of the effectiveness of the custom model.

Creating a custom model

You can use your organization settings to create a custom large language model.

{% data reusables.profile.access_org %} {% data reusables.profile.org_settings %}

  1. In the left sidebar, click {% octicon "copilot" aria-hidden="true" %} {% data variables.product.prodname_copilot_short %} then click Custom model.

  2. On the "Custom models" page, click Train a new custom model.

  3. Under "Select repositories," choose either Selected repositories or All repositories.

  4. If you chose Selected repositories, select the repositories you want to use for training then click Apply.

  5. Optionally, if you want to train your model only on code written in certain programming languages, under "Specify languages," start typing the name of a language you want to include. Select the required language from the list that's displayed. Repeat the process for each language you want to include.

  6. To improve the performance of your model, select the checkbox labeled Include data from prompts and suggestions.

    [!NOTE] If the checkbox isn't available to select it indicates that the Telemetry data collection policy for custom models has been disabled in your organization's settings. For information on how to change policies for your organization, see "AUTOTITLE."

    By selecting this option you allow {% data variables.product.prodname_copilot_short %} to collect data for prompts that user submitted and the code completion suggestions that were generated. Once sufficient data has been collected, {% data variables.product.prodname_copilot_short %} will use this as part of the model training process, allowing it to produce a more effective model.

    For more information, see "Telemetry data collection and usage for custom models," later in this article.

  7. Click Create new custom model.

Checking the progress of model creation

You can check in your organization settings for an indication of how model creation is progressing.

  1. Go to your organization's settings for {% data variables.product.prodname_copilot_short %} custom models. See "Creating a custom model" above.

  2. The first time you train a model, the page that's displayed shows the training results.

    If this is not the first training, the current and previous training attempts are listed. To see details of the current training process, click the first ellipsis button (...), then click Training details.

Reasons for training failure

Model training may fail for a variety of reasons, including:

  • Not enough data or non-representative data. Lack of data provided for training, or too much replication in the data, may make the fine-tuning unstable.
  • Non-differentiated data. If the data is not sufficiently different from the public data on which the base model was trained, training may fail or the quality of code completion suggestions from the custom model may be only marginally improved.
  • A data preprocessing step may encounter unexpected files types and formats which causes it to fail. A solution may be to specify only certain file types for training.

Retraining or deleting the custom model

As an organization owner, you can update or delete the custom model from your organization's settings page.

Retraining the model updates it to include any new code that has been added to the repositories you selected for training. You can retrain the model once a week.

  1. Go to your organization's settings for {% data variables.product.prodname_copilot_short %} custom models. See "Creating a custom model" above.
  2. On the model training page, click the first ellipsis button (...), then click either Retrain model or Delete model.

If you retrain the model, {% data variables.product.prodname_copilot_short %} will continue to use the current model to generate code completion suggestions until the new model is ready. Once the new model is ready, it will be automatically be used for code completion suggestions for all managed users who get a {% data variables.product.prodname_copilot_enterprise_short %} subscription from the organization.

If you delete the custom model, {% data variables.product.prodname_copilot_short %} will use the base model for generating code completion suggestions for all users who get a {% data variables.product.prodname_copilot_short %} subscription from the organization.

Telemetry data collection and usage for custom models

When you create a custom model, you can choose to allow {% data variables.product.company_short %} to collect telemetry data for the purposes of training the model. This data is used to improve the quality of the code completion suggestions the model can generate.

What telemetry data is collected?

  • Prompts: This includes all the information sent to the {% data variables.product.prodname_copilot %} language model by the {% data variables.product.prodname_copilot_short %} extension, including context from your open files.
  • Suggestions: The code completion suggestions that {% data variables.product.prodname_copilot_short %} generates.
  • Code snippet: A snapshot of the code 30 seconds after a suggestion is accepted, capturing how the suggestion was integrated into the codebase. This helps determine whether the suggestion was accepted as is or modified by the user before final integration.

How is telemetry data used?

Telemetry data is primarily used to fine-tune the {% data variables.product.prodname_copilot_short %} custom model to better understand and predict your organization’s coding patterns. Specifically, it helps:

  • Enhance model accuracy: By analyzing the collected telemetry, {% data variables.product.prodname_copilot_short %} refines your custom model to increase the relevance and accuracy of future coding suggestions.
  • Monitor performance: Telemetry data allows {% data variables.product.company_short %} to monitor how well custom models are performing compared to the base model, enabling ongoing improvements.
  • Feedback loops: The data helps {% data variables.product.company_short %} create feedback loops where the model learns from real-world usage, adapting to your specific coding environment over time.

Data storage and retention

  • Data storage: All telemetry data collected is stored in the {% data variables.product.prodname_copilot_short %} Data Store, a secure and restricted environment. The data is encrypted and isolated to prevent unauthorized access.
  • Retention period: Telemetry data is retained for a rolling 28-day period. After this period, the data is automatically deleted from {% data variables.product.company_short %}'s systems, ensuring that only recent and relevant data is used for model training and improvement.

Privacy and data security

{% data variables.product.company_short %} is committed to ensuring that your organization’s data remains private and secure.

  • Exclusive use: The telemetry data collected from your organization is used exclusively for training your custom model and is never shared with other organizations or used to train other customers’ models.
  • Data leakage prevention: {% data variables.product.company_short %} implements strict data isolation protocols to prevent cross-contamination between different organizations’ data. This means that your proprietary code and information are protected from exposure to other organizations or individuals.

Important considerations

  • Opt-in for telemetry: Participation in telemetry data collection is optional and controlled via your organization’s admin policies. Telemetry data is only collected when explicitly enabled for training custom models.

  • Potential risks: Although {% data variables.product.company_short %} takes extensive measures to prevent data leakage, there are scenarios where sensitive data, such as internal links or names, could be included in the telemetry and subsequently used in training. We recommend reviewing and filtering the data you submit for training to minimize these risks.

    For more details about our data-handling practices, see the {% data variables.product.prodname_copilot %} Trust Center or review {% data variables.product.company_short %}’s data protection agreement.