Skip to content

tcooperFL/githubbing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

githubbing

This is a GitHub analysis workbench, intended for interactive use through a REPL.

Setup

Leinigen Setup

Be sure you have defined the environment variables GITHUB_TOKEN and GITHUB_USER. To find your token, visit your profile in GitHub, and navigate to the Personal access tokens. If you don't have one yet, click on "Generate new token". Create the GITHUB_TOKEN environment variable with this string value. Create the GITHUB_USER environment variable with your GitHub username.

Be sure you are in the githubbing top-level folder where the project.clj file is.

Start this program with a Clojure REPL.

$ lein repl

If you see no errors then you are ready go to. If lein isn't defined, you'll need to install it.

Organization Configuration

Update src/githubbinb/config.clj with the name of your GitHub organization and the product cues that indicate that a repo may be a product repo. This provides an organization parameter default in most functions taking an organization.

Usage

You will be working primarily in the githubbing.core namespace. Here is an overview of the workflow.

  1. Create a collection of repos, either by fetching from github or loading a previously saved collection from a file. For example, to fetch all the repos using defaults from config.clj,

    (def repos (repos/fetch-repos))

  2. Create a report from a collection of repos. It may be the entire collection or any subset. For example, here we create a classification report on the first 10 items in the report. Each namespace under githubbing.reports has a report function.

    (def r1 (classy/report (take 10 repos)))

  3. Use an export function to view the report or write it to disk. Here we first preview it and then write it to disk.

    (csv/preview r1) (csv/export r1 "my report.csv")

Fetching repository information

You can get your repository information directly from GitHub using GitHub APIs or load your repositories from a file previously saved.

Fetching from GitHub

A repo is a map defined by the GitHub v3 APIs. You can directly fetch repos for an organization using the repos/fetch-repos function. This returns a lazy sequence of repos, fetched in batches. Note that some may not be fetched until you actually attempt to access repo maps later in the returned collection.

To get all the repo maps for the default organization,

(def repos (repos/fetch-repos))

Supply a github organization name (actually, slug) to get repos for that organization, overriding the config default.

(def repos (repos/fetch-repos "mongodb"))

Functions that invoke the GitHub v3 APIs by default are wrapped in memoize to cache results in memory so that we don't worry about re-invoking these REST endpoints, which are time-consuming and may trigger API call limits.

Loading from a file

Instead of querying GitHub, you can use a previous snapshot for testing purposes. Load up the one in the resources folder.

(def repos (load-repos "reources/my-repos.json"))

Note that this will likely be out-of-date, so you don't want to use it as the basis of any updates.

Save a snapshot of any collection of repo maps into a file to load in at a later time.

(save-repos (take 10 repos) "resources/my-repos.json")

Files generated by save-repos are in standard json format, and when reloaded with load-repos create an = collection.

Creating reports

A report is the result of some analysis specific to that report. So, for example, when creating a classification report, the repos are accessed and analyzed, and the returned report contains the result of that analysis.

Reports all have 3 keys, as defined in githubbing.reports:

  • :name - A simple human-readable name for this report. Used for default in exports.
  • :headings - A sequence of column descriptors: [ ] used to sequence the columns, providing column headers and row accessor for each column.
  • :data - A collection of maps, each representing a row with keys for each heading. Each row represents the result of the report analysis on each repo provided in creating the report. This can be all the repos in the organization, or as above, just a few to create a test report.

Each of the other namespaces under githubbing.reports contains code for a different report. Each has an implementation of create-report specific to that report.

Exporting report data

The githubbing.exports folder has namespaces representing exporters that take reports and export them in various formats. For example, to export a report r1 in csv format, use the output function in githubbing.exports.csv.

(csv/export r1)

If there is an option to preview the report by writing to stdout, you should expect to be a preview function in that namespace.

(csv/preview r1)

For the CSV exporter, you may want to create a report from a subset of the repos, then preview the export, copy and paste directly into Excel or Numbers and check the layout before committing to the full report.

Updating repo topics with classifications

If you have commit permission for repos, you can ensure that the computed classification is in the labels for the repo.

For example, suppose you ran a classification on a repo in your organization.

(def repo (repos/fetch-repo "my-organization" "my-repo"))
(def classified (classy/classify repo))
(:classification classified)

If you wish to ensure that this repository has this classification as a topic, and assuming you have commit access, you can cause that classification to be added to the topics for this repo.

(update-topics classified)

If you have commit access to an entire organization, you can ensure that every repo in the organization has a label corresponding to the classification that githubbing makes.

(doseq [r (repos/fetch-repos "my-organization")]
  (update-repo-topics (classy/classify r)))

Note: All the GET calls to GitHub are memo'ized, so if the same call is made a second time in this REPL, it will use the cached result from the first call. If you subsequently update it and attempt to fetch it again in the same session, you'll not see your updates.

To exit, ^D or type

(exit)

then restart the REPL.

Downloading Org Repos

Suppose you want to download all repos for an organization to your local file system. This toolkit doesn't do the whole job, but it does work in concert with your zsh shell and git command line to get the job done.

Run download-repos given your organization, destination folder, and an optional regular expression to select which repos to include based on a match of the repo name. For example,

(def sdks (download-repos "mongodb" "/tmp/github/mongodb" #".*sdk"))

This example creates a download shell file that contains git commands to clone all mongodb repos with a name containing "sdk". Run this in the REPL and you will get on output like this:

Wrote git commands for 3 out of 171 repos provided.
Open a zsh term and execute
    cd /tmp/github/mongodb
    sh ./download_repos.sh |& tee download.log

The generated shell script contains git clone statements for each repo matched. then you might go into each and analyze the languages with cloc.

    cloc -exclude-dir=.git . |& tee cloc.txt

Getting Language Stats

The GitHub API includes a call to return the distribution of code in a given repo by language in bytes. You can see the url for this call in the :language_url property in the map returned by fetch-repos. You can call this directly with the function

(repos/fetch-languages "mongodb" "docs-tools")
=>
{:Python 1086508,
 :JavaScript 810387,
 :CSS 206605,
 :HTML 133138,
 :Makefile 4431,
 :C++ 29}

Add the languages to an existing collection of repos using the add-language function.

(def repos+ (map repos/add-languages sdks))

That will augment the sequence of repos, adding a :languages property to each with the languages map as the value.
(Note: This does NOT push that back to GitHub!)

To aggregate language counts from a sequence of repos,

(reduce (partial merge-with +) (map :languages repos+))

Then get that in the right order

(sort-by second > (seq *1))

You can then use this to filter and easily get into an Excel chart.

Getting actual on-disk repo sizes

The GitHub API to get repository information returns a "size" property, but it does not represent the true size-on-disk of a cloned repository. To get this, you need to actually clone the repo locally and then compute the disk space taken by the full directory tree.

That's just what the function sizing/add-repo-size does. It takes a repo map that came from the Get Repository GitHub API, clones that repo in the $TMPDIR directory, adds up the bytes used by the full directory, then deletes that directory. It then returns a new repo map with :size-on-disk property with the total number of bytes taken by that on-disk repo, and :clone-result with the result of the git clone shell process.

The following code will return all the sdk repos in the mongodb org, with sizes added. (Note this will take a few minutes as it does all the cloning!)

(def sdks (filter #(re-find #"sdk$" (:name %)) repos))
(def sdks+s (map sizing/add-repo-size sdks))

See the results of the cloning

(map :clone-result sdks+s)
=>
({:exit 0, :out "", :err "Cloning into 'stitch-js-sdk'...\n"}
 {:exit 0, :out "", :err "Cloning into 'stitch-android-sdk'...\n"}
 {:exit 0, :out "", :err "Cloning into 'stitch-ios-sdk'...\n"})

See the size of each

(map #(select-keys % [:name :size-on-disk]) sdks+s)
=>
({:name "stitch-js-sdk", :size-on-disk 46292762}
 {:name "stitch-android-sdk", :size-on-disk 8769580}
 {:name "stitch-ios-sdk", :size-on-disk 33177918})

Getting branch counts

To get the branches of a repo you need to make an additional call to the GitHub API. The sizing/add-branches function does this and returns its repo map argument with a :branches property that contains an array of maps that describe the branches in that repo.

Given the 3 repos that have "sdk" in their names, the following code illustrates how you can access information about branches within each repo.

(def sdk+b (map sizing/add-branches sdks))
(doseq [r sdk+b]
  (printf "\nRepo: %s\n" (:name r))
  (doseq [ b (:branches r)]
  (printf "\t%s\n" (:name b))))

Repo: stitch-js-sdk
    REALMC-5838
    Release-STITCH-4077
    STITCH-3299
    master
    mburdette-patch-1
    revert-404-add-support-for-new-app-with-cluster
    support/3.x

Repo: stitch-android-sdk
    Release-4.3.0
    Release-4.3.3
    Release-4.4.0
    Release-4.4.1
    Release-4.5.0
    Release-4.6.0
    Release-4.7.0
    STITCH-3160
    master
    support/4.1.5

Repo: stitch-ios-sdk
    Release-STITCH-3017
    Release-STITCH-3346
    Release-STITCH-3383
    Release-STITCH-3632
    Release-STITCH-4078
    master
    revert-168-STITCH-2716
    sync
=> nil

Branch information returned from GitHub include

  • :name branch name
  • :commit url and sha key for the branch commit
  • :protected true or false

Notes

  1. When you exit the REPL, your environment is not saved, so all your in-memory data is lost.
  2. If you make too many REST calls to GitHub within a period of time, GitHub will fail with an error indicating that you have exceeded your limit. There is an API call you can make to find out what that limit is and when it is reset. Check the GitHub REST API documentation. Generally, if you wait an hour it will reset and you can try again. Since GET call results in this toolkit are effectively cached with memoize, if you don't exit the REPL, you can start your analysis over, and the GET calls will return with cached results without actually making new calls, so your call count will pick up starting with new calls only.
  3. To manually test the REST calls, use POSTMAN and set it up for Basic authentication using username $GITHUB_USER and pwd $GITHUB_TOKEN

Todo

Consider revising this to use the tentacles library

About

A Github Repo Analyzer in Clojure

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published