This is a GitHub analysis workbench, intended for interactive use through a REPL.
Be sure you have defined the environment variables GITHUB_TOKEN
and GITHUB_USER
.
To find your token, visit your profile in GitHub, and navigate to the Personal access tokens.
If you don't have one yet, click on "Generate new token".
Create the GITHUB_TOKEN
environment variable with this string value.
Create the GITHUB_USER
environment variable with your GitHub username.
Be sure you are in the githubbing top-level folder where the project.clj
file is.
Start this program with a Clojure REPL.
$ lein repl
If you see no errors then you are ready go to. If lein
isn't defined, you'll need to install it.
Update src/githubbinb/config.clj
with the name of your GitHub organization and the product cues that indicate
that a repo may be a product repo. This provides an organization parameter default in most functions taking
an organization.
You will be working primarily in the githubbing.core
namespace. Here is an overview of the workflow.
-
Create a collection of repos, either by fetching from github or loading a previously saved collection from a file. For example, to fetch all the repos using defaults from
config.clj
,(def repos (repos/fetch-repos))
-
Create a report from a collection of repos. It may be the entire collection or any subset. For example, here we create a classification report on the first 10 items in the report. Each namespace under
githubbing.reports
has areport
function.(def r1 (classy/report (take 10 repos)))
-
Use an export function to view the report or write it to disk. Here we first preview it and then write it to disk.
(csv/preview r1) (csv/export r1 "my report.csv")
You can get your repository information directly from GitHub using GitHub APIs or load your repositories from a file previously saved.
A repo is a map defined by the GitHub v3 APIs. You can directly fetch repos for
an organization using the repos/fetch-repos
function. This returns a lazy sequence
of repos, fetched in batches. Note that some may not be fetched until you actually
attempt to access repo maps later in the returned collection.
To get all the repo maps for the default organization,
(def repos (repos/fetch-repos))
Supply a github organization name (actually, slug) to get repos for that organization, overriding the config default.
(def repos (repos/fetch-repos "mongodb"))
Functions that invoke the GitHub v3 APIs by default are wrapped in memoize
to cache results in memory
so that we don't worry about re-invoking these REST endpoints, which are time-consuming and may trigger
API call limits.
Instead of querying GitHub, you can use a previous snapshot for testing purposes. Load up the one in the resources folder.
(def repos (load-repos "reources/my-repos.json"))
Note that this will likely be out-of-date, so you don't want to use it as the basis of any updates.
Save a snapshot of any collection of repo maps into a file to load in at a later time.
(save-repos (take 10 repos) "resources/my-repos.json")
Files generated by save-repos are in standard json format, and when reloaded with
load-repos
create an = collection.
A report is the result of some analysis specific to that report. So, for example, when creating a classification report, the repos are accessed and analyzed, and the returned report contains the result of that analysis.
Reports all have 3 keys, as defined in githubbing.reports
:
:name
- A simple human-readable name for this report. Used for default in exports.:headings
- A sequence of column descriptors: [ ] used to sequence the columns, providing column headers and row accessor for each column.:data
- A collection of maps, each representing a row with keys for each heading. Each row represents the result of the report analysis on each repo provided in creating the report. This can be all the repos in the organization, or as above, just a few to create a test report.
Each of the other namespaces under githubbing.reports
contains code for
a different report. Each has an implementation of create-report
specific to that report.
The githubbing.exports
folder has namespaces representing exporters that take reports and export them
in various formats. For example, to export a report r1
in csv
format, use the output
function in
githubbing.exports.csv
.
(csv/export r1)
If there is an option to preview the report by writing to stdout, you should
expect to be a preview
function in that namespace.
(csv/preview r1)
For the CSV exporter, you may want to create a report from a subset of the repos, then preview the export, copy and paste directly into Excel or Numbers and check the layout before committing to the full report.
If you have commit permission for repos, you can ensure that the computed classification is in the labels for the repo.
For example, suppose you ran a classification on a repo in your organization.
(def repo (repos/fetch-repo "my-organization" "my-repo"))
(def classified (classy/classify repo))
(:classification classified)
If you wish to ensure that this repository has this classification as a topic, and assuming you have commit access, you can cause that classification to be added to the topics for this repo.
(update-topics classified)
If you have commit access to an entire organization, you can ensure that every repo in the organization has a label corresponding to the classification that githubbing makes.
(doseq [r (repos/fetch-repos "my-organization")]
(update-repo-topics (classy/classify r)))
Note: All the GET calls to GitHub are memo'ized, so if the same call is made a second time in this REPL, it will use the cached result from the first call. If you subsequently update it and attempt to fetch it again in the same session, you'll not see your updates.
To exit, ^D or type
(exit)
then restart the REPL.
Suppose you want to download all repos for an organization to your
local file system. This toolkit doesn't do the whole job, but it does
work in concert with your zsh
shell and git
command line to get the
job done.
Run download-repos
given your organization, destination folder, and
an optional regular expression to select which repos to include based
on a match of the repo name. For example,
(def sdks (download-repos "mongodb" "/tmp/github/mongodb" #".*sdk"))
This example creates a download shell file that contains git commands to clone all mongodb repos with a name containing "sdk". Run this in the REPL and you will get on output like this:
Wrote git commands for 3 out of 171 repos provided.
Open a zsh term and execute
cd /tmp/github/mongodb
sh ./download_repos.sh |& tee download.log
The generated shell script contains git clone
statements for each repo matched.
then you might go into each and analyze the languages with cloc
.
cloc -exclude-dir=.git . |& tee cloc.txt
The GitHub API includes a call to return the distribution of code in a given repo
by language in bytes. You can see the url for this call in the :language_url
property
in the map returned by fetch-repos
. You can call this directly with the function
(repos/fetch-languages "mongodb" "docs-tools")
=>
{:Python 1086508,
:JavaScript 810387,
:CSS 206605,
:HTML 133138,
:Makefile 4431,
:C++ 29}
Add the languages to an existing collection of repos using the add-language
function.
(def repos+ (map repos/add-languages sdks))
That will augment the sequence of repos, adding a :languages
property to
each with the languages map as the value.
(Note: This does NOT push that back to GitHub!)
To aggregate language counts from a sequence of repos,
(reduce (partial merge-with +) (map :languages repos+))
Then get that in the right order
(sort-by second > (seq *1))
You can then use this to filter and easily get into an Excel chart.
The GitHub API to get repository information returns a "size"
property, but it does
not represent the true size-on-disk of a cloned repository. To get this, you need to
actually clone the repo locally and then compute the disk space taken by the full
directory tree.
That's just what the function sizing/add-repo-size
does. It takes a repo map that came
from the Get Repository GitHub API, clones that repo in the $TMPDIR
directory,
adds up the bytes used by the full directory, then deletes that directory. It then
returns a new repo map with :size-on-disk
property with the total number of
bytes taken by that on-disk repo, and :clone-result
with the result of the
git clone
shell process.
The following code will return all the sdk repos in the mongodb org, with sizes added. (Note this will take a few minutes as it does all the cloning!)
(def sdks (filter #(re-find #"sdk$" (:name %)) repos))
(def sdks+s (map sizing/add-repo-size sdks))
See the results of the cloning
(map :clone-result sdks+s)
=>
({:exit 0, :out "", :err "Cloning into 'stitch-js-sdk'...\n"}
{:exit 0, :out "", :err "Cloning into 'stitch-android-sdk'...\n"}
{:exit 0, :out "", :err "Cloning into 'stitch-ios-sdk'...\n"})
See the size of each
(map #(select-keys % [:name :size-on-disk]) sdks+s)
=>
({:name "stitch-js-sdk", :size-on-disk 46292762}
{:name "stitch-android-sdk", :size-on-disk 8769580}
{:name "stitch-ios-sdk", :size-on-disk 33177918})
To get the branches of a repo you need to make an additional call to the GitHub API.
The sizing/add-branches
function does this and returns its repo map argument with a
:branches
property that contains an array of maps that describe the branches
in that repo.
Given the 3 repos that have "sdk" in their names, the following code illustrates how you can access information about branches within each repo.
(def sdk+b (map sizing/add-branches sdks))
(doseq [r sdk+b]
(printf "\nRepo: %s\n" (:name r))
(doseq [ b (:branches r)]
(printf "\t%s\n" (:name b))))
Repo: stitch-js-sdk
REALMC-5838
Release-STITCH-4077
STITCH-3299
master
mburdette-patch-1
revert-404-add-support-for-new-app-with-cluster
support/3.x
Repo: stitch-android-sdk
Release-4.3.0
Release-4.3.3
Release-4.4.0
Release-4.4.1
Release-4.5.0
Release-4.6.0
Release-4.7.0
STITCH-3160
master
support/4.1.5
Repo: stitch-ios-sdk
Release-STITCH-3017
Release-STITCH-3346
Release-STITCH-3383
Release-STITCH-3632
Release-STITCH-4078
master
revert-168-STITCH-2716
sync
=> nil
Branch information returned from GitHub include
:name
branch name:commit
url and sha key for the branch commit:protected
true or false
- When you exit the REPL, your environment is not saved, so all your in-memory data is lost.
- If you make too many REST calls to GitHub within a period of time, GitHub will fail with an error indicating that you have exceeded your limit.
There is an API call you can make to find out what that limit is and when it is reset.
Check the GitHub REST API documentation.
Generally, if you wait an hour it will reset and you can try again.
Since GET call results in this toolkit are effectively cached with
memoize
, if you don't exit the REPL, you can start your analysis over, and the GET calls will return with cached results without actually making new calls, so your call count will pick up starting with new calls only. - To manually test the REST calls, use POSTMAN and set it up for Basic authentication using username $GITHUB_USER and pwd $GITHUB_TOKEN
Consider revising this to use the tentacles library