We realized this wasn’t going to be easy. Most analytics tools are designed for traditional products like websites and desktop apps, not for code and NPM packages. The word “telemetry” was bounced around internally, but the process of building such a system required more time and effort than we initially wanted to budget for.
How we collect data
The fact that you’re reading this blog post means that we found another way: scraping Github. At Twilio, most of the company’s code lives on an Enterprise Github instance with very generous rate limits. This means we could scan through the entirety of the enterprise Github for the information we seek. For the projects living on regular Github, we added the Github organizations to an ancillary crawl list.
Using the excellent Octokit library, we didn’t have to write much code to get a lot accomplished. Here’s how we grab every organization:
async function getAllOrgs() { try { const response = await octokit.paginate('GET /organizations'); return response; } catch (error) { console.error(error); } }
And here’s how we grab every repository under every organization:
async function getAllRelevantReposForOrg(org) { try { const allRepos = await octokit.paginate('GET /orgs/:org/repos', { org, type: 'all', }); return cleanReposResponse(allRepos); } catch (error) { console.error(`[fn 'getAllRelevantReposForOrg']:`, error); } }
The “cleanReposResponse” function trims the response by:
- Only keeping the name, language, and last updated fields from the response
- Removing any repositories that haven’t been updated in a few years
- Keeping only the repositories with code in certain programming languages like Typescript and JavaScript, which are relevant to our system.
At this point we’re very close, but there may still be some repositories in this list that don’t pertain to us. So we then fetch the package.json
files in each repository. Some repositories have several package.json
files, such as monorepos, so we first run a search to find their locations:
const response = await octokit.search.code({ q: `repo:${orgName}/${repo.name}+filename:package.json`, });
Then we get the content of the package.json
files and map them back up to the repository and organization:
async function getPackageJson(owner, repo, packagePath = Endpoints.PACKAGE_JSON) { try { const response = await octokit.repos.getContent({ owner, repo, path: packagePath, }); // De-encode the response let pkg = JSON.parse(Buffer.from(response.data.content, response.data.encoding).toString()); // We only care about some packageJson fields, drop the rest for space return lodash.pick(pkg, AllowedPackageJsonFields); } catch (error) { if (error.response == null || error.response.status === 404) { console.log(`[getPackageJson] Processing: ${owner}/${repo} -- No package.json found.`); } else { console.log(error.response); } } }
We now know:
- which organizations have front-end or Node.js code
- which repositories have a
package.json
file - and all the information contained within their
package.json
, such as project name, version, and dependencies
Since all of the Paste Design System packages are namespaced, we can scan the package.json
files to find repositories with the @twilio-paste/
prefix.
Our first report
The very first report we generated looks something like this:
{ "numberOfOrgs": 10, "numberOfRepos": 20, "orgs": { "cool-org": { "cool-repo": { "root-package.json": { "@twilio-paste/core": "6.0.1", "@twilio-paste/icons": "4.0.1" }, "subdir-package.json": { "@twilio-paste/core": "6.0.1", "@twilio-paste/icons": "4.0.1" }, }, ... }, } }
This report shows us how many organizations and repositories at the company are using Paste, plus which packages and versions they're using. Since this is an exhaustive scan of the entire enterprise Github instance, the report is very accurate. Using this information, we tracked our adoption curve growth from 7 organizations and 11 repositories on March 22, 2020 to 19 organizations and 60 repositories one year later.