title: GitHub Knows
url: http://blog.humphd.org/github-knows/
hash_url: 05f3e49cab
I was reflecting the other day how useful it would be if GitHub, in addition to the lists it has now like Trending and Explore, could also provide me a better view into which projects a) need help; and more, b) can accept that help when it arrives. Lots of people responded, and I don't think I'm alone in wanting better ways to find things in GitHub.
Lots of GitHub users might not care about this, since you work on what you work on already, and finding even more work to do is the last thing on your mind. For me, my interest stems from the fact that I constantly need to find good projects, bugs, and communities for undergrads wanting to learn how to do open source, since this is what I teach. Doing it well is an unsolved problem, since what works for one set of students automatically disqualifies the next set: you can't repeat your success, since closed bugs (hopefully!) don't re-open.
And because I write about this stuff, I hear from lots of students that I don't teach, students from all over the world who, like my own, are struggling to find a way in, a foothold, a path to get started. It's a hard problem, made harder by the size of the group we're discussing. GitHub's published numbers from 2017 indicate that there are over 500K students using its services, and those are just the ones who have self-identified as such--I'm sure it's much higher.
The usual response I get from people is to use existing queries for labels with some variation of "good first bug". This can work, especially if you get in quickly when a project, or group of projects, does a triage through their issues. For example, this fall I was able to leverage the Hacktoberfest efforts, since many projects took the time to go and label bugs they felt were a good fit for new people (side note: students love this, and I had quite a few get shirts and a sense that they'd become part of the community).
But static labeling of issues doesn't work over time. For example, I could show you thousands of "good first bugs" sitting patiently in projects that have long ago stopped being relevant, developed, or cared about by the developers. It's like finding a "Sale!" sign on goods in an abandoned store, and more of a historical curiosity than a signal to customers. Unless these labels auto-expire, or are mercilessly triaged by the dev team, I don't think they solve the problem.
So what could we do instead? Well, one thing we could do is make better use of the fact that we all now work in a monorepo called github.com. Everyone's favourite distributed version control system has evolved to (hilariously) become the most centralized system we've ever had. As such, GitHub knows about you, your team, and your code and could help us navigate through everything it contains. What I'm going to describe is already starting to happen. For example, if you have any node.js projects on GitHub, you've probably received emails about npm packages being out of date and vulnerable to security issues:
We found a potential security vulnerability in a repository which you have been granted security alert access. Known high severity security vulnerability detected in package-name < X.Y.Z defined in package-lock.json. package-lock.json update suggested: package-name ~> X.Y.Z
Now imagine we take this further. What sorts of things could we do?
This kind of signalling could also help all GitHub users, not just new contributors to a project. For example, I bet GitHub knows when you're approaching burnout. In sport we see coaches and team physios/physicians making data-driven decisions about injury potential for athletes: it's imperative to periodize training and balance volume and recovery so that you don't push beyond what an individual can effectively manage. Why don't we do this with tech? It's easy with programming to "lose yourself" in a problem or piece of code. I've had lots of periods where I'm pushing myself really hard in order to find a solution to some problem. I've also had lots of periods where I'm mentally dead and can't bring myself to write anything. I often don't have good insight into when I'm veering into one or the other of these two extremes.
But GitHub does. If I'm committing code day after day, I'm working day after day. If I'm writing novellas in issues, filing bugs, reviewing code, or even just scrolling my mouse without clicking on this website, I'm working. I don't know what the threshold is for burnout, but I'm sure there are psychologists or other behavioral scientist who do, and could give some guidelines that could be used to offer a signal.
What if, in addition to "critical" security warnings for our code modules, we also got helpful info about the state of our project and practices. GitHub could let people know when it sees them cross some threshold of work, to give them some idea how much they've been doing, to help them realize they need some rest. Same thing for a project overall: when your top few developers are going all out for long stretches, the project as a whole is going to collapse eventually. If everyone is @ mentioning a person in bugs across a bunch of repos, asking for reviews, asking for help, looking for feedback, eventually that person is not going to be able to keep up. What if GitHub showed me a bit of info when I requested a review or @ mentioned you:
"This person is totally swamped right now. Is there someone else that could help?"
Actually, GitHub probably knows that too, since it knows the other devs who commit to the same repos/files on a regular basis. Why not suggest some alternatives to balance the load?
Eventually this could go even further. I can't tell you how many times the solution to a bug I have is sitting quietly over in another repo, unknown to me or the rest of my team. Wouldn't it be nice if GitHub could point me at bugs that are like the one I'm looking at now? Doing this perfectly would be hard, I realize. But imagine if GitHub could signal the presence of other similar bugs, code, or people. GitHub knows.
GitHub knows and now it needs to tell us. None of the above can be done immediately, but we need to get there. We are moving into a period of data, machine learning, artificial intelligence, and automation. In 2018 we need our systems to move beyond hosting our data, and our data needs to start working for us.