kNN for Stronger Passwords
I recently came across an extremely useful tool for checking the quality of your passwords, haveibeenpwned.com. You type in a password and it automatically gets checked against a database of ~500M leaked passwords. If any matches were found, you should p̶r̶o̶b̶a̶b̶l̶y̶ definitely change your password.
Let’s see what happens when we try the classic correcthorsebatterystaple.
At least 114 other people have used this and been pwned. Needless to say DO NOT USE it (I’ll patiently wait while you go change your password).
Having a tool like this is great. However, and somewhat unfortunately, only exact matches are returned. In a database of 500 million passwords, this may not be a huge deal since it’s very likely that any iteration of password, p4assw0rd, paSSwOrD, etc, will already be present. But for rarer passwords, this is not always the case. For example, let’s see what happens when we replace the first o with a 0 in c0rrecthorsebatterystaple.
As you can see, no matches were found. So, perfectly safe, right? Not exactly. It turns out that most would-be password hackers are quite savvy to this tactic and it’s easy to break with a dictionary attack. Therefore, it would be great if haveibeenhacked.com also notified you if your password was similar to leaked ones.
This is where ~~machine learning~~ comes in.
Well, sort of… While we could build a super cool deep neural network with thousands of layers trained using some type of contrastive loss between good and bad passwords (and while this may or may not work), let’s start out with something simpler, like k-Nearest-Neighbors (kNNs). The system we’ll be creating takes a user provided password and checks if it is similar enough to any of the passwords in our database of leaked passwords. To determine similarity, we could use any kind of distance metric, and one which seems pretty suited to strings is the edit distance (or the Levenshtein distance if you’re trying to impress your friends).
I was actually surprised to find that none of the mainstream data science libraries (sklearn, numpy, scipy, etc) provide this out of the box. There may be a lot of reasons for this, and one of them might be that it’s not very computationally efficient (at least that’s my guess). In any case, we’ll implement it ourselves given that it’s relatively simple:
This metric function is the meat of our system, the rest is just plumbing. If you’re never written kNN before, it’s probably the easiest machine learning algorithm to implement from scratch. In fact, since it is a non-parametric model, there isn't even any learning that takes place. All it does is take your test data, in our case the user provided password, and compares to every single “training” data point using the distance metric. If two data points are close enough in distance space, then we consider them to be neighbors. Here’s what this looks like in code:
Looks good! As you might have noticed, we glossed over an important detail. If you look on line 15 in the code snippet above you’ll see we’re accessing a file called passwords.txt. This is the database of leaked passwords we use to test the user password against. There’s plenty of places to find leaked passwords online, and for the purpose of this article I’ve used this list I found on GitHub.
All that’s left now is to wire everything together and we’ll have a simple, yet powerful, machine learning model for predicting password strength:
While this does the job, it turns out that users aren’t too keen on typing their passwords as plaintext on a terminal. Instead let’s use getpass to ask the user for their password instead. Also, if we do find matching passwords, we may not want to display them on the terminal since it would be pretty easy to derive the user password from these matches. Instead, let’s make this option available only if the user requests it via the show_matching flag. Lastly, we’ll also add a knob to tune the matching threshold through a max_distance flag. Putting it all together we have:
Finally, let’s see how our system can handle the ultimate test of c0rrecthorsebatterystaple.
Nice! We were able to warn the user that their password is weak without ever having seen an exact match.
With a few lines of code we built a smarter password strength prediction system. While this is a toy example, these ideas have potential applications in the real world. Imagine that when creating a new account, instead of vague heuristics about how many special characters your password has, it was tested for similarity against the millions of already known bad passwords. It could even be provided as a service for other apps to send requests to. Of course, if you’re serious about your online security, you should be using a password manager anyways.
What we’ve built could be improved in several substantial ways, from the efficiency of the current system to the algorithm itself. Can you think of some improvements? If so, all the code can be found at my GitHub. Feel free to fork, steal, send PRs, or anything else you might want to do with it.
Finally, it should go without saying, but please DO NOT USE this system to check your actual passwords and use haveibeenpwned.com instead.
Thanks for reading!
April 7th, 2019