Matchmaker is a machine-learning project to perform similarity matching using a dataset of OkCupid dating profiles.
Given your basic details (such as your sex and orientation), and more interesting questions (such as your religion, education, use of drugs, etc), Matchmaker will return the best possible matches from the dataset of approximately 60,000 OkCupid profiles. The former questions are matched using logic rules (e.g. a straight woman's candidates will be filtered down to non-gay men). The latter questions (religion, pets, children, etc) are run through a k-nearest neighbors model. This means that you and the profiles in the dataset will be plotted within an n-dimensional space (think a graph with many axes) along with the rest of the dating profiles, and your matches will be the ones nearest to you in the space, where "nearest" effectively means similarity. Hence the name of the algorithm: "Nearest Neighbors".
Weighting is applied appropriately so that differences on some dimensions will have more of an influence on similarity than differences on others. For example, a 28-year-old non-smoker would probably be more averse to dating a 27-year-old smoker than they would be to dating a 26-year-old non-smoker. And a vegan is probably more willing to date a vegetarian than they are to date an omnivore, since the former is "nearer" to the latter. Note that these weights are calibrated according to my own biases!
From a technical perspective, the entire thing is done in Python with the pandas, NumPy, SciPy and scikit-learn packages. The model is run in real-time; no pre-calculation or pre-training, so it will take 3 - 4 seconds to run (based on the performance on my laptop).
Note to self: Next time do something easier and less...controversial. The training data has, for example, 217 unique values for ethnicity and 45 for religion, including "radical agnostics", which I'm pretty sure is an oxymoron.
Developed with Python version 3.8.6.
See requirements.txt
for packages and versions (and below to install).
Clone the Git repo.
Install the dependencies:
$ pip install -r requirements.txt
The dataset has been included in the data
directory as chunked files (don't worry, it's anonymized), so no need to get it yourself. See below for copyright and attribution.
$ python matchmaker.py
Help output:
$ python matchmaker.py --help
usage: matchmaker.py [-h] [--matches MATCHES_TO_RETRIEVE] [--force-training]
K-Nearest Neighbors machine learning model to find the best matches within a set of OkCupid profiles.
optional arguments:
-h, --help show this help message and exit
--matches MATCHES_TO_RETRIEVE
The number of matching profiles to find (default: 40).
--force-training Train the model even if a previously trained and saved model can be loaded and used (default: false).
As the command-line tool is mainly for development and testing use it is hard-coded with my profile data. That is easily editable in the script file itself; you will see the override line when you open it up.
Note that the Django app is a very simple wrapper around the same model and logic that the CLI uses. It just provides an easy-to-use form and nice UI, and delegates the work to the main package outside of of the web app.
It is deployed to Heroku or you can boot the server yourself:
$ python manage.py runserver
And then open http://localhost:8000/.
Unit testing is in place where appropriate, such as for data preprocessing, calculating match scores, etc. The tests are implemented with Python's unittest standard library.
$ python -m unittest discover -v
The dataset used in this project was obtained from Larxel on Kaggle. In turn, that dataset was sourced from Albert Y. Kim's GitHub repository which was created for the publication OkCupid Profile Data for Introductory Statistics and Data Science Courses (Journal of Statistics Education, July 2015, Volume 23, Number 2) by Albert Y. Kim and Adriana Escobedo-Land:
We present a data set consisting of user profile data for 59,946 San Francisco OkCupid users (a free online dating website) from June 2012. The data set includes typical user information, lifestyle variables, and text responses to 10 essay questions.