cs224w-wiki-game

Project for cs224w, GNN for wiki game.

Authors: Michael Rybalkin, Cary Xiao, Noah Islam

Dataset

This project uses the English Wikipedia hyperlink network dataset found here: https://snap.stanford.edu/data/enwiki-2013.html. Download the dataset using download_dataset.sh.

Setup

pip install -r requirements.txt
./download_dataset.sh

Usage Instructions

If you would like to play the wiki game for yourself on the n=1000 subset of the wikipedia dataset, run ./human_player 1000. The first time you do this, the subsampled graph needs to be generated, and this process takes a few minutes. Afterwards, it will be cached. Before playing, you must provide the names of the starting and ending articles to play with. Run ./node_names.py 1000 to get a list of all the valid node names to use here. If you don't want to choose the start and end node, run with baseline mode using ./human_player 1000 b to play 20 games with randomly-selected (seeded) start and end nodes.
If you would like to run the node2vec agent and see it play the wiki game, you first need to generate node2vec embeddings. Run ./gen_node2vec_embeddings.py 1000. Now, to simulate 20 trials of the wiki game with the simple node2vec agent (picking the neighbor who's embedding has the highest cosine similarity with the target embedding) run ./node2vec_player.py 1000 20.
If you would like to run one of the GNN-based approaches (MLP or GraphSAGE), you must first generate the node2vec embeddings (if you have not already done so). Precomputed embeddings for n=1000 and n=10000 are already in the embeddings dir. Pretrained checkpoints are commited in the checkpoints/best_model.pt for MLP and checkpoints/graphsage/best_model.pt for GraphSAGE. To run them, try for example ./gnn_player.py 1000 b checkpoints/best_model.pt. If you would like to train your own models using our training data instead, you can run the train script. Example usage: python3 gnn/train.py --model mlp --epochs 50. Checkpoints generated are saved to the checkpoints dir. Now, you can run the gnn_player.py script and specify the filepath to the checkpoint as the third argument.
If you would like to compare the paths taken by different agents in the wiki game, you can use the script ./visualize_test_results.py. When running a player script in baseline mode, a file is produced which contains the paths taken by the player. All of these files which are placed in the directory players_to_visualize will be rendered by the visualization script. Each approach will have a unique color, and if multiple approaches traverse the same nodes/edges, they will be purple. If the graphs are too cluttered, try removing some of the files from players_to_visualize.

Scripts

download_dataset.sh: Downloads the dataset, sets up the repo directory for the other scripts.
util.py: Contains utils used by other scripts. Is also runnable, and takes optional argument n (usage: ./util.py 1000). When ran, the script subsamples the full dataset to only inculde the n highest-degree articles, and saves the objects to pickles/.... Edges are treated as bidirectional when degree is counted (both incoming and outgoing links are counted). Defaults to n=1000 when unspecified.
human_player.py: CLI demo of the Wiki Game, prompts for a start and target article title, then lists all neighboring article titles and prompts the user to make a selection. After the target is reached, displays the path of articles taken from start to target. Takes optional argument n (usage: ./human_player.py 1000) which subsamples the dataset to only include the n highest-degree articles. If n is not specified, uses the full dataset. To play 20 trials and write results to a file, add an attional argument like so: ./human_player.py 1000 b.
node2vec_player.py: Simulates the simple node2vec agent playing the wiki game. Takes two args. First is n for the number of ndoes. Second is optional, is the number of trials to run in baseline mode (defaults to 20). Example usage: ./node2vec_player.py 1000 20. If in baseline mode, write results of running baseline trials to filesystem.
gnn_player.py: Simulates the GNN agent playing the wiki game. Takes four args. First is n for the number of nodes, defaults to n=1000. Second is b for baseline mode, where the GNN is evaluated on some number of trials selected randomly with seed. Third is the filepath to the checkpoint of the trained GNN. Fourth is the number of trials to use in baseline mode (default is 20). Example usage: ./gnn_player 1000 b checkpoints/best_model.pt 200. If in baseline mode, write results of running baseline trials to filesystem.
visualize_test_results.py: Visualization script used for comparing paths of different approaches. Visualizes all path files in the dir players_to_visualize. Black edges showing untraversed edges between visited nodes can be disabled with the boolean flag SHOW_GRAY_EDGES. This script assumes that all path files were generated in baseline mode using the same src,dst node pairs and the same number of trials. Trials if the path files must be consistent with NUM_PROMPTS. Example usage: ./visualize_test_results.npy.
node_names.py: Takes an argument n (usage: ./node_names.py 1000), and prints the names of all nodes in the subsampled dataset of top n nodes of highest degree. Requires that the dataset has already been generated.
gen_node2vec_embeddings.py: Uses Node2vec to create node an embedding for each node in the subgraph. Saves embeddings in embeddings folder. Takes parameter n for number of nodes in the graph. Usage: ./gen_node2vec_embeddings.py 1000.
baseline.py: Plays 20 trials of the wiki game using BFS, then reports the shortest path length and number of visited nodes using BFS.
graph_stats.py: This is a simple script used to calculate the mean, standard deviation, and maximum for each strategy in both the 20-game and 200-game benchmark. To use this script, simply move all .pkl files you want to calculate the stats of into the test_results/ directory and run python3 graph_stats.py. s
graph_train_loss.py: This is a simple script that creates the loss graphs across each epoch we show in the training section of our Medium post. To get each graph, run python3 ./graph_train_loss.py <.json file>, where you specify the path to one of the two JSON files in checkpoint/.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cs224w-wiki-game

Dataset

Setup

Usage Instructions

Scripts

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
checkpoints		checkpoints
gnn		gnn
players_to_visualize		players_to_visualize
test_results		test_results
training_data		training_data
.gitignore		.gitignore
README.md		README.md
baseline.py		baseline.py
build_dict_parallel.py		build_dict_parallel.py
cache_full_dataset.py		cache_full_dataset.py
comprehensive_eval.py		comprehensive_eval.py
comprehensive_eval_500.pkl		comprehensive_eval_500.pkl
comprehensive_eval_results.pkl		comprehensive_eval_results.pkl
download_dataset.sh		download_dataset.sh
gen_node2vec_embeddings.py		gen_node2vec_embeddings.py
gnn_player.py		gnn_player.py
graph_stats.py		graph_stats.py
graph_train_loss.py		graph_train_loss.py
human_player.py		human_player.py
node2vec_player.py		node2vec_player.py
node_names.py		node_names.py
requirements.txt		requirements.txt
util.py		util.py
visualize_test_results.py		visualize_test_results.py

Folders and files

Latest commit

History

Repository files navigation

cs224w-wiki-game

Dataset

Setup

Usage Instructions

Scripts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages