Node Representation Learning Benchmark
Status | Developing |
---|---|
Author(s) | QIAN Zhiqiang ([email protected]), GUO Zhihao ([email protected]), YU Rico ([email protected]) , LIAN, Amber ([email protected]) |
Updated | 2021-05 |
We aim at building an automatic,fair and systematic evaluation platform to compare the results of different Network Embedding models. The implemented or modified models include DeepWalk, node2vec, GCN, NetMF, GAE, featWalk, CAN,LINE,HOPE.
Also, we imported several classic dataset, which includes Flickr, ACM, Cora, BlogCatalog. We will implement more representative NE models continuously. Specifically, we welcome other researchers to contribute NE models into this platform.
Download all dependent packages
pip install -r requirement.txt
Then the command below could be run successfully,The command can select datasets, algorithms and others through the parameters provided by the command system,which can view details by this command
python netBenchmark.py -h
optional arguments:
- -h, --help
show this help message and exit - --dataset {{cora,flickr,blogcatalog,citeseer,pubmed,chameleon,film,squirrel,all}
select a available dataset (default: all) - --method {featwalk,netmf,deepwalk,node2vec,dgi,gae,can_new,hope,grarep,sdne,netsmf,line,prone,all}
The network embedding algorithms (default: all) - --task_method {task1,task2,task3}
The evaluation method - --cuda_device {default:0} The number of cuda device
- --training_ratio (default: 1.) The value to control the total training time
- --tuning_method {random,tpe} The method of parameter tuning.(now includes Random search and tpe search)
An example
python netBenchmark.py --method=all --dataset=all --task_method=task1 --cuda_device=1
The dataset component reads in data from ./data, and functions will tackle the original format to a dict result Until now, we import more than 10 datasets and wrote about 4 kind of reading method.t All the input datasets inherit from a base class: Dataset. Path: ./preprocessing/dataset.py
class Datasets:
def __init__(self):
super(Datasets, self).__init__()
def get_graph(self,variable_name):
graph = None
return graph
@classmethod
def attributed(cls):
raise NotImplementedError
This class aimed at dealing with a different format of input source files and return a normalized format result.
We now have 4 different methods to deal with different source files, which includes mat
/txt
/(tx,ty,x,y)
/npz
.After that, it all will return a DICT result,and adj/labels/featrues will be transformed to csc_matrix
data={"Network":adj,"Label":labels,"Attributes":feature}
All algorithms model inherit from a base class: ./models/model.py Models
, which itself inherits from torch.nn.Module
.
The main idea of this class is to tune parameters and obtain the best result, which will be evaluated by node classification or link prediction according to the different tasks.
So, class train_model
,get_score
,parameter_tuning
are the most important content in model.py.
After building our template,we realized that it is important to make our code extensible, which means it can facilitate the importing of other algorithms.
So, we build a function train_model
for algorithms to normalize the input and output.
And implementation of train_model
in different algorithms is various, which means it can be replaced when it comes new algorithms.
train_model
only had one input parameter called kwargs
, which will be pre-defined and represent all hyper-parameters of this algorithm.
For example,
kwargs={'alpha1': 0.2404249370702901, 'num_paths': 47, 'path_length': 48, 'win_size': 14}
After processing by Dataset.py, input data will become a global variable that can be called as self.mat_content
.
But it may still not satisfied the format of some algorithms.So, algorithms can call dataset as self.mat_content
in class and tackle it.
For example, GraRep will tackle graph by nx.from_scipy_sparse_matrix
like below
self.graph=self.mat_content['Network']
self.G = nx.from_scipy_sparse_matrix(self.graph)
According to above two points, the function need to be overwritten like below,
def train_model(self, **kwargs):
adj, features, labels = load_citationmat() 'Load the data from the Dictionary and Preprocess'
embbeding = Featwalk( **kwargs).function() 'Send the Parameter to the Algorithm'
return embbeding
Thus, the reason we used kwargs
is that for each tuning,the value of it is different,which means it needs to be a variable.
Besides,for each algorithm, its hyper-parameters is varying with numbers,name and range,so it will be recorded in check_train_parameters
.
For example,
def check_train_parameters(self):
space_dtree = {
'alpha1': hp.uniform('alpha1', 0, 1),
'num_paths': hp.uniformint('num_paths', 10, 50),
'path_length': hp.uniformint('path_length', 5, 50),
'win_size': hp.uniformint('win_size', 5, 15),
}
return space_dtree
All in all, a new algorithms will be imported successfully by overwriting two functions in model.py,which is train_model
and check_train_parameters
respectively.
In order to tune parameters under different scoring criteria,we wrote get_score function ,which will put the embedding in different evaluation function by an IF-ELSE condition here.
def get_score(self,params):
emb = self.train_model(**params)
adj = self.mat_content['Network']
if self.task_method == 'task1':
adj_train, train_edges, val_edges, val_edges_false, test_edges, test_edges_false = pre.mask_val_test_edges(adj)
score=link_prediction_Automatic_tuning(emb,edges_pos=test_edges,edges_neg=test_edges_false)
elif self.task_method == 'task2':
adj_train, train_edges, val_edges, val_edges_false, test_edges, test_edges_false = pre.mask_val_test_edges(adj)
score = link_prediction_Automatic_tuning(emb, edges_pos=val_edges, edges_neg=val_edges_false)
else:
score=node_classifcation_end2end(np.array(emb), self.mat_content['Label'])
return -score
Hyperopt is a Python library for serial and parallel optimization over awkward search spaces, which may include real-valued, discrete, and conditional dimensions. Currently three algorithms are implemented in hyperopt:Random Search,Tree of Parzen Estimators (TPE), Adaptive TPE. We used two methods from it so far,which is random search and TPE respectively. After tuning and obtaining the best score, it will return the best embedding matrix, related hyper-parameters and tuning times.
def parameter_tuning(self):
trials = Trials()
if self.tuning == 'random':
algo = partial(hyperopt.rand.suggest)
elif self.tuning== 'tpe':
algo = partial(tpe.suggest)
space_dtree = self.check_train_parameters()
best = fmin(fn=self.get_score, space=space_dtree, algo=algo, max_evals=1000, trials=trials, timeout=self.stop_time)
hyperparam = hyperopt.space_eval(space_dtree,best)
print(hyperparam)
print('end of training:{:.2f}s'.format(self.stop_time))
emb = self.train_model(**hyperparam)
return emb,best
Two purpose of evaluation layer is to tune the parameters and obtain the final accuracy fairly.
But deal with embedding will take a lot of time,so, we build 2 function for both link prediction and node classification .
We use five-fold cross-validation to generate the training and test sets.
In the first round (five rounds in total), we tune the hyper-parameters. We use 1/10 of training set as validationto tune hyperparameters.
node_classifcation_end2end(feature, labels)
link_prediction_Automatic_tuning(emb, edges_pos, edges_neg)
After we have selected the hyperparameters, we add 1/10 validation back to training set. The average of results in ten runs (five rounds in each run) are recorded.node_classifcation_10time(feature, labels)
10-fold Node classificationlink_prediction_10_time(emb,Graph)