Gene expression analysis of hepatocellular carcinoma and comparison between Random forest and Support vector machine classifiers
Hepatocellular carcinoma (HCC) is the most frequent malignant tumor in liver and is the third leading cause of cancer death worldwide. From NCBI GEO dataset, GEO dataset with accession number of GSE14520 was extracted. The dataset has a total of 445 samples, in which case patients’ samples were collected between 2002-2003 at Liver Cancer Institute (LCI), Fudan University, China and Liver Tissue Cell Distribution System (LTCDS) at University of Minnesota, USA. A total of 222 cases and 212 control were in the dataset, while for eleven of them, case-control information was not available. The dataset had dimension of 445 X 22268, which is high-dimensional. Supervised learning was carried out on gene expression dataset using random forest and support vector machine. In-built R packages randomForest, svmpath, kernlab, and verification were used for performing RF and SVM analysis. Then, a ROC curve, with False Alarm Rate (x-axis) vs Hit Rate (y-axis) was plotted for both RF and SVM. The variance importance ranking table uses mean decrease accuracy and mean decrease gini index to determine which variables (genes in this case) are important. When the performance was compared between SVM and RF, RF was found to perform better than SVM for this gene expression dataset.