-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathdatasets.html
158 lines (128 loc) · 7.62 KB
/
datasets.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Resources | Boğaziçi University Text Analytics and BIoInformatics Lab</title>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link href="static/css/bootstrap.css" rel="stylesheet">
<link href="static/css/flat-ui.css" rel="stylesheet">
<link href="static/css/lab.css" rel="stylesheet">
<link rel="shortcut icon" href="images/favicon.ico">
<!-- HTML5 shim, for IE6-8 support of HTML5 elements. All other JS at the end of file. -->
<!--[if lt IE 9]>
<script src="static/js/html5shiv.js"></script>
<script src="static/js/respond.min.js"></script>
<![endif]-->
</head>
<body data-spy="scroll" data-target="#affix-nav">
<nav class="navbar navbar-default navbar-fixed-top" role="navigation">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#navbar-collapse">
<span class="sr-only">Toggle navigation</span>
</button>
<a class="navbar-brand" href="?" title="Text Analytics and BIoInformatics Lab">TABI</a>
</div>
<div class="collapse navbar-collapse" id="navbar-collapse">
<ul class="nav navbar-nav">
<li >
<a href="?">Home</a>
</li>
<li >
<a href="people.html">People</a>
</li>
<li >
<a href="projects.html">Projects</a>
</li>
<li >
<a href="publications.html">Publications</a>
</li>
<li >
<a href="theses.html">Theses</a>
</li>
<li >
<a href="courses.html">Courses</a>
</li>
<li class="active">
<a href="resources.html">Resources</a>
</li>
</ul>
<ul class="nav navbar-nav navbar-right visible-lg">
<li><a href="http://www.boun.edu.tr/" title="Boğaziçi University">BU</a></li>
<li>
<a href="http://www.cmpe.boun.edu.tr/" title="Boğaziçi University Computer Engineering Department">
CmpE
</a>
</li>
</ul>
</div>
</div>
</nav>
<div class="container">
<div class="row">
<div class="col-md-10 col-md-offset-1">
<h1 class="lab-section-title">Datasets</h1>
<center>
<table width="800"><tbody><tr><td>
<pre style="line-height:140%">
<h1 class="lab-section-title">Review Datasets</h1>
<li><a href="https://www.yelp.com/dataset_challenge">Yelp Dataset</a>: 4.1M reviews and 947K tips by 1M users for 144K businesses [<a href="review_datasets/yelp_dataset_challenge_round9.tar">Download (1.8gb)</a>]</li>
<li><a href="https://sites.google.com/site/nquocdai/resources">SAR14 Dataset</a>: An independent score-associated dataset of 233600 movie reviews. [<a href="review_datasets/SAR14.zip">Download (120mb)</a>]</li>
<li><a href="https://snap.stanford.edu/data/web-Amazon.html">Amazon Books Reviews</a>: 12,886,488 reviews [<a href="review_datasets/Books.txt.gz">Download (4.4gb)</a>]</li>
<li><a href="https://snap.stanford.edu/data/web-Amazon.html">Amazon Music Reviews</a>: 6,396,350 reviews [<a href="review_datasets/Music.txt.gz">Download (2.1gb)</a>]</li>
<li><a href="https://snap.stanford.edu/data/web-Amazon.html">Amazon Movie&TV Reviews</a>: 7,850,072 reviews [<a href="review_datasets/Movies_&_TV.txt.gz">Download (2.8gb)</a>]</li>
<h2 class="lab-section-title">Sentiment Datasets</h2>
<li><a href="http://ai.stanford.edu/~amaas/data/sentiment/">Large Movie Review Dataset (IMDB Review Dataset)</a>: 25,000 highly polar movie reviews for training,
and 25,000 for testing [<a href="sentiment_datasets/aclImdb_v1.tar.gz">Download (80mb)</a>]</li>
<li><a href="http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html">Multi-Domain Sentiment Dataset</a> [<a href="sentiment_datasets/domain_sentiment_data.tar.gz">Download (30mb)</a>]</li>
<li><a href="http://www.sananalytics.com/lab/twitter-sentiment/">Twitter Sentiment Corpos</a>: 5513 hand-classified tweets [<a href="sentiment_datasets/sanders-twitter-0.2.zip">Download (150kb)</a>]</li>
<h2 class="lab-section-title">Word Embeddings</h2>
<li><a href="https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md">FastText Turkish Word Embeddings from Wikipedia</a>: 300 dimension [<a href="word_embeddings/wiki.tr.zip">Download (3.4gb)</a>]</li>
<li><a href="https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md">FastText English Word Embeddings from Wikipedia</a>: 300 dimension [<a href="word_embeddings/wiki.en.zip">Download (9.6gb)</a>]</li>
<li><a href="http://nlp.stanford.edu/projects/glove/">GloVe Word Embeddings</a>:<ul><li>Wikipedia 2014 + Gigaword 5: 6B+(50d,100d,200d,300d) [<a href="word_embeddings/glove.6B.zip">Download (822mb)</a>,
</li><li>Common Crawl: 42B+300dim [<a href="word_embeddings/glove.42B.300d.zip">Download (1.7gb)</a>],
</li><li>Common Crawl: 840B+300d [<a href="word_embeddings/glove.840B.300d.zip">Download (2gb)</a>],
</li><li>Twitter (2B tweets): 27B+(25d,50d,100d,200d) [<a href="word_embeddings/glove.twitter.27B.zip">Download (1.4gb)</a>]</li></ul>
</li><li>Google News Vector: (300d) [<a href="word_embeddings/GoogleNews-vectors-negative300.bin.gz">Download (1.5gb)</a>]
<h2 class="lab-section-title">Question&Answer Datasets</h2>
</li><li>TREC-QA 2013 [<a href="qa_datasets/jacana-qa-naacl2013-data-results.tar.bz2">Download (9mb)</a>]</li>
<h2 class="lab-section-title">Misc Datasets</h2>
<li><a href="http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html">Books unlabeled data</a> [<a href="misc_datasets/book.unlabeled.gz">Download (2.7mb)</a>]</li>
<li><a href="https://cogcomp.cs.illinois.edu/page/resource_view/89">Yahoo! Answers dataset</a>: 189,467 question and answer pairs from 20 top-level categories from
the Yahoo! Answers website; 10,000 question/answer pairs per category [<a href="misc_datasets/yahoo.answers.tar.gz">Download (127mb)</a>]</li>
<li><a href="https://snap.stanford.edu/data/web-Amazon.html">Titles for all products on Amazon</a> [<a href="misc_datasets/titles.txt.gz">Download (34mb)</a>]</li>
<li><a href="http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html">20 NewsGroup Dataset</a> [<a href="misc_datasets/news20.tar.gz">Download (17mb)</a>]</li>
<li><a href="https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs?srid=uKuKg">Quora Question Pairs Dataset</a>: over 400,000 lines of potential question duplicate pairs [<a href="misc_datasets/quora_duplicate_questions.tsv">Download (55mb)</a>]</li>
<li><a href="http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html">Cornell Movie--Dialogs Corpus</a>: 220,579 conversational exchanges between 10,292 pairs of movie characters [<a href="misc_datasets/cornell_movie_dialogs_corpus.zip">Download (1mb)</a>]</li>
</pre>
</td></tr></tbody></table>
</center>
</div>
</div>
<footer>
<hr>
<p class="text-center">
©
<a href="http://www.boun.edu.tr/">Boğaziçi University</a>
<a href="http://www.cmpe.boun.edu.tr/">Computer Engineering Department</a>
2020
</p>
</footer>
</div>
<script src="//ajax.googleapis.com/ajax/libs/jquery/2.0.3/jquery.min.js"></script>
<script>
window.jQuery || document.write('<script src="static/js/jquery-2.0.3.min.js"><\/script>')
</script>
<script src="static/js/bootstrap.min.js"></script>
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-124225373-1']);
_gaq.push(['_trackPageview']);
(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
</script>
</body>
</html>