socialsensor
diff --git a/‎README.md
+34-34 b/‎README.md
+34-34
diff --git a/‎config.properties
+4-20 b/‎config.properties
+4-20
diff --git a/‎eval.properties
+22 b/‎eval.properties
+22
diff --git a/‎pom.xml
+10-4 b/‎pom.xml
+10-4
diff --git a/‎samples/building_concepts.txt
+120 b/‎samples/building_concepts.txt
+120
diff --git a/‎samples/samples.zip
67.4 MB b/‎samples/samples.zip
67.4 MB
@@ -1,34 +1,41 @@
 Multimedia Geotagging
 ======
 
-Contains the implementation of algorithms that estimate the geographic location of multimedia items based on their textual content and metadata. It includes the <a href="http://ceur-ws.org/Vol-1263/mediaeval2014_submission_44.pdf">participation</a> in the <a href="http://www.multimediaeval.org/mediaeval2014/placing2014/">MediaEval Placing Task 2014</a>. The project's paper can be found <a href="http://link.springer.com/chapter/10.1007/978-3-319-18455-5_2">here</a>.
+This repository contains the implementation of algorithms that estimate the geographic location of multimedia items based on their textual content. The approach is described in <a href="http://ceur-ws.org/Vol-1436/Paper58.pdf">here</a> and <a href="http://link.springer.com/chapter/10.1007/978-3-319-18455-5_2">here</a>. It was submitted in <a href="http://www.multimediaeval.org/mediaeval2016/placing/">MediaEval Placing Task 2016</a>.
 
 
 
 <h2>Main Method</h2>
 
 The approach is a refined language model, including feature selection and weighting schemes and heuristic techniques that improves the accuracy in finer granularities. It is a text-based method, in which a complex geographical-tag model is built from the tags, titles and the locations of a massive amount of geotagged images that are included in a training set, in order to estimate the location of each query image included in a test set.
 
-The main approach comprises two major processing steps, an offline and an online. A pre-processing step fist applied in all images. All punctuation and symbols are removed (e.g. “.%!&”), all characters are transformed to lower case and then all images from the training set with empty tags and title are filtered.
+The main approach comprises two major processing steps, an offline and an online.
 
 <h3>Offline Processing Step</h3>
 
+* Pre-processing
+	* apply URL decoding, lowercase transformation, tokenization
+	* remove accents, punctuations and symbols (e.g. “.%!&”)
+	* discard terms consisting of numerics or less than three characters
+
 * Language Model
 	* divide earth surface in rectangular cells with a side length of 0.01°
-	* calculate tag-cell probabilities based on the users that used the tag inside the cell
+	* calculate term-cell probabilities based on the users that used the term inside the cell
 
 * Feature selection
-	* cross-validation scheme using the training set only
-	* rank tags based on their accuracy for predicting the location of items in the withheld fold
-	* select tags that surpass a predefined threshold
+	* calculate locality score of every term in the dataset
+	* locality is based on the term frequency and the neighbor users that have used it in the cell distribution
+	* the final set of selected terms is formed from the terms with locality score greater than zero 
 
 * Feature weighting using spatial entropy
-	* calculate entropy values applying the Shannon entropy formula in the tag-cell probabilities
-	* build a Gaussian weight function based on the values of the spatial tag entropy
+	* calculate spatial entropy values of every term applying the Shannon entropy formula in the term-cell probabilities
+	* spatial entropy weights derives from a Gaussian weight function over the spatial entropy of terms
+	* locality weights derives from the relative position in the rank of terms based on their locality score
+	* combine locality and spatial entropy weight to generate the final weights
 
 <h3>Online Processing Step</h3>
 
-* Language Model based estimation
+* Language Model based estimation (prior-estimation)
 	* the probability of each cell is calculated
 	* Most Likely Cell (MLC) considered the cell with the highest probability and used to produce the estimation
 
@@ -46,45 +53,38 @@ The main approach comprises two major processing steps, an offline and an online
 In order to make possible to run the project you have to set all necessary argument in <a href="https://github.com/socialsensor/multimedia-geotagging/blob/master/config.properties">configurations</a>, following the instruction for every argument. The default values may be used. 
 
 
-_Input File_
-The dataset's records, that are fed to the algorithm as training and test set, have to be in the following format. The different metadatas are separated with _tab_ character.
-
-		imageID  imageHashID  userID  title  tags  machineTags  lon  lat  description
-				
-`imageID`: the ID of the image<br>
-`imageHashID`: the Hash ID of the image that was provided by the organizers (optional)<br>
-`userID`: the ID of the user that uploaded the image<br>
-`title`: image's title<br>
-`tags`: image's tags<br>
-`machineTags`: image's machine tags<br>
-`lon`: image's longitude<br>
-`lat`: image's latitude<br>
-`description`: image's description, if it is provided.
+_Input File_<br>
+The imput files must be in the same format as <a href="https://webscope.sandbox.yahoo.com/catalog.php?datatype=i&did=67">YFCC100M dataset</a>.
 
 
-_Output File of the Offline Step_	
-At the end of the training process, the algorithm creates a folder named `TagCellProbabilities` and inside the folder another folder named `scale_(s)`, named appropriately based on the scale `s` of the language model's cells. The format of this file is the following.
+_Output Files_<br>
+At the end of the training process, the algorithm creates a folder named `TermCellProbs` and inside the folder another folder named `scale_(s)`, named appropriately based on the scale `s` of the language model's cells. The format of this file is the following.
 
-	tag	  ent-value   cell1-lon_cell1-lat>cell1-prob   cell2-lon_cell2-lat>cell2-prob...
+	term	cell1-lon_cell1-lat>cell1-prob>cell1-users  cell2-lon_cell2-lat>cell2-prob>cell2-users...
 		
-`tag`: the actual name of the tag<br>
-`ent-value`: the value of the tag's entropy<br>
+`term`: the actual name of the term<br>
 `cellx`: the x most probable cell.<br>
 `cellx-lon_cellx-lat`: the longitude and latitude of center of the `cellx`, which is used as cell ID<br>
-`cellx-prob`: the probability of the `cellx` for the specific tag
+`cellx-prob`: the probability of the `cellx` for the specific tag<br>
+`cellx-users`: the number of users that used the specific term in the `cellx`
 
-The output of the cross-validation scheme is a file named `tagAccuracies_range_1.0` found in the projects directory. The output file contains the tags with their accuracies in the range of 1km and it is used for the feature selection. 
+The output of the feature weighting scheme is a folder with name `Weights` containing two files one for locality weight and one for spatial entropy weights, namely `locality_weights` and `spatial_entropy_weights`, respectively. Each row contains a term and its corresponding weight, separated with a tab.
 
-The files that are described above are given as input in the Language Model estimation process. During this process, a folder named `resultsLM` and inside that folder two files named `resultsLM_scale(s)`are created, where are included the MLCs of the query images. Every row contains the imageID and the MLC, separated with a `;`, of the image that corresponds in the respective line in the training set. Also, a file named `confidence_associated_tags` is created in root the root directory, containing the confidence and associated tags with the MLC for every query image.
+The files that are described above are given as input in the Language Model estimation process. During this process, a folder named `resultsLM` and inside that folder two files named `resultsLM_scale(s)`are created, where are included the MLCs of the query images. Every row contains the imageID and the MLC (tab-separated) of the image that corresponds in the respective line in the test set. Also, a file named `resultsLM_scale(s)_conf_evid` is created in the same folder, containing the confidence and evidences that lead to estimated MLC, for every query image.
 
 Having estimated the MLCs for both granularity grids, the files are fed to the Multiple Resolution Grids technique, which produce a file named `resultsLM_mg(cs)-(fs)`, where `(cs)` and `(fs)` stands for coarser and finer granularity grid, respectively. Every row of this file contains the image id, the MLC of the coarser language model and the result of the Multiple Resolution Grids technique, separated with a `>`.
 
-In conclusion, the file that is created by the Multiple Resolution Grids technique is used for the final processes of the algorithm, Similarity Search. During this process, a folder named `resultSS` is created, containing the similarity values and the location of the images that containing in the MLG of every image in the test set. The final results are saved in the file specified in the arguments, and the records in each row are the ID of the query image, the estimated latitude, the estimated longitude and the distance between the real and the estimated locations, all separated with the symbol `;`.
+In conclusion, the file that is created by the Multiple Resolution Grids technique is used for the final processes of the algorithm, Similarity Search. During this process, a folder named `resultSS` is created, containing the similarity values and the location of the images that containing in the MLG of every image in the test set. The final results are saved in the file specified in the arguments, and the records in each row are the ID of the query image, the real longitude and latitude, the estimated longitude and latitude, and they are tab-separated.
 
-<h3>Demo Version</h3>
+<h3>Evaluation Framework</h3>
 
-There have been developed a <a href="https://github.com/socialsensor/multimedia-geotagging/tree/demo">demo version</a> and a <a href="https://github.com/socialsensor/multimedia-geotagging/tree/storm">storm module</a> of the approach .
+This <a href="https://github.com/MKLab-ITI/multimedia-geotagging/tree/develop/src/main/java/gr/iti/mklab/mmcomms16">pacage</a> contains the implemetations of the sampling strategies described in the <a href="http://dl.acm.org/citation.cfm?doid=2983554.2983558">MMCommons 2016 paper</a>. In order to run the evaluation framework you have to set all necessary argument in <a href="https://github.com/MKLab-ITI/multimedia-geotagging/blob/master/eval.properties">configuration file</a>, following the instruction for every argument. To run the code, the <a href="https://github.com/MKLab-ITI/multimedia-geotagging/blob/master/src/test/java/gr/iti/mklab/main/Evaluation.java">Evaluation class</a> have to be executed.
+
+Additionally, in this <a href="https://github.com/MKLab-ITI/multimedia-geotagging/blob/master/samples/">folder</a>, the <a href="https://github.com/MKLab-ITI/multimedia-geotagging/blob/master/samples/samples.zip">zip file</a> that contains the generated collections from the different sampling strategies and the <a href="https://github.com/MKLab-ITI/multimedia-geotagging/blob/master/samples/building_concepts.txt">file</a> of the building concepts can be found. Keep in mind that the geographical uniform sampling, the user uniform sampling and text diversity sampling generates different files in every code execution because they involve random selections and permutations.
+
+<h3>Demo Version</h3>
 
+There have been developed a <a href="https://github.com/socialsensor/multimedia-geotagging/tree/demo">demo version</a> and a <a href="https://github.com/socialsensor/multimedia-geotagging/tree/storm">storm module</a> of the approach.
 
 <h3>Contact for further details about the project</h3>
 
 
@@ -1,5 +1,5 @@
 #Project directory
-dir=/media/georgekordopatis/New Volume/placing-task/files/
+dir=/home/georgekordopatis/Documents/multimedia-geotagging/images/
 
 #Processes of the program
 #Values:
@@ -12,23 +12,9 @@ dir=/media/georgekordopatis/New Volume/placing-task/files/
 #all = all the processes
 process=train
 
-#Source Data
-sFolder=yfcc100m_dataset/
-sTrain=mediaeval2014_placing_train
-sTest=mediaeval2014_placing_test
-hashFile=yfcc100m_hash
-
-#Training and Test folder and file name
-trainFile=all_train_set_filtered
-testFile=all_test_set
-
-#Filter images of Training set with empty tags and title
-#Boolean: true = filter, false = no filter
-filter=true
-
-#Tag accuracy threshold and tag frequency threshold
-thetaG=0.0
-thetaT=1
+#Folder that contains the training files and Test set file
+trainFolder=/yfcc100m/
+testFile=/testset/2016/mediaeval2016_placing_test
 
 #Scale of Grid
 #side cell = 10^(-scale) (i.e. scale 2 = 0.01)
@@ -38,8 +24,6 @@ finerScale=3
 #Total number of the similar images (k) and the result files of the LM process for multiple grids (input)
 #required for IGSS process
 k=5
-coarserGrid=resultsLM_scale2
-finerGrid=resultsLM_scale3
 
 #Name of the final Result File (output)
 resultFile=results_G2-3_k
@@ -0,0 +1,22 @@
+#Paths to the input Files
+testFile=mediaeval2015_placing_test
+placeFile=mediaeval2015_placing_test_places
+conceptFile=mediaeval2015_placing_test_autotags
+resultFile=results
+
+#Sampling Strategy
+#GUS  <--  Geographical Uniform Sampling
+#UUS  <--  User Uniform Sampling
+#TBS  <--  Text-based Sampling
+#TDS  <--  Text Diversity Sampling
+#GFS  <--  Geographically Focused Sampling
+#ABS  <--  Ambiguity-based Sampling
+#VS  <--  Visual Sampling
+#BS  <--  Building Sampling
+#(Empty)  <--  No sampling
+sampling=GUS
+
+#Minimum and Maximum precision range
+#precisionrange = 10^(scale) (i.e. scale -1 --> range 0.1km)
+minRangeScale=-2
+maxRangeScale=3
@@ -63,11 +63,17 @@
 			<artifactId>commons-math3</artifactId>
 			<version>3.4.1</version>
 		</dependency>
-		
+
 		<dependency>
-			<groupId>org.apache.commons</groupId>
-			<artifactId>commons-lang3</artifactId>
-			<version>3.4</version>
+			<groupId>info.debatty</groupId>
+			<artifactId>java-lsh</artifactId>
+			<version>0.10</version>
+		</dependency>
+
+		<dependency>
+			<groupId>net.sf.geographiclib</groupId>
+			<artifactId>GeographicLib-Java</artifactId>
+			<version>1.42</version>
 		</dependency>
 
 	</dependencies>
 
@@ -0,0 +1,120 @@
+flying buttress
+brussels carpet
+capitol
+rose window
+abbey
+coliseum
+nave
+cathedral
+pantheon
+chateau
+belfry
+gothic
+temple
+aisle
+pointed arch
+rotunda
+organ loft
+onion dome
+palace
+bastion
+campanile
+cloister
+dome
+clock tower
+roman arch
+round arch
+amphitheater
+church
+facade
+frieze
+ceiling
+ballpark
+gargoyle
+colonnade
+manor
+altar
+battlement
+corbel
+castle
+brownstone
+mansion
+fortification
+pediment
+row house
+pedestal
+acropolis
+apartment
+building complex
+skyscraper
+stronghold
+monument
+fortress
+great hall
+tower
+drawbridge
+arch
+portico
+stadium
+field house
+condominium
+fort
+steeple
+steel arch bridge
+memorial
+column
+gable
+stained
+dome building
+watchtower
+marina
+city
+support column
+concrete
+cantilever bridge
+building
+roof
+door knocker
+building structure
+department store
+cityscape
+bazaar
+casino
+baluster
+auditorium
+hall
+truss
+brickwork
+assembly hall
+harbor
+radome
+architecture
+warehouse
+chandelier
+house
+window box
+ruins
+greenhouse
+stairwell
+window
+lighthouse
+mezzanine
+country house
+library
+stairs
+bookshop
+waterfront
+cemetery
+villa
+rafter
+stoop
+resort
+brick
+bannister
+mantel
+wall
+loft
+shelter
+cafeteria
+farmhouse
+cabin