This repository contains a Python implementation of indexing and query algorithms for Social Spatio-temporal Keyword Pattern (S²KP) queries over Multiple Aspect Trajectory (MAT) data, supporting the following episode-level constraints:
- Spatial: episode bounding box
(minx, maxx, miny, maxy) - Temporal: episode timeframe
(tmin, tmax) - Social rating: rating range
(rating_min, rating_max) - Textual: episode keyword set
It supports S²KP queries over MAT data and provides multiple index structures for efficient candidate pruning and sequence verification over MAT episode sequences.
A Python implementation of EMT Dual Index (Gryllakis et. al, 2022).
build_emt_dual_index(...) builds two complementary trees over regions-of-interest (ROI) nodes:
- Spatial Box Sort (SBS) Tree a kd-tree–like structure over spatial bounds, used to prune by ROI.
- Episode Sort (ES Tree) a kd-tree–like structure over rating/time bounds, used to prune by social/temporal constraints.
Each EMTNode aggregates:
- union spatial bounds over its episodes
- union rating bounds
- union time bounds
- an inverted index
keyword -> [episodes]
Query entrypoint:
s2kp_query(emt_index, query_sequence, keyword_freq_map)
A keyword-frequency heuristic decides whether to start the search from SBS (spatial) or ES (social/temporal), then verifies the full episode sequence.
build_hilbert_emt_index(...) extends EMT with Hilbert curve keys for spatial locality:
- Computes per-node Hilbert key intervals over episode centers.
- Uses those intervals to prune SBS traversal early when a Hilbert interval is available.
Query entrypoint:
hemt_query(hemt_index, query_sequence, keyword_freq_map)
EMR tree is an enhanced Python implementation for MAT data from Episode-based Semantic trajectory R-tree (ESR) that is used for semantic trajectory data (Gryllakis et. al, 2017).
build_emr_tree(...) builds a hierarchical structure over episodes:
- Leaves hold up to
max_leaf_sizeepisodes (sorted by a 4D center heuristic). - Internal nodes store union bounds across children.
- Each node maintains an inverted file
(keyword -> episode IDs)for fast keyword pruning.
A separate STR histogram is used for selectivity estimation:
build_str_histogram(...)estimate_selectivity_episode(...)
Query entrypoint (selectivity-driven):
emr_sskp_query(root, query_sequence, hist_str, global_if, traj_eps)
The query selects the most selective episode abstraction in the sequence, filters candidate leaf nodes, then verifies remaining episodes forward/backward.
Each trajectory episode is expected to be a dict with at least:
episode = {
"trajectory_id": int,
"episode_id": int,
"spatial_bounds": (minx, maxx, miny, maxy),
"timeframe": (tmin, tmax),
"social_rankings": (rmin, rmax),
"keywords": set([...]),
"next_episode": episode_or_None,
"prev_episode": episode_or_None,
}Queries are sequences of “episode abstractions” (constraints). Supported fields:
- roi_bounds: '*' or (x1, x2, y1, y2)
- timeframe: '*' or (tmin, tmax)
- rating_min: '*' or number
- rating_max: '*' or number
- keywords: '*', None, or an iterable of keywords The repository normalizes all abstractions internally via normalize_ea(...) to a canonical schema (roi_bounds, tmin/tmax, rating_min/rating_max, keywords).
Example:
query_sequence = [
{"roi_bounds": (0, 10, 0, 10), "timeframe": (0, 100), "rating_min": 3, "rating_max": 5, "keywords": {"park"}},
{"roi_bounds": "*", "timeframe": (50, 200), "rating_min": 2, "rating_max": 5, "keywords": {"cafe"}},
]- Build EMT (or HEMT)
emt = build_emt_dual_index(trajectories_with_episodes)
kw_freq = compute_global_keyword_frequency(emt)- Run a sequence query
matches = s2kp_query(emt, query_sequence, kw_freq)
print(matches)
Hilbert-enhanced variant:
hemt = build_hilbert_emt_index(trajectories_with_episodes, p=16)
kw_freq = compute_global_keyword_frequency(hemt)
matches = hemt_query(hemt, query_sequence, kw_freq)
#EMR variant:
root = build_emr_tree(trajectories_with_episodes, max_leaf_size=20)
hist = build_str_histogram(trajectories_with_episodes, B=10)
#Build a global inverted file for selectivity estimation
global_if = {}
for t in trajectories_with_episodes:
for ep in t["episodes"]:
ep_id = (ep["trajectory_id"], ep["episode_id"])
for kw in ep["keywords"]:
global_if.setdefault(kw, set()).add(ep_id)
matches = emr_sskp_query(root, query_sequence, hist, global_if, trajectories_with_episodes)
print(matches)This dataset contains 2D trajectory points grouped into trajectories (tracks). Each row represents one observation (one point) along a trajectory with time, coordinates, rating, keyword coordinates and keywords.
- One row = one timestep for one trajectory
- Rows belonging to the same trajectory share the same trajectory_id
- Within a trajectory, timestep is expected to be monotonic (typically increasing)
Each line has 8 comma-separated fields:
- trajectory_id (int) Identifier of the trajectory / track (e.g., 0).
- time (int) Time index or frame number within the trajectory (e.g., 29, 54, 97).
- x (float) X coordinate in the dataset’s coordinate system.
- y (float) Y coordinate in the dataset’s coordinate system.
- rating (float) Social rating.
- kx (float) Keyword X coordinate.
- ky (float) Keyword Y coordinate.
- keyword (string) related keywords (e.g., industry, complex, …).
Example
trajectory_id,time, x, y, rating, kx, ky
0,29,34037.70761806609,9833.138607576757,8.4,3.206948757171631,34.005638122558594,industry
0,54,34782.25592839948,9270.427745288769,9.6,-9.44599723815918,16.407695770263672,complex
0,97,36103.710020125865,8309.520759541527,9.1,-19.66252326965332,-37.67576599121094,deliciousIf you use this code or reproduce these results, please cite: F.Gryllakis, N. Pelekis, C. Doulkeridis and Y. Theodoridis, "Multiple aspect trajectory indexing with Hilbert curve enhancements for efficient S²KP queries", 2025. (Submitted for publication.)