API Reference
Preprocessing Module
Trajectory preprocessing module for pyehicle.
This module provides algorithms for preprocessing GPS trajectories, including: - Compression: Reduce trajectory point count while preserving shape - Map-matching: Align trajectories to road networks - Filtering: Smooth trajectories using Kalman filtering - Segmentation: Split trajectories into segments - Interpolation: Add points between existing trajectory points - Sampling rate: Calculate trajectory sampling rates
- pyehicle.preprocessing.by_number_of_points(df: DataFrame | DataFrame, num: int, lat_col: str = 'lat', lon_col: str = 'lon', time_col: str = 'time') DataFrame | DataFrame[source]
Resample trajectory to exact number of points using geodesic interpolation.
This function resamples a GPS trajectory to contain exactly num points by interpolating along geodesic (great-circle) paths between original points. Points are distributed uniformly by distance along the trajectory, and timestamps are interpolated proportionally.
This is useful for: - Normalizing trajectory density across datasets - Preparing trajectories for fixed-size neural network inputs - Creating smoother visualizations - Standardizing trajectories for comparison
- Parameters:
df (pd.DataFrame or pl.DataFrame) – Input trajectory with at least latitude, longitude, and optionally time columns. Minimum 2 points required.
num (int) – Target number of points in the output trajectory. Must be > 2. Start and end points are always preserved.
lat_col (str, default='lat') – Name of the latitude column (WGS84 decimal degrees).
lon_col (str, default='lon') – Name of the longitude column (WGS84 decimal degrees).
time_col (str, default='time') – Name of the time column. If present, timestamps are interpolated linearly by cumulative distance. If not present, output contains only lat/lon.
- Returns:
Resampled trajectory with exactly num points. Returns same type as input. Contains columns: lat_col, lon_col, and time_col (if present in input). Points are distributed uniformly by distance along the geodesic path.
- Return type:
pd.DataFrame or pl.DataFrame
Examples
>>> import pandas as pd >>> import pyehicle as pye >>> >>> # Load trajectory with variable spacing >>> df = pd.read_csv('trajectory.csv') >>> print(f"Original: {len(df)} points") >>> >>> # Resample to exactly 100 points >>> resampled = pye.preprocessing.by_number_of_points(df, num=100) >>> print(f"Resampled: {len(resampled)} points") # Exactly 100 >>> >>> # Increase density for smoother visualization >>> dense = pye.preprocessing.by_number_of_points(df, num=500) >>> pye.utilities.visualization.single(dense, name='Dense Trajectory')
Notes
Algorithm: 1. Calculate cumulative geodesic distances along the original trajectory 2. Define num target distances uniformly spaced from 0 to total_distance 3. For each target distance, find containing segment and interpolation fraction 4. Interpolate coordinates geodesically using pyproj Geod.fwd() 5. Interpolate timestamps linearly by distance (if time column present)
Geodesic Interpolation: - Uses WGS84 ellipsoid (pyproj Geod) - Accurate for all distances and latitudes - Interpolates along great-circle paths (shortest distance on sphere) - Preserves exact start and end points
Performance: - Time complexity: O(n + m) where n = input points, m = output points - Vectorized operations for efficiency - Fast even for large trajectories (>10,000 points)
Use Cases: - Standardizing trajectory density for machine learning - Creating smooth animations with fixed frame counts - Upsampling sparse trajectories (num > len(df)) - Downsampling dense trajectories (num < len(df))
See also
by_sampling_rateResample by temporal interval
get_sampling_rateCalculate current sampling rate
- pyehicle.preprocessing.by_sampling_rate(df: DataFrame | DataFrame, target_sampling_rate: float, lat_col: str = 'lat', lon_col: str = 'lon', time_col: str = 'time') DataFrame | DataFrame[source]
Resample trajectory to uniform temporal interval using geodesic interpolation.
This function resamples a GPS trajectory to a constant time interval (sampling rate) by interpolating coordinates at regular temporal intervals. This creates trajectories with uniform time spacing, essential for time-series analysis and temporal modeling.
Coordinates are interpolated geodesically along great-circle paths between original points, maintaining geographic accuracy across all latitudes and distances.
This is useful for: - Normalizing temporal resolution across datasets - Preparing trajectories for time-series analysis - Synchronizing trajectories from different devices - Creating animations with consistent frame timing
- Parameters:
df (pd.DataFrame or pl.DataFrame) – Input trajectory with latitude, longitude, and time columns. Minimum 2 points.
target_sampling_rate (float) – Target time interval in seconds between consecutive points. Must be positive and smaller than the current sampling rate. Typical values: - 1.0s: High-frequency (1 Hz) - 5.0s: Standard GPS logging - 10.0s: Low-frequency tracking
lat_col (str, default='lat') – Name of the latitude column (WGS84 decimal degrees).
lon_col (str, default='lon') – Name of the longitude column (WGS84 decimal degrees).
time_col (str, default='time') – Name of the time column. Must be parseable by pandas.to_datetime().
- Returns:
Resampled trajectory with uniform temporal spacing. Returns same type as input. Contains columns: lat_col, lon_col, time_col. Points are spaced at exact target_sampling_rate intervals. Duplicates removed if present.
- Return type:
pd.DataFrame or pl.DataFrame
- Raises:
If target_sampling_rate <= 0 - If target_sampling_rate > current sampling rate (would require downsampling)
Examples
>>> import pandas as pd >>> import pyehicle as pye >>> >>> # Load trajectory with variable sampling rate >>> df = pd.read_csv('trajectory.csv') >>> current_rate = pye.preprocessing.get_sampling_rate(df) >>> print(f"Current sampling rate: {current_rate}s") >>> >>> # Resample to 5-second intervals >>> resampled = pye.preprocessing.by_sampling_rate(df, target_sampling_rate=5.0) >>> new_rate = pye.preprocessing.get_sampling_rate(resampled) >>> print(f"New sampling rate: {new_rate}s") # Should be ~5.0 >>> >>> # Synchronize multiple trajectories to same temporal resolution >>> traj1 = pye.preprocessing.by_sampling_rate(df1, target_sampling_rate=1.0) >>> traj2 = pye.preprocessing.by_sampling_rate(df2, target_sampling_rate=1.0) >>> # Now both have 1-second intervals for comparison
Notes
Algorithm: 1. Convert timestamps to seconds from start 2. Generate target times at uniform intervals: 0, Δt, 2Δt, 3Δt, … 3. For each target time, find containing time segment 4. Calculate interpolation fraction within segment 5. Interpolate coordinates geodesically at that time 6. Convert back to datetime format
Geodesic Interpolation: - Uses WGS84 ellipsoid for accurate great-circle paths - Coordinates interpolated along shortest distance on sphere - Maintains geographic accuracy at all latitudes - Vectorized operations for performance
Temporal Requirements: - target_sampling_rate must be smaller than current rate - This function increases temporal resolution (more frequent sampling) - For decreasing resolution (downsampling), use data selection instead
Performance: - Time complexity: O(n + m) where n = input points, m = output points - Vectorized numpy operations throughout - Fast for typical trajectories (<10,000 points)
Use Cases: - Preparing data for LSTM/RNN models (uniform time steps) - Creating smooth real-time visualizations - Synchronizing multi-sensor data - Generating training data with consistent temporal resolution
See also
by_number_of_pointsResample by fixed point count
get_sampling_rateCalculate current sampling rate
- pyehicle.preprocessing.by_time(df: DataFrame | DataFrame, time_threshold: float = 30, length_threshold: int = 20, time_col: str = 'time') DataFrame | DataFrame | list[source]
Segment a trajectory into sub-trajectories based on time gaps between consecutive points.
This function splits a trajectory whenever there is a time gap larger than the specified threshold. Only segments longer than the length threshold are kept. This is useful for: - Separating distinct trips recorded in a single file - Identifying breaks in continuous GPS tracking - Removing short, fragmented trajectory segments
The function preserves all original columns and maintains the temporal order within segments.
- Parameters:
df (pd.DataFrame or pl.DataFrame) – The trajectory DataFrame to segment. Must contain a time column.
time_threshold (float, default=30) – Maximum time gap in seconds. Gaps larger than this will cause a split. Common values: - 30s: For removing brief GPS signal losses - 300s (5 min): For separating distinct trips - 3600s (1 hour): For separating different days
length_threshold (int, default=20) – Minimum number of points for a segment to be kept. Segments with fewer points are discarded. Helps filter out very short, potentially noisy segments.
time_col (str, default='time') – Name of the time column. Must be parseable by pandas.to_datetime().
- Returns:
The return type depends on the number of valid segments found: - If only 1 valid segment: Returns a single DataFrame (same type as input) - If multiple valid segments: Returns a list of DataFrames - If no valid segments (all too short): Returns the original DataFrame unchanged - If input has ≤ 1 row: Returns the original DataFrame unchanged
- Return type:
pd.DataFrame, pl.DataFrame, or list
Examples
>>> import pandas as pd >>> import pyehicle as pye >>> >>> # Load a trajectory with multiple trips >>> df = pd.read_csv('full_day_gps.csv') >>> print(f"Total points: {len(df)}") >>> >>> # Split at 10-minute gaps, keep segments with 50+ points >>> segments = pye.preprocessing.by_time( ... df, ... time_threshold=600, # 10 minutes ... length_threshold=50 ... ) >>> >>> # Check results >>> if isinstance(segments, list): ... print(f"Found {len(segments)} trips") ... for i, seg in enumerate(segments): ... print(f"Trip {i+1}: {len(seg)} points") ... else: ... print(f"Single continuous trajectory: {len(segments)} points") >>> >>> # Process each segment separately >>> if isinstance(segments, list): ... for i, segment in enumerate(segments): ... # Apply preprocessing to each trip ... compressed = pye.preprocessing.spatio_temporal_compress(segment) ... matched = pye.preprocessing.leuven(compressed) ... matched.to_csv(f'trip_{i+1}_matched.csv')
Notes
Algorithm: 1. Calculate time differences between consecutive points 2. Identify split points where time_diff > time_threshold 3. Split trajectory at these points 4. Filter out segments with length ≤ length_threshold 5. Return results based on number of valid segments
Performance: - Time complexity: O(n) where n is the number of points - Memory: O(n) for creating new DataFrames - Fast for all trajectory sizes
Edge Cases: - Empty or single-point input: Returns original DataFrame - All segments too short: Returns original DataFrame - Time column not sorted: Results may be unexpected (assumes chronological order)
Use Cases: - Preprocessing multi-day GPS logs - Separating work commutes from personal trips - Removing GPS signal loss periods - Batch processing distinct trajectory segments
- pyehicle.preprocessing.get_sampling_rate(df: DataFrame | DataFrame, time_col: str = 'time') float[source]
Calculate the average sampling rate (time interval) of a GPS trajectory.
This function computes the mean time difference between consecutive trajectory points, providing insight into the temporal resolution of the GPS data. This information is useful for: - Understanding data quality and resolution - Choosing appropriate interpolation parameters - Validating trajectory resampling operations - Comparing trajectories from different sources
The function handles various time formats and returns -1.0 for trajectories with insufficient data (< 2 points).
- Parameters:
df (pd.DataFrame or pl.DataFrame) – The DataFrame containing trajectory data with timestamps.
time_col (str, default='time') – The column name for time values. The column should contain datetime-like values that can be parsed by pandas.to_datetime().
- Returns:
The average sampling rate in seconds, rounded to 3 decimal places. Returns -1.0 if the DataFrame has fewer than 2 rows (cannot calculate rate).
- Return type:
- Raises:
ValueError – If the specified time column is not found in the DataFrame.
Examples
>>> import pandas as pd >>> import pyehicle as pye >>> >>> # Create a trajectory with 5-second intervals >>> df = pd.DataFrame({ ... 'time': pd.date_range('2023-01-01', periods=100, freq='5S'), ... 'lat': np.linspace(56.95, 56.96, 100), ... 'lon': np.linspace(24.10, 24.11, 100) ... }) >>> >>> # Check sampling rate >>> rate = pye.preprocessing.get_sampling_rate(df) >>> print(f"Sampling rate: {rate} seconds") # Output: 5.0 seconds >>> >>> # Use sampling rate to choose interpolation target >>> if rate > 10: ... # Sparse data - may need more interpolation ... target_rate = rate / 2 ... else: ... # Dense data - keep similar rate ... target_rate = rate >>> >>> # Resample trajectory to target rate >>> resampled = pye.preprocessing.by_sampling_rate( ... df, ... target_sampling_rate=target_rate ... )
>>> # Compare sampling rates before and after compression >>> original_rate = pye.preprocessing.get_sampling_rate(df) >>> compressed = pye.preprocessing.spatio_temporal_compress(df) >>> compressed_rate = pye.preprocessing.get_sampling_rate(compressed) >>> print(f"Original: {original_rate}s, After compression: {compressed_rate}s")
Notes
Calculation Method: The function computes: mean(time[i+1] - time[i]) for all consecutive pairs. This provides a robust estimate even for trajectories with variable sampling rates.
Interpretation: - 1.0s: High-frequency GPS (1 Hz sampling) - 5.0s: Standard GPS logging (0.2 Hz) - 10-30s: Low-frequency tracking - >60s: Sparse trajectory or vehicle with intermittent tracking
Limitations: - Assumes time column is in chronological order (not validated) - Doesn’t detect or handle outliers in sampling rate - For highly variable rates, mean may not be representative - Returns -1.0 for empty or single-point trajectories
Performance: - Time complexity: O(n) where n is the number of points - Memory: O(n) for time difference array - Fast even for large trajectories (> 1 million points)
- pyehicle.preprocessing.kalman_aeqd_filter(df: DataFrame | DataFrame, lat_col: str = 'lat', lon_col: str = 'lon', time_col: str | None = 'time', process_noise_std_m_init: float = 1.0, measurement_noise_std_m_init: float = 10.0, em_iters: int = 3, em_tol: float = 0.001, chunk_size: int | None = None, overlap: int = 50, outlier_alpha: float | None = 0.01, return_states: bool = False, per_chunk_aeqd: bool = True, use_mmap: bool = False, mmap_dir: str | None = None, mmap_threshold: int = 200000, return_QR: bool = False, use_numba: bool = False, transformer_cache_precision: int = 6, verbose: bool = False)[source]
Apply Kalman filtering and RTS smoothing to a GPS trajectory with automatic noise parameter learning.
This function smooths noisy GPS trajectories using a Kalman filter with a constant-velocity motion model. It uses an Azimuthal Equidistant (AEQD) projection for accurate metric calculations and can automatically learn optimal noise parameters through the EM algorithm. Optionally detects and flags outliers.
- Parameters:
df (pd.DataFrame or pl.DataFrame) – Input trajectory with latitude, longitude, and optionally time columns.
lat_col (str, default="lat") – Name of the latitude column.
lon_col (str, default="lon") – Name of the longitude column.
time_col (str or None, default="time") – Name of the time column. If None, synthetic timestamps are created.
process_noise_std_m_init (float, default=1.0) – Initial process noise standard deviation in meters (affects position).
measurement_noise_std_m_init (float, default=10.0) – Initial measurement noise standard deviation in meters (GPS error).
em_iters (int, default=3) – Number of EM iterations for learning noise parameters (0 = use initial values).
em_tol (float, default=1e-3) – EM convergence tolerance (relative change in Q and R matrices).
chunk_size (int or None, default=None) – Process trajectory in chunks of this size (useful for very long trajectories).
overlap (int, default=50) – Number of overlapping points between chunks for smooth stitching.
outlier_alpha (float or None, default=0.01) – Significance level for outlier detection (None = no outlier detection).
return_states (bool, default=False) – If True, also return full state vectors (position + velocity).
per_chunk_aeqd (bool, default=True) – If True, use a separate AEQD projection for each chunk.
use_mmap (bool, default=False) – Use memory-mapped arrays for very large trajectories.
mmap_dir (str or None, default=None) – Directory for memory-mapped files.
mmap_threshold (int, default=200000) – Trajectory length above which to use memory mapping.
return_QR (bool, default=False) – If True, also return learned Q and R covariance matrices.
use_numba (bool, default=False) – Use numba JIT compilation for faster processing (if numba is available).
transformer_cache_precision (int, default=6) – Number of decimal places for caching coordinate transformers.
verbose (bool, default=False) – Print progress information during processing.
- Returns:
pd.DataFrame or pl.DataFrame – Smoothed trajectory with the same type as input. Contains smoothed lat/lon and a ‘_kept’ column indicating which points passed outlier detection.
np.ndarray (optional) – If return_states=True, returns full state vectors (n x 4) with position and velocity.
tuple (optional) – If return_QR=True, returns (avg_Q, avg_R, qr_list) with learned covariance matrices.
Notes
The Kalman filter uses a 2D constant-velocity model: - State: [x, vx, y, vy] (position and velocity in AEQD projected coordinates) - Observation: [x, y] (GPS measurements) The EM algorithm learns optimal Q (process noise) and R (measurement noise) from the data.
- pyehicle.preprocessing.leuven(df: DataFrame | DataFrame, lat_col: str = 'lat', lon_col: str = 'lon', delete_cache: bool = False, max_dist: float = 0.01, tile_max_deg: float = 0.5, tile_overlap_deg: float = 0.05, cache_overpass: bool = True, return_projected: bool = False, aeqd_precision: int = 6) DataFrame | DataFrame[source]
Map-match GPS trajectory to road network using Hidden Markov Model algorithm (Leuven).
This function aligns noisy GPS points to the most likely path on the actual road network by fetching OpenStreetMap data and applying HMM-based map-matching. It corrects GPS drift, snaps points to road centerlines, and assigns OSM way IDs for road-network-based analysis.
The algorithm: 1. Fetches driveable roads from Overpass API (tiled for large areas) 2. Projects coordinates to AEQD (Azimuthal Equidistant) for accurate metric calculations 3. Builds in-memory road network graph from OSM data 4. Applies Hidden Markov Model to find most likely road path 5. Returns trajectory with coordinates snapped to roads
This is essential preprocessing for: - Trajectory reconstruction (provides OSM IDs for road identification) - Removing GPS noise and drift - Preparing data for road-network-based routing analysis - Generating training data for machine learning models
- Parameters:
df (pd.DataFrame or pl.DataFrame) – Input GPS trajectory with at least latitude and longitude columns. Can contain additional columns (all will be preserved). Minimum 2 points required.
lat_col (str, default='lat') – Name of the latitude column (WGS84 decimal degrees).
lon_col (str, default='lon') – Name of the longitude column (WGS84 decimal degrees).
delete_cache (bool, default=False) – If True, delete cached Overpass API responses after matching. Useful for saving disk space or forcing fresh data download. Recommended: False (reuse cache).
max_dist (float, default=0.01) – Maximum distance in kilometers for matching GPS points to road nodes. Points farther than this from any road will be excluded from the result. Typical values: - 0.01 km (10m): High-accuracy GPS, urban areas - 0.05 km (50m): Standard GPS accuracy - 0.1 km (100m): Low-accuracy GPS or sparse road networks
tile_max_deg (float, default=0.5) – Maximum tile size in degrees for splitting large bounding boxes. Overpass API has request size limits, so large trajectories are automatically tiled. Larger values = fewer API calls but longer response times. Typical: 0.3-0.7 degrees.
tile_overlap_deg (float, default=0.05) – Overlap between adjacent tiles in degrees to ensure roads crossing tile boundaries are captured. Should be >= typical road segment length. Typical: 0.03-0.1 degrees.
cache_overpass (bool, default=True) – If True, cache Overpass API responses to disk in ./overpass_cache/ for reuse. Dramatically speeds up repeated matching in the same area. Recommended: True.
return_projected (bool, default=False) – If True, return coordinates in AEQD projected meters (x, y) instead of WGS84 degrees (lon, lat). Useful for distance calculations. Typical: False (keep lat/lon).
aeqd_precision (int, default=6) – Decimal precision for rounding AEQD projection center coordinates when caching transformers. Higher precision = more cache entries. Typical: 5-7.
- Returns:
Map-matched trajectory with coordinates snapped to road network. Returns same type as input (pandas → pandas, polars → polars). Output contains: - Matched coordinates (lat/lon or x/y depending on return_projected) - All original columns preserved - Points filtered: only those within max_dist of roads are kept - Order preserved: points remain in chronological sequence
- Return type:
pd.DataFrame or pl.DataFrame
Examples
>>> import pandas as pd >>> import pyehicle as pye >>> >>> # Load raw GPS trajectory >>> df = pd.read_csv('raw_gps.csv') >>> print(f"Raw trajectory: {len(df)} points") >>> >>> # Map-match to OSM roads with default settings >>> matched = pye.preprocessing.leuven(df) >>> print(f"Matched trajectory: {len(matched)} points") >>> >>> # Map-match with custom distance threshold (100m) >>> matched = pye.preprocessing.leuven( ... df, ... max_dist=0.1, # 100 meters ... cache_overpass=True # Reuse cached OSM data ... ) >>> >>> # Use in preprocessing pipeline >>> compressed = pye.preprocessing.spatio_temporal_compress(df) >>> matched = pye.preprocessing.leuven(compressed) # Map-match compressed data >>> filtered = pye.preprocessing.kalman_aeqd_filter(matched) # Apply Kalman filter >>> >>> # Visualize before/after >>> pye.utilities.visualization.multiple( ... [df, matched], ... names=['Raw GPS', 'Map-Matched'], ... show_in_browser=True ... )
Notes
Algorithm: Hidden Markov Model (HMM) - States: Road segments (edges) in the OSM network - Observations: GPS points with noise - Transition probability: Likelihood of traveling from one road to another - Emission probability: Likelihood of observing GPS point given road location - Viterbi algorithm: Finds most likely sequence of states (road path)
Data Sources: - Overpass API: Public OSM query service (https://overpass-api.de/) - Fallback: OSMnx library if Overpass fails - Caching: Responses saved to ./overpass_cache/ as XML files
Dependencies: - leuvenmapmatching: Core HMM matching library - osmium: Fast OSM XML parsing (preferred) - osmnx: Fallback for OSM data fetching and parsing - pyproj: AEQD projection transformations
Projection: AEQD (Azimuthal Equidistant) - Centered on trajectory bounding box center - Preserves distances from center point - Converts lat/lon (degrees) to x/y (meters) - Essential for accurate distance-based matching
Performance: - Time complexity: O(n * m * k) where:
n = trajectory points
m = average candidate roads per point
k = HMM Viterbi path length
For 1000-point trajectory: 30-90 seconds (first run), <5 seconds (cached)
Overpass API calls: ~1-5 tiles for typical city-scale trajectories
Cache significantly speeds up repeated matching in same area
Quality Factors: - GPS accuracy: Better GPS → better matching - Road network density: Urban >> rural - Sampling rate: Higher frequency → better path inference - max_dist: Should match expected GPS error (typically 10-50m)
Limitations: - Requires internet connection (Overpass API or OSMnx) - May fail in areas with sparse OSM coverage - Computationally intensive for very long trajectories (>10,000 points) - Does not handle GPS outages or tunnels well - Assumes trajectory follows driveable roads (not for off-road vehicles)
Troubleshooting: - “No OSM data available”: Check internet connection, try larger tile_max_deg - Slow performance: Enable cache_overpass=True, reduce trajectory size with compression - Poor matching: Increase max_dist, check OSM coverage in area, improve GPS quality - Overpass timeout: Reduce tile_max_deg (split into smaller tiles)
See also
meiliAlternative map-matching using Valhalla routing engine
spatio_temporal_compressReduce trajectory size before map-matching
kalman_aeqd_filterApply Kalman filtering after map-matching
- pyehicle.preprocessing.meili(df: DataFrame | DataFrame, lat_col: str = 'lat', lon_col: str = 'lon', time_col: str = 'time') DataFrame | DataFrame[source]
Map-match a GPS trajectory using Valhalla’s Meili algorithm.
This function requires a Valhalla server running on localhost:8002. Meili uses a Hidden Markov Model to match GPS traces to roads, handling noise and gaps effectively.
- Parameters:
- Returns:
Map-matched trajectory with coordinates snapped to roads. Includes ‘original_lon’ and ‘original_lat’ columns if time_col is present.
- Return type:
pd.DataFrame or pl.DataFrame
Notes
Requires Valhalla server running on http://localhost:8002 Install and run Valhalla before using this function. See: https://github.com/valhalla/valhalla
- pyehicle.preprocessing.spatio_temporal_compress(df: DataFrame | DataFrame, spatial_radius_km: float = 0.1, lat_col: str = 'lat', lon_col: str = 'lon', time_col: str | None = 'time', collapse_clusters: bool = True, time_threshold_s: float | None = None, agg_other: Literal['first', 'last', 'mean'] = 'first', min_samples: int = 1, drop_noise: bool = False, use_aeqd: bool = True, final_geodetic_check: bool = False, geodetic_check_tol_m: float = 0.0) DataFrame | DataFrame[source]
Compress GPS trajectory by clustering nearby points using DBSCAN-style spatial clustering.
This function reduces the number of points in a trajectory while preserving its essential shape by identifying and merging clusters of points that are close together in space (and optionally time). It uses a KD-tree for efficient O(n log n) neighbor searches and can optionally use an Azimuthal Equidistant (AEQD) projection for sub-meter distance accuracy.
The compression algorithm works in several stages: 1. Project coordinates to a local metric system (AEQD centered on trajectory centroid) 2. Build a KD-tree for fast spatial range queries 3. Find clusters of nearby points using DBSCAN 4. Optionally filter clusters by temporal proximity 5. Aggregate or select representatives from each cluster
- Parameters:
df (pd.DataFrame or pl.DataFrame) – Input trajectory with latitude, longitude, and optionally time columns. Must contain at least lat_col and lon_col columns.
spatial_radius_km (float, default=0.1) – Clustering radius in kilometers (converted to meters internally). Points within this distance are considered spatial neighbors. Typical values: 0.01-0.5 km for urban trajectories.
lat_col (str, default="lat") – Name of the latitude column (WGS84 decimal degrees).
lon_col (str, default="lon") – Name of the longitude column (WGS84 decimal degrees).
time_col (str or None, default="time") – Name of the time column. If None, no temporal filtering is applied. If provided, must be parseable by pandas.to_datetime().
collapse_clusters (bool, default=True) – How to represent each cluster in the output: - True: Merge all cluster points into a single point (geographic centroid) - False: Keep one representative point from each cluster (the one with lowest index)
time_threshold_s (float or None, default=None) – Temporal clustering threshold in seconds. If provided, points must be within this time window AND spatial_radius_km to be clustered together. Useful for separating overlapping paths taken at different times.
agg_other ({"first", "last", "mean"}, default="first") – How to aggregate non-spatial columns when collapse_clusters=True: - “first”: Use values from the first point in the cluster - “last”: Use values from the last point in the cluster - “mean”: Average numeric columns, use first value for non-numeric
min_samples (int, default=1) – Minimum number of neighbors (including self) required to form a core point. Similar to DBSCAN’s min_samples parameter. Higher values filter out isolated points more aggressively.
drop_noise (bool, default=False) – If True, drop points that don’t belong to any cluster (noise points). If False, keep noise points as singleton clusters.
use_aeqd (bool, default=True) – If True, use Azimuthal Equidistant projection centered on trajectory centroid for accurate distance calculations. If False or if AEQD fails, fallback to Web Mercator (EPSG:3857). AEQD is recommended for accurate compression.
final_geodetic_check (bool, default=False) – If True, verify each cluster using geodesic distances (Geod.inv) after the initial projection-based clustering. Slower but provides additional accuracy verification. Usually not needed with use_aeqd=True.
geodetic_check_tol_m (float, default=0.0) – Tolerance in meters for the final geodetic check. Cluster members must be within spatial_radius_km + geodetic_check_tol_m of the cluster centroid. Only used if final_geodetic_check=True.
- Returns:
Compressed trajectory with the same type and column structure as input. The order of rows may differ from the input. The number of rows will be less than or equal to the input (depending on clustering results).
- Return type:
pd.DataFrame or pl.DataFrame
- Raises:
ValueError – If lat_col or lon_col are not present in the DataFrame.
Examples
>>> import pandas as pd >>> import pyehicle as pye >>> >>> # Load a dense GPS trajectory >>> df = pd.read_csv('dense_trajectory.csv') # 1000 points >>> print(f"Original points: {len(df)}") >>> >>> # Simple compression with 100m radius >>> compressed = pye.preprocessing.spatio_temporal_compress( ... df, ... spatial_radius_km=0.1, # 100 meters ... collapse_clusters=True ... ) >>> print(f"Compressed points: {len(compressed)}") # ~200 points >>> >>> # Spatio-temporal compression (separate paths at different times) >>> compressed_st = pye.preprocessing.spatio_temporal_compress( ... df, ... spatial_radius_km=0.05, # 50 meters ... time_threshold_s=300, # 5 minutes ... collapse_clusters=True ... ) >>> >>> # Aggressive compression: drop isolated points, larger radius >>> aggressive = pye.preprocessing.spatio_temporal_compress( ... df, ... spatial_radius_km=0.2, # 200 meters ... min_samples=3, # Need at least 3 neighbors ... drop_noise=True, # Remove isolated points ... agg_other="mean" # Average other columns ... )
Notes
Algorithm Details:
The function implements a DBSCAN-like clustering algorithm optimized for GPS data:
Projection: Coordinates are projected to a local metric system for accurate distance calculations. By default, uses AEQD (Azimuthal Equidistant) projection centered on the trajectory’s geographic centroid. This ensures distances are accurate within a few meters across the entire trajectory.
Spatial Indexing: A KD-tree is built on the projected coordinates for O(log n) neighbor queries. Without scipy, falls back to O(n²) brute force.
Clustering: DBSCAN algorithm identifies clusters of spatially connected points: - Core points: Points with ≥ min_samples neighbors within spatial_radius_km - Cluster formation: Core points and their neighbors form clusters - Noise: Points that don’t belong to any cluster (if not dropped)
Temporal Filtering: If time_threshold_s is provided, neighbor relationships are filtered to only include points within the time window. This prevents clustering of points from different trips on the same path.
Aggregation: Clusters are represented either by their geographic centroid (collapse_clusters=True) or by selecting one representative point.
Performance: - With scipy (cKDTree): O(n log n) - fast even for large trajectories - Without scipy: O(n²) - acceptable for <10,000 points, slow for larger datasets - AEQD projection: ~0.1ms per trajectory (one-time cost) - Geodetic check: O(n) extra cost per cluster (usually not needed)
Accuracy: - AEQD projection: Sub-meter accuracy for distances up to ~1000 km from center - Web Mercator: Acceptable for small spatial_radius (<1 km) at mid-latitudes - Geodetic check: Verifies clusters using true geodesic distances (slowest but most accurate)
Typical Use Cases: - Reduce storage size of dense GPS logs (e.g., 1 Hz sampling → 0.1 Hz effective) - Preprocessing for map-matching (remove redundant points) - Noise reduction in stationary periods (e.g., at traffic lights) - Simplify trajectory for visualization
Warnings: - If scipy is not installed, a warning is issued and O(n²) fallback is used - If AEQD projection fails, a warning is issued and Web Mercator is used - Very large spatial_radius_km (>5 km) may produce unexpected results - Very small min_samples (<1) will be clamped to 1
Compression
- pyehicle.preprocessing.spatio_temporal_compress(df: DataFrame | DataFrame, spatial_radius_km: float = 0.1, lat_col: str = 'lat', lon_col: str = 'lon', time_col: str | None = 'time', collapse_clusters: bool = True, time_threshold_s: float | None = None, agg_other: Literal['first', 'last', 'mean'] = 'first', min_samples: int = 1, drop_noise: bool = False, use_aeqd: bool = True, final_geodetic_check: bool = False, geodetic_check_tol_m: float = 0.0) DataFrame | DataFrame[source]
Compress GPS trajectory by clustering nearby points using DBSCAN-style spatial clustering.
This function reduces the number of points in a trajectory while preserving its essential shape by identifying and merging clusters of points that are close together in space (and optionally time). It uses a KD-tree for efficient O(n log n) neighbor searches and can optionally use an Azimuthal Equidistant (AEQD) projection for sub-meter distance accuracy.
The compression algorithm works in several stages: 1. Project coordinates to a local metric system (AEQD centered on trajectory centroid) 2. Build a KD-tree for fast spatial range queries 3. Find clusters of nearby points using DBSCAN 4. Optionally filter clusters by temporal proximity 5. Aggregate or select representatives from each cluster
- Parameters:
df (pd.DataFrame or pl.DataFrame) – Input trajectory with latitude, longitude, and optionally time columns. Must contain at least lat_col and lon_col columns.
spatial_radius_km (float, default=0.1) – Clustering radius in kilometers (converted to meters internally). Points within this distance are considered spatial neighbors. Typical values: 0.01-0.5 km for urban trajectories.
lat_col (str, default="lat") – Name of the latitude column (WGS84 decimal degrees).
lon_col (str, default="lon") – Name of the longitude column (WGS84 decimal degrees).
time_col (str or None, default="time") – Name of the time column. If None, no temporal filtering is applied. If provided, must be parseable by pandas.to_datetime().
collapse_clusters (bool, default=True) – How to represent each cluster in the output: - True: Merge all cluster points into a single point (geographic centroid) - False: Keep one representative point from each cluster (the one with lowest index)
time_threshold_s (float or None, default=None) – Temporal clustering threshold in seconds. If provided, points must be within this time window AND spatial_radius_km to be clustered together. Useful for separating overlapping paths taken at different times.
agg_other ({"first", "last", "mean"}, default="first") – How to aggregate non-spatial columns when collapse_clusters=True: - “first”: Use values from the first point in the cluster - “last”: Use values from the last point in the cluster - “mean”: Average numeric columns, use first value for non-numeric
min_samples (int, default=1) – Minimum number of neighbors (including self) required to form a core point. Similar to DBSCAN’s min_samples parameter. Higher values filter out isolated points more aggressively.
drop_noise (bool, default=False) – If True, drop points that don’t belong to any cluster (noise points). If False, keep noise points as singleton clusters.
use_aeqd (bool, default=True) – If True, use Azimuthal Equidistant projection centered on trajectory centroid for accurate distance calculations. If False or if AEQD fails, fallback to Web Mercator (EPSG:3857). AEQD is recommended for accurate compression.
final_geodetic_check (bool, default=False) – If True, verify each cluster using geodesic distances (Geod.inv) after the initial projection-based clustering. Slower but provides additional accuracy verification. Usually not needed with use_aeqd=True.
geodetic_check_tol_m (float, default=0.0) – Tolerance in meters for the final geodetic check. Cluster members must be within spatial_radius_km + geodetic_check_tol_m of the cluster centroid. Only used if final_geodetic_check=True.
- Returns:
Compressed trajectory with the same type and column structure as input. The order of rows may differ from the input. The number of rows will be less than or equal to the input (depending on clustering results).
- Return type:
pd.DataFrame or pl.DataFrame
- Raises:
ValueError – If lat_col or lon_col are not present in the DataFrame.
Examples
>>> import pandas as pd >>> import pyehicle as pye >>> >>> # Load a dense GPS trajectory >>> df = pd.read_csv('dense_trajectory.csv') # 1000 points >>> print(f"Original points: {len(df)}") >>> >>> # Simple compression with 100m radius >>> compressed = pye.preprocessing.spatio_temporal_compress( ... df, ... spatial_radius_km=0.1, # 100 meters ... collapse_clusters=True ... ) >>> print(f"Compressed points: {len(compressed)}") # ~200 points >>> >>> # Spatio-temporal compression (separate paths at different times) >>> compressed_st = pye.preprocessing.spatio_temporal_compress( ... df, ... spatial_radius_km=0.05, # 50 meters ... time_threshold_s=300, # 5 minutes ... collapse_clusters=True ... ) >>> >>> # Aggressive compression: drop isolated points, larger radius >>> aggressive = pye.preprocessing.spatio_temporal_compress( ... df, ... spatial_radius_km=0.2, # 200 meters ... min_samples=3, # Need at least 3 neighbors ... drop_noise=True, # Remove isolated points ... agg_other="mean" # Average other columns ... )
Notes
Algorithm Details:
The function implements a DBSCAN-like clustering algorithm optimized for GPS data:
Projection: Coordinates are projected to a local metric system for accurate distance calculations. By default, uses AEQD (Azimuthal Equidistant) projection centered on the trajectory’s geographic centroid. This ensures distances are accurate within a few meters across the entire trajectory.
Spatial Indexing: A KD-tree is built on the projected coordinates for O(log n) neighbor queries. Without scipy, falls back to O(n²) brute force.
Clustering: DBSCAN algorithm identifies clusters of spatially connected points: - Core points: Points with ≥ min_samples neighbors within spatial_radius_km - Cluster formation: Core points and their neighbors form clusters - Noise: Points that don’t belong to any cluster (if not dropped)
Temporal Filtering: If time_threshold_s is provided, neighbor relationships are filtered to only include points within the time window. This prevents clustering of points from different trips on the same path.
Aggregation: Clusters are represented either by their geographic centroid (collapse_clusters=True) or by selecting one representative point.
Performance: - With scipy (cKDTree): O(n log n) - fast even for large trajectories - Without scipy: O(n²) - acceptable for <10,000 points, slow for larger datasets - AEQD projection: ~0.1ms per trajectory (one-time cost) - Geodetic check: O(n) extra cost per cluster (usually not needed)
Accuracy: - AEQD projection: Sub-meter accuracy for distances up to ~1000 km from center - Web Mercator: Acceptable for small spatial_radius (<1 km) at mid-latitudes - Geodetic check: Verifies clusters using true geodesic distances (slowest but most accurate)
Typical Use Cases: - Reduce storage size of dense GPS logs (e.g., 1 Hz sampling → 0.1 Hz effective) - Preprocessing for map-matching (remove redundant points) - Noise reduction in stationary periods (e.g., at traffic lights) - Simplify trajectory for visualization
Warnings: - If scipy is not installed, a warning is issued and O(n²) fallback is used - If AEQD projection fails, a warning is issued and Web Mercator is used - Very large spatial_radius_km (>5 km) may produce unexpected results - Very small min_samples (<1) will be clamped to 1
Map-Matching
- pyehicle.preprocessing.leuven(df: DataFrame | DataFrame, lat_col: str = 'lat', lon_col: str = 'lon', delete_cache: bool = False, max_dist: float = 0.01, tile_max_deg: float = 0.5, tile_overlap_deg: float = 0.05, cache_overpass: bool = True, return_projected: bool = False, aeqd_precision: int = 6) DataFrame | DataFrame[source]
Map-match GPS trajectory to road network using Hidden Markov Model algorithm (Leuven).
This function aligns noisy GPS points to the most likely path on the actual road network by fetching OpenStreetMap data and applying HMM-based map-matching. It corrects GPS drift, snaps points to road centerlines, and assigns OSM way IDs for road-network-based analysis.
The algorithm: 1. Fetches driveable roads from Overpass API (tiled for large areas) 2. Projects coordinates to AEQD (Azimuthal Equidistant) for accurate metric calculations 3. Builds in-memory road network graph from OSM data 4. Applies Hidden Markov Model to find most likely road path 5. Returns trajectory with coordinates snapped to roads
This is essential preprocessing for: - Trajectory reconstruction (provides OSM IDs for road identification) - Removing GPS noise and drift - Preparing data for road-network-based routing analysis - Generating training data for machine learning models
- Parameters:
df (pd.DataFrame or pl.DataFrame) – Input GPS trajectory with at least latitude and longitude columns. Can contain additional columns (all will be preserved). Minimum 2 points required.
lat_col (str, default='lat') – Name of the latitude column (WGS84 decimal degrees).
lon_col (str, default='lon') – Name of the longitude column (WGS84 decimal degrees).
delete_cache (bool, default=False) – If True, delete cached Overpass API responses after matching. Useful for saving disk space or forcing fresh data download. Recommended: False (reuse cache).
max_dist (float, default=0.01) – Maximum distance in kilometers for matching GPS points to road nodes. Points farther than this from any road will be excluded from the result. Typical values: - 0.01 km (10m): High-accuracy GPS, urban areas - 0.05 km (50m): Standard GPS accuracy - 0.1 km (100m): Low-accuracy GPS or sparse road networks
tile_max_deg (float, default=0.5) – Maximum tile size in degrees for splitting large bounding boxes. Overpass API has request size limits, so large trajectories are automatically tiled. Larger values = fewer API calls but longer response times. Typical: 0.3-0.7 degrees.
tile_overlap_deg (float, default=0.05) – Overlap between adjacent tiles in degrees to ensure roads crossing tile boundaries are captured. Should be >= typical road segment length. Typical: 0.03-0.1 degrees.
cache_overpass (bool, default=True) – If True, cache Overpass API responses to disk in ./overpass_cache/ for reuse. Dramatically speeds up repeated matching in the same area. Recommended: True.
return_projected (bool, default=False) – If True, return coordinates in AEQD projected meters (x, y) instead of WGS84 degrees (lon, lat). Useful for distance calculations. Typical: False (keep lat/lon).
aeqd_precision (int, default=6) – Decimal precision for rounding AEQD projection center coordinates when caching transformers. Higher precision = more cache entries. Typical: 5-7.
- Returns:
Map-matched trajectory with coordinates snapped to road network. Returns same type as input (pandas → pandas, polars → polars). Output contains: - Matched coordinates (lat/lon or x/y depending on return_projected) - All original columns preserved - Points filtered: only those within max_dist of roads are kept - Order preserved: points remain in chronological sequence
- Return type:
pd.DataFrame or pl.DataFrame
Examples
>>> import pandas as pd >>> import pyehicle as pye >>> >>> # Load raw GPS trajectory >>> df = pd.read_csv('raw_gps.csv') >>> print(f"Raw trajectory: {len(df)} points") >>> >>> # Map-match to OSM roads with default settings >>> matched = pye.preprocessing.leuven(df) >>> print(f"Matched trajectory: {len(matched)} points") >>> >>> # Map-match with custom distance threshold (100m) >>> matched = pye.preprocessing.leuven( ... df, ... max_dist=0.1, # 100 meters ... cache_overpass=True # Reuse cached OSM data ... ) >>> >>> # Use in preprocessing pipeline >>> compressed = pye.preprocessing.spatio_temporal_compress(df) >>> matched = pye.preprocessing.leuven(compressed) # Map-match compressed data >>> filtered = pye.preprocessing.kalman_aeqd_filter(matched) # Apply Kalman filter >>> >>> # Visualize before/after >>> pye.utilities.visualization.multiple( ... [df, matched], ... names=['Raw GPS', 'Map-Matched'], ... show_in_browser=True ... )
Notes
Algorithm: Hidden Markov Model (HMM) - States: Road segments (edges) in the OSM network - Observations: GPS points with noise - Transition probability: Likelihood of traveling from one road to another - Emission probability: Likelihood of observing GPS point given road location - Viterbi algorithm: Finds most likely sequence of states (road path)
Data Sources: - Overpass API: Public OSM query service (https://overpass-api.de/) - Fallback: OSMnx library if Overpass fails - Caching: Responses saved to ./overpass_cache/ as XML files
Dependencies: - leuvenmapmatching: Core HMM matching library - osmium: Fast OSM XML parsing (preferred) - osmnx: Fallback for OSM data fetching and parsing - pyproj: AEQD projection transformations
Projection: AEQD (Azimuthal Equidistant) - Centered on trajectory bounding box center - Preserves distances from center point - Converts lat/lon (degrees) to x/y (meters) - Essential for accurate distance-based matching
Performance: - Time complexity: O(n * m * k) where:
n = trajectory points
m = average candidate roads per point
k = HMM Viterbi path length
For 1000-point trajectory: 30-90 seconds (first run), <5 seconds (cached)
Overpass API calls: ~1-5 tiles for typical city-scale trajectories
Cache significantly speeds up repeated matching in same area
Quality Factors: - GPS accuracy: Better GPS → better matching - Road network density: Urban >> rural - Sampling rate: Higher frequency → better path inference - max_dist: Should match expected GPS error (typically 10-50m)
Limitations: - Requires internet connection (Overpass API or OSMnx) - May fail in areas with sparse OSM coverage - Computationally intensive for very long trajectories (>10,000 points) - Does not handle GPS outages or tunnels well - Assumes trajectory follows driveable roads (not for off-road vehicles)
Troubleshooting: - “No OSM data available”: Check internet connection, try larger tile_max_deg - Slow performance: Enable cache_overpass=True, reduce trajectory size with compression - Poor matching: Increase max_dist, check OSM coverage in area, improve GPS quality - Overpass timeout: Reduce tile_max_deg (split into smaller tiles)
See also
meiliAlternative map-matching using Valhalla routing engine
spatio_temporal_compressReduce trajectory size before map-matching
kalman_aeqd_filterApply Kalman filtering after map-matching
- pyehicle.preprocessing.meili(df: DataFrame | DataFrame, lat_col: str = 'lat', lon_col: str = 'lon', time_col: str = 'time') DataFrame | DataFrame[source]
Map-match a GPS trajectory using Valhalla’s Meili algorithm.
This function requires a Valhalla server running on localhost:8002. Meili uses a Hidden Markov Model to match GPS traces to roads, handling noise and gaps effectively.
- Parameters:
- Returns:
Map-matched trajectory with coordinates snapped to roads. Includes ‘original_lon’ and ‘original_lat’ columns if time_col is present.
- Return type:
pd.DataFrame or pl.DataFrame
Notes
Requires Valhalla server running on http://localhost:8002 Install and run Valhalla before using this function. See: https://github.com/valhalla/valhalla
Kalman Filtering
- pyehicle.preprocessing.kalman_aeqd_filter(df: DataFrame | DataFrame, lat_col: str = 'lat', lon_col: str = 'lon', time_col: str | None = 'time', process_noise_std_m_init: float = 1.0, measurement_noise_std_m_init: float = 10.0, em_iters: int = 3, em_tol: float = 0.001, chunk_size: int | None = None, overlap: int = 50, outlier_alpha: float | None = 0.01, return_states: bool = False, per_chunk_aeqd: bool = True, use_mmap: bool = False, mmap_dir: str | None = None, mmap_threshold: int = 200000, return_QR: bool = False, use_numba: bool = False, transformer_cache_precision: int = 6, verbose: bool = False)[source]
Apply Kalman filtering and RTS smoothing to a GPS trajectory with automatic noise parameter learning.
This function smooths noisy GPS trajectories using a Kalman filter with a constant-velocity motion model. It uses an Azimuthal Equidistant (AEQD) projection for accurate metric calculations and can automatically learn optimal noise parameters through the EM algorithm. Optionally detects and flags outliers.
- Parameters:
df (pd.DataFrame or pl.DataFrame) – Input trajectory with latitude, longitude, and optionally time columns.
lat_col (str, default="lat") – Name of the latitude column.
lon_col (str, default="lon") – Name of the longitude column.
time_col (str or None, default="time") – Name of the time column. If None, synthetic timestamps are created.
process_noise_std_m_init (float, default=1.0) – Initial process noise standard deviation in meters (affects position).
measurement_noise_std_m_init (float, default=10.0) – Initial measurement noise standard deviation in meters (GPS error).
em_iters (int, default=3) – Number of EM iterations for learning noise parameters (0 = use initial values).
em_tol (float, default=1e-3) – EM convergence tolerance (relative change in Q and R matrices).
chunk_size (int or None, default=None) – Process trajectory in chunks of this size (useful for very long trajectories).
overlap (int, default=50) – Number of overlapping points between chunks for smooth stitching.
outlier_alpha (float or None, default=0.01) – Significance level for outlier detection (None = no outlier detection).
return_states (bool, default=False) – If True, also return full state vectors (position + velocity).
per_chunk_aeqd (bool, default=True) – If True, use a separate AEQD projection for each chunk.
use_mmap (bool, default=False) – Use memory-mapped arrays for very large trajectories.
mmap_dir (str or None, default=None) – Directory for memory-mapped files.
mmap_threshold (int, default=200000) – Trajectory length above which to use memory mapping.
return_QR (bool, default=False) – If True, also return learned Q and R covariance matrices.
use_numba (bool, default=False) – Use numba JIT compilation for faster processing (if numba is available).
transformer_cache_precision (int, default=6) – Number of decimal places for caching coordinate transformers.
verbose (bool, default=False) – Print progress information during processing.
- Returns:
pd.DataFrame or pl.DataFrame – Smoothed trajectory with the same type as input. Contains smoothed lat/lon and a ‘_kept’ column indicating which points passed outlier detection.
np.ndarray (optional) – If return_states=True, returns full state vectors (n x 4) with position and velocity.
tuple (optional) – If return_QR=True, returns (avg_Q, avg_R, qr_list) with learned covariance matrices.
Notes
The Kalman filter uses a 2D constant-velocity model: - State: [x, vx, y, vy] (position and velocity in AEQD projected coordinates) - Observation: [x, y] (GPS measurements) The EM algorithm learns optimal Q (process noise) and R (measurement noise) from the data.
Segmentation
- pyehicle.preprocessing.by_time(df: DataFrame | DataFrame, time_threshold: float = 30, length_threshold: int = 20, time_col: str = 'time') DataFrame | DataFrame | list[source]
Segment a trajectory into sub-trajectories based on time gaps between consecutive points.
This function splits a trajectory whenever there is a time gap larger than the specified threshold. Only segments longer than the length threshold are kept. This is useful for: - Separating distinct trips recorded in a single file - Identifying breaks in continuous GPS tracking - Removing short, fragmented trajectory segments
The function preserves all original columns and maintains the temporal order within segments.
- Parameters:
df (pd.DataFrame or pl.DataFrame) – The trajectory DataFrame to segment. Must contain a time column.
time_threshold (float, default=30) – Maximum time gap in seconds. Gaps larger than this will cause a split. Common values: - 30s: For removing brief GPS signal losses - 300s (5 min): For separating distinct trips - 3600s (1 hour): For separating different days
length_threshold (int, default=20) – Minimum number of points for a segment to be kept. Segments with fewer points are discarded. Helps filter out very short, potentially noisy segments.
time_col (str, default='time') – Name of the time column. Must be parseable by pandas.to_datetime().
- Returns:
The return type depends on the number of valid segments found: - If only 1 valid segment: Returns a single DataFrame (same type as input) - If multiple valid segments: Returns a list of DataFrames - If no valid segments (all too short): Returns the original DataFrame unchanged - If input has ≤ 1 row: Returns the original DataFrame unchanged
- Return type:
pd.DataFrame, pl.DataFrame, or list
Examples
>>> import pandas as pd >>> import pyehicle as pye >>> >>> # Load a trajectory with multiple trips >>> df = pd.read_csv('full_day_gps.csv') >>> print(f"Total points: {len(df)}") >>> >>> # Split at 10-minute gaps, keep segments with 50+ points >>> segments = pye.preprocessing.by_time( ... df, ... time_threshold=600, # 10 minutes ... length_threshold=50 ... ) >>> >>> # Check results >>> if isinstance(segments, list): ... print(f"Found {len(segments)} trips") ... for i, seg in enumerate(segments): ... print(f"Trip {i+1}: {len(seg)} points") ... else: ... print(f"Single continuous trajectory: {len(segments)} points") >>> >>> # Process each segment separately >>> if isinstance(segments, list): ... for i, segment in enumerate(segments): ... # Apply preprocessing to each trip ... compressed = pye.preprocessing.spatio_temporal_compress(segment) ... matched = pye.preprocessing.leuven(compressed) ... matched.to_csv(f'trip_{i+1}_matched.csv')
Notes
Algorithm: 1. Calculate time differences between consecutive points 2. Identify split points where time_diff > time_threshold 3. Split trajectory at these points 4. Filter out segments with length ≤ length_threshold 5. Return results based on number of valid segments
Performance: - Time complexity: O(n) where n is the number of points - Memory: O(n) for creating new DataFrames - Fast for all trajectory sizes
Edge Cases: - Empty or single-point input: Returns original DataFrame - All segments too short: Returns original DataFrame - Time column not sorted: Results may be unexpected (assumes chronological order)
Use Cases: - Preprocessing multi-day GPS logs - Separating work commutes from personal trips - Removing GPS signal loss periods - Batch processing distinct trajectory segments
Interpolation
- pyehicle.preprocessing.by_number_of_points(df: DataFrame | DataFrame, num: int, lat_col: str = 'lat', lon_col: str = 'lon', time_col: str = 'time') DataFrame | DataFrame[source]
Resample trajectory to exact number of points using geodesic interpolation.
This function resamples a GPS trajectory to contain exactly num points by interpolating along geodesic (great-circle) paths between original points. Points are distributed uniformly by distance along the trajectory, and timestamps are interpolated proportionally.
This is useful for: - Normalizing trajectory density across datasets - Preparing trajectories for fixed-size neural network inputs - Creating smoother visualizations - Standardizing trajectories for comparison
- Parameters:
df (pd.DataFrame or pl.DataFrame) – Input trajectory with at least latitude, longitude, and optionally time columns. Minimum 2 points required.
num (int) – Target number of points in the output trajectory. Must be > 2. Start and end points are always preserved.
lat_col (str, default='lat') – Name of the latitude column (WGS84 decimal degrees).
lon_col (str, default='lon') – Name of the longitude column (WGS84 decimal degrees).
time_col (str, default='time') – Name of the time column. If present, timestamps are interpolated linearly by cumulative distance. If not present, output contains only lat/lon.
- Returns:
Resampled trajectory with exactly num points. Returns same type as input. Contains columns: lat_col, lon_col, and time_col (if present in input). Points are distributed uniformly by distance along the geodesic path.
- Return type:
pd.DataFrame or pl.DataFrame
Examples
>>> import pandas as pd >>> import pyehicle as pye >>> >>> # Load trajectory with variable spacing >>> df = pd.read_csv('trajectory.csv') >>> print(f"Original: {len(df)} points") >>> >>> # Resample to exactly 100 points >>> resampled = pye.preprocessing.by_number_of_points(df, num=100) >>> print(f"Resampled: {len(resampled)} points") # Exactly 100 >>> >>> # Increase density for smoother visualization >>> dense = pye.preprocessing.by_number_of_points(df, num=500) >>> pye.utilities.visualization.single(dense, name='Dense Trajectory')
Notes
Algorithm: 1. Calculate cumulative geodesic distances along the original trajectory 2. Define num target distances uniformly spaced from 0 to total_distance 3. For each target distance, find containing segment and interpolation fraction 4. Interpolate coordinates geodesically using pyproj Geod.fwd() 5. Interpolate timestamps linearly by distance (if time column present)
Geodesic Interpolation: - Uses WGS84 ellipsoid (pyproj Geod) - Accurate for all distances and latitudes - Interpolates along great-circle paths (shortest distance on sphere) - Preserves exact start and end points
Performance: - Time complexity: O(n + m) where n = input points, m = output points - Vectorized operations for efficiency - Fast even for large trajectories (>10,000 points)
Use Cases: - Standardizing trajectory density for machine learning - Creating smooth animations with fixed frame counts - Upsampling sparse trajectories (num > len(df)) - Downsampling dense trajectories (num < len(df))
See also
by_sampling_rateResample by temporal interval
get_sampling_rateCalculate current sampling rate
- pyehicle.preprocessing.by_sampling_rate(df: DataFrame | DataFrame, target_sampling_rate: float, lat_col: str = 'lat', lon_col: str = 'lon', time_col: str = 'time') DataFrame | DataFrame[source]
Resample trajectory to uniform temporal interval using geodesic interpolation.
This function resamples a GPS trajectory to a constant time interval (sampling rate) by interpolating coordinates at regular temporal intervals. This creates trajectories with uniform time spacing, essential for time-series analysis and temporal modeling.
Coordinates are interpolated geodesically along great-circle paths between original points, maintaining geographic accuracy across all latitudes and distances.
This is useful for: - Normalizing temporal resolution across datasets - Preparing trajectories for time-series analysis - Synchronizing trajectories from different devices - Creating animations with consistent frame timing
- Parameters:
df (pd.DataFrame or pl.DataFrame) – Input trajectory with latitude, longitude, and time columns. Minimum 2 points.
target_sampling_rate (float) – Target time interval in seconds between consecutive points. Must be positive and smaller than the current sampling rate. Typical values: - 1.0s: High-frequency (1 Hz) - 5.0s: Standard GPS logging - 10.0s: Low-frequency tracking
lat_col (str, default='lat') – Name of the latitude column (WGS84 decimal degrees).
lon_col (str, default='lon') – Name of the longitude column (WGS84 decimal degrees).
time_col (str, default='time') – Name of the time column. Must be parseable by pandas.to_datetime().
- Returns:
Resampled trajectory with uniform temporal spacing. Returns same type as input. Contains columns: lat_col, lon_col, time_col. Points are spaced at exact target_sampling_rate intervals. Duplicates removed if present.
- Return type:
pd.DataFrame or pl.DataFrame
- Raises:
If target_sampling_rate <= 0 - If target_sampling_rate > current sampling rate (would require downsampling)
Examples
>>> import pandas as pd >>> import pyehicle as pye >>> >>> # Load trajectory with variable sampling rate >>> df = pd.read_csv('trajectory.csv') >>> current_rate = pye.preprocessing.get_sampling_rate(df) >>> print(f"Current sampling rate: {current_rate}s") >>> >>> # Resample to 5-second intervals >>> resampled = pye.preprocessing.by_sampling_rate(df, target_sampling_rate=5.0) >>> new_rate = pye.preprocessing.get_sampling_rate(resampled) >>> print(f"New sampling rate: {new_rate}s") # Should be ~5.0 >>> >>> # Synchronize multiple trajectories to same temporal resolution >>> traj1 = pye.preprocessing.by_sampling_rate(df1, target_sampling_rate=1.0) >>> traj2 = pye.preprocessing.by_sampling_rate(df2, target_sampling_rate=1.0) >>> # Now both have 1-second intervals for comparison
Notes
Algorithm: 1. Convert timestamps to seconds from start 2. Generate target times at uniform intervals: 0, Δt, 2Δt, 3Δt, … 3. For each target time, find containing time segment 4. Calculate interpolation fraction within segment 5. Interpolate coordinates geodesically at that time 6. Convert back to datetime format
Geodesic Interpolation: - Uses WGS84 ellipsoid for accurate great-circle paths - Coordinates interpolated along shortest distance on sphere - Maintains geographic accuracy at all latitudes - Vectorized operations for performance
Temporal Requirements: - target_sampling_rate must be smaller than current rate - This function increases temporal resolution (more frequent sampling) - For decreasing resolution (downsampling), use data selection instead
Performance: - Time complexity: O(n + m) where n = input points, m = output points - Vectorized numpy operations throughout - Fast for typical trajectories (<10,000 points)
Use Cases: - Preparing data for LSTM/RNN models (uniform time steps) - Creating smooth real-time visualizations - Synchronizing multi-sensor data - Generating training data with consistent temporal resolution
See also
by_number_of_pointsResample by fixed point count
get_sampling_rateCalculate current sampling rate
Sampling Rate
- pyehicle.preprocessing.get_sampling_rate(df: DataFrame | DataFrame, time_col: str = 'time') float[source]
Calculate the average sampling rate (time interval) of a GPS trajectory.
This function computes the mean time difference between consecutive trajectory points, providing insight into the temporal resolution of the GPS data. This information is useful for: - Understanding data quality and resolution - Choosing appropriate interpolation parameters - Validating trajectory resampling operations - Comparing trajectories from different sources
The function handles various time formats and returns -1.0 for trajectories with insufficient data (< 2 points).
- Parameters:
df (pd.DataFrame or pl.DataFrame) – The DataFrame containing trajectory data with timestamps.
time_col (str, default='time') – The column name for time values. The column should contain datetime-like values that can be parsed by pandas.to_datetime().
- Returns:
The average sampling rate in seconds, rounded to 3 decimal places. Returns -1.0 if the DataFrame has fewer than 2 rows (cannot calculate rate).
- Return type:
- Raises:
ValueError – If the specified time column is not found in the DataFrame.
Examples
>>> import pandas as pd >>> import pyehicle as pye >>> >>> # Create a trajectory with 5-second intervals >>> df = pd.DataFrame({ ... 'time': pd.date_range('2023-01-01', periods=100, freq='5S'), ... 'lat': np.linspace(56.95, 56.96, 100), ... 'lon': np.linspace(24.10, 24.11, 100) ... }) >>> >>> # Check sampling rate >>> rate = pye.preprocessing.get_sampling_rate(df) >>> print(f"Sampling rate: {rate} seconds") # Output: 5.0 seconds >>> >>> # Use sampling rate to choose interpolation target >>> if rate > 10: ... # Sparse data - may need more interpolation ... target_rate = rate / 2 ... else: ... # Dense data - keep similar rate ... target_rate = rate >>> >>> # Resample trajectory to target rate >>> resampled = pye.preprocessing.by_sampling_rate( ... df, ... target_sampling_rate=target_rate ... )
>>> # Compare sampling rates before and after compression >>> original_rate = pye.preprocessing.get_sampling_rate(df) >>> compressed = pye.preprocessing.spatio_temporal_compress(df) >>> compressed_rate = pye.preprocessing.get_sampling_rate(compressed) >>> print(f"Original: {original_rate}s, After compression: {compressed_rate}s")
Notes
Calculation Method: The function computes: mean(time[i+1] - time[i]) for all consecutive pairs. This provides a robust estimate even for trajectories with variable sampling rates.
Interpretation: - 1.0s: High-frequency GPS (1 Hz sampling) - 5.0s: Standard GPS logging (0.2 Hz) - 10-30s: Low-frequency tracking - >60s: Sparse trajectory or vehicle with intermittent tracking
Limitations: - Assumes time column is in chronological order (not validated) - Doesn’t detect or handle outliers in sampling rate - For highly variable rates, mean may not be representative - Returns -1.0 for empty or single-point trajectories
Performance: - Time complexity: O(n) where n is the number of points - Memory: O(n) for time difference array - Fast even for large trajectories (> 1 million points)
Reconstructing Module
Reconstructing module for the pyehicle library.
This module provides algorithms for reconstructing and refining GPS trajectories by combining multiple segments, interpolating curves, and enforcing road network continuity.
- pyehicle.reconstructing.curve_interpolation(df, road_network, lower_threshold=20, upper_threshold=80, max_node_distance=10, time_col='time', lat_col='lat', lon_col='lon')[source]
Improve trajectory realism by interpolating points along road curves based on bearing changes.
This is the final step in trajectory reconstruction. It takes a refined trajectory and adds intermediate points at locations where the vehicle direction changes (curves, turns, corners). By detecting bearing changes and filling in the actual road geometry, it creates high-fidelity trajectories that closely follow real-world road paths.
The function analyzes bearing (direction) changes between consecutive points. When a bearing change falls within the specified threshold range, it indicates a curve that would benefit from additional points. The function then: 1. Finds the road network path between the curve endpoints 2. Interpolates multiple points along that path 3. Assigns timestamps proportionally to distance traveled
This is essential for: - Visualizing realistic vehicle paths (trajectories that look like actual driving) - Traffic analysis requiring accurate road following (lane-level positioning) - Route validation and comparison (matching GPS traces to expected routes) - Simulation and animation (smooth, realistic vehicle movement)
Algorithm Overview: 1. Bearing Analysis: Calculate bearing (direction) between each consecutive point pair 2. Bearing Change Detection: Compute absolute change in bearing (accounts for 360° wrap) 3. Curve Identification: Flag points where bearing change ∈ [lower_threshold, upper_threshold] 4. Path Interpolation: For each curve, find road network path and interpolate points 5. Temporal Distribution: Assign timestamps proportional to distance along interpolated path
Why Bearing Thresholds? - Too small (<20°): Captures noise and minor GPS drift, not real curves - Sweet spot (20-80°): Gradual turns, highway exits, curved roads, roundabouts - Too large (>80°): Sharp 90° turns already handled by refine_trajectory()
- Parameters:
df (pd.DataFrame) – Input trajectory DataFrame with columns for latitude, longitude, and timestamp. Should be output from refine_trajectory() for best results. Minimum columns: lat_col, lon_col, time_col.
road_network (igraph.Graph) – Road network graph loaded from OSM data. Must have: - Node attributes: ‘x’ (longitude), ‘y’ (latitude) - Edge attributes: ‘osmid’ (OpenStreetMap way ID), ‘geometry’ (LineString) Typically loaded via utilities.road_network.load_road_network().
lower_threshold (float, default=20) – Minimum bearing change in degrees to trigger curve interpolation. Bearing changes smaller than this are considered straight segments (no interpolation needed). Typical values: - 10°: Very sensitive, captures subtle curves (may add noise) - 20°: Recommended default, captures most real curves - 30°: Less sensitive, only significant curves
upper_threshold (float, default=80) – Maximum bearing change in degrees to trigger curve interpolation. Bearing changes larger than this are considered sharp turns (already handled by refine_trajectory() or too abrupt for curve interpolation). Typical values: - 60°: Captures gradual to moderate curves - 80°: Recommended default, avoids sharp 90° turns - 90°: Includes sharper turns (may overlap with refinement)
max_node_distance (float, default=10) – Maximum distance in meters for snapping trajectory points to road nodes during interpolation. Points farther than this will skip interpolation. Typical values: - 10-15m: High-accuracy GPS (recommended) - 20-30m: Standard GPS accuracy - 40-50m: Low-accuracy GPS or sparse road networks
time_col (str, default='time') – Name of the timestamp column in df. Can be Unix timestamps (numeric) or formatted datetime strings. Supports various formats.
lat_col (str, default='lat') – Name of the latitude column in df (WGS84 decimal degrees).
lon_col (str, default='lon') – Name of the longitude column in df (WGS84 decimal degrees).
- Returns:
Enhanced trajectory with interpolated points at curves. Contains the same columns as the input (lat_col, lon_col, time_col) with: - Original trajectory points preserved - Intermediate points added at detected curves - Timestamps distributed proportionally to distance - Points sorted chronologically - Duplicates removed Output format: Timestamps as strings (‘%Y-%m-%d %H:%M:%S’)
- Return type:
pd.DataFrame
Examples
>>> import pandas as pd >>> import pyehicle as pye >>> >>> # Load refined trajectory (output from refine_trajectory) >>> df = pd.read_csv('refined_trajectory.csv') >>> print(f"Refined trajectory: {len(df)} points") >>> >>> # Load road network >>> G = pye.utilities.road_network.load_road_network('city_roads.graphml') >>> >>> # Interpolate curves with default thresholds (20-80°) >>> final = pye.reconstructing.curve_interpolation( ... df, ... road_network=G, ... lower_threshold=20, ... upper_threshold=80, ... max_node_distance=10 ... ) >>> >>> print(f"Final trajectory: {len(final)} points") >>> print(f"Added {len(final) - len(df)} points at curves") >>> >>> # Visualize before/after >>> pye.utilities.visualization.multiple( ... [df, final], ... names=['Refined', 'With Curves'], ... show_in_browser=True ... )
>>> # Full reconstruction pipeline (all steps) >>> import pyehicle as pye >>> >>> # 1. Preprocessing >>> df = pd.read_csv('raw_gps.csv') >>> compressed = pye.preprocessing.spatio_temporal_compress(df) >>> matched = pye.preprocessing.leuven(compressed, city='Riga', country='Latvia') >>> >>> # 2. Segmentation and combination >>> segments = pye.preprocessing.by_time(matched, time_threshold=300) >>> combined = pye.reconstructing.trajectory_combiner(segments) >>> >>> # 3. Refinement (road network continuity) >>> G = pye.utilities.road_network.load_road_network('riga_roads.graphml') >>> refined = pye.reconstructing.refine_trajectory(combined, G, max_node_distance=10) >>> >>> # 4. Curve interpolation (THIS FUNCTION - final step) >>> final = pye.reconstructing.curve_interpolation( ... refined, G, ... lower_threshold=20, ... upper_threshold=80, ... max_node_distance=10 ... ) >>> >>> # Save final high-fidelity trajectory >>> final.to_csv('reconstructed_trajectory.csv', index=False) >>> print(f"✓ Reconstruction complete: {len(final)} points")
Notes
Algorithm Details: 1. Calculate bearings between consecutive points using geodesic forward azimuth 2. Compute bearing changes: abs(bearing[i] - bearing[i-1]), handling 360° wrap-around 3. For each point where lower_threshold ≤ bearing_change ≤ upper_threshold:
Find shortest path through road network from previous point to current point
Reconstruct LineString geometry from the path
Calculate cumulative distances along the path
Interpolate timestamps proportionally: t = t1 + (d/total_d) * (t2-t1)
Replace the two original points with the interpolated path points
Deduplicate coordinates and sort by timestamp
Bearing Change Calculation: Bearing changes handle the circular nature of angles (360° = 0°):
`python change = abs(bearing2 - bearing1) change = change if change <= 180 else 360 - change `This ensures: 350° → 10° = 20° change (not 340°)Temporal Interpolation: Timestamps are distributed proportionally to distance along the interpolated path, assuming constant speed: - t[i] = t1 + (cumulative_distance[i] / total_distance) * (t2 - t1) - This maintains realistic timing and ensures chronological order
Road Network Filtering: The function filters the road network to the trajectory’s bounding box (0.5° buffer) for performance, similar to refine_trajectory().
Performance: - Time complexity: O(n * m * k) where:
n = trajectory points
m = avg points per interpolated curve
k = avg edges in shortest path
For 1000-point trajectory with 20 curves: 10-60 seconds
Dominated by shortest path computation (Dijkstra’s algorithm)
Quality Tuning: - More aggressive (lower_threshold=5, upper_threshold=60):
Captures more curves
More interpolated points
Smoother trajectories
Slower processing
Risk of over-interpolation
Less aggressive (lower_threshold=15, upper_threshold=40): - Captures only significant curves - Fewer interpolated points - Faster processing - May miss subtle curves
Limitations: - Requires road network with valid geometry and OSM IDs - May fail for disconnected roads or missing road segments - Does not handle U-turns well (>180° bearing changes) - Assumes GPS points are reasonably map-matched (best after refine_trajectory()) - Does not consider speed, acceleration, or vehicle dynamics
Use Cases: - Final step in trajectory reconstruction pipeline - Preparing trajectories for visualization and presentation - Generating realistic vehicle paths for traffic simulation - Route validation and GPS trace comparison - Creating training data for machine learning (realistic vehicle behavior)
When NOT to Use: - Raw GPS data (use refine_trajectory() first) - Trajectories with very high sampling rate (already smooth) - Applications not requiring curve realism (simple analysis)
See also
trajectory_combinerFirst step - combine and sort trajectory segments
refine_trajectorySecond step - enforce road network continuity
calculate_bearings_pyprojCalculate trajectory bearings for curve detection
utilities.road_network.load_road_networkLoad OSM road network graph
- pyehicle.reconstructing.refine_trajectory(df, road_network, max_node_distance=10, time_col='time', lat_col='lat', lon_col='lon')[source]
Refine GPS trajectory by enforcing spatial continuity with the underlying road network.
This is the main trajectory refinement function and the second step in reconstruction. It takes a raw or preprocessed GPS trajectory and “snaps” it to realistic road paths by detecting road transitions and interpolating corner points at intersections. The result is a refined trajectory that follows actual road geometry.
The function processes the trajectory sequentially, detecting when consecutive points lie on different roads (OSM ID changes) and filling gaps with corner points that follow the road network. This is essential for: - Converting GPS traces into road-matched trajectories - Preparing data for traffic analysis and route reconstruction - Generating realistic vehicle paths for simulation - Improving trajectory quality before curve interpolation
Algorithm Overview: 1. Preprocessing: Convert timestamps, remove duplicates, filter road network to bounding box 2. Initialization: Build spatial index, initialize refined trajectory 3. Sequential Processing: For each consecutive pair of trajectory points:
Find nearest road edges (OSM IDs) for both points
If OSM IDs differ (road transition): Interpolate corner points at intersection
If OSM IDs match (same road): Add current point directly
Finalization: Convert back to DataFrame with formatted timestamps
OSM ID Detection: The function uses OpenStreetMap way IDs to identify specific roads. When consecutive trajectory points have different OSM IDs, it indicates a road transition (e.g., turning at an intersection). The function then calls interpolate_corner_point() to fill the gap with realistic corner points.
- Parameters:
df (pd.DataFrame) – Input trajectory DataFrame with columns for latitude, longitude, and timestamp. Can be raw GPS data or preprocessed (e.g., after compression or map-matching). Minimum columns: lat_col, lon_col, time_col.
road_network (igraph.Graph) – Road network graph loaded from OSM data. Must have: - Node attributes: ‘x’ (longitude), ‘y’ (latitude) - Edge attributes: ‘osmid’ (OpenStreetMap way ID), ‘geometry’ (LineString) Typically loaded via utilities.road_network.load_road_network().
max_node_distance (float, default=10) – Maximum distance in meters for snapping trajectory points to road nodes. Points farther than this from any same-road node will skip corner interpolation. Typical values: - 10-15m: High-accuracy GPS (differential GPS, RTK) - 20-30m: Standard GPS (smartphone, consumer device) - 40-50m: Low-accuracy GPS or sparse road networks
time_col (str, default='time') – Name of the timestamp column in df. Must be parseable by pandas.to_datetime(). Supports various formats: ISO 8601, Unix timestamps, custom datetime strings.
lat_col (str, default='lat') – Name of the latitude column in df (WGS84 decimal degrees).
lon_col (str, default='lon') – Name of the longitude column in df (WGS84 decimal degrees).
- Returns:
Refined trajectory with interpolated corner points. Contains the same columns as the input (lat_col, lon_col, time_col) with: - Original trajectory points preserved - Corner points inserted at road transitions - Timestamps distributed proportionally to distance - Points sorted chronologically - Duplicates removed Output format: Timestamps as strings (‘%Y-%m-%d %H:%M:%S’)
- Return type:
pd.DataFrame
Examples
>>> import pandas as pd >>> import pyehicle as pye >>> >>> # Load GPS trajectory >>> df = pd.read_csv('gps_trajectory.csv') >>> print(f"Raw trajectory: {len(df)} points") >>> >>> # Load road network for the area >>> G = pye.utilities.road_network.load_road_network( ... 'city_roads.graphml' ... ) >>> >>> # Refine trajectory with road network continuity >>> refined = pye.reconstructing.refine_trajectory( ... df, ... road_network=G, ... max_node_distance=10, ... time_col='time', ... lat_col='lat', ... lon_col='lon' ... ) >>> >>> print(f"Refined trajectory: {len(refined)} points") >>> print(f"Added {len(refined) - len(df)} corner points") >>> >>> # Visualize before/after >>> pye.utilities.visualization.multiple( ... [df, refined], ... names=['Raw GPS', 'Refined'], ... show_in_browser=True ... )
>>> # Full preprocessing + reconstruction pipeline >>> import pyehicle as pye >>> >>> # 1. Load and preprocess >>> df = pd.read_csv('full_day_gps.csv') >>> compressed = pye.preprocessing.spatio_temporal_compress(df) >>> matched = pye.preprocessing.leuven(compressed, city='Riga', country='Latvia') >>> >>> # 2. Split by time gaps >>> segments = pye.preprocessing.by_time(matched, time_threshold=300) >>> >>> # 3. Combine segments spatially >>> combined = pye.reconstructing.trajectory_combiner(segments) >>> >>> # 4. Refine with road network (this function) >>> G = pye.utilities.road_network.load_road_network('riga_roads.graphml') >>> refined = pye.reconstructing.refine_trajectory(combined, G) >>> >>> # 5. Final curve interpolation >>> final = pye.reconstructing.curve_interpolation(refined, G) >>> >>> # Save result >>> final.to_csv('reconstructed_trajectory.csv', index=False)
Notes
Algorithm Details: The refinement process ensures that trajectories follow realistic road paths: 1. For each consecutive pair of points, check if they’re on the same road (OSM ID) 2. If on different roads → Road transition detected → Interpolate corner points 3. Corner interpolation handles two scenarios:
Adjacent roads with direct intersection → Single corner point
Non-adjacent roads → Path reconstruction along network
Timestamps distributed proportionally to distance (constant speed assumption)
Deduplication prevents duplicate coordinates and timestamps
Road Network Filtering: The function filters the road network to the trajectory’s bounding box (with 0.5° buffer) for performance. This dramatically speeds up spatial queries for large road networks.
Spatial Index: Uses R-tree spatial index for fast nearest-edge queries. Built automatically from the road network using preprocess_road_segments().
Duplicate Handling: - Input duplicates removed before processing (same lat/lon) - Output duplicates removed after interpolation - Prevents issues with zero-length edges and invalid geometries
Temporal Ordering: - Points sorted by timestamp before processing - Corner points assigned interpolated timestamps - Final output re-sorted to ensure chronological order
Performance: - Time complexity: O(n * m) where n = trajectory points, m = avg corner points per transition - For 1000-point trajectory with 50 road transitions: 5-30 seconds - Dominated by corner interpolation (shortest path computations) - Large road networks (>100k edges) benefit from bounding box filtering
Limitations: - Requires accurate OSM IDs from map-matching or spatial queries - May fail for disconnected road networks (e.g., ferry routes) - Assumes roads have valid geometry and OSM IDs in the graph - Corner interpolation may fail if points are too far from roads (>max_node_distance) - Does not consider turn restrictions, traffic rules, or one-way streets
Use Cases: - Preparing trajectories for curve interpolation (next step in reconstruction) - Converting raw GPS logs to road-matched trajectories - Traffic flow analysis and route reconstruction - Vehicle path simulation for autonomous driving research - Improving trajectory quality for visualization and analysis
Quality Assurance: The function is designed to be robust: - Handles missing geometries gracefully - Skips problematic segments instead of failing entirely - Preserves original points when corner interpolation fails - Maintains temporal continuity throughout
See also
trajectory_combinerFirst step - spatially sort trajectory segments
curve_interpolationThird step - interpolate smooth curves at corners
utilities.road_network.load_road_networkLoad OSM road network graph
preprocessing.leuvenHMM map-matching for OSM ID assignment
- pyehicle.reconstructing.trajectory_combiner(dataframes, lat_col='lat', lon_col='lon', time_col='time')[source]
Combine multiple topologically equivalent trajectories and sort them by geographical proximity.
This function takes multiple topologically equivalent trajectory DataFrames (e.g., different bus runs along the same route, multiple GPS recordings of the same path) and intelligently combines them into a single unified trajectory by sorting points based on nearest-neighbor distances. The algorithm starts from the first point and iteratively selects the closest unvisited point, creating a spatially continuous trajectory.
This is particularly useful for: - Combining multiple recordings of the same bus/vehicle route - Merging GPS traces from different trips along the same path - Reconstructing a canonical route from multiple trajectory samples - Preparing topologically equivalent trajectories for refinement
The function uses true geodesic distances (WGS84 ellipsoid) rather than Euclidean distances, ensuring accuracy across all latitudes and distances.
- Parameters:
dataframes (list of pd.DataFrame) – A list of DataFrames containing topologically equivalent trajectories (e.g., multiple GPS recordings of the same route). Each DataFrame must have latitude and longitude columns. The DataFrames can have different lengths and may contain additional columns (all column types will be preserved).
lat_col (str, default='lat') – The column name for latitude values (WGS84 decimal degrees).
lon_col (str, default='lon') – The column name for longitude values (WGS84 decimal degrees).
time_col (str, default='time') – The column name for timestamp values. Used for timestamp re-assignment based on cumulative distance traveled.
- Returns:
A single DataFrame containing the combined trajectory, sorted by time after spatial reordering and timestamp re-assignment. The output may contain fewer points than the input due to duplicate coordinate removal (points at identical lat/lon locations are deduplicated, keeping only the first occurrence). The index is reset (0, 1, 2, …, n-1). All original column types are preserved.
- Return type:
pd.DataFrame
Examples
>>> import pandas as pd >>> import pyehicle as pye >>> >>> # Combine multiple bus runs along the same route >>> run1 = pd.read_csv('bus_route_66_monday_morning.csv') >>> run2 = pd.read_csv('bus_route_66_monday_afternoon.csv') >>> run3 = pd.read_csv('bus_route_66_tuesday_morning.csv') >>> >>> # Combine all runs into a single canonical trajectory >>> combined = pye.reconstructing.trajectory_combiner( ... [run1, run2, run3], ... lat_col='lat', ... lon_col='lon', ... time_col='time' ... ) >>> print(f"Combined {len(run1) + len(run2) + len(run3)} points into {len(combined)} points") >>> # Note: Point count may be less due to duplicate coordinate removal
>>> # Combine multiple GPS recordings of the same delivery route >>> monday_route = pd.read_csv('delivery_monday.csv') >>> tuesday_route = pd.read_csv('delivery_tuesday.csv') >>> wednesday_route = pd.read_csv('delivery_wednesday.csv') >>> >>> # Create canonical route from multiple recordings >>> canonical_route = pye.reconstructing.trajectory_combiner( ... [monday_route, tuesday_route, wednesday_route] ... )
Notes
Algorithm:
The function implements a multi-step reconstruction algorithm:
Concatenation: All input DataFrames are concatenated into one
Spatial Sorting: Greedy nearest-neighbor algorithm: - Start from the first point (index 0) - Calculate geodesic distance from current point to all unvisited points - Select the nearest unvisited point as the next point - Mark the selected point as visited and repeat
Duplicate Removal: Remove points with identical (lat, lon) coordinates, keeping only the first occurrence. This eliminates redundant points from overlapping segments or stationary GPS readings.
Timestamp Re-assignment: Calculate cumulative distances along the spatially- sorted path and re-assign timestamps proportional to distance traveled (constant speed assumption), preserving the original time range.
Time Sorting: Sort by time to maintain the library’s assumption of time-ordered trajectories. Since timestamps are now monotonically increasing along the path, this preserves spatial continuity.
Performance: - Time complexity: O(n²) where n is the total number of points - Space complexity: O(n) for coordinate arrays and visited tracking - For 1000 points: ~1 second - For 10,000 points: ~100 seconds (consider preprocessing to reduce points first)
Accuracy: - Uses pyproj Geod.inv() for true geodesic distances on WGS84 ellipsoid - Accurate to millimeter precision for all distances - No projection distortion (unlike Euclidean distance in lat/lon)
Important Considerations:
Point Count Reduction: The output may have fewer points than the input due to duplicate coordinate removal. Points with identical (lat, lon) values are deduplicated, keeping only the first occurrence. This affects: * Loop trajectories (where paths cross themselves) * Stationary GPS readings (e.g., stopped at traffic lights) * Overlapping segments from multiple sources
Timestamp Modification: Original timestamps are discarded and replaced with new timestamps calculated from cumulative distances. The new timestamps: * Preserve the original time range (min to max) * Assume constant speed along the spatially-sorted path * Are monotonically increasing, ensuring time-ordered output
Order Dependency: The algorithm starts from the first point of the first DataFrame in the list. Different input orders may produce different results if segments overlap spatially.
Greedy Strategy: Uses nearest-neighbor heuristic, not globally optimal. For complex overlapping segments, may not produce the best possible ordering. However, this is fast and works well for most GPS trajectories.
Memory Efficiency: Coordinates are pre-extracted as numpy arrays for faster vectorized operations. Uses boolean array for visited tracking instead of a set for better iteration performance.
Column Compatibility: All DataFrames are concatenated, so column names and types must be compatible across all input DataFrames. All column types are preserved in the output.
Use Cases: - Route Canonicalization: Combine multiple GPS recordings of the same bus/vehicle
route to create a single canonical trajectory representative of that route
Multi-Trip Fusion: Merge GPS traces from different trips along the same path (e.g., daily commutes, delivery routes, public transit routes)
Reconstruction Pipeline: First step before refine_trajectory() and curve_interpolation() for final trajectory refinement
Limitations: - Not suitable for very large datasets (>50,000 points) without preprocessing - Assumes segments are generally spatially separated (not heavily overlapping) - Duplicate coordinates are removed, which may be undesirable for:
Loop trajectories where the path legitimately crosses itself
Scenarios requiring stationary points (vehicle stopped, waiting)
Accurate point-count preservation
Timestamps are synthetic (distance-based), losing original timing information
Constant speed assumption may not reflect actual vehicle behavior
O(n²) complexity makes it slow for large point counts
See also
refine_trajectorySecond step - enforces road network continuity
curve_interpolationFinal step - interpolates smooth curves
by_timeSplits trajectories by temporal gaps
Trajectory Combiner
- pyehicle.reconstructing.trajectory_combiner(dataframes, lat_col='lat', lon_col='lon', time_col='time')[source]
Combine multiple topologically equivalent trajectories and sort them by geographical proximity.
This function takes multiple topologically equivalent trajectory DataFrames (e.g., different bus runs along the same route, multiple GPS recordings of the same path) and intelligently combines them into a single unified trajectory by sorting points based on nearest-neighbor distances. The algorithm starts from the first point and iteratively selects the closest unvisited point, creating a spatially continuous trajectory.
This is particularly useful for: - Combining multiple recordings of the same bus/vehicle route - Merging GPS traces from different trips along the same path - Reconstructing a canonical route from multiple trajectory samples - Preparing topologically equivalent trajectories for refinement
The function uses true geodesic distances (WGS84 ellipsoid) rather than Euclidean distances, ensuring accuracy across all latitudes and distances.
- Parameters:
dataframes (list of pd.DataFrame) – A list of DataFrames containing topologically equivalent trajectories (e.g., multiple GPS recordings of the same route). Each DataFrame must have latitude and longitude columns. The DataFrames can have different lengths and may contain additional columns (all column types will be preserved).
lat_col (str, default='lat') – The column name for latitude values (WGS84 decimal degrees).
lon_col (str, default='lon') – The column name for longitude values (WGS84 decimal degrees).
time_col (str, default='time') – The column name for timestamp values. Used for timestamp re-assignment based on cumulative distance traveled.
- Returns:
A single DataFrame containing the combined trajectory, sorted by time after spatial reordering and timestamp re-assignment. The output may contain fewer points than the input due to duplicate coordinate removal (points at identical lat/lon locations are deduplicated, keeping only the first occurrence). The index is reset (0, 1, 2, …, n-1). All original column types are preserved.
- Return type:
pd.DataFrame
Examples
>>> import pandas as pd >>> import pyehicle as pye >>> >>> # Combine multiple bus runs along the same route >>> run1 = pd.read_csv('bus_route_66_monday_morning.csv') >>> run2 = pd.read_csv('bus_route_66_monday_afternoon.csv') >>> run3 = pd.read_csv('bus_route_66_tuesday_morning.csv') >>> >>> # Combine all runs into a single canonical trajectory >>> combined = pye.reconstructing.trajectory_combiner( ... [run1, run2, run3], ... lat_col='lat', ... lon_col='lon', ... time_col='time' ... ) >>> print(f"Combined {len(run1) + len(run2) + len(run3)} points into {len(combined)} points") >>> # Note: Point count may be less due to duplicate coordinate removal
>>> # Combine multiple GPS recordings of the same delivery route >>> monday_route = pd.read_csv('delivery_monday.csv') >>> tuesday_route = pd.read_csv('delivery_tuesday.csv') >>> wednesday_route = pd.read_csv('delivery_wednesday.csv') >>> >>> # Create canonical route from multiple recordings >>> canonical_route = pye.reconstructing.trajectory_combiner( ... [monday_route, tuesday_route, wednesday_route] ... )
Notes
Algorithm:
The function implements a multi-step reconstruction algorithm:
Concatenation: All input DataFrames are concatenated into one
Spatial Sorting: Greedy nearest-neighbor algorithm: - Start from the first point (index 0) - Calculate geodesic distance from current point to all unvisited points - Select the nearest unvisited point as the next point - Mark the selected point as visited and repeat
Duplicate Removal: Remove points with identical (lat, lon) coordinates, keeping only the first occurrence. This eliminates redundant points from overlapping segments or stationary GPS readings.
Timestamp Re-assignment: Calculate cumulative distances along the spatially- sorted path and re-assign timestamps proportional to distance traveled (constant speed assumption), preserving the original time range.
Time Sorting: Sort by time to maintain the library’s assumption of time-ordered trajectories. Since timestamps are now monotonically increasing along the path, this preserves spatial continuity.
Performance: - Time complexity: O(n²) where n is the total number of points - Space complexity: O(n) for coordinate arrays and visited tracking - For 1000 points: ~1 second - For 10,000 points: ~100 seconds (consider preprocessing to reduce points first)
Accuracy: - Uses pyproj Geod.inv() for true geodesic distances on WGS84 ellipsoid - Accurate to millimeter precision for all distances - No projection distortion (unlike Euclidean distance in lat/lon)
Important Considerations:
Point Count Reduction: The output may have fewer points than the input due to duplicate coordinate removal. Points with identical (lat, lon) values are deduplicated, keeping only the first occurrence. This affects: * Loop trajectories (where paths cross themselves) * Stationary GPS readings (e.g., stopped at traffic lights) * Overlapping segments from multiple sources
Timestamp Modification: Original timestamps are discarded and replaced with new timestamps calculated from cumulative distances. The new timestamps: * Preserve the original time range (min to max) * Assume constant speed along the spatially-sorted path * Are monotonically increasing, ensuring time-ordered output
Order Dependency: The algorithm starts from the first point of the first DataFrame in the list. Different input orders may produce different results if segments overlap spatially.
Greedy Strategy: Uses nearest-neighbor heuristic, not globally optimal. For complex overlapping segments, may not produce the best possible ordering. However, this is fast and works well for most GPS trajectories.
Memory Efficiency: Coordinates are pre-extracted as numpy arrays for faster vectorized operations. Uses boolean array for visited tracking instead of a set for better iteration performance.
Column Compatibility: All DataFrames are concatenated, so column names and types must be compatible across all input DataFrames. All column types are preserved in the output.
Use Cases: - Route Canonicalization: Combine multiple GPS recordings of the same bus/vehicle
route to create a single canonical trajectory representative of that route
Multi-Trip Fusion: Merge GPS traces from different trips along the same path (e.g., daily commutes, delivery routes, public transit routes)
Reconstruction Pipeline: First step before refine_trajectory() and curve_interpolation() for final trajectory refinement
Limitations: - Not suitable for very large datasets (>50,000 points) without preprocessing - Assumes segments are generally spatially separated (not heavily overlapping) - Duplicate coordinates are removed, which may be undesirable for:
Loop trajectories where the path legitimately crosses itself
Scenarios requiring stationary points (vehicle stopped, waiting)
Accurate point-count preservation
Timestamps are synthetic (distance-based), losing original timing information
Constant speed assumption may not reflect actual vehicle behavior
O(n²) complexity makes it slow for large point counts
See also
refine_trajectorySecond step - enforces road network continuity
curve_interpolationFinal step - interpolates smooth curves
by_timeSplits trajectories by temporal gaps
Trajectory Refinement
- pyehicle.reconstructing.refine_trajectory(df, road_network, max_node_distance=10, time_col='time', lat_col='lat', lon_col='lon')[source]
Refine GPS trajectory by enforcing spatial continuity with the underlying road network.
This is the main trajectory refinement function and the second step in reconstruction. It takes a raw or preprocessed GPS trajectory and “snaps” it to realistic road paths by detecting road transitions and interpolating corner points at intersections. The result is a refined trajectory that follows actual road geometry.
The function processes the trajectory sequentially, detecting when consecutive points lie on different roads (OSM ID changes) and filling gaps with corner points that follow the road network. This is essential for: - Converting GPS traces into road-matched trajectories - Preparing data for traffic analysis and route reconstruction - Generating realistic vehicle paths for simulation - Improving trajectory quality before curve interpolation
Algorithm Overview: 1. Preprocessing: Convert timestamps, remove duplicates, filter road network to bounding box 2. Initialization: Build spatial index, initialize refined trajectory 3. Sequential Processing: For each consecutive pair of trajectory points:
Find nearest road edges (OSM IDs) for both points
If OSM IDs differ (road transition): Interpolate corner points at intersection
If OSM IDs match (same road): Add current point directly
Finalization: Convert back to DataFrame with formatted timestamps
OSM ID Detection: The function uses OpenStreetMap way IDs to identify specific roads. When consecutive trajectory points have different OSM IDs, it indicates a road transition (e.g., turning at an intersection). The function then calls interpolate_corner_point() to fill the gap with realistic corner points.
- Parameters:
df (pd.DataFrame) – Input trajectory DataFrame with columns for latitude, longitude, and timestamp. Can be raw GPS data or preprocessed (e.g., after compression or map-matching). Minimum columns: lat_col, lon_col, time_col.
road_network (igraph.Graph) – Road network graph loaded from OSM data. Must have: - Node attributes: ‘x’ (longitude), ‘y’ (latitude) - Edge attributes: ‘osmid’ (OpenStreetMap way ID), ‘geometry’ (LineString) Typically loaded via utilities.road_network.load_road_network().
max_node_distance (float, default=10) – Maximum distance in meters for snapping trajectory points to road nodes. Points farther than this from any same-road node will skip corner interpolation. Typical values: - 10-15m: High-accuracy GPS (differential GPS, RTK) - 20-30m: Standard GPS (smartphone, consumer device) - 40-50m: Low-accuracy GPS or sparse road networks
time_col (str, default='time') – Name of the timestamp column in df. Must be parseable by pandas.to_datetime(). Supports various formats: ISO 8601, Unix timestamps, custom datetime strings.
lat_col (str, default='lat') – Name of the latitude column in df (WGS84 decimal degrees).
lon_col (str, default='lon') – Name of the longitude column in df (WGS84 decimal degrees).
- Returns:
Refined trajectory with interpolated corner points. Contains the same columns as the input (lat_col, lon_col, time_col) with: - Original trajectory points preserved - Corner points inserted at road transitions - Timestamps distributed proportionally to distance - Points sorted chronologically - Duplicates removed Output format: Timestamps as strings (‘%Y-%m-%d %H:%M:%S’)
- Return type:
pd.DataFrame
Examples
>>> import pandas as pd >>> import pyehicle as pye >>> >>> # Load GPS trajectory >>> df = pd.read_csv('gps_trajectory.csv') >>> print(f"Raw trajectory: {len(df)} points") >>> >>> # Load road network for the area >>> G = pye.utilities.road_network.load_road_network( ... 'city_roads.graphml' ... ) >>> >>> # Refine trajectory with road network continuity >>> refined = pye.reconstructing.refine_trajectory( ... df, ... road_network=G, ... max_node_distance=10, ... time_col='time', ... lat_col='lat', ... lon_col='lon' ... ) >>> >>> print(f"Refined trajectory: {len(refined)} points") >>> print(f"Added {len(refined) - len(df)} corner points") >>> >>> # Visualize before/after >>> pye.utilities.visualization.multiple( ... [df, refined], ... names=['Raw GPS', 'Refined'], ... show_in_browser=True ... )
>>> # Full preprocessing + reconstruction pipeline >>> import pyehicle as pye >>> >>> # 1. Load and preprocess >>> df = pd.read_csv('full_day_gps.csv') >>> compressed = pye.preprocessing.spatio_temporal_compress(df) >>> matched = pye.preprocessing.leuven(compressed, city='Riga', country='Latvia') >>> >>> # 2. Split by time gaps >>> segments = pye.preprocessing.by_time(matched, time_threshold=300) >>> >>> # 3. Combine segments spatially >>> combined = pye.reconstructing.trajectory_combiner(segments) >>> >>> # 4. Refine with road network (this function) >>> G = pye.utilities.road_network.load_road_network('riga_roads.graphml') >>> refined = pye.reconstructing.refine_trajectory(combined, G) >>> >>> # 5. Final curve interpolation >>> final = pye.reconstructing.curve_interpolation(refined, G) >>> >>> # Save result >>> final.to_csv('reconstructed_trajectory.csv', index=False)
Notes
Algorithm Details: The refinement process ensures that trajectories follow realistic road paths: 1. For each consecutive pair of points, check if they’re on the same road (OSM ID) 2. If on different roads → Road transition detected → Interpolate corner points 3. Corner interpolation handles two scenarios:
Adjacent roads with direct intersection → Single corner point
Non-adjacent roads → Path reconstruction along network
Timestamps distributed proportionally to distance (constant speed assumption)
Deduplication prevents duplicate coordinates and timestamps
Road Network Filtering: The function filters the road network to the trajectory’s bounding box (with 0.5° buffer) for performance. This dramatically speeds up spatial queries for large road networks.
Spatial Index: Uses R-tree spatial index for fast nearest-edge queries. Built automatically from the road network using preprocess_road_segments().
Duplicate Handling: - Input duplicates removed before processing (same lat/lon) - Output duplicates removed after interpolation - Prevents issues with zero-length edges and invalid geometries
Temporal Ordering: - Points sorted by timestamp before processing - Corner points assigned interpolated timestamps - Final output re-sorted to ensure chronological order
Performance: - Time complexity: O(n * m) where n = trajectory points, m = avg corner points per transition - For 1000-point trajectory with 50 road transitions: 5-30 seconds - Dominated by corner interpolation (shortest path computations) - Large road networks (>100k edges) benefit from bounding box filtering
Limitations: - Requires accurate OSM IDs from map-matching or spatial queries - May fail for disconnected road networks (e.g., ferry routes) - Assumes roads have valid geometry and OSM IDs in the graph - Corner interpolation may fail if points are too far from roads (>max_node_distance) - Does not consider turn restrictions, traffic rules, or one-way streets
Use Cases: - Preparing trajectories for curve interpolation (next step in reconstruction) - Converting raw GPS logs to road-matched trajectories - Traffic flow analysis and route reconstruction - Vehicle path simulation for autonomous driving research - Improving trajectory quality for visualization and analysis
Quality Assurance: The function is designed to be robust: - Handles missing geometries gracefully - Skips problematic segments instead of failing entirely - Preserves original points when corner interpolation fails - Maintains temporal continuity throughout
See also
trajectory_combinerFirst step - spatially sort trajectory segments
curve_interpolationThird step - interpolate smooth curves at corners
utilities.road_network.load_road_networkLoad OSM road network graph
preprocessing.leuvenHMM map-matching for OSM ID assignment
Curve Interpolation
- pyehicle.reconstructing.curve_interpolation(df, road_network, lower_threshold=20, upper_threshold=80, max_node_distance=10, time_col='time', lat_col='lat', lon_col='lon')[source]
Improve trajectory realism by interpolating points along road curves based on bearing changes.
This is the final step in trajectory reconstruction. It takes a refined trajectory and adds intermediate points at locations where the vehicle direction changes (curves, turns, corners). By detecting bearing changes and filling in the actual road geometry, it creates high-fidelity trajectories that closely follow real-world road paths.
The function analyzes bearing (direction) changes between consecutive points. When a bearing change falls within the specified threshold range, it indicates a curve that would benefit from additional points. The function then: 1. Finds the road network path between the curve endpoints 2. Interpolates multiple points along that path 3. Assigns timestamps proportionally to distance traveled
This is essential for: - Visualizing realistic vehicle paths (trajectories that look like actual driving) - Traffic analysis requiring accurate road following (lane-level positioning) - Route validation and comparison (matching GPS traces to expected routes) - Simulation and animation (smooth, realistic vehicle movement)
Algorithm Overview: 1. Bearing Analysis: Calculate bearing (direction) between each consecutive point pair 2. Bearing Change Detection: Compute absolute change in bearing (accounts for 360° wrap) 3. Curve Identification: Flag points where bearing change ∈ [lower_threshold, upper_threshold] 4. Path Interpolation: For each curve, find road network path and interpolate points 5. Temporal Distribution: Assign timestamps proportional to distance along interpolated path
Why Bearing Thresholds? - Too small (<20°): Captures noise and minor GPS drift, not real curves - Sweet spot (20-80°): Gradual turns, highway exits, curved roads, roundabouts - Too large (>80°): Sharp 90° turns already handled by refine_trajectory()
- Parameters:
df (pd.DataFrame) – Input trajectory DataFrame with columns for latitude, longitude, and timestamp. Should be output from refine_trajectory() for best results. Minimum columns: lat_col, lon_col, time_col.
road_network (igraph.Graph) – Road network graph loaded from OSM data. Must have: - Node attributes: ‘x’ (longitude), ‘y’ (latitude) - Edge attributes: ‘osmid’ (OpenStreetMap way ID), ‘geometry’ (LineString) Typically loaded via utilities.road_network.load_road_network().
lower_threshold (float, default=20) – Minimum bearing change in degrees to trigger curve interpolation. Bearing changes smaller than this are considered straight segments (no interpolation needed). Typical values: - 10°: Very sensitive, captures subtle curves (may add noise) - 20°: Recommended default, captures most real curves - 30°: Less sensitive, only significant curves
upper_threshold (float, default=80) – Maximum bearing change in degrees to trigger curve interpolation. Bearing changes larger than this are considered sharp turns (already handled by refine_trajectory() or too abrupt for curve interpolation). Typical values: - 60°: Captures gradual to moderate curves - 80°: Recommended default, avoids sharp 90° turns - 90°: Includes sharper turns (may overlap with refinement)
max_node_distance (float, default=10) – Maximum distance in meters for snapping trajectory points to road nodes during interpolation. Points farther than this will skip interpolation. Typical values: - 10-15m: High-accuracy GPS (recommended) - 20-30m: Standard GPS accuracy - 40-50m: Low-accuracy GPS or sparse road networks
time_col (str, default='time') – Name of the timestamp column in df. Can be Unix timestamps (numeric) or formatted datetime strings. Supports various formats.
lat_col (str, default='lat') – Name of the latitude column in df (WGS84 decimal degrees).
lon_col (str, default='lon') – Name of the longitude column in df (WGS84 decimal degrees).
- Returns:
Enhanced trajectory with interpolated points at curves. Contains the same columns as the input (lat_col, lon_col, time_col) with: - Original trajectory points preserved - Intermediate points added at detected curves - Timestamps distributed proportionally to distance - Points sorted chronologically - Duplicates removed Output format: Timestamps as strings (‘%Y-%m-%d %H:%M:%S’)
- Return type:
pd.DataFrame
Examples
>>> import pandas as pd >>> import pyehicle as pye >>> >>> # Load refined trajectory (output from refine_trajectory) >>> df = pd.read_csv('refined_trajectory.csv') >>> print(f"Refined trajectory: {len(df)} points") >>> >>> # Load road network >>> G = pye.utilities.road_network.load_road_network('city_roads.graphml') >>> >>> # Interpolate curves with default thresholds (20-80°) >>> final = pye.reconstructing.curve_interpolation( ... df, ... road_network=G, ... lower_threshold=20, ... upper_threshold=80, ... max_node_distance=10 ... ) >>> >>> print(f"Final trajectory: {len(final)} points") >>> print(f"Added {len(final) - len(df)} points at curves") >>> >>> # Visualize before/after >>> pye.utilities.visualization.multiple( ... [df, final], ... names=['Refined', 'With Curves'], ... show_in_browser=True ... )
>>> # Full reconstruction pipeline (all steps) >>> import pyehicle as pye >>> >>> # 1. Preprocessing >>> df = pd.read_csv('raw_gps.csv') >>> compressed = pye.preprocessing.spatio_temporal_compress(df) >>> matched = pye.preprocessing.leuven(compressed, city='Riga', country='Latvia') >>> >>> # 2. Segmentation and combination >>> segments = pye.preprocessing.by_time(matched, time_threshold=300) >>> combined = pye.reconstructing.trajectory_combiner(segments) >>> >>> # 3. Refinement (road network continuity) >>> G = pye.utilities.road_network.load_road_network('riga_roads.graphml') >>> refined = pye.reconstructing.refine_trajectory(combined, G, max_node_distance=10) >>> >>> # 4. Curve interpolation (THIS FUNCTION - final step) >>> final = pye.reconstructing.curve_interpolation( ... refined, G, ... lower_threshold=20, ... upper_threshold=80, ... max_node_distance=10 ... ) >>> >>> # Save final high-fidelity trajectory >>> final.to_csv('reconstructed_trajectory.csv', index=False) >>> print(f"✓ Reconstruction complete: {len(final)} points")
Notes
Algorithm Details: 1. Calculate bearings between consecutive points using geodesic forward azimuth 2. Compute bearing changes: abs(bearing[i] - bearing[i-1]), handling 360° wrap-around 3. For each point where lower_threshold ≤ bearing_change ≤ upper_threshold:
Find shortest path through road network from previous point to current point
Reconstruct LineString geometry from the path
Calculate cumulative distances along the path
Interpolate timestamps proportionally: t = t1 + (d/total_d) * (t2-t1)
Replace the two original points with the interpolated path points
Deduplicate coordinates and sort by timestamp
Bearing Change Calculation: Bearing changes handle the circular nature of angles (360° = 0°):
`python change = abs(bearing2 - bearing1) change = change if change <= 180 else 360 - change `This ensures: 350° → 10° = 20° change (not 340°)Temporal Interpolation: Timestamps are distributed proportionally to distance along the interpolated path, assuming constant speed: - t[i] = t1 + (cumulative_distance[i] / total_distance) * (t2 - t1) - This maintains realistic timing and ensures chronological order
Road Network Filtering: The function filters the road network to the trajectory’s bounding box (0.5° buffer) for performance, similar to refine_trajectory().
Performance: - Time complexity: O(n * m * k) where:
n = trajectory points
m = avg points per interpolated curve
k = avg edges in shortest path
For 1000-point trajectory with 20 curves: 10-60 seconds
Dominated by shortest path computation (Dijkstra’s algorithm)
Quality Tuning: - More aggressive (lower_threshold=5, upper_threshold=60):
Captures more curves
More interpolated points
Smoother trajectories
Slower processing
Risk of over-interpolation
Less aggressive (lower_threshold=15, upper_threshold=40): - Captures only significant curves - Fewer interpolated points - Faster processing - May miss subtle curves
Limitations: - Requires road network with valid geometry and OSM IDs - May fail for disconnected roads or missing road segments - Does not handle U-turns well (>180° bearing changes) - Assumes GPS points are reasonably map-matched (best after refine_trajectory()) - Does not consider speed, acceleration, or vehicle dynamics
Use Cases: - Final step in trajectory reconstruction pipeline - Preparing trajectories for visualization and presentation - Generating realistic vehicle paths for traffic simulation - Route validation and GPS trace comparison - Creating training data for machine learning (realistic vehicle behavior)
When NOT to Use: - Raw GPS data (use refine_trajectory() first) - Trajectories with very high sampling rate (already smooth) - Applications not requiring curve realism (simple analysis)
See also
trajectory_combinerFirst step - combine and sort trajectory segments
refine_trajectorySecond step - enforce road network continuity
calculate_bearings_pyprojCalculate trajectory bearings for curve detection
utilities.road_network.load_road_networkLoad OSM road network graph
Utilities Module
Utilities module for the pyehicle library.
This module provides utility functions for road network management, trajectory evaluation metrics, and visualization tools.
- pyehicle.utilities.build_igraph_graph(nodes, edges)[source]
Build igraph.Graph from OpenStreetMap nodes and edges DataFrames.
This function converts raw OSM network data (from pyrosm) into an igraph.Graph representation with spatial attributes. It handles node ID factorization, edge validation, geometry processing, and geodesic length calculation.
The resulting graph is undirected with geographic coordinates stored as vertex attributes and road geometries as edge attributes, ready for spatial analysis and routing operations.
- Parameters:
nodes (pd.DataFrame) – OSM nodes DataFrame from pyrosm.get_network() with columns: - ‘id’: OSM node IDs (int64) - ‘lat’: Latitude in WGS84 degrees (float64) - ‘lon’: Longitude in WGS84 degrees (float64)
edges (pd.DataFrame) – OSM edges DataFrame from pyrosm.get_network() with columns: - ‘id’: OSM way IDs (int64) - ‘u’: Source node ID (int64) - ‘v’: Target node ID (int64) - ‘geometry’: Shapely LineString geometries
- Returns:
Undirected road network graph with attributes: - Vertex attributes:
’node_id’: Original OSM node IDs
’x’: Longitude (WGS84 degrees)
’y’: Latitude (WGS84 degrees)
Edge attributes: - ‘osmid’: OSM way IDs - ‘length’: Geodesic edge length in meters (WGS84) - ‘geometry’: Shapely LineString geometries
- Return type:
igraph.Graph
Notes
Algorithm Steps:
Node Factorization: Convert arbitrary OSM node IDs to sequential 0-based indices required by igraph. Uses pandas.factorize() for efficiency.
Edge Index Mapping: Map edge endpoints (u, v) from OSM IDs to factorized indices using the node ID→index mapping from step 1.
Edge Validation: Remove edges with missing endpoints or indices out of bounds (handles disconnected components and data quality issues).
Graph Construction: Create igraph.Graph with edge list. Graph is undirected as road networks are typically bidirectional.
Geodesic Length Calculation: Compute true great-circle distances for all edges using pyproj Geod.inv() on WGS84 ellipsoid.
Geometry Processing: Ensure all edge geometries are LineStrings. Convert non-LineString geometries (e.g., Points) to simple two-point lines.
Performance: - Time complexity: O(n + m) where n = nodes, m = edges - Factorization: O(n) with pandas - Geodesic calculations: O(m) vectorized with pyproj - Fast even for large networks (100k+ edges in seconds)
Edge Cases: - Edges referencing non-existent nodes are removed - Zero-length edges are kept but have length = 0 - Non-LineString geometries are converted to straight lines - Handles duplicate edges (igraph allows multiple edges)
See also
load_road_networkMain function that calls this builder
preprocess_road_segmentsCreates R-tree index from graph
- pyehicle.utilities.clear_overpass_cache()[source]
Delete the on-disk Overpass cache directory (overpass_cache/).
- pyehicle.utilities.f1(df_matched: DataFrame | DataFrame, df_true: DataFrame | DataFrame, lat_col: str = 'lat', lon_col: str = 'lon', delete_cache: bool = True, tile_max_deg: float = 0.5, tile_overlap_deg: float = 0.05, cache_overpass: bool = True, aeqd_precision: int = 6) float[source]
- pyehicle.utilities.filter_road_network_by_bbox(road_network, matched, buffer=0.01)[source]
Filter road network to bounding box around trajectory for performance optimization.
This function creates a spatial subset of the road network by extracting only the nodes and edges within a bounding box around a GPS trajectory. This significantly reduces memory usage and improves computational performance for graph algorithms when working with large regional or national road networks.
Use this function before running computationally expensive operations like shortest path routing or spatial analysis when you only need roads near a specific trajectory.
- Parameters:
road_network (igraph.Graph) – Full road network graph from load_road_network(). Must have vertex attributes ‘x’ (longitude) and ‘y’ (latitude).
matched (pd.DataFrame) – GPS trajectory DataFrame with columns ‘lat’ and ‘lon’. Used to determine the bounding box for filtering.
buffer (float, default=0.01) – Buffer distance to extend bounding box in degrees (~1.1 km at equator). Ensures roads slightly outside trajectory extent are included. Typical values: - 0.005 (~550m): Tight fit for dense urban trajectories - 0.01 (~1.1km): Standard buffer (default) - 0.02 (~2.2km): Generous buffer for sparse trajectories
- Returns:
Filtered subgraph containing only nodes within the bounding box. Edges are preserved if both endpoints are within the bounding box. All vertex and edge attributes are preserved.
- Return type:
igraph.Graph
Examples
>>> import pandas as pd >>> import pyehicle as pye >>> >>> # Load full country-wide road network (large, slow operations) >>> full_network, _, _ = pye.utilities.road_network.load_road_network( ... 'latvia-latest.osm.pbf', ... save_path='latvia_full.graphml' ... ) >>> print(f"Full network: {len(full_network.vs)} nodes, {len(full_network.es)} edges") >>> >>> # Load trajectory (small region within Latvia) >>> trajectory = pd.read_csv('riga_trajectory.csv') >>> >>> # Filter to relevant subgraph around trajectory >>> filtered_network = pye.utilities.road_network.filter_road_network_by_bbox( ... full_network, ... trajectory, ... buffer=0.01 ... ) >>> print(f"Filtered: {len(filtered_network.vs)} nodes, {len(filtered_network.es)} edges") >>> # Output: Filtered: 8234 nodes, 11567 edges (much smaller!) >>> >>> # Now routing operations are much faster >>> # Use filtered_network for refine_trajectory() or curve_interpolation()
Notes
When to Use:
Large Networks: When road network covers region larger than trajectory (e.g., country-wide network for city trajectory)
Multiple Trajectories: Filter once per trajectory region for efficiency
Routing Operations: Before Dijkstra shortest paths or other graph algorithms
Memory Constraints: When full network exceeds available memory
Performance:
Filtering is O(n) where n = number of nodes
Typical speedup: 10-100x for routing operations on filtered network
Memory savings: Proportional to filtered area (90%+ reduction possible)
Buffer Size Guidelines:
Buffer is in WGS84 degrees (latitude/longitude)
1 degree ≈ 111 km at equator (less at higher latitudes)
0.01 degrees ≈ 1.1 km at equator
Increase buffer if trajectory has gaps or uses routing
Too small: May miss important connecting roads
Too large: Minimal performance benefit
Subgraph Properties:
Uses igraph.induced_subgraph() which preserves all attributes
Vertices are renumbered 0, 1, 2, … in filtered graph
Original ‘node_id’ attribute preserves OSM IDs
Edges preserved only if both endpoints are in bounding box
Graph connectivity may be different (disconnected components possible)
Limitations:
Simple rectangular bounding box (not convex hull or buffer polygon)
Buffer is uniform in degrees (different physical distance at different latitudes)
Does not account for road network connectivity (may cut through roads)
For precise spatial filtering, consider using Shapely buffer operations
See also
load_road_networkLoad full road network
refine_trajectoryUses filtered networks for efficiency
curve_interpolationBenefits from network filtering
- pyehicle.utilities.length_index(df_matched: DataFrame | DataFrame, df_original: DataFrame | DataFrame, lat_col: str = 'lat', lon_col: str = 'lon') float[source]
- pyehicle.utilities.load_road_network(pbf_file_path, bbox=None, save_path=None)[source]
Load road network from OpenStreetMap PBF file or cached GraphML format.
This function loads a driveable road network from OSM data, builds an igraph.Graph representation with spatial attributes, and creates an R-tree spatial index for fast nearest-neighbor queries. It supports caching via GraphML for faster subsequent loads.
The road network is extracted using pyrosm (network_type=”driving”) which includes: - Motorways, trunk roads, primary/secondary/tertiary roads - Residential and service roads - Excludes footways, cycleways, pedestrian paths
- Parameters:
pbf_file_path (str) – Path to OpenStreetMap PBF file (e.g., ‘city-latest.osm.pbf’). PBF files can be downloaded from: - Geofabrik: https://download.geofabrik.de/ - BBBike: https://extract.bbbike.org/ - OpenStreetMap: https://planet.openstreetmap.org/
bbox (tuple of float, optional) – Bounding box to filter the network: (north, south, east, west) in WGS84 degrees. If None, entire PBF file is loaded. Use bbox for large files to reduce memory. Example: (56.98, 56.92, 24.15, 24.05) for part of Riga, Latvia.
save_path (str, optional) – Path to save/load cached GraphML file (e.g., ‘city_roads.graphml’). If save_path exists: loads from cache (fast) If save_path doesn’t exist: loads from PBF (slow), but doesn’t save Caching significantly speeds up subsequent loads (seconds vs minutes).
- Returns:
road_network (igraph.Graph) – Road network graph with attributes: - Vertex (node) attributes:
’node_id’: Original OSM node ID
’x’: Longitude (WGS84 degrees)
’y’: Latitude (WGS84 degrees)
Edge attributes: - ‘osmid’: OSM way ID - ‘length’: Edge length in meters (geodesic distance) - ‘geometry’: Shapely LineString geometry
segment_geometries (np.ndarray) – Array of Shapely LineString geometries for all edges. Same order as road_network.es.
spatial_index (rtree.index.Index) – R-tree spatial index for fast nearest-edge queries. Indexed by edge bounding boxes. Use: nearest_edge = next(spatial_index.nearest((lon, lat, lon, lat), 1))
Examples
>>> import pyehicle as pye >>> >>> # Load road network from PBF file (first time - slow) >>> G, geometries, spatial_index = pye.utilities.road_network.load_road_network( ... 'riga-latest.osm.pbf', ... save_path='riga_roads.graphml' ... ) >>> print(f"Loaded {len(G.vs)} nodes and {len(G.es)} edges") >>> >>> # Load from cache (subsequent times - fast) >>> G, geometries, spatial_index = pye.utilities.road_network.load_road_network( ... 'riga-latest.osm.pbf', # Not used if cache exists ... save_path='riga_roads.graphml' # Loads from here ... ) >>> >>> # Load subset using bounding box >>> bbox = (56.98, 56.92, 24.15, 24.05) # (north, south, east, west) >>> G, _, _ = pye.utilities.road_network.load_road_network( ... 'latvia-latest.osm.pbf', ... bbox=bbox ... )
Notes
Data Source: OSM PBF Files - PBF (Protocolbuffer Binary Format) = compressed OSM data - Much smaller than XML (10x compression) - Fast to parse with pyrosm - Updated regularly on Geofabrik (daily for most regions)
Network Extraction: - Uses pyrosm with network_type=”driving” - Extracts only driveable roads (excludes pedestrian, cycling paths) - Preserves road geometry (LineStrings with multiple points) - Calculates geodesic lengths using pyproj Geod
Caching with GraphML: - GraphML = XML-based graph format supported by igraph - Saves graph structure + all attributes - Geometries stored as WKT strings (Well-Known Text) - Loading from GraphML is 50-100x faster than parsing PBF - Recommended workflow: Load from PBF once, then use cache
Spatial Index (R-tree): - Built automatically by preprocess_road_segments() - Indexes edge bounding boxes for O(log n) nearest-edge queries - Essential for trajectory refinement and map-matching - Fast nearest-neighbor: ~0.01ms per query
Performance: - Loading from PBF: 1-10 minutes depending on file size - Loading from GraphML: 5-30 seconds - Memory: ~100 MB per 100k edges - Spatial index build: ~1-5 seconds
Use Cases: - Trajectory reconstruction: refine_trajectory(), curve_interpolation() - Map-matching: Finding nearest road segments - Routing: Shortest path computation with Dijkstra - Spatial analysis: Road network statistics
See also
build_igraph_graphBuild igraph from node/edge DataFrames
preprocess_road_segmentsCreate R-tree spatial index
filter_road_network_by_bboxFilter network by bounding box
- pyehicle.utilities.optimized_lengths(df_matched: DataFrame | DataFrame, df_true: DataFrame | DataFrame, lat_col: str = 'lat', lon_col: str = 'lon', delete_cache: bool = True, tile_max_deg: float = 0.5, tile_overlap_deg: float = 0.05, cache_overpass: bool = True, aeqd_precision: int = 6) Tuple[float, float, float][source]
- pyehicle.utilities.precision(df_matched: DataFrame | DataFrame, df_true: DataFrame | DataFrame, lat_col: str = 'lat', lon_col: str = 'lon', delete_cache: bool = True, tile_max_deg: float = 0.5, tile_overlap_deg: float = 0.05, cache_overpass: bool = True, aeqd_precision: int = 6) float[source]
- pyehicle.utilities.preprocess_road_segments(road_network)[source]
Build R-tree spatial index for fast nearest-neighbor queries on road segments.
This function creates an R-tree spatial index from road network edge geometries, enabling O(log n) nearest-neighbor queries essential for trajectory reconstruction and map-matching. The R-tree indexes bounding boxes of all road segments for efficient spatial searches.
R-trees are hierarchical data structures that organize spatial objects by their bounding rectangles, providing logarithmic query time for nearest-neighbor and intersection searches. This is critical for performance when matching thousands of GPS points to a road network with tens of thousands of edges.
- Parameters:
road_network (igraph.Graph) – Road network graph from build_igraph_graph() or load_road_network(). Must have edge attribute ‘geometry’ containing Shapely LineString objects.
- Returns:
segment_geometries (np.ndarray) – Array of Shapely LineString geometries for all edges. Length = len(road_network.es). Order matches road_network.es (edge sequence). Can be indexed by edge ID.
spatial_index (rtree.index.Index) – R-tree spatial index for fast nearest-neighbor queries. Indexed by edge IDs. Usage: nearest_edge_id = next(spatial_index.nearest((lon, lat, lon, lat), 1))
Examples
>>> import pyehicle as pye >>> >>> # Load road network >>> G, geometries, spatial_index = pye.utilities.road_network.load_road_network( ... 'city-latest.osm.pbf', ... save_path='city_roads.graphml' ... ) >>> >>> # Find nearest road segment to a GPS point >>> gps_lon, gps_lat = 24.1055, 56.9496 # Riga, Latvia >>> nearest_edge_id = next(spatial_index.nearest((gps_lon, gps_lat, gps_lon, gps_lat), 1)) >>> nearest_geometry = geometries[nearest_edge_id] >>> print(f"Nearest road: OSM way {G.es[nearest_edge_id]['osmid']}") >>> >>> # Find 5 nearest road segments >>> k_nearest = list(spatial_index.nearest((gps_lon, gps_lat, gps_lon, gps_lat), 5)) >>> for edge_id in k_nearest: ... print(f"Edge {edge_id}: length {G.es[edge_id]['length']:.1f}m")
Notes
R-tree Index Structure:
Bounding Boxes: Each road segment is indexed by its minimum bounding rectangle (minx, miny, maxx, maxy).
Hierarchical Structure: R-tree organizes boxes into a tree, grouping nearby segments together at each level.
Query Performance: O(log n) for nearest-neighbor, much faster than brute-force O(n) distance calculations.
Memory Efficiency: Index size scales linearly with number of edges.
Use Cases:
Trajectory Refinement: refine_trajectory() uses this to find nearest roads when transitioning between different OSM ways.
Map-Matching: Fast nearest-road lookups for aligning GPS points.
Curve Interpolation: curve_interpolation() uses this to find road segments for path routing.
Spatial Queries: General-purpose nearest-neighbor searches.
Performance:
Index construction: O(n log n) where n = number of edges
Build time: ~1-5 seconds for 100k edges
Query time: ~0.01ms per nearest-neighbor search
Memory: ~100 bytes per edge (index overhead)
Spatial Index API:
nearest(bbox, k): Returns k nearest edge IDs to bounding box - bbox format: (minx, miny, maxx, maxy) - For point queries: (lon, lat, lon, lat) (same coords twice)
Returns generator, use next() or list() to retrieve results
Results are edge IDs matching road_network.es indices
See also
load_road_networkMain function that calls this preprocessor
build_igraph_graphBuilds graph with geometries
refine_trajectoryUses spatial index for trajectory refinement
- pyehicle.utilities.recall(df_matched: DataFrame | DataFrame, df_true: DataFrame | DataFrame, lat_col: str = 'lat', lon_col: str = 'lon', delete_cache: bool = True, tile_max_deg: float = 0.5, tile_overlap_deg: float = 0.05, cache_overpass: bool = True, aeqd_precision: int = 6) float[source]
- pyehicle.utilities.rmf(df_matched: DataFrame | DataFrame, df_true: DataFrame | DataFrame, lat_col: str = 'lat', lon_col: str = 'lon', delete_cache: bool = True, tile_max_deg: float = 0.5, tile_overlap_deg: float = 0.05, cache_overpass: bool = True, aeqd_precision: int = 6) float[source]
Road Network
- pyehicle.utilities.load_road_network(pbf_file_path, bbox=None, save_path=None)[source]
Load road network from OpenStreetMap PBF file or cached GraphML format.
This function loads a driveable road network from OSM data, builds an igraph.Graph representation with spatial attributes, and creates an R-tree spatial index for fast nearest-neighbor queries. It supports caching via GraphML for faster subsequent loads.
The road network is extracted using pyrosm (network_type=”driving”) which includes: - Motorways, trunk roads, primary/secondary/tertiary roads - Residential and service roads - Excludes footways, cycleways, pedestrian paths
- Parameters:
pbf_file_path (str) – Path to OpenStreetMap PBF file (e.g., ‘city-latest.osm.pbf’). PBF files can be downloaded from: - Geofabrik: https://download.geofabrik.de/ - BBBike: https://extract.bbbike.org/ - OpenStreetMap: https://planet.openstreetmap.org/
bbox (tuple of float, optional) – Bounding box to filter the network: (north, south, east, west) in WGS84 degrees. If None, entire PBF file is loaded. Use bbox for large files to reduce memory. Example: (56.98, 56.92, 24.15, 24.05) for part of Riga, Latvia.
save_path (str, optional) – Path to save/load cached GraphML file (e.g., ‘city_roads.graphml’). If save_path exists: loads from cache (fast) If save_path doesn’t exist: loads from PBF (slow), but doesn’t save Caching significantly speeds up subsequent loads (seconds vs minutes).
- Returns:
road_network (igraph.Graph) – Road network graph with attributes: - Vertex (node) attributes:
’node_id’: Original OSM node ID
’x’: Longitude (WGS84 degrees)
’y’: Latitude (WGS84 degrees)
Edge attributes: - ‘osmid’: OSM way ID - ‘length’: Edge length in meters (geodesic distance) - ‘geometry’: Shapely LineString geometry
segment_geometries (np.ndarray) – Array of Shapely LineString geometries for all edges. Same order as road_network.es.
spatial_index (rtree.index.Index) – R-tree spatial index for fast nearest-edge queries. Indexed by edge bounding boxes. Use: nearest_edge = next(spatial_index.nearest((lon, lat, lon, lat), 1))
Examples
>>> import pyehicle as pye >>> >>> # Load road network from PBF file (first time - slow) >>> G, geometries, spatial_index = pye.utilities.road_network.load_road_network( ... 'riga-latest.osm.pbf', ... save_path='riga_roads.graphml' ... ) >>> print(f"Loaded {len(G.vs)} nodes and {len(G.es)} edges") >>> >>> # Load from cache (subsequent times - fast) >>> G, geometries, spatial_index = pye.utilities.road_network.load_road_network( ... 'riga-latest.osm.pbf', # Not used if cache exists ... save_path='riga_roads.graphml' # Loads from here ... ) >>> >>> # Load subset using bounding box >>> bbox = (56.98, 56.92, 24.15, 24.05) # (north, south, east, west) >>> G, _, _ = pye.utilities.road_network.load_road_network( ... 'latvia-latest.osm.pbf', ... bbox=bbox ... )
Notes
Data Source: OSM PBF Files - PBF (Protocolbuffer Binary Format) = compressed OSM data - Much smaller than XML (10x compression) - Fast to parse with pyrosm - Updated regularly on Geofabrik (daily for most regions)
Network Extraction: - Uses pyrosm with network_type=”driving” - Extracts only driveable roads (excludes pedestrian, cycling paths) - Preserves road geometry (LineStrings with multiple points) - Calculates geodesic lengths using pyproj Geod
Caching with GraphML: - GraphML = XML-based graph format supported by igraph - Saves graph structure + all attributes - Geometries stored as WKT strings (Well-Known Text) - Loading from GraphML is 50-100x faster than parsing PBF - Recommended workflow: Load from PBF once, then use cache
Spatial Index (R-tree): - Built automatically by preprocess_road_segments() - Indexes edge bounding boxes for O(log n) nearest-edge queries - Essential for trajectory refinement and map-matching - Fast nearest-neighbor: ~0.01ms per query
Performance: - Loading from PBF: 1-10 minutes depending on file size - Loading from GraphML: 5-30 seconds - Memory: ~100 MB per 100k edges - Spatial index build: ~1-5 seconds
Use Cases: - Trajectory reconstruction: refine_trajectory(), curve_interpolation() - Map-matching: Finding nearest road segments - Routing: Shortest path computation with Dijkstra - Spatial analysis: Road network statistics
See also
build_igraph_graphBuild igraph from node/edge DataFrames
preprocess_road_segmentsCreate R-tree spatial index
filter_road_network_by_bboxFilter network by bounding box
Evaluation
- pyehicle.utilities.f1(df_matched: DataFrame | DataFrame, df_true: DataFrame | DataFrame, lat_col: str = 'lat', lon_col: str = 'lon', delete_cache: bool = True, tile_max_deg: float = 0.5, tile_overlap_deg: float = 0.05, cache_overpass: bool = True, aeqd_precision: int = 6) float[source]
- pyehicle.utilities.precision(df_matched: DataFrame | DataFrame, df_true: DataFrame | DataFrame, lat_col: str = 'lat', lon_col: str = 'lon', delete_cache: bool = True, tile_max_deg: float = 0.5, tile_overlap_deg: float = 0.05, cache_overpass: bool = True, aeqd_precision: int = 6) float[source]
Visualization
- pyehicle.utilities.visualization.single(df: DataFrame | DataFrame, name: str = None, return_map: bool = False, show_legend: bool = True, legend_name: str = 'Color by Trajectory', show_in_browser: bool = True, lat_col: str = 'lat', lon_col: str = 'lon', tiles: str = 'OpenStreetMap')[source]
Plot a single trajectory on an interactive Folium map.
This function creates an interactive map visualization of a GPS trajectory with customizable appearance and legend. The map is centered on the trajectory’s centroid.
- Parameters:
df (pd.DataFrame or pl.DataFrame) – The trajectory data to plot with latitude and longitude columns.
name (str, optional) – The name of the trajectory to display in the legend. Defaults to “Trajectory”.
return_map (bool, default=False) – If True, returns the folium.Map object instead of displaying it.
show_legend (bool, default=True) – If True, displays a draggable legend on the map.
legend_name (str, default='Color by Trajectory') – The title of the legend box.
show_in_browser (bool, default=True) – If True, opens the map in the default web browser.
lat_col (str, default='lat') – The column name for latitude values.
lon_col (str, default='lon') – The column name for longitude values.
tiles (str, default="OpenStreetMap") – The map tile style. Options include: “OpenStreetMap”, “Stamen Terrain”, “Stamen Toner”, “Stamen Watercolor”, “CartoDB positron”, “CartoDB dark_matter”.
- Returns:
If return_map=True, returns the folium.Map object. Otherwise, returns None.
- Return type:
folium.Map or None
Examples
>>> import pandas as pd >>> import pyehicle as pye >>> df = pd.read_csv('trajectory.csv') >>> pye.utilities.visualization.single(df, name='My Route', show_in_browser=True)
>>> # Use different map tiles >>> pye.utilities.visualization.single(df, tiles="CartoDB positron")
>>> # Get map object for further customization >>> map_obj = pye.utilities.visualization.single(df, return_map=True, show_in_browser=False) >>> # Add custom markers or layers to map_obj >>> map_obj.show_in_browser()
Notes
The trajectory is drawn as an orange polyline with weight=3. The map is automatically centered on the mean coordinates of the trajectory with a default zoom level of 15 (suitable for city-level detail).
- pyehicle.utilities.visualization.multiple(df_list: list, names: list = None, return_map: bool = False, show_legend: bool = True, legend_name: str = 'Color by Trajectory', show_in_browser: bool = True, cmap: str = 'tab10', lat_col: str = 'lat', lon_col: str = 'lon', tiles: str = 'OpenStreetMap')[source]
Plot multiple trajectories on a single interactive Folium map.
This function visualizes multiple GPS trajectories on the same map with different colors for easy comparison. Each trajectory is drawn as a polyline with automatic color assignment from the specified colormap.
- Parameters:
df_list (list of pd.DataFrame or pl.DataFrame) – List of trajectory DataFrames to plot. Each DataFrame must contain latitude and longitude columns.
names (list of str, optional) – List of names for the trajectories to display in the legend. If None, trajectories are labeled as “Trajectory 1”, “Trajectory 2”, etc. If the list is shorter than df_list, remaining trajectories get default names.
return_map (bool, default=False) – If True, returns the folium.Map object instead of displaying it.
show_legend (bool, default=True) – If True, displays a draggable legend on the map showing trajectory names and colors.
legend_name (str, default='Color by Trajectory') – The title of the legend box.
show_in_browser (bool, default=True) – If True, opens the map in the default web browser.
cmap (str, default='tab10') – Matplotlib colormap name for trajectory colors. Popular options include: ‘tab10’, ‘Set1’, ‘Set2’, ‘Paired’, ‘Accent’, ‘viridis’, ‘plasma’.
lat_col (str, default='lat') – The column name for latitude values in all DataFrames.
lon_col (str, default='lon') – The column name for longitude values in all DataFrames.
tiles (str, default="OpenStreetMap") – The map tile style. Options include: “OpenStreetMap”, “Stamen Terrain”, “Stamen Toner”, “Stamen Watercolor”, “CartoDB positron”, “CartoDB dark_matter”.
- Returns:
If return_map=True, returns the folium.Map object. Otherwise returns None.
- Return type:
folium.Map or None
- Raises:
ValueError – If df_list is empty, if any DataFrame is missing required columns, or if no coordinates are available for centering the map.
Examples
>>> import pandas as pd >>> import pyehicle as pye >>> >>> raw = pd.read_csv('raw_trajectory.csv') >>> matched = pd.read_csv('matched_trajectory.csv') >>> reconstructed = pd.read_csv('reconstructed_trajectory.csv') >>> >>> # Compare three trajectories with custom names >>> pye.utilities.visualization.multiple( ... df_list=[raw, matched, reconstructed], ... names=['Raw GPS', 'Map-Matched', 'Reconstructed'], ... show_in_browser=True ... )
>>> # Use different colormap and map style >>> pye.utilities.visualization.multiple( ... df_list=[traj1, traj2, traj3], ... cmap='viridis', ... tiles='CartoDB dark_matter' ... )
>>> # Get map object for further customization >>> map_obj = pye.utilities.visualization.multiple( ... df_list=[traj1, traj2], ... names=['Before', 'After'], ... return_map=True, ... show_in_browser=False ... )
Notes
Trajectories are drawn with weight=3 and opacity=0.7 for better visibility when multiple trajectories overlap.
The map is automatically centered on the centroid of all trajectories combined.
Colors are automatically assigned from the specified colormap to ensure visual distinction between trajectories.