API Reference

Preprocessing Module

Trajectory preprocessing module for pyehicle.

This module provides algorithms for preprocessing GPS trajectories, including: - Compression: Reduce trajectory point count while preserving shape - Map-matching: Align trajectories to road networks - Filtering: Smooth trajectories using Kalman filtering - Segmentation: Split trajectories into segments - Interpolation: Add points between existing trajectory points - Sampling rate: Calculate trajectory sampling rates

pyehicle.preprocessing.by_number_of_points(df: DataFrame | DataFrame, num: int, lat_col: str = 'lat', lon_col: str = 'lon', time_col: str = 'time') → DataFrame | DataFrame[source]

Resample trajectory to exact number of points using geodesic interpolation.

This function resamples a GPS trajectory to contain exactly num points by interpolating along geodesic (great-circle) paths between original points. Points are distributed uniformly by distance along the trajectory, and timestamps are interpolated proportionally.

This is useful for: - Normalizing trajectory density across datasets - Preparing trajectories for fixed-size neural network inputs - Creating smoother visualizations - Standardizing trajectories for comparison

Parameters:

df (pd.DataFrame or pl.DataFrame) – Input trajectory with at least latitude, longitude, and optionally time columns. Minimum 2 points required.
num (int) – Target number of points in the output trajectory. Must be > 2. Start and end points are always preserved.
lat_col (str, default='lat') – Name of the latitude column (WGS84 decimal degrees).
lon_col (str, default='lon') – Name of the longitude column (WGS84 decimal degrees).
time_col (str, default='time') – Name of the time column. If present, timestamps are interpolated linearly by cumulative distance. If not present, output contains only lat/lon.

Returns:

Resampled trajectory with exactly num points. Returns same type as input. Contains columns: lat_col, lon_col, and time_col (if present in input). Points are distributed uniformly by distance along the geodesic path.

Return type:

pd.DataFrame or pl.DataFrame

Examples

>>> import pandas as pd
>>> import pyehicle as pye
>>>
>>> # Load trajectory with variable spacing
>>> df = pd.read_csv('trajectory.csv')
>>> print(f"Original: {len(df)} points")
>>>
>>> # Resample to exactly 100 points
>>> resampled = pye.preprocessing.by_number_of_points(df, num=100)
>>> print(f"Resampled: {len(resampled)} points")  # Exactly 100
>>>
>>> # Increase density for smoother visualization
>>> dense = pye.preprocessing.by_number_of_points(df, num=500)
>>> pye.utilities.visualization.single(dense, name='Dense Trajectory')

Notes

Algorithm: 1. Calculate cumulative geodesic distances along the original trajectory 2. Define num target distances uniformly spaced from 0 to total_distance 3. For each target distance, find containing segment and interpolation fraction 4. Interpolate coordinates geodesically using pyproj Geod.fwd() 5. Interpolate timestamps linearly by distance (if time column present)

Geodesic Interpolation: - Uses WGS84 ellipsoid (pyproj Geod) - Accurate for all distances and latitudes - Interpolates along great-circle paths (shortest distance on sphere) - Preserves exact start and end points

Performance: - Time complexity: O(n + m) where n = input points, m = output points - Vectorized operations for efficiency - Fast even for large trajectories (>10,000 points)

Use Cases: - Standardizing trajectory density for machine learning - Creating smooth animations with fixed frame counts - Upsampling sparse trajectories (num > len(df)) - Downsampling dense trajectories (num < len(df))

See also

by_sampling_rate: Resample by temporal interval
get_sampling_rate: Calculate current sampling rate

pyehicle.preprocessing.by_sampling_rate(df: DataFrame | DataFrame, target_sampling_rate: float, lat_col: str = 'lat', lon_col: str = 'lon', time_col: str = 'time') → DataFrame | DataFrame[source]

Resample trajectory to uniform temporal interval using geodesic interpolation.

This function resamples a GPS trajectory to a constant time interval (sampling rate) by interpolating coordinates at regular temporal intervals. This creates trajectories with uniform time spacing, essential for time-series analysis and temporal modeling.

Coordinates are interpolated geodesically along great-circle paths between original points, maintaining geographic accuracy across all latitudes and distances.

This is useful for: - Normalizing temporal resolution across datasets - Preparing trajectories for time-series analysis - Synchronizing trajectories from different devices - Creating animations with consistent frame timing

Parameters:

df (pd.DataFrame or pl.DataFrame) – Input trajectory with latitude, longitude, and time columns. Minimum 2 points.
target_sampling_rate (float) – Target time interval in seconds between consecutive points. Must be positive and smaller than the current sampling rate. Typical values: - 1.0s: High-frequency (1 Hz) - 5.0s: Standard GPS logging - 10.0s: Low-frequency tracking
lat_col (str, default='lat') – Name of the latitude column (WGS84 decimal degrees).
lon_col (str, default='lon') – Name of the longitude column (WGS84 decimal degrees).
time_col (str, default='time') – Name of the time column. Must be parseable by pandas.to_datetime().

Returns:

Resampled trajectory with uniform temporal spacing. Returns same type as input. Contains columns: lat_col, lon_col, time_col. Points are spaced at exact target_sampling_rate intervals. Duplicates removed if present.

Return type:

pd.DataFrame or pl.DataFrame

Raises:

ValueError –

If target_sampling_rate <= 0 - If target_sampling_rate > current sampling rate (would require downsampling)

Examples

>>> import pandas as pd
>>> import pyehicle as pye
>>>
>>> # Load trajectory with variable sampling rate
>>> df = pd.read_csv('trajectory.csv')
>>> current_rate = pye.preprocessing.get_sampling_rate(df)
>>> print(f"Current sampling rate: {current_rate}s")
>>>
>>> # Resample to 5-second intervals
>>> resampled = pye.preprocessing.by_sampling_rate(df, target_sampling_rate=5.0)
>>> new_rate = pye.preprocessing.get_sampling_rate(resampled)
>>> print(f"New sampling rate: {new_rate}s")  # Should be ~5.0
>>>
>>> # Synchronize multiple trajectories to same temporal resolution
>>> traj1 = pye.preprocessing.by_sampling_rate(df1, target_sampling_rate=1.0)
>>> traj2 = pye.preprocessing.by_sampling_rate(df2, target_sampling_rate=1.0)
>>> # Now both have 1-second intervals for comparison

Notes

Algorithm: 1. Convert timestamps to seconds from start 2. Generate target times at uniform intervals: 0, Δt, 2Δt, 3Δt, … 3. For each target time, find containing time segment 4. Calculate interpolation fraction within segment 5. Interpolate coordinates geodesically at that time 6. Convert back to datetime format

Geodesic Interpolation: - Uses WGS84 ellipsoid for accurate great-circle paths - Coordinates interpolated along shortest distance on sphere - Maintains geographic accuracy at all latitudes - Vectorized operations for performance

Temporal Requirements: - target_sampling_rate must be smaller than current rate - This function increases temporal resolution (more frequent sampling) - For decreasing resolution (downsampling), use data selection instead

Performance: - Time complexity: O(n + m) where n = input points, m = output points - Vectorized numpy operations throughout - Fast for typical trajectories (<10,000 points)

Use Cases: - Preparing data for LSTM/RNN models (uniform time steps) - Creating smooth real-time visualizations - Synchronizing multi-sensor data - Generating training data with consistent temporal resolution

See also

by_number_of_points: Resample by fixed point count
get_sampling_rate: Calculate current sampling rate

pyehicle.preprocessing.by_time(df: DataFrame | DataFrame, time_threshold: float = 30, length_threshold: int = 20, time_col: str = 'time') → DataFrame | DataFrame | list[source]

Segment a trajectory into sub-trajectories based on time gaps between consecutive points.

This function splits a trajectory whenever there is a time gap larger than the specified threshold. Only segments longer than the length threshold are kept. This is useful for: - Separating distinct trips recorded in a single file - Identifying breaks in continuous GPS tracking - Removing short, fragmented trajectory segments

The function preserves all original columns and maintains the temporal order within segments.

Parameters:

df (pd.DataFrame or pl.DataFrame) – The trajectory DataFrame to segment. Must contain a time column.
time_threshold (float, default=30) – Maximum time gap in seconds. Gaps larger than this will cause a split. Common values: - 30s: For removing brief GPS signal losses - 300s (5 min): For separating distinct trips - 3600s (1 hour): For separating different days
length_threshold (int, default=20) – Minimum number of points for a segment to be kept. Segments with fewer points are discarded. Helps filter out very short, potentially noisy segments.
time_col (str, default='time') – Name of the time column. Must be parseable by pandas.to_datetime().

Returns:

The return type depends on the number of valid segments found: - If only 1 valid segment: Returns a single DataFrame (same type as input) - If multiple valid segments: Returns a list of DataFrames - If no valid segments (all too short): Returns the original DataFrame unchanged - If input has ≤ 1 row: Returns the original DataFrame unchanged

Return type:

pd.DataFrame, pl.DataFrame, or list

Examples

>>> import pandas as pd
>>> import pyehicle as pye
>>>
>>> # Load a trajectory with multiple trips
>>> df = pd.read_csv('full_day_gps.csv')
>>> print(f"Total points: {len(df)}")
>>>
>>> # Split at 10-minute gaps, keep segments with 50+ points
>>> segments = pye.preprocessing.by_time(
...     df,
...     time_threshold=600,      # 10 minutes
...     length_threshold=50
... )
>>>
>>> # Check results
>>> if isinstance(segments, list):
...     print(f"Found {len(segments)} trips")
...     for i, seg in enumerate(segments):
...         print(f"Trip {i+1}: {len(seg)} points")
... else:
...     print(f"Single continuous trajectory: {len(segments)} points")
>>>
>>> # Process each segment separately
>>> if isinstance(segments, list):
...     for i, segment in enumerate(segments):
...         # Apply preprocessing to each trip
...         compressed = pye.preprocessing.spatio_temporal_compress(segment)
...         matched = pye.preprocessing.leuven(compressed)
...         matched.to_csv(f'trip_{i+1}_matched.csv')

Notes

Algorithm: 1. Calculate time differences between consecutive points 2. Identify split points where time_diff > time_threshold 3. Split trajectory at these points 4. Filter out segments with length ≤ length_threshold 5. Return results based on number of valid segments

Performance: - Time complexity: O(n) where n is the number of points - Memory: O(n) for creating new DataFrames - Fast for all trajectory sizes

Edge Cases: - Empty or single-point input: Returns original DataFrame - All segments too short: Returns original DataFrame - Time column not sorted: Results may be unexpected (assumes chronological order)

Use Cases: - Preprocessing multi-day GPS logs - Separating work commutes from personal trips - Removing GPS signal loss periods - Batch processing distinct trajectory segments

pyehicle.preprocessing.get_sampling_rate(df: DataFrame | DataFrame, time_col: str = 'time') → float[source]

Calculate the average sampling rate (time interval) of a GPS trajectory.

This function computes the mean time difference between consecutive trajectory points, providing insight into the temporal resolution of the GPS data. This information is useful for: - Understanding data quality and resolution - Choosing appropriate interpolation parameters - Validating trajectory resampling operations - Comparing trajectories from different sources

The function handles various time formats and returns -1.0 for trajectories with insufficient data (< 2 points).

Parameters:

df (pd.DataFrame or pl.DataFrame) – The DataFrame containing trajectory data with timestamps.
time_col (str, default='time') – The column name for time values. The column should contain datetime-like values that can be parsed by pandas.to_datetime().

Returns:

The average sampling rate in seconds, rounded to 3 decimal places. Returns -1.0 if the DataFrame has fewer than 2 rows (cannot calculate rate).

Return type:

float

Raises:

ValueError – If the specified time column is not found in the DataFrame.

Examples

>>> import pandas as pd
>>> import pyehicle as pye
>>>
>>> # Create a trajectory with 5-second intervals
>>> df = pd.DataFrame({
...     'time': pd.date_range('2023-01-01', periods=100, freq='5S'),
...     'lat': np.linspace(56.95, 56.96, 100),
...     'lon': np.linspace(24.10, 24.11, 100)
... })
>>>
>>> # Check sampling rate
>>> rate = pye.preprocessing.get_sampling_rate(df)
>>> print(f"Sampling rate: {rate} seconds")  # Output: 5.0 seconds
>>>
>>> # Use sampling rate to choose interpolation target
>>> if rate > 10:
...     # Sparse data - may need more interpolation
...     target_rate = rate / 2
... else:
...     # Dense data - keep similar rate
...     target_rate = rate
>>>
>>> # Resample trajectory to target rate
>>> resampled = pye.preprocessing.by_sampling_rate(
...     df,
...     target_sampling_rate=target_rate
... )

>>> # Compare sampling rates before and after compression
>>> original_rate = pye.preprocessing.get_sampling_rate(df)
>>> compressed = pye.preprocessing.spatio_temporal_compress(df)
>>> compressed_rate = pye.preprocessing.get_sampling_rate(compressed)
>>> print(f"Original: {original_rate}s, After compression: {compressed_rate}s")

Notes

Calculation Method: The function computes: mean(time[i+1] - time[i]) for all consecutive pairs. This provides a robust estimate even for trajectories with variable sampling rates.

Interpretation: - 1.0s: High-frequency GPS (1 Hz sampling) - 5.0s: Standard GPS logging (0.2 Hz) - 10-30s: Low-frequency tracking - >60s: Sparse trajectory or vehicle with intermittent tracking

Limitations: - Assumes time column is in chronological order (not validated) - Doesn’t detect or handle outliers in sampling rate - For highly variable rates, mean may not be representative - Returns -1.0 for empty or single-point trajectories

Performance: - Time complexity: O(n) where n is the number of points - Memory: O(n) for time difference array - Fast even for large trajectories (> 1 million points)

pyehicle.preprocessing.kalman_aeqd_filter(df: DataFrame | DataFrame, lat_col: str = 'lat', lon_col: str = 'lon', time_col: str | None = 'time', process_noise_std_m_init: float = 1.0, measurement_noise_std_m_init: float = 10.0, em_iters: int = 3, em_tol: float = 0.001, chunk_size: int | None = None, overlap: int = 50, outlier_alpha: float | None = 0.01, return_states: bool = False, per_chunk_aeqd: bool = True, use_mmap: bool = False, mmap_dir: str | None = None, mmap_threshold: int = 200000, return_QR: bool = False, use_numba: bool = False, transformer_cache_precision: int = 6, verbose: bool = False)[source]

Apply Kalman filtering and RTS smoothing to a GPS trajectory with automatic noise parameter learning.

This function smooths noisy GPS trajectories using a Kalman filter with a constant-velocity motion model. It uses an Azimuthal Equidistant (AEQD) projection for accurate metric calculations and can automatically learn optimal noise parameters through the EM algorithm. Optionally detects and flags outliers.

Parameters:

df (pd.DataFrame or pl.DataFrame) – Input trajectory with latitude, longitude, and optionally time columns.
lat_col (str, default="lat") – Name of the latitude column.
lon_col (str, default="lon") – Name of the longitude column.
time_col (str or None, default="time") – Name of the time column. If None, synthetic timestamps are created.
process_noise_std_m_init (float, default=1.0) – Initial process noise standard deviation in meters (affects position).
measurement_noise_std_m_init (float, default=10.0) – Initial measurement noise standard deviation in meters (GPS error).
em_iters (int, default=3) – Number of EM iterations for learning noise parameters (0 = use initial values).
em_tol (float, default=1e-3) – EM convergence tolerance (relative change in Q and R matrices).
chunk_size (int or None, default=None) – Process trajectory in chunks of this size (useful for very long trajectories).
overlap (int, default=50) – Number of overlapping points between chunks for smooth stitching.
outlier_alpha (float or None, default=0.01) – Significance level for outlier detection (None = no outlier detection).
return_states (bool, default=False) – If True, also return full state vectors (position + velocity).
per_chunk_aeqd (bool, default=True) – If True, use a separate AEQD projection for each chunk.
use_mmap (bool, default=False) – Use memory-mapped arrays for very large trajectories.
mmap_dir (str or None, default=None) – Directory for memory-mapped files.
mmap_threshold (int, default=200000) – Trajectory length above which to use memory mapping.
return_QR (bool, default=False) – If True, also return learned Q and R covariance matrices.
use_numba (bool, default=False) – Use numba JIT compilation for faster processing (if numba is available).
transformer_cache_precision (int, default=6) – Number of decimal places for caching coordinate transformers.
verbose (bool, default=False) – Print progress information during processing.

Returns:

pd.DataFrame or pl.DataFrame – Smoothed trajectory with the same type as input. Contains smoothed lat/lon and a ‘_kept’ column indicating which points passed outlier detection.
np.ndarray (optional) – If return_states=True, returns full state vectors (n x 4) with position and velocity.
tuple (optional) – If return_QR=True, returns (avg_Q, avg_R, qr_list) with learned covariance matrices.

Notes

The Kalman filter uses a 2D constant-velocity model: - State: [x, vx, y, vy] (position and velocity in AEQD projected coordinates) - Observation: [x, y] (GPS measurements) The EM algorithm learns optimal Q (process noise) and R (measurement noise) from the data.

pyehicle.preprocessing.leuven(df: DataFrame | DataFrame, lat_col: str = 'lat', lon_col: str = 'lon', delete_cache: bool = False, max_dist: float = 0.01, tile_max_deg: float = 0.5, tile_overlap_deg: float = 0.05, cache_overpass: bool = True, return_projected: bool = False, aeqd_precision: int = 6) → DataFrame | DataFrame[source]

Map-match GPS trajectory to road network using Hidden Markov Model algorithm (Leuven).

This function aligns noisy GPS points to the most likely path on the actual road network by fetching OpenStreetMap data and applying HMM-based map-matching. It corrects GPS drift, snaps points to road centerlines, and assigns OSM way IDs for road-network-based analysis.

The algorithm: 1. Fetches driveable roads from Overpass API (tiled for large areas) 2. Projects coordinates to AEQD (Azimuthal Equidistant) for accurate metric calculations 3. Builds in-memory road network graph from OSM data 4. Applies Hidden Markov Model to find most likely road path 5. Returns trajectory with coordinates snapped to roads

This is essential preprocessing for: - Trajectory reconstruction (provides OSM IDs for road identification) - Removing GPS noise and drift - Preparing data for road-network-based routing analysis - Generating training data for machine learning models

Parameters:

df (pd.DataFrame or pl.DataFrame) – Input GPS trajectory with at least latitude and longitude columns. Can contain additional columns (all will be preserved). Minimum 2 points required.
lat_col (str, default='lat') – Name of the latitude column (WGS84 decimal degrees).
lon_col (str, default='lon') – Name of the longitude column (WGS84 decimal degrees).
delete_cache (bool, default=False) – If True, delete cached Overpass API responses after matching. Useful for saving disk space or forcing fresh data download. Recommended: False (reuse cache).
max_dist (float, default=0.01) – Maximum distance in kilometers for matching GPS points to road nodes. Points farther than this from any road will be excluded from the result. Typical values: - 0.01 km (10m): High-accuracy GPS, urban areas - 0.05 km (50m): Standard GPS accuracy - 0.1 km (100m): Low-accuracy GPS or sparse road networks
tile_max_deg (float, default=0.5) – Maximum tile size in degrees for splitting large bounding boxes. Overpass API has request size limits, so large trajectories are automatically tiled. Larger values = fewer API calls but longer response times. Typical: 0.3-0.7 degrees.
tile_overlap_deg (float, default=0.05) – Overlap between adjacent tiles in degrees to ensure roads crossing tile boundaries are captured. Should be >= typical road segment length. Typical: 0.03-0.1 degrees.
cache_overpass (bool, default=True) – If True, cache Overpass API responses to disk in ./overpass_cache/ for reuse. Dramatically speeds up repeated matching in the same area. Recommended: True.
return_projected (bool, default=False) – If True, return coordinates in AEQD projected meters (x, y) instead of WGS84 degrees (lon, lat). Useful for distance calculations. Typical: False (keep lat/lon).
aeqd_precision (int, default=6) – Decimal precision for rounding AEQD projection center coordinates when caching transformers. Higher precision = more cache entries. Typical: 5-7.

Returns:

Map-matched trajectory with coordinates snapped to road network. Returns same type as input (pandas → pandas, polars → polars). Output contains: - Matched coordinates (lat/lon or x/y depending on return_projected) - All original columns preserved - Points filtered: only those within max_dist of roads are kept - Order preserved: points remain in chronological sequence

Return type:

pd.DataFrame or pl.DataFrame

Examples

>>> import pandas as pd
>>> import pyehicle as pye
>>>
>>> # Load raw GPS trajectory
>>> df = pd.read_csv('raw_gps.csv')
>>> print(f"Raw trajectory: {len(df)} points")
>>>
>>> # Map-match to OSM roads with default settings
>>> matched = pye.preprocessing.leuven(df)
>>> print(f"Matched trajectory: {len(matched)} points")
>>>
>>> # Map-match with custom distance threshold (100m)
>>> matched = pye.preprocessing.leuven(
...     df,
...     max_dist=0.1,  # 100 meters
...     cache_overpass=True  # Reuse cached OSM data
... )
>>>
>>> # Use in preprocessing pipeline
>>> compressed = pye.preprocessing.spatio_temporal_compress(df)
>>> matched = pye.preprocessing.leuven(compressed)  # Map-match compressed data
>>> filtered = pye.preprocessing.kalman_aeqd_filter(matched)  # Apply Kalman filter
>>>
>>> # Visualize before/after
>>> pye.utilities.visualization.multiple(
...     [df, matched],
...     names=['Raw GPS', 'Map-Matched'],
...     show_in_browser=True
... )

Notes

Algorithm: Hidden Markov Model (HMM) - States: Road segments (edges) in the OSM network - Observations: GPS points with noise - Transition probability: Likelihood of traveling from one road to another - Emission probability: Likelihood of observing GPS point given road location - Viterbi algorithm: Finds most likely sequence of states (road path)

Data Sources: - Overpass API: Public OSM query service (https://overpass-api.de/) - Fallback: OSMnx library if Overpass fails - Caching: Responses saved to ./overpass_cache/ as XML files

Dependencies: - leuvenmapmatching: Core HMM matching library - osmium: Fast OSM XML parsing (preferred) - osmnx: Fallback for OSM data fetching and parsing - pyproj: AEQD projection transformations

Projection: AEQD (Azimuthal Equidistant) - Centered on trajectory bounding box center - Preserves distances from center point - Converts lat/lon (degrees) to x/y (meters) - Essential for accurate distance-based matching

Performance: - Time complexity: O(n * m * k) where:

n = trajectory points

m = average candidate roads per point

k = HMM Viterbi path length

For 1000-point trajectory: 30-90 seconds (first run), <5 seconds (cached)
Overpass API calls: ~1-5 tiles for typical city-scale trajectories
Cache significantly speeds up repeated matching in same area

Quality Factors: - GPS accuracy: Better GPS → better matching - Road network density: Urban >> rural - Sampling rate: Higher frequency → better path inference - max_dist: Should match expected GPS error (typically 10-50m)

Limitations: - Requires internet connection (Overpass API or OSMnx) - May fail in areas with sparse OSM coverage - Computationally intensive for very long trajectories (>10,000 points) - Does not handle GPS outages or tunnels well - Assumes trajectory follows driveable roads (not for off-road vehicles)

Troubleshooting: - “No OSM data available”: Check internet connection, try larger tile_max_deg - Slow performance: Enable cache_overpass=True, reduce trajectory size with compression - Poor matching: Increase max_dist, check OSM coverage in area, improve GPS quality - Overpass timeout: Reduce tile_max_deg (split into smaller tiles)

See also

meili: Alternative map-matching using Valhalla routing engine
spatio_temporal_compress: Reduce trajectory size before map-matching
kalman_aeqd_filter: Apply Kalman filtering after map-matching

pyehicle.preprocessing.meili(df: DataFrame | DataFrame, lat_col: str = 'lat', lon_col: str = 'lon', time_col: str = 'time') → DataFrame | DataFrame[source]

Map-match a GPS trajectory using Valhalla’s Meili algorithm.

This function requires a Valhalla server running on localhost:8002. Meili uses a Hidden Markov Model to match GPS traces to roads, handling noise and gaps effectively.

Parameters:

df (pd.DataFrame or pl.DataFrame) – Input trajectory with latitude, longitude, and time columns.
lat_col (str, default='lat') – Name of the latitude column.
lon_col (str, default='lon') – Name of the longitude column.
time_col (str, default='time') – Name of the time column.

Returns:

Map-matched trajectory with coordinates snapped to roads. Includes ‘original_lon’ and ‘original_lat’ columns if time_col is present.

Return type:

pd.DataFrame or pl.DataFrame

Notes

Requires Valhalla server running on http://localhost:8002 Install and run Valhalla before using this function. See: https://github.com/valhalla/valhalla

pyehicle.preprocessing.spatio_temporal_compress(df: DataFrame | DataFrame, spatial_radius_km: float = 0.1, lat_col: str = 'lat', lon_col: str = 'lon', time_col: str | None = 'time', collapse_clusters: bool = True, time_threshold_s: float | None = None, agg_other: Literal['first', 'last', 'mean'] = 'first', min_samples: int = 1, drop_noise: bool = False, use_aeqd: bool = True, final_geodetic_check: bool = False, geodetic_check_tol_m: float = 0.0) → DataFrame | DataFrame[source]

Compress GPS trajectory by clustering nearby points using DBSCAN-style spatial clustering.

This function reduces the number of points in a trajectory while preserving its essential shape by identifying and merging clusters of points that are close together in space (and optionally time). It uses a KD-tree for efficient O(n log n) neighbor searches and can optionally use an Azimuthal Equidistant (AEQD) projection for sub-meter distance accuracy.

The compression algorithm works in several stages: 1. Project coordinates to a local metric system (AEQD centered on trajectory centroid) 2. Build a KD-tree for fast spatial range queries 3. Find clusters of nearby points using DBSCAN 4. Optionally filter clusters by temporal proximity 5. Aggregate or select representatives from each cluster

Parameters:

df (pd.DataFrame or pl.DataFrame) – Input trajectory with latitude, longitude, and optionally time columns. Must contain at least lat_col and lon_col columns.
spatial_radius_km (float, default=0.1) – Clustering radius in kilometers (converted to meters internally). Points within this distance are considered spatial neighbors. Typical values: 0.01-0.5 km for urban trajectories.
lat_col (str, default="lat") – Name of the latitude column (WGS84 decimal degrees).
lon_col (str, default="lon") – Name of the longitude column (WGS84 decimal degrees).
time_col (str or None, default="time") – Name of the time column. If None, no temporal filtering is applied. If provided, must be parseable by pandas.to_datetime().
collapse_clusters (bool, default=True) – How to represent each cluster in the output: - True: Merge all cluster points into a single point (geographic centroid) - False: Keep one representative point from each cluster (the one with lowest index)
time_threshold_s (float or None, default=None) – Temporal clustering threshold in seconds. If provided, points must be within this time window AND spatial_radius_km to be clustered together. Useful for separating overlapping paths taken at different times.
agg_other ({"first", "last", "mean"}, default="first") – How to aggregate non-spatial columns when collapse_clusters=True: - “first”: Use values from the first point in the cluster - “last”: Use values from the last point in the cluster - “mean”: Average numeric columns, use first value for non-numeric
min_samples (int, default=1) – Minimum number of neighbors (including self) required to form a core point. Similar to DBSCAN’s min_samples parameter. Higher values filter out isolated points more aggressively.
drop_noise (bool, default=False) – If True, drop points that don’t belong to any cluster (noise points). If False, keep noise points as singleton clusters.
use_aeqd (bool, default=True) – If True, use Azimuthal Equidistant projection centered on trajectory centroid for accurate distance calculations. If False or if AEQD fails, fallback to Web Mercator (EPSG:3857). AEQD is recommended for accurate compression.
final_geodetic_check (bool, default=False) – If True, verify each cluster using geodesic distances (Geod.inv) after the initial projection-based clustering. Slower but provides additional accuracy verification. Usually not needed with use_aeqd=True.
geodetic_check_tol_m (float, default=0.0) – Tolerance in meters for the final geodetic check. Cluster members must be within spatial_radius_km + geodetic_check_tol_m of the cluster centroid. Only used if final_geodetic_check=True.

Returns:

Compressed trajectory with the same type and column structure as input. The order of rows may differ from the input. The number of rows will be less than or equal to the input (depending on clustering results).

Return type:

pd.DataFrame or pl.DataFrame

Raises:

ValueError – If lat_col or lon_col are not present in the DataFrame.

Examples

>>> import pandas as pd
>>> import pyehicle as pye
>>>
>>> # Load a dense GPS trajectory
>>> df = pd.read_csv('dense_trajectory.csv')  # 1000 points
>>> print(f"Original points: {len(df)}")
>>>
>>> # Simple compression with 100m radius
>>> compressed = pye.preprocessing.spatio_temporal_compress(
...     df,
...     spatial_radius_km=0.1,  # 100 meters
...     collapse_clusters=True
... )
>>> print(f"Compressed points: {len(compressed)}")  # ~200 points
>>>
>>> # Spatio-temporal compression (separate paths at different times)
>>> compressed_st = pye.preprocessing.spatio_temporal_compress(
...     df,
...     spatial_radius_km=0.05,  # 50 meters
...     time_threshold_s=300,     # 5 minutes
...     collapse_clusters=True
... )
>>>
>>> # Aggressive compression: drop isolated points, larger radius
>>> aggressive = pye.preprocessing.spatio_temporal_compress(
...     df,
...     spatial_radius_km=0.2,  # 200 meters
...     min_samples=3,           # Need at least 3 neighbors
...     drop_noise=True,         # Remove isolated points
...     agg_other="mean"         # Average other columns
... )

Notes

Algorithm Details:

The function implements a DBSCAN-like clustering algorithm optimized for GPS data:

Projection: Coordinates are projected to a local metric system for accurate distance calculations. By default, uses AEQD (Azimuthal Equidistant) projection centered on the trajectory’s geographic centroid. This ensures distances are accurate within a few meters across the entire trajectory.
Spatial Indexing: A KD-tree is built on the projected coordinates for O(log n) neighbor queries. Without scipy, falls back to O(n²) brute force.
Clustering: DBSCAN algorithm identifies clusters of spatially connected points: - Core points: Points with ≥ min_samples neighbors within spatial_radius_km - Cluster formation: Core points and their neighbors form clusters - Noise: Points that don’t belong to any cluster (if not dropped)
Temporal Filtering: If time_threshold_s is provided, neighbor relationships are filtered to only include points within the time window. This prevents clustering of points from different trips on the same path.
Aggregation: Clusters are represented either by their geographic centroid (collapse_clusters=True) or by selecting one representative point.

Performance: - With scipy (cKDTree): O(n log n) - fast even for large trajectories - Without scipy: O(n²) - acceptable for <10,000 points, slow for larger datasets - AEQD projection: ~0.1ms per trajectory (one-time cost) - Geodetic check: O(n) extra cost per cluster (usually not needed)

Accuracy: - AEQD projection: Sub-meter accuracy for distances up to ~1000 km from center - Web Mercator: Acceptable for small spatial_radius (<1 km) at mid-latitudes - Geodetic check: Verifies clusters using true geodesic distances (slowest but most accurate)

Typical Use Cases: - Reduce storage size of dense GPS logs (e.g., 1 Hz sampling → 0.1 Hz effective) - Preprocessing for map-matching (remove redundant points) - Noise reduction in stationary periods (e.g., at traffic lights) - Simplify trajectory for visualization

Warnings: - If scipy is not installed, a warning is issued and O(n²) fallback is used - If AEQD projection fails, a warning is issued and Web Mercator is used - Very large spatial_radius_km (>5 km) may produce unexpected results - Very small min_samples (<1) will be clamped to 1

Compression

pyehicle.preprocessing.spatio_temporal_compress(df: DataFrame | DataFrame, spatial_radius_km: float = 0.1, lat_col: str = 'lat', lon_col: str = 'lon', time_col: str | None = 'time', collapse_clusters: bool = True, time_threshold_s: float | None = None, agg_other: Literal['first', 'last', 'mean'] = 'first', min_samples: int = 1, drop_noise: bool = False, use_aeqd: bool = True, final_geodetic_check: bool = False, geodetic_check_tol_m: float = 0.0) → DataFrame | DataFrame[source]

Compress GPS trajectory by clustering nearby points using DBSCAN-style spatial clustering.

This function reduces the number of points in a trajectory while preserving its essential shape by identifying and merging clusters of points that are close together in space (and optionally time). It uses a KD-tree for efficient O(n log n) neighbor searches and can optionally use an Azimuthal Equidistant (AEQD) projection for sub-meter distance accuracy.

The compression algorithm works in several stages: 1. Project coordinates to a local metric system (AEQD centered on trajectory centroid) 2. Build a KD-tree for fast spatial range queries 3. Find clusters of nearby points using DBSCAN 4. Optionally filter clusters by temporal proximity 5. Aggregate or select representatives from each cluster

Parameters:

df (pd.DataFrame or pl.DataFrame) – Input trajectory with latitude, longitude, and optionally time columns. Must contain at least lat_col and lon_col columns.
spatial_radius_km (float, default=0.1) – Clustering radius in kilometers (converted to meters internally). Points within this distance are considered spatial neighbors. Typical values: 0.01-0.5 km for urban trajectories.
lat_col (str, default="lat") – Name of the latitude column (WGS84 decimal degrees).
lon_col (str, default="lon") – Name of the longitude column (WGS84 decimal degrees).
time_col (str or None, default="time") – Name of the time column. If None, no temporal filtering is applied. If provided, must be parseable by pandas.to_datetime().
collapse_clusters (bool, default=True) – How to represent each cluster in the output: - True: Merge all cluster points into a single point (geographic centroid) - False: Keep one representative point from each cluster (the one with lowest index)
time_threshold_s (float or None, default=None) – Temporal clustering threshold in seconds. If provided, points must be within this time window AND spatial_radius_km to be clustered together. Useful for separating overlapping paths taken at different times.
agg_other ({"first", "last", "mean"}, default="first") – How to aggregate non-spatial columns when collapse_clusters=True: - “first”: Use values from the first point in the cluster - “last”: Use values from the last point in the cluster - “mean”: Average numeric columns, use first value for non-numeric
min_samples (int, default=1) – Minimum number of neighbors (including self) required to form a core point. Similar to DBSCAN’s min_samples parameter. Higher values filter out isolated points more aggressively.
drop_noise (bool, default=False) – If True, drop points that don’t belong to any cluster (noise points). If False, keep noise points as singleton clusters.
use_aeqd (bool, default=True) – If True, use Azimuthal Equidistant projection centered on trajectory centroid for accurate distance calculations. If False or if AEQD fails, fallback to Web Mercator (EPSG:3857). AEQD is recommended for accurate compression.
final_geodetic_check (bool, default=False) – If True, verify each cluster using geodesic distances (Geod.inv) after the initial projection-based clustering. Slower but provides additional accuracy verification. Usually not needed with use_aeqd=True.
geodetic_check_tol_m (float, default=0.0) – Tolerance in meters for the final geodetic check. Cluster members must be within spatial_radius_km + geodetic_check_tol_m of the cluster centroid. Only used if final_geodetic_check=True.

Returns:

Compressed trajectory with the same type and column structure as input. The order of rows may differ from the input. The number of rows will be less than or equal to the input (depending on clustering results).

Return type:

pd.DataFrame or pl.DataFrame

Raises:

ValueError – If lat_col or lon_col are not present in the DataFrame.

Examples

>>> import pandas as pd
>>> import pyehicle as pye
>>>
>>> # Load a dense GPS trajectory
>>> df = pd.read_csv('dense_trajectory.csv')  # 1000 points
>>> print(f"Original points: {len(df)}")
>>>
>>> # Simple compression with 100m radius
>>> compressed = pye.preprocessing.spatio_temporal_compress(
...     df,
...     spatial_radius_km=0.1,  # 100 meters
...     collapse_clusters=True
... )
>>> print(f"Compressed points: {len(compressed)}")  # ~200 points
>>>
>>> # Spatio-temporal compression (separate paths at different times)
>>> compressed_st = pye.preprocessing.spatio_temporal_compress(
...     df,
...     spatial_radius_km=0.05,  # 50 meters
...     time_threshold_s=300,     # 5 minutes
...     collapse_clusters=True
... )
>>>
>>> # Aggressive compression: drop isolated points, larger radius
>>> aggressive = pye.preprocessing.spatio_temporal_compress(
...     df,
...     spatial_radius_km=0.2,  # 200 meters
...     min_samples=3,           # Need at least 3 neighbors
...     drop_noise=True,         # Remove isolated points
...     agg_other="mean"         # Average other columns
... )

Notes

Algorithm Details:

The function implements a DBSCAN-like clustering algorithm optimized for GPS data:

Projection: Coordinates are projected to a local metric system for accurate distance calculations. By default, uses AEQD (Azimuthal Equidistant) projection centered on the trajectory’s geographic centroid. This ensures distances are accurate within a few meters across the entire trajectory.
Spatial Indexing: A KD-tree is built on the projected coordinates for O(log n) neighbor queries. Without scipy, falls back to O(n²) brute force.
Clustering: DBSCAN algorithm identifies clusters of spatially connected points: - Core points: Points with ≥ min_samples neighbors within spatial_radius_km - Cluster formation: Core points and their neighbors form clusters - Noise: Points that don’t belong to any cluster (if not dropped)
Temporal Filtering: If time_threshold_s is provided, neighbor relationships are filtered to only include points within the time window. This prevents clustering of points from different trips on the same path.
Aggregation: Clusters are represented either by their geographic centroid (collapse_clusters=True) or by selecting one representative point.

Performance: - With scipy (cKDTree): O(n log n) - fast even for large trajectories - Without scipy: O(n²) - acceptable for <10,000 points, slow for larger datasets - AEQD projection: ~0.1ms per trajectory (one-time cost) - Geodetic check: O(n) extra cost per cluster (usually not needed)

Accuracy: - AEQD projection: Sub-meter accuracy for distances up to ~1000 km from center - Web Mercator: Acceptable for small spatial_radius (<1 km) at mid-latitudes - Geodetic check: Verifies clusters using true geodesic distances (slowest but most accurate)

Typical Use Cases: - Reduce storage size of dense GPS logs (e.g., 1 Hz sampling → 0.1 Hz effective) - Preprocessing for map-matching (remove redundant points) - Noise reduction in stationary periods (e.g., at traffic lights) - Simplify trajectory for visualization

Warnings: - If scipy is not installed, a warning is issued and O(n²) fallback is used - If AEQD projection fails, a warning is issued and Web Mercator is used - Very large spatial_radius_km (>5 km) may produce unexpected results - Very small min_samples (<1) will be clamped to 1

Map-Matching

pyehicle.preprocessing.leuven(df: DataFrame | DataFrame, lat_col: str = 'lat', lon_col: str = 'lon', delete_cache: bool = False, max_dist: float = 0.01, tile_max_deg: float = 0.5, tile_overlap_deg: float = 0.05, cache_overpass: bool = True, return_projected: bool = False, aeqd_precision: int = 6) → DataFrame | DataFrame[source]

Map-match GPS trajectory to road network using Hidden Markov Model algorithm (Leuven).

This function aligns noisy GPS points to the most likely path on the actual road network by fetching OpenStreetMap data and applying HMM-based map-matching. It corrects GPS drift, snaps points to road centerlines, and assigns OSM way IDs for road-network-based analysis.

The algorithm: 1. Fetches driveable roads from Overpass API (tiled for large areas) 2. Projects coordinates to AEQD (Azimuthal Equidistant) for accurate metric calculations 3. Builds in-memory road network graph from OSM data 4. Applies Hidden Markov Model to find most likely road path 5. Returns trajectory with coordinates snapped to roads

This is essential preprocessing for: - Trajectory reconstruction (provides OSM IDs for road identification) - Removing GPS noise and drift - Preparing data for road-network-based routing analysis - Generating training data for machine learning models

Parameters:

df (pd.DataFrame or pl.DataFrame) – Input GPS trajectory with at least latitude and longitude columns. Can contain additional columns (all will be preserved). Minimum 2 points required.
lat_col (str, default='lat') – Name of the latitude column (WGS84 decimal degrees).
lon_col (str, default='lon') – Name of the longitude column (WGS84 decimal degrees).
delete_cache (bool, default=False) – If True, delete cached Overpass API responses after matching. Useful for saving disk space or forcing fresh data download. Recommended: False (reuse cache).
max_dist (float, default=0.01) – Maximum distance in kilometers for matching GPS points to road nodes. Points farther than this from any road will be excluded from the result. Typical values: - 0.01 km (10m): High-accuracy GPS, urban areas - 0.05 km (50m): Standard GPS accuracy - 0.1 km (100m): Low-accuracy GPS or sparse road networks
tile_max_deg (float, default=0.5) – Maximum tile size in degrees for splitting large bounding boxes. Overpass API has request size limits, so large trajectories are automatically tiled. Larger values = fewer API calls but longer response times. Typical: 0.3-0.7 degrees.
tile_overlap_deg (float, default=0.05) – Overlap between adjacent tiles in degrees to ensure roads crossing tile boundaries are captured. Should be >= typical road segment length. Typical: 0.03-0.1 degrees.
cache_overpass (bool, default=True) – If True, cache Overpass API responses to disk in ./overpass_cache/ for reuse. Dramatically speeds up repeated matching in the same area. Recommended: True.
return_projected (bool, default=False) – If True, return coordinates in AEQD projected meters (x, y) instead of WGS84 degrees (lon, lat). Useful for distance calculations. Typical: False (keep lat/lon).
aeqd_precision (int, default=6) – Decimal precision for rounding AEQD projection center coordinates when caching transformers. Higher precision = more cache entries. Typical: 5-7.

Returns:

Map-matched trajectory with coordinates snapped to road network. Returns same type as input (pandas → pandas, polars → polars). Output contains: - Matched coordinates (lat/lon or x/y depending on return_projected) - All original columns preserved - Points filtered: only those within max_dist of roads are kept - Order preserved: points remain in chronological sequence

Return type:

pd.DataFrame or pl.DataFrame

Examples

>>> import pandas as pd
>>> import pyehicle as pye
>>>
>>> # Load raw GPS trajectory
>>> df = pd.read_csv('raw_gps.csv')
>>> print(f"Raw trajectory: {len(df)} points")
>>>
>>> # Map-match to OSM roads with default settings
>>> matched = pye.preprocessing.leuven(df)
>>> print(f"Matched trajectory: {len(matched)} points")
>>>
>>> # Map-match with custom distance threshold (100m)
>>> matched = pye.preprocessing.leuven(
...     df,
...     max_dist=0.1,  # 100 meters
...     cache_overpass=True  # Reuse cached OSM data
... )
>>>
>>> # Use in preprocessing pipeline
>>> compressed = pye.preprocessing.spatio_temporal_compress(df)
>>> matched = pye.preprocessing.leuven(compressed)  # Map-match compressed data
>>> filtered = pye.preprocessing.kalman_aeqd_filter(matched)  # Apply Kalman filter
>>>
>>> # Visualize before/after
>>> pye.utilities.visualization.multiple(
...     [df, matched],
...     names=['Raw GPS', 'Map-Matched'],
...     show_in_browser=True
... )

Notes

Algorithm: Hidden Markov Model (HMM) - States: Road segments (edges) in the OSM network - Observations: GPS points with noise - Transition probability: Likelihood of traveling from one road to another - Emission probability: Likelihood of observing GPS point given road location - Viterbi algorithm: Finds most likely sequence of states (road path)

Data Sources: - Overpass API: Public OSM query service (https://overpass-api.de/) - Fallback: OSMnx library if Overpass fails - Caching: Responses saved to ./overpass_cache/ as XML files

Dependencies: - leuvenmapmatching: Core HMM matching library - osmium: Fast OSM XML parsing (preferred) - osmnx: Fallback for OSM data fetching and parsing - pyproj: AEQD projection transformations

Projection: AEQD (Azimuthal Equidistant) - Centered on trajectory bounding box center - Preserves distances from center point - Converts lat/lon (degrees) to x/y (meters) - Essential for accurate distance-based matching

Performance: - Time complexity: O(n * m * k) where:

n = trajectory points

m = average candidate roads per point

k = HMM Viterbi path length

For 1000-point trajectory: 30-90 seconds (first run), <5 seconds (cached)
Overpass API calls: ~1-5 tiles for typical city-scale trajectories
Cache significantly speeds up repeated matching in same area

Quality Factors: - GPS accuracy: Better GPS → better matching - Road network density: Urban >> rural - Sampling rate: Higher frequency → better path inference - max_dist: Should match expected GPS error (typically 10-50m)

Limitations: - Requires internet connection (Overpass API or OSMnx) - May fail in areas with sparse OSM coverage - Computationally intensive for very long trajectories (>10,000 points) - Does not handle GPS outages or tunnels well - Assumes trajectory follows driveable roads (not for off-road vehicles)

Troubleshooting: - “No OSM data available”: Check internet connection, try larger tile_max_deg - Slow performance: Enable cache_overpass=True, reduce trajectory size with compression - Poor matching: Increase max_dist, check OSM coverage in area, improve GPS quality - Overpass timeout: Reduce tile_max_deg (split into smaller tiles)

See also

meili: Alternative map-matching using Valhalla routing engine
spatio_temporal_compress: Reduce trajectory size before map-matching
kalman_aeqd_filter: Apply Kalman filtering after map-matching

pyehicle.preprocessing.meili(df: DataFrame | DataFrame, lat_col: str = 'lat', lon_col: str = 'lon', time_col: str = 'time') → DataFrame | DataFrame[source]

Map-match a GPS trajectory using Valhalla’s Meili algorithm.

This function requires a Valhalla server running on localhost:8002. Meili uses a Hidden Markov Model to match GPS traces to roads, handling noise and gaps effectively.

Parameters:

df (pd.DataFrame or pl.DataFrame) – Input trajectory with latitude, longitude, and time columns.
lat_col (str, default='lat') – Name of the latitude column.
lon_col (str, default='lon') – Name of the longitude column.
time_col (str, default='time') – Name of the time column.

Returns:

Map-matched trajectory with coordinates snapped to roads. Includes ‘original_lon’ and ‘original_lat’ columns if time_col is present.

Return type:

pd.DataFrame or pl.DataFrame

Notes

Requires Valhalla server running on http://localhost:8002 Install and run Valhalla before using this function. See: https://github.com/valhalla/valhalla

Kalman Filtering

pyehicle.preprocessing.kalman_aeqd_filter(df: DataFrame | DataFrame, lat_col: str = 'lat', lon_col: str = 'lon', time_col: str | None = 'time', process_noise_std_m_init: float = 1.0, measurement_noise_std_m_init: float = 10.0, em_iters: int = 3, em_tol: float = 0.001, chunk_size: int | None = None, overlap: int = 50, outlier_alpha: float | None = 0.01, return_states: bool = False, per_chunk_aeqd: bool = True, use_mmap: bool = False, mmap_dir: str | None = None, mmap_threshold: int = 200000, return_QR: bool = False, use_numba: bool = False, transformer_cache_precision: int = 6, verbose: bool = False)[source]

Apply Kalman filtering and RTS smoothing to a GPS trajectory with automatic noise parameter learning.

This function smooths noisy GPS trajectories using a Kalman filter with a constant-velocity motion model. It uses an Azimuthal Equidistant (AEQD) projection for accurate metric calculations and can automatically learn optimal noise parameters through the EM algorithm. Optionally detects and flags outliers.

Parameters:

df (pd.DataFrame or pl.DataFrame) – Input trajectory with latitude, longitude, and optionally time columns.
lat_col (str, default="lat") – Name of the latitude column.
lon_col (str, default="lon") – Name of the longitude column.
time_col (str or None, default="time") – Name of the time column. If None, synthetic timestamps are created.
process_noise_std_m_init (float, default=1.0) – Initial process noise standard deviation in meters (affects position).
measurement_noise_std_m_init (float, default=10.0) – Initial measurement noise standard deviation in meters (GPS error).
em_iters (int, default=3) – Number of EM iterations for learning noise parameters (0 = use initial values).
em_tol (float, default=1e-3) – EM convergence tolerance (relative change in Q and R matrices).
chunk_size (int or None, default=None) – Process trajectory in chunks of this size (useful for very long trajectories).
overlap (int, default=50) – Number of overlapping points between chunks for smooth stitching.
outlier_alpha (float or None, default=0.01) – Significance level for outlier detection (None = no outlier detection).
return_states (bool, default=False) – If True, also return full state vectors (position + velocity).
per_chunk_aeqd (bool, default=True) – If True, use a separate AEQD projection for each chunk.
use_mmap (bool, default=False) – Use memory-mapped arrays for very large trajectories.
mmap_dir (str or None, default=None) – Directory for memory-mapped files.
mmap_threshold (int, default=200000) – Trajectory length above which to use memory mapping.
return_QR (bool, default=False) – If True, also return learned Q and R covariance matrices.
use_numba (bool, default=False) – Use numba JIT compilation for faster processing (if numba is available).
transformer_cache_precision (int, default=6) – Number of decimal places for caching coordinate transformers.
verbose (bool, default=False) – Print progress information during processing.

Returns:

pd.DataFrame or pl.DataFrame – Smoothed trajectory with the same type as input. Contains smoothed lat/lon and a ‘_kept’ column indicating which points passed outlier detection.
np.ndarray (optional) – If return_states=True, returns full state vectors (n x 4) with position and velocity.
tuple (optional) – If return_QR=True, returns (avg_Q, avg_R, qr_list) with learned covariance matrices.

Notes

The Kalman filter uses a 2D constant-velocity model: - State: [x, vx, y, vy] (position and velocity in AEQD projected coordinates) - Observation: [x, y] (GPS measurements) The EM algorithm learns optimal Q (process noise) and R (measurement noise) from the data.

Segmentation

pyehicle.preprocessing.by_time(df: DataFrame | DataFrame, time_threshold: float = 30, length_threshold: int = 20, time_col: str = 'time') → DataFrame | DataFrame | list[source]

Segment a trajectory into sub-trajectories based on time gaps between consecutive points.

This function splits a trajectory whenever there is a time gap larger than the specified threshold. Only segments longer than the length threshold are kept. This is useful for: - Separating distinct trips recorded in a single file - Identifying breaks in continuous GPS tracking - Removing short, fragmented trajectory segments

The function preserves all original columns and maintains the temporal order within segments.

Parameters:

df (pd.DataFrame or pl.DataFrame) – The trajectory DataFrame to segment. Must contain a time column.
time_threshold (float, default=30) – Maximum time gap in seconds. Gaps larger than this will cause a split. Common values: - 30s: For removing brief GPS signal losses - 300s (5 min): For separating distinct trips - 3600s (1 hour): For separating different days
length_threshold (int, default=20) – Minimum number of points for a segment to be kept. Segments with fewer points are discarded. Helps filter out very short, potentially noisy segments.
time_col (str, default='time') – Name of the time column. Must be parseable by pandas.to_datetime().

Returns:

The return type depends on the number of valid segments found: - If only 1 valid segment: Returns a single DataFrame (same type as input) - If multiple valid segments: Returns a list of DataFrames - If no valid segments (all too short): Returns the original DataFrame unchanged - If input has ≤ 1 row: Returns the original DataFrame unchanged

Return type:

pd.DataFrame, pl.DataFrame, or list

Examples

>>> import pandas as pd
>>> import pyehicle as pye
>>>
>>> # Load a trajectory with multiple trips
>>> df = pd.read_csv('full_day_gps.csv')
>>> print(f"Total points: {len(df)}")
>>>
>>> # Split at 10-minute gaps, keep segments with 50+ points
>>> segments = pye.preprocessing.by_time(
...     df,
...     time_threshold=600,      # 10 minutes
...     length_threshold=50
... )
>>>
>>> # Check results
>>> if isinstance(segments, list):
...     print(f"Found {len(segments)} trips")
...     for i, seg in enumerate(segments):
...         print(f"Trip {i+1}: {len(seg)} points")
... else:
...     print(f"Single continuous trajectory: {len(segments)} points")
>>>
>>> # Process each segment separately
>>> if isinstance(segments, list):
...     for i, segment in enumerate(segments):
...         # Apply preprocessing to each trip
...         compressed = pye.preprocessing.spatio_temporal_compress(segment)
...         matched = pye.preprocessing.leuven(compressed)
...         matched.to_csv(f'trip_{i+1}_matched.csv')

Notes

Algorithm: 1. Calculate time differences between consecutive points 2. Identify split points where time_diff > time_threshold 3. Split trajectory at these points 4. Filter out segments with length ≤ length_threshold 5. Return results based on number of valid segments

Performance: - Time complexity: O(n) where n is the number of points - Memory: O(n) for creating new DataFrames - Fast for all trajectory sizes

Edge Cases: - Empty or single-point input: Returns original DataFrame - All segments too short: Returns original DataFrame - Time column not sorted: Results may be unexpected (assumes chronological order)

Use Cases: - Preprocessing multi-day GPS logs - Separating work commutes from personal trips - Removing GPS signal loss periods - Batch processing distinct trajectory segments

Interpolation

pyehicle.preprocessing.by_number_of_points(df: DataFrame | DataFrame, num: int, lat_col: str = 'lat', lon_col: str = 'lon', time_col: str = 'time') → DataFrame | DataFrame[source]

Resample trajectory to exact number of points using geodesic interpolation.

This function resamples a GPS trajectory to contain exactly num points by interpolating along geodesic (great-circle) paths between original points. Points are distributed uniformly by distance along the trajectory, and timestamps are interpolated proportionally.

This is useful for: - Normalizing trajectory density across datasets - Preparing trajectories for fixed-size neural network inputs - Creating smoother visualizations - Standardizing trajectories for comparison

Parameters:

df (pd.DataFrame or pl.DataFrame) – Input trajectory with at least latitude, longitude, and optionally time columns. Minimum 2 points required.
num (int) – Target number of points in the output trajectory. Must be > 2. Start and end points are always preserved.
lat_col (str, default='lat') – Name of the latitude column (WGS84 decimal degrees).
lon_col (str, default='lon') – Name of the longitude column (WGS84 decimal degrees).
time_col (str, default='time') – Name of the time column. If present, timestamps are interpolated linearly by cumulative distance. If not present, output contains only lat/lon.

Returns:

Resampled trajectory with exactly num points. Returns same type as input. Contains columns: lat_col, lon_col, and time_col (if present in input). Points are distributed uniformly by distance along the geodesic path.

Return type:

pd.DataFrame or pl.DataFrame

Examples

>>> import pandas as pd
>>> import pyehicle as pye
>>>
>>> # Load trajectory with variable spacing
>>> df = pd.read_csv('trajectory.csv')
>>> print(f"Original: {len(df)} points")
>>>
>>> # Resample to exactly 100 points
>>> resampled = pye.preprocessing.by_number_of_points(df, num=100)
>>> print(f"Resampled: {len(resampled)} points")  # Exactly 100
>>>
>>> # Increase density for smoother visualization
>>> dense = pye.preprocessing.by_number_of_points(df, num=500)
>>> pye.utilities.visualization.single(dense, name='Dense Trajectory')

Notes

Algorithm: 1. Calculate cumulative geodesic distances along the original trajectory 2. Define num target distances uniformly spaced from 0 to total_distance 3. For each target distance, find containing segment and interpolation fraction 4. Interpolate coordinates geodesically using pyproj Geod.fwd() 5. Interpolate timestamps linearly by distance (if time column present)

Geodesic Interpolation: - Uses WGS84 ellipsoid (pyproj Geod) - Accurate for all distances and latitudes - Interpolates along great-circle paths (shortest distance on sphere) - Preserves exact start and end points

Performance: - Time complexity: O(n + m) where n = input points, m = output points - Vectorized operations for efficiency - Fast even for large trajectories (>10,000 points)

Use Cases: - Standardizing trajectory density for machine learning - Creating smooth animations with fixed frame counts - Upsampling sparse trajectories (num > len(df)) - Downsampling dense trajectories (num < len(df))

See also

by_sampling_rate: Resample by temporal interval
get_sampling_rate: Calculate current sampling rate

pyehicle.preprocessing.by_sampling_rate(df: DataFrame | DataFrame, target_sampling_rate: float, lat_col: str = 'lat', lon_col: str = 'lon', time_col: str = 'time') → DataFrame | DataFrame[source]

Resample trajectory to uniform temporal interval using geodesic interpolation.

This function resamples a GPS trajectory to a constant time interval (sampling rate) by interpolating coordinates at regular temporal intervals. This creates trajectories with uniform time spacing, essential for time-series analysis and temporal modeling.

Coordinates are interpolated geodesically along great-circle paths between original points, maintaining geographic accuracy across all latitudes and distances.

This is useful for: - Normalizing temporal resolution across datasets - Preparing trajectories for time-series analysis - Synchronizing trajectories from different devices - Creating animations with consistent frame timing

Parameters:

df (pd.DataFrame or pl.DataFrame) – Input trajectory with latitude, longitude, and time columns. Minimum 2 points.
target_sampling_rate (float) – Target time interval in seconds between consecutive points. Must be positive and smaller than the current sampling rate. Typical values: - 1.0s: High-frequency (1 Hz) - 5.0s: Standard GPS logging - 10.0s: Low-frequency tracking
lat_col (str, default='lat') – Name of the latitude column (WGS84 decimal degrees).
lon_col (str, default='lon') – Name of the longitude column (WGS84 decimal degrees).
time_col (str, default='time') – Name of the time column. Must be parseable by pandas.to_datetime().

Returns:

Resampled trajectory with uniform temporal spacing. Returns same type as input. Contains columns: lat_col, lon_col, time_col. Points are spaced at exact target_sampling_rate intervals. Duplicates removed if present.

Return type:

pd.DataFrame or pl.DataFrame

Raises:

ValueError –

If target_sampling_rate <= 0 - If target_sampling_rate > current sampling rate (would require downsampling)

Examples

>>> import pandas as pd
>>> import pyehicle as pye
>>>
>>> # Load trajectory with variable sampling rate
>>> df = pd.read_csv('trajectory.csv')
>>> current_rate = pye.preprocessing.get_sampling_rate(df)
>>> print(f"Current sampling rate: {current_rate}s")
>>>
>>> # Resample to 5-second intervals
>>> resampled = pye.preprocessing.by_sampling_rate(df, target_sampling_rate=5.0)
>>> new_rate = pye.preprocessing.get_sampling_rate(resampled)
>>> print(f"New sampling rate: {new_rate}s")  # Should be ~5.0
>>>
>>> # Synchronize multiple trajectories to same temporal resolution
>>> traj1 = pye.preprocessing.by_sampling_rate(df1, target_sampling_rate=1.0)
>>> traj2 = pye.preprocessing.by_sampling_rate(df2, target_sampling_rate=1.0)
>>> # Now both have 1-second intervals for comparison

Notes

Algorithm: 1. Convert timestamps to seconds from start 2. Generate target times at uniform intervals: 0, Δt, 2Δt, 3Δt, … 3. For each target time, find containing time segment 4. Calculate interpolation fraction within segment 5. Interpolate coordinates geodesically at that time 6. Convert back to datetime format

Geodesic Interpolation: - Uses WGS84 ellipsoid for accurate great-circle paths - Coordinates interpolated along shortest distance on sphere - Maintains geographic accuracy at all latitudes - Vectorized operations for performance

Temporal Requirements: - target_sampling_rate must be smaller than current rate - This function increases temporal resolution (more frequent sampling) - For decreasing resolution (downsampling), use data selection instead

Performance: - Time complexity: O(n + m) where n = input points, m = output points - Vectorized numpy operations throughout - Fast for typical trajectories (<10,000 points)

Use Cases: - Preparing data for LSTM/RNN models (uniform time steps) - Creating smooth real-time visualizations - Synchronizing multi-sensor data - Generating training data with consistent temporal resolution

See also

by_number_of_points: Resample by fixed point count
get_sampling_rate: Calculate current sampling rate

Sampling Rate

pyehicle.preprocessing.get_sampling_rate(df: DataFrame | DataFrame, time_col: str = 'time') → float[source]

Calculate the average sampling rate (time interval) of a GPS trajectory.

This function computes the mean time difference between consecutive trajectory points, providing insight into the temporal resolution of the GPS data. This information is useful for: - Understanding data quality and resolution - Choosing appropriate interpolation parameters - Validating trajectory resampling operations - Comparing trajectories from different sources

The function handles various time formats and returns -1.0 for trajectories with insufficient data (< 2 points).

Parameters:

df (pd.DataFrame or pl.DataFrame) – The DataFrame containing trajectory data with timestamps.
time_col (str, default='time') – The column name for time values. The column should contain datetime-like values that can be parsed by pandas.to_datetime().

Returns:

The average sampling rate in seconds, rounded to 3 decimal places. Returns -1.0 if the DataFrame has fewer than 2 rows (cannot calculate rate).

Return type:

float

Raises:

ValueError – If the specified time column is not found in the DataFrame.

Examples

>>> import pandas as pd
>>> import pyehicle as pye
>>>
>>> # Create a trajectory with 5-second intervals
>>> df = pd.DataFrame({
...     'time': pd.date_range('2023-01-01', periods=100, freq='5S'),
...     'lat': np.linspace(56.95, 56.96, 100),
...     'lon': np.linspace(24.10, 24.11, 100)
... })
>>>
>>> # Check sampling rate
>>> rate = pye.preprocessing.get_sampling_rate(df)
>>> print(f"Sampling rate: {rate} seconds")  # Output: 5.0 seconds
>>>
>>> # Use sampling rate to choose interpolation target
>>> if rate > 10:
...     # Sparse data - may need more interpolation
...     target_rate = rate / 2
... else:
...     # Dense data - keep similar rate
...     target_rate = rate
>>>
>>> # Resample trajectory to target rate
>>> resampled = pye.preprocessing.by_sampling_rate(
...     df,
...     target_sampling_rate=target_rate
... )

>>> # Compare sampling rates before and after compression
>>> original_rate = pye.preprocessing.get_sampling_rate(df)
>>> compressed = pye.preprocessing.spatio_temporal_compress(df)
>>> compressed_rate = pye.preprocessing.get_sampling_rate(compressed)
>>> print(f"Original: {original_rate}s, After compression: {compressed_rate}s")

Notes

Calculation Method: The function computes: mean(time[i+1] - time[i]) for all consecutive pairs. This provides a robust estimate even for trajectories with variable sampling rates.

Interpretation: - 1.0s: High-frequency GPS (1 Hz sampling) - 5.0s: Standard GPS logging (0.2 Hz) - 10-30s: Low-frequency tracking - >60s: Sparse trajectory or vehicle with intermittent tracking

Limitations: - Assumes time column is in chronological order (not validated) - Doesn’t detect or handle outliers in sampling rate - For highly variable rates, mean may not be representative - Returns -1.0 for empty or single-point trajectories

Performance: - Time complexity: O(n) where n is the number of points - Memory: O(n) for time difference array - Fast even for large trajectories (> 1 million points)

Reconstructing Module

Reconstructing module for the pyehicle library.

This module provides algorithms for reconstructing and refining GPS trajectories by combining multiple segments, interpolating curves, and enforcing road network continuity.

pyehicle.reconstructing.curve_interpolation(df, road_network, lower_threshold=20, upper_threshold=80, max_node_distance=10, time_col='time', lat_col='lat', lon_col='lon')[source]

Improve trajectory realism by interpolating points along road curves based on bearing changes.

This is the final step in trajectory reconstruction. It takes a refined trajectory and adds intermediate points at locations where the vehicle direction changes (curves, turns, corners). By detecting bearing changes and filling in the actual road geometry, it creates high-fidelity trajectories that closely follow real-world road paths.

The function analyzes bearing (direction) changes between consecutive points. When a bearing change falls within the specified threshold range, it indicates a curve that would benefit from additional points. The function then: 1. Finds the road network path between the curve endpoints 2. Interpolates multiple points along that path 3. Assigns timestamps proportionally to distance traveled

This is essential for: - Visualizing realistic vehicle paths (trajectories that look like actual driving) - Traffic analysis requiring accurate road following (lane-level positioning) - Route validation and comparison (matching GPS traces to expected routes) - Simulation and animation (smooth, realistic vehicle movement)

Algorithm Overview: 1. Bearing Analysis: Calculate bearing (direction) between each consecutive point pair 2. Bearing Change Detection: Compute absolute change in bearing (accounts for 360° wrap) 3. Curve Identification: Flag points where bearing change ∈ [lower_threshold, upper_threshold] 4. Path Interpolation: For each curve, find road network path and interpolate points 5. Temporal Distribution: Assign timestamps proportional to distance along interpolated path

Why Bearing Thresholds? - Too small (<20°): Captures noise and minor GPS drift, not real curves - Sweet spot (20-80°): Gradual turns, highway exits, curved roads, roundabouts - Too large (>80°): Sharp 90° turns already handled by refine_trajectory()

Parameters:

df (pd.DataFrame) – Input trajectory DataFrame with columns for latitude, longitude, and timestamp. Should be output from refine_trajectory() for best results. Minimum columns: lat_col, lon_col, time_col.
road_network (igraph.Graph) – Road network graph loaded from OSM data. Must have: - Node attributes: ‘x’ (longitude), ‘y’ (latitude) - Edge attributes: ‘osmid’ (OpenStreetMap way ID), ‘geometry’ (LineString) Typically loaded via utilities.road_network.load_road_network().
lower_threshold (float, default=20) – Minimum bearing change in degrees to trigger curve interpolation. Bearing changes smaller than this are considered straight segments (no interpolation needed). Typical values: - 10°: Very sensitive, captures subtle curves (may add noise) - 20°: Recommended default, captures most real curves - 30°: Less sensitive, only significant curves
upper_threshold (float, default=80) – Maximum bearing change in degrees to trigger curve interpolation. Bearing changes larger than this are considered sharp turns (already handled by refine_trajectory() or too abrupt for curve interpolation). Typical values: - 60°: Captures gradual to moderate curves - 80°: Recommended default, avoids sharp 90° turns - 90°: Includes sharper turns (may overlap with refinement)
max_node_distance (float, default=10) – Maximum distance in meters for snapping trajectory points to road nodes during interpolation. Points farther than this will skip interpolation. Typical values: - 10-15m: High-accuracy GPS (recommended) - 20-30m: Standard GPS accuracy - 40-50m: Low-accuracy GPS or sparse road networks
time_col (str, default='time') – Name of the timestamp column in df. Can be Unix timestamps (numeric) or formatted datetime strings. Supports various formats.
lat_col (str, default='lat') – Name of the latitude column in df (WGS84 decimal degrees).
lon_col (str, default='lon') – Name of the longitude column in df (WGS84 decimal degrees).

Returns:

Enhanced trajectory with interpolated points at curves. Contains the same columns as the input (lat_col, lon_col, time_col) with: - Original trajectory points preserved - Intermediate points added at detected curves - Timestamps distributed proportionally to distance - Points sorted chronologically - Duplicates removed Output format: Timestamps as strings (‘%Y-%m-%d %H:%M:%S’)

Return type:

pd.DataFrame

Examples

>>> import pandas as pd
>>> import pyehicle as pye
>>>
>>> # Load refined trajectory (output from refine_trajectory)
>>> df = pd.read_csv('refined_trajectory.csv')
>>> print(f"Refined trajectory: {len(df)} points")
>>>
>>> # Load road network
>>> G = pye.utilities.road_network.load_road_network('city_roads.graphml')
>>>
>>> # Interpolate curves with default thresholds (20-80°)
>>> final = pye.reconstructing.curve_interpolation(
...     df,
...     road_network=G,
...     lower_threshold=20,
...     upper_threshold=80,
...     max_node_distance=10
... )
>>>
>>> print(f"Final trajectory: {len(final)} points")
>>> print(f"Added {len(final) - len(df)} points at curves")
>>>
>>> # Visualize before/after
>>> pye.utilities.visualization.multiple(
...     [df, final],
...     names=['Refined', 'With Curves'],
...     show_in_browser=True
... )

>>> # Full reconstruction pipeline (all steps)
>>> import pyehicle as pye
>>>
>>> # 1. Preprocessing
>>> df = pd.read_csv('raw_gps.csv')
>>> compressed = pye.preprocessing.spatio_temporal_compress(df)
>>> matched = pye.preprocessing.leuven(compressed, city='Riga', country='Latvia')
>>>
>>> # 2. Segmentation and combination
>>> segments = pye.preprocessing.by_time(matched, time_threshold=300)
>>> combined = pye.reconstructing.trajectory_combiner(segments)
>>>
>>> # 3. Refinement (road network continuity)
>>> G = pye.utilities.road_network.load_road_network('riga_roads.graphml')
>>> refined = pye.reconstructing.refine_trajectory(combined, G, max_node_distance=10)
>>>
>>> # 4. Curve interpolation (THIS FUNCTION - final step)
>>> final = pye.reconstructing.curve_interpolation(
...     refined, G,
...     lower_threshold=20,
...     upper_threshold=80,
...     max_node_distance=10
... )
>>>
>>> # Save final high-fidelity trajectory
>>> final.to_csv('reconstructed_trajectory.csv', index=False)
>>> print(f"✓ Reconstruction complete: {len(final)} points")

Notes

Algorithm Details: 1. Calculate bearings between consecutive points using geodesic forward azimuth 2. Compute bearing changes: abs(bearing[i] - bearing[i-1]), handling 360° wrap-around 3. For each point where lower_threshold ≤ bearing_change ≤ upper_threshold:

Find shortest path through road network from previous point to current point

Reconstruct LineString geometry from the path

Calculate cumulative distances along the path

Interpolate timestamps proportionally: t = t1 + (d/total_d) * (t2-t1)

Replace the two original points with the interpolated path points

Deduplicate coordinates and sort by timestamp

Bearing Change Calculation: Bearing changes handle the circular nature of angles (360° = 0°): `python change = abs(bearing2 - bearing1) change = change if change <= 180 else 360 - change ` This ensures: 350° → 10° = 20° change (not 340°)

Temporal Interpolation: Timestamps are distributed proportionally to distance along the interpolated path, assuming constant speed: - t[i] = t1 + (cumulative_distance[i] / total_distance) * (t2 - t1) - This maintains realistic timing and ensures chronological order

Road Network Filtering: The function filters the road network to the trajectory’s bounding box (0.5° buffer) for performance, similar to refine_trajectory().

Performance: - Time complexity: O(n * m * k) where:

n = trajectory points

m = avg points per interpolated curve

k = avg edges in shortest path

For 1000-point trajectory with 20 curves: 10-60 seconds
Dominated by shortest path computation (Dijkstra’s algorithm)

Quality Tuning: - More aggressive (lower_threshold=5, upper_threshold=60):

Captures more curves

More interpolated points

Smoother trajectories

Slower processing

Risk of over-interpolation

Less aggressive (lower_threshold=15, upper_threshold=40): - Captures only significant curves - Fewer interpolated points - Faster processing - May miss subtle curves

Limitations: - Requires road network with valid geometry and OSM IDs - May fail for disconnected roads or missing road segments - Does not handle U-turns well (>180° bearing changes) - Assumes GPS points are reasonably map-matched (best after refine_trajectory()) - Does not consider speed, acceleration, or vehicle dynamics

Use Cases: - Final step in trajectory reconstruction pipeline - Preparing trajectories for visualization and presentation - Generating realistic vehicle paths for traffic simulation - Route validation and GPS trace comparison - Creating training data for machine learning (realistic vehicle behavior)

When NOT to Use: - Raw GPS data (use refine_trajectory() first) - Trajectories with very high sampling rate (already smooth) - Applications not requiring curve realism (simple analysis)

See also

trajectory_combiner: First step - combine and sort trajectory segments
refine_trajectory: Second step - enforce road network continuity
calculate_bearings_pyproj: Calculate trajectory bearings for curve detection
utilities.road_network.load_road_network: Load OSM road network graph

pyehicle.reconstructing.refine_trajectory(df, road_network, max_node_distance=10, time_col='time', lat_col='lat', lon_col='lon')[source]

Refine GPS trajectory by enforcing spatial continuity with the underlying road network.

This is the main trajectory refinement function and the second step in reconstruction. It takes a raw or preprocessed GPS trajectory and “snaps” it to realistic road paths by detecting road transitions and interpolating corner points at intersections. The result is a refined trajectory that follows actual road geometry.

The function processes the trajectory sequentially, detecting when consecutive points lie on different roads (OSM ID changes) and filling gaps with corner points that follow the road network. This is essential for: - Converting GPS traces into road-matched trajectories - Preparing data for traffic analysis and route reconstruction - Generating realistic vehicle paths for simulation - Improving trajectory quality before curve interpolation

Algorithm Overview: 1. Preprocessing: Convert timestamps, remove duplicates, filter road network to bounding box 2. Initialization: Build spatial index, initialize refined trajectory 3. Sequential Processing: For each consecutive pair of trajectory points:

Find nearest road edges (OSM IDs) for both points

If OSM IDs differ (road transition): Interpolate corner points at intersection

If OSM IDs match (same road): Add current point directly

Finalization: Convert back to DataFrame with formatted timestamps

OSM ID Detection: The function uses OpenStreetMap way IDs to identify specific roads. When consecutive trajectory points have different OSM IDs, it indicates a road transition (e.g., turning at an intersection). The function then calls interpolate_corner_point() to fill the gap with realistic corner points.

Parameters:

df (pd.DataFrame) – Input trajectory DataFrame with columns for latitude, longitude, and timestamp. Can be raw GPS data or preprocessed (e.g., after compression or map-matching). Minimum columns: lat_col, lon_col, time_col.
road_network (igraph.Graph) – Road network graph loaded from OSM data. Must have: - Node attributes: ‘x’ (longitude), ‘y’ (latitude) - Edge attributes: ‘osmid’ (OpenStreetMap way ID), ‘geometry’ (LineString) Typically loaded via utilities.road_network.load_road_network().
max_node_distance (float, default=10) – Maximum distance in meters for snapping trajectory points to road nodes. Points farther than this from any same-road node will skip corner interpolation. Typical values: - 10-15m: High-accuracy GPS (differential GPS, RTK) - 20-30m: Standard GPS (smartphone, consumer device) - 40-50m: Low-accuracy GPS or sparse road networks
time_col (str, default='time') – Name of the timestamp column in df. Must be parseable by pandas.to_datetime(). Supports various formats: ISO 8601, Unix timestamps, custom datetime strings.
lat_col (str, default='lat') – Name of the latitude column in df (WGS84 decimal degrees).
lon_col (str, default='lon') – Name of the longitude column in df (WGS84 decimal degrees).

Returns:

Refined trajectory with interpolated corner points. Contains the same columns as the input (lat_col, lon_col, time_col) with: - Original trajectory points preserved - Corner points inserted at road transitions - Timestamps distributed proportionally to distance - Points sorted chronologically - Duplicates removed Output format: Timestamps as strings (‘%Y-%m-%d %H:%M:%S’)

Return type:

pd.DataFrame

Examples

>>> import pandas as pd
>>> import pyehicle as pye
>>>
>>> # Load GPS trajectory
>>> df = pd.read_csv('gps_trajectory.csv')
>>> print(f"Raw trajectory: {len(df)} points")
>>>
>>> # Load road network for the area
>>> G = pye.utilities.road_network.load_road_network(
...     'city_roads.graphml'
... )
>>>
>>> # Refine trajectory with road network continuity
>>> refined = pye.reconstructing.refine_trajectory(
...     df,
...     road_network=G,
...     max_node_distance=10,
...     time_col='time',
...     lat_col='lat',
...     lon_col='lon'
... )
>>>
>>> print(f"Refined trajectory: {len(refined)} points")
>>> print(f"Added {len(refined) - len(df)} corner points")
>>>
>>> # Visualize before/after
>>> pye.utilities.visualization.multiple(
...     [df, refined],
...     names=['Raw GPS', 'Refined'],
...     show_in_browser=True
... )

>>> # Full preprocessing + reconstruction pipeline
>>> import pyehicle as pye
>>>
>>> # 1. Load and preprocess
>>> df = pd.read_csv('full_day_gps.csv')
>>> compressed = pye.preprocessing.spatio_temporal_compress(df)
>>> matched = pye.preprocessing.leuven(compressed, city='Riga', country='Latvia')
>>>
>>> # 2. Split by time gaps
>>> segments = pye.preprocessing.by_time(matched, time_threshold=300)
>>>
>>> # 3. Combine segments spatially
>>> combined = pye.reconstructing.trajectory_combiner(segments)
>>>
>>> # 4. Refine with road network (this function)
>>> G = pye.utilities.road_network.load_road_network('riga_roads.graphml')
>>> refined = pye.reconstructing.refine_trajectory(combined, G)
>>>
>>> # 5. Final curve interpolation
>>> final = pye.reconstructing.curve_interpolation(refined, G)
>>>
>>> # Save result
>>> final.to_csv('reconstructed_trajectory.csv', index=False)

Notes

Algorithm Details: The refinement process ensures that trajectories follow realistic road paths: 1. For each consecutive pair of points, check if they’re on the same road (OSM ID) 2. If on different roads → Road transition detected → Interpolate corner points 3. Corner interpolation handles two scenarios:

Adjacent roads with direct intersection → Single corner point

Non-adjacent roads → Path reconstruction along network

Timestamps distributed proportionally to distance (constant speed assumption)
Deduplication prevents duplicate coordinates and timestamps

Road Network Filtering: The function filters the road network to the trajectory’s bounding box (with 0.5° buffer) for performance. This dramatically speeds up spatial queries for large road networks.

Spatial Index: Uses R-tree spatial index for fast nearest-edge queries. Built automatically from the road network using preprocess_road_segments().

Duplicate Handling: - Input duplicates removed before processing (same lat/lon) - Output duplicates removed after interpolation - Prevents issues with zero-length edges and invalid geometries

Temporal Ordering: - Points sorted by timestamp before processing - Corner points assigned interpolated timestamps - Final output re-sorted to ensure chronological order

Performance: - Time complexity: O(n * m) where n = trajectory points, m = avg corner points per transition - For 1000-point trajectory with 50 road transitions: 5-30 seconds - Dominated by corner interpolation (shortest path computations) - Large road networks (>100k edges) benefit from bounding box filtering

Limitations: - Requires accurate OSM IDs from map-matching or spatial queries - May fail for disconnected road networks (e.g., ferry routes) - Assumes roads have valid geometry and OSM IDs in the graph - Corner interpolation may fail if points are too far from roads (>max_node_distance) - Does not consider turn restrictions, traffic rules, or one-way streets

Use Cases: - Preparing trajectories for curve interpolation (next step in reconstruction) - Converting raw GPS logs to road-matched trajectories - Traffic flow analysis and route reconstruction - Vehicle path simulation for autonomous driving research - Improving trajectory quality for visualization and analysis

Quality Assurance: The function is designed to be robust: - Handles missing geometries gracefully - Skips problematic segments instead of failing entirely - Preserves original points when corner interpolation fails - Maintains temporal continuity throughout

See also

trajectory_combiner: First step - spatially sort trajectory segments
curve_interpolation: Third step - interpolate smooth curves at corners
utilities.road_network.load_road_network: Load OSM road network graph
preprocessing.leuven: HMM map-matching for OSM ID assignment

pyehicle.reconstructing.trajectory_combiner(dataframes, lat_col='lat', lon_col='lon', time_col='time')[source]

Combine multiple topologically equivalent trajectories and sort them by geographical proximity.

This function takes multiple topologically equivalent trajectory DataFrames (e.g., different bus runs along the same route, multiple GPS recordings of the same path) and intelligently combines them into a single unified trajectory by sorting points based on nearest-neighbor distances. The algorithm starts from the first point and iteratively selects the closest unvisited point, creating a spatially continuous trajectory.

This is particularly useful for: - Combining multiple recordings of the same bus/vehicle route - Merging GPS traces from different trips along the same path - Reconstructing a canonical route from multiple trajectory samples - Preparing topologically equivalent trajectories for refinement

The function uses true geodesic distances (WGS84 ellipsoid) rather than Euclidean distances, ensuring accuracy across all latitudes and distances.

Parameters:

dataframes (list of pd.DataFrame) – A list of DataFrames containing topologically equivalent trajectories (e.g., multiple GPS recordings of the same route). Each DataFrame must have latitude and longitude columns. The DataFrames can have different lengths and may contain additional columns (all column types will be preserved).
lat_col (str, default='lat') – The column name for latitude values (WGS84 decimal degrees).
lon_col (str, default='lon') – The column name for longitude values (WGS84 decimal degrees).
time_col (str, default='time') – The column name for timestamp values. Used for timestamp re-assignment based on cumulative distance traveled.

Returns:

A single DataFrame containing the combined trajectory, sorted by time after spatial reordering and timestamp re-assignment. The output may contain fewer points than the input due to duplicate coordinate removal (points at identical lat/lon locations are deduplicated, keeping only the first occurrence). The index is reset (0, 1, 2, …, n-1). All original column types are preserved.

Return type:

pd.DataFrame

Examples

>>> import pandas as pd
>>> import pyehicle as pye
>>>
>>> # Combine multiple bus runs along the same route
>>> run1 = pd.read_csv('bus_route_66_monday_morning.csv')
>>> run2 = pd.read_csv('bus_route_66_monday_afternoon.csv')
>>> run3 = pd.read_csv('bus_route_66_tuesday_morning.csv')
>>>
>>> # Combine all runs into a single canonical trajectory
>>> combined = pye.reconstructing.trajectory_combiner(
...     [run1, run2, run3],
...     lat_col='lat',
...     lon_col='lon',
...     time_col='time'
... )
>>> print(f"Combined {len(run1) + len(run2) + len(run3)} points into {len(combined)} points")
>>> # Note: Point count may be less due to duplicate coordinate removal

>>> # Combine multiple GPS recordings of the same delivery route
>>> monday_route = pd.read_csv('delivery_monday.csv')
>>> tuesday_route = pd.read_csv('delivery_tuesday.csv')
>>> wednesday_route = pd.read_csv('delivery_wednesday.csv')
>>>
>>> # Create canonical route from multiple recordings
>>> canonical_route = pye.reconstructing.trajectory_combiner(
...     [monday_route, tuesday_route, wednesday_route]
... )

Notes

Algorithm:

The function implements a multi-step reconstruction algorithm:

Concatenation: All input DataFrames are concatenated into one
Spatial Sorting: Greedy nearest-neighbor algorithm: - Start from the first point (index 0) - Calculate geodesic distance from current point to all unvisited points - Select the nearest unvisited point as the next point - Mark the selected point as visited and repeat
Duplicate Removal: Remove points with identical (lat, lon) coordinates, keeping only the first occurrence. This eliminates redundant points from overlapping segments or stationary GPS readings.
Timestamp Re-assignment: Calculate cumulative distances along the spatially- sorted path and re-assign timestamps proportional to distance traveled (constant speed assumption), preserving the original time range.
Time Sorting: Sort by time to maintain the library’s assumption of time-ordered trajectories. Since timestamps are now monotonically increasing along the path, this preserves spatial continuity.

Performance: - Time complexity: O(n²) where n is the total number of points - Space complexity: O(n) for coordinate arrays and visited tracking - For 1000 points: ~1 second - For 10,000 points: ~100 seconds (consider preprocessing to reduce points first)

Accuracy: - Uses pyproj Geod.inv() for true geodesic distances on WGS84 ellipsoid - Accurate to millimeter precision for all distances - No projection distortion (unlike Euclidean distance in lat/lon)

Important Considerations:

Point Count Reduction: The output may have fewer points than the input due to duplicate coordinate removal. Points with identical (lat, lon) values are deduplicated, keeping only the first occurrence. This affects: * Loop trajectories (where paths cross themselves) * Stationary GPS readings (e.g., stopped at traffic lights) * Overlapping segments from multiple sources
Timestamp Modification: Original timestamps are discarded and replaced with new timestamps calculated from cumulative distances. The new timestamps: * Preserve the original time range (min to max) * Assume constant speed along the spatially-sorted path * Are monotonically increasing, ensuring time-ordered output
Order Dependency: The algorithm starts from the first point of the first DataFrame in the list. Different input orders may produce different results if segments overlap spatially.
Greedy Strategy: Uses nearest-neighbor heuristic, not globally optimal. For complex overlapping segments, may not produce the best possible ordering. However, this is fast and works well for most GPS trajectories.
Memory Efficiency: Coordinates are pre-extracted as numpy arrays for faster vectorized operations. Uses boolean array for visited tracking instead of a set for better iteration performance.
Column Compatibility: All DataFrames are concatenated, so column names and types must be compatible across all input DataFrames. All column types are preserved in the output.

Use Cases: - Route Canonicalization: Combine multiple GPS recordings of the same bus/vehicle

route to create a single canonical trajectory representative of that route

Multi-Trip Fusion: Merge GPS traces from different trips along the same path (e.g., daily commutes, delivery routes, public transit routes)
Reconstruction Pipeline: First step before refine_trajectory() and curve_interpolation() for final trajectory refinement

Limitations: - Not suitable for very large datasets (>50,000 points) without preprocessing - Assumes segments are generally spatially separated (not heavily overlapping) - Duplicate coordinates are removed, which may be undesirable for:

Loop trajectories where the path legitimately crosses itself

Scenarios requiring stationary points (vehicle stopped, waiting)

Accurate point-count preservation

Timestamps are synthetic (distance-based), losing original timing information
Constant speed assumption may not reflect actual vehicle behavior
O(n²) complexity makes it slow for large point counts

See also

refine_trajectory: Second step - enforces road network continuity
curve_interpolation: Final step - interpolates smooth curves
by_time: Splits trajectories by temporal gaps

Trajectory Combiner

pyehicle.reconstructing.trajectory_combiner(dataframes, lat_col='lat', lon_col='lon', time_col='time')[source]

Combine multiple topologically equivalent trajectories and sort them by geographical proximity.

This function takes multiple topologically equivalent trajectory DataFrames (e.g., different bus runs along the same route, multiple GPS recordings of the same path) and intelligently combines them into a single unified trajectory by sorting points based on nearest-neighbor distances. The algorithm starts from the first point and iteratively selects the closest unvisited point, creating a spatially continuous trajectory.

This is particularly useful for: - Combining multiple recordings of the same bus/vehicle route - Merging GPS traces from different trips along the same path - Reconstructing a canonical route from multiple trajectory samples - Preparing topologically equivalent trajectories for refinement

The function uses true geodesic distances (WGS84 ellipsoid) rather than Euclidean distances, ensuring accuracy across all latitudes and distances.

Parameters:

dataframes (list of pd.DataFrame) – A list of DataFrames containing topologically equivalent trajectories (e.g., multiple GPS recordings of the same route). Each DataFrame must have latitude and longitude columns. The DataFrames can have different lengths and may contain additional columns (all column types will be preserved).
lat_col (str, default='lat') – The column name for latitude values (WGS84 decimal degrees).
lon_col (str, default='lon') – The column name for longitude values (WGS84 decimal degrees).
time_col (str, default='time') – The column name for timestamp values. Used for timestamp re-assignment based on cumulative distance traveled.

Returns:

A single DataFrame containing the combined trajectory, sorted by time after spatial reordering and timestamp re-assignment. The output may contain fewer points than the input due to duplicate coordinate removal (points at identical lat/lon locations are deduplicated, keeping only the first occurrence). The index is reset (0, 1, 2, …, n-1). All original column types are preserved.

Return type:

pd.DataFrame

Examples

>>> import pandas as pd
>>> import pyehicle as pye
>>>
>>> # Combine multiple bus runs along the same route
>>> run1 = pd.read_csv('bus_route_66_monday_morning.csv')
>>> run2 = pd.read_csv('bus_route_66_monday_afternoon.csv')
>>> run3 = pd.read_csv('bus_route_66_tuesday_morning.csv')
>>>
>>> # Combine all runs into a single canonical trajectory
>>> combined = pye.reconstructing.trajectory_combiner(
...     [run1, run2, run3],
...     lat_col='lat',
...     lon_col='lon',
...     time_col='time'
... )
>>> print(f"Combined {len(run1) + len(run2) + len(run3)} points into {len(combined)} points")
>>> # Note: Point count may be less due to duplicate coordinate removal

>>> # Combine multiple GPS recordings of the same delivery route
>>> monday_route = pd.read_csv('delivery_monday.csv')
>>> tuesday_route = pd.read_csv('delivery_tuesday.csv')
>>> wednesday_route = pd.read_csv('delivery_wednesday.csv')
>>>
>>> # Create canonical route from multiple recordings
>>> canonical_route = pye.reconstructing.trajectory_combiner(
...     [monday_route, tuesday_route, wednesday_route]
... )

Notes

Algorithm:

The function implements a multi-step reconstruction algorithm:

Concatenation: All input DataFrames are concatenated into one
Spatial Sorting: Greedy nearest-neighbor algorithm: - Start from the first point (index 0) - Calculate geodesic distance from current point to all unvisited points - Select the nearest unvisited point as the next point - Mark the selected point as visited and repeat
Duplicate Removal: Remove points with identical (lat, lon) coordinates, keeping only the first occurrence. This eliminates redundant points from overlapping segments or stationary GPS readings.
Timestamp Re-assignment: Calculate cumulative distances along the spatially- sorted path and re-assign timestamps proportional to distance traveled (constant speed assumption), preserving the original time range.
Time Sorting: Sort by time to maintain the library’s assumption of time-ordered trajectories. Since timestamps are now monotonically increasing along the path, this preserves spatial continuity.

Performance: - Time complexity: O(n²) where n is the total number of points - Space complexity: O(n) for coordinate arrays and visited tracking - For 1000 points: ~1 second - For 10,000 points: ~100 seconds (consider preprocessing to reduce points first)

Accuracy: - Uses pyproj Geod.inv() for true geodesic distances on WGS84 ellipsoid - Accurate to millimeter precision for all distances - No projection distortion (unlike Euclidean distance in lat/lon)

Important Considerations:

Point Count Reduction: The output may have fewer points than the input due to duplicate coordinate removal. Points with identical (lat, lon) values are deduplicated, keeping only the first occurrence. This affects: * Loop trajectories (where paths cross themselves) * Stationary GPS readings (e.g., stopped at traffic lights) * Overlapping segments from multiple sources
Timestamp Modification: Original timestamps are discarded and replaced with new timestamps calculated from cumulative distances. The new timestamps: * Preserve the original time range (min to max) * Assume constant speed along the spatially-sorted path * Are monotonically increasing, ensuring time-ordered output
Order Dependency: The algorithm starts from the first point of the first DataFrame in the list. Different input orders may produce different results if segments overlap spatially.
Greedy Strategy: Uses nearest-neighbor heuristic, not globally optimal. For complex overlapping segments, may not produce the best possible ordering. However, this is fast and works well for most GPS trajectories.
Memory Efficiency: Coordinates are pre-extracted as numpy arrays for faster vectorized operations. Uses boolean array for visited tracking instead of a set for better iteration performance.
Column Compatibility: All DataFrames are concatenated, so column names and types must be compatible across all input DataFrames. All column types are preserved in the output.

Use Cases: - Route Canonicalization: Combine multiple GPS recordings of the same bus/vehicle

route to create a single canonical trajectory representative of that route

Multi-Trip Fusion: Merge GPS traces from different trips along the same path (e.g., daily commutes, delivery routes, public transit routes)
Reconstruction Pipeline: First step before refine_trajectory() and curve_interpolation() for final trajectory refinement

Limitations: - Not suitable for very large datasets (>50,000 points) without preprocessing - Assumes segments are generally spatially separated (not heavily overlapping) - Duplicate coordinates are removed, which may be undesirable for:

Loop trajectories where the path legitimately crosses itself

Scenarios requiring stationary points (vehicle stopped, waiting)

Accurate point-count preservation

Timestamps are synthetic (distance-based), losing original timing information
Constant speed assumption may not reflect actual vehicle behavior
O(n²) complexity makes it slow for large point counts

See also

refine_trajectory: Second step - enforces road network continuity
curve_interpolation: Final step - interpolates smooth curves
by_time: Splits trajectories by temporal gaps

Trajectory Refinement

pyehicle.reconstructing.refine_trajectory(df, road_network, max_node_distance=10, time_col='time', lat_col='lat', lon_col='lon')[source]

Refine GPS trajectory by enforcing spatial continuity with the underlying road network.

This is the main trajectory refinement function and the second step in reconstruction. It takes a raw or preprocessed GPS trajectory and “snaps” it to realistic road paths by detecting road transitions and interpolating corner points at intersections. The result is a refined trajectory that follows actual road geometry.

The function processes the trajectory sequentially, detecting when consecutive points lie on different roads (OSM ID changes) and filling gaps with corner points that follow the road network. This is essential for: - Converting GPS traces into road-matched trajectories - Preparing data for traffic analysis and route reconstruction - Generating realistic vehicle paths for simulation - Improving trajectory quality before curve interpolation

Algorithm Overview: 1. Preprocessing: Convert timestamps, remove duplicates, filter road network to bounding box 2. Initialization: Build spatial index, initialize refined trajectory 3. Sequential Processing: For each consecutive pair of trajectory points:

Find nearest road edges (OSM IDs) for both points

If OSM IDs differ (road transition): Interpolate corner points at intersection

If OSM IDs match (same road): Add current point directly

Finalization: Convert back to DataFrame with formatted timestamps

OSM ID Detection: The function uses OpenStreetMap way IDs to identify specific roads. When consecutive trajectory points have different OSM IDs, it indicates a road transition (e.g., turning at an intersection). The function then calls interpolate_corner_point() to fill the gap with realistic corner points.

Parameters:

df (pd.DataFrame) – Input trajectory DataFrame with columns for latitude, longitude, and timestamp. Can be raw GPS data or preprocessed (e.g., after compression or map-matching). Minimum columns: lat_col, lon_col, time_col.
road_network (igraph.Graph) – Road network graph loaded from OSM data. Must have: - Node attributes: ‘x’ (longitude), ‘y’ (latitude) - Edge attributes: ‘osmid’ (OpenStreetMap way ID), ‘geometry’ (LineString) Typically loaded via utilities.road_network.load_road_network().
max_node_distance (float, default=10) – Maximum distance in meters for snapping trajectory points to road nodes. Points farther than this from any same-road node will skip corner interpolation. Typical values: - 10-15m: High-accuracy GPS (differential GPS, RTK) - 20-30m: Standard GPS (smartphone, consumer device) - 40-50m: Low-accuracy GPS or sparse road networks
time_col (str, default='time') – Name of the timestamp column in df. Must be parseable by pandas.to_datetime(). Supports various formats: ISO 8601, Unix timestamps, custom datetime strings.
lat_col (str, default='lat') – Name of the latitude column in df (WGS84 decimal degrees).
lon_col (str, default='lon') – Name of the longitude column in df (WGS84 decimal degrees).

Returns:

Refined trajectory with interpolated corner points. Contains the same columns as the input (lat_col, lon_col, time_col) with: - Original trajectory points preserved - Corner points inserted at road transitions - Timestamps distributed proportionally to distance - Points sorted chronologically - Duplicates removed Output format: Timestamps as strings (‘%Y-%m-%d %H:%M:%S’)

Return type:

pd.DataFrame

Examples

>>> import pandas as pd
>>> import pyehicle as pye
>>>
>>> # Load GPS trajectory
>>> df = pd.read_csv('gps_trajectory.csv')
>>> print(f"Raw trajectory: {len(df)} points")
>>>
>>> # Load road network for the area
>>> G = pye.utilities.road_network.load_road_network(
...     'city_roads.graphml'
... )
>>>
>>> # Refine trajectory with road network continuity
>>> refined = pye.reconstructing.refine_trajectory(
...     df,
...     road_network=G,
...     max_node_distance=10,
...     time_col='time',
...     lat_col='lat',
...     lon_col='lon'
... )
>>>
>>> print(f"Refined trajectory: {len(refined)} points")
>>> print(f"Added {len(refined) - len(df)} corner points")
>>>
>>> # Visualize before/after
>>> pye.utilities.visualization.multiple(
...     [df, refined],
...     names=['Raw GPS', 'Refined'],
...     show_in_browser=True
... )

>>> # Full preprocessing + reconstruction pipeline
>>> import pyehicle as pye
>>>
>>> # 1. Load and preprocess
>>> df = pd.read_csv('full_day_gps.csv')
>>> compressed = pye.preprocessing.spatio_temporal_compress(df)
>>> matched = pye.preprocessing.leuven(compressed, city='Riga', country='Latvia')
>>>
>>> # 2. Split by time gaps
>>> segments = pye.preprocessing.by_time(matched, time_threshold=300)
>>>
>>> # 3. Combine segments spatially
>>> combined = pye.reconstructing.trajectory_combiner(segments)
>>>
>>> # 4. Refine with road network (this function)
>>> G = pye.utilities.road_network.load_road_network('riga_roads.graphml')
>>> refined = pye.reconstructing.refine_trajectory(combined, G)
>>>
>>> # 5. Final curve interpolation
>>> final = pye.reconstructing.curve_interpolation(refined, G)
>>>
>>> # Save result
>>> final.to_csv('reconstructed_trajectory.csv', index=False)

Notes

Algorithm Details: The refinement process ensures that trajectories follow realistic road paths: 1. For each consecutive pair of points, check if they’re on the same road (OSM ID) 2. If on different roads → Road transition detected → Interpolate corner points 3. Corner interpolation handles two scenarios:

Adjacent roads with direct intersection → Single corner point

Non-adjacent roads → Path reconstruction along network

Timestamps distributed proportionally to distance (constant speed assumption)
Deduplication prevents duplicate coordinates and timestamps