Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some operators do not report "siri.scheduled_start_time" equals the "gtfs.start_time" #17

Open
ShayAdler opened this issue May 15, 2023 · 2 comments

Comments

@ShayAdler
Copy link
Contributor

Opening this issue to zoom in into a specific problem I found while investigating siri and gtfs matching, that happens in a small operator in Jerusalem and we should look dipper into -

  • Example:
    • Line 234 of operator ירושלים-דרום איחוד, rides in 8/5
    • If we search for matching to the planned rides this day, we only get 7, even that if looking specifically on siri, we see 39 rows on this day
    • The siri.scheduled_start_time rows that didn't match are close, but not equal to the gtfs.start_time, and they also have the milliseconds part in the time not zeroed, which is also different from the pattern we've seen when matching works (same for the rides that did match this day)
image
-- the rides of this line that will match with 5 min tolerance
select siri_ride.id, siri_route.operator_ref, gtfs_route.agency_name, siri_route.line_ref, gtfs_route.route_short_name, gtfs_route.route_long_name, gtfs_ride.start_time = scheduled_start_time, gtfs_ride.start_time, siri_ride.scheduled_start_time, gtfs_route.date
        from gtfs_ride, gtfs_route, siri_route, siri_ride
        where
        gtfs_route.id = gtfs_ride.gtfs_route_id
        and gtfs_route.operator_ref = siri_route.operator_ref
        and gtfs_route.line_ref = siri_route.line_ref
        and siri_route.id = siri_ride.siri_route_id
        and gtfs_route.date between '2023-05-04' and '2023-05-10'
    --     and siri_ride.scheduled_start_time = gtfs_ride.start_time
          and siri_ride.scheduled_start_time > gtfs_ride.start_time - '5 minutes'::interval
        and siri_ride.scheduled_start_time < gtfs_ride.start_time + '5 minutes'::interval
        and scheduled_time_gtfs_ride_id is null and DATE_TRUNC('hour', scheduled_start_time) = '2023-05-08 01:00:00.000000'
    and route_short_name='234'

-- the matched to gtfs
select gtfs_route.id, gr.start_time, scheduled_start_time, scheduled_time_gtfs_ride_id, sr.id from gtfs_route
         join gtfs_ride gr on gtfs_route.id = gr.gtfs_route_id
         left  join  siri_ride sr on gr.id = sr.scheduled_time_gtfs_ride_id
         where date ='2023-05-08' and line_ref=15136 and operator_ref=50

-- all the siri rides from this day
select scheduled_start_time from siri_ride
join siri_route sr on siri_ride.siri_route_id = sr.id
where DATE_TRUNC('day', scheduled_start_time) = '2023-05-08'
and line_ref=15136 and operator_ref=50
  • Questions:
    • In which other operator does this specific issue happen?
    • How common is this in percentages for those operators?
    • Does this happen in specific hours? Is there any pattern (e.g. un-planned rides)?
@ShayAdler
Copy link
Contributor Author

If looking at the missing percentage per operator, we can see that all the top ones are also small operators from Jerusalem, so this is what made me think this is a pattern specifically for them (should be verified of course)-

-- misses per operator
SELECT gr.agency_name,

  COUNT(*) FILTER (WHERE scheduled_time_gtfs_ride_id IS NULL) AS new_null_count,
  COUNT(*) FILTER (WHERE scheduled_time_gtfs_ride_id IS NOT NULL) AS new_not_null_count,
COUNT(*) FILTER (WHERE scheduled_time_gtfs_ride_id IS NULL) * 100.0 / COUNT(*) AS new_null_percentage
FROM siri_ride
join siri_route sr on siri_ride.siri_route_id = sr.id
join gtfs_route gr on sr.line_ref = gr.line_ref
WHERE DATE_TRUNC('day', scheduled_start_time) > '2023-04-29'
and gr.date > '2023-04-29'
GROUP BY gr.agency_name

misses_per_operator_2.csv

@OriHoch
Copy link
Contributor

OriHoch commented May 16, 2023

@ShayAdler when querying gtfs data you must limit by gtfs date, otherwise you are getting duplicated records per gtfs_date

You can see this by doing this query:

    SELECT siri_ride.id, gr.date
    FROM siri_ride
    join siri_route sr on siri_ride.siri_route_id = sr.id
    join gtfs_route gr on sr.line_ref = gr.line_ref
    WHERE DATE_TRUNC('day', scheduled_start_time) > '2023-04-29'
    and gr.date > '2023-04-29'

you will see that each siri_ride.id is duplicated by how many gr.dates there are

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants