-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix incorrect datapoints in datapoints table, geostreams db #581
Comments
Can you do a query to show how many datapoints are affected. |
And, which DB and which data points? |
Updated issue to add which database and table. |
The result of this query
is 18773153 and there are 137686191 datapoints total. So that means about 18773153 points or approximately 13% of the records in datapoints are outside Maricopa. I'll be trying out a fix on a much smaller table locally. |
There are likely data points from Urbana, perhaps also some from St Louis and Manhattan, KS
…________________________________
From: Todd Nicholson <[email protected]>
Sent: Thursday, May 23, 2019 5:25:29 PM
To: terraref/computing-pipeline
Cc: LeBauer, David Shaner - (dlebauer); Comment
Subject: Re: [terraref/computing-pipeline] fix incorrect datapoints in datapoints table, geostreams db (#581)
The result of this query
select count(*) from datapoints where ST_Distance(geog, ST_SetSRID(ST_Point(-112.013473,33.039463),4326)::geography) > 10000;
is 18773153
and there are 137686191 datapoints total.
So that means about 18773153 points or approximately 13% of the records in datapoints are outside Maricopa.
I'll be trying out a fix on a much smaller table locally.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#581?email_source=notifications&email_token=AADRPZ4UKHT54DWO3E3CQHTPW4YXTA5CNFSM4HOXJ5GKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWD2SRY#issuecomment-495429959>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AADRPZYEPMKND25YSA3KVDDPW4YXTANCNFSM4HOXJ5GA>.
|
So I did some queries to find out where the datapoints in the db were. Total number datapoints = 137686191 I ran this query based on the original issue #484
The result was 18592001 entries. This accounts for all the entries in datapoints that are not in Arizona or near Urbana. So i got one of these entries. The gid is 164759042 and the source dataset is and the geog is
i wrote a python script to get lat/lon from geography. It uses shapely.wkb and this function below returns something (Point, Multigon, etc) with lat/lon within AZ.
But when I use a geography from one of the datapoints that are 'bad' an error is thrown:
The same error is thrown for other datapoints not in AZ or near Urbana. This sort of makes sense since the query to get them uses the lat,lon order instead of lon, lat order. The datasets that are included in the data field for these datapoints entries have correct lat lon, so correct geography can be created from that. I'm not sure how it's generated for this table, but figured I'd report on what I found so far. |
@tcnichol keep in mind you can use ST_X() and ST_Y() to parse the geog. I use something like |
I had to use the ST_AsText on the geog but this one worked - those points have the lat/lon that is in the Atlantic. Again, I'm not sure the best way to fix this since I don't know how this database table is populated. If it's done by an extractor then resubmitting datasets might work. If we assume an offset based on the lon summing up to 180, then a script to update the db directly could work. |
I created a datapoints table locally and added points that replicated the error found in datapoints. This query updates the lat/lon
After this, all the points that were formerly in the Atlantic have lat/lon within Maricopa. My local db table was very small (just 10 datapoints) so I am not sure how long this would take. I could also copy datapoints, and run this update on the copy. |
please do an explain on this query (https://www.postgresql.org/docs/9.1/sql-explain.html) and we can get a sense of how it will be executed so we can see what indices exist, etc. |
running this but it seems to take a very long time. Will post the results once it is done.
|
It stopped but I got this error :
Will be looking into this. |
I found a query that works. I use ST_MakeValid and then take a centroid of the geography points. Many of them are polygons that are not closed. This 'explain' query might help :
Will check on the server. |
according to the docs online, this should check the time for update without actually updating the table.
|
Some datapoints in the datapoints table in the geostreams db were stored incorrectly. New entries to datapoints will be correct but older ones may need to be corrected.
this is related to this issue
#484
The text was updated successfully, but these errors were encountered: