Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix incorrect datapoints in datapoints table, geostreams db #581

Open
tcnichol opened this issue May 22, 2019 · 14 comments
Open

fix incorrect datapoints in datapoints table, geostreams db #581

tcnichol opened this issue May 22, 2019 · 14 comments
Assignees

Comments

@tcnichol
Copy link

tcnichol commented May 22, 2019

Some datapoints in the datapoints table in the geostreams db were stored incorrectly. New entries to datapoints will be correct but older ones may need to be corrected.

this is related to this issue

#484

@tcnichol tcnichol self-assigned this May 22, 2019
@robkooper
Copy link
Member

Can you do a query to show how many datapoints are affected.

@dlebauer
Copy link
Member

And, which DB and which data points?

@tcnichol tcnichol changed the title fix incorrect datapoints in db fix incorrect datapoints in datapoints table, geostreams db May 23, 2019
@tcnichol
Copy link
Author

Updated issue to add which database and table.

@tcnichol
Copy link
Author

The result of this query

 select count(*) from datapoints where ST_Distance(geog, ST_SetSRID(ST_Point(-112.013473,33.039463),4326)::geography)  > 10000;

is 18773153

and there are 137686191 datapoints total.

So that means about 18773153 points or approximately 13% of the records in datapoints are outside Maricopa.

I'll be trying out a fix on a much smaller table locally.

@dlebauer
Copy link
Member

dlebauer commented May 24, 2019 via email

@tcnichol
Copy link
Author

So I did some queries to find out where the datapoints in the db were.

Total number datapoints = 137686191
Near Maricopa = 118913038
Near Urbana = 181152

I ran this query based on the original issue #484

select count(*) from datapoints where ST_Distance(geog, ST_SetSRID(ST_Point(33.0752456246289, -68.0249705271937),4326)::geography)  < 5000;

The result was 18592001 entries. This accounts for all the entries in datapoints that are not in Arizona or near Urbana.

So i got one of these entries. The gid is 164759042 and the source dataset is
https://terraref.ncsa.illinois.edu/clowder/datasets/5b3040334f0c2b43c04f962a

and the geog is

0103000020E61000000100000004000000DFF73902A68940401D402BD7990151C0DFF73902A6894040BE7C55019A0151C0241700A2A5894040BE7C55019A0151C0241700A2A58940401D402BD7990151C0

i wrote a python script to get lat/lon from geography. It uses shapely.wkb and this function below returns something (Point, Multigon, etc) with lat/lon within AZ.

wkb.loads(t_geog, hex=True)

But when I use a geography from one of the datapoints that are 'bad' an error is thrown:

shapely.errors.WKBReadingError: Could not create geometry because of errors while reading input

The same error is thrown for other datapoints not in AZ or near Urbana. This sort of makes sense since the query to get them uses the lat,lon order instead of lon, lat order.

The datasets that are included in the data field for these datapoints entries have correct lat lon, so correct geography can be created from that. I'm not sure how it's generated for this table, but figured I'd report on what I found so far.

@robkooper
Copy link
Member

@tcnichol keep in mind you can use ST_X() and ST_Y() to parse the geog. I use something like SELECT ST_Y(ST_Centroid(geog)) AS lat, ST_X(ST_Centroid(geog)) AS lon FROM sensors to get the latitude and longitude of the points.

@tcnichol
Copy link
Author

tcnichol commented May 28, 2019

I had to use the ST_AsText on the geog but this one worked - those points have the lat/lon that is in the Atlantic.

Again, I'm not sure the best way to fix this since I don't know how this database table is populated. If it's done by an extractor then resubmitting datasets might work. If we assume an offset based on the lon summing up to 180, then a script to update the db directly could work.

@tcnichol
Copy link
Author

tcnichol commented Jun 3, 2019

I created a datapoints table locally and added points that replicated the error found in datapoints. This query updates the lat/lon

update datapoints
set geog =  ST_SetSRID(ST_MakePoint(-180 - ST_Y(ST_Centroid( ST_AsText(geog))), ST_X(ST_Centroid( ST_AsText(geog)))), 4326)::geography
where  ST_Distance(geog, ST_SetSRID(ST_Point(33.0752456246289, -68.0249705271937),4326)::geography)  < 5000;

After this, all the points that were formerly in the Atlantic have lat/lon within Maricopa.

My local db table was very small (just 10 datapoints) so I am not sure how long this would take. I could also copy datapoints, and run this update on the copy.

@max-zilla
Copy link
Contributor

please do an explain on this query (https://www.postgresql.org/docs/9.1/sql-explain.html) and we can get a sense of how it will be executed so we can see what indices exist, etc.

@tcnichol
Copy link
Author

tcnichol commented Jun 7, 2019

running this but it seems to take a very long time. Will post the results once it is done.

explain analyze verbose
update datapoints
set geog =  ST_SetSRID(ST_MakePoint(-180 - ST_Y(ST_Centroid( ST_AsText(geog))), ST_X(ST_Centroid( ST_AsText(geog)))), 4326)::geography
where  ST_Distance(geog, ST_SetSRID(ST_Point(33.0752456246289, -68.0249705271937),4326)::geography)  < 5000;

@tcnichol
Copy link
Author

tcnichol commented Jun 7, 2019

It stopped but I got this error :

ERROR:  geometry contains non-closed rings
HINT:  "...,33.0756200671313 -68.0250750753778))" <-- parse error at position 150 within geometry

Will be looking into this.

@tcnichol
Copy link
Author

tcnichol commented Jun 9, 2019

I found a query that works. I use ST_MakeValid and then take a centroid of the geography points. Many of them are polygons that are not closed. This 'explain' query might help :

explain 
update datapoints
set geog =  ST_SetSRID(ST_MakePoint(-180 - ST_Y(ST_Centroid( ST_MakeValid(geog::geometry))), ST_X(ST_Centroid( ST_MakeValid(geog::geometry)))), 4326)::geography
where  ST_Distance(geog, ST_SetSRID(ST_Point(33.0752456246289, -68.0249705271937),4326)::geography)  < 5000;

Will check on the server.

@tcnichol
Copy link
Author

according to the docs online, this should check the time for update without actually updating the table.

begin;
explain analyze
update datapoints
set geog = ST_SetSRID(ST_MakePoint(-180 - ST_Y(ST_AsText(
	ST_Centroid(ST_MakeValid(geog::geometry)))),
							   ST_X(ST_AsText(ST_Centroid(ST_MakeValid(geog::geometry))))), 4326)::geography
								where ST_Distance(geog, ST_SetSRID(ST_Point(33.0752456246289, -68.0249705271937),4326)::geography)  < 5000;
rollback;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants