-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify media-type and format negotiation of Part 3 collections #427
Comments
No, that is not a correct assumption. The term "collection", as used in draft OGC API - Common - Part 2, and as used in Processes - Part 3: Workflows, does not imply that the data is a collection of vector features. This was discussed in details in opengeospatial/ogcapi-common#140. An OGC API collection is a "collection of data", typically including 2 or 3 spatial dimensions. If you meant that it implies the existence of a The "STAC collection", which is also modeled after OGC API - Features and also uses So if we consider STAC + COG as a valid "data access mechanism" for Collection Input and/or Output, it needs to be clear that the "input" / "output" is not the items, but the actual COG data (all of the assets combined comprising a single coverage output in Part 1 terms, where assets of different bands would either be considered a dimension or different fields of the coverage, and different items would be at other spatial or temporal coordinates within the coverage -- you could combine the whole thing for a particular AoI/ToI/RoI within a single netCDF file for example). It would also be possible to offer both OGC API - Coverages (
With the first-class collection object, while such a STAC+COG collection could be a valid "Remote Collection Input" to put in "collection" input, the input conceptually would be the actual coverage data, so that you should also be able to actually pass to that process as an input an "href" with a GeoTIFF or a base64-encoded GeoTIFF when not using Collection Input. This does pose some challenges in terms of GeoTIFF being only 2D and not able to encode different bands at different resolution. This goes back to some of our previous discussions, but in terms of the implementation, it does mean that you can implement all of the OGC API - Processes 1/3 and OGC API data access mechanisms client machinery separately from the processes, and feed the process the actual data directly (the subset of the coverage cells retrieved from the COGs as HTTP range requests, after having parsed the STAC items, in this case). After the process receives an input, it does not need to make any further network requests to retrieve data to do the actual processing. This makes the most sense to me thinking of the processes as "functions" from which all this API / remote access stuff should be abstracted away, and in terms of not having to duplicate the same functionality in the implementation of every process. Preprocessing inputs / accessing remote collections is a machinery thing -- not a process thing.
No, because the actual data (the coverage encoded in all the GeoTIFF assets) is not a GeoJSON feature collection, but a 2D gridded coverage (or 3D if the different GeoTIFFs form a temporal dimension). As mentioned above, it would also be possible to offer that coverage data using OGC API - Coverages, OGC API - Tiles, OGC API - EDR, as well as STAC + COG all from within the same |
Given the shared use of
This is exactly my concern. It is not really "clear", unless you happen to know all involved data access mechanism, which is not the case by most users. "How" do we indicate that clearly? A STAC Item could contain Assets with respective single-band GeoTiff, virtual Assets that pre-combine RGB into another GeoTiff, could also contain Point Cloud LAZ, a DEM, netCDF weather data, and a GeoJSON FeatureCollection with class labels, all at the same time. It could also have many other media-type variants of all this data only to offer alternative/equivalent formats that can be retrieved as desired. In a situation where only the RGB GeoTiff would be the relevant Assets for the Process execution in this case, does that mean it is up to the execution body to pass the appropriate Note I strongly believe providing explicit examples of these convoluted use cases in the specification could help readers understand the "best practices" and expectations.
I agree. This is how I'm implementing it as well. This is why I want to get as much details as possible regarding how the " |
STAC+COG being a special case, we should have a conditional requirement that makes this clear.
That clearly does not fit that usage pattern :) To fit the Collection Input/Output pattern, you would need to organize your STAC items as one collection per asset data type, where all assets together form a single Feature Collection or Coverage or Point Cloud. They could still all be on the same STAC API, just one collection per GeoDataClass data layer.
That could be fine as long as they all describe the same data of the same type, as long as somehow the STAC specification / extensions makes it clear to the client that these are alternative formats that can be used and represent the same data.
I'm not familiar with the STAC API, but the Features CQL2 filter would only allow you to filter at the items level, not the assets level. So while a filter could allow to use heterogeneous STAC items by resulting in a consistent set of filtered items, I don't think the same would work if it is the assets that are heterogeneous? |
Interestingly enough though, that is the case I'm seeing as the most relevant to support GDC for my server. Cross-collection use cases are also defined in TB20 to combine EO and Climate/Weather data. TBH, I think OGC APIs are just as chaotic. The difference is only that different "data access mechanism" are used for these various formats instead of STAC extensions. The problem remains the same, it just moves, and a client submitting the request must still indicate somehow which items of interest to resolve, and this will dictate which API endpoints/request to use and data type to extract from a "Collection" that can contain many things. Many of the OGC API endpoints can return similar media-types. Therefore, it is not sufficient to use only this information to resolve which endpoints to request when only the
I strongly disagree with this. I do not see any difference with accessing
I think so too since Assets cannot be filtered currently by Features CQL2. |
I am not saying that it is not valid scenario to provide all those things together. What I am saying is that if you have STAC items for all these, each type of data to use together with "Collection Input / Output", such as the GeoTIFF coverage, the individual climate dataset (though multiple variables can be different fields in the same coverage) and the GeoJSON annotations, should each be their own collections. All collections could still be provided from the same STAC API end-points, or be provided together as multiple inputs to a process.
I am not sure what you are saying here. Are you talking about It is possible to do only ONE of these two things, because both STAC and Features use
I think a key principle here is that for a particular collection, all data access mechanisms (Coverages, Tiles, DGGS, Tiles, Maps, EDR) represent the same data, and accessing the data using any of them should produce equivalent results. In the case of a STAC /items, those items are actually metadata as a stepping-stone to the actual data in the assets. Maps / Map Tiles would normally not be used for processing workflows unless it is the only available access mechanism, because Maps are a visual representation of the data typically not intended for processing, except if the data is (A)RGB imagery. So the choice of the API access mechanism should be agnostic of the process, and it is really up to the implementation of the server and client to decide whether to use Coverages or Coverage Tiles or DGGS or EDR or accessing STAC assets to retrieve the data. For a STAC collection to fit the Collection Input / Output paradigm, the assets it describes must fit the concept that not only the metadata (the STAC items) have a consistent schema, but the actual data of the assets must also have a consistent schema, since these assets are what would actually get fed to a process receiving this collection as an input (not the STAC items). |
I do not see why that should be the case. The whole point of "Collection" as per its own definition is that it can be an agglomeration of different data representations and formats. If one is forced to represent a dataset that combines imagery, weather data and annotations representing some "event" into distinct collections because of their different type/format/representations, it defeats the main purpose of that collection/dataset that offers cross-domain annotations. This dataset could be the result of curated expert annotations that cannot "simply" be obtained by joining collections based on AOI/TOI/etc. Splitting into distinct collections also causes another issue. It doesn't "naturally group" together the corresponding data that represents the whole dataset. Users have to "somehow" figure out that
What I meant is that, in OGC APIs, a On the other hand, using a STAC In the end, with either approach, there is a need to filter/join/parse some specific data structure, which depends on specific metadata properties, media-types, etc. returned by the API of choice, and this cannot be done for what the process expects, unless it provides a IMO, the fact that distinct OGC APIs returns some specific data types is not any different from parsing STAC Items that happen to have the same data distributed in its Assets. Resolving OGC
STAC could also simply introduce a way to filter
The "same data" is really subjective depending on which user you talk to. Those data access mechanisms all have some spatio-temporal structure that could be represented in some kind of alternate/extended GeoJSON with custom fields or generic Again, it only moves the problem elsewhere. Each approach is valid.
When you're dealing with your own processing server accessing your own data structure, sure, you can arrange things to "just work" together, because you know your own use cases. When you interrogate some remote server maintained by some other entity that have multiple use cases to handle, they couldn't care less about your requirements. Ambiguous combinations emerge to handle more use cases, and you need a way to indicate how to resolve things. In order for server/client to agree, the API must provide some mean for resolution. This is the whole reason of this |
Following a few tests on my end while implementing Collection Input (crim-ca/weaver#685), I've found that I need additional parameters like More specifically, the below combinations (not exhaustive) are possible: {
"inputs": {
"static-collection": {
"$comment": "Static document, resolves to 'test.geojson'.",
"collection": "https://mocked-file-server.com/test",
"format": "geojson-feature-collection",
"schema": "http://www.opengis.net/def/glossary/term/FeatureCollection"
},
"ogc-api-features": {
"collection": "https://mocked-file-server.com/collections/test",
"format": "ogc-features-collection",
"type": "application/geo+json",
"filter-lang": "cql2-text",
"filter": "property.name = 'test'"
},
"ogc-api-coverages": {
"collection": "https://mocked-file-server.com/collections/test",
"format": "ogc-coverage-collection",
"type": "image/tiff; application=geotiff; profile=cloud-optimized",
"filter-lang": "cql2-text",
"filter": "property.name = 'test'"
},
"ogc-api-maps": {
"collection": "https://mocked-file-server.com/collections/test",
"format": "ogc-map-collection",
"type": "image/tiff; application=geotiff; profile=cloud-optimized",
"style": "dark",
"filter-lang": "cql2-text",
"filter": "property.name = 'test'"
},
"stac-api": {
"collection": "https://mocked-file-server.com/collections/test",
"format": "stac-collection",
"type": "image/tiff; application=geotiff; profile=cloud-optimized",
"filter-lang": "cql2-text",
"filter": "property.name = 'test'"
}
}
} Using
So far, I have only those defined, but there are most definitely other combinations to add to the format table. Also, open to renaming. This is just what felt like appropriate names. Using @jerstlouis |
This really goes against the idea that the workflow does not explicitly make this distinction, so that the exact same workflow can be re-used (workflow reusability) with other servers and/or clients supporting different access mechanisms.
Pointing to a collection e.g., sentinel2 with all its bands, does not necessarily mean "all bands", but whichever bands the process to which it is given as input needs. This should apply the same whether using STAC or Coverages. The process can use field selection
The OGC API - Maps API access mechanisms (and related styles) doesn't really fit into the typical workflows at all, since it doesn't normally provide access to the raw data of the collection, especially when an alternate access mechanism providing the raw data is available. Most workflows would work off the raw data, unles it's explicitly something intended to work on pre-rendered maps. The one use case where Maps may be useful is for RGB imagery being used as part of a workflow, with no other access mechanisms available, though that is probably not great for processing (e.g., 8-bit depth and scaled/clamped values instead of 16-bit original values). This being said, I'm not totally against the idea to add parameters to force a particular access mechanism for specific scenario / use cases, but this is absolutely not something that should be required or even the common use case. The whole point of the collection input was to leave it open for negotiation between the server / client hops in the workflow.
Hoping we can first take some time to discuss this together de vive voix (probably a long discussion), and in the context of specific experiments. |
And when the workflow is not able to resolve a combination because there are conflicting patterns, what do I do? Shrug and move to another server? The workflow is not reusable either. The idea of reusing
I agree, and never said otherwise. However, even then, the properties would depend on the API. So, rather than trying to handle all possible combinations of property names/structure some user could throw at the process, I find it is much easier to add a
I've often seen replies of the sort. It is not because it is not "YOUR" typical workflow, that it can simply be ignored or that it doesn't exist. We either support powerful and flexible Collection Inputs, or we don't at all. We cannot pick and choose only the preferred and easy use cases... Again, I reiterate, the
Would love that. What date/time would work? There's never enough time in SWG or TB meetings for this. |
Multiple available access patterns should never result in a failed workflow resolution -- on the contrary, they should make successful resolution more likely due to multiple options available intersecting client capabilities. The client picks its preferred option that it supports, but picking any that does work for it should work almost just as well. The idea with Part 3 nested processes is that each client-server hop is responsible for negotiating the immediate exchange. The client side of each hop sees what access mechanisms the server supports, and the client picks what makes the most sense from what is available in the context of the process. In theory, all access mechanisms should give access to the same raw data, whether it's Coverage tiles, DGGS, Coverages, STAC items pointing to COGs... The data should be the same in all cases, so whichever access mechanims the client picks should yield very similar, if not identical results. A hint that this client is free to ignore might make sense in situations where for whatever reason the client makes a bad decision by itself, but ideally the client should make the best decision without needing a hint.
If you mean that the workflow definition should have "properties": [ "assets.red", "assets.green", "assets.blue" ] that's not how I was expecting this to work. The "properties" should be exactly the same, regardless of the access mechanism, so it should still simply be
I'm not saying it doesn't exist, just that in general (unless specifically doing post-processing on maps) processes take as input raw data, not rendered maps, so you don't have these two fundamentally different types of inputs one being the raw data, the other rendered maps. In the case of STAC+COG, Coverages, Coverage Tiles, and DGGS data, you get the same data values regardless of the selected access mechanisms, and we should focus initial discussion / experiment on these use cases, rather than thinking about Maps and Styles at this stage.
They might come in handy in particular situations, but I'm really wary about working on these exception scenarios before having done technology integration experiments with the regular cases, because there's a possibility that the automatic resolution might actually just work fine without the hints. So I would really like us to experiment with specific workflow examples both where we agree they are not necessary, and where you think they are really necessary, to have some concrete context to discuss and define these hints.
5:00 PM or later this evening would work for me, any time tomorrow Friday, any time after 10:00 AM Monday, any time Tuesday. |
I disagree. If my auto-resolution implementation happened to prefer STAC-API, but you provided the The process cannot inform about which one is better either, because it does not depend on a Since many APIs can still return 200 OK even if a
Well, they don't, hence the use of different APIs for different purpose.
Correct in the workflow definition.
No, it doesn't! For a Coverages API, it would indicate to extract the corresponding bands, which are defined in the Completely different access strategies and interpretation for all.
It is this kind of focus that creates situations as the
That's the thing. I am implementing it now, and I'm already encountering issues! I also got time Friday 10AM-2PM. Let's try that. |
I believe some (most?) of those issues you point out are non-issues when used as intended, which is why I want to get to the bottom of all these with you in practical scenarios like we're doing here. Let's try to discuss tomorrow around 10:00 AM.
The The STAC access mechanism is a bit of a special case, as we discussed in a previous issue, but how I would understand properties to work in that case, to align with the concept of a data cube available for example as both Coverages and STAC+COG, is that the property selection translates to the selection of an asset or field inside an asset in the case of a single asset containing multiple bands (not to a property of the STAC item metadata).
The
It would normally be the processing engine really (the Part 3 / collection input) which would access the collection, not the process (i.e., execution unit of the application package) itself. But potentially the processing engine could use information about that process (like a format that it accepts which is supported by a particular access mechanism) to make a decision about which access mechanism to pick.
If using a parameter supported as per the conformance declaration associated with the input collection, a 200 while using that parameter should definitely mean that it was applied successfully. Also, using an unsupported parameter with an OGC API should result in a 4xx error if the parameter/value is not supported. |
Given a process executed with Part 3
collection
(rather thanvalue
orhref
), the data passed as input or produced as output of the process seems to imply that it will be a GeoJSON FeatureCollection, or an equivalent representation such as STAC Collection or GML.Given this implication, can it be assumed that only process descriptions that explicitly support
format: geojson-feature-collection
(or alternate "collection-like" formats) can safely employ thecollection
input/output? A process expecting, for example, a GeoTiff would be automatically invalid if acollection
was specified?Given that a
collection
can also imply an array of "whatever item the collection contains", would a process expecting an input array of GeoTiff be a valid candidate for acollection
with a STAC Collection that happens to contain those GeoJSON Items with STAC Assets employing GeoTiff media-types? Would this process still need to indicateformat: geojson-feature-collection
support explicitly (IMO, yes) to avoid ambiguity regarding how those GeoTiff should be parsed?Are there other use-cases not mentioned above to consider?
The Part 3 document needs to clarify these considerations such that expected behaviors by implementations can somewhat "agree" and improve chances of interoperability. The versatility of
collection
, although useful in some cases, somewhat acts as a double-edged sword when describing and execution processes, since processing intensions can become too abstract.The text was updated successfully, but these errors were encountered: