Alternative referencing mechanism #506
Replies: 7 comments 10 replies
-
This may seem somewhat contrived, but just to flesh out a trickier example, would something like this be acceptable?:
Then the
I think ideally a dataset creator who is making these manually would avoid ID and naming situations like this, but for data repos that are building these automatically/programmatically, it's quite nice to rely on the file system structure to ensure uniqueness of |
Beta Was this translation helpful? Give feedback.
-
2 cents: @goeffthomas Your example is a bit contrived, but seems valid to me. Given that you're generating the Croisant automatically, then yes, you may need to encode paths in ids to ensure their uniqueness. For field ids, maybe there is a better approach than url escaping parentheses? Economy%20%28GDP%20per%20Capita%29 is a bit ugly. I also wonder if you could have some logic that only encodes the paths when there is ambiguity, and then maybe only some portion of the path that is sufficient to disambiguate? @marcenacp We can certainly recommend making ids easy to understand, but we can't really enforce that. We may need to add functionality like pattern matching to the library to improve usability. |
Beta Was this translation helpful? Give feedback.
-
Proposal LGTM. Should we still mention that the I think we should still explicitly recommend that either |
Beta Was this translation helpful? Give feedback.
-
Unless anybody has a strong objection to this proposal, I'd like to move ahead with it. Now is the last chance to speak up. :-) |
Beta Was this translation helpful? Give feedback.
-
I updated all the examples in the Croissant spec to use the new mechanism: (as edit suggestions) Please take a look. If I hear no objections, I will turn the edit suggestions into actual changes. |
Beta Was this translation helpful? Give feedback.
-
Though not the most pressing thing in the world, I think it may be beneficial for us to align on expectations with respect to encoding sooner rather than later. After giving it some thought, since we can't control whether
And then due to decoding, it would yield something like this during usage:
Do others have any opinions on this? @pierrot0 @marcenacp @benjelloun |
Beta Was this translation helpful? Give feedback.
-
Closing, as this added to Croissant 1.0 |
Beta Was this translation helpful? Give feedback.
-
The current name / reference mechanism imposes strict constraints on names, so that they can be used as identifiers. As an alternative, we can use JSON-LD's standard
@id
mechanism.Here is an initial proposal, also in the Croissant spec:
In this approach, every object in a Croissant dataset that needs to be identified defines an @id property. The value of this property is generally a string identifier, which will get resolved into a full URI, by concatenating it with a base URI, which defaults to the URL of the page. The important property of @id is that the resulting URI is a global identifier of the object. This implies that no two objects in the dataset should have the same identifier.
Per the JSON-LD specification, references to objects are also specified using the @id property. It is understood to be a reference if no other property is specified within the same object.
The name property does not need to follow any constraints anymore, since it's not used as an identifier. In fact, an object may have multiple names, e.g., to support internationalization.
Let's revisit the above examples with this new approach:
A set of JSON files included in a tar archive:
A "foreign key" reference on column "movie_id" from a "ratings" table to a "movies" table:
In the above example, the @id of a field is prefixed by the @id of the corresponding RecordSet. This ensures the uniqueness, and makes it possible to disambiguate between fields of the same name in different RecordSets. In this example, both the ratings and movies RecordSets have a movie_id field.
Beta Was this translation helpful? Give feedback.
All reactions