Alternative referencing mechanism #506

benjelloun · 2024-02-09T17:34:06Z

benjelloun
Feb 9, 2024
Maintainer

The current name / reference mechanism imposes strict constraints on names, so that they can be used as identifiers. As an alternative, we can use JSON-LD's standard @id mechanism.

Here is an initial proposal, also in the Croissant spec:

In this approach, every object in a Croissant dataset that needs to be identified defines an @id property. The value of this property is generally a string identifier, which will get resolved into a full URI, by concatenating it with a base URI, which defaults to the URL of the page. The important property of @id is that the resulting URI is a global identifier of the object. This implies that no two objects in the dataset should have the same identifier.

Per the JSON-LD specification, references to objects are also specified using the @id property. It is understood to be a reference if no other property is specified within the same object.

The name property does not need to follow any constraints anymore, since it's not used as an identifier. In fact, an object may have multiple names, e.g., to support internationalization.

Let's revisit the above examples with this new approach:

A set of JSON files included in a tar archive:

{
      "@type": "cr:FileObject",
   "@id": "flores200_dataset.tar.gz",        
   "name": "Flores 200 archive",
   "description": "Flores 200 is hosted on a webserver.",
      "contentSize": "25585843 B",
      "contentUrl": "https://tinyurl.com/flores200dataset",
      "encodingFormat": "application/x-gziptar",
      "sha256": "b8b0b76783024b85797e5cc75064eb83fc5288b41e9654dabc7be6ae944011f6"
    },
    {
      "@type": "cr:FileSet",
   "@id": "flores200_dev_files",        
      "name": "Flores 200 dev files",
      "description": "dev files are inside the tar.",
      "containedIn": { "@id": "flores200_dataset.tar.gz"},
      "encodingFormat": "application/json",
      "includes": "flores200_dataset/dev/*.dev"
    }

A "foreign key" reference on column "movie_id" from a "ratings" table to a "movies" table:

{
  "@type": "cr:RecordSet",
  "@id": "ratings",        
  "name": "IMDB ratings",
  "field": [
     {
       "@type": "cr:Field",
       "@id": "ratings/movie_id",        
       "name": "Movie id",
       "dataType": "sc:Integer",
       "references": {"@id": "movies/movie_id"}
     },...
  ]
}

In the above example, the @id of a field is prefixed by the @id of the corresponding RecordSet. This ensures the uniqueness, and makes it possible to disambiguate between fields of the same name in different RecordSets. In this example, both the ratings and movies RecordSets have a movie_id field.

goeffthomas · 2024-02-10T00:51:28Z

goeffthomas
Feb 10, 2024
Collaborator

This may seem somewhat contrived, but just to flesh out a trickier example, would something like this be acceptable?:

"distribution": [{
  "@type": "cr:FileObject",
  "@id": "archive.zip",        
  "name": "archive.zip",
  "description": "Describes the dataset",
  "contentSize": "3 GB",
  "contentUrl": "https://www.kaggle.com/api/v1/datasets/download/goeff/my-dataset",
  "encodingFormat": "application/zip",
  "sha256": "<some hash>"
},
{
  "@type": "cr:FileObject",
  "@id": "archive.zip/path/to/csv1/table.csv",        
  "name": "table.csv",
  "description": "Table with same name, but in csv1 directory",
  "containedIn": { "@id": "archive.zip" },
  "contentUrl": "path/to/csv1/table.csv"
  "encodingFormat": "text/csv"
},
{
  "@type": "cr:FileObject",
  "@id": "archive.zip/path/to/csv2/table.csv",        
  "name": "table.csv",
  "description": "Table with same name, but in csv2 directory",
  "containedIn": { "@id": "archive.zip" },
  "contentUrl": "path/to/csv2/table.csv"
  "encodingFormat": "text/csv"
}]

Then the recordSets could look something like:

"recordSets": [{
  "@type": "cr:RecordSet",
  "@id": "archive.zip/path/to/csv1/table.csv_records",        
  "name": "table.csv_records",
  "field": [
    {
      "@type": "cr:Field",
      // Special characters are encoded since it's a URI
      "@id": "archive.zip/path/to/csv1/table.csv_records/Economy%20%28GDP%20per%20Capita%29",        
      "name": "Economy (GDP per Capita)",
      "dataType": "sc:Integer",
    },...
  ]
},
{
  "@type": "cr:RecordSet",
  "@id": "archive.zip/path/to/csv2/table.csv_records",        
  "name": "table.csv_records",
  "field": [
    {
      "@type": "cr:Field",
      // Special characters are encoded since it's a URI
      "@id": "archive.zip/path/to/csv2/table.csv_records/Economy%20%28GDP%20per%20Capita%29",        
      "name": "Economy (GDP per Capita)",
      "dataType": "sc:Integer",
    },...
  ]
}]

I think ideally a dataset creator who is making these manually would avoid ID and naming situations like this, but for data repos that are building these automatically/programmatically, it's quite nice to rely on the file system structure to ensure uniqueness of @ids. I realize this makes the issue of the loader a bit trickier. How do you enable a user to easily work with loading a specific recordSet if the @ids are allowed to be somewhat unwieldy (and names are no longer required to be unique). Could it accept a glob pattern to match on @ids and then just throw if it encounters more than 1 match? Curious your thoughts on this as well @marcenacp

6 replies

ccl-core Feb 14, 2024
Maintainer

If name is a human-readable description, and we already have a description field, then I am not sure I get what the difference between name and description would be (and why we would need name at all, especially since at the moment it is a mandatory property)?

ccl-core Feb 14, 2024
Maintainer

I agree that ds["table.csv_records"]["Economy%20%28GDP%20per%20Capita%29"] is not very friendly to interact with... Maybe we could have some sort of internal converter in mlcroissant that "translates" between escaped/human readable string representation, allowing users to write something like ds["table.csv_records"]["Economy (GDP per Capita)"]?

marcenacp Feb 14, 2024
Maintainer

@ccl-core, the problem is that "Economy (GDP per Capita)" is not guaranteed to be unique anymore. Should we add this constraint?

Otherwise, could the best practice be to use the current reference mechanism in the IDs?

cr:RecordSet
  id: "annotations"
  name: "Annotations"
  cr:Field:
    id: "annotations/bbox"
    name: "Bounding box"

The library could then return to the user readable objects:

{
  "bbox": [...],
}

cc @benjelloun

goeffthomas Feb 14, 2024
Collaborator

If name is a human-readable description, and we already have a description field, then I am not sure I get what the difference between name and description would be (and why we would need name at all, especially since at the moment it is a mandatory property)?

I can't speak for all object, but FileObject and Field illustrate the difference nicely. The name of the file is different than a description of what's in it. And the name of a column is different than metadata describing it's purpose, etc.

I agree that ds["table.csv_records"]["Economy%20%28GDP%20per%20Capita%29"] is not very friendly to interact with... Maybe we could have some sort of internal converter in mlcroissant that "translates" between escaped/human readable string representation, allowing users to write something like ds["table.csv_records"]["Economy (GDP per Capita)"]?

If @id is known (by JSON-LD standards) to be a URI, would we not just decode before using? Maybe this is what you mean by an internal converter.

My hunch is that we should avoid having the tool do too much "magic" for users. Maybe I have an unpopular opinion on this, but I think an unwieldy or hard-to-use Croissant file is a problem with the creator of the Croissant. If that's Kaggle, then our logic/algo on how to create IDs needs to be cleaned up. If it's a user making one by hand, they should get feedback from end users that it could use some improvement.

benjelloun Feb 15, 2024
Maintainer Author

@marcenacp I don't think we should require names to be unique.

Re-usage of identifiers in the library: I agree with Geoff that having unwieldy ids is mostly a problem that needs to be addressed by the dataset producer, and that we should avoid too much magic in the library.

Maybe the library can support an explicit aliasing mechanism, so that the user can define -- once -- user-friendly aliases for the ids they want to use, and then use these aliases in the rest of their code?

benjelloun · 2024-02-12T08:57:12Z

benjelloun
Feb 12, 2024
Maintainer Author

2 cents:

@goeffthomas Your example is a bit contrived, but seems valid to me. Given that you're generating the Croisant automatically, then yes, you may need to encode paths in ids to ensure their uniqueness.

For field ids, maybe there is a better approach than url escaping parentheses? Economy%20%28GDP%20per%20Capita%29 is a bit ugly.

I also wonder if you could have some logic that only encodes the paths when there is ambiguity, and then maybe only some portion of the path that is sufficient to disambiguate?

@marcenacp We can certainly recommend making ids easy to understand, but we can't really enforce that. We may need to add functionality like pattern matching to the library to improve usability.

1 reply

goeffthomas Feb 12, 2024
Collaborator

Yes, I purposely brought up that example since I didn't want us to lose sight of special characters and since @id is supposed to be a URI, I assume it's expected to be encoded as such. Definitely open to thinking through other/better ways to name things to try to make them as user-friendly as possible. I think I was more interested in trying to highlight what the new use of @id can do vs what it should do (which I think is a theme throughout a number of the threads/comments here).

For field IDs, if the work is done to try and keep the record set IDs unique, then it seems like we could do something like collapse all special characters down in the ID. And if need be, we can keep track of fields within a record set and do some kind of numbering strategy in the event that a collision results after the collapsing. For example if GDP (per Capita) and GDP per Capita were both in a CSV, the end result would be GDPperCapita1 and GDPperCapita2. Not great, but still readable and on the maintainer of the dataset at that point to make their CSV better.

pierrot0 · 2024-02-12T14:03:39Z

pierrot0
Feb 12, 2024
Maintainer

Proposal LGTM.

Should we still mention that the sc:name file property should uniquely identify a file within a dataset?

I think we should still explicitly recommend that either @id or name should contain the original file name (including extension). And we should be explicit about which one of @id or name it is.

2 replies

benjelloun Feb 12, 2024
Maintainer Author

I don't think we want the require sc:name to be unique anymore, for any type of object, including FileObject. We can still recommend using the filename and/or path when describing a FileObject. Per Goeff's example above, we should probably suggest the filename as an sc:name, and the path (if not unique) as the identifier. Does that make sense?

As a side note, once we introduce @id, the sc:name becomes optional, and is only used when we want to give a human readable name to an object.

goeffthomas Feb 17, 2024
Collaborator

+1, for name on a FileObject, it's nice for it to be intuitive that it actually means filename (regardless of whether we enforce, which we don't want to). For @id, it's on the dataset producer to ensure that these are unique at a bare minimum (otherwise they're invalid). Beyond that, the more concise the better. For data repos building these programmatically, the filepath provides a nice guarantee on uniqueness this way, but we should still encourage them to find ways to shorten them.

I've just begun thinking about how we want to do this at Kaggle. I'm happy to share a doc once I've landed on an approach that I think will scale across a variety of dataset structures and edge cases.

benjelloun · 2024-02-16T10:00:07Z

benjelloun
Feb 16, 2024
Maintainer Author

Unless anybody has a strong objection to this proposal, I'd like to move ahead with it. Now is the last chance to speak up. :-)

0 replies

benjelloun · 2024-02-16T17:41:18Z

benjelloun
Feb 16, 2024
Maintainer Author

I updated all the examples in the Croissant spec to use the new mechanism: (as edit suggestions)

https://docs.google.com/document/d/11E1x2rIKo_9C2Hh7pMpHtTE30iizVCWUMQ9rDysBoeA/edit?usp=sharing&resourcekey=0-drT2urhsv5QnaBr57G0coQ

Please take a look. If I hear no objections, I will turn the edit suggestions into actual changes.

0 replies

goeffthomas · 2024-02-17T00:46:02Z

goeffthomas
Feb 17, 2024
Collaborator

I purposely brought up that example since I didn't want us to lose sight of special characters and since @id is supposed to be a URI, I assume it's expected to be encoded as such.

Though not the most pressing thing in the world, I think it may be beneficial for us to align on expectations with respect to encoding sooner rather than later. After giving it some thought, since we can't control whether @id is an actual URL or not, I tend to lean toward "better safe than sorry". By that, I mean it may be best to default to decoding @id at the point of ingesting/building the graph. To support situations where users haven't encoded, we could expose a flag in mlcroissant along the lines of encoded_ids: bool = True so users can disable the decoding step if they choose. I think this way, we avoid any "magic" since the default behavior is documented and configurable at runtime. So to pull from the example discussed above, something like this would be okay re: encoding and naming:

"recordSets": [{
  "@type": "cr:RecordSet",
  "@id": "csv1/table.csv_records",        
  "name": "csv1/table.csv records",
  "field": [
    {
      "@type": "cr:Field",
      // Special characters are encoded since it's a URI
      "@id": "csv1/table.csv_records/Economy%20%28GDP%20per%20Capita%29",        
      "name": "Economy (GDP per Capita)",
      "dataType": "sc:Integer",
    },...
  ]
},
{
  "@type": "cr:RecordSet",
  "@id": "csv2/table.csv_records",        
  "name": "csv2/table.csv records",
  "field": [
    {
      "@type": "cr:Field",
      // Special characters are encoded since it's a URI
      "@id": "csv2/table.csv_records/Economy%20%28GDP%20per%20Capita%29",        
      "name": "Economy (GDP per Capita)",
      "dataType": "sc:Integer",
    },...
  ]
}]

And then due to decoding, it would yield something like this during usage:

ds["csv1/table.csv_records"]["Economy (GDP per Capita)"]
ds["csv2/table.csv_records"]["Economy (GDP per Capita)"]

Do others have any opinions on this? @pierrot0 @marcenacp @benjelloun

1 reply

goeffthomas Feb 17, 2024
Collaborator

And just to be clear, my leaning on this is not very strong. Maybe 60/40 in favor of defaulting to encoding. I understand those URIs don't look great. Unfortunately csv2/table.csv_records/Economy (GDP per Capita) also isn't good because of all that whitespace. Maybe this is just another one of those "it's on the dataset producer" things 🤷

benjelloun · 2024-06-03T15:25:49Z

benjelloun
Jun 3, 2024
Maintainer Author

Closing, as this added to Croissant 1.0

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative referencing mechanism #506

{{title}}

Replies: 7 comments 10 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Alternative referencing mechanism #506

benjelloun Feb 9, 2024 Maintainer

Replies: 7 comments · 10 replies

goeffthomas Feb 10, 2024 Collaborator

ccl-core Feb 14, 2024 Maintainer

ccl-core Feb 14, 2024 Maintainer

marcenacp Feb 14, 2024 Maintainer

goeffthomas Feb 14, 2024 Collaborator

benjelloun Feb 15, 2024 Maintainer Author

benjelloun Feb 12, 2024 Maintainer Author

goeffthomas Feb 12, 2024 Collaborator

pierrot0 Feb 12, 2024 Maintainer

benjelloun Feb 12, 2024 Maintainer Author

goeffthomas Feb 17, 2024 Collaborator

benjelloun Feb 16, 2024 Maintainer Author

benjelloun Feb 16, 2024 Maintainer Author

goeffthomas Feb 17, 2024 Collaborator

goeffthomas Feb 17, 2024 Collaborator

benjelloun Jun 3, 2024 Maintainer Author

benjelloun
Feb 9, 2024
Maintainer

Replies: 7 comments 10 replies

goeffthomas
Feb 10, 2024
Collaborator

ccl-core Feb 14, 2024
Maintainer

ccl-core Feb 14, 2024
Maintainer

marcenacp Feb 14, 2024
Maintainer

goeffthomas Feb 14, 2024
Collaborator

benjelloun Feb 15, 2024
Maintainer Author

benjelloun
Feb 12, 2024
Maintainer Author

goeffthomas Feb 12, 2024
Collaborator

pierrot0
Feb 12, 2024
Maintainer

benjelloun Feb 12, 2024
Maintainer Author

goeffthomas Feb 17, 2024
Collaborator

benjelloun
Feb 16, 2024
Maintainer Author

benjelloun
Feb 16, 2024
Maintainer Author

goeffthomas
Feb 17, 2024
Collaborator

goeffthomas Feb 17, 2024
Collaborator

benjelloun
Jun 3, 2024
Maintainer Author