Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate Nested RecordSets in favor of repeated subField #750

Open
benjelloun opened this issue Sep 27, 2024 · 1 comment
Open

Deprecate Nested RecordSets in favor of repeated subField #750

benjelloun opened this issue Sep 27, 2024 · 1 comment

Comments

@benjelloun
Copy link
Contributor

The Croissant Spec allows nesting RecordSets inside RecordSets, by using a field with dataType="cr:RecordSet"

https://docs.mlcommons.org/croissant/docs/croissant-spec.html#nested-records

This mechanism has not been used much, is not supported in the mlcroissant library, and adds unneeded complexity.

Instead, we propose using the existing subField mechanism, and specifying repeated=true to represent multiple records.

Here is an example based on the one in the above documentation:

{
  "@type": "cr:RecordSet",
  "@id": "movies_with_ratings",
  "key": { "@id": "movies_with_ratings/movie_id" },
  "field": [
    {
      "@type": "cr:Field",
      "@id": "movies_with_ratings/movie_id",
      "source": { "@id": "movies/movie_id" }
      "references" :  { "@id": "ratings/movie_id" }
    },
    {
      "@type": "cr:Field",
      "@id": "movies_with_ratings/movie_title",
      "source": { "@id": "movies/title" }
    },
    {
      "@type": "cr:Field",
      "@id": "movies_with_ratings/ratings",
      "repeated": "true",
      "subField": [
        {
          "@type": "cr:Field",
          "@id": "movies_with_ratings/ratings/user_id",
          "source": { "@id": "ratings/user_id" }
        },
        {
          "@type": "cr:Field",
          "@id": "movies_with_ratings/ratings/rating",
          "source": { "@id": "ratings/rating" }
        },
        {
          "@type": "cr:Field",
          "@id": "movies_with_ratings/ratings/timestamp",
          "source": { "@id": "ratings/timestamp" }
        }
      ]
    }
  ]
}

Note that using a repeated field with subFields also enables us to get rid of the cumbersome "parentField" property in the previous syntax. Instead, the join with the underlying ratings table is specified on the "movie_id" property.

@benjelloun benjelloun converted this from a draft issue Sep 27, 2024
@csbrown
Copy link

csbrown commented Dec 19, 2024

This is a common representation for trees generally. Instead of actually nesting the data structure, maintain a flat data structure of all nodes, and have each node point to its immediate children. e.g.

  tree = {
    root: [1,2],
    1: [3,4],
    2: [5,6],
    3: [7]
  }

This mechanism is used by, for example, the GraphQL schema. GraphQL uses this mechanism because it actually empowers defining possibly infinite trees, where a subType for a type can be the type itself. IMHO, the GraphQL type system is pretty intelligent, and we could learn a lot from the setup there.

To this end, what might make sense is the ability to define a compound type, right in the .json file. For example, perhaps a movie can have a "sequel" field, which in turn is a movie itself, and which might have sequels, and so on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

2 participants