Formatting different splits of dataset #682

Irenetema · 2024-06-06T23:44:18Z

Irenetema
Jun 6, 2024

Hi,

I am working on a benchmark dataset created from the existing Snapshot Serengeti dataset. The dataset offers different types for novelties for novelty detection computer vision systems.

The annotations are split into training, validation and test set, and they are formatted as jsonl objects. Each jsonl file contains the url of the images in the set and the label information such as the class.

I am trying to create the croissant metadata to allow for download of specific split (train, valid or test) using the python library but I am not able to figure this out looking into the doc.

distribution = [
        # NOVEL-SS annotations:
        mlc.FileObject(
            id="jsonl-files",
            name="jsonl-files",
            description="NOVEL-SS training set image annotations.",
            content_url="https://raw.githubusercontent.com/Irenetema/NOVEL_SS/master/labels/croissant_jsonl.zip",
            encoding_format="application/zip",
            sha256="6265b65ce08acafc3cd55233d32135625857ad421927e8f3af71c789ad434a85",
        ),
        mlc.FileObject(
            id="train_annotations",
            name="train_annotations",
            description="NOVEL-SS training set image annotations.",
            contained_in=["jsonl-files"],
            content_url="train.jsonl",
            encoding_format="application/jsonlines"
        ),
        mlc.FileObject(
            id="valid_annotations",
            name="valid_annotations",
            description="NOVEL-SS training set image annotations.",
            contained_in=["jsonl-files"],
            content_url="valid.jsonl",
            encoding_format="application/jsonlines"
        ),
        mlc.FileObject(
            id="test_annotations",
            name="test_annotations",
            description="NOVEL-SS training set image annotations.",
            contained_in=["jsonl-files"],
            content_url="test.jsonl",
            encoding_format="application/jsonlines"
        ),
    ]

    record_sets = [
        # RecordSets contains records in the dataset.
        mlc.RecordSet(
            id="images_and_bbox",
            name="images_and_bbox",
            key="name",
            fields=[
                mlc.Field(
                    id="images_and_bbox/image_path",
                    name="image_path",
                    description="Snapshot Serengeti image path (e.g. S6/P07/P07_R2/S6_P07_R2_IMAG0077.JPG)",
                    data_types=mlc.DataType.TEXT,
                    source=mlc.Source(
                        file_set="train_annotations",
                        extract=mlc.Extract(column="image_path"),
                    ),
                ),
                mlc.Field(
                    id="images_and_bbox/width",
                    name="width",
                    description="Image width (e.g., 2048)",
                    data_types=mlc.DataType.INTEGER,
                    source=mlc.Source(
                        file_set="train_annotations",
                        extract=mlc.Extract(column="width"),
                    ),
                ),
                mlc.Field(
                    id="images_and_bbox/height",
                    name="height",
                    description="Image height (e.g., 1536)",
                    data_types=mlc.DataType.INTEGER,
                    source=mlc.Source(
                        file_set="train_annotations",
                        extract=mlc.Extract(column="height"),
                    ),
                ),
                mlc.Field(
                    id="images_and_bbox/environment_id",
                    name="environment_id",
                    description="id of environment (lighting condition) of the image",
                    data_types=mlc.DataType.INTEGER,
                    source=mlc.Source(
                        file_set="train_annotations",
                        extract=mlc.Extract(column="environment_id"),
                    ),
                ),
                mlc.Field(
                    id="images_and_bbox/novelty_type",
                    name="novelty_type",
                    description="interger indentifying the type of novelty in the image",
                    data_types=mlc.DataType.INTEGER,
                    source=mlc.Source(
                        file_set="train_annotations",
                        extract=mlc.Extract(column="novelty_type"),
                    ),
                ),
            ],
        ),
    ]

    # Metadata contains information about the dataset.
    metadata = mlc.Metadata(
        name="NOVEL-SS",
        # Descriptions can contain plain text or markdown.
        description=(
            ""
        ),
        cite_as=(
            ""
        ),
        distribution=distribution,
        record_sets=record_sets
    )

When I download only one set (see file_set="train_annotations" above) it works but I don't see how make the split be a parameter.

Any idea how to do this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Formatting different splits of dataset #682

{{title}}

Replies: 0 comments

Select a reply

Formatting different splits of dataset #682

Irenetema Jun 6, 2024

Replies: 0 comments

Irenetema
Jun 6, 2024