Attributions for copied content are missing #414

ThinkOpenly · 2025-01-13T15:08:20Z

Describe the bug
We copy a lot of content from the ISA here. The ISA is licensed under "Creative Commons Attribution 4.0 International License". This repository, as far as I see, is entirely licensed under BSD-3-Clause-Clear (see "LICENSE"). This license is similar, but not the same, and we seem to be lacking the proper attribution for the content. https://creativecommons.org/share-your-work/use-remix/.

Additional context
Something like:

Some content derived from:

"The RISC-V Instruction Set Manual, Volume I: Unprivileged Architecture, Version 20240411" Creative Commons Attribution 4.0 International License.

"The RISC-V Instruction Set Manual, Volume II: Privileged Architecture, Version 20240411" Creative Commons Attribution 4.0 International License.

in "LICENSE", perhaps?

jjscheel · 2025-01-14T23:59:54Z

@ThinkOpenly, great catch. We need to give this some thought. This is where we likely need to separate the "DB" content from tooling and IF we must keep them together, then we need to clearly articulate in the repo licenses and associated README content these various licenses.

@kbroch-rivosinc, the RVI Staff will ultimately need to weigh in and help provide guidance here as RISC-V does not have an OSPO. We do, however, have plenty of experienced folks inside the LF who can assist as well.

So, let's start with whether we think this mixing of content is necessary to long term project. Thought here, @dhower-qc?

ThinkOpenly · 2025-01-15T03:43:44Z

This is where we likely need to separate the "DB" content from tooling

Do you mean DB content from ISA content?

IF we must keep them together, then we need to clearly articulate in the repo licenses and associated README content these various licenses.

@kbroch-rivosinc, the RVI Staff will ultimately need to weigh in and help provide guidance here as RISC-V does not have an OSPO. We do, however, have plenty of experienced folks inside the LF who can assist as well.

How do we decide on appropriate attribution? Who decides what is acceptable? (Honest question.) Is my suggestion above insufficient (if vague)? Perhaps the covered content needs to be described more succinctly:

> Instruction descriptions and CSR descriptions derived from:
> - "The RISC-V Instruction Set Manual, Volume I: Unprivileged Architecture, Version 20240411" Creative Commons Attribution 4.0 International License.
> - "The RISC-V Instruction Set Manual, Volume II: Privileged Architecture, Version 20240411" Creative Commons Attribution 4.0 International License.

It looks like some (not all) extensions and profile classes already call out their licensing, e.g. arch/ext/B.yaml:

doc_license:
  name: Creative Commons Attribution 4.0 International License
  url: https://creativecommons.org/licenses/by/4.0/

...similarly the ISA verbatim content, in arch/manual/isa.yaml.

Maybe similar attribution could be added to the instructions, CSRs, and the rest of the extensions, and call it a day?

I think we also need to add attribution for the Sail code in the instructions, by the way.

So, let's start with whether we think this mixing of content is necessary to long term project. Thought here, @dhower-qc?

If the goal is "one-stop shopping", I'd argue that the mixed content is necessary. If we extrapolate, it could be a desirable outcome that the DB contains the canonical information, and the documentation (ISA) pulls what it needs from the DB.

jjscheel · 2025-01-15T14:36:46Z

Do you mean DB content from ISA content?

I mean "DB" as in the tooling associated with Unified DB, not the content. The ISA material is a good example of what I'd call content. More generally, I'd argue all yaml file information is "content" versus the tools/code that operate on the yaml files, such as Antora and other things. Make sense?

How do we decide on appropriate attribution? Who decides what is acceptable? (Honest question.) Is my suggestion above insufficient (if vague)? Perhaps the covered content needs to be described more succinctly:

RVI, specifically me and Andrea as Management have a responsibility to protect our IP. Thus, questions around licensing and IP come to us. As an example, I'm the one who reviews all specification contributions as part of the process and ensures they are made by members. This work is part of the "Specification Policies" review that occurs before Freeze and Ratification.

So, I'm "here to help". ;-)

If the goal is "one-stop shopping", I'd argue that the mixed content is necessary. If we extrapolate, it could be a desirable outcome that the DB contains the canonical information, and the documentation (ISA) pulls what it needs from the DB.

While a one-stop shopping goal is always preferable, the question always becomes "from whose perspective?" As an example, we don't put our webpages in the Apache webserver repo, nor, do Internet users care.

FWIW, the Debug spec has set some precedence for mixed licensing of tools and content. But honestly, we haven't done much.

dhower-qc · 2025-01-15T15:02:19Z

@ThinkOpenly, great catch. We need to give this some thought. This is where we likely need to separate the "DB" content from tooling and IF we must keep them together, then we need to clearly articulate in the repo licenses and associated README content these various licenses.

@kbroch-rivosinc, the RVI Staff will ultimately need to weigh in and help provide guidance here as RISC-V does not have an OSPO. We do, however, have plenty of experienced folks inside the LF who can assist as well.

So, let's start with whether we think this mixing of content is necessary to long term project. Thought here, @dhower-qc?

I do think there is great technical value to keep text content and executable content together long term. We should be able to come up with a way to separately attribute. For example, we can have BSD-3 apply to all Ruby/Python/C++/IDL/ files. We could specify that content in YAML files has mixed licensing -- Sail appears to be some form of BSD-2, IDL is BSD-3, prose can be CC4.0, metadata (e.g., encodings) is ?? (ask a lawyer). We should capture attribution of prose that is copied/derived from the ISA manual.

ThinkOpenly · 2025-01-15T17:07:01Z

Does a separation of tools an content offer much benefit? The tools may have a single license within the project scope, but the content, as we're seeing, has multiple origins with their own respective licenses. The need to identify how different subsets of overall project content is licensed persists.

If the goal is "one-stop shopping", I'd argue that the mixed content is necessary. If we extrapolate, it could be a desirable outcome that the DB contains the canonical information, and the documentation (ISA) pulls what it needs from the DB.

While a one-stop shopping goal is always preferable, the question always becomes "from whose perspective?" As an example, we don't put our webpages in the Apache webserver repo, nor, do Internet users care.

It's at least clear to me that "all" of the information particular to an instruction should be in one place: the name, description, syntax, operands/types, encoding, semantics, encompassing extension, implementation notes, etc. In that way, downstream uses (documentation, assemblers, disassemblers, simulators/emulators, hardware) can all go to one well-curated and validated source.

FWIW, the Debug spec has set some precedence for mixed licensing of tools and content. But honestly, we haven't done much.

Any thoughts on my simple proposal, above:

Maybe similar attribution could be added to the instructions, CSRs, and the rest of the extensions, and call it a day?

Is what is currently done for the extensions, profile classes, and verbatim ISA content sufficient?

kbroch-rivosinc · 2025-01-15T19:32:48Z

Discussed with @dhower-qc about providing a POC of a technical solution to try and faithfully denote the copyright/licensing of the content (docs and code) of this repo (I had originally said I would mock it up in doc-sig repo but I think providing a PR here would be more useful).

Goals of POC

allow multiple licenses on a file
don't need to annotate every file with copyright/license (although in the long run this could be a good idea)
license/copyright validation tooling (reuse) passes
GithubUI "understands" the licenses

Non-goals

won't identify which parts of a file relate to which license in multi-license case (this could be done with comments in the yaml)
not legally reviewed (I'm not a lawyer)
this is by no means the final solution (just POC/better than not representing the existing licensing)

Implementation

Approach can be summarized by this example file (NOTE: the dual license on this line):

create .reuse/reuse.toml file with copyright/licensing information (use this instead of dep5 as it is deprecated)
attach the correct LICENSE files to the repo in the top level dir
symlink above to in the LICENSES dir (that's where reuse tool expects them and githubUI can't deal with symlinks)
automation to show license/copyright validation is compliant (use reuse API and put badge on README)

I'm happy to put in a PR for this if others think it would be useful or we can just discuss other existing open source projects that deal with multi-license issues.

jjscheel · 2025-01-15T19:35:01Z

@kbroch-rivosinc, I was starting to reach a similar conclusion. Can we articulate a "requirement" that anything "imported" includes metadata about the importing license? It seems that we'd want to annotate that data somehow (footnote, twistie, etc.)

kbroch-rivosinc · 2025-01-15T21:20:39Z

@kbroch-rivosinc, I was starting to reach a similar conclusion. Can we articulate a "requirement" that anything "imported" includes metadata about the importing license? It seems that we'd want to annotate that data somehow (footnote, twistie, etc.)

reuse has the notion of snippets https://reuse.software/spec-3.3/#comment-headers to denote sections of a file under different copyright/licensing. This would work for the yaml files that include the "imported" content.

jjscheel · 2025-01-16T12:46:05Z

Nice. Any idea what the "visible" (pdf or html) output looks like, @kbroch-rivosinc?

kbroch-rivosinc · 2025-01-16T15:01:27Z

Nice. Any idea what the "visible" (pdf or html) output looks like, @kbroch-rivosinc?

I don't think at the moment there's anything in main branch but I'm sure it could be added if needed. Here's an example of a currently generated instruction page: https://riscv-software-src.github.io/riscv-unified-db/manual/html/isa/20240411/insts/addi.html

Just like Antora says at the bottom about being MPL-2.0, the template that generates adoc for the addi inst. could mention Sail licensing in that section and isa-manual licensing in an section from the manual.

kbroch-rivosinc · 2025-01-16T17:26:26Z

I've pushed the POC to a draft PR for those to look at. I think it accomplished what I listed above and I also added an example of putting a snippet comment section in one file. Again this is using the reuse tool which is actively developed and it uses SPDX which is backed by the LF: https://reuse.software/faq/#what-is-spdx

Not saying the PR should be accepted but if something like it was then other features would be:

GithubUI would recognized the 3 different LICENSE-*.txt files
it could be configured to use the reuse api and denote compliance https://api.reuse.software/info/github.com/fsfe/reuse-tool

Here's what I see if I run reuse lint on this branch now:

~/rvi/repos/riscv-software-src/riscv-unified-db on dev/kbroch/multi-license-poc:main wip ⇡1 +1 !2                   Py 3.12.4 at 09:12:27 AM
❯ reuse lint
# SUMMARY

* Bad licenses: 0
* Deprecated licenses: 0
* Licenses without file extension: 0
* Missing licenses: 0
* Unused licenses: 0
* Used licenses: BSD-3-Clause-Clear, CC-BY-4.0, BSD-2-Clause
* Read errors: 0
* Files with copyright information: 2130 / 2130
* Files with license information: 2130 / 2130

Congratulations! Your project is compliant with version 3.3 of the REUSE Specification :-)

If you run reuse lint --json you can get specific details on each file. For example the one I put snippet info in:

~/rvi/repos/riscv-software-src/riscv-unified-db on dev/kbroch/multi-license-poc:main wip ⇡1 +1 !2                   Py 3.12.4 at 09:12:27 AM
❯ reuse lint --json | jq '.files[] | select(.path == "arch/inst/I/addi.yaml")'

{
  "path": "arch/inst/I/addi.yaml",
  "copyrights": [
    {
      "value": "FIXME: Sail copyright holders",
      "source": "REUSE.toml",
      "source_type": "reuse-toml"
    },
    {
      "value": "FIXME: isa-manual copyright holders",
      "source": "REUSE.toml",
      "source_type": "reuse-toml"
    },
    {
      "value": "Copyright (c) 2024, Qualcomm Innovation Center, Inc. All rights reserved.",
      "source": "REUSE.toml",
      "source_type": "reuse-toml"
    },
    {
      "value": "SPDX-SnippetCopyrightText: FIXME: Sail copyright holders",
      "source": "arch/inst/I/addi.yaml",
      "source_type": "file-header"
    }
  ],
  "spdx_expressions": [
    {
      "value": "BSD-3-Clause-Clear",
      "source": "REUSE.toml",
      "source_type": "reuse-toml"
    },
    {
      "value": "BSD-2-Clause",
      "source": "REUSE.toml",
      "source_type": "reuse-toml"
    },
    {
      "value": "CC-BY-4.0",
      "source": "REUSE.toml",
      "source_type": "reuse-toml"
    },
    {
      "value": "BSD-2-Clause",
      "source": "arch/inst/I/addi.yaml",
      "source_type": "file-header"
    }
  ]
}

ThinkOpenly · 2025-01-16T18:04:49Z

I've pushed the POC to a draft PR [...]

Missing from the comment above is the implementation that can obviously be found in the PR. In a few lines:

# SPDX-SnippetBegin
# SPDX-SnippetCopyrightText: FIXME: Sail copyright holders
# SPDX-License-Identifier: BSD-2-Clause
sail(): |
  {
    let rs1_val = X(rs1);
    let immext : xlenbits = sign_extend(imm);
    let result : xlenbits = match op {
      RISCV_ADDI  => rs1_val + immext,
[...]
    };
    X(rd) = result;
    RETIRE_SUCCESS
  }

# SPDX-SnippetEnd

It certainly solves the problem in the YAML source.

Subjectively, it's slightly ugly, but not too bad. :-) Note that "Sail copyright holders" is a fairly lengthy text (currently 39 lines/entities). (So, it'll get uglier unless this text is just a short reference to the full text.)

It is not integrated as usable YAML data, so downstream can't easily determine the license associated with the YAML values by simply reading the YAML as data, since the annotations are embedded exclusively in comments. To best "protect the IP", I think we want a solution that makes it easy for downstream users of the YAML to be able to easily associate a license with the covered content.

ThinkOpenly added the bug Something isn't working label Jan 13, 2025

kbroch-rivosinc mentioned this issue Jan 16, 2025

POC using reuse tool to express multi-license #424

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attributions for copied content are missing #414

Attributions for copied content are missing #414

ThinkOpenly commented Jan 13, 2025

jjscheel commented Jan 14, 2025

ThinkOpenly commented Jan 15, 2025

jjscheel commented Jan 15, 2025

dhower-qc commented Jan 15, 2025

ThinkOpenly commented Jan 15, 2025

kbroch-rivosinc commented Jan 15, 2025

jjscheel commented Jan 15, 2025

kbroch-rivosinc commented Jan 15, 2025

jjscheel commented Jan 16, 2025

kbroch-rivosinc commented Jan 16, 2025

kbroch-rivosinc commented Jan 16, 2025

ThinkOpenly commented Jan 16, 2025

Attributions for copied content are missing #414

Attributions for copied content are missing #414

Comments

ThinkOpenly commented Jan 13, 2025

jjscheel commented Jan 14, 2025

ThinkOpenly commented Jan 15, 2025

jjscheel commented Jan 15, 2025

dhower-qc commented Jan 15, 2025

ThinkOpenly commented Jan 15, 2025

kbroch-rivosinc commented Jan 15, 2025

Goals of POC

Non-goals

Implementation

jjscheel commented Jan 15, 2025

kbroch-rivosinc commented Jan 15, 2025

jjscheel commented Jan 16, 2025

kbroch-rivosinc commented Jan 16, 2025

kbroch-rivosinc commented Jan 16, 2025

ThinkOpenly commented Jan 16, 2025