Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-1031] SBT correct the LICENSE and NOTICE for shaded client jars #1967

Closed
wants to merge 11 commits into from

Conversation

cxzl25
Copy link
Contributor

@cxzl25 cxzl25 commented Oct 10, 2023

What changes were proposed in this pull request?

Flink/Spark jars packaged with SBT use the correct LICENSE and NOTICE.

Why are the changes needed?

#1930 (comment)

Does this PR introduce any user-facing change?

How was this patch tested?

@codecov
Copy link

codecov bot commented Oct 10, 2023

Codecov Report

Merging #1967 (1ad6634) into main (4465a92) will increase coverage by 0.14%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main    #1967      +/-   ##
==========================================
+ Coverage   46.71%   46.84%   +0.14%     
==========================================
  Files         165      165              
  Lines       10668    10668              
  Branches      972      972              
==========================================
+ Hits         4982     4996      +14     
+ Misses       5362     5350      -12     
+ Partials      324      322       -2     

see 2 files with indirect coverage changes

📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today!

@cxzl25
Copy link
Contributor Author

cxzl25 commented Oct 12, 2023

cc @pan3793 @cfmcgrady

case m if m == "META-INF/LICENSE.txt" => MergeStrategy.discard
case m if m == "META-INF/NOTICE.txt" => MergeStrategy.discard
case m if m == "LICENSE.txt" => MergeStrategy.discard
case m if m == "NOTICE.txt" => MergeStrategy.discard
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to know which jar the file comes from?

Copy link
Member

@pan3793 pan3793 Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ideal goal here is to drop all files **/LICENSE.* and **/NOTICE.* that come from the third-party jars and only keep the one provided by ourselves. I don't find an elegant approach in Maven, hope we can make it happen in SBT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maven-shade-plugin provides some license transformers.

https://maven.apache.org/plugins/maven-shade-plugin/examples/resource-transformers.html

<transformer implementation="org.apache.maven.plugins.shade.resource.ApacheLicenseResourceTransformer"/>
<transformer implementation="org.apache.maven.plugins.shade.resource.ApacheNoticeResourceTransformer"/>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented a simple ResourceTransformer that can output LICENSE.txt and NOTICE.txt source jar.

import org.apache.maven.plugins.shade.resource.ResourceTransformer;

public class PrintResourceTransformer implements ResourceTransformer {
  @Override
  public boolean canTransformResource(String s) {
    boolean hasFlag = false;
    if (s.contains("LICENSE.txt")) {
      hasFlag = true;
    } else if (s.contains("NOTICE.txt")) {
      hasFlag = true;
    }
    if (hasFlag) {
      URL resource = this.getClass().getResource("/" + s);
      System.err.printf("PrintResourceTransformer %s,%s,%s%n", s, hasFlag, resource);
    }
    return hasFlag;
  }
PrintResourceTransformer META-INF/LICENSE.txt,true,jar:file:/Users/csy/.m2/repository/org/jdom/jdom2/2.0.6.1/jdom2-2.0.6.1.jar!/META-INF/LICENSE.txt
PrintResourceTransformer META-INF/NOTICE.txt,true,jar:file:/Users/csy/.m2/repository/commons-io/commons-io/2.6/commons-io-2.6.jar!/META-INF/NOTICE.txt

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately, the built-in apache license/notice transformer does not work well during my previous test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately, the built-in apache license/notice transformer does not work well during my previous test

ApacheLicenseResourceTransformer just ignores META-INF/LICENSE, META-INF/LICENSE.txt, META-INF/LICENSE.md.

https://github.com/apache/maven-shade-plugin/blob/master/src/main/java/org/apache/maven/plugins/shade/resource/ApacheLicenseResourceTransformer.java#L31

Copy link
Member

@pan3793 pan3793 Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cxzl25 can we implement it in sbt? the previous discussion aims to remove maven and only keep sbt eventually.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ideal goal here is to drop all files **/LICENSE.* and **/NOTICE.* that come from the third-party jars and only keep the one provided by ourselves. I don't find an elegant approach in Maven, hope we can make it happen in SBT

we can refer to pekko, and achieve this goal with MergeStrategy.concat

assembly / assemblyMergeStrategy := {
        case n if n.endsWith(".class")  => MergeStrategy.first
        case n if n.endsWith(".txt")    => MergeStrategy.concat
        case n if n.endsWith("NOTICE")  => MergeStrategy.concat
        case n if n.endsWith("LICENSE") => MergeStrategy.concat
        case n                          => (assembly / assemblyMergeStrategy).value.apply(n)
      },

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using contains? I tested it locally and verified that it can be excluded.

@cfmcgrady
Copy link
Contributor

I wrote a script and found that SBT's LICENSE and NOTICE are consistent with Maven's.

#!/bin/bash

# Specify the paths for two directories
dir1="sbt"
dir2="maven"

# Use find command to list all files in the first directory
find "$dir1" -type f | while read file1; do
    # Get the relative path
    rel_path="${file1#$dir1}"
    # Construct the corresponding file path in the second directory
    file2="$dir2$rel_path"

    # Check if a file with the same path exists in the second directory
    if [ -e "$file2" ]; then
        # Use md5sum command to compare contents of the two files
        hash1=$(md5sum "$file1" | awk '{print $1}')
        hash2=$(md5sum "$file2" | awk '{print $1}')

        # Compare MD5 hash values
        if [[ "$file1" != *.class && "$hash1" != "$hash2" ]]; then
            echo "Files differ: $file1 and $file2"
        fi
    else
        # Print file path if it doesn't exist in second directory
        echo "File does not exist in second directory: $file1"
    fi
done

# Check if there are files in the second directory not in the first
find "$dir2" -type f | while read file2; do
    # Get the relative path
    rel_path="${file2#$dir2}"
    # Construct the corresponding file path in the first directory
    file1="$dir1$rel_path"

    # Print file path if it doesn't exist in first directory
    if [ ! -e "$file1" ]; then
        echo "File does not exist in first directory: $file2"
    fi
done
Files differ: sbt/celeborn-client-spark-3-shaded_2.12-0.4.0-SNAPSHOT.jar and maven/celeborn-client-spark-3-shaded_2.12-0.4.0-SNAPSHOT.jar
File does not exist in second directory: sbt/org/apache/celeborn/package$.class
File does not exist in second directory: sbt/org/apache/celeborn/package.class
Files differ: sbt/META-INF/services/reactor.blockhound.integration.BlockHoundIntegration and maven/META-INF/services/reactor.blockhound.integration.BlockHoundIntegration
Files differ: sbt/META-INF/MANIFEST.MF and maven/META-INF/MANIFEST.MF
File does not exist in second directory: sbt/META-INF/NOTICE.txt
File does not exist in second directory: sbt/META-INF/LICENSE.txt
File does not exist in second directory: sbt/META-INF/INDEX.LIST
File does not exist in first directory: maven/META-INF/maven/org.apache.celeborn/celeborn-common_2.12/pom.xml
File does not exist in first directory: maven/META-INF/maven/org.apache.celeborn/celeborn-common_2.12/pom.properties
File does not exist in first directory: maven/META-INF/maven/org.apache.celeborn/celeborn-client_2.12/pom.xml
File does not exist in first directory: maven/META-INF/maven/org.apache.celeborn/celeborn-client_2.12/pom.properties
File does not exist in first directory: maven/META-INF/maven/org.apache.celeborn/celeborn-client-spark-3-shaded_2.12/pom.xml
File does not exist in first directory: maven/META-INF/maven/org.apache.celeborn/celeborn-client-spark-3-shaded_2.12/pom.properties
File does not exist in first directory: maven/META-INF/maven/org.apache.celeborn/celeborn-client-spark-common_2.12/pom.xml
File does not exist in first directory: maven/META-INF/maven/org.apache.celeborn/celeborn-client-spark-common_2.12/pom.properties
File does not exist in first directory: maven/META-INF/maven/org.apache.celeborn/celeborn-client-spark-3_2.12/pom.xml
File does not exist in first directory: maven/META-INF/maven/org.apache.celeborn/celeborn-client-spark-3_2.12/pom.properties
File does not exist in first directory: maven/META-INF/DEPENDENCIES

@pan3793
Copy link
Member

pan3793 commented Oct 16, 2023

I wrote a script and found that SBT's LICENSE and NOTICE are consistent with Maven's.

Thanks for the effort, actually, as mentioned above, I don't think the current implementation in Maven is perfect.

The goals of this part are:

  1. filter out all LICENSE.* and NOTICE.* from the third-party jars, no matter from which folders.
  2. only include the LICENSE and NOTICE we provide.

@@ -679,6 +691,8 @@ trait SparkClientProjects {

(assembly / assemblyMergeStrategy) := {
case m if m.toLowerCase(Locale.ROOT).endsWith("manifest.mf") => MergeStrategy.discard
case m if m.startsWith("META-INF/license/") => MergeStrategy.discard
case PathList(ps@_*) if Assembly.isLicenseFile(ps.last) => licenseFileOnlyProjectMergeStrategy(ps.mkString(java.io.File.separator))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sbtassembly.Assembly#isLicenseFile

  /**
   * Returns a flag if the given file is a license file
   *
   * @param fileName the given file to check
   * @return flag representing if the file is a license file
   */
  def isLicenseFile(fileName: String): Boolean = {
    val LicenseFile = """(license|licence|notice|copying)([.]\w+)?$""".r
    fileName.toLowerCase match {
      case LicenseFile(_, ext) if ext != ".class" => true // DISLIKE
      case _                                      => false
    }
  }

@@ -205,6 +205,18 @@ object CelebornCommonSettings {
Dependencies.junit % "test",
// https://www.scala-sbt.org/1.x/docs/Testing.html
Dependencies.junitInterface % "test")

def licenseFileOnlyProjectMergeStrategy(path: String) =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

./build/make-distribution.sh --sbt-enabled
META-INF/LICENSE from Project dependency: Project name = celeborn-client-spark-3-shaded_2.12-0.4.0-SNAPSHOT, target = META-INF/LICENSE
[Discard] META-INF/NOTICE.txt Not from Project dependency: Jar name = commons-lang3-3.12.0.jar, jar org = org.apache.commons, entry target = META-INF/NOTICE.txt
[Discard] META-INF/LICENSE.txt Not from Project dependency: Jar name = commons-lang3-3.12.0.jar, jar org = org.apache.commons, entry target = META-INF/LICENSE.txt
META-INF/NOTICE from Project dependency: Project name = celeborn-client-spark-3-shaded_2.12-0.4.0-SNAPSHOT, target = META-INF/NOTICE

println(s"$path from Project dependency: $dep")
Some(JarEntry(dep.target, dep.stream))
case dep =>
println(s"[Discard] $path Not from Project dependency: $dep")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I roughly remember the sbt has a built-in slf4j logger

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@pan3793 pan3793 Nov 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about

assembly - including file: celeborn-client-spark-3-shaded_2.12:META-INF/LICENSE
assembly - excluding file: commons-lang3-3.12.0.jar:META-INF/LICENSE.txt

?

@@ -205,6 +205,18 @@ object CelebornCommonSettings {
Dependencies.junit % "test",
// https://www.scala-sbt.org/1.x/docs/Testing.html
Dependencies.junitInterface % "test")

def licenseFileOnlyProjectMergeStrategy(log: Logger) =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[info] assembly - excluding file: Jar name = commons-lang3-3.12.0.jar, jar org = org.apache.commons, entry target = META-INF/LICENSE.txt
[info] assembly - excluding file: Jar name = commons-lang3-3.12.0.jar, jar org = org.apache.commons, entry target = META-INF/NOTICE.txt
[info] assembly - including file: META-INF/NOTICE
[info] assembly - including file: META-INF/LICENSE

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the built-in strategy MergeStrategy.preferProject can achieve this goal

merge strategy

MergeStrategy.preferProject will choose the first project file over library files if present. Otherwise, it works like MergeStrategy.first

Copy link
Contributor Author

@cxzl25 cxzl25 Nov 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not? If our project does not have the same file

MergeStrategy.preferProject

image

https://github.com/sbt/sbt-assembly/blob/e4d3b362997a32da21d97c051c9d92ecc5cdb9c2/src/main/scala/sbtassembly/MergeStrategy.scala#L82-L88

sbtassembly.MergeStrategy#preferProject

  /**
   * Picks the dependency that is the first project/internal dependency, otherwise picks the first library/external dependency
   */
  val preferProject: MergeStrategy = MergeStrategy("PreferProject") { conflicts =>
    val entry = conflicts.find(_.isProjectDependency).getOrElse(conflicts.head)
    Right(Vector(JarEntry(entry.target, entry.stream)))
  }

use CustomMergeStrategy

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it feasible to combine using MergeStrategy.preferProject and MergeStrategy.discard?

case "LICENSE" | "NOTICE" => MergeStrategy.preferProject
case m if Assembly.isLicenseFile(m) => MergeStrategy.discard

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

          case PathList(ps@_*) if ps.last == "LICENSE" || ps.last == "NOTICE" => MergeStrategy.preferProject
          case PathList(ps@_*) if Assembly.isLicenseFile(ps.last) => MergeStrategy.discard
image

@pan3793 pan3793 requested a review from cfmcgrady November 1, 2023 08:15
@@ -870,6 +885,8 @@ trait FlinkClientProjects {

(assembly / assemblyMergeStrategy) := {
case m if m.toLowerCase(Locale.ROOT).endsWith("manifest.mf") => MergeStrategy.discard
case m if m.startsWith("META-INF/license/") => MergeStrategy.discard
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to discard the files in folder META-INF/license/?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MR module depends on the netty 3x version, so there is a META-INF/license/ directory
Check that Spark and Flink do not depend on netty 3x. This configuration is now removed.

https://repo1.maven.org/maven2/io/netty/netty/3.10.6.Final/netty-3.10.6.Final.jar

image

@cfmcgrady
Copy link
Contributor

$ ./build/sbt -Pspark-3.3 celeborn-client-spark-3-shaded/assembly
$ jar tf client-spark/spark-3-shaded/target/scala-2.12/celeborn-client-spark-3-shaded_2.12-0.4.0-SNAPSHOT.jar | grep -i 'license\|notice'
META-INF/LICENSE
META-INF/NOTICE
META-INF/licenses/
META-INF/licenses/LICENSE-protobuf.txt

@cfmcgrady cfmcgrady requested a review from pan3793 November 6, 2023 02:41
case m if m == "META-INF/NOTICE.txt" => MergeStrategy.discard
case m if m == "LICENSE.txt" => MergeStrategy.discard
case m if m == "NOTICE.txt" => MergeStrategy.discard
case PathList(ps@_*) if ps.last == "LICENSE" || ps.last == "NOTICE" => MergeStrategy.preferProject
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case PathList(ps@_*) if ps.last == "LICENSE" || ps.last == "NOTICE" => MergeStrategy.preferProject
case "LICENSE" | "NOTICE" => MergeStrategy.discard
case PathList(ps@_*) if ps.last == "LICENSE" || ps.last == "NOTICE" => MergeStrategy.preferProject

we should discard the NOTICE/LICENSE files from the scala-library for mr-shade client.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ ./build/sbt -Pmr celeborn-client-mr-shaded/assembly
$ jar tf client-mr/mr-shaded/target/scala-2.12/celeborn-client-mr-shaded_2.12-0.4.0-SNAPSHOT.jar | grep -i 'license\|notice'
LICENSE
META-INF/LICENSE
META-INF/NOTICE
META-INF/licenses/
META-INF/licenses/LICENSE-protobuf.txt
NOTICE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding this problem.
I have now changed to use preferProject only for META-INF/LICENSE and META-INF/NOTICE.

@cfmcgrady
Copy link
Contributor

thanks, merging to main(v0.4.0).

@cfmcgrady cfmcgrady closed this in 23d7c20 Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants