-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Omics fixes #3924
Closed
Closed
Omics fixes #3924
Changes from all commits
Commits
Show all changes
47 commits
Select commit
Hold shift + click to select a range
4a1bbe3
Use S3's built-in SHA256 hashes
dimaryaz 2be72bd
Fix tests
dimaryaz 1f0b418
Update hashing to match the latest spec
dimaryaz 9fb90ea
cleanup
dimaryaz b459f72
Delay deduping until the new top_hash is known
dimaryaz 3a309b1
oops
dimaryaz 773e6fb
pylint
dimaryaz 531b6ff
Create PARALLEL_CHECKSUMS.md
drernie 51fcbca
Update api/python/quilt3/data_transfer.py
dimaryaz 05310e2
tweak identifier
drernie 61b0367
sha2-256-chunked
drernie 5a4e67e
rename to a2-256-chunked
drernie d0f1ee2
Don't add a \n to base64-encoded hashes
dimaryaz 63abad3
add log to sha2-256-chunked
drernie b317895
Switch to "modern" checksums for all file sizes (#3892)
dimaryaz 75dafd3
Update checksum docs with suggestions
drernie 2270ff3
ceil(log2))
drernie b02e975
Fix broken hashing retries. Fix hashing an empty string.
dimaryaz deb39fd
Treat an empty file as a single zero-sized block
dimaryaz 224c39b
Switch empty file hash to an empty list of blocks
dimaryaz 58af8da
lint
dimaryaz 6a32b15
Update CopyFileListFn
dimaryaz 0a861d8
Update CopyFileListFn again
dimaryaz 158ac08
PR feedback
dimaryaz 7638377
Release 6.0.0a1
dimaryaz 40e69c7
6.1.0a2
drernie 5418f4d
clarify testing
drernie 41fdb26
debug conf_kwargs["signature_version"]
drernie baa8d9e
Merge branch 'master' into sigv4-6.1.0
drernie 781182d
disable failing test
drernie d616ce6
import Omics test from test_scaling
drernie 11478db
class AccessTest
drernie 2ec8711
pass test_boto3_access
drernie 9f2c95c
test_package fails on ListObjectVersions
drernie 6d124c4
handle missing S3Api.LIST_OBJECT_VERSIONS
drernie a24c5cd
handle inability to get the workflow config
drernie 077f3b3
pass test_list_object_versions
drernie d8562ac
write to quilt-sales-staging
drernie 080015a
remove class to stop mocks
drernie 6bb6bb1
fix push permissions
drernie 6d98a93
create package!
drernie 242b6cf
revert unneeded changes
drernie fc8e867
revert workflow patch
drernie 1e97084
revert VERSION
drernie b79aefe
isort cleanup
drernie 14d2bc1
Use ClientError
drernie e32beb2
move fallback into set_dir
drernie File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,5 @@ | ||
|
||
test: | ||
test: install-local | ||
pytest --disable-warnings | ||
|
||
install-local: | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,3 @@ | ||
Use pytest during normal development. | ||
|
||
You may need to first `make install-local`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
import os | ||
from datetime import UTC, datetime | ||
|
||
import boto3 | ||
import pytest | ||
|
||
import quilt3 as q3 | ||
from quilt3.data_transfer import list_object_versions | ||
|
||
NOW = datetime.now(UTC).strftime("%Y-%m-%d %H:%M:%S") | ||
BKT = "850787717197-1867753-fepwgrx9iujr5b9pkjudkhpgxwbuhuse1b-s3alias" | ||
SOURCE = f"s3://{BKT}" | ||
DBKT = "quilt-sales-staging" | ||
DEST = f"s3://{DBKT}" | ||
FOLDER = "850787717197/sequenceStore/1867753048/readSet/5447294294" | ||
FILE = "U0a_CGATGT_L001_R1_004.fastq.gz" | ||
KEY = f"{FOLDER}/{FILE}" | ||
|
||
|
||
@pytest.fixture(autouse=True) | ||
def client(): | ||
os.environ["AWS_PROFILE"] = "sales" | ||
session = boto3.Session(profile_name="sales") | ||
return session.client("s3") | ||
|
||
|
||
def test_boto3_access(client): | ||
head_object = client.head_object(Bucket=BKT, Key=KEY) | ||
assert head_object | ||
print(f"head_object: {head_object}") | ||
get_object = client.get_object(Bucket=BKT, Key=KEY) | ||
print(f"get_object: {get_object}") | ||
list_bucket = client.list_objects(Bucket=BKT, Prefix=FOLDER) | ||
print(f"list_bucket: {list_bucket}") | ||
|
||
|
||
def test_package(client): | ||
dir = f"{SOURCE}/{FOLDER}" | ||
split = FOLDER.split("/") | ||
pkg_name = f"{split[-2]}/{split[-1]}" | ||
msg = f"Today's Date: {NOW}" | ||
|
||
pkg = q3.Package() | ||
try: | ||
pkg = pkg.browse(pkg_name, registry=DEST) | ||
except Exception as e: | ||
print(f"Error browsing package: {e}") | ||
print(f"Package: {pkg}") | ||
|
||
print(f"S3 URI: {dir} @ {msg}") | ||
pkg.set_dir("/", dir, meta={"timestamp": NOW}) | ||
print(f"Package.set_dir: {pkg}") | ||
assert pkg | ||
|
||
PKG_URI = f"quilt+{DEST}#package={pkg_name}" | ||
print(f"Pushing {pkg_name} to {DEST}: {PKG_URI}") | ||
check = client.list_objects(Bucket=DBKT, Prefix=pkg_name) | ||
assert check.get("Prefix") == pkg_name | ||
pkg.push(pkg_name, registry=DEST, message=msg, force=True) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
# sha2-256-chunked | ||
|
||
## DRAFT multihash codec 0xb510 | ||
|
||
> Hash of concatenated SHA2-256 digests of 8*2^n MiB source chunks | ||
> where n = ceil(log2(source_size / (10^4 * 8MiB)) | ||
|
||
This variant of sha2-256 is designed to enable large files | ||
to be efficiently uploaded and hashed in parallel using fixed size chunks | ||
(8 MiB to start with), with the final result being a "top hash" | ||
that doesn't depend on the upload order. | ||
|
||
The algorithm has an upper limit of 10,000 chunks. | ||
If the file is larger than 80,000 MiB, | ||
it will double the chunk size until the number of chunks | ||
is under that limit. | ||
|
||
```pseudocode | ||
n = ceil(log2(source_size / (10_000 * 8 MiB))) | ||
chunk_size = 8 MiB * 2^n | ||
``` | ||
|
||
## Inspiration | ||
|
||
This algorithm is inspired by Amazon's | ||
[S3 Checksums](https://aws.amazon.com/blogs/aws/new-additional-checksum-algorithms-for-amazon-s3/) | ||
implementation. | ||
|
||
The key differences are: | ||
|
||
* Fixing the chunk size as (starting from) 8 MiB. | ||
* Always hashing the result (even if source_size < 8 MiB). | ||
|
||
It can reuse hashes generated by | ||
[create_multipart_upload](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/create_multipart_upload.html) | ||
as of at least [boto3 1.34.44](https://pypi.org/project/boto3/1.34.44/) | ||
(2024-02-16), simply by rehashing the value if source_size < 8 MiB. | ||
|
||
## Status 2024-02-21 | ||
|
||
It has been submitted for draft | ||
[multiformats registration](https://github.com/multiformats/multiformats/blob/master/contributing.md#multiformats-registrations) | ||
under the name `sha2-256-chunked` using the prefix `0xb510`. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's still too broad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in new PR