Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Omics fixes #3924

Closed
wants to merge 47 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
4a1bbe3
Use S3's built-in SHA256 hashes
dimaryaz Apr 1, 2022
2be72bd
Fix tests
dimaryaz May 3, 2023
1f0b418
Update hashing to match the latest spec
dimaryaz May 5, 2023
9fb90ea
cleanup
dimaryaz Jan 8, 2024
b459f72
Delay deduping until the new top_hash is known
dimaryaz Jan 8, 2024
3a309b1
oops
dimaryaz Jan 8, 2024
773e6fb
pylint
dimaryaz Jan 8, 2024
531b6ff
Create PARALLEL_CHECKSUMS.md
drernie Feb 19, 2024
51fcbca
Update api/python/quilt3/data_transfer.py
dimaryaz Feb 20, 2024
05310e2
tweak identifier
drernie Feb 20, 2024
61b0367
sha2-256-chunked
drernie Feb 20, 2024
5a4e67e
rename to a2-256-chunked
drernie Feb 21, 2024
d0f1ee2
Don't add a \n to base64-encoded hashes
dimaryaz Feb 21, 2024
63abad3
add log to sha2-256-chunked
drernie Feb 22, 2024
b317895
Switch to "modern" checksums for all file sizes (#3892)
dimaryaz Feb 22, 2024
75dafd3
Update checksum docs with suggestions
drernie Feb 22, 2024
2270ff3
ceil(log2))
drernie Feb 22, 2024
b02e975
Fix broken hashing retries. Fix hashing an empty string.
dimaryaz Feb 22, 2024
deb39fd
Treat an empty file as a single zero-sized block
dimaryaz Feb 23, 2024
224c39b
Switch empty file hash to an empty list of blocks
dimaryaz Feb 23, 2024
58af8da
lint
dimaryaz Feb 23, 2024
6a32b15
Update CopyFileListFn
dimaryaz Feb 26, 2024
0a861d8
Update CopyFileListFn again
dimaryaz Feb 26, 2024
158ac08
PR feedback
dimaryaz Feb 26, 2024
7638377
Release 6.0.0a1
dimaryaz Feb 26, 2024
40e69c7
6.1.0a2
drernie Mar 28, 2024
5418f4d
clarify testing
drernie Mar 28, 2024
41fdb26
debug conf_kwargs["signature_version"]
drernie Mar 28, 2024
baa8d9e
Merge branch 'master' into sigv4-6.1.0
drernie Apr 2, 2024
781182d
disable failing test
drernie Apr 2, 2024
d616ce6
import Omics test from test_scaling
drernie Apr 2, 2024
11478db
class AccessTest
drernie Apr 2, 2024
2ec8711
pass test_boto3_access
drernie Apr 2, 2024
9f2c95c
test_package fails on ListObjectVersions
drernie Apr 2, 2024
6d124c4
handle missing S3Api.LIST_OBJECT_VERSIONS
drernie Apr 2, 2024
a24c5cd
handle inability to get the workflow config
drernie Apr 2, 2024
077f3b3
pass test_list_object_versions
drernie Apr 2, 2024
d8562ac
write to quilt-sales-staging
drernie Apr 2, 2024
080015a
remove class to stop mocks
drernie Apr 2, 2024
6bb6bb1
fix push permissions
drernie Apr 3, 2024
6d98a93
create package!
drernie Apr 3, 2024
242b6cf
revert unneeded changes
drernie Apr 3, 2024
fc8e867
revert workflow patch
drernie Apr 3, 2024
1e97084
revert VERSION
drernie Apr 3, 2024
b79aefe
isort cleanup
drernie Apr 3, 2024
14d2bc1
Use ClientError
drernie Apr 3, 2024
e32beb2
move fallback into set_dir
drernie Apr 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion api/python/Makefile
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@

test:
test: install-local
pytest --disable-warnings

install-local:
Expand Down
2 changes: 2 additions & 0 deletions api/python/TESTS_README
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@
Use pytest during normal development.

You may need to first `make install-local`.
13 changes: 11 additions & 2 deletions api/python/quilt3/packages.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
legacy_calculate_checksum,
legacy_calculate_checksum_bytes,
list_object_versions,
list_objects,
list_url,
put_bytes,
)
Expand Down Expand Up @@ -910,11 +911,19 @@ def set_dir(self, lkey, path=None, meta=None, update_policy="incoming"):
src_path = src.path
if src.basename() != '':
src_path += '/'
objects, _ = list_object_versions(src.bucket, src_path)
try:
objects, _ = list_object_versions(src.bucket, src_path)
except botocore.exceptions.ClientError as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's still too broad

Suggested change
except botocore.exceptions.ClientError as e:
except botocore.exceptions.ClientError as e:
if e.response["Error"]["Code"] != "AccessDenied":
raise

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in new PR

# use list_objects instead
print(f"list_object_versions not available; using list_objects:\n{e}")
objects = list_objects(src.bucket, src_path, recursive=True)
for obj in objects:
obj["IsLatest"] = True

for obj in objects:
if not obj['IsLatest']:
continue
# Skip S3 pseduo directory files and Keys that end in /
# Skip S3 pseudo-directory files and Keys that end in /
if obj['Key'].endswith('/'):
if obj['Size'] != 0:
warnings.warn(f'Logical keys cannot end in "/", skipping: {obj["Key"]}')
Expand Down
59 changes: 59 additions & 0 deletions api/python/tests/integration/test_access.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
import os
from datetime import UTC, datetime

import boto3
import pytest

import quilt3 as q3
from quilt3.data_transfer import list_object_versions

NOW = datetime.now(UTC).strftime("%Y-%m-%d %H:%M:%S")
BKT = "850787717197-1867753-fepwgrx9iujr5b9pkjudkhpgxwbuhuse1b-s3alias"
SOURCE = f"s3://{BKT}"
DBKT = "quilt-sales-staging"
DEST = f"s3://{DBKT}"
FOLDER = "850787717197/sequenceStore/1867753048/readSet/5447294294"
FILE = "U0a_CGATGT_L001_R1_004.fastq.gz"
KEY = f"{FOLDER}/{FILE}"


@pytest.fixture(autouse=True)
def client():
os.environ["AWS_PROFILE"] = "sales"
session = boto3.Session(profile_name="sales")
return session.client("s3")


def test_boto3_access(client):
head_object = client.head_object(Bucket=BKT, Key=KEY)
assert head_object
print(f"head_object: {head_object}")
get_object = client.get_object(Bucket=BKT, Key=KEY)
print(f"get_object: {get_object}")
list_bucket = client.list_objects(Bucket=BKT, Prefix=FOLDER)
print(f"list_bucket: {list_bucket}")


def test_package(client):
dir = f"{SOURCE}/{FOLDER}"
split = FOLDER.split("/")
pkg_name = f"{split[-2]}/{split[-1]}"
msg = f"Today's Date: {NOW}"

pkg = q3.Package()
try:
pkg = pkg.browse(pkg_name, registry=DEST)
except Exception as e:
print(f"Error browsing package: {e}")
print(f"Package: {pkg}")

print(f"S3 URI: {dir} @ {msg}")
pkg.set_dir("/", dir, meta={"timestamp": NOW})
print(f"Package.set_dir: {pkg}")
assert pkg

PKG_URI = f"quilt+{DEST}#package={pkg_name}"
print(f"Pushing {pkg_name} to {DEST}: {PKG_URI}")
check = client.list_objects(Bucket=DBKT, Prefix=pkg_name)
assert check.get("Prefix") == pkg_name
pkg.push(pkg_name, registry=DEST, message=msg, force=True)
43 changes: 43 additions & 0 deletions docs/sha2-256-chunked.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# sha2-256-chunked

## DRAFT multihash codec 0xb510

> Hash of concatenated SHA2-256 digests of 8*2^n MiB source chunks
> where n = ceil(log2(source_size / (10^4 * 8MiB))

This variant of sha2-256 is designed to enable large files
to be efficiently uploaded and hashed in parallel using fixed size chunks
(8 MiB to start with), with the final result being a "top hash"
that doesn't depend on the upload order.

The algorithm has an upper limit of 10,000 chunks.
If the file is larger than 80,000 MiB,
it will double the chunk size until the number of chunks
is under that limit.

```pseudocode
n = ceil(log2(source_size / (10_000 * 8 MiB)))
chunk_size = 8 MiB * 2^n
```

## Inspiration

This algorithm is inspired by Amazon's
[S3 Checksums](https://aws.amazon.com/blogs/aws/new-additional-checksum-algorithms-for-amazon-s3/)
implementation.

The key differences are:

* Fixing the chunk size as (starting from) 8 MiB.
* Always hashing the result (even if source_size < 8 MiB).

It can reuse hashes generated by
[create_multipart_upload](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/create_multipart_upload.html)
as of at least [boto3 1.34.44](https://pypi.org/project/boto3/1.34.44/)
(2024-02-16), simply by rehashing the value if source_size < 8 MiB.

## Status 2024-02-21

It has been submitted for draft
[multiformats registration](https://github.com/multiformats/multiformats/blob/master/contributing.md#multiformats-registrations)
under the name `sha2-256-chunked` using the prefix `0xb510`.
Loading