Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compatibility issues with R's BiocFileCache #27

Open
jwokaty opened this issue Jan 7, 2025 · 3 comments · Fixed by #28
Open

Compatibility issues with R's BiocFileCache #27

jwokaty opened this issue Jan 7, 2025 · 3 comments · Fixed by #28

Comments

@jwokaty
Copy link

jwokaty commented Jan 7, 2025

I'm experiencing two issues related to compatibility with R's BiocFileCache.

  1. I'm not able use a cache initially created through pyBiocFileCache with BiocFileCache. When I attempt to set up the cache in an R session that was created with pyBiocFileCache, I get the following error in my R session:
BiocFileCache("cache_with_resource_created_with_pybiocfilecache")
Error in if (!schema_version %in% .SUPPORTED_SCHEMA_VERSIONS) stop("unsupported schema version ",  : 
  argument is of length zero

Lori suggested that this is related to missing schema_version, which is missing in the metadata table of my BiocFileCache.sqlite file although it should be inserted when creating the database:

conn.execute(
text("""
INSERT INTO metadata (key, value)
VALUES ('schema_version', :version);
"""),
{"version": SCHEMA_VERSION},
.

If I insert it manually into my BiocFileCache.sqlite file, I am able to create the cache that I created via pyBiocFileCache in an R session using BiocFileCache.

  1. I'm not able to get a resource using pyBiocFileCache that was created in a cache initially through R's BiocFileCache. I get an RpathTimeoutError:
In [5]: rcache.get("homo_sapien")
---------------------------------------------------------------------------
RpathTimeoutError                         Traceback (most recent call last)
Cell In[5], line 1
----> 1 rcache.get("homo_sapien")

File ~/env/lib/python3.12/site-packages/pybiocfilecache/cache.py:216, in BiocFileCache.get(self, rname)
    214 while not Path(str(resource.rpath)).exists():
    215     if time() - start >= timeout:
--> 216         raise RpathTimeoutError(
    217             f"For resource: '{rname}' the rpath does not exist " f"after {timeout} seconds."
    218         )
    219     sleep(0.1)
    221 # Update access time

RpathTimeoutError: For resource: 'homo_sapien' the rpath does not exist after 30 seconds.

Here's how I created the caches.

Create a resource with pybiocfilecache.

from pybiocfilecache import BiocFileCache
import urllib.request

url = "ftp://ftp.ensembl.org/pub/release-71/gtf/homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz"
urllib.request.urlretrieve(url, "pycache/"+url.split("/")[-1])

pybfc = BiocFileCache("pycache")
pybfc.add("homosapiens", "pycache/"+url.split("/")[-1]) # set rname to "homosapiens" 

Create a resource with BiocFileCache in an R session.

library(BiocFileCache)
rbfc <- BiocFileCache("rcache")
url <- paste(
    "ftp://ftp.ensembl.org/pub/release-71/gtf",
    "homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz",
    sep="/")
path <- bfcrpath(bfc, url)

In a terminal running Python, try accessing the resource created in the R session.

from pybiocfilecache import BiocFileCache

rbfc = BiocFileCache("rcache")
rbfc.list_resources() # this works
rbfc.get("ftp://ftp.ensembl.org/pub/release-71/gtf/homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz") # RpathTimeoutError 

In an R session, try creating a cache with the one made in the Python session.

library(BiocFileCache)

pybfc <- BiocFileCache("pycache") # Error in if (!schema_version %in% .SUPPORTED_SCHEMA_VERSIONS) ...

If it's helpful, here's what's my virtualenv and my sessionInfo():

# Python 3.12
pip list
Package            Version
------------------ -----------
aiobotocore        2.16.1
aiohappyeyeballs   2.4.4
aiohttp            3.11.11
aioitertools       0.12.0
aiosignal          1.3.2
annotated-types    0.7.0
asciitree          0.3.3
asttokens          3.0.0
attrs              24.3.0
BiocFrame          0.6.1
biocutils          0.2.0
boto3              1.35.88
botocore           1.35.88
certifi            2024.12.14
charset-normalizer 3.4.1
click              8.1.8
coloredlogs        15.0.1
decorator          5.1.1
executing          2.1.0
fasteners          0.19
frozenlist         1.5.0
fsspec             2024.12.0
geniml             0.5.2
GenomicRanges      0.5.0
greenlet           3.1.1
gtars              0.1.1
humanfriendly      10.0
idna               3.10
ipython            8.31.0
IRanges            0.3.0
jedi               0.19.2
jmespath           1.0.1
logmuse            0.2.8
markdown-it-py     3.0.0
matplotlib-inline  0.1.7
mdurl              0.1.2
multidict          6.1.0
ncls               0.0.68
numcodecs          0.13.1
numpy              2.2.1
pandas             2.2.3
parso              0.8.4
pephubclient       0.4.5
peppy              0.40.7
pexpect            4.9.0
pip                24.0
polars             1.18.0
prompt_toolkit     3.0.48
propcache          0.2.1
ptyprocess         0.7.0
pure_eval          0.2.3
pyarrow            18.1.0
pyBiocFileCache    0.6.0
pydantic           2.10.4
pydantic_core      2.27.2
Pygments           2.18.0
python-dateutil    2.9.0.post0
pytz               2024.2
PyYAML             6.0.2
requests           2.32.3
rich               13.9.4
s3fs               2024.12.0
s3transfer         0.10.4
setuptools         75.6.0
shellingham        1.5.4
six                1.17.0
SQLAlchemy         2.0.36
stack-data         0.6.3
traitlets          5.14.3
typer              0.15.1
typing_extensions  4.12.2
tzdata             2024.2
ubiquerg           0.8.0
urllib3            2.3.0
wcwidth            0.2.13
wrapt              1.17.0
yarl               1.18.3
zarr               2.18.4
# R 4.5.0
sessionInfo()
R Under development (unstable) (2024-11-18 r87347)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS

Matrix products: default
BLAS:   /home/fm/R-devel/lib/libRblas.so 
LAPACK: /home/fm/R-devel/lib/libRlapack.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BiocFileCache_2.15.0 dbplyr_2.5.0        

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5      httr_1.4.7       cli_3.6.3        rlang_1.1.4     
 [5] DBI_1.2.3        purrr_1.0.2      generics_0.1.3   glue_1.8.0      
 [9] bit_4.5.0        fansi_1.0.6      tibble_3.2.1     filelock_1.0.3  
[13] fastmap_1.2.0    lifecycle_1.0.4  memoise_2.0.1    compiler_4.5.0  
[17] dplyr_1.1.4      RSQLite_2.3.8    blob_1.2.4       pkgconfig_2.0.3 
[21] R6_2.5.1         tidyselect_1.2.1 utf8_1.2.4       pillar_1.9.0    
[25] curl_6.0.1       magrittr_2.0.3   withr_3.0.2      tools_4.5.0     
[29] bit64_4.5.2      cachem_1.1.0    
@jkanche
Copy link
Member

jkanche commented Jan 7, 2025

Hi @jwokaty, thank you for reporting this. The first should be an easy fix.

Let me test out the 2nd scenario. The only other time we ran into this was when the file added to the cache was large and would take a while to move/copy, hence we added a timeout constraint.

@jkanche
Copy link
Member

jkanche commented Jan 8, 2025

Hi @jwokaty, Github automatically closes issues now, but otherwise this works for me, Please install the recent version of the package.

Starting with Python

import pybiocfilecache as bfc
from pathlib import Path

cache_dir = "./cache_with_py"

# removing any previous caches with the same name
import shutil
shutil.rmtree(cache_dir)

# download the human gtf reference and save it as `hsapiens.gtf.gz` in the current working directory
import urllib.request
url = "ftp://ftp.ensembl.org/pub/release-71/gtf/homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz"
urllib.request.urlretrieve(url, filename = "hsapiens.gtf.gz" )

# add to the cache
cache.add(rname="homosapiens", fpath=Path("./hsapiens.gtf.gz"))

cache.list_resources()

Switch to R

library(BiocFileCache)

cache_dir <- "./cache_with_py"
rbfc <- BiocFileCache(cache_dir)
length(rbfc)

bfcinfo(rbfc)
show(rbfc)

rbfc[["BFC1"]]

hsap <- file("./hsapiens.gtf.gz")
add2 <- bfcadd(rbfc, "hsapiens_from_R", "./hsapiens.gtf.gz", download=FALSE)
add2

Roundtrip to Python

cache.list_resources()
cache.get(rname="hsapiens_from_R") or 
cache.get(rid="BFC2")

downurl = "https://bioconductor.org/packages/stats/bioc/BiocFileCache/BiocFileCache_2024_stats.tab"
add_url = cache.add(rname="download_link", fpath=downurl, rtype="web")

cache.list_resources()

Let me know if you run into any issues

@jwokaty
Copy link
Author

jwokaty commented Jan 9, 2025

Thanks for your quick attention to my issue! I reinstalled the new version; however, after running the second script and opening the cache with a file made with python, I get rbfc and inspecting in R, the cache directory is duplicated because it includes the cache directory in sqlite.

> bfcinfo(rbfc)
# A tibble: 2 × 10
  rid   rname create_time access_time rpath rtype fpath last_modified_time etag 
  <chr> <chr> <chr>       <chr>       <chr> <chr> <chr> <chr>              <chr>
1 BFC1  homo… 2025-01-09… 2025-01-09… ./ca… rela… hsap… 2025-01-09 17:40:… 79e7…
2 BFC2  hsap… 2025-01-09… 2025-01-09… ./ca… rela… ./hs… NA                 NA   
# ℹ 1 more variable: expires <dbl>
> bfcinfo(rbfc)$rpath
[1] "./cache_with_py/cache_with_py/fb57a31007b249bab60d18885c6f6b00_hsapiens.gtf.gz"
[2] "./cache_with_py/d6464139abac2_hsapiens.gtf.gz"  

When I attempt the round trip, I am still not able to access the resource created in R. I am able to access the resource I previously created in Python. (Note, this is from a different session I tried in the Bioconductor Docker to make sure I'm not experiencing a problem local to my system.)

>>> cache.list_resources()
[<Resource(rid='BFC1', rname='homosapiens', rpath='cache_with_py/ff50a35076d8413282510e1a777c5ccd_hsapiens.gtf.gz')>, <Resource(rid='BFC2', rname='hsapiens_from_R', rpath='a2853e77697_hsapiens.gtf.gz')>]
>>> cache.get(rid="BFC2")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/env/lib/python3.12/site-packages/pybiocfilecache/cache.py", line 220, in get
    raise TimeoutError(
TimeoutError: For resource: 'None' the rpath does not exist after 30 seconds.
>>> cache.get(rid="BFC1")
<Resource(rid='BFC1', rname='homosapiens', rpath='cache_with_py/ff50a35076d8413282510e1a777c5ccd_hsapiens.gtf.gz')>

The resource created in R doesn't have the cache directory in the path, which might be why it's not found.

@jkanche jkanche reopened this Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants