-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NegativeArraySizeException when converting plink bed to hailmatrix on large chromosome #14168
Comments
Update: when I tested with chr1 with 32355811 variants at local computer using singularity instead of docker with 200g spark memory, it also failed. |
Although the test is still running now, I am pretty sure the following solution solved the problem.
|
Hey @shengqh ! Yeah, this is a bug in Kryo, a serialization library used by Spark, which cannot handle the size of data you're producing. This is partly a deficiency in Hail: we assume that PLINK files are relatively small, in particular that the number of variants is small. This issue was supposedly resolved in Spark 2.4.0+ and 3.0.0+ by apache/spark@3e03303 . You appear to be running Apache Spark version 3.3.2, so I'm surprised you encountered this. Can you confirm which version of the Kryo JAR you have in your environment? Can you also share a bit of information about this PLINK file? Thanks for your feedback and help improving Hail! Related issue: #5564 . |
I don't know the Kryo JAR. I tested on both docker images hailgenetics/hail:0.2.126-py3.11 and hailgenetics/hail:0.2.127-py3.11.
We have a 35K cohort. The VCF format of chr1 is 2.4T. So we prefer to deliver plink bed format and hail matrix. And, the cohort will continue to grow in future. I will prefer to keep one file per chromosome. For large cohort, which format do you prefer? Hail matrix or Hail VDS?
|
Heh. So, yes, "project" VCFs grow super-linearly in the number of samples. I (and others) are currently pushing very hard for the VCF spec to support two sparse representations: "local alleles" (samtools/hts-specs#434) and "reference blocks" (samtools/hts-specs#435). When using these two sparse representations, you should be able to store 35,000 whole genomes in ~10TiB of GZIP-compressed VCF. What is your calling pipeline? Do you generate GVCFs? If yes, I strongly recommend you use the VDS Combiner to produce a VDS. You can read more details in this recent preprint we wrote, but a VDS of 35,000 whole genomes should be a few terabytes. I'd guess 4 TiB, but it depends on your reference block granularity. I strongly recommend using size 10 GQ buckets.
Those should use Kryo 4.0.2. OK. My conclusion is that Kryo still has a bug preventing the serialization of very large objects. This becomes a limitation in Hail: we cannot support PLINK files with tens of millions of variants. Our community is largely transitioning to GVCFs and VDS, so I doubt we'll improve our PLINK1 importer to support such large PLINK1 files. That said, PRs are always welcome if loading such large PLINK1 files is a hard requirement for you all. |
Looks like VDS is a better solution than HailMatrix. However, we got the joint call result as vcf alreay. Can VDS Combiner read joint call VCF and then save it as VDS format? I cannot find any example to transfer VCF to VDS. Thanks.
|
What happened?
I wrote a bed2hailmatrix workflow and ran it on Terra platform to convert from plink bed format to hail matrix format.
https://github.com/shengqh/warp/blob/develop/pipelines/vumc_biostatistics/genotype/VUMCBed2HailMatrix.wdl
code is pretty simple:
When I tested it on the chr12 with 34523 samples and 18377527 variants from one of my dataset in Terra (100 g was allocated for this task), it failed with error message:
Interesting thing is, when I tried to convert the exactly same data in local computer using singularity instead of docker, it worked. Also, for the the other chromosomes with less variants but same samples, such as chr13, it worked well in Terra.
Since we will convert multiple plink files to hailmatrix table using Terra platform in future, I need to figure the problem out. Any advise would be appreciated.
Version
0.2.127
Relevant log output
The text was updated successfully, but these errors were encountered: