From 606f4df72e1267131e80459cb69f491929ff9a72 Mon Sep 17 00:00:00 2001 From: jonas-fuchs Date: Mon, 27 Mar 2023 17:12:04 +0200 Subject: [PATCH 1/8] Introduced sanger mode --- README.md | 18 +++---- docs/FAQ.md | 12 +++-- docs/how_varvamp_works.md | 2 +- docs/installation.md | 12 +++++ docs/preparing_the_data.md | 2 +- docs/usage.md | 8 +++- varvamp/command.py | 91 ++++++++++++++++++++++-------------- varvamp/scripts/logging.py | 68 +++++++++++++++------------ varvamp/scripts/reporting.py | 10 ++-- varvamp/scripts/scheme.py | 24 ++++++++++ 10 files changed, 162 insertions(+), 85 deletions(-) diff --git a/README.md b/README.md index c1084d6..b9e204e 100644 --- a/README.md +++ b/README.md @@ -8,24 +8,24 @@ For a lot of virus genera it is difficult to design pan-specific primers. varVAM **varVAMP comes in three different flavors:** -varVAMP logo +varVAMP logo -**SANGER** *(coming soon)*: varVAMP searches for the very best primers and reports back an amplicon which can be used for PCR-based screening approaches. +**SANGER**: varVAMP searches for the very best primers and reports back non-overlapping amplicons which can be used for PCR-based screening approaches. **TILED**: varVAMP uses a graph based approach to design overlapping amplicons that tile the entire viral genome. This designs amplicons that are suitable for Oxford Nanopore or Illumina based full-genome sequencing. **QPCR** *(coming soon)*: varVAMP searches for small amplicons with an internal primer for the probe. It minimizes temperature differences between the primers. -This program is currently being developed and in an alpha state. You are welcome to use this software. If you successfully design primers, drop me a mail. It might be possible to collaborate! +This program is currently being developed and in an alpha state. You are welcome to use this software. If you successfully design primers, drop me a mail. It might be possible to collaborate! Ideas and suggestions are highly welcome. # Documentation -* [Installation](docs/installation.md) -* [Preparing the data](docs/preparing_the_data.md) -* [Usage](docs/usage.md) -* [Output](docs/output.md) -* [How it works](docs/how_varvamp_works.md) -* [FAQ](docs/FAQ.md) +* [Installation](https://github.com/jonas-fuchs/varVAMP/blob/master/docs/installation.md) +* [Preparing the data](https://github.com/jonas-fuchs/varVAMP/blob/master/docs/preparing_the_data.md) +* [Usage](https://github.com/jonas-fuchs/varVAMP/blob/master/docs/usage.md) +* [Output](https://github.com/jonas-fuchs/varVAMP/blob/master/docs/output.md) +* [How it works](https://github.com/jonas-fuchs/varVAMP/blob/master/docs/how_varvamp_works.md) +* [FAQ](https://github.com/jonas-fuchs/varVAMP/blob/master/docs/FAQ.md) --- diff --git a/docs/FAQ.md b/docs/FAQ.md index 8ee18f9..c00dae1 100644 --- a/docs/FAQ.md +++ b/docs/FAQ.md @@ -4,18 +4,22 @@ Start with setting your optimal length of your amplicon, the max length that you can tolerate and the overlap that you want to achieve. If you want to allow ambiguous characters, set them to 2 as a start. Set the threshold to the mean sequence identity of your alignment. Run varvamp and then optimize the output. -2. **How do I optimize the output?** +2. **How do I optimize the output for TILED mode?** It all depends on how many conserved regions varVAMP is able to find! There are two main parameters that influence this. The number of ambiguous bases allowed within a primer and the threshold for consensus nucleotides. Setting the threshold higher or the number of ambiguous bases lower will result in less conserved regions. If you have set the parameters below and get a decent output, increase the threshold until the output gets worse. This will increase the specificity of your primers. Likewise, if you do not have a good output, consider increasing the number of ambiguous bases before you lower the threshold. The console output varVAMP will also give you some suggestions. 3. **varVAMP reported primer dimers. What now?** -In your case varVAMP could not find suitable replacement primers. You can either rerun varVAMP and try different settings or you can perform a third pool that contains a amplicon that has one of the conflicting dimers. Notably, varVAMP also reports the dimer melting temperature. If it is still reasonable low, using a hot start polymerase might still lead to successful PCR amplification. +In your case varVAMP could not find suitable replacement primers in the TILED mode. You can either rerun varVAMP and try different settings or you can perform a third pool that contains a amplicon that has one of the conflicting dimers. Notably, varVAMP also reports the dimer melting temperature. If it is still reasonable low, using a hot start polymerase might still lead to successful PCR amplification. +4. **I have multiple amplicons after SANGER mode. Which should I use?** -4. **How fast is varVAMP?** +varVAMP sorts all amplicons by score and always takes the best one of non-overlapping amplicons. If you are not interested in a specific gene region, amplicon_0 is your best candidate! + +5. **How fast is varVAMP?** + +varVAMP is pretty fast given the complexity of the problem. Running time is depended on the alignment length, number of sequences and the running mode. While the TILED is rather slow, qPCR and SANGER can be faster. An alignment with a few hundred sequences and with a genome size of 10 kb will likely run in under a minute for the TILED mode. For large e.g. DNA viruses (200 kb) it takes considerably longer, but should still finish in minutes. Running time optimizations are planned. -varVAMP is pretty fast given the complexity of the problem. Running time is depended on the alignment length, number of sequences and the running mode. While the TILED is rather slow, qPCR and SANGER are faster. An alignment with a few hundred sequences and with a genome size of 10 kb will likely run in under a minute for the TILED mode. For large e.g. DNA viruses (200 kb) it takes considerably longer, but should still finish in minutes. Running time optimizations are planned. diff --git a/docs/how_varvamp_works.md b/docs/how_varvamp_works.md index 63a53dd..0493ae6 100644 --- a/docs/how_varvamp_works.md +++ b/docs/how_varvamp_works.md @@ -37,7 +37,7 @@ To search for the best scoring amplicon, varVAMP uses a graph based approach. 7. Lastly, the best scoring scheme is evaluated for primer dimers in their respective pools. If a primer dimer pair is found, varVAMP evaluates for each primer their overlapping previously not considered primers (primer search step 2) and again minimizes the score. The scheme and all primers are updated. If no alternative primers can be found, varVAMP issues a warning and reports the unsolvable primer dimers. #### Sanger sequencing -coming soon +varVAMP sorts all amplicons by their score and takes the non-overlapping amplicon with the lowest score! As varVAMP gives a size penalty to amplicons, varVAMP automatically finds amplicons with low primer scores close to your optimal length (if possible). #### qPCR coming soon diff --git a/docs/installation.md b/docs/installation.md index b7ef2b6..b5f3974 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -3,10 +3,22 @@ varVAMP runs on UNIX Systems, MacOSX and Windows with Python3 >=3.9 installed. ## Installation +From PyPI: + ```shell pip install varvamp ``` + +From CONDA (optionally with a new env): + +```shell +conda create --name varvamp +conda activate varvamp +conda install varvamp +``` + That was already it. To check if it worked: + ```shell varvamp -v ``` diff --git a/docs/preparing_the_data.md b/docs/preparing_the_data.md index 9757aae..a850e3a 100644 --- a/docs/preparing_the_data.md +++ b/docs/preparing_the_data.md @@ -12,7 +12,7 @@ mafft my_sequences.fasta > my_alignment.fasta ### BUT... -If your sequences are two diverse, also varVAMP will not perform well. Therefore, it is important that, phylogenically, the input alignment makes sense. To analyze this you can calculate a tree with tools like [iqtree](http://www.iqtree.org/) and then use [TreeCluster](https://github.com/niemasd/TreeCluster) to get phylogenically related sequence clusters. However, this can be also computationally intensive. +If your sequences are too diverse, also varVAMP will not perform well. Therefore, it is important that, phylogenically, the input alignment makes sense. To analyze this you can calculate a tree with tools like [iqtree](http://www.iqtree.org/) and then use [TreeCluster](https://github.com/niemasd/TreeCluster) to get phylogenically related sequence clusters. However, this can be also computationally intensive. We have had good experience in using varVAMP with the sequence identity based clustering algorithm [vsearch](https://github.com/torognes/vsearch). A good starting point is a sequence identity between 0.8 and 0.85. For such clusters varVAMP should perform reasonably well. diff --git a/docs/usage.md b/docs/usage.md index 723df01..1c6b03d 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -16,6 +16,7 @@ In this case varVAMP uses as standard settings: * ```OVERLAP``` = 100 (minimum overlap length) * ```THRESHOLD``` = 0.9 (nucleotide consensus threshold) * ```ALLOWED_AMBIGUOUS``` = 4 (number of allowed ambiguous characters in a primer) +* ```MODE``` = TILED These settings are quite relaxed and can produce decent results for diverse viruses (80-90 % sequence identity). However, you can likely optimize the result. @@ -24,6 +25,10 @@ These settings are quite relaxed and can produce decent results for diverse viru varvamp [OPTIONS] ``` ``` +usage: varvamp [options] + +varvamp: variable virus amplicon design + positional arguments: input alignment file and dir to write results @@ -34,13 +39,14 @@ optional arguments: -ml MAX_LENGTH, --max-length MAX_LENGTH max length of the amplicons -o OVERLAP, --overlap OVERLAP - min overlap of the amplicons + min overlap of the amplicons. Only for TILED mode. -t THRESHOLD, --threshold THRESHOLD threshold for nucleotides in alignment to be considered conserved -a ALLOWED_AMBIGUOUS, --allowed-ambiguous ALLOWED_AMBIGUOUS number of ambiguous characters that are allowed within a primer --console, --no-console show varvamp console output (default: True) + --mode MODE varvamp modes: TILED, SANGER -v, --version show program's version number and exit ``` diff --git a/varvamp/command.py b/varvamp/command.py index 0869739..b3ba6da 100644 --- a/varvamp/command.py +++ b/varvamp/command.py @@ -21,7 +21,7 @@ from varvamp.scripts import scheme -# DEFs +# DEFs^ def get_args(sysargs): """ arg parsing for varvamp @@ -55,7 +55,7 @@ def get_args(sysargs): "--overlap", type=float, default=config.AMPLICON_MIN_OVERLAP, - help="min overlap of the amplicons" + help="min overlap of the amplicons. Only for TILED mode." ) parser.add_argument( "-t", @@ -77,6 +77,12 @@ def get_args(sysargs): default=True, help="show varvamp console output" ) + parser.add_argument( + "--mode", + type = str, + default= "TILED", + help="varvamp modes: TILED, SANGER" + ) parser.add_argument( "-v", "--version", @@ -99,6 +105,7 @@ def main(sysargs=sys.argv[1:]): args = get_args(sysargs) if not args.console: sys.stdout = open(os.devnull, 'w') + args.mode = args.mode.upper() start_time = time.process_time() results_dir, data_dir, log_file = logging.create_dir_structure(args.input[1]) logging.raise_arg_errors(args, log_file) @@ -201,46 +208,57 @@ def main(sysargs=sys.argv[1:]): log_file, exit=True ) - amplicon_graph = scheme.create_amplicon_graph(amplicons, args.overlap) logging.varvamp_progress( log_file, progress=0.8, job="Finding potential amplicons.", - progress_text=str(len(amplicons)) + " potential amplicons" - ) - # search for amplicon scheme - coverage, amplicon_scheme = scheme.find_best_covering_scheme( - amplicons, - amplicon_graph, - all_primers + progress_text= f"{len(amplicons)} potential amplicons" ) - dimers_not_solved = scheme.check_and_solve_heterodimers( - amplicon_scheme, - left_primer_candidates, - right_primer_candidates, - all_primers) - if dimers_not_solved: - logging.raise_error( - f"varVAMP found {len(dimers_not_solved)} primer dimers without replacements. Check the dimer file and perform the PCR for incomaptible amplicons in a sperate reaction.", - log_file + + if args.mode == "TILED": + amplicon_graph = scheme.create_amplicon_graph(amplicons, args.overlap) + # search for amplicon scheme + coverage, amplicon_scheme = scheme.find_best_covering_scheme( + amplicons, + amplicon_graph, + all_primers ) - reporting.write_dimers(dir, dimers_not_solved) - percent_coverage = round(coverage/len(ambiguous_consensus)*100, 2) - logging.varvamp_progress( - log_file, - progress=0.9, - job="Creating amplicon scheme.", - progress_text=f"{percent_coverage} % total coverage with {len(amplicon_scheme[0]) + len(amplicon_scheme[1])} amplicons" - ) - if percent_coverage < 70: - logging.raise_error( - "coverage < 70 %. Possible solutions:\n" - "\t - lower threshold\n" - "\t - increase amplicons lengths\n" - "\t - increase number of ambiguous nucleotides\n" - "\t - relax primer settings (not recommended)\n", - log_file + dimers_not_solved = scheme.check_and_solve_heterodimers( + amplicon_scheme, + left_primer_candidates, + right_primer_candidates, + all_primers) + if dimers_not_solved: + logging.raise_error( + f"varVAMP found {len(dimers_not_solved)} primer dimers without replacements. Check the dimer file and perform the PCR for incomaptible amplicons in a sperate reaction.", + log_file + ) + reporting.write_dimers(dir, dimers_not_solved) + percent_coverage = round(coverage/len(ambiguous_consensus)*100, 2) + logging.varvamp_progress( + log_file, + progress=0.9, + job="Creating amplicon scheme.", + progress_text=f"{percent_coverage} % total coverage with {len(amplicon_scheme[0]) + len(amplicon_scheme[1])} amplicons" ) + if percent_coverage < 70: + logging.raise_error( + "coverage < 70 %. Possible solutions:\n" + "\t - lower threshold\n" + "\t - increase amplicons lengths\n" + "\t - increase number of ambiguous nucleotides\n" + "\t - relax primer settings (not recommended)\n", + log_file + ) + elif args.mode == "SANGER": + amplicon_scheme = scheme.find_best_amplicons(amplicons, all_primers) + logging.varvamp_progress( + log_file, + progress=0.9, + job="Finding low scoring amplicons.", + progress_text="The lowest scoring amplicon is amplicon_0." + ) + # write files reporting.write_alignment(data_dir, alignment_cleaned) reporting.write_fasta(data_dir, "majority_consensus", majority_consensus) @@ -250,7 +268,8 @@ def main(sysargs=sys.argv[1:]): reporting.write_scheme_to_files( results_dir, amplicon_scheme, - ambiguous_consensus + ambiguous_consensus, + args.mode ) reporting.varvamp_plot( results_dir, diff --git a/varvamp/scripts/logging.py b/varvamp/scripts/logging.py index 2ab6d18..85be823 100644 --- a/varvamp/scripts/logging.py +++ b/varvamp/scripts/logging.py @@ -50,7 +50,7 @@ def varvamp_progress(log_file, start_time=None, progress=0, job="", progress_tex else: if progress == 1: stop_time = str(round(time.process_time() - start_time, 2)) - progress_text = f"all done \n\n\rvarVAMP created an amplicon scheme in {stop_time} sec!\n{datetime.datetime.now()}" + progress_text = f"all done \n\n\rvarVAMP finished in {stop_time} sec!\n{datetime.datetime.now()}" job = "Finalizing output." print( "\rJob:\t\t " + job + "\nProgress: \t [{0}] {1}%".format("█"*block + "-"*(barLength-block), progress*100) + "\t" + progress_text, @@ -115,39 +115,47 @@ def raise_arg_errors(args, log_file): log_file, exit=True ) - if args.opt_length < 200 or args.max_length < 200: + if args.mode != "TILED" and args.mode != "SANGER": raise_error( - "your amplicon lengths might be to small. Consider increasing", - log_file - ) - if args.overlap < 0: - raise_error( - "overlap size can not be negative.", - log_file, - exit=True - ) - if args.overlap < 50: - raise_error( - "small overlaps might hinder downstream analyses. Consider increasing.", - log_file - ) - if args.overlap > args.max_length/2 - config.PRIMER_SIZES[1]: - raise_error( - "min overlap must be lower than half of your maximum length - maximum primer length. To achieve optimal results reduce it to at least half of your optimal length", - log_file, - exit=True - ) - if args.overlap > args.opt_length: - raise_error( - "overlap can not be higher than your optimal length.", + "non valid varvamp mode. Use TILED or SANGER.", log_file, exit=True ) - if args.overlap > args.opt_length/2: - raise_error( - "your intended overlap is higher than half of your optimal length. This reduces how well varvamps will find overlapping amplicons. Consider decreasing.", - log_file - ) + # TILED specific warnings + if args.mode == "TILED": + if args.opt_length < 200 or args.max_length < 200: + raise_error( + "your amplicon lengths might be to small. Consider increasing", + log_file + ) + if args.overlap < 0: + raise_error( + "overlap size can not be negative.", + log_file, + exit=True + ) + if args.overlap < 50: + raise_error( + "small overlaps might hinder downstream analyses. Consider increasing.", + log_file + ) + if args.overlap > args.max_length/2 - config.PRIMER_SIZES[1]: + raise_error( + "min overlap must be lower than half of your maximum length - maximum primer length. To achieve optimal results reduce it to at least half of your optimal length", + log_file, + exit=True + ) + if args.overlap > args.opt_length: + raise_error( + "overlap can not be higher than your optimal length.", + log_file, + exit=True + ) + if args.overlap > args.opt_length/2: + raise_error( + "your intended overlap is higher than half of your optimal length. This reduces how well varvamps will find overlapping amplicons. Consider decreasing.", + log_file + ) def confirm_config(args, log_file): diff --git a/varvamp/scripts/reporting.py b/varvamp/scripts/reporting.py index f4e8c9e..b0be4d6 100644 --- a/varvamp/scripts/reporting.py +++ b/varvamp/scripts/reporting.py @@ -102,7 +102,7 @@ def get_permutations(seq): return[''.join(p) for p in itertools.product(*splits)] -def write_scheme_to_files(dir, amplicon_scheme, ambiguous_consensus): +def write_scheme_to_files(dir, amplicon_scheme, ambiguous_consensus, mode): """ write all relevant bed files and a tsv file with all primer stats """ @@ -134,7 +134,11 @@ def write_scheme_to_files(dir, amplicon_scheme, ambiguous_consensus): right = (primer_names[1], amplicon_scheme[pool][amp][primer_names[1]]) # write amplicon bed - print("ambiguous_consensus", left[1][1], right[1][2], new_name, pool, sep="\t", file=bed) + if mode == "TILED": + print("ambiguous_consensus", left[1][1], right[1][2], new_name, pool, sep="\t", file=bed) + elif mode == "SANGER": + print("ambiguous_consensus", left[1][1], right[1][2], new_name, round(left[1][3] + right[1][3], 1), sep="\t", file=bed) + # write primer assignments tabular file print(left[0], right[0], sep="\t", file=tabular) @@ -346,7 +350,7 @@ def varvamp_plot(dir, threshold, alignment_cleaned, conserved_regions, all_prime ax[idx].set_title(primer[0], loc="left") ax[idx].xaxis.set_ticks(np.arange(primer[1][1], primer[1][1]+len(x), 1)) ax[idx].xaxis.set_ticklabels(x, rotation=45) - ax[idx].set_ylabel(ylabel="% of sequences") + ax[idx].set_ylabel(ylabel="freq of sequences") ax[idx].set_xlabel("position") ax[idx].set_ylim(0, 1-threshold) # - to pdf diff --git a/varvamp/scripts/scheme.py b/varvamp/scripts/scheme.py index 1aad401..0c0f104 100644 --- a/varvamp/scripts/scheme.py +++ b/varvamp/scripts/scheme.py @@ -388,3 +388,27 @@ def check_and_solve_heterodimers(amplicon_scheme, left_primer_candidates, right_ primer_dimers = test_scheme_for_dimers(amplicon_scheme) return not_solvable + + +def find_best_amplicons(amplicons, all_primers): + """ + find the best scoring non-overlapping amplicons from all found amplicons + """ + # sort amplicons + sorted_amplicons = sorted(amplicons.items(), key=lambda x: x[1][5]) + to_retain = [sorted_amplicons[0]] + amplicon_range = list(range(sorted_amplicons[0][1][0], sorted_amplicons[0][1][1])) + amplicon_set = set(amplicon_range) + for amp in sorted_amplicons: + amp_positions = list(range(amp[1][0], amp[1][1]+1)) + if not any(x in amp_positions for x in amplicon_set): + amplicon_set.update(amp_positions) + to_retain.append(amp) + # build the final dictionary + scheme_dictionary = {0: {}} + for amp in to_retain: + scheme_dictionary[0][amp[0]] = {} + scheme_dictionary[0][amp[0]][amp[1][2]] = all_primers["+"][amp[1][2]] + scheme_dictionary[0][amp[0]][amp[1][3]] = all_primers["-"][amp[1][3]] + + return scheme_dictionary From 61ceb1a461b8d0904bd59fa6b7623c42b88a6cfa Mon Sep 17 00:00:00 2001 From: jonas-fuchs Date: Mon, 27 Mar 2023 17:14:02 +0200 Subject: [PATCH 2/8] updated version --- varvamp/__init__.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/varvamp/__init__.py b/varvamp/__init__.py index 136d60d..36ac97e 100644 --- a/varvamp/__init__.py +++ b/varvamp/__init__.py @@ -1,3 +1,3 @@ """Tool to design amplicons for highly variable virusgenomes""" _program = "varvamp" -__version__ = "0.3" +__version__ = "0.4" From 1812eb587413884d916cb153865535003c4bd355 Mon Sep 17 00:00:00 2001 From: jonas-fuchs Date: Mon, 27 Mar 2023 17:25:25 +0200 Subject: [PATCH 3/8] updated comment --- varvamp/scripts/scheme.py | 1 + 1 file changed, 1 insertion(+) diff --git a/varvamp/scripts/scheme.py b/varvamp/scripts/scheme.py index 0c0f104..23b4be2 100644 --- a/varvamp/scripts/scheme.py +++ b/varvamp/scripts/scheme.py @@ -399,6 +399,7 @@ def find_best_amplicons(amplicons, all_primers): to_retain = [sorted_amplicons[0]] amplicon_range = list(range(sorted_amplicons[0][1][0], sorted_amplicons[0][1][1])) amplicon_set = set(amplicon_range) + # find lowest non-overlapping for amp in sorted_amplicons: amp_positions = list(range(amp[1][0], amp[1][1]+1)) if not any(x in amp_positions for x in amplicon_set): From 0bcf385165918ab68e883af70be9b6cf3264e05b Mon Sep 17 00:00:00 2001 From: jonas-fuchs Date: Mon, 27 Mar 2023 20:30:48 +0200 Subject: [PATCH 4/8] beautified arg parsing, included report_n arg --- docs/usage.md | 33 ++++++++----------- varvamp/command.py | 65 +++++++++++++++++++++++--------------- varvamp/scripts/config.py | 8 ----- varvamp/scripts/logging.py | 58 ++++++++++++++++++++++++---------- varvamp/scripts/scheme.py | 7 +++- 5 files changed, 101 insertions(+), 70 deletions(-) diff --git a/docs/usage.md b/docs/usage.md index 1c6b03d..5adbf35 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -15,7 +15,7 @@ In this case varVAMP uses as standard settings: * ```MAX_LENGTH``` = 2000 (maximum amplicon length) * ```OVERLAP``` = 100 (minimum overlap length) * ```THRESHOLD``` = 0.9 (nucleotide consensus threshold) -* ```ALLOWED_AMBIGUOUS``` = 4 (number of allowed ambiguous characters in a primer) +* ```N_AMBIG``` = 4 (number of allowed ambiguous characters in a primer) * ```MODE``` = TILED These settings are quite relaxed and can produce decent results for diverse viruses (80-90 % sequence identity). However, you can likely optimize the result. @@ -24,31 +24,26 @@ These settings are quite relaxed and can produce decent results for diverse viru ```shell varvamp [OPTIONS] ``` + ``` -usage: varvamp [options] +usage: varvamp [options] varvamp: variable virus amplicon design positional arguments: - input alignment file and dir to write results + input alignment file and dir to write results + mode varvamp modes: TILED, SANGER optional arguments: - -h, --help show this help message and exit - -ol OPT_LENGTH, --opt-length OPT_LENGTH - optimal length of the amplicons - -ml MAX_LENGTH, --max-length MAX_LENGTH - max length of the amplicons - -o OVERLAP, --overlap OVERLAP - min overlap of the amplicons. Only for TILED mode. - -t THRESHOLD, --threshold THRESHOLD - threshold for nucleotides in alignment to be considered conserved - -a ALLOWED_AMBIGUOUS, --allowed-ambiguous ALLOWED_AMBIGUOUS - number of ambiguous characters that are allowed within a primer - --console, --no-console - show varvamp console output (default: True) - --mode MODE varvamp modes: TILED, SANGER - -v, --version show program's version number and exit - + -h, --help show this help message and exit + -ol , --opt-length optimal length of the amplicons + -ml , --max-length max length of the amplicons + -t , --threshold threshold for conserved nucleotides + -a , --n-ambig max number of ambiguous characters in a primer + -o , --overlap TILED: min overlap of the amplicons + -n , --report-n SANGER: report the top n best hits + --verb, --no-verb show varvamp console output (default: True) + -v, --version show program's version number and exit ``` ## Further customization (advanced) diff --git a/varvamp/command.py b/varvamp/command.py index b3ba6da..b272a0c 100644 --- a/varvamp/command.py +++ b/varvamp/command.py @@ -21,7 +21,6 @@ from varvamp.scripts import scheme -# DEFs^ def get_args(sysargs): """ arg parsing for varvamp @@ -29,60 +28,74 @@ def get_args(sysargs): parser = argparse.ArgumentParser( prog=_program, description='varvamp: variable virus amplicon design', - usage='''varvamp [options]''') + usage='''varvamp [options]''') parser.add_argument( "input", nargs=2, help="alignment file and dir to write results" ) + parser.add_argument( + "mode", + type=str, + nargs='?', + default="TILED", + help="varvamp modes: TILED, SANGER", + ) parser.add_argument( "-ol", "--opt-length", help="optimal length of the amplicons", + metavar="", type=int, - default=config.AMPLICON_OPT_LENGTH + default=1000 ) parser.add_argument( "-ml", "--max-length", help="max length of the amplicons", + metavar="", type=int, - default=config.AMPLICON_MAX_LENGTH - ) - parser.add_argument( - "-o", - "--overlap", - type=float, - default=config.AMPLICON_MIN_OVERLAP, - help="min overlap of the amplicons. Only for TILED mode." + default=2000 ) parser.add_argument( "-t", "--threshold", + metavar="", type=float, - default=config.FREQUENCY_THRESHOLD, - help="threshold for nucleotides in alignment to be considered conserved" + default=0.9, + help="threshold for conserved nucleotides" ) parser.add_argument( "-a", - "--allowed-ambiguous", + "--n-ambig", + metavar="", type=int, - default=config.PRIMER_ALLOWED_N_AMB, - help="number of ambiguous characters that are allowed within a primer" + default=2, + help="max number of ambiguous characters in a primer" ) parser.add_argument( - "--console", + "-o", + "--overlap", + type=int, + metavar="", + default=100, + help="TILED: min overlap of the amplicons" + ) + parser.add_argument( + "-n", + "--report-n", + type=int, + metavar="", + default= float("inf"), + help="SANGER: report the top n best hits" + ) + parser.add_argument( + "--verb", action=argparse.BooleanOptionalAction, default=True, help="show varvamp console output" ) - parser.add_argument( - "--mode", - type = str, - default= "TILED", - help="varvamp modes: TILED, SANGER" - ) parser.add_argument( "-v", "--version", @@ -103,7 +116,7 @@ def main(sysargs=sys.argv[1:]): """ # start varVAMP args = get_args(sysargs) - if not args.console: + if not args.verb: sys.stdout = open(os.devnull, 'w') args.mode = args.mode.upper() start_time = time.process_time() @@ -143,7 +156,7 @@ def main(sysargs=sys.argv[1:]): # generate conserved region list conserved_regions = conserved.find_regions( ambiguous_consensus, - args.allowed_ambiguous + args.n_ambig ) if not conserved_regions: logging.raise_error( @@ -251,7 +264,7 @@ def main(sysargs=sys.argv[1:]): log_file ) elif args.mode == "SANGER": - amplicon_scheme = scheme.find_best_amplicons(amplicons, all_primers) + amplicon_scheme = scheme.find_best_amplicons(amplicons, all_primers, args.report_n) logging.varvamp_progress( log_file, progress=0.9, diff --git a/varvamp/scripts/config.py b/varvamp/scripts/config.py index a78435d..b29d3e3 100644 --- a/varvamp/scripts/config.py +++ b/varvamp/scripts/config.py @@ -6,10 +6,6 @@ # CAN BE CHANGED -# alignment and consensus creation threshold -FREQUENCY_THRESHOLD = 0.9 # freq at which a nucleotide is considered conserved -PRIMER_ALLOWED_N_AMB = 4 # allowed number of ambiguous chars in primer - # basic primer parameters PRIMER_TMP = (57, 63, 60) # temperatur (min, max, opt) PRIMER_GC_RANGE = (40, 60, 50) # gc (min, max, opt) @@ -36,10 +32,6 @@ PRIMER_3_PENALTY = (10, 10, 10) # penalties for 3' mismatches PRIMER_PERMUTATION_PENALTY = 0.1 # penalty for the number of permutations -# amplicon settings -AMPLICON_MIN_OVERLAP = 100 -AMPLICON_OPT_LENGTH = 1000 -AMPLICON_MAX_LENGTH = 2000 # DO NOT CHANGE # nucleotide definitions diff --git a/varvamp/scripts/logging.py b/varvamp/scripts/logging.py index 85be823..9a66027 100644 --- a/varvamp/scripts/logging.py +++ b/varvamp/scripts/logging.py @@ -91,13 +91,13 @@ def raise_arg_errors(args, log_file): log_file, exit=True ) - if args.allowed_ambiguous < 0: + if args.n_ambig < 0: raise_error( "number of ambiguous chars can not be negative", log_file, exit=True ) - if args.allowed_ambiguous > 4: + if args.n_ambig > 4: raise_error( "high number of ambiguous nucleotides in primer leads to a high " "degeneracy. Consider reducing.", @@ -121,8 +121,20 @@ def raise_arg_errors(args, log_file): log_file, exit=True ) + if args.mode == "SANGER": + if args.report_n < 1: + raise_error( + "number of reported amplicons cannot be below 1.", + log_file, + exit=True + ) # TILED specific warnings if args.mode == "TILED": + if args.report_n != float("inf"): + raise_error( + "n-report is for SANGER and ignored for TILED", + log_file + ) if args.opt_length < 200 or args.max_length < 200: raise_error( "your amplicon lengths might be to small. Consider increasing", @@ -167,12 +179,6 @@ def confirm_config(args, log_file): # check if all variables exists all_vars = [ - # arg dependent - "FREQUENCY_THRESHOLD", - "PRIMER_ALLOWED_N_AMB", - "AMPLICON_OPT_LENGTH", - "AMPLICON_MAX_LENGTH", - "AMPLICON_MIN_OVERLAP", # arg independent "PRIMER_TMP", "PRIMER_GC_RANGE", @@ -314,16 +320,36 @@ def confirm_config(args, log_file): var_dic = vars(config) with open(log_file, 'a') as f: print( - "settings that can be adjusted via arguments\n", - f"PRIMER_OPT_LENGTH = {args.opt_length}", - f"PRIMER_MAX_LENGTH = {args.max_length}", - f"PRIMER_MIN_OVERLAP = {args.overlap}", - f"PRIMER_THRESHOLD = {args.threshold}", - f"PRIMER_ALLOWED_N_AMB = {args.allowed_ambiguous}", - "\nconfig settings\n", + f"MODE = {args.mode}", + sep="\n", + file=f + ) + print( + "\nsettings that can be adjusted via arguments\n", + f"OPT_LENGTH = {args.opt_length}", + f"MAX_LENGTH = {args.max_length}", + f"THRESHOLD = {args.threshold}", + f"ALLOWED_N_AMB = {args.n_ambig}", sep="\n", file=f ) + if args.mode == "TILED": + print( + f"MIN_OVERLAP = {args.overlap}", + sep="\n", + file=f + ) + if args.mode == "SANGER": + print( + f"REPORT_N_AMPLICONS = {args.report_n}", + sep="\n", + file=f + ) + print( + "\nconfig settings\n", + sep="\n", + file = f + ) for var in all_vars[5:]: print(f"{var} = {var_dic[var]}", file=f) - print("\nprogress\n", file=f) + print("\nprogress", file=f) diff --git a/varvamp/scripts/scheme.py b/varvamp/scripts/scheme.py index 23b4be2..46617b6 100644 --- a/varvamp/scripts/scheme.py +++ b/varvamp/scripts/scheme.py @@ -390,7 +390,7 @@ def check_and_solve_heterodimers(amplicon_scheme, left_primer_candidates, right_ return not_solvable -def find_best_amplicons(amplicons, all_primers): +def find_best_amplicons(amplicons, all_primers, n): """ find the best scoring non-overlapping amplicons from all found amplicons """ @@ -407,9 +407,14 @@ def find_best_amplicons(amplicons, all_primers): to_retain.append(amp) # build the final dictionary scheme_dictionary = {0: {}} + counter = 1 for amp in to_retain: scheme_dictionary[0][amp[0]] = {} scheme_dictionary[0][amp[0]][amp[1][2]] = all_primers["+"][amp[1][2]] scheme_dictionary[0][amp[0]][amp[1][3]] = all_primers["-"][amp[1][3]] + if n != float("inf"): + if counter == n: + break + counter += 1 return scheme_dictionary From 30ecca1b209b92562d87beb97b6b65830ab407b0 Mon Sep 17 00:00:00 2001 From: jonas-fuchs Date: Mon, 27 Mar 2023 23:12:53 +0200 Subject: [PATCH 5/8] small changes --- README.md | 2 +- docs/usage.md | 4 +--- varvamp/command.py | 5 ++--- varvamp/scripts/scheme.py | 2 +- 4 files changed, 5 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index b9e204e..3140707 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ # varVAMP -[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0) +[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0) For a lot of virus genera it is difficult to design pan-specific primers. varVAMP solves this, by introducing ambiguous characters into primers and minimizes mismatches at the 3' end. Primers might not work for some sequences of your input alignment but should recognize the large majority. diff --git a/docs/usage.md b/docs/usage.md index 5adbf35..e22cae2 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -22,12 +22,10 @@ These settings are quite relaxed and can produce decent results for diverse viru **Full usage:** ```shell -varvamp [OPTIONS] +usage: varvamp [options] ``` ``` -usage: varvamp [options] - varvamp: variable virus amplicon design positional arguments: diff --git a/varvamp/command.py b/varvamp/command.py index b272a0c..1f51652 100644 --- a/varvamp/command.py +++ b/varvamp/command.py @@ -13,7 +13,6 @@ from varvamp import __version__ from varvamp.scripts import logging from varvamp.scripts import alignment -from varvamp.scripts import config from varvamp.scripts import consensus from varvamp.scripts import conserved from varvamp.scripts import primers @@ -87,7 +86,7 @@ def get_args(sysargs): "--report-n", type=int, metavar="", - default= float("inf"), + default=float("inf"), help="SANGER: report the top n best hits" ) parser.add_argument( @@ -225,7 +224,7 @@ def main(sysargs=sys.argv[1:]): log_file, progress=0.8, job="Finding potential amplicons.", - progress_text= f"{len(amplicons)} potential amplicons" + progress_text=f"{len(amplicons)} potential amplicons" ) if args.mode == "TILED": diff --git a/varvamp/scripts/scheme.py b/varvamp/scripts/scheme.py index 46617b6..efc9184 100644 --- a/varvamp/scripts/scheme.py +++ b/varvamp/scripts/scheme.py @@ -397,7 +397,7 @@ def find_best_amplicons(amplicons, all_primers, n): # sort amplicons sorted_amplicons = sorted(amplicons.items(), key=lambda x: x[1][5]) to_retain = [sorted_amplicons[0]] - amplicon_range = list(range(sorted_amplicons[0][1][0], sorted_amplicons[0][1][1])) + amplicon_range = list(range(sorted_amplicons[0][1][0], sorted_amplicons[0][1][1]+1)) amplicon_set = set(amplicon_range) # find lowest non-overlapping for amp in sorted_amplicons: From 62faacfc37db42943da30d385efdf38cce95183c Mon Sep 17 00:00:00 2001 From: jonas-fuchs Date: Mon, 27 Mar 2023 23:22:25 +0200 Subject: [PATCH 6/8] added badges --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 3140707..3bab491 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ # varVAMP -[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0) +[![License: GPL v3](https://img.shields.io/github/license/jonas-fuchs/varvamp)](https://www.gnu.org/licenses/gpl-3.0) For a lot of virus genera it is difficult to design pan-specific primers. varVAMP solves this, by introducing ambiguous characters into primers and minimizes mismatches at the 3' end. Primers might not work for some sequences of your input alignment but should recognize the large majority. From 127d70d6684dacce3e4a558c22f2c0c9c27faf4b Mon Sep 17 00:00:00 2001 From: jonas-fuchs Date: Mon, 27 Mar 2023 23:32:28 +0200 Subject: [PATCH 7/8] reordered varvamp import --- varvamp/command.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/varvamp/command.py b/varvamp/command.py index 1f51652..66b1d56 100644 --- a/varvamp/command.py +++ b/varvamp/command.py @@ -9,8 +9,6 @@ import argparse # varVAMP -from . import _program -from varvamp import __version__ from varvamp.scripts import logging from varvamp.scripts import alignment from varvamp.scripts import consensus @@ -18,6 +16,8 @@ from varvamp.scripts import primers from varvamp.scripts import reporting from varvamp.scripts import scheme +from varvamp import __version__ +from . import _program def get_args(sysargs): From 20f6316d9fa3d8a13d86f8cdbcb2eeb6d3489133 Mon Sep 17 00:00:00 2001 From: jonas-fuchs Date: Tue, 28 Mar 2023 17:04:19 +0200 Subject: [PATCH 8/8] conda package not ready yet --- docs/installation.md | 8 -------- 1 file changed, 8 deletions(-) diff --git a/docs/installation.md b/docs/installation.md index b5f3974..b98a51d 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -9,14 +9,6 @@ From PyPI: pip install varvamp ``` -From CONDA (optionally with a new env): - -```shell -conda create --name varvamp -conda activate varvamp -conda install varvamp -``` - That was already it. To check if it worked: ```shell