MetaHipMer2 - MHM2超算系统宏基因组短读长序列组装神器的介绍和使用

这篇具有很好参考价值的文章主要介绍了MetaHipMer2 - MHM2超算系统宏基因组短读长序列组装神器的介绍和使用。希望对大家有所帮助。如果存在错误或未考虑完全的地方,请大家不吝赐教,您也可以点击"举报违法"按钮提交疑问。

berkeleylab / mhm2 / Downloads — Bitbucket

文章:

Terabase-scale metagenome coassembly with MetaHipMer | Scientific Reports

MetaHipMer (MHM) 是一种从头开始的宏基因组短读组装器。这是版本 2 (MHM2),完全用 UPC++、CUDA 和 HIP 编写,可以在单服务器和多节点超级计算机上高效运行,可以扩展以共同组装 terabase 大小的元基因组。有关 MetaHipMer 的更多信息可以在 Exascale 计算项目的 ExaBiome 项目下以及多个出版物中找到:

  • E. Georganas et al., "Extreme Scale De Novo Metagenome Assembly," SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, 2018, pp. 122-13.E. Georganas 等人,“Extreme Scale De Novo Metagenome Assembly”,SC18:高性能计算、网络、存储和分析国际会议,美国德克萨斯州达拉斯,2018 年,第 122-13 页。
  • Hofmeyr, S., Egan, R., Georganas, E. et al. Terabase-scale metagenome coassembly with MetaHipMer. Sci Rep 10, 10689 (2020).Hofmeyr, S.、Egan, R.、Georganas, E. 等人。与 MetaHipMer 进行兆兆级宏基因组共组装。科学报告 10, 10689 (2020)。
  • Awan, M.G., Deslippe, J., Buluc, A. et al. ADEPT: a domain independent sequence alignment strategy for gpu architectures. BMC Bioinformatics 21, 406 (2020).Awan, M.G.、Deslippe, J.、Buluc, A. 等人。 ADEPT:GPU 架构的域独立序列比对策略。 BMC 生物信息学 21, 406 (2020)。
  • Muaaz Awan, Steven Hofmeyr, Rob Egan et al. "Accelerating large scale de novo metagenome assembly using GPUs.", SC 2021Muaaz Awan、Steven Hofmeyr、Rob Egan 等人。 “使用 GPU 加速大规模从头宏基因组组装。”,SC 2021

组装的质量与其他领先的宏基因组组装程序相当,正如 CAMI2 竞赛结果中记录的那样,其中 MetaHipMer 在三个数据集中的两个中获得质量第一,在第三个数据集中获得第二:

  • F. Meyer et al., "Critical Assessment of Metagenome Interpretation: the second round of challenges", Nature Methods volume 19, pages429–440 (2022)F. Meyer 等人,“宏基因组解释的批判性评估:第二轮挑战”,《自然方法》第 19 卷,第 429–440 页(2022 年)

the user guide有关构建、安装和运行 MHM2 的信息可以在用户指南中找到

Building and Installing 搭建和安装 

MHM2 depends on UPC++, with the C++17 standard, and CMake. GPU builds require CUDA and/or HIP.MHM2 依赖于 UPC++、C++17 标准和 CMake。 GPU 构建需要 CUDA 和/或 HIP。

A script, build.sh, is provided for building and installing MHM2.提供了一个脚本 build.sh 用于构建和安装 MHM2。

Before building MHM2, ensure that either the UPC++ compiler wrapper, upcxx is in your PATH, or set the MHM2_BUILD_ENV environment variable to point to a script that loads the appropriate environment, for example, on NERSC's Perlmutter supercomputer, you would set the following for the gnu compiler on the KNL partition:在构建 MHM2 之前,请确保 UPC++ 编译器包装器 upcxx 位于您的 PATH 中,或者将 MHM2_BUILD_ENV 环境变量设置为指向加载适当的环境,例如,在 NERSC 的 Perlmutter 超级计算机上,您可以为 KNL 分区上的 gnu 编译器设置以下内容:

export MHM2_BUILD_ENV=contrib/environments/perlmutter/gnu.sh

There are several scripts provided for different build choices on NERC's and OLCF's systems, in directories that start with contrib/environments. You do not need to use any scripts such as these when building on a Linux server, although you may want to create your own when setting up the build. On NERSC and OLCF we recommend using the gnu (contrib/environments/*/gnu.sh) environments. Building with Intel is very slow.在以 contrib/environments 开头的目录中,为 NERC 和 OLCF 系统上的不同构建选择提供了多个脚本。在 Linux 服务器上构建时,您不需要使用任何脚本,尽管您可能希望在设置构建时创建自己的脚本。在 NERSC 和 OLCF 上,我们建议使用 gnu (contrib/environments/*/gnu.sh) 环境。使用英特尔构建速度非常慢。

To build a release version (optimized for performance), execute:要构建发布版本(针对性能进行优化),请执行:

./build.sh Release

Alternatively, you can build a debug version with:或者,您可以使用以下命令构建调试版本:

./build.sh Debug

This will capture a great deal of information useful for debugging but will run a lot slower (up to 5x slower at scale on multiple nodes).这将捕获大量对调试有用的信息,但运行速度会慢很多(在多个节点上速度最多慢 5 倍)。

An alternative to the pure debug version is the "release" debug version, which still captures a reasonable amount of debugging information, but is a lot faster (although still up to 2x slower than the release version):纯调试版本的替代方案是“发布”调试版本,它仍然捕获合理数量的调试信息,但速度要快得多(尽管仍然比发布版本慢 2 倍):

./build.sh RelWithDebInfo

The ./build.sh script will install the binaries by default into the install/bin subdirectory in the repository root directory. To set a different install directory, set the environment variable MHM2_INSTALL_PATH, e.g.:默认情况下,./build.sh 脚本会将二进制文件安装到存储库根目录中的 install/bin 子目录中。要设置不同的安装目录,请设置环境变量 MHM2_INSTALL_PATH,例如:

MHM2_INSTALL_PATH=/usr/local/share/mhm2 ./build.sh Release

Once MHM2 has been built once, you can rebuild withMHM2 构建完成后,您可以使用

./build.sh

and it will build using the previously chosen setting (ReleaseDebug, or RelWithDebInfo).它将使用之前选择的设置(ReleaseDebug 或 RelWithDebInfo)进行构建。

You can also run 你也可以运行 

./build.sh clean

to start from scratch. If you run this, then the next call to build.sh should be with one of the three configuration settings.从头开始。如果运行此命令,则下一次调用 build.sh 应该使用三个配置设置之一。

By default, the build occurs within the root of the repository, in a subdirectory called .build. This is created automatically by the build.sh script.默认情况下,构建发生在存储库根目录中名为 .build 的子目录中。这是由 build.sh 脚本自动创建的。

The MHM2 build uses cmake, which you can call directly, instead of through the build.sh script, e.g.:MHM2 构建使用 cmake,您可以直接调用它,而不是通过 build.sh 脚本,例如:

mkdir -p .build
cd .build
cmake -DCMAKE_INSTALL_PREFIX=path-to-install ..
make -j all install

Consult the build.sh script to see how it executes these commands.请查阅 build.sh 脚本以了解它如何执行这些命令。

You'll need to first set the environment, e.g.:您需要首先设置环境,例如:

source contrib/environments/perlmutter/gnu.sh

If you see an error message when building like the following:如果您在构建时看到如下错误消息:

include could not find load file: GetGitVersion

Then you have probably not cloned the git submodules. You need to execute the following from the root directory:那么您可能还没有克隆 git 子模块。您需要从根目录执行以下命令:

git submodule init
git submodule updateDoker
docker pull robegan21/mhm2

Running 运行 

To execute MHM2, run the mhm2.py script located at install/bin. Most parameters have sensible defaults, so it is possible to run with only the read FASTQ files specified, e.g. to run with two interleaved reads files, lib1.fastq and lib2.fastq, you could execute:要执行 MHM2,请运行位于 install/bin 的 mhm2.py 脚本。大多数参数都有合理的默认值,因此可以仅使用指定的读取 FASTQ 文件运行,例如要使用两个交错读取文件 lib1.fastq 和 lib2.fastq 运行,您可以执行:

mhm2.py -r lib1.fastq,lib2.fastq

A list of all the command line options can be found by running with -h. Because mhm2.py is a python script that wraps the UPC++ binary, mhm2, there will be two levels of options, one from the python script, and one from the binary. Some of the options have a short form (a single dash with a single character) and a long form (starting with a double-dash). In the options described below, where both a short form and a long form exist, they are separated by a comma. The type of the option is indicated as one of STRING (a string of characters), INT (an integer), FLOAT (a floating point value) or BOOL (a boolean flag). For BOOL, the option can be given as truefalseyesno01, or omitted altogether, in which case the option will be true, and if an option is specfied, the = must be used, e.g.通过运行 -h 可以找到所有命令行选项的列表。由于 mhm2.py 是包装 UPC++ 二进制文件 mhm2 的 Python 脚本,因此将有两层选项,一层来自 Python 脚本,一层来自二进制文件。某些选项具有短形式(带有单个字符的单破折号)和长形式(以双破折号开头)。在下面描述的选项中,如果同时存在短形式和长形式,则它们之间用逗号分隔。选项的类型指示为 STRING(字符串)、INT(整数)、FLOAT(浮点值)或BOOL(布尔标志)。对于 BOOL,选项可以指定为 truefalseyesno、 、1,或完全省略,在这种情况下选项将为 true,如果指定了选项,则必须使用 = ,例如

mhm2.py --checkpoint=false

By default, the run will generate files in a specific output directory (see the --output option below). At a minimum, this will include the following files:默认情况下,运行将在特定输出目录中生成文件(请参阅下面的 --output 选项)。至少,这将包括以下文件:

  • final_assembly.fasta: the contigs for the assembly, in FASTA format.final_assembly.fasta:装配体的重叠群,采用 FASTA 格式。
  • mhm2.log: a log file containing details about the run, including various quality statistics, details about the assembly process and timing information.mhm2.log:包含有关运行的详细信息的日志文件,包括各种质量统计数据、有关组装过程的详细信息和计时信息。
  • mhm2.config: a configuration file containing all the non-default options used for the run.mhm2.config:包含运行时使用的所有非默认选项的配置文件。
  • per_thread: a subdirectory containing per-process files that record memory usage and debugging information in Debug mode.per_thread:包含每个进程文件的子目录,这些文件记录调试模式下的内存使用情况和调试信息。

In addition, many more files may be generated according to which command-line options are specified. These are described in detail below where relevant.此外,根据指定的命令行选项,还可以生成更多文件。下面将在相关的地方详细描述这些内容。

The mhm2 binary can be executed directly using upcxx-runsrun or another suitable launcher. Generally we recommend using mhm2.py, because it takes care of many facets of starting the executable in a given environment and provides additional functionality, e.g. automatically restarting on errors, easily enabling communication tracing, etc.mhm2 二进制文件可以使用 upcxx-runsrun 或其他合适的启动器直接执行。一般来说,我们建议使用 mhm2.py,因为它负责在给定环境中启动可执行文件的许多方面,并提供附加功能,例如出错时自动重启、轻松启用通信跟踪等。

Basic options 基本选项 

These are the most commonly used options.这些是最常用的选项。

The input files of reads are specified with either -r-p, or -u. At least one of these options must be specified. When running on a Lustre file system (such as on OLCF's Frontier), it is recommended that all input files be striped to ensure adequate I/O performance. Usually this means first striping a directory and then moving files into it, e.g. for a file reads.fastq:读取的输入文件由 -r-p 或 -u 指定。必须至少指定这些选项之一。当在Lustre文件系统上运行时(例如在OLCF的Frontier上),建议对所有输入文件进行条带化以确保足够的I/O性能。通常这意味着首先分割一个目录,然后将文件移入其中,例如对于文件 reads.fastq

mkdir data
lfs setstripe -c 72 data
mv reads.fastq data

-r, --reads STRING,STRING,...

A collection of names of files containing interleaved paired reads in FASTQ format. Multiple files must be comma-separated, or can be separated by spaces. For paired reads in separate files, use the -p option. For unpaired reads, use the -u option. Long lists of read files can be set in a configuration file and loaded with the --config option, to avoid having to type them in on the command line.包含 FASTQ 格式的交错配对读取的文件名称的集合。多个文件必须以逗号分隔,或者可以用空格分隔。对于单独文件中的配对读取,请使用 -p 选项。对于不配对的读取,请使用 -u 选项。可以在配置文件中设置读取文件的长列表并使用 --config 选项加载,以避免在命令行中输入它们。

-p, --paired-reads STRING,STRING,...

A collection of names of files containing separate paired reads in FASTQ format. Multiple files must be comma-separated, or can be separated by spaces. For each library, the file containing the reads for the first pairs must be followed by the file containing the reads for the second pairs, e.g. for two libraries with separate paired reads files lib1_1.fastqlib1_2.fastq and lib2_1.fastqlib2_2.fastq, the option should be specified as:包含 FASTQ 格式的单独配对读取的文件名称的集合。多个文件必须以逗号分隔,或者可以用空格分隔。对于每个库,包含第一对读数的文件后面必须跟有包含第二对读数的文件,例如对于两个具有单独配对读取文件 lib1_1.fastqlib1_2.fastq 和 lib2_1.fastqlib2_2.fastq 的库,该选项应指定为:

-p lib1_1.fastq,lib1_2.fastq,lib2_1.fastq,lib2_2.fastq

This option only supports reads where each pair of reads has the same sequence length, usually only seen in raw reads. For support of trimmed reads of possibly different lengths, first interleave the files and then call with the -r option. The separate files can be interleaved with reformat.sh from bbtools.此选项仅支持每对读取具有相同序列长度的读取,通常仅在原始读取中看到。为了支持可能不同长度的修剪读取,首先交错文件,然后使用 -r 选项调用。单独的文件可以与 bbtools 中的 reformat.sh 交错。

-u, --unpaired-reads STRING,STRING,...

A collection of names of files containing unpaired reads in FASTQ format. Multiple files must be comma-separated, or can be separated by spaces.包含 FASTQ 格式的未配对读取的文件名称的集合。多个文件必须以逗号分隔,或者可以用空格分隔。

--adapter-refs STRING

A file containing adapter sequences in the FASTA format. If specified, it will be used to trim out all adapters when the input reads are first loaded. Two files containing adapter sequences are provided in the contrib directory: adapters_no_transposase.fa and all_adapters.fa.gz. The latter must be gunzipped before it can be used.包含 FASTA 格式的接头序列的文件。如果指定,它将用于在首次加载输入读数时修剪所有适配器。 contrib 目录中提供了两个包含接头序列的文件:adapters_no_transposase.fa 和all_adapters.fa.gz。后者必须先gunzipped后才能使用。

-i, --insert INT:INT

The insert size for paired reads. The first integer is the average insert size for the paired reads and the second integer is the standard deviation of the insert sizes. MHM2 will automatically attempt to compute these values so this parameter is usually not necessary. However, there are certain cases where it may be useful, for example, if MHM2 prints a warning about being unable to compute the insert size because of the nature of the reads, or if only doing scaffolding. MHM2 will also compare its computed value to any option set on the command line and print a warning if the two differ significantly; this is useful for confirming assumptions about the insert sise distribution.配对读取的插入大小。第一个整数是配对读取的平均插入大小,第二个整数是插入大小的标准偏差。 MHM2 将自动尝试计算这些值,因此通常不需要此参数。然而,在某些情况下它可能有用,例如,如果 MHM2 打印一条警告,提示由于读取的性质而无法计算插入大小,或者仅进行脚手架。 MHM2 还将其计算值与命令行上设置的任何选项进行比较,如果两者差异显着,则打印警告;这对于确认有关插入尺寸分布的假设很有用。

-k, --kmer-lens INT,INT,...

The k-mer lengths used for the contigging rounds. MHM2 performs one or more contigging rounds, each of which performs k-mer counting, followed by a deBruijn graph traversal, then alignment and local assembly to extend the contigs. Typically, multiple rounds are used with increasing values of k; the shorter values are useful for low abundance genomes, whereas the longer k values are useful for resolving repeats. This option defaults to -k 21,33,55,77,99, which is fine for reads of length 150. For shorter or longer reads, it may be a good idea to adjust these values, for example, for reads of length 101, a better set is usually -k 21,33,47,63. Also, each round of contigging takes time, so the overall assembly time an be reduced by reducing the number of rounds, although this will likely reduce the quality of the final assembly.用于重叠轮次的 k 聚体长度。 MHM2 执行一轮或多轮重叠群,每一轮执行 k 聚体计数,然后进行 deBruijn 图遍历,然后进行比对和局部组装以扩展重叠群。通常,随着 k 值的增加,使用多轮;较短的值对于低丰度基因组有用,而较长的 k 值对于解决重复很有用。此选项默认为 -k 21,33,55,77,99,这对于长度为 150 的读取来说很好。对于较短或较长的读取,调整这些值可能是一个好主意,例如,对于长度为 101 的读取,更好的设置通常是-k 21,33,47,63。此外,每轮重叠都需要时间,因此可以通过减少轮数来减少总装配时间,尽管这可能会降低最终装配的质量。

-s, --scaff-kmer-lens INT,INT,...

The k-mer lengths used for the scaffolding rounds. In MHM2, the contigging rounds are followed by one or more scaffolding rounds. These rounds usually proceed from a high k to a low one, i.e. the reverse ordering of contigging. This option defaults to -s 99,33. The first value should always be set to the final k used in contigging, e.g. for reads of length 101 with parameter -k 21,33,47,63, the scaffolding values could be -s 63,33. More rounds may improve contiguity but will likely increase misassemblies. To disable scaffolding altogether, set this value to 0, i.e. -s 0.用于脚手架回合的 k 聚体长度。在 MHM2 中,连续轮次之后是一轮或多轮脚手架轮次。这些轮通常从高 k 到低 k 进行,即重叠的相反顺序。此选项默认为-s 99,33。第一个值应始终设置为重叠中使用的最终 k,例如对于带有参数 -k 21,33,47,63 的长度为 101 的读取,脚手架值可以是 -s 63,33。更多轮数可能会改善连续性,但可能会增加错误组装。要完全禁用脚手架,请将此值设置为 0,即 -s 0

--min-ctg-print-len INT

The minimum length for contigs to be included in the final assembly, final_assembly.fasta. This defaults to 500.最终组装中包含的重叠群的最小长度,final_assembly.fasta。默认为 500。

-o, --output STRING

The name for the output directory. If not specified, it will be set to a default value of the following form:输出目录的名称。如果未指定,它将设置为以下形式的默认值:

mhm2-run-<READS_FNAME1>-n<PROCS>-N<NODES>-YYMMDDhhmmss-<JOBID>

where <READS_FNAME1> is the name of the first reads file, PROCS is the number of processes and NODES is the number of nodes. Following this is the date and time when the run was started: YY is the last two digits of the year, MM is the number of the month, DD is the day of the month, hh is the hour of day, mm is the minute and ss is the second. Be warned that if two runs are started at exactly the same time, with the same parameters, then with the default values, they could both end up running in the same output directory, which will lead to corrupted results.其中 <READS_FNAME1> 是第一个读取文件的名称,PROCS 是进程数,NODES 是节点数。接下来是运行开始的日期和时间:YY 是年份的最后两位数字,MM 是月份数字,DD是月份中的日期,hh 是一天中的小时,mm 是分钟,ss 是秒。请注意,如果两次运行完全在同一时间启动,并且使用相同的参数,那么使用默认值,它们最终可能会在同一输出目录中运行,这将导致结果损坏。

If the output directory is created by MHM2 (either as the default or when passed as a parameter), it will automatically be striped in the most effective way on a Lustre filesystem. If using a pre-existing directory that was not created by MHM2, the user should ensure that on Lustre filesystems it is adequately striped.如果输出目录是由 MHM2 创建的(无论是默认目录还是作为参数传递),它将在 Lustre 文件系统上以最有效的方式自动进行条带化。如果使用不是由 MHM2 创建的预先存在的目录,用户应确保在 Lustre 文件系统上对其进行了充分的条带化。

If the output directory already exists, files produced by a previous run of MHM2 may be overwritten, depending on whether or not this is a restart of a previous run. If there is an existing log file (mhm2.log), it will be renamed with the date appended before the new one is written as mhm2.log, so log information about previous runs will always be retained.如果输出目录已存在,则先前运行 MHM2 生成的文件可能会被覆盖,具体取决于这是否是先前运行的重新启动。如果存在现有日志文件 (mhm2.log),则会将其重命名,并在新日志文件写入之前附加日期,将其写入为 mhm2.log,因此将始终保留有关先前运行的日志信息。

--checkpoint BOOL

Checkpoint runs. If set to true, this will checkpoint the run by saving intermediate files that can later be used to restart the run (see the --restart option below). The intermediate files are FASTA files of contigs, and they are saved at the end of each contigging round (contigs-<k>.fasta) and at the end of each scaffolding round (scaff-contigs-<k>.fasta), where the <k> value is the k-mer size for that round. Checkpointing is on by default and can be disabled by passing --checkpoint=false.检查点运行。如果设置为 true,这将通过保存稍后可用于重新启动运行的中间文件来检查运行(请参阅下面的 --restart 选项)。中间文件是重叠群的 FASTA 文件,它们保存在每个重叠轮次结束时 (contigs-<k>.fasta) 和每个脚手架轮次结束时 (scaff-contigs-<k>.fasta),其中 < b3> 值是该轮的 k-mer 大小。检查点默认处于启用状态,可以通过传递 --checkpoint=false 来禁用。

--restart BOOL

Restart a previous incomplete run. If set to true, MHM2 will attempt to restart a run from an existing directory. The output directory option must be specified and must contain a previous checkpointed run. The restart will use the same options as the previous run, and will load the most recent checkpointed contigs file in order to resume. This defaults to false.重新开始之前未完成的运行。如果设置为 true,MHM2 将尝试从现有目录重新启动运行。必须指定输出目录选项,并且必须包含先前的检查点运行。重新启动将使用与上次运行相同的选项,并将加载最新的检查点重叠群文件以便恢复。这默认为 false。

--post-asm-align BOOL

Perform alignment of reads to final assembly after assembly has completed. If set to true. MHM2 will align the original reads to the final assembly and report the results in a file, final_assembly.sam, in SAM format. This defaults to false.组装完成后,将读数与最终组装进行对齐。如果设置为 true。 MHM2 会将原始读数与最终组装进行比对,并以 SAM 格式在文件 final_assembly.sam 中报告结果。这默认为 false。

--post-asm-abd BOOL

Compute contig abundances after assembly has completed. If set to true, MHM2 will compute the abundances (depths) for the contigs in the final assembly and write the results to the file, final_assembly_depths.txt. The format of this file is the same as that used by MetaBAT, and so can be used together with the final_assembly.fasta for post-assembly binning, e.g.:组装完成后计算重叠群丰度。如果设置为 true,MHM2 将计算最终组装中重叠群的丰度(深度)并将结果写入文件 final_assembly_depths.txt。该文件的格式与 MetaBAT 使用的格式相同,因此可以与 final_assembly.fasta 一起使用进行组装后装箱,例如:

metabat2 -i final_assembly.fasta -a final_assembly_depths.txt -o bins_dir/bin

This defaults to false. 这默认为 false。 

--post-asm-only BOOL

Perform only post-assembly operations. If set to true, this requires an existing directory containing a full run (i.e. with a final_assembly.fasta file), and it will execute any specified post-assembly options (--post-asm-align--post-asm-abd) on that assembly without any other steps. This provides a convenient means to run alignment and/or abundance calculations on an already completed assembly. By default this post-assembly analysis will use the final_assembly.fasta file in the output directory, but any FASTA file could be used, including those not generated by MHM2 (see the --contigs in the advanced options section below). This defaults to false.仅执行组装后操作。如果设置为 true,则需要包含完整运行的现有目录(即带有 final_assembly.fasta 文件),并且它将执行任何指定的组装后选项(--post-asm-align、 )在该程序集上,无需任何其他步骤。这提供了一种在已完成的组件上运行比对和/或丰度计算的便捷方法。默认情况下,此组装后分析将使用输出目录中的 final_assembly.fasta 文件,但可以使用任何 FASTA 文件,包括那些不是由 MHM2 生成的文件(请参阅高级中的 --contigs下面的选项部分)。这默认为 false。

--write-gfa BOOL

Produce an assembly graph in the GF2 format. If set to true, MHM2 will output an assembly graph in the GFA2 format in the file, final_assembly.gfa. This represents the assembly graph formed by aligning the reads to the final contigs and using those alignments to infer edges between the contigs. This defaults to false.生成 GF2 格式的装配图。如果设置为 true,MHM2 将在文件 final_assembly.gfa 中输出 GFA2 格式的装配图。这表示通过将读数与最终重叠群对齐并使用这些对齐来推断重叠群之间的边缘而形成的组装图。这默认为 false。

-Q, --quality-offset INT

The phred encoding offset. In most cases, MHM2 will be able to detect this offset from analyzing the reads file, so it usually does not need to be explicitly set.phred 编码偏移量。在大多数情况下,MHM2 将能够通过分析读取文件来检测此偏移量,因此通常不需要显式设置。

--progress BOOL

Display progress indicators during a run. If true, many time-consuming stages will be shown updating with a simple progress bar. The progress bar output will not be written into the log file, mhm2.log. This defaults to false.在运行期间显示进度指示器。如果为真,许多耗时的阶段将通过一个简单的进度条显示更新。进度条输出不会写入日志文件mhm2.log。这默认为 false。

-v, --verbose BOOL

Verbose output. If true, MMHM2 will produce verbose output, which prints out a lot of additional information about the timing of the run and the various computations that are being performed. This defaults to false. All of the information seen in verbose mode will always be written to the log file, mhm2.log. This defaults to false.详细输出。如果为 true,MMHM2 将产生详细输出,打印出有关运行时间和正在执行的各种计算的大量附加信息。这默认为 false。在详细模式下看到的所有信息将始终写入日志文件 mhm2.log。这默认为 false。

--config STRING

Use a config file for the parameters. If this is specified, the options will be loaded from the named config file. The file is a plain text file of the format:使用配置文件作为参数。如果指定了此选项,将从指定的配置文件中加载选项。该文件是格式为的纯文本文件:

key = value

where key is the name of an option and value is the value of the option. All blank lines and lines beginning with a semi-colon will be ignored. When the config file is not specified as an option, MHM2 always writes out all of the non-default options to the file mhm2.config in the output directory. Even when options are loaded from a config file, they can still be overridden by options on the command line. For example, if the config file, test.config, contains the line:其中 key 是选项的名称,value 是选项的值。所有空白行和以分号开头的行都将被忽略。当配置文件未指定为选项时,MHM2 始终将所有非默认选项写入输出目录中的文件 mhm2.config 中。即使从配置文件加载选项,它们仍然可以被命令行上的选项覆盖。例如,如果配置文件 test.config 包含以下行:

k = 21,33,55,77,99

but the command line is:但命令行是:

mhm2.py --config test.config -k 45,63

then MHM2 will run with k-mer lengths of 45, 63.那么 MHM2 将以 45、63 的 k 聚体长度运行。

Advanced options 高级选项 

These are additional options for tuning performance or the quality of the output, selecting precisely how to restart a run, or for additonal debugging information. Most users will not need any of these options.这些是用于调整性​​能或输出质量、精确选择如何重新启动运行或用于附加调试信息的附加选项。大多数用户不需要任何这些选项。

Restarting runs 重新开始运行 

Although the --restart option provides for simple restarts of previous runs, it is possible to restart at very specific points, with different options from those of the original run, e.g. restarting scaffolding with different k-mer values, set using the -s option.尽管 --restart 选项提供了先前运行的简单重新启动,但也可以在非常特定的点重新启动,并使用与原始运行不同的选项,例如使用不同的 k-mer 值重新启动脚手架,使用 -s 选项进行设置。

The relevant options are listed below.下面列出了相关选项。

-c, --contigs STRING

The file name containing contigs in FASTA format that are to be used as the most recent checkpoint for a restart. Any contigs file generated during a checkpointed run can be used, so it is possible to restart at any stage. It is also possible to specify any FASTA file if running only post-assembly analysis (--post-asm-only).包含 FASTA 格式的重叠群的文件名,这些重叠群将用作重新启动的最新检查点。可以使用检查点运行期间生成的任何重叠群文件,因此可以在任何阶段重新启动。如果仅运行组装后分析 (--post-asm-only),也可以指定任何 FASTA 文件。

--max-kmer-len INT

The maximum k-mer length that was previously used in contigging. This is usually derived from the -k parameter, and so only needs to be specified if the restart will run scaffolding rounds only. For example, the following command will restart after the scaffolding round with k=99 and will run run two more scaffolding rounds with k=55 and k=21:先前在重叠中使用的最大 k 聚体长度。这通常源自 -k 参数,因此仅在重新启动仅运行脚手架轮次时才需要指定。例如,以下命令将在使用 k=99 的脚手架回合后重新启动,并使用 k=55 和 k=21 运行另外两轮脚手架:

mhm2.py -o outdir -r reads.fq -c scaff-contigs-99.fasta --max-kmer-len 99 -s 55,21

--prev-kmer-len INT

The k-mer length in the previous contigging round. Only needed if restarting in contigging, e.g.上一轮重叠的 k 聚体长度。仅当重新启动时才需要,例如

mhm2.py -o outdir -r reads.fq -c contigs-77.fasta --max-prev-kmer-len 55

Tuning assembly quality 调整装配质量 

There are several additonal options for adjusting the quality of the final assembly, apart from the k-mer values specified for the contigging and scaffolding rounds, as described earlier.除了如前所述的为重叠轮和脚手架轮指定的 k 聚体值之外,还有几个用于调整最终组装质量的附加选项。

--break-scaff-Ns INT

The number of Ns allowed in a gap before the scaffold is broken into two. The default is 10.在脚手架分成两部分之前间隙中允许的 N 数。默认值为 10。

--min-depth-thres INT

The minimum depth (abundance) for a k-mer to be considered for the deBruijn graph traversal. This defaults to 2. Increasing it can reduce errors at the cost of reduced contiguity and genome discovery.deBruijn 图遍历要考虑的 k 聚体的最小深度(丰度)。默认值为 2。增加它可以减少错误,但代价是减少连续性和基因组发现。

--optimize STRING

Adjust the trade-off in assembly quality between errors and contiguity. There are three settings that can be used: contiguity (improve contiguity at the cost of increased errors), correctness (reduce errors at the cost of contiguity), and default (the default setting which tries to balance the two).调整装配质量在错误和连续性之间的权衡。可以使用三种设置:contiguity(以增加错误为代价提高连续性)、correctness(以连续性为代价减少错误)和 (尝试平衡两者的默认设置)。

Adjusting performance and memory usage调整性能和内存使用

There are several options that adjust the trade-off between the memory used and the time taken, or that influence the performance in different ways on different platforms. These usually do not have to be adjusted.有多个选项可以调整所用内存和所用时间之间的权衡,或者在不同平台上以不同方式影响性能。这些通常不需要调整。

--max-kmer-store INT

The maximum size per process in MB for the aggregation of k-mers during k-mer analysis. This defaults to 1% of available memory. Higher values use more memory, but can potentially result in faster computation.k-mer 分析期间用于聚合 k-mers 的每个进程的最大大小(以 MB 为单位)。默认为可用内存的 1%。较高的值使用更多的内存,但可能会导致更快的计算。

--max-rpcs-in-flight INT

The maximum number of remote procedure calls (RPCs) outstanding at any given time. The default is 100. Reducing this will reduce memory usage but could increase running time. If set to 0, there are no limits on the outstanding RPCs.任何给定时间未完成的远程过程调用 (RPC) 的最大数量。默认值为 100。减少此值将减少内存使用量,但可能会增加运行时间。如果设置为 0,则对未完成的 RPC 没有限制。

--max-worker-threads INT

The maximum number of background worker threads. These threads are used for a limited number of background computation tasks. The default value is three and should not need to be adjusted.后台工作线程的最大数量。这些线程用于有限数量的后台计算任务。默认值为 3,不需要调整。

--pin STRING

Restrict the hardware contexts that processes can run on. There are five options: cpu (restrict each process to a single logical CPU); core (restrict each process to a core); numa (restrict each process to a NUMA domain); rr_numa (restrict each process to a NUMA domain in a round robin manner; and none (don't restrict the processes). The default is cpu.限制进程可以运行的硬件上下文。有五个选项: cpu(将每个进程限制为单个逻辑 CPU); core(将每个进程限制在一个核心); numa(将每个进程限制在一个 NUMA 域); rr_numa(以循环方式将每个进程限制到 NUMA 域;none(不限制进程)。默认为 cpu

--sequencing-depth INT

The expected average sequencing depth. This value is used to estimate the memory requirements for unique k-mers in the first round of contigging. It may only need to be adjusted when memory is scarce and there is a need to better balance initial memory allocations.预期的平均测序深度。该值用于估计第一轮重叠中唯一 k 聚体的内存需求。可能只有在内存稀缺并且需要更好地平衡初始内存分配时才需要调整。

--shared-heap INT

Set the shared heap size used by the UPC++ runtime, as a percentage of the available memory. This defaults to 10% and should not need to be adjusted. If MHM2 fails with upcxx::bad_shared_alloc messages, then this value should be increased.设置 UPC++ 运行时使用的共享堆大小,以可用内存的百分比形式。默认为 10%,不需要调整。如果 MHM2 失败并显示 upcxx::bad_shared_alloc 消息,则应增加该值。

Miscellaneous 各种各样的 

--shuffle-reads BOOL

Shuffle reads to improve locality. This defaults to true, and results in greatly enhanced performance. The only reason to disable it is when testing or evaluating the impact of read shuffling.随机读取以改善局部性。这默认为 true,并且会大大增强性能。禁用它的唯一原因是在测试或评估读取改组的影响时。

--use-qf BOOL

Use the TCF filter to reduce memory when running on GPUs. This option will reduce peak GPU memory requirements by about 50%, with a performance impact of less than one percent. The only reason to disable it is for debugging or evaluating the impact of the TCF.在 GPU 上运行时,使用 TCF 过滤器减少内存。此选项会将峰值 GPU 内存需求降低约 50%,对性能的影响不到 1%。禁用它的唯一原因是为了调试或评估 TCF 的影响。

--dump-merged BOOL

Write merged FASTQ input files to the output directory. The file will be named <READ_FILE_NAME>-merged.fastq. The default is false.将合并的 FASTQ 输入文件写入输出目录。该文件将被命名为<READ_FILE_NAME>-merged.fastq。默认为 false。

--dump-kmers BOOL

Write k-mers to files in the output directory. The k-mers are written in each contigging stage immediately after k-mer counting. The files will be written on a per process basis into the per_rank subdirectory, named kmers-<k>.txt.gz, where k is the value for the contigging round. The default is false.将 k-mers 写入输出目录中的文件。在 k 聚体计数后,立即将 k 聚体写入每个重叠阶段。这些文件将按进程写入 per_rank 子目录,名为 kmers-<k>.txt.gz,其中 k 是连续轮次的值。默认为 false。

--subsample-pct INT

Percentage of input read files to use in the assembly. The value is from 0 to 100. This option enables to user to test a dataset with a smaller part of it, e.g. 10%. All of the input files will be reduced to the percentage specified.要在程序集中使用的输入读取文件的百分比。该值从 0 到 100。此选项使用户能够使用数据集的较小部分来测试数据集,例如10%。所有输入文件都将减少到指定的百分比。

--procs INT

Set the number of processes used when running MHM2. By default, MHM2 automatically detects the number of available cores and runs with one process per core. This setting allows a user to run MHM2 on a subset of available processors, which may be desirable if running on a server that is running other applications.设置运行 MHM2 时使用的进程数。默认情况下,MHM2 自动检测可用核心的数量,并以每个核心一个进程的方式运行。此设置允许用户在可用处理器的子集上运行 MHM2,如果在运行其他应用程序的服务器上运行,这可能是理想的。

--nodes INT

Set the number of nodes on which to execute MHM2. By default, MHM2 will use the total available number of nodes from the job launch. On rare occasions it may be desirable to explicitly set this through the MHM2 launch, and not during the external job launch.设置执行 MHM2 的节点数。默认情况下,MHM2 将使用作业启动时的可用节点总数。在极少数情况下,可能需要通过 MHM2 启动(而不是在外部作业启动期间)明确设置此设置。

--gasnet-stats

Collect GASNet communication statistics. This option should be run with either a Debug or a RelWithDebInfo build. It will produce a summary of communication statistics for each stage, and write the summary to the log file and to stdout. Enabling this will further increase the runtime.收集 GASNet 通信统计数据。此选项应与 Debug 或 RelWithDebInfo 版本一起运行。它将生成每个阶段的通信统计信息摘要,并将摘要写入日志文件和标准输出。启用此功能将进一步增加运行时间。

--gasnet-trace

Enable GASNet tracing. This must be run with a Debug or RelWithDebInfo build. This will produce one file per process P, called trace_<P>.txt. For more about what data are collected, consult the GASNet documentation, in the section about GASNet tracing and statistical collection. The trace mask is set to a comprensive set of default, i.e. setting the GASNet environment variable: GASNET_TRACEMASK="GPWBNIH". This can be overriden by explicitly setting that environment variable.启用 GASNet 跟踪。这必须使用 Debug 或 RelWithDebInfo 版本运行。这将为每个进程 P 生成一个文件,称为 trace_<P>.txt。有关收集哪些数据的更多信息,请参阅有关 GASNet 跟踪和统计收集的部分中的 GASNet 文档。跟踪掩码设置为一组完整的默认值,即设置 GASNet 环境变量:GASNET_TRACEMASK="GPWBNIH"。这可以通过显式设置该环境变量来覆盖。

--preproc STRING

Preprocessing commands. This is a comma separated list containing preprocesses and/or options (e.g. valgrind --leak-check=full), or additional options to be passed to upcxx-run.预处理命令。这是一个逗号分隔的列表,包含预处理和/或选项(例如 valgrind --leak-check=full),或要传递给 upcxx-run 的其他选项。

--binary STRING

The name of the binary file for executing MHM2. This defaults to mhm2. This option can be used to run different binaries from the same script, such as builds with and without GPU support.用于执行 MHM2 的二进制文件的名称。默认为mhm2。此选项可用于从同一脚本运行不同的二进制文件,例如带或不带 GPU 支持的构建。文章来源地址https://www.toymoban.com/news/detail-804652.html

到了这里,关于MetaHipMer2 - MHM2超算系统宏基因组短读长序列组装神器的介绍和使用的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处: 如若内容造成侵权/违法违规/事实不符,请点击违法举报进行投诉反馈,一经查实,立即删除!

领支付宝红包 赞助服务器费用

相关文章

  • 聚焦甲烷循环,宏基因组分析项目再创新!

            甲烷,化学式CH4,在自然界分布很广,是最简单的有机物,也是最简单的烃。但同时也是一种重要的温室气体,是一种仅次于二氧化碳的强大温室气体,对环境和全球变化具有重大影响,其导致全球变暖潜力是CO2的28倍,约占全球变暖的20%。产甲烷过程和甲烷氧化

    2024年02月14日
    浏览(35)
  • 生信步骤|原核生物基因组注释--Prokka

    全基因组注释是鉴定生物基因组特征的过程。Prokka是一个适用于原核生物基因组注释工具,可以注释细菌,古菌和病毒基因组。Prokka在预测基因组CDS区域时采用了多种数据库,内置的三个核心数据库包括ISfinder数据库,NCBI细菌抗性数据库和UniprotKB数据库。 此外,prokka内置基因

    2024年02月04日
    浏览(81)
  • 基因组组装: 3D-DNA 染色体挂载

    本文将介绍基因组组装过程中,如何利用 HiC 测序数据,进行染色体级别基因组的组装。该过程主要利用 Juicer [1] 和 3D-DNA [2] 进行,有关第一步 Juicer 的过程,已经下方的文章中介绍了,本文主要介绍第二步: 3D-DNA 的安装与使用。 目前基因组组装的主要流程是,利用二代或者

    2024年02月13日
    浏览(33)
  • 基于R做宏基因组的进化树ClusterTree分析

    同上一篇的PCoA分析,这个也是基于公司结果基础上的再次分析,重新挑选样本,在公司结果提供的csv结果表上进行删减,本地重新分析作图 表格预处理 在公司给的ClusterTree的原始表格数据里选取要保留的样本,同样保存为逗号分隔的csv文件 代码演示 无色版 上色版

    2024年02月13日
    浏览(34)
  • 高通量测序的数据处理与分析指北(二)-宏基因组篇

    之前的一篇文章已经从生物实验的角度讲述了高通量测序的原理,这篇文章旨在介绍宏基因组二代测序数据的处理方式及其原理。在正文开始之前,我们先来认识一下什么是宏基因组。以我的理解,宏基因组就是某环境中所有生物的基因组的合集,这个环境可以是下水道,河

    2023年04月16日
    浏览(38)
  • 全量知识系统 三个矩阵-基因活性/特征明度/实体实性

    前面在谈到“祖传代码”的表示--“元素周期表”时,提出了表示“元素”属性的三个重要矩阵:基因活性矩阵、特征明度矩阵和实体实性矩阵。以下单就这三个矩阵的本身以及 它们在 全量知识系统中的意义等方面的一些问题讨论。(百度AI的回复) 基因活性矩阵 (Gene Ac

    2024年04月17日
    浏览(36)
  • 【计算系统】5分钟了解超算,高性能计算,并行计算,分布式计算,网格计算,集群计算以及云计算的区别

    超级计算机(Supercomputer)是一种计算力极强的计算机,学术界通常称这一领域为高性能计算(High-Performance Computing)。超级计算机主要为最顶尖的科学研究服务,包括核聚变模拟、石油勘探、量子力学、气候模拟、癌症研究、基因组学、分子动力学、飞机和航天器空气动力学

    2024年02月06日
    浏览(40)
  • 融智学应用场景实训实操文化基因系统工程实践指南讲座音频

    俗话说,听君一席话胜读十年书。戴上耳机闭目倾听(语言哲学和语言科学基础之上的融智学): “融智学应用场景实训实操文化基因系统工程实践指南讲座音频”(一共七章)随之便会发现,原来汉字汉语暨中文实质上早已发展成为了新的世界多语思维辨析各式各样歧义的

    2024年02月20日
    浏览(34)
  • 易基因:NAR:RCMS编辑系统在特定细胞RNA位点的靶向m5C甲基化和去甲基化研究|项目文章

    喜讯!易基因表观转录组学RNA-BS技术服务见刊《核酸研究》 大家好,这里是专注表观组学十余年,领跑多组学科研服务的易基因。 2024年2月15日,吉林大学张涛、赵飞宇、李金泽为共同第一作者,吉林大学李占军、隋婷婷及赖良学为共同通讯在《Nucleic Acids Research》(NAR/ IF1

    2024年03月11日
    浏览(49)
  • 超算力量|“神威·太湖之光”:“速度”与“应用”兼具

    2021年11月19日,在全球超级计算大会(SC21)上,国家超级计算无锡中心、之江实验室、清华大学、上海量子科学研究中心等单位,基于新一代“神威·太湖之光”超级计算机,联合研发的神威量子模拟器(SWQSIM)摘得2021年度ACM“戈登·贝尔”奖。 获奖项目:弥合“量子霸权”

    2024年02月06日
    浏览(31)

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

博客赞助

微信扫一扫打赏

请作者喝杯咖啡吧~博客赞助

支付宝扫一扫领取红包,优惠每天领

二维码1

领取红包

二维码2

领红包