Samtools get consensus sequences

8/30/2023

Samtools get consensus sequences

Read Now

To better identify sequencing reads derived from different DNA fragments, a technology called unique molecular identifier (UMI) has been developed. For example, cell-free DNA usually has a peak length of ~ 167 bp, which is much shorter than the peak length of normally fragmented genomic DNA. This possibility can be higher when the DNA fragments are shorter. However, for ultra-deep sequencing, it’s possible that two read pairs with same positions are derived from different original DNA fragments. Due to the nature that errors usually happen randomly, the inconsistent mismatches in the clustered read group can be removed to generate a consensus read.

Then, the reads clustered together can be merged to be a single read.

For low-depth paired-end NGS data, the read pairs of same start and end mapping positions can be treated as duplicated reads derived from a same original DNA fragment. Traditionally, we just mark the duplicated reads and remove them before downstream analysis. Especially for the high-depth data generated by sequencing low-input DNA, the duplication level can be much higher. Particularly, the library amplification using PCR technology can lead to particular sequences becoming overrepresented, and consequently cause some false positive mutations in the result of NGS data analysis.Īs a result of library amplification, NGS data can have many duplicates. However, the processes of making NGS library and sequencing are not error-free. To detect such low-frequency variants, we usually increase the sequencing depth (can be higher than 10,000x). Since the tumor-derived DNA is usually a small part of the total blood cell-free DNA, the mutant allele frequency (MAF) of a variant detected from ctDNA sequencing data can be very low (as low as 0.1%). Recently, circulating tumor DNA (ctDNA) sequencing has been recognized as a promising biomarker for cancer treatment and monitoring. From such deep sequencing data, somatic mutations can be detected to guide personalized targeted therapy or immunotherapy. High-depth next-generation sequencing (NGS) has been widely used for precision cancer diagnosis and treatment. To our best knowledge, gencore is the only duplicate removing tool that generates both informative HTML and JSON reports. Comparing to some new tools like UMI-Reducer and UMI-tools, gencore runs much faster, uses less memory, generates better consensus reads and provides simpler interfaces. ConclusionsĬomparing to the conventional tools like Picard and SAMtools, gencore greatly reduces the output data’s mapping mismatches, which are mostly caused by errors. The JSON format report contains all the statistical results, and is interpretable for downstream programs. The HTML format report contains many interactive figures plotting statistical coverage and duplication information. Gencore reports statistical results in both HTML and JSON formats.

When unique molecular identifier (UMI) technology is applied, gencore can use them to identify the reads derived from same original DNA fragment. This error-suppressing feature makes gencore very suitable for the application of detecting ultra-low frequency mutations from deep sequencing data. While the consensus read is generated, the random errors introduced by library construction and sequencing can be removed. This tool clusters the mapped sequencing reads and merges reads in each cluster to generate one single consensus read. This paper presents an efficient tool gencore for duplicate removing and sequence error suppressing of NGS data. These unmet requirements drove us to develop an ultra-fast, simple, little-weighted but powerful tool for duplicate removing and sequence error suppressing, with features of handling UMIs and reporting informative results. Furthermore, existing tools rarely report rich statistical results, which are very important for quality control and downstream analysis. Some modern tools can work with UMIs, but are usually slow and use too much memory. Most existing duplicate removing tools cannot handle the UMI-integrated data. Recently, a new technology called unique molecular identifier (UMI) has been developed to better identify sequencing reads derived from different DNA fragments. However, as NGS technology gains more recognition in clinical application, researchers start to pay more attention to its sequencing errors, and prefer to remove these errors while performing deduplication operations. Removing duplicates might be considered as a well-resolved problem in next-generation sequencing (NGS) data processing domain.

0 Comments

Samtools get consensus sequences

Leave a Reply.

Author

Archives

Categories