CLustClosed

0 0
  • 0 Collaborators

The finalization step of the genome assembly, which includes the ordering and orientation of contigs and closing gaps, represents one of the most important. The gaps can be produced by the low coverage observed in certain regions of the genome, generating errors in the process of assembly and makes it difficult to annotation. uses machine learning (ML) methods, which could improve accuracy in closing gaps. ML techniques can help solve several biological problems, such as improving the assembly process by detecting assembly errors and finishing and closing the gaps in the genome due to its versatility ...learn more

Project status: Under Development

Artificial Intelligence

Overview / Usage

Due to the advent of new generation technologies (NGS) released around 2005, the cost and time of the sequencing have reduced considerably, resulting in the increase the projects like Whole Genome Sequencing (WGS) and Whole Transcriptome Sequencing (WTS). NGS platforms produce a large amount of data compared to previous technologies, although there are still features that can make data assembly difficult as moderate sequencing error rates for some platforms, short reads that hamper the assembly process, low complexity, low coverage and difficulty in solving repetitive regions, which may compromise the representation of certain regions of the genome, generating gaps, making it difficult to finalize the genomes and reflecting the high amount of draft genomes deposited in public databases.

Methodology / Approach

The CLustGClosed pipeline is based on two steps: the first consists in to define the clusters based on the reads GC-content. Thus, for each cluster file, the best k-mer is calculated, finally, each group of reads is assembled, and the contigs are used to close the gaps of the traditional assembly genome produced with all read.

Technologies Used

Python
thread library
multiprocessing library

Comments (0)