Whole-genome de novo sequencing, also called ab initio sequencing, refers to the assembly of sequences obtained by bioinformatics analysis without any reference sequence to obtain the whole-genome sequence map of the species, providing sequence information for subsequent gene mining and functional validation; laying the foundation for molecular breeding and synthetic biology research.
The Nanopore or HIFI sequencing data were used for genome assembly which then are corrected using Illumina Novaseq or DNBSEQ sequencing data to obtain high quality contigs (N50 >1Mb, some can reach above 10Mb).
The assembled Contigs are mounted to chromosomes in combination with Hi-C to obtain chromosome-level whole-genome sequence maps, which are annotated and then used to reveal the scientific questions of the study through other analyses.
FAQs
(1) What is genome de novo sequencing?
Genome de novo sequencing, focuses on the genome sequence of unknown species and the genome that needs to be updated, by constructing a genomic DNA library and sequencing it. The data obtained from sequencing are then spliced, assembled and annotated by bioinformatics methods to obtain a complete genome sequence map of the species.
(2) What are the advantages of Nanopore sequencing?
a. Nanopore sequencing does not involve polymerase synthesis reaction, so there is no problem of enzyme inactivation.
b. Ultra-long reads, in DNA sequencing, its average read length can reach tens to hundreds of Kb, and the longest read length can reach more than 2 Mb, which is a good solution to the assembly problem of highly repetitive and highly heterozygous genome (difficult to solve by second generation sequencing).
c. Nanopore sequencing can directly detect methylation modification information, which is important for epigenetic research.
d、High throughput
(3) How to select genomic samples?
Plant samples are best selected from uncontaminated seedlings, young leaves, etc., while animal samples are best selected from whole blood or visceral tissues.
(4) What is the role of Hi-C in aiding genome assembly?
The most important role of Hi-C is to anchor fragmented genomic sequences to chromosomes (which is similar to genetic mapping); it can also perform error correction on assembled genomes.
(5) The difference between Hi-C technology and genetic mapping?
Hi-C application of a single individual can complete the chromosome construction, mount chromosome efficiency up to 90% or more, but can not be QTL localization.
(6) Why must we do the genomesurvey?
Survey is an effective means to evaluate the genome. For species without reference genomes, it is essential to evaluate the genome characteristics before starting a de novo project, and the genome size and complexity directly affects the project price, period and subsequent assembly strategy.
(7) Does having a genomic Survey necessarily mean that we don't need to do flow cytometry?
No, it does not. We generally recommend customer to do a flow cytometry to get a preliminary estimate of the genome size before doing the Survey. The reason is as follows: in K-mer analysis, we regard the peak with the most K-mer distribution as the main peak, the peak 1/2 before the main peak as the heterozygous peak, and the peak 2 times after the main peak as the duplicate peak. At this point we need the results of the flow cytometry to validate our judgment. Depending on the size of the genome calculated from the different peaks, the one that matches better with the flow cytometry result is the main peak.
(8) Why is the length of K-mer selected different when K-mer analysis is performed?
Generally, the K-mer used to estimate the genome size we choose 17-mer, the reason is that the nucleotide fragment of length 17 composed of four ATCG non-passing bases has 417~17G, which is enough to cover the normal genome in general; if we choose 15, only 1G is possible. For normal genomes there may not be enough coverage, leading to inaccurate estimation, and of course for larger genomes >15G we will try to use 19-mer for evaluation.
Since there are error bases on Reads, the larger the K-mer is not better, if the K-mer is chosen to be larger, the number of K-mer containing this error site will be larger.
In addition, to avoid palindromic sequences, K-mer analysis selects all K-lengths as odd numbers. And for high repetition repetitions, we usually choose some K-mer with longer length, this is because larger K-mer can span some high repetition regions and have better results for assembly.