hgfdb

Assembly statistics

　　We published HI-C assembly version genome. We use mecat to do preliminary assembly of raw data and correction of assembly results using Pilon. Chromosome construction was performed using Juicer+3DDNA, and the Scaffold sequence was assembled by Juicer. The alignment ratio was 77.50%, and the ratio of effective data was 61.07%. Next, we construct the cluster through 3DDNA. The 3D DNA is based on unsupervised clustering. According to the heat map results, the final chromosome sequence length table is obtained as follows. 　　

Assessment of Hi-c assembly

Sequenced Read Pairs	353,303,359
Normal Paired	119,503,007 (33.82%)
Chimeric Paired	154,299,354 (43.67%)
Chimeric Ambiguous	72,571,301 (20.54%)
Unmapped	6,929,697 (1.96%)
Ligation Motif Present	304,929,764 (86.31%)
Alignable (Normal+Chimeric Paired)	273,802,361 (77.50%)
Unique Reads	241,663,703 (68.40%)
PCR Duplicates	31,216,962 (8.84%)
Optical Duplicates	921,696 (0.26%)
Library Complexity Estimate	1,099,873,512
Intra-fragment Reads	3,388,387 (0.96% / 1.40%)
Below MAPQ Threshold	22,518,708 (6.37% / 9.32%)
Hi-C Contacts	215,756,608 (61.07% / 89.28%)
Ligation Motif Present	126,288,033 (35.74% / 52.26%)
3' Bias (Long Range)	76% - 24%
Pair Type %(L-I-O-R)	25% - 25% - 25% - 25%
Inter-chromosomal	118,197,762 (33.46% / 48.91%)
Intra-chromosomal	97,558,846 (27.61% / 40.37%)
Short Range (<20Kb)	52,730,723 (14.93% / 21.82%)
Long Range (>20Kb)	44,827,949 (12.69% / 18.55%)

Summary of genome

Stat Type	Revised genome		Final assembly
Stat Type	contig Length	contig Number	scaffold Length	scaffold Number
N90	1,623,455	98	13,092,000	16
N80	4,324,123	58	20,137,632	10
N70	8,097,362	40	61,566,867	7
N60	12,315,090	30	79,303,508	5
N50	14,174,297	22	98,083,000	4
N40	17,358,682	15	113,403,698	3
N30	17,358,682	15	113,403,698	3
N20	30,229,578	6	153,175,500	2
N10	35,139,116	3	199,086,795	1
Max length	42,031,810		199,086,795
Total length	1,044,489,195		1,049,889,730
Total number	1,003		1,127
Average length	1,041,365		931,579

Statistic of BUSCO(aves)

	Gene numbers	Percentage
Complete BUSCOs	4643	94.5%
Complete BUSCOs	4643	94.5%
Complete and single-copy BUSCOs	4586	93.3%
Complete and duplicated BUSCOs	57	1.2%
Fragmented BUSCOs	151	3.1%
Missing BUSCOs	121	2.4%
Total BUSCO groups searched	4915	-

Repeat annotation

Tandem repeats were identified across the genome with the help of the program Tandem Repeats Finder (TRF). Transposable elements (TEs) in the genome were identified by a combination of homology-based and de novo approaches. For homolog based prediction, known repeats were identified using RepeatMasker and RepeatProteinMask against Repbase (Repbase Release 16.10; http://www.girinst.org/repbase/index.html). For de novo prediction, RepeatModeler(http://repeatmasker.org/), LTR FINDER (http://tlife.fudan.edu.cn/ltr_finder/) were used to identify de novo evolved repeats inferred from the assembled genome.

Statistics of Repeats in Numida meleagris Genome

Type	Repeat Size (bp）	% of genome
Trf	10,581,766	1.007893
Repeatmasker	94,624,222	9.012777
Proteinmask	68,314,708	6.506846
De novo	131,715,232	12.545625
Total	143,202,256	13.639743

TEs Content in the Assembled Numida meleagris Genome

	Repbase TEs		TE protiens		De novo		Combined TEs
Type	Length (Bp)	% in genome	Length (Bp)	% in genome	Length (Bp)	% in genome	Length (Bp)	% in genome
DNA	9,432,380	0.898416	4,697,710	0.447448	8,786,941	0.836939	11,517,584	1.097028
LINE	66,938,260	6.375742	55,912,223	5.325533	86,205,096	8.210871	100,871,061	9.607777
SINE	439,058	0.041819	0	0	148,336	0.014129	502,770	0.047888
LTR	17,952,541	1.709945	7,771,775	0.740247	40,304,124	3.838891	44,706,385	4.258198
Other	1,197	0.000114	0	0	0	0	1,197	0.000114
Unknown	0	0	0	0	8,436,472	0.803558	8,436,472	0.803558
Total	94,624,222	9.012777	68,314,708	6.506846	129,576,315	12.341898	135,541,039	12.910026

Note: Repbase TEs: the result of RepeatMasker based on Repbase; TE proteins: the result of RepeatProteinMask based on Repbase; De novo: de novo finding repeats (Reaptmodeler); Combined: combine the results of Repbase TEs, TE proteins and de novo.

Distribution of Divergence Rate of each Type of Numida meleagris's TE. The divergence rate was calculated between the identified TE elements in the genome by homology-based method and the consensus sequence in the Repbase.

Distribution of Divergence Rate of each Type of Numida meleagris's TE. The divergence rate was calculated between the identified TE elements in the genome by De novo method and the consensus sequence in the predicted TE library.

Protein-coding genes prediction

To predict protein-coding genes, we used both homology-based and de novo prediction methods.
Homology based gene prediction
Proteins of all species were used in the homology-based annotation. First, TBLASTN was processed with parameters of “E-value = 1e-5, -F F”. BLAST hits that correspond to reference proteins were concatenated by Solar software, and low-quality records were filtered. Genomic sequence of each reference protein was extended upstream and downstream by 2000 bp to represent a protein-coding region. GeneWise software was used to predict gene structure contained in each protein region. De novo gene prediction
In addition we used three de novo prediction programs: Augustus and GlimmerHMM, GENESCAN. Augustus with gene model parameters trained from Numida meleagris，which seted up by PASA. GlimmerHMM and GENESCAN with gene model parameters trained from human. We filtered partial genes and small genes with less than 150 bp coding lengths.
High-confidence gene model generation
Finally, comprehensive above predicted results, and coupled with the transcriptome comparison data by PASA (http://pasa.sourceforge.net/). All kinds of gene sets merged by EVidenceModeler (EVM entry, http://evidencemodeler.sourceforge.net/) integration software, get a non redundant, a more complete set of genes.

Numida meleagris's gene length distribution

Numida meleagris's exon number

Gene annotation statistics for Numida meleagris genome

Methods	Gene Number	Avg. mRNA Length	Total Exon Number	Avg. Exon Length	Avg. CDS Length	Avg. Exon Number	Total Intron Length
ab initio
augustus	15,034	20854.37535	145,599	168.7409186	1634.196421	9.684648131	288,956,170
glimmerHMM	62,813	10908.09057	296,012	186.8550261	880.5713785	4.712591343	629,858,563
genscan	32,841	23552.69565	275,436	168.7314512	1415.143144	8.38695533	727,019,362
Homology
A.platyrhynchos	14,629	22623.02748	148,438	166.8307441	1692.803473	10.14683164	306,188,247
G.gallus	15,995	20337.90816	150,308	165.3216795	1553.558675	9.397186621	300,455,670
M.gallopavo	17,127	16937.7501	152,014	165.2556278	1466.758276	8.87569335	264,971,677
T.guttata	14,166	21562.46019	138,986	165.3593455	1622.379924	9.811238176	282,471,177
transcript	8,215	39758.06549	98,328	172.3615654	2063.051491	11.96932441	261,206,210
EVM	15,173	27407.56403	165,668	165.6100695	1808.231002	10.91860542	388,418,680

Non-protein-coding genes annotation

The tRNAscan-SE (version 1.23) software with default parameters for eukaryote was used for tRNA annotation. rRNA annotation was based on homology information of plant rRNAs using BLASTN with parameters of “E-value = 1e-5”. The miRNA and snRNA genes were predicted by INFERNAL software (http://infernal.janelia.org, version 0.81) against the Rfam database (Release 11.0).

Statistics for Numida meleagris ncRNA annotation.

Type	Copy Number	Avg. Length (bp)	Total Length (bp)	Pct.in Genome (%)
miRNA	175	84.99428571	14874	0.001417
tRNA	294	75.17687075	22102	0.002105
rRNA	170	265.2941176	45100	0.004296
18S	30	459.5666667	13787	0.001313
28S	110	251.7818182	27696	0.002638
5.8S	7	132.8571429	930	0.000089
5S	23	116.826087	2687	0.000256
snRNA	251	128.9920319	32377	0.003084
CD-box	117	112.9316239	13213	0.001259
HACA-box	58	143.2758621	8310	0.000792
Splicing	63	140.1904762	8832	0.000841

Gene Function annotation

Gene functions were assigned according to the best match alignment using BLASTP against Swiss-Prot or TrEMBL databases. Gene motifs and domains were determined using InterProScan against protein databases including all databases . Gene Ontology IDs were obtained from the corresponding Swiss-Prot and TrEMBL entries. All genes were aligned against the KEGG proteins. The pathway to which the gene might belong was derived from the matching genes in KEGG.

Statistics for Numida meleagris gene function annotation.

	Number	Percent(%)
Total	15,173
InterPro	13,492	88.92111
GO	10,934	72.062216
KEGG	13,430	88.512489
Swissprot	13,883	91.498056
TrEMBL	14,210	93.6532
Annotated	14,373	94.727476
Unanotated	800	5.272524

Jump to section	Assembly statistics	Repeat annotation	Protein-coding genes prediction
Jump to section	Non-protein-coding genes annotation	Gene Function annotation