Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manually specify where het and hom k-mer peaks are - hifiasm's guess incorrect. #55

Open
conchoecia opened this issue Dec 2, 2020 · 12 comments

Comments

@conchoecia
Copy link

Hi there, trying out my first assemblies with hifiasm with an animal whose genome is around 170 Mb, and is over 4% heterozygous. We know this from >100x coverage k-mer spectra from Illumina WGS reads.

The k-mer spectrum from the PacBio HiFi data tells the same story. I made a k-51 spectrum since I saw that is what hifiasm uses by default. The data are around 44x coverage, and you can see the peak from the het k-mers around 20 and the homozygous k-mers around 40.

CCS_k51_linear_plot

The k-mer spectrum from hifiasm matches the above spectrum:

[M::ha_ft_gen::499.056*[email protected]] ==> filtered out 3811358 k-mers occurring 95 or more times
[M::ha_opt_update_cov] updated max_n_chain to 100
[M::ha_pt_gen::698.676*5.57] ==> counted 18097002 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[6] = 152684
[M::ha_analyze_count] highest: count[19] = 451630
[M::ha_hist_line]     1: ****************************************************************************************************> 7830671
[M::ha_hist_line]     2: ****************************************************************************************************> 770193
[M::ha_hist_line]     3: ******************************************************************** 309013
[M::ha_hist_line]     4: ****************************************** 191939
[M::ha_hist_line]     5: ********************************** 154466
[M::ha_hist_line]     6: ********************************** 152684
[M::ha_hist_line]     7: *********************************** 156256
[M::ha_hist_line]     8: ************************************** 170569
[M::ha_hist_line]     9: ****************************************** 191057
[M::ha_hist_line]    10: ************************************************ 215175
[M::ha_hist_line]    11: ******************************************************* 247482
[M::ha_hist_line]    12: *************************************************************** 283626
[M::ha_hist_line]    13: *********************************************************************** 319430
[M::ha_hist_line]    14: ****************************************************************************** 354206
[M::ha_hist_line]    15: ************************************************************************************** 388878
[M::ha_hist_line]    16: ******************************************************************************************** 417222
[M::ha_hist_line]    17: ************************************************************************************************** 441975
[M::ha_hist_line]    18: **************************************************************************************************** 450105
[M::ha_hist_line]    19: **************************************************************************************************** 451630
[M::ha_hist_line]    20: *************************************************************************************************** 445987
[M::ha_hist_line]    21: ********************************************************************************************** 426016
[M::ha_hist_line]    22: **************************************************************************************** 397584
[M::ha_hist_line]    23: ******************************************************************************** 360297
[M::ha_hist_line]    24: ********************************************************************** 315880
[M::ha_hist_line]    25: ************************************************************* 276191
[M::ha_hist_line]    26: **************************************************** 234698
[M::ha_hist_line]    27: ******************************************** 196798
[M::ha_hist_line]    28: ************************************ 163054
[M::ha_hist_line]    29: ***************************** 132947
[M::ha_hist_line]    30: ************************* 111858
[M::ha_hist_line]    31: ********************* 93468
[M::ha_hist_line]    32: ****************** 81804
[M::ha_hist_line]    33: **************** 73618
[M::ha_hist_line]    34: *************** 68704
[M::ha_hist_line]    35: ************** 64746
[M::ha_hist_line]    36: ************** 63374
[M::ha_hist_line]    37: ************** 62412
[M::ha_hist_line]    38: ************** 62932
[M::ha_hist_line]    39: ************** 62554
[M::ha_hist_line]    40: ************** 61114
[M::ha_hist_line]    41: ************* 58898
[M::ha_hist_line]    42: ************* 57009
[M::ha_hist_line]    43: ************ 54254
[M::ha_hist_line]    44: *********** 51216
[M::ha_hist_line]    45: *********** 48483
[M::ha_hist_line]    46: ********** 45364
[M::ha_hist_line]    47: ********* 41100
[M::ha_hist_line]    48: ******** 37727
[M::ha_hist_line]    49: ******** 34807
[M::ha_hist_line]    50: ******* 31219
[M::ha_hist_line]    51: ****** 27505
[M::ha_hist_line]    52: ***** 24763
[M::ha_hist_line]    53: ***** 22128
[M::ha_hist_line]    54: **** 20128
[M::ha_hist_line]    55: **** 17958
[M::ha_hist_line]    56: **** 15938
[M::ha_hist_line]    57: *** 14744
[M::ha_hist_line]    58: *** 13672
[M::ha_hist_line]    59: *** 12817
[M::ha_hist_line]    60: *** 12142
[M::ha_hist_line]    61: *** 11357
[M::ha_hist_line]    62: ** 10574
[M::ha_hist_line]    63: ** 10200
[M::ha_hist_line]    64: ** 9739
[M::ha_hist_line]    65: ** 9206
[M::ha_hist_line]    66: ** 8849

However, hifiasm gets the position of the homozygous peak incorrect, and calls it at 19, while it should be around 40. The results of running this assembly with --high-het was that the primary assembly was twice as big as it should have been (400Mb instead of 200Mb). So, both haplotypes are ending up in the same primary assembly.

Main genome scaffold total: 753
Main genome contig total:   753
Main genome scaffold sequence total: 422.8 MB
Main genome contig sequence total:   422.8 MB (->  0.0% gap)
Main genome scaffold N/L50: 95/1.4 MB
Main genome contig N/L50:   95/1.4 MB
Number of scaffolds > 50 KB: 558
% main genome in scaffolds > 50 KB: 98.5%

 Minimum    Number    Number     Total        Total     Scaffold
Scaffold      of        of      Scaffold      Contig     Contig
 Length   Scaffolds  Contigs     Length       Length    Coverage
--------  ---------  -------  -----------  -----------  --------
    All       753        753  422,765,292  422,765,292   100.00%
   1 kb       753        753  422,765,292  422,765,292   100.00%
 2.5 kb       753        753  422,765,292  422,765,292   100.00%
   5 kb       753        753  422,765,292  422,765,292   100.00%
  10 kb       753        753  422,765,292  422,765,292   100.00%
  25 kb       727        727  422,203,962  422,203,962   100.00%
  50 kb       558        558  416,220,132  416,220,132   100.00%
 100 kb       458        458  409,207,212  409,207,212   100.00%
 250 kb       361        361  393,262,727  393,262,727   100.00%
 500 kb       274        274  361,301,464  361,301,464   100.00%
   1 mb       140        140  263,063,929  263,063,929   100.00%
 2.5 mb        25         25   83,986,696   83,986,696   100.00%
   5 mb         2          2   10,227,160   10,227,160   100.00%

The secondary assembly is too small:

Main genome scaffold total: 1011
Main genome contig total:   1011
Main genome scaffold sequence total: 25.6 MB
Main genome contig sequence total:   25.6 MB (->  0.0% gap)
Main genome scaffold N/L50: 270/26.2 KB
Main genome contig N/L50:   270/26.2 KB
Number of scaffolds > 50 KB: 49
% main genome in scaffolds > 50 KB: 21.7%

 Minimum    Number    Number     Total        Total     Scaffold
Scaffold      of        of      Scaffold      Contig     Contig
 Length   Scaffolds  Contigs     Length       Length    Coverage
--------  ---------  -------  -----------  -----------  --------
    All     1,011      1,011   25,632,451   25,632,451   100.00%
   1 kb     1,011      1,011   25,632,451   25,632,451   100.00%
 2.5 kb     1,011      1,011   25,632,451   25,632,451   100.00%
   5 kb     1,011      1,011   25,632,451   25,632,451   100.00%
  10 kb     1,008      1,008   25,607,535   25,607,535   100.00%
  25 kb       298        298   13,556,140   13,556,140   100.00%
  50 kb        49         49    5,556,899    5,556,899   100.00%
 100 kb        16         16    3,360,515    3,360,515   100.00%
 250 kb         4          4    1,596,952    1,596,952   100.00%
 500 kb         1          1      798,418      798,418   100.00%
   1 mb         0          0            0            0     0.00%

Is there anything I can do to specify where hifiasm should expect the het and homozygous peaks? I think this would be helpful to others working with highly heterozygous species. Thanks so much!

@chhylp123
Copy link
Owner

chhylp123 commented Dec 2, 2020

Since the het rate is quite high, hifiasm regards the het peak as homozygous peak incorrectly. It doesn't matter too much for assembly, but affects purge_dups a lot. One solution is to set --purge-cov to be homozygous peak (e.g. 45). A decreasing number of -s might be also helpful. However, both of these options probably cannot do enough purging for samples with het rate. If that happens, we recommend you to try Purge_dups (https://github.com/dfguan/purge_dups). In future, I guess we will have a better solution to avoid purge_dups, which has a lot of problems.

@conchoecia
Copy link
Author

Thanks for your message, @chhylp123. I gave it another run, this time specifying -k 19. This time hifiasm correctly guessed the het and hom peaks, but again it erroneously output both haplotypes into the primary assembly, and the alternate assembly is much too small.

[M::ha_analyze_count] lowest: count[9] = 1754283
[M::ha_analyze_count] highest: count[19] = 2980425
[M::ha_hist_line]     2: ****************************************************************************************************> 14170747
[M::ha_hist_line]     3: ****************************************************************************************************> 6095522
[M::ha_hist_line]     4: ****************************************************************************************************> 3517697
[M::ha_hist_line]     5: ********************************************************************************** 2441350
[M::ha_hist_line]     6: ********************************************************************* 2062511
[M::ha_hist_line]     7: *************************************************************** 1874500
[M::ha_hist_line]     8: *********************************************************** 1771091
[M::ha_hist_line]     9: *********************************************************** 1754283
[M::ha_hist_line]    10: ************************************************************ 1795541
[M::ha_hist_line]    11: *************************************************************** 1869441
[M::ha_hist_line]    12: ******************************************************************** 2028087
[M::ha_hist_line]    13: ************************************************************************* 2188209
[M::ha_hist_line]    14: ******************************************************************************* 2354761
[M::ha_hist_line]    15: ************************************************************************************* 2523250
[M::ha_hist_line]    16: ****************************************************************************************** 2692454
[M::ha_hist_line]    17: ************************************************************************************************ 2851163
[M::ha_hist_line]    18: ************************************************************************************************** 2933855
[M::ha_hist_line]    19: **************************************************************************************************** 2980425
[M::ha_hist_line]    20: **************************************************************************************************** 2973333
[M::ha_hist_line]    21: *************************************************************************************************** 2952906
[M::ha_hist_line]    22: ************************************************************************************************ 2865676
[M::ha_hist_line]    23: ******************************************************************************************* 2710854
[M::ha_hist_line]    24: ************************************************************************************* 2518491
[M::ha_hist_line]    25: ****************************************************************************** 2313737
[M::ha_hist_line]    26: ************************************************************************ 2138149
[M::ha_hist_line]    27: ***************************************************************** 1938409
[M::ha_hist_line]    28: ************************************************************ 1787849
[M::ha_hist_line]    29: ******************************************************* 1628415
[M::ha_hist_line]    30: *************************************************** 1529645
[M::ha_hist_line]    31: ************************************************* 1452905
[M::ha_hist_line]    32: *********************************************** 1406267
[M::ha_hist_line]    33: ********************************************** 1375248
[M::ha_hist_line]    34: *********************************************** 1386959
[M::ha_hist_line]    35: *********************************************** 1392977
[M::ha_hist_line]    36: ************************************************ 1418896
[M::ha_hist_line]    37: ************************************************ 1443780
[M::ha_hist_line]    38: ************************************************** 1491147
[M::ha_hist_line]    39: *************************************************** 1517179
[M::ha_hist_line]    40: *************************************************** 1524668
[M::ha_hist_line]    41: ************************************************** 1502562
[M::ha_hist_line]    42: ************************************************** 1480236
[M::ha_hist_line]    43: ************************************************ 1429719
[M::ha_hist_line]    44: ********************************************** 1381645
[M::ha_hist_line]    45: ********************************************* 1329876
[M::ha_hist_line]    46: ****************************************** 1261889
[M::ha_hist_line]    47: **************************************** 1186449
[M::ha_hist_line]    48: ************************************* 1110108
[M::ha_hist_line]    49: *********************************** 1037530
[M::ha_hist_line]    50: ******************************** 966132
[M::ha_hist_line]    51: ****************************** 890500
[M::ha_hist_line]    52: **************************** 826020
[M::ha_hist_line]    53: ************************** 762370
[M::ha_hist_line]    54: ************************ 711653
[M::ha_hist_line]    55: ********************** 657454
[M::ha_hist_line]    56: ********************* 614353
[M::ha_hist_line]    57: ******************* 574305
[M::ha_hist_line]    58: ****************** 548358
[M::ha_hist_line]    59: ****************** 522686
[M::ha_hist_line]    60: ***************** 502057
[M::ha_hist_line]    61: **************** 485945
[M::ha_hist_line]    62: **************** 469769
[M::ha_hist_line]    63: *************** 452361
[M::ha_hist_line]    64: *************** 435805
[M::ha_hist_line]    65: ************** 420092
[M::ha_hist_line]    66: ************** 409057
[M::ha_hist_line]    67: ************* 391407
[M::ha_hist_line]    68: ************* 379454
[M::ha_hist_line]    69: ************ 367346
[M::ha_hist_line]    70: ************ 354757
[M::ha_hist_line]    71: *********** 339978
[M::ha_hist_line]    72: *********** 327008
-
-
-
[M::ha_hist_line]   200: * 15256
[M::ha_hist_line]   201: * 15186
[M::ha_hist_line]   202: * 15098
[M::ha_hist_line]  rest: **************************************************************************************** 2613571
[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: count[40] = 1524668
[M::ha_ft_gen] peak_hom: 40; peak_het: 19
[M::ha_ft_gen::327.776*[email protected]] ==> filtered out 2659111 k-mers occurring 200 or more times
[M::ha_opt_update_cov] updated max_n_chain to 200
[M::ha_pt_gen::470.491*6.63] ==> counted 11783100 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[7] = 103020
[M::ha_analyze_count] highest: count[19] = 243814
[M::ha_hist_line]     1: ****************************************************************************************************> 3881309
[M::ha_hist_line]     2: ****************************************************************************************************> 731724
[M::ha_hist_line]     3: ****************************************************************************************************> 285736
[M::ha_hist_line]     4: ******************************************************************* 164048
[M::ha_hist_line]     5: ************************************************* 118919
[M::ha_hist_line]     6: ******************************************** 106851
[M::ha_hist_line]     7: ****************************************** 103020
[M::ha_hist_line]     8: ******************************************* 104750
[M::ha_hist_line]     9: ********************************************** 112275
[M::ha_hist_line]    10: ************************************************** 121525
[M::ha_hist_line]    11: ******************************************************* 135217
[M::ha_hist_line]    12: *************************************************************** 152592
[M::ha_hist_line]    13: ********************************************************************** 170299
[M::ha_hist_line]    14: ***************************************************************************** 187520
[M::ha_hist_line]    15: ************************************************************************************* 206166
[M::ha_hist_line]    16: ****************************************************************************************** 220437
[M::ha_hist_line]    17: ************************************************************************************************* 235482
[M::ha_hist_line]    18: *************************************************************************************************** 240448
[M::ha_hist_line]    19: **************************************************************************************************** 243814
[M::ha_hist_line]    20: **************************************************************************************************** 243184
[M::ha_hist_line]    21: ************************************************************************************************* 236363
[M::ha_hist_line]    22: ********************************************************************************************* 226290
[M::ha_hist_line]    23: ************************************************************************************** 208942
[M::ha_hist_line]    24: ****************************************************************************** 188999
[M::ha_hist_line]    25: ********************************************************************** 170183
[M::ha_hist_line]    26: ************************************************************** 151626
[M::ha_hist_line]    27: ******************************************************* 133053
[M::ha_hist_line]    28: ************************************************* 118693
[M::ha_hist_line]    29: ****************************************** 103494
[M::ha_hist_line]    30: *************************************** 93931
[M::ha_hist_line]    31: *********************************** 85921
[M::ha_hist_line]    32: ********************************* 80779
[M::ha_hist_line]    33: ******************************* 76282
[M::ha_hist_line]    34: ******************************* 76613
[M::ha_hist_line]    35: ******************************* 74620
[M::ha_hist_line]    36: ******************************* 75318
[M::ha_hist_line]    37: ******************************* 75739
[M::ha_hist_line]    38: ******************************** 77902
[M::ha_hist_line]    39: ******************************** 77883
[M::ha_hist_line]    40: ******************************** 78776
[M::ha_hist_line]    41: ******************************* 76017
[M::ha_hist_line]    42: ******************************* 75003
[M::ha_hist_line]    43: ***************************** 71694
[M::ha_hist_line]    44: **************************** 68535
[M::ha_hist_line]    45: *************************** 66454
[M::ha_hist_line]    46: ************************** 62425
[M::ha_hist_line]    47: ************************ 58036
[M::ha_hist_line]    48: ********************** 54488
[M::ha_hist_line]    49: ********************* 50170
[M::ha_hist_line]    50: ******************* 46680
[M::ha_hist_line]    51: ****************** 42821
[M::ha_hist_line]    52: **************** 39110
[M::ha_hist_line]    53: *************** 36031
[M::ha_hist_line]    54: ************** 33860
[M::ha_hist_line]    55: ************* 30924
[M::ha_hist_line]    56: ************ 29141
[M::ha_hist_line]    57: *********** 27028
[M::ha_hist_line]    58: *********** 25620
[M::ha_hist_line]    59: ********** 24737
[M::ha_hist_line]    60: ********** 23265
[M::ha_hist_line]    61: ********* 22523
[M::ha_hist_line]    62: ********* 21687
[M::ha_hist_line]    63: ******** 20549
[M::ha_hist_line]    64: ******** 19945
[M::ha_hist_line]    65: ******** 19109
[M::ha_hist_line]    66: ******** 18588
[M::ha_hist_line]    67: ******* 17721
[M::ha_hist_line]    68: ******* 17247
[M::ha_hist_line]    69: ******* 16661
[M::ha_hist_line]    70: ****** 15704
-
-
-
[M::ha_hist_line]   164: * 1335
[M::ha_hist_line]   165: * 1259
[M::ha_hist_line]  rest: ************* 31330
[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: count[40] = 78776
[M::ha_pt_gen] peak_hom: 40; peak_het: 19
[M::ha_pt_gen::521.345*7.64] ==> indexed 216061193 positions
[M::ha_assemble::2779.950*[email protected]] ==> corrected reads for round 1
[M::ha_assemble] # bases: 8054326319; # corrected bases: 14082381; # recorrected bases: 16537
[M::ha_assemble] size of buffer: 15.063GB
[M::ha_pt_gen::2825.302*21.59] ==> counted 9874090 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[7] = 100493
[M::ha_analyze_count] highest: count[19] = 242703
[M::ha_hist_line]     1: ****************************************************************************************************> 2237612
[M::ha_hist_line]     2: ****************************************************************************************************> 543579
[M::ha_hist_line]     3: *************************************************************************************************** 239651
[M::ha_hist_line]     4: ************************************************************* 148520
[M::ha_hist_line]     5: ********************************************** 111364
[M::ha_hist_line]     6: ******************************************* 103166
[M::ha_hist_line]     7: ***************************************** 100493
[M::ha_hist_line]     8: ****************************************** 102670
[M::ha_hist_line]     9: ********************************************* 110027
[M::ha_hist_line]    10: ************************************************* 118852
[M::ha_hist_line]    11: ******************************************************* 132281
[M::ha_hist_line]    12: ************************************************************** 149774
[M::ha_hist_line]    13: ********************************************************************* 167554
[M::ha_hist_line]    14: **************************************************************************** 184708
[M::ha_hist_line]    15: ************************************************************************************ 203351
[M::ha_hist_line]    16: ****************************************************************************************** 217242
[M::ha_hist_line]    17: ************************************************************************************************ 233245
[M::ha_hist_line]    18: *************************************************************************************************** 240099
[M::ha_hist_line]    19: **************************************************************************************************** 242703
[M::ha_hist_line]    20: **************************************************************************************************** 241927
[M::ha_hist_line]    21: ************************************************************************************************** 236969
[M::ha_hist_line]    22: ********************************************************************************************** 227311
[M::ha_hist_line]    23: *************************************************************************************** 211026
[M::ha_hist_line]    24: ******************************************************************************* 190978
[M::ha_hist_line]    25: *********************************************************************** 171183
[M::ha_hist_line]    26: *************************************************************** 153630
[M::ha_hist_line]    27: ******************************************************** 135095
[M::ha_hist_line]    28: ************************************************** 120339
[M::ha_hist_line]    29: ******************************************* 103924
[M::ha_hist_line]    30: *************************************** 94874
[M::ha_hist_line]    31: *********************************** 86055
[M::ha_hist_line]    32: ********************************* 80923
[M::ha_hist_line]    33: ******************************* 75959
[M::ha_hist_line]    34: ******************************* 76099
[M::ha_hist_line]    35: ******************************* 74110
[M::ha_hist_line]    36: ******************************* 75386
[M::ha_hist_line]    37: ******************************* 74682
[M::ha_hist_line]    38: ******************************** 77166
[M::ha_hist_line]    39: ******************************** 77475
[M::ha_hist_line]    40: ********************************* 79001
[M::ha_hist_line]    41: ******************************* 76020
[M::ha_hist_line]    42: ******************************* 75217
[M::ha_hist_line]    43: ***************************** 71586
[M::ha_hist_line]    44: ***************************** 69413
[M::ha_hist_line]    45: **************************** 66756
[M::ha_hist_line]    46: ************************** 63191
[M::ha_hist_line]    47: ************************ 58743
[M::ha_hist_line]    48: *********************** 55832
[M::ha_hist_line]    49: ********************* 50727
[M::ha_hist_line]    50: ******************** 47525
[M::ha_hist_line]    51: ****************** 43052
[M::ha_hist_line]    52: **************** 39946
[M::ha_hist_line]    53: *************** 36391
[M::ha_hist_line]    54: ************** 34694
[M::ha_hist_line]    55: ************* 31186
[M::ha_hist_line]    56: ************ 29236
[M::ha_hist_line]    57: *********** 27356
[M::ha_hist_line]    58: *********** 26237
[M::ha_hist_line]    59: ********** 24596
[M::ha_hist_line]    60: ********** 23546
[M::ha_hist_line]    61: ********* 22721
.
.
.
[M::ha_hist_line]   165: * 1307
[M::ha_hist_line]   166: * 1307
[M::ha_hist_line]  rest: ************* 31683
[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: count[40] = 78965
[M::ha_pt_gen] peak_hom: 40; peak_het: 19
[M::ha_pt_gen::7418.119*23.39] ==> indexed 216581152 positions
[M::ha_assemble::7927.976*[email protected]] ==> found overlaps for the final round
[M::ha_print_ovlp_stat] # overlaps: 21150601
[M::ha_print_ovlp_stat] # strong overlaps: 3922010
[M::ha_print_ovlp_stat] # weak overlaps: 17228591
[M::ha_print_ovlp_stat] # exact overlaps: 20329706
[M::ha_print_ovlp_stat] # inexact overlaps: 820895
[M::ha_print_ovlp_stat] # overlaps without large indels: 21111083
[M::ha_print_ovlp_stat] # reverse overlaps: 2530916
Writing reads to disk...
Reads has been written.
Writing ma_hit_ts to disk...
ma_hit_ts has been written.
Writing ma_hit_ts to disk...
ma_hit_ts has been written.
bin files have been written.
Writing raw unitig GFA to disk...
Writing processed unitig GFA to disk...
[M::purge_dups] purge duplication coverage threshold: 50
[M::purge_dups] purge duplication coverage threshold: 50
[M::adjust_utg_by_primary] primary contig coverage range: [32, infinity]
Writing primary contig GFA to disk...
Writing alternate contig GFA to disk...
[M::main] Version: 0.12-r304

primary contigs:

Main genome scaffold total: 773
Main genome contig total:   773
Main genome scaffold sequence total: 420.5 MB
Main genome contig sequence total:   420.5 MB (->  0.0% gap)
Main genome scaffold N/L50: 66/1.9 MB
Main genome contig N/L50:   66/1.9 MB
Number of scaffolds > 50 KB: 452
% main genome in scaffolds > 50 KB: 97.5%

 Minimum    Number    Number     Total        Total     Scaffold
Scaffold      of        of      Scaffold      Contig     Contig
 Length   Scaffolds  Contigs     Length       Length    Coverage
--------  ---------  -------  -----------  -----------  --------
    All       773        773  420,473,782  420,473,782   100.00%
   1 kb       773        773  420,473,782  420,473,782   100.00%
 2.5 kb       773        773  420,473,782  420,473,782   100.00%
   5 kb       773        773  420,473,782  420,473,782   100.00%
  10 kb       773        773  420,473,782  420,473,782   100.00%
  25 kb       727        727  419,474,641  419,474,641   100.00%
  50 kb       452        452  409,886,666  409,886,666   100.00%
 100 kb       354        354  403,275,622  403,275,622   100.00%
 250 kb       279        279  390,826,593  390,826,593   100.00%
 500 kb       222        222  370,606,466  370,606,466   100.00%
   1 mb       133        133  304,500,802  304,500,802   100.00%
 2.5 mb        41         41  154,917,623  154,917,623   100.00%
   5 mb         5          5   30,732,695   30,732,695   100.00%

alternate contigs:

Main genome scaffold total: 603
Main genome contig total:   603
Main genome scaffold sequence total: 28.5 MB
Main genome contig sequence total:   28.5 MB (->  0.0% gap)
Main genome scaffold N/L50: 70/66.0 KB
Main genome contig N/L50:   70/66.0 KB
Number of scaffolds > 50 KB: 95
% main genome in scaffolds > 50 KB: 55.0%

 Minimum    Number    Number     Total        Total     Scaffold
Scaffold      of        of      Scaffold      Contig     Contig
 Length   Scaffolds  Contigs     Length       Length    Coverage
--------  ---------  -------  -----------  -----------  --------
    All       603        603   28,541,231   28,541,231   100.00%
   1 kb       603        603   28,541,231   28,541,231   100.00%
 2.5 kb       603        603   28,541,231   28,541,231   100.00%
   5 kb       603        603   28,541,231   28,541,231   100.00%
  10 kb       602        602   28,531,752   28,531,752   100.00%
  25 kb       336        336   23,702,698   23,702,698   100.00%
  50 kb        95         95   15,683,649   15,683,649   100.00%
 100 kb        43         43   12,259,161   12,259,161   100.00%
 250 kb        11         11    7,373,679    7,373,679   100.00%
 500 kb         7          7    5,998,160    5,998,160   100.00%
   1 mb         2          2    2,266,314    2,266,314   100.00%
 2.5 mb         0          0            0            0     0.00%

Not sure what to try at this point, but I'm interested to keep the conversation going if you all are interested in developing the assembler to handle such cases.

@lh3
Copy link
Collaborator

lh3 commented Dec 2, 2020

The location of the hom/het k-mer peak has little to do with the doubled assembly size. As Haoyu said, try --purge-cov and/or -s. I guess probably the standalone purge_dups will work better in your case.

@plattsad
Copy link

plattsad commented Feb 9, 2021

I'm not sure if we may be seeing a somewhat similar issue. I have a 240MBase plant genome that isn't assembling well - we're getting an N50 of about 110KBase.

The plant should have a moderate to low heterozygosity and we have 11GBase of sequence, so the peak below at x46 looks like the homozygous peak with no evident heterozygous peak. But it looks like hifiasm is picking up on something in the noise as a spurious homozygous peak (maybe a bit of residual contamination?) and is calling "[M::ha_ft_gen] peak_hom: 6; peak_het: -1" which is possibly messing with the downstream assembly? I can't immediately see any way of overriding this?

[M::ha_analyze_count] highest: count[6] = 7615771
[M::ha_hist_line]     2: ****************************************************************************************************> 22408535
[M::ha_hist_line]     3: ****************************************************************************************************> 8657975
[M::ha_hist_line]     4: ************************************************************************************************ 7336391
[M::ha_hist_line]     5: *************************************************************************************************** 7513444
[M::ha_hist_line]     6: **************************************************************************************************** 7615771
[M::ha_hist_line]     7: ******************************************************************************************* 6953884
[M::ha_hist_line]     8: **************************************************************************** 5818652
[M::ha_hist_line]     9: ************************************************************* 4675623
[M::ha_hist_line]    10: ********************************************** 3483947
[M::ha_hist_line]    11: ********************************* 2486463
[M::ha_hist_line]    12: *********************** 1734740
[M::ha_hist_line]    13: ***************** 1273650
[M::ha_hist_line]    14: ************* 988026
[M::ha_hist_line]    15: *********** 816541
[M::ha_hist_line]    16: ********** 744607
[M::ha_hist_line]    17: ********** 744153
[M::ha_hist_line]    18: ********** 778618
[M::ha_hist_line]    19: *********** 831791
[M::ha_hist_line]    20: ************ 885995
[M::ha_hist_line]    21: ************* 967146
[M::ha_hist_line]    22: ************** 1082043
[M::ha_hist_line]    23: *************** 1175861
[M::ha_hist_line]    24: ***************** 1291702
[M::ha_hist_line]    25: ******************* 1418231
[M::ha_hist_line]    26: ********************* 1574217
[M::ha_hist_line]    27: *********************** 1717038
[M::ha_hist_line]    28: ************************* 1867704
[M::ha_hist_line]    29: *************************** 2053383
[M::ha_hist_line]    30: ****************************** 2252547
[M::ha_hist_line]    31: ******************************** 2456543
[M::ha_hist_line]    32: *********************************** 2654839
[M::ha_hist_line]    33: ************************************* 2853282
[M::ha_hist_line]    34: **************************************** 3071010
[M::ha_hist_line]    35: ******************************************* 3268239
[M::ha_hist_line]    36: ********************************************** 3478024
[M::ha_hist_line]    37: ************************************************* 3723711
[M::ha_hist_line]    38: *************************************************** 3914659
[M::ha_hist_line]    39: ****************************************************** 4076444
[M::ha_hist_line]    40: ******************************************************** 4243403
[M::ha_hist_line]    41: ********************************************************* 4350864
[M::ha_hist_line]    42: ********************************************************** 4393110
[M::ha_hist_line]    43: ********************************************************** 4425975
[M::ha_hist_line]    44: ********************************************************** 4441411
[M::ha_hist_line]    45: ********************************************************** 4424218
[M::ha_hist_line]    46: ********************************************************* 4351193
[M::ha_hist_line]    47: ******************************************************** 4275290
[M::ha_hist_line]    48: ****************************************************** 4132816
[M::ha_hist_line]    49: ***************************************************** 4004287
[M::ha_hist_line]    50: ************************************************** 3827829
[M::ha_hist_line]    51: *********************************************** 3613549
[M::ha_hist_line]    52: ********************************************* 3408068
[M::ha_hist_line]    53: ****************************************** 3196954
[M::ha_hist_line]    54: *************************************** 2951828
[M::ha_hist_line]    55: ************************************ 2730826
[M::ha_hist_line]    56: ********************************* 2492887
[M::ha_hist_line]    57: ****************************** 2261824
[M::ha_hist_line]    58: *************************** 2060978
[M::ha_hist_line]    59: ************************ 1844006
[M::ha_hist_line]    60: ********************** 1666759
[M::ha_hist_line]    61: ******************** 1488025
[M::ha_hist_line]    62: ***************** 1298218
[M::ha_hist_line]    63: *************** 1128978
[M::ha_hist_line]    64: ************* 1006326
[M::ha_hist_line]    65: ************ 884390
[M::ha_hist_line]    66: ********** 758372
[M::ha_hist_line]    67: ********* 654228
[M::ha_hist_line]    68: ******* 559453
[M::ha_hist_line]    69: ****** 472609
[M::ha_hist_line]    70: ***** 401628
[M::ha_hist_line]    71: ***** 343492
[M::ha_hist_line]    72: **** 293676
[M::ha_hist_line]    73: *** 256838
[M::ha_hist_line]    74: *** 226872
[M::ha_hist_line]    75: *** 204987
[M::ha_hist_line]    76: ** 180611
[M::ha_hist_line]    77: ** 158380
[M::ha_hist_line]    78: ** 142687
[M::ha_hist_line]    79: ** 131546
[M::ha_hist_line]    80: ** 118504
[M::ha_hist_line]    81: * 108998
[M::ha_hist_line]    82: * 102957
[M::ha_hist_line]    83: * 95738
[M::ha_hist_line]    84: * 94677
[M::ha_hist_line]    85: * 90178
[M::ha_hist_line]    86: * 84027
[M::ha_hist_line]    87: * 82578
[M::ha_hist_line]    88: * 80335
[M::ha_hist_line]    89: * 77907
[M::ha_hist_line]    90: * 77706
[M::ha_hist_line]    91: * 75025
[M::ha_hist_line]    92: * 73079
[M::ha_hist_line]    93: * 69838
[M::ha_hist_line]    94: * 69007
[M::ha_hist_line]    95: * 67334
[M::ha_hist_line]    96: * 67287
[M::ha_hist_line]    97: * 64676
[M::ha_hist_line]    98: * 62391
[M::ha_hist_line]    99: * 60120
[M::ha_hist_line]   100: * 58544
[M::ha_hist_line]   101: * 56452
[M::ha_hist_line]   102: * 53633
[M::ha_hist_line]   103: * 52964
[M::ha_hist_line]   104: * 51004
[M::ha_hist_line]   105: * 50528
[M::ha_hist_line]   106: * 47821
[M::ha_hist_line]   107: * 45735
[M::ha_hist_line]   108: * 45054
[M::ha_hist_line]   109: * 42925
[M::ha_hist_line]   110: * 39438
[M::ha_hist_line]   111: * 38983
[M::ha_hist_line]  rest: ********************************* 2482275
[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: none
[M::ha_ft_gen] peak_hom: 6; peak_het: -1

@conchoecia
Copy link
Author

conchoecia commented Feb 9, 2021

What are your assembly stats for the primary and alternative assemblies, @plattsad ? https://github.com/conchoecia/fasta_stats

@plattsad
Copy link

plattsad commented Feb 9, 2021

With a vanilla configuration (only threads specified)...

---------------- Information for assembly 'fot.asm.p_ctg.gfa.fa' ----------------


                                         Number of scaffolds       5480
                                     Total size of scaffolds  464904639
                                            Longest scaffold    2070765
                                           Shortest scaffold      18620
                                 Number of scaffolds > 1K nt       5480 100.0%
                                Number of scaffolds > 10K nt       5480 100.0%
                               Number of scaffolds > 100K nt       1025  18.7%
                                 Number of scaffolds > 1M nt         11   0.2%
                                Number of scaffolds > 10M nt          0   0.0%
                                          Mean scaffold size      84837
                                        Median scaffold size      47112
                                         N50 scaffold length     117304
                                          L50 scaffold count        821
                                                 scaffold %A      30.62
                                                 scaffold %C      19.38
                                                 scaffold %G      19.38
                                                 scaffold %T      30.62
                                                 scaffold %N       0.00
                                         scaffold %non-ACGTN       0.00
                             Number of scaffold non-ACGTN nt          0

                Percentage of assembly in scaffolded contigs       0.0%
              Percentage of assembly in unscaffolded contigs     100.0%
                      Average number of contigs per scaffold        1.0
Average length of break (>25 Ns) between contigs in scaffold          0

                                           Number of contigs       5480
                              Number of contigs in scaffolds          0
                          Number of contigs not in scaffolds       5480
                                       Total size of contigs  464904639
                                              Longest contig    2070765
                                             Shortest contig      18620
                                   Number of contigs > 1K nt       5480 100.0%
                                  Number of contigs > 10K nt       5480 100.0%
                                 Number of contigs > 100K nt       1025  18.7%
                                   Number of contigs > 1M nt         11   0.2%
                                  Number of contigs > 10M nt          0   0.0%
                                            Mean contig size      84837
                                          Median contig size      47112
                                           N50 contig length     117304
                                            L50 contig count        821
                                                   contig %A      30.62
                                                   contig %C      19.38
                                                   contig %G      19.38
                                                   contig %T      30.62
                                                   contig %N       0.00
                                           contig %non-ACGTN       0.00
                               Number of contig non-ACGTN nt          0
---------------- Information for assembly 'fot.asm.a_ctg.gfa.fa' ----------------


                                         Number of scaffolds       1019
                                     Total size of scaffolds   53586801
                                            Longest scaffold     562572
                                           Shortest scaffold       9051
                                 Number of scaffolds > 1K nt       1019 100.0%
                                Number of scaffolds > 10K nt       1018  99.9%
                               Number of scaffolds > 100K nt         76   7.5%
                                 Number of scaffolds > 1M nt          0   0.0%
                                Number of scaffolds > 10M nt          0   0.0%
                                          Mean scaffold size      52588
                                        Median scaffold size      37402
                                         N50 scaffold length      58138
                                          L50 scaffold count        226
                                                 scaffold %A      30.73
                                                 scaffold %C      19.29
                                                 scaffold %G      19.25
                                                 scaffold %T      30.73
                                                 scaffold %N       0.00
                                         scaffold %non-ACGTN       0.00
                             Number of scaffold non-ACGTN nt          0

                Percentage of assembly in scaffolded contigs       0.0%
              Percentage of assembly in unscaffolded contigs     100.0%
                      Average number of contigs per scaffold        1.0
Average length of break (>25 Ns) between contigs in scaffold          0

                                           Number of contigs       1019
                              Number of contigs in scaffolds          0
                          Number of contigs not in scaffolds       1019
                                       Total size of contigs   53586801
                                              Longest contig     562572
                                             Shortest contig       9051
                                   Number of contigs > 1K nt       1019 100.0%
                                  Number of contigs > 10K nt       1018  99.9%
                                 Number of contigs > 100K nt         76   7.5%
                                   Number of contigs > 1M nt          0   0.0%
                                  Number of contigs > 10M nt          0   0.0%
                                            Mean contig size      52588
                                          Median contig size      37402
                                           N50 contig length      58138
                                            L50 contig count        226
                                                   contig %A      30.73
                                                   contig %C      19.29
                                                   contig %G      19.25
                                                   contig %T      30.73
                                                   contig %N       0.00
                                           contig %non-ACGTN       0.00
                               Number of contig non-ACGTN nt          0

@chhylp123
Copy link
Owner

Looks like we should be able to let users to manually set homozygous peak. Do you think it is helpful @plattsad @conchoecia? I will release v0.14 today if you think it is an acceptable solution.

@plattsad
Copy link

plattsad commented Feb 9, 2021

I think that might be useful. As a test I manually set peak_hom and peak_het in htab.cpp to 46 and 23 respectively. The assembly with only threads set appeared considerably better:

---------------- Information for assembly 'fot.APmod.asm.p_ctg.gfa.fa' ----------------


                                         Number of scaffolds        494
                                     Total size of scaffolds  243557028
                                            Longest scaffold   29379636
                                           Shortest scaffold      21153
                                 Number of scaffolds > 1K nt        494 100.0%
                                Number of scaffolds > 10K nt        494 100.0%
                               Number of scaffolds > 100K nt         60  12.1%
                                 Number of scaffolds > 1M nt         19   3.8%
                                Number of scaffolds > 10M nt         10   2.0%
                                          Mean scaffold size     493030
                                        Median scaffold size      40029
                                         N50 scaffold length   16540259
                                          L50 scaffold count          6
                                                 scaffold %A      30.58
                                                 scaffold %C      19.42
                                                 scaffold %G      19.44
                                                 scaffold %T      30.56
                                                 scaffold %N       0.00
                                         scaffold %non-ACGTN       0.00
                             Number of scaffold non-ACGTN nt          0

                Percentage of assembly in scaffolded contigs       0.0%
              Percentage of assembly in unscaffolded contigs     100.0%
                      Average number of contigs per scaffold        1.0
Average length of break (>25 Ns) between contigs in scaffold          0

                                           Number of contigs        494
                              Number of contigs in scaffolds          0
                          Number of contigs not in scaffolds        494
                                       Total size of contigs  243557028
                                              Longest contig   29379636
                                             Shortest contig      21153
                                   Number of contigs > 1K nt        494 100.0%
                                  Number of contigs > 10K nt        494 100.0%
                                 Number of contigs > 100K nt         60  12.1%
                                   Number of contigs > 1M nt         19   3.8%
                                  Number of contigs > 10M nt         10   2.0%
                                            Mean contig size     493030
                                          Median contig size      40029
                                           N50 contig length   16540259
                                            L50 contig count          6
                                                   contig %A      30.58
                                                   contig %C      19.42
                                                   contig %G      19.44
                                                   contig %T      30.56
                                                   contig %N       0.00
                                           contig %non-ACGTN       0.00
                               Number of contig non-ACGTN nt          0

@conchoecia
Copy link
Author

@plattsad that is a great improvement after manually setting the het and hom peak. Good idea trying that as a direct proof-of-concept. If you have Hi-C data then that should nicely scaffold into chromosomes.

@chhylp123 If you think it will increase the assembly quality, then yes it would be nice to have the option (even if it only affects purgedups). Our examples have shown that the current hifiasm algorithm to detect the het and hom peaks doesn't work in all cases.

By the way, I used Hi-C to scaffold the output of my genome assembly above that was doubled in size, and ended up with a nice chromosome-scale diploid assembly. The heterozygosity seems like it is well over 5% in the animal that I am working with, including major indels and inversions between the haplotypes.

@chhylp123
Copy link
Owner

@plattsad @conchoecia The new version of hifiasm (v0.14) incorporates a new option '--min-hist-cnt' to ignore noisy counts when analyzing the k-mer spectrum. Hope it will be helpful for this problem.

@conchoecia
Copy link
Author

Hi @chhylp123 - I think that this is a good compromise to help the software determine where a good cutoff for the noise is. I take it to mean that --min-hist-cnt is to be in the trough between the heterozygous peak and error peak in the k-mer spectrum? I'll follow up here once I try it out.

@chhylp123
Copy link
Owner

Yes, I think so. Skip the problematic peaks should work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants