---
title: "vignette"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{vignette}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup}
library(findGSEP)
```
---
title: "Estimate Genome Size of Polyploid Species Using K-mer Frequencies"
output: html_document
---
## Description
`findGSEP` is a function for multiple polyploidy genome size estimation by fitting k-mer frequencies iteratively with a normal distribution model.
To use `findGSEP`, one needs to prepare a histo file, which contains two tab-separated columns. The first column gives frequencies at which k-mers occur in reads, while the second column gives counts of such distinct k-mers. Parameters k and related histo file are required for any estimation.
Dependencies (R library) required: `pracma`, `fGarch`, etc. - see DESCRIPTION for details.
## Usage
```r
findGSEP(
path,
samples,
sizek,
exp_hom,
ploidy,
range_left,
range_right,
xlimit,
ylimit,
output_dir
)
```
## Arguments
- **path**: is the histo file location (mandatory).
- **samples**: is the histo file name (mandatory).
- **sizek**: is the size of k used to generate the histo file (mandatory). K is involved in calculating heterozygosity if the genome is heterozygous.
- **exp_hom**: a rough average k-mer coverage for finding the homozygous regions. In general, one can get peaks in the k-mer frequencies file, but has to determine which one is for the homozygous regions, and which one is for the heterozygous regions. It is optional, however, it must be provided if one wants to estimate size for a heterozygous genome. VALUE for `exp_hom` must satisfy `fp < VALUE < 2*fp`, where `fp` is the freq for homozygous peak. If not provided, 0 by default assumes the genome is homozygous.
- **ploidy**: is the number of ploidy (mandatory).
- **range_left**: is the left range for estimation, default is `exp_hom * 0.2`, normally do not need to change this (optional).
- **range_right**: is the right range for estimation, default is `exp_hom * 0.2`, normally do not need to change this (optional).
- **xlimit**: is the x-axis range, if not given, then it will automatically calculate a proper range, normally do not need to change this (optional).
- **ylimit**: is the y-axis range, if not given, then it will automatically calculate a proper range, normally do not need to change this (optional).
- **output_dir**: is the path to write output files (optional). If not provided, by default results will be written in the folder where the histo file is.
## Example Usage
To run the algorithm, follow these steps:
1. **Prepare a Path**: Create a directory where the histo file will be stored. For example, create a directory named `test_findGSEP`.
2. **Put Histo File in the Path**: Place your histo file in the `test_findGSEP` directory. In this example, the histo file name is `ara_simulate_4ploidy_25x_rep4.histo`.
3. **Provide Output Directory**: Specify the output directory where the results will be saved. In this example, we use `tempdir()` as the output directory.
4. **Run the Algorithm**: Use the following command to run the algorithm with the specified parameters:
```r
findGSEP(
path = 'test_findGSEP',
samples = 'ara_simulate_4ploidy_25x_rep4.histo',
sizek = 21,
exp_hom = 35,
ploidy = 4,
output_dir = tempdir(),
range_left = 35 * 0.2, ## exp_hom*0.2
range_right = 35 * 0.2, ## exp_hom*0.2
xlimit = -1, ## will calculate automatically
ylimit = -1 ## will calculate automatically
)
```
5. **Output**: The output will include:
- A PDF file named `${samples}._hap_genome_size_est.pdf`, which contains the estimated genome size.
- A CSV file named `${samples}._haploid_size.csv`, which contains the predicted genome size.
## References
Laiyi Fu, Yanxin Xie, Shunkang Ling, and Hequan Sun# etc. al. findGSEP: a web application for estimating ge-nome size of polyploid species using k-mer frequencies
## Session Info
```r
R version 4.3.3 (2024-02-29)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.4.1
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Asia/Shanghai
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] findGSEP_1.2.0 dplyr_1.1.4 png_0.1-8 scales_1.3.0 fGarch_4033.92
[6] pracma_2.4.4 ggplot2_3.5.0 RColorBrewer_1.1-3
loaded via a namespace (and not attached):
[1] Matrix_1.6-5 gtable_0.3.5 compiler_4.3.3 fBasics_4032.96 gbutils_0.5
[6] tidyselect_1.2.1 cvar_0.5 timeSeries_4032.109 yaml_2.3.8 fastmap_1.1.1
[11] lattice_0.22-5 R6_2.5.1 generics_0.1.3 knitr_1.45 rbibutils_2.2.16
[16] tibble_3.2.1 spatial_7.3-17 munsell_0.5.1 timeDate_4032.109 pillar_1.9.0
[21] rlang_1.1.3 utf8_1.2.4 xfun_0.43 pkgload_1.3.4 cli_3.6.2
[26] withr_3.0.0 magrittr_2.0.3 Rdpack_2.6 digest_0.6.35 grid_4.3.3
[31] rstudioapi_0.16.0 lifecycle_1.0.4 vctrs_0.6.5 evaluate_0.23 glue_1.7.0
[36] fansi_1.0.6 colorspace_2.1-0 rmarkdown_2.26 htmltools_0.5.8.1 tools_4.3.3
[41] pkgconfig_2.0.3
```