GenomicBreeding.jl
Overview
The GenomicBreeding
module provides a comprehensive suite of tools for genomic prediction, genome-wide association studies (GWAS), and data handling in genomic breeding. It integrates functionalities from GenomicBreedingCore
, GenomicBreedingIO
, GenomicBreedingModels
, and GenomicBreedingPlots
to offer efficient and scalable solutions for genetic data analysis and visualisation.
Installation
We designed GenomicBreeding.jl to work on an HPC running Linux (the various components, i.e. GenomicBreedingCore.jl, GenomicBreedingIO.jl, GenomicBreedingModels.jl, and GenomicBreedingPlots.jl work on a single Linux PC too).
Currently, we require that you install Julia on your home directory in your HPC cluster via:
curl -fsSL https://install.julialang.org | sh
type -a julia
Currently, GenomicBreedingModels.jl is dependent on R and the package BGLR for Bayes A, Bayes B and Bayes C models. Because of this we require that R and BGLR be installed. To help with this, you may install all the requirements via Conda using the environment file: GenomicBreeding_conda.yml
. We aim to have a pure Julia implementation of Bayesian models using Turing.jl in the near future (we just need to speed-up the models a bit).
Install the GenomicBreeding.jl library in Julia:
using Pkg
Pkg.add("https://github.com/GenomicBreeding/GenomicBreeding.jl")
Feel free to install the GenomicBreeding.jl components as well as various other useful libraries:
using Pkg
GB_components = [
"https://github.com/GenomicBreeding/GenomicBreedingCore.jl",
"https://github.com/GenomicBreeding/GenomicBreedingIO.jl",
"https://github.com/GenomicBreeding/GenomicBreedingModels.jl",
"https://github.com/GenomicBreeding/GenomicBreedingPlots.jl",
]
for P in GB_components
Pkg.add(url=P)
end
Pkg.add(["StatsBase", "MixedModels", "MultivariateStats", "UnicodePlots", "ColorSchemes", "CairoMakie"])
Quickstart
- Genomic prediction
- Genome-wide association study
- Genotype data filtering and imputation
- Phenotype data analyses
- Cluster analyses
- Plotting
- Mating simulations
Genomic prediction
Example 1: using simulated data
Here's a simple example using simulated data to perform replicated k-fold cross validation:
# It is always a good idea to keep all your packages updated
using Pkg; Pkg.update()
# Load GenomicBreeding
using GenomicBreeding
# Load plotting and GP model functions
import GenomicBreeding: plot, lasso, bayesa
# Simulate genotype and phenotype data
genomes = simulategenomes(n=300, l=1_000, verbose=true)
genomes.populations[1:100] .= "pop1"; genomes.populations[101:200] .= "pop2"; genomes.populations[201:300] .= "pop3" # simulate multiple populations
trials, _ = simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, verbose=true);
phenomes = extractphenomes(trials)
fname_geno = writedelimited(genomes, fname="test-geno.tsv")
fname_pheno = writedelimited(phenomes, fname="test-pheno.tsv")
# Input struct documentation
@doc GBInput
# Setup the input struct
input = GBInput(
fname_geno=fname_geno,
fname_pheno=fname_pheno,
anaysis=cv, # set analysis to use the `cv` function for replicated k-fold cross-validation
models = [lasso, bayesa],
n_folds=2,
n_replications=2,
SLURM_job_name="testGB",
SLURM_account_name="dbiopast1",
SLURM_cpus_per_task=2,
SLURM_mem_G=5,
SLURM_time_limit_dd_hhmmss="00-00:15:00",
SLURM_max_array_jobs_running=10,
verbose=true
)
# Preliminary look at the genotype and phenotype data
plot(input=input, format="png", plot_size=(700, 500))
# Documentation of the main user interface function (take note of the currently available analyses)
@doc submitslurmarrayjobs
# Submit the Slurm array jobs
# Note that here you will be prompted to enter YES to proceed.
outdir = submitslurmarrayjobs(input)
# Monitor the Slurm jobs
run(`sh -c 'squeue -u $USER'`)
run(`sh -c 'ls -lhtr slurm-*_*.out'`)
run(`sh -c 'cat slurm-*_*.out'`)
run(`sh -c 'tail slurm-*_*.out'`)
run(`sh -c 'grep -i "err" slurm-*_*.out'`)
run(`sh -c 'grep -i "err" slurm-*_*.out | cut -d: -f1 | sort | uniq'`)
readdir(outdir)
# Once the array jobs have finishes or at least a couple of jobs have finished, run below.
# Rerun as you often as wish to update the plots.
# You may exit Julia and just run the plotting function below after correctly defining input::GBInput above.
plot(input=input, format="png", plot_size=(700, 500), skip_genomes=true, skip_phenomes=true, overwrite=true)
Example 2: using empirical data
# TODO
# with GP per se on unphenotyped set
Example 3: using artificial neural network and GPU support on simulated data
# TODO
# Make sure to set:
# - `SLURM_partition_name` to the name of your HPC's GPU partition nodes,
# - `SLURM_additional_flags` to additional GPU-related flag, e.g. `--gres=gpu:1`
Genome-wide association study
# TODO
Genotype data filtering and imputation
# TODO
Phenotype data analyses
# TODO
Cluster analyses
# TODO
Plotting
# TODO
Mating simulations
# TODO
File formats
Variant call format
See VCFv4.2 and VCFv4.3 for details in the format specifications.
Note that given that we work with autopolyploids, and pools (e.g. half-sib families) more often, the VCF parser prioritise the AF
(allele frequency) field, followed by the AD
(allele depth) field, and finally the GT
(genotype) field. If your VCF file needs depth-based filtering, you may opt for having the AD
field instead of the AF
field, as well as include the DP
(depth) field. A way to generate a VCF file with AD
and DP
fields is via:
time \
bcftools mpileup \
-BI \
-a AD,DP \
-d 4000000 \
-f ${REF} \
-T ${LOCI} \
-b ${BAMS_LIST} \
-Ov > ${PREFIX}-MPILEUP.vcf
where the following flags refer to:
-B
: disable probabilistic realignment for the computation of base alignment quality (BAQ). BAQ is the Phred-scaled probability of a read base being misaligned. Applying this option greatly helps to reduce false SNPs caused by misalignments.-I
: do not perform INDEL calling-a AD,DP
: comma-separated list of FORMAT and INFO tags to output, i.e. Total allelic depth (Number=R,Type=Integer) (AD) and Number of high-quality bases (Number=1,Type=Integer) (DP)-Ou
: output uncompressed BCF, because we're piping into bcftools call-d 4000000
: at each position, read maximally 4 million reads per input file-f ...
: reference genome in fasta format-T ...
: regions or loci specified in a tab-delimited file. The columns of the tab-delimited file can contain either positions (two-column format: CHROM, POS) or intervals (three-column format: CHROM, BEG, END), but not both. Positions are 1-based and inclusive.-b ...
: list of input alignment files, one file per line
Allele frequency table
This is a simple human-readable string-delimited text file (tab-delimited by default). The smallest minimal valid allele frequency table is as follows:
chrom | pos | all_alleles | allele | entry_1 |
---|---|---|---|---|
chr_1 | 123 | A | T | A |
Each column must be sorted exactly as this: starting with "chrom" for chromosome or scaffold name, "pos" for the position in bases, "all_alleles" for a string of reference and alternative alleles separated by pipe/s (|
), "allele" for exactly one of the alleles in the previous column, and finally subsequency column names refer to the names of the entries. Values under the 5th and subsequent columns are assumed to be allele frequencies, i.e. ranges from 0.0
to 1.0
. Missing values may be encoded as any of the following:
- ""
- "missing"
- "NA"
- "na"
- "N/A"
- "n/a"
An additional header may be included, for example:
chrom | pos | all_alleles | allele | entry_1 | entry_2 | entry_3 |
---|---|---|---|---|---|---|
chrom | pos | all_alleles | allele | pop_ABC | pop_DEF | pop_XYZ |
chr_1 | 123 | A|T | A | 0.51 | 0.25 | 0.25 |
where the first header is the same as the detailed above, while the second header replaces the entry names with the corresponding population or group the entries belong to.
JLD2
This is a compressed binary format containing Julia structs (e.g. genomes, and phenomes struct). It is a subset of the scientific data format HDF5.
Trials table
Similar to the allele frequency table format, this is a simple human-readable string-delimited text file (tab-delimited by default). The smallest minimal valid trials table is as follows:
years | seasons | harvests | sites | entries | populations | replications | blocks | rows | cols | trait_1 |
---|---|---|---|---|---|---|---|---|---|---|
2023 | 2023EarlySpring | FLIGHT-2023-09-05 | LOC1_TRT1 | entry_1 | pop_1 | rep_1 | row1 | row1 | col1 | 3.1416 |
The first 10 columns must be named (or as close as possible) as the following in any order:
- "years"
- "seasons"
- "harvests"
- "sites"
- "entries"
- "populations"
- "replications"
- "blocks"
- "rows"
- "cols"
The subsequent columns (column 11 and so on) refer to the name of the traits. Values under the 11th and subsequent columns are assumed to be numeric. Missing values may be encoded as any of the following:
- ""
- "missing"
- "NA"
- "na"
- "N/A"
- "n/a"
Phenomes table
Again, similar to the allele frequency table format, this is a simple human-readable string-delimited text file (tab-delimited by default). The smallest minimal valid phenomes table is as follows:
entries | populations | trait_1 |
---|---|---|
entry_1 | pop_XYZ | 0.5772 |
Each column must be sorted exactly as this: starting with "entries" for the names of the entries, genotypes, families or pools, "populations" for the corresponding population names or group names. Subsequent column names (column 3 and so on) refer to the trait names. Values under the 3rd and subsequent columns are assumed to be numeric. Missing values may be encoded as any of the following:
- ""
- "missing"
- "NA"
- "na"
- "N/A"
- "n/a"
License
The GenomicBreeding
module is licensed under the GPLv3 License. See the LICENSE.md file for more details.
Index
GenomicBreeding.GBInput
Base.:==
Base.hash
GenomicBreeding.checkinputs
GenomicBreeding.cv
GenomicBreeding.fit
GenomicBreeding.gwas
GenomicBreeding.loadcvs
GenomicBreeding.loadfits
GenomicBreeding.loadgenomesphenomes
GenomicBreeding.plot
GenomicBreeding.predict
GenomicBreeding.prepareinputs
GenomicBreeding.prepareoutprefixandoutdir
GenomicBreeding.submitslurmarrayjobs
GenomicBreedingCore.checkdims
GenomicBreedingCore.clone