GenomicBreedingModels
Documentation for GenomicBreedingModels.
GenomicBreedingModels.LπDist
GenomicBreedingModels.NπDist
GenomicBreedingModels.TπDist
Base.maximum
Base.maximum
Base.maximum
Base.minimum
Base.minimum
Base.minimum
Base.rand
Base.rand
Base.rand
Distributions.logpdf
Distributions.logpdf
Distributions.logpdf
GenomicBreedingModels.addnorm
GenomicBreedingModels.bayesa
GenomicBreedingModels.bayesb
GenomicBreedingModels.bayesc
GenomicBreedingModels.bayesian
GenomicBreedingModels.bayesian
GenomicBreedingModels.bglr
GenomicBreedingModels.cvbulk
GenomicBreedingModels.cvleaveonepopulationout
GenomicBreedingModels.cvmultithread!
GenomicBreedingModels.cvpairwisepopulation
GenomicBreedingModels.cvperpopulation
GenomicBreedingModels.epistasisfeatures
GenomicBreedingModels.extractxyetc
GenomicBreedingModels.gwaslmm
GenomicBreedingModels.gwasols
GenomicBreedingModels.gwasprep
GenomicBreedingModels.gwasreml
GenomicBreedingModels.heritabilitynarrow_sense
GenomicBreedingModels.invoneplus
GenomicBreedingModels.lasso
GenomicBreedingModels.log10epsdivlog10eps
GenomicBreedingModels.loglikreml
GenomicBreedingModels.metrics
GenomicBreedingModels.mlp
GenomicBreedingModels.mult
GenomicBreedingModels.ols
GenomicBreedingModels.pearsonscorrelation
GenomicBreedingModels.predict
GenomicBreedingModels.raise
GenomicBreedingModels.reconstitutefeatures
GenomicBreedingModels.ridge
GenomicBreedingModels.square
GenomicBreedingModels.transform1
GenomicBreedingModels.transform2
GenomicBreedingModels.turing_bayesG
GenomicBreedingModels.turing_bayesG_logit
GenomicBreedingModels.turing_bayesGs
GenomicBreedingModels.turing_bayesGπ
GenomicBreedingModels.turing_bayesGπs
GenomicBreedingModels.turing_bayesL
GenomicBreedingModels.turing_bayesLs
GenomicBreedingModels.turing_bayesLπ
GenomicBreedingModels.turing_bayesLπs
GenomicBreedingModels.turing_bayesT
GenomicBreedingModels.turing_bayesTπ
GenomicBreedingModels.validate
GenomicBreedingModels.@string2operations
GenomicBreedingModels.LπDist
— TypeLaplace distribution with a point mass at 0.0
GenomicBreedingModels.NπDist
— TypeGaussian distribution with a point mass at 0.0
GenomicBreedingModels.TπDist
— TypeT-distribution with a point mass at 0.0
Base.maximum
— MethodMaximum value of the LπDist distribution
Base.maximum
— MethodMaximum value of the NπDist distribution
Base.maximum
— MethodMaximum value of the TπDist distribution
Base.minimum
— MethodMinimum value of the LπDist distribution
Base.minimum
— MethodMinimum value of the NπDist distribution
Base.minimum
— MethodMinimum value of the TπDist distribution
Base.rand
— MethodSampling method for LπDist
Examples
d = LπDist(0.1, 0.0, 1.0)
rand(d)
Base.rand
— MethodSampling method for NπDist
Examples
d = NπDist(0.1, 0.0, 1.0)
rand(d)
Base.rand
— MethodSampling method for TπDist
Examples
d = TπDist(0.1, 1.0)
rand(d)
Distributions.logpdf
— Methodlog(pdf) of LπDist
Examples
d = LπDist(0.1, 0.0, 1.0)
logpdf.(d, [-1.0, 0.0, 1.0])
Distributions.logpdf
— Methodlog(pdf) of NπDist
Examples
d = NπDist(0.1, 0.0, 1.0)
logpdf.(d, [-1.0, 0.0, 1.0])
Distributions.logpdf
— Methodlog(pdf) of TπDist
Examples
d = TπDist(0.1, 1.0)
logpdf.(d, [-1.0, 0.0, 1.0])
GenomicBreedingModels.addnorm
— Methodaddnorm(x, y) = (x + y) / 2.0
This function takes two numbers and finds their arithmetic mean. A member of the set of endofunctions defined in GenomicBreedingModels.jl for building non-linear genomic prediction models. Inputs and ouput range between zero and one.
GenomicBreedingModels.bayesa
— Methodbayesa(;
genomes::Genomes,
phenomes::Phenomes,
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
idx_trait::Int64 = 1,
n_iter::Int64 = 1_500,
n_burnin::Int64 = 500,
verbose::Bool = false
)::Fit
Fits a Bayes A model for genomic prediction using the BGLR method. Bayes A assumes marker effects follow a scaled-t distribution, allowing for different variances for each marker effect.
Arguments
genomes::Genomes
: Genomic data structure containing allele frequenciesphenomes::Phenomes
: Phenotypic data structure containing trait measurementsidx_entries::Union{Nothing,Vector{Int64}}
: Optional indices to subset specific entries/individualsidx_loci_alleles::Union{Nothing,Vector{Int64}}
: Optional indices to subset specific genetic markersidx_trait::Int64
: Index of the trait to analyze (default: 1)n_iter::Int64
: Number of iterations for the MCMC algorithm (default: 1,500)n_burnin::Int64
: Number of burn-in iterations to discard (default: 500)verbose::Bool
: Whether to print progress messages (default: false)
Returns
Fit
: A fitted model object containing:model
: Model identifier ("ols")b_hat
: Estimated regression coefficientsb_hat_labels
: Labels for the coefficientstrait
: Name of the analyzed traitentries
: Entry identifierspopulations
: Population identifiersy_true
: Observed phenotypic valuesy_pred
: Predicted phenotypic valuesmetrics
: Dictionary of performance metrics
Notes
- The model assumes a Gaussian distribution for the trait values
- Recommended to check convergence by examining trace plots of key parameters
- The burn-in period should be adjusted based on convergence diagnostics
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> fit = bayesa(genomes=genomes, phenomes=phenomes);
julia> fit.model == "bayesa"
true
julia> fit.metrics["cor"] > 0.50
true
GenomicBreedingModels.bayesb
— Methodbayesb(;
genomes::Genomes,
phenomes::Phenomes,
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
idx_trait::Int64 = 1,
n_iter::Int64 = 1_500,
n_burnin::Int64 = 500,
verbose::Bool = false
)::Fit
Fits a Bayes B model for genomic prediction using the BGLR method. Bayes B assumes that marker effects follow a mixture distribution where some markers have zero effect and others follow a scaled-t distribution.
Arguments
genomes::Genomes
: Genomic data structure containing allele frequenciesphenomes::Phenomes
: Phenotypic data structure containing trait measurementsidx_entries::Union{Nothing,Vector{Int64}}
: Optional indices to subset specific entries (default: nothing)idx_loci_alleles::Union{Nothing,Vector{Int64}}
: Optional indices to subset specific loci/alleles (default: nothing)idx_trait::Int64
: Index of the trait to analyze (default: 1)n_iter::Int64
: Number of iterations for the MCMC algorithm (default: 1,500)n_burnin::Int64
: Number of burn-in iterations to discard (default: 500)verbose::Bool
: Whether to print progress information (default: false)
Returns
Fit
: A fitted model object containing:model
: Model identifier ("ols")b_hat
: Estimated regression coefficientsb_hat_labels
: Labels for the coefficientstrait
: Name of the analyzed traitentries
: Entry identifierspopulations
: Population identifiersy_true
: Observed phenotypic valuesy_pred
: Predicted phenotypic valuesmetrics
: Dictionary of performance metrics
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> fit = bayesb(genomes=genomes, phenomes=phenomes);
julia> fit.model == "bayesb"
true
julia> fit.metrics["cor"] > 0.50
true
GenomicBreedingModels.bayesc
— Methodbayesc(;
genomes::Genomes,
phenomes::Phenomes,
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
idx_trait::Int64 = 1,
n_iter::Int64 = 1_500,
n_burnin::Int64 = 500,
verbose::Bool = false
)::Fit
Fits a Bayes C model for genomic prediction using the BGLR method. Bayes C assumes that marker effects follow a mixture distribution with a point mass at zero and a normal distribution.
Arguments
genomes::Genomes
: Genomic data structure containing allele frequenciesphenomes::Phenomes
: Phenotypic data structure containing trait measurementsidx_entries::Union{Nothing,Vector{Int64}}
: Optional indices to subset specific entries/individualsidx_loci_alleles::Union{Nothing,Vector{Int64}}
: Optional indices to subset specific loci/allelesidx_trait::Int64
: Index of the trait to analyze (default: 1)n_iter::Int64
: Number of iterations for MCMC sampling (default: 1,500)n_burnin::Int64
: Number of burn-in iterations to discard (default: 500)verbose::Bool
: Whether to print progress information (default: false)
Returns
Fit
: A fitted model object containing:model
: Model identifier ("ols")b_hat
: Estimated regression coefficientsb_hat_labels
: Labels for the coefficientstrait
: Name of the analyzed traitentries
: Entry identifierspopulations
: Population identifiersy_true
: Observed phenotypic valuesy_pred
: Predicted phenotypic valuesmetrics
: Dictionary of performance metrics
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> fit = bayesc(genomes=genomes, phenomes=phenomes);
julia> fit.model == "bayesc"
true
julia> fit.metrics["cor"] > 0.50
true
GenomicBreedingModels.bayesian
— Methodbayesian(
turing_model::Function;
X::Matrix{Float64},
y::Vector{Float64},
sampler::String = ["NUTS", "HMC", "HMCDA", "MH", "PG"][1],
sampling_method::Int64 = 1,
seed::Int64 = 123,
n_burnin::Int64 = 500,
n_iter::Int64 = 1_500,
verbose::Bool = false,
)::Fit
Fit a Bayesian linear regression models via Turing.jl
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);
julia> phenomes = extractphenomes(trials);
julia> sol = Suppressor.@suppress bayesian(turing_bayesG, genomes=genomes, phenomes=phenomes);
julia> # Slow because not multivariate T-dist: sol_T = Suppressor.@suppress bayesian(turing_bayesT, genomes=genomes, phenomes=phenomes);
julia> # Even slower because of an extra set of distribution to define a non-spherical variance-covariance matrix: sol_Gs = Suppressor.@suppress bayesian(turing_bayesGs, genomes=genomes, phenomes=phenomes);
julia> sol_BGLR = Suppressor.@suppress bayesian("BayesA", genomes=genomes, phenomes=phenomes); sol.metrics["cor"] > sol_BGLR.metrics["cor"]
true
julia> sol.metrics["cor"] > 0.5
true
GenomicBreedingModels.bayesian
— Methodbayesian(
bglr_model::String;
genomes::Genomes,
phenomes::Phenomes,
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
idx_trait::Int64 = 1,
response_type::String = ["gaussian", "ordinal"][1],
n_burnin::Int64 = 500,
n_iter::Int64 = 1_500,
verbose::Bool = false
)::Fit
Fit Bayesian genomic prediction models using the BGLR R package.
Arguments
bglr_model::String
: Type of BGLR model to fit. Options: "BayesA", "BayesB", "BayesC"genomes::Genomes
: Genomic data structure containing allele frequenciesphenomes::Phenomes
: Phenotypic data structure containing trait measurementsidx_entries::Union{Nothing,Vector{Int64}}
: Optional indices to subset entriesidx_loci_alleles::Union{Nothing,Vector{Int64}}
: Optional indices to subset loci/allelesidx_trait::Int64
: Index of trait to analyze (default: 1)response_type::String
: Type of response variable. Options: "gaussian" or "ordinal" (default: "gaussian")n_burnin::Int64
: Number of burn-in MCMC iterations (default: 500)n_iter::Int64
: Total number of MCMC iterations (default: 1,500)verbose::Bool
: Whether to print progress information (default: false)
Returns
Fit
: Object containing:- Model fit summary
- Estimated genetic effects
- Predicted values
- Model performance metrics
- Input data details
Details
This function provides a Julia interface to the BGLR R package for Bayesian genomic prediction. It uses temporary files to interface with R and automatically cleans up afterward.
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);
julia> phenomes = extractphenomes(trials);
julia> sol = Suppressor.@suppress bayesian("BayesA", genomes=genomes, phenomes=phenomes);
julia> sol.metrics["cor"] > 0.5
true
GenomicBreedingModels.bglr
— Methodbglr(; G::Matrix{Float64}, y::Vector{Float64}, model::String="BayesA",
response_type::String="gaussian", n_iter::Int64=1_500,
n_burnin::Int64=500, verbose::Bool=false)::Vector{Float64}
Perform Bayesian genomic prediction using the BGLR (Bayesian Generalized Linear Regression) R package.
Arguments
G::Matrix{Float64}
: Marker matrix where rows represent individuals and columns represent locus-allele combinationsy::Vector{Float64}
: Vector of phenotypic valuesmodel::String
: Bayesian model type to use. Options: "BayesA", "BayesB", or "BayesC". Default: "BayesA"response_type::String
: Type of response variable. Options: "gaussian" or "ordinal". Default: "gaussian"n_iter::Int64
: Number of iterations for the MCMC chain. Default: 1,500n_burnin::Int64
: Number of burn-in iterations to discard. Default: 500verbose::Bool
: Whether to print progress information. Default: false
Returns
Vector{Float64}
: Estimated effects including the intercept (first element) followed by marker effects
Details
This function creates temporary files to interface with R's BGLR package, runs the Bayesian analysis, and automatically cleans up temporary files afterward. The function uses the system's R installation and requires the BGLR package to be installed in R.
Notes
This is hacky. It invokes Rscript for each instance which should allow multithreading because RCall.jl currently does not allow multithreading.
GenomicBreedingModels.cvbulk
— Methodcvbulk(;
genomes::Genomes,
phenomes::Phenomes,
models::Vector{Function}=[ridge],
n_replications::Int64=5,
n_folds::Int64=5,
seed::Int64=42,
verbose::Bool=true
)::Tuple{Vector{CV}, Vector{String}}
Perform replicated k-fold cross-validation of genomic prediction model(s) across all available traits and entries, ignoring populations.
Arguments
genomes::Genomes
: Genomic data structure containing allele frequenciesphenomes::Phenomes
: Phenotypic data structure containing trait measurementsmodels::Vector{Function}
: Vector of genomic prediction model functions to evaluaten_replications::Int64
: Number of times to repeat k-fold cross-validation randomising k-fold partitioning each time (default: 5)n_folds::Int64
: Number of cross-validation folds (default: 5)seed::Int64
: Random seed for reproducibility (default: 42)verbose::Bool
: Whether to display progress information (default: true)
Returns
- Tuple containing:
- Vector of CV objects with cross-validation results
- Vector of warning messages about skipped cases
Threading
Uses multiple threads if Julia is started with threading enabled. Example startup command: julia --threads 7,1
(7 worker threads, 1 runtime thread)
Notes
- Performs random k-fold partitioning of entries ignoring population structure
- Handles missing/invalid phenotype values
- Validates model inputs and data dimensions
- Returns warnings for cases with insufficient data or zero variance
Example
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false); genomes.populations = StatsBase.sample(string.("pop_", 1:3), length(genomes.entries), replace=true);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, verbose=false);
julia> phenomes = extractphenomes(trials);
julia> cvs, notes = cvbulk(genomes=genomes, phenomes=phenomes, models=[ols, ridge], n_replications=2, n_folds=2, verbose=false);
julia> df_across_entries, df_per_entry = tabularise(cvs);
julia> idx_across = findall((df_across_entries.trait .== "trait_1") .&& (df_across_entries.model .== "ridge") .&& (df_across_entries.replication .== "replication_1") .&& (df_across_entries.fold .== "fold_1"));
julia> idx_per = findall((df_per_entry.trait .== "trait_1") .&& (df_per_entry.model .== "ridge") .&& (df_per_entry.replication .== "replication_1") .&& (df_per_entry.fold .== "fold_1"));
julia> abs(df_across_entries.cor[idx_across][1] - cor(df_per_entry.y_true[idx_per], df_per_entry.y_pred[idx_per])) < 1e-10
true
GenomicBreedingModels.cvleaveonepopulationout
— Methodcvleaveonepopulationout(;
genomes::Genomes,
phenomes::Phenomes,
models::Vector{Function}=[ridge],
n_replications::Int64=5,
n_folds::Int64=5,
seed::Int64=42,
verbose::Bool=true
)::Tuple{Vector{CV}, Vector{String}}
Performs leave-one-population-out cross-validation for genomic prediction models across all available traits.
Arguments
genomes::Genomes
: Genomic data structure containing allele frequenciesphenomes::Phenomes
: Phenotypic data structure containing trait measurementsmodels::Vector{Function}
: Vector of model functions to evaluate (default: [ridge])n_replications::Int64
: Number of replications (not used in this implementation)n_folds::Int64
: Number of folds (not used in this implementation)seed::Int64
: Random seed (not used in this implementation)verbose::Bool
: If true, displays progress information during execution
Returns
Tuple{Vector{CV}, Vector{String}}
: Returns a tuple containing:- Vector of CV objects with cross-validation results
- Vector of warning/error messages for skipped validations
Details
The function implements a leave-one-population-out cross-validation strategy where:
- For each trait and population combination:
- Uses one population as validation set
- Uses remaining populations as training set
- Evaluates multiple genomic prediction models
- Handles missing data and variance checks
- Supports parallel processing via Julia's multi-threading
Threading
To utilize multiple threads, start Julia with: julia --threads n,1
where n is the desired number of threads for computation and 1 is reserved for the runtime.
Supported Models
- ols (Ordinary Least Squares)
- ridge (Ridge Regression)
- lasso (Lasso Regression)
- bayesa (Bayes A)
- bayesb (Bayes B)
- bayesc (Bayes C)
Example
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false); genomes.populations = StatsBase.sample(string.("pop_", 1:3), length(genomes.entries), replace=true);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);
julia> phenomes = extractphenomes(trials);
julia> cvs, notes = cvleaveonepopulationout(genomes=genomes, phenomes=phenomes, models=[ridge, bayesa], n_replications=2, n_folds=2, verbose=false);
julia> df_across_entries, df_per_entry = tabularise(cvs);
julia> sum([sum(split(df_across_entries.training_population[i], ";") .== df_across_entries.validation_population[i]) for i in 1:size(df_across_entries, 1)]) == 0
true
julia> idx_across = findall((df_across_entries.validation_population .== "pop_1") .&& (df_across_entries.trait .== "trait_1") .&& (df_across_entries.model .== "ridge"));
julia> idx_per = findall((df_per_entry.validation_population .== "pop_1") .&& (df_per_entry.trait .== "trait_1") .&& (df_per_entry.model .== "ridge"));
julia> abs(df_across_entries.cor[idx_across][1] - cor(df_per_entry.y_true[idx_per], df_per_entry.y_pred[idx_per])) < 1e-10
true
GenomicBreedingModels.cvmultithread!
— Methodcvmultithread!(cvs::Vector{CV}; genomes::Genomes, phenomes::Phenomes, models_vector, verbose::Bool = true)::Vector{CV}
Perform multi-threaded genomic prediction cross-validation using specified models.
Arguments
cvs::Vector{CV}
: Vector of cross-validation objects to be processedgenomes::Genomes
: Genomic data structure containing allele frequenciesphenomes::Phenomes
: Phenotypic data structure containing trait measurementsmodels_vector
: Vector of model functions to be used for prediction (e.g., [ridge, bayesa])verbose::Bool=true
: Whether to display progress bar during computation
Returns
- Modified
cvs
vector with updated cross-validation results
Threading
Requires Julia to be started with multiple threads to utilize parallel processing. Example startup command: julia --threads 7,1
(7 worker threads, 1 runtime thread)
Details
The function performs cross-validation in parallel for each CV object using the corresponding model from the models_vector. For each fold:
- Extracts training and validation indices
- Fits the specified model using training data
- Validates the model using validation data
- Updates the CV object with prediction results
Example
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);
julia> phenomes = extractphenomes(trials);
julia> idx_training = collect(1:50);
julia> idx_validation_1 = collect(51:75);
julia> idx_validation_2 = collect(76:100);
julia> fit = Fit(n = length(idx_training), l = length(genomes.loci_alleles) + 1); fit.model = "ridge"; fit.trait = "trait_1";
julia> fit.entries = genomes.entries[idx_training]; fit.populations = genomes.populations[idx_training];
julia> fit.b_hat_labels = vcat(["intercept"], genomes.loci_alleles);
julia> cv_1 = CV("replication_1", "fold_1", fit, genomes.populations[idx_validation_1], genomes.entries[idx_validation_1], zeros(length(idx_validation_1)), zeros(length(idx_validation_1)), fit.metrics);
julia> cv_2 = CV("replication_1", "fold_2", fit, genomes.populations[idx_validation_2], genomes.entries[idx_validation_2], zeros(length(idx_validation_2)), zeros(length(idx_validation_2)), fit.metrics);
julia> cvs = [cv_1, cv_2]; models = [ridge, ridge];
julia> cvmultithread!(cvs, genomes=genomes, phenomes=phenomes, models_vector=[ridge, bayesa], verbose=false);
julia> df_across_entries, df_per_entry = tabularise(cvs);
julia> idx_across = findall(df_across_entries.fold .== "fold_2");
julia> idx_per = findall(df_per_entry.fold .== "fold_2");
julia> abs(df_across_entries.cor[idx_across][1] - cor(df_per_entry.y_true[idx_per], df_per_entry.y_pred[idx_per])) < 1e-10
true
GenomicBreedingModels.cvpairwisepopulation
— Methodcvpairwisepopulation(;
genomes::Genomes,
phenomes::Phenomes,
models::Vector{Function}=[ridge],
n_replications::Int64=5,
n_folds::Int64=5,
seed::Int64=42,
verbose::Bool=true
)::Tuple{Vector{CV}, Vector{String}}
Performs pairwise cross-validation between populations for genomic prediction models.
Arguments
genomes::Genomes
: Genomic data structure containing allele frequenciesphenomes::Phenomes
: Phenotypic data structure containing trait measurementsmodels::Vector{Function}
: Vector of genomic prediction model functions to evaluate (default: [ridge])n_replications::Int64
: Number of replications (unused, maintained for API consistency)n_folds::Int64
: Number of folds (unused, maintained for API consistency)seed::Int64
: Random seed (unused, maintained for API consistency)verbose::Bool
: Whether to display progress information
Returns
Tuple{Vector{CV}, Vector{String}}
:- Vector of CV objects containing cross-validation results
- Vector of warning messages for skipped validations
Details
For each pair of populations (pop1, pop2):
- Uses pop1 as training set and pop2 as validation set
- Skips within-population validation (pop1==pop2)
- Evaluates each model on all available traits
- Handles missing/invalid phenotype values
- Validates model inputs and data dimensions
Threading
Requires Julia to be started with multiple threads: julia --threads n,1
where n is number of worker threads
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false); genomes.populations = StatsBase.sample(string.("pop_", 1:3), length(genomes.entries), replace=true);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);
julia> phenomes = extractphenomes(trials);
julia> cvs, notes = cvpairwisepopulation(genomes=genomes, phenomes=phenomes, models=[ols, ridge], n_replications=2, n_folds=2, verbose=false);
julia> df_across_entries, df_per_entry = tabularise(cvs);
julia> sum(df_across_entries.training_population .!= df_across_entries.validation_population) == size(df_across_entries, 1)
true
julia> idx_across = findall((df_across_entries.training_population .== "pop_1") .&& (df_across_entries.validation_population .== "pop_2") .&& (df_across_entries.trait .== "trait_1") .&& (df_across_entries.model .== "ridge"));
julia> idx_per = findall((df_per_entry.training_population .== "pop_1") .&& (df_per_entry.validation_population .== "pop_2") .&& (df_per_entry.trait .== "trait_1") .&& (df_per_entry.model .== "ridge"));
julia> abs(df_across_entries.cor[idx_across][1] - cor(df_per_entry.y_true[idx_per], df_per_entry.y_pred[idx_per])) < 1e-10
true
GenomicBreedingModels.cvperpopulation
— Methodcvperpopulation(;
genomes::Genomes,
phenomes::Phenomes,
models::Vector{Function}=[ridge],
n_replications::Int64=5,
n_folds::Int64=5,
seed::Int64=42,
verbose::Bool=true
)::Tuple{Vector{CV}, Vector{String}}
Performs within-population replicated cross-validation of genomic prediction models across all available traits.
Arguments
genomes::Genomes
: Genomic data structure containing allele frequenciesphenomes::Phenomes
: Phenotypic data structure containing trait measurementsmodels::Vector{Function}=[ridge]
: Vector of genomic prediction model functions to evaluaten_replications::Int64=5
: Number of replications for cross-validationn_folds::Int64=5
: Number of folds for k-fold cross-validationseed::Int64=42
: Random seed for reproducibilityverbose::Bool=true
: Whether to print progress information
Returns
Tuple{Vector{CV}, Vector{String}}
: A tuple containing:- Vector of CV objects with cross-validation results
- Vector of notes/warnings generated during the process
Details
The function performs separate cross-validations for each unique population in the dataset. Supports multiple genomic prediction models including:
ols
: Ordinary Least Squaresridge
: Ridge Regressionlasso
: Lasso Regressionbayesa
: Bayes Abayesb
: Bayes Bbayesc
: Bayes C
Threading
To use multiple threads, invoke Julia with: julia --threads n,1
where n is the desired number of threads for multi-threaded processes and 1 is reserved for the Julia runtime.
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(l=1_000, verbose=false); genomes.populations = StatsBase.sample(string.("pop_", 1:3), length(genomes.entries), replace=true);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);
julia> phenomes = extractphenomes(trials);
julia> cvs, notes = cvperpopulation(genomes=genomes, phenomes=phenomes, models=[ridge, bayesa], n_replications=2, n_folds=2, verbose=false);
julia> df_across_entries, df_per_entry = tabularise(cvs);
julia> sort(unique(df_across_entries.training_population))
3-element Vector{String}:
"pop_1"
"pop_2"
"pop_3"
julia> df_across_entries.training_population == df_across_entries.validation_population
true
julia> idx_across = findall((df_across_entries.validation_population .== "pop_1") .&& (df_across_entries.trait .== "trait_1") .&& (df_across_entries.model .== "bayesa") .&& (df_across_entries.replication .== "replication_1") .&& (df_across_entries.fold .== "fold_1"));
julia> idx_per = findall((df_per_entry.validation_population .== "pop_1") .&& (df_per_entry.trait .== "trait_1") .&& (df_per_entry.model .== "bayesa") .&& (df_per_entry.replication .== "replication_1") .&& (df_per_entry.fold .== "fold_1"));
julia> abs(df_across_entries.cor[idx_across][1] - cor(df_per_entry.y_true[idx_per], df_per_entry.y_pred[idx_per])) < 1e-10
true
julia> summary_across, summary_per_entry = summarise(cvs);
julia> size(summary_across)
(6, 8)
julia> size(summary_per_entry)
(200, 8)
GenomicBreedingModels.epistasisfeatures
— Methodepistasisfeatures(
genomes::Genomes,
phenomes::Phenomes;
idx_trait::Int64 = 1,
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
transformations1::Vector{Function} = [square, invoneplus, log10epsdivlog10eps],
transformations2::Vector{Function} = [mult, addnorm, raise],
n_new_features_per_transformation::Int64 = 1_000,
n_reps::Int64 = 3,
verbose::Bool = false
)::Genomes
Generate epistasis features by applying various transformations to genomic data.
Arguments
genomes::Genomes
: Genomic data structure containing allele frequenciesphenomes::Phenomes
: Phenotypic data structure containing trait measurementsidx_trait::Int64
: Index of the trait to analyze (default: 1)idx_entries::Union{Nothing,Vector{Int64}}
: Indices of entries to include (default: all)idx_loci_alleles::Union{Nothing,Vector{Int64}}
: Indices of loci/alleles to include (default: all)transformations1::Vector{Function}
: Single-input transformation functions (default: [square, invoneplus, log10epsdivlog10eps])transformations2::Vector{Function}
: Two-input transformation functions (default: [mult, addnorm, raise])n_new_features_per_transformation::Int64
: Number of new features to generate per transformation (default: 1_000)n_reps::Int64
: Number of times to repeat the transformation process (default: 3)verbose::Bool
: Whether to display progress information (default: false)
Returns
Genomes
: Enhanced genomic data structure with additional epistasis features
Description
This function generates epistasis features by applying two types of transformations:
- Single-input transformations (transformations1) applied to individual genomic features
- Two-input transformations (transformations2) applied to pairs of genomic features
The transformations are repeated n_reps
times to create a rich set of derived features.
Notes
- Ensures all generated features are within [0,1] range
- Maintains dimensional consistency of input/output structures
- Automatically handles entry and loci/allele subsetting
- Validates input data structures before processing
Example
julia> genomes = GenomicBreedingCore.simulategenomes(l=1_000, verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> genomes_plus_features = epistasisfeatures(genomes, phenomes, n_new_features_per_transformation=50, n_reps=2, verbose=false);
julia> cvs, notes = cvbulk(genomes=genomes_plus_features, phenomes=phenomes, models=[ridge, lasso, bayesa], verbose=false);
julia> cvs_no_epi, notes_no_epi = cvbulk(genomes=genomes, phenomes=phenomes, models=[ridge, lasso, bayesa], verbose=false);
julia> df_across, df_per_entry = GenomicBreedingCore.tabularise(cvs);
julia> df_across_no_epi, df_per_entry_no_epi = GenomicBreedingCore.tabularise(cvs_no_epi);
julia> df_summary = combine(groupby(df_across, [:trait, :model]), [[:cor] => mean, [:cor] => std]);
julia> df_summary_no_epi = combine(groupby(df_across_no_epi, [:trait, :model]), [[:cor] => mean, [:cor] => std]);
julia> mean(df_summary.cor_mean) > mean(df_summary_no_epi.cor_mean)
true
GenomicBreedingModels.extractxyetc
— Methodextractxyetc(
genomes::Genomes,
phenomes::Phenomes;
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
idx_trait::Int64 = 1,
add_intercept::Bool = true
) -> Tuple{Matrix{Float64}, Vector{Float64}, Vector{String}, Vector{String}, Vector{String}}
Extract data matrices and vectors from genomic and phenotypic data for statistical analyses.
Arguments
genomes::Genomes
: Genomic data structure containing allele frequenciesphenomes::Phenomes
: Phenotypic data structure containing trait measurementsidx_entries::Union{Nothing,Vector{Int64}}
: Optional indices to select specific entries (default: all entries)idx_loci_alleles::Union{Nothing,Vector{Int64}}
: Optional indices to select specific loci-alleles (default: all loci-alleles)idx_trait::Int64
: Index of the trait to analyze (default: 1)add_intercept::Bool
: Whether to add an intercept column to the design matrix (default: true)
Returns
A tuple containing:
X::Matrix{Float64}
: Design matrix with allele frequencies (and intercept if specified)y::Vector{Float64}
: Response vector with phenotypic valuesentries::Vector{String}
: Names of the selected entriespopulations::Vector{String}
: Population identifiers for the selected entriesloci_alleles::Vector{String}
: Names of the selected loci-alleles
Notes
- The function filters out entries with missing, NaN, or infinite phenotype values
- Requires at least 2 valid entries after filtering
- Checks for non-zero variance in the trait values
- Ensures consistency between genomic and phenotypic data dimensions
- Validates all index inputs are within bounds
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> X, y, loci_alleles = extractxyetc(genomes, phenomes);
julia> X == hcat(ones(length(phenomes.entries)), genomes.allele_frequencies)
true
julia> y == phenomes.phenotypes[:, 1]
true
GenomicBreedingModels.gwaslmm
— Methodgwaslmm(
genomes::Genomes,
phenomes::Phenomes;
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
idx_trait::Int64 = 1,
GRM_type::String = ["simple", "ploidy-aware"][1],
verbose::Bool = false
)::Fit
Perform genome-wide association analysis using a linear mixed model (LMM) approach.
Arguments
genomes::Genomes
: Genomic data structure containing allele frequenciesphenomes::Phenomes
: Phenotypic data structure containing trait measurementsidx_entries::Union{Nothing,Vector{Int64}}
: Optional indices for subsetting entriesidx_loci_alleles::Union{Nothing,Vector{Int64}}
: Optional indices for subsetting loci/allelesidx_trait::Int64
: Index of the trait to analyze (default: 1)GRM_type::String
: Type of genetic relationship matrix to use:- "simple": Standard GRM calculation
- "ploidy-aware": Ploidy-adjusted GRM calculation
verbose::Bool
: Whether to display progress and plots (default: false)
Returns
Fit
: A structure containing GWAS results including:model
: Model identifier ("GWAS_LMM")b_hat
: Vector of test statistics (z-scores) for genetic markers
Details
The function implements a mixed model GWAS using the first principal component of the genetic relationship matrix (GRM) as a fixed effect to control for population structure. The model includes random effects for entries and uses REML estimation.
Notes
- Handles both diploid and polyploid data through the
GRM_type
parameter - Uses multi-threading for parallel computation of marker effects
- Includes automatic convergence retry on fitting failures
- Maximum fitting time per marker is limited to 60 seconds
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> ploidy = 4;
julia> genomes.allele_frequencies = round.(genomes.allele_frequencies .* ploidy) ./ ploidy;
julia> proportion_of_variance = zeros(9, 1); proportion_of_variance[1, 1] = 0.5;
julia> trials, effects = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.05 0.00 0.00;], proportion_of_variance = proportion_of_variance, verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> fit_1 = Suppressor.@suppress gwaslmm(genomes=genomes, phenomes=phenomes, GRM_type="simple");
julia> fit_1.model
"GWAS_LMM"
julia> fit_2 = Suppressor.@suppress gwaslmm(genomes=genomes, phenomes=phenomes, GRM_type="ploidy-aware");
julia> fit_2.model
"GWAS_LMM"
julia> findall(fit_1.b_hat .== maximum(fit_1.b_hat)) == findall(fit_2.b_hat .== maximum(fit_2.b_hat))
true
GenomicBreedingModels.gwasols
— Methodgwasols(
genomes::Genomes,
phenomes::Phenomes;
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
idx_trait::Int64 = 1,
GRM_type::String = ["simple", "ploidy-aware"][1],
verbose::Bool = false,
)::Fit
Perform genome-wide association study (GWAS) using ordinary least squares (OLS) regression with population structure correction.
Arguments
genomes::Genomes
: Genomic data structure containing allele frequenciesphenomes::Phenomes
: Phenotypic data structure containing trait measurementsidx_entries::Union{Nothing,Vector{Int64}}
: Optional indices to subset entries (default: all entries)idx_loci_alleles::Union{Nothing,Vector{Int64}}
: Optional indices to subset loci/alleles (default: all loci)idx_trait::Int64
: Index of the trait to analyze (default: 1)GRM_type::String
: Type of genetic relationship matrix to use ("simple" or "ploidy-aware") (default: "simple")verbose::Bool
: Whether to display progress and plots (default: false)
Returns
Fit
: A structure containing GWAS results including:model
: Model identifier ("GWAS_OLS")b_hat
: Vector of effect size estimates/t-statistics for each marker- Additional model information
Details
The function implements GWAS using OLS regression while accounting for population structure through the first principal component of the genetic relationship matrix (GRM) as a covariate. Two types of GRM can be used: "simple" assumes diploid organisms, while "ploidy-aware" accounts for different ploidy levels.
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> ploidy = 4;
julia> genomes.allele_frequencies = round.(genomes.allele_frequencies .* ploidy) ./ ploidy;
julia> proportion_of_variance = zeros(9, 1); proportion_of_variance[1, 1] = 0.5;
julia> trials, effects = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.05 0.00 0.00;], proportion_of_variance = proportion_of_variance, verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> fit_1 = gwasols(genomes=genomes, phenomes=phenomes, GRM_type="simple");
julia> fit_1.model
"GWAS_OLS"
julia> fit_2 = gwasols(genomes=genomes, phenomes=phenomes, GRM_type="ploidy-aware");
julia> fit_2.model
"GWAS_OLS"
julia> findall(fit_1.b_hat .== maximum(fit_1.b_hat)) == findall(fit_2.b_hat .== maximum(fit_2.b_hat))
true
GenomicBreedingModels.gwasprep
— Methodgwasprep(
genomes::Genomes,
phenomes::Phenomes;
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
idx_trait::Int64 = 1,
GRM_type::String = ["simple", "ploidy-aware"][1],
standardise::Bool = true,
verbose::Bool = false,
)::Tuple{Matrix{Float64},Vector{Float64},Matrix{Float64},Fit}
Prepare data matrices and structures for genome-wide association studies (GWAS).
Arguments
genomes::Genomes
: Genomic data structure containing allele frequenciesphenomes::Phenomes
: Phenotypic data structure containing trait measurementsidx_entries::Union{Nothing,Vector{Int64}}
: Optional indices to subset entries (default: all entries)idx_loci_alleles::Union{Nothing,Vector{Int64}}
: Optional indices to subset loci/alleles (default: all loci)idx_trait::Int64
: Index of the trait to analyze (default: 1)GRM_type::String
: Type of genetic relationship matrix to use ("simple" or "ploidy-aware") (default: "simple")standardise::Bool
: Whether to standardize the data matrices (default: true)verbose::Bool
: Whether to print progress information (default: false)
Returns
A tuple containing:
G::Matrix{Float64}
: Standardized allele frequency matrixy::Vector{Float64}
: Standardized phenotype vectorGRM::Matrix{Float64}
: Genetic relationship matrixfit::Fit
: Initialized Fit struct for GWAS results
Details
- Performs data validation and preprocessing for GWAS analysis
- Removes fixed loci with no variation
- Standardizes genomic and phenotypic data if requested
- Constructs appropriate genetic relationship matrix
- Initializes output structure for association results
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> ploidy = 4;
julia> genomes.allele_frequencies = round.(genomes.allele_frequencies .* ploidy) ./ ploidy;
julia> proportion_of_variance = zeros(9, 1); proportion_of_variance[1, 1] = 0.5;
julia> trials, effects = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.05 0.00 0.00;], proportion_of_variance = proportion_of_variance, verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> G, y, GRM, fit = gwasprep(genomes=genomes, phenomes=phenomes);
julia> sum(abs.(mean(G, dims=1)[1,:]) .< 1e-10) == size(G, 2)
true
julia> sum(abs.(std(G, dims=1)[1,:] .- 1) .< 1e-10) == size(G, 2)
true
julia> (abs(mean(y)) < 1e-10, abs(std(y) - 1) < 1e-10)
(true, true)
julia> size(G, 1) == length(y)
true
julia> (size(G, 1), length(y)) == size(GRM)
true
julia> length(fit.entries) == length(y)
true
julia> length(fit.b_hat) == size(G, 2)
true
GenomicBreedingModels.gwasreml
— Methodgwasreml(
genomes::Genomes,
phenomes::Phenomes;
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
idx_trait::Int64 = 1,
GRM_type::String = "simple",
verbose::Bool = false
)::Fit
Performs genome-wide association analysis using restricted maximum likelihood estimation (REML).
Arguments
genomes::Genomes
: Genomic data structure containing allele frequenciesphenomes::Phenomes
: Phenotypic data structure containing trait measurementsidx_entries::Union{Nothing,Vector{Int64}}
: Optional indices to subset entries (default: nothing)idx_loci_alleles::Union{Nothing,Vector{Int64}}
: Optional indices to subset loci/alleles (default: nothing)idx_trait::Int64
: Index of the trait to analyze (default: 1)GRM_type::String
: Type of genetic relationship matrix to use, either "simple" or "ploidy-aware" (default: "simple")verbose::Bool
: Whether to display progress and plots (default: false)
Returns
::Fit
: A Fit struct containing GWAS results, including effect estimates and test statistics
Details
Implements the REML log-likelihood calculation for a mixed model of the form: y = Xβ + Zu + e where:
- β are fixed effects
- u are random genetic effects with u ~ N(0, σ²_u * GRM)
- e are residual effects with e ~ N(0, σ²_e * I)
The function constructs the variance-covariance matrices and computes the REML transformation to obtain the log-likelihood value used in variance component estimation.
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(l=1_000, verbose=false);
julia> ploidy = 4;
julia> genomes.allele_frequencies = round.(genomes.allele_frequencies .* ploidy) ./ ploidy;
julia> proportion_of_variance = zeros(9, 1); proportion_of_variance[1, 1] = 0.5;
julia> trials, effects = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.05 0.00 0.00;], proportion_of_variance = proportion_of_variance, verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> fit_1 = gwasreml(genomes=genomes, phenomes=phenomes, GRM_type="simple");
julia> fit_1.model
"GWAS_REML"
julia> fit_2 = gwasreml(genomes=genomes, phenomes=phenomes, GRM_type="ploidy-aware");
julia> fit_2.model
"GWAS_REML"
julia> findall(fit_1.b_hat .== maximum(fit_1.b_hat)) == findall(fit_2.b_hat .== maximum(fit_2.b_hat))
true
GenomicBreedingModels.heritabilitynarrow_sense
— Methodheritabilitynarrow_sense(y_true::Vector{Float64}, y_pred::Vector{Float64})::Float64
Calculate narrow-sense heritability (h²) from true and predicted phenotypic values.
Narrow-sense heritability is the proportion of phenotypic variance that can be attributed to additive genetic effects. It is calculated as the ratio of additive genetic variance (s²a) to total phenotypic variance (s²a + s²e).
Arguments
y_true::Vector{Float64}
: Vector of observed/true phenotypic valuesy_pred::Vector{Float64}
: Vector of predicted genetic values
Returns
Float64
: Narrow-sense heritability (h²) value between 0 and 1
Details
- Returns 0.0 if variance of either input vector is near zero (< 1e-10)
- Additive genetic variance (s²a) is estimated from variance of predictions
- Environmental variance (s²e) is estimated from variance of residuals
- Result is bounded between 0 and 1
GenomicBreedingModels.invoneplus
— Methodinvoneplus(x) = 1 / (1 + x)
This function takes a number, adds one and inverts it. A member of the set of endofunctions defined in GenomicBreedingModels.jl for building non-linear genomic prediction models. Both input and ouput range between zero and one.
GenomicBreedingModels.lasso
— Methodlasso(;
genomes::Genomes,
phenomes::Phenomes,
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
idx_trait::Int64 = 1,
verbose::Bool = false,
)::Fit
Fits a LASSO (Least Absolute Shrinkage and Selection Operator) regression model with L1 regularization for genomic prediction.
Arguments
genomes::Genomes
: Genomic data structure containing allele frequenciesphenomes::Phenomes
: Phenotypic data structure containing trait measurementsidx_entries::Union{Nothing,Vector{Int64}}
: Optional indices to select specific entries/individualsidx_loci_alleles::Union{Nothing,Vector{Int64}}
: Optional indices to select specific loci-allelesidx_trait::Int64
: Index of the trait to analyze (default: 1)verbose::Bool
: If true, prints diagnostic plots and additional information (default: false)
Returns
Fit
: A fitted model object containing:model
: Model identifier ("ols")b_hat
: Estimated regression coefficientsb_hat_labels
: Labels for the coefficientstrait
: Name of the analyzed traitentries
: Entry identifierspopulations
: Population identifiersy_true
: Observed phenotypic valuesy_pred
: Predicted phenotypic valuesmetrics
: Dictionary of performance metrics
Details
The function implements LASSO regression using the GLMNet package, which performs automatic feature selection through L1 regularization. The optimal regularization parameter (λ) is selected using cross-validation to minimize prediction error while promoting sparsity in the coefficients.
Notes
- Standardisation is disabled by default because the allele frequencies across loci are comparable as they all range from zero to one
- The model includes an intercept term
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> fit = lasso(genomes=genomes, phenomes=phenomes);
julia> fit.model == "lasso"
true
julia> fit.metrics["cor"] > 0.50
true
GenomicBreedingModels.log10epsdivlog10eps
— Methodlog10epsdivlog10eps(x) = (log10(abs(x) + eps(Float64))) / log10(eps(Float64))
This function takes a number, adds a very tiny value, takes the log10, and divide it by the log10 of the same very tiny value. A member of the set of endofunctions defined in GenomicBreedingModels.jl for building non-linear genomic prediction models. Both input and ouput range between zero and one.
GenomicBreedingModels.loglikreml
— Methodloglikreml(θ::Vector{Float64}, data::Tuple{Vector{Float64},Matrix{Float64},Matrix{Float64}})::Float64
Calculate the restricted maximum likelihood (REML) log-likelihood for a mixed linear model.
Arguments
θ::Vector{Float64}
: Vector of variance components [σ²e, σ²u] where:- σ²_e is the residual variance
- σ²_u is the genetic variance
data::Tuple{Vector{Float64},Matrix{Float64},Matrix{Float64}}
: Tuple containing:- y: Vector of phenotypic observations
- X: Design matrix for fixed effects
- GRM: Genomic relationship matrix
Returns
Float64
: The REML log-likelihood value. ReturnsInf
if matrix operations fail.
Details
Implements the REML log-likelihood calculation for a mixed model of the form: y = Xβ + Zu + e where:
- β are fixed effects
- u are random genetic effects with u ~ N(0, σ²_u * GRM)
- e are residual effects with e ~ N(0, σ²_e * I)
The function constructs the variance-covariance matrices and computes the REML transformation to obtain the log-likelihood value used in variance component estimation.
Examples
```jldoctest; setup = :(using GenomicBreedingCore, GenomicBreedingModels, LinearAlgebra, StatsBase) julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> ploidy = 4;
julia> genomes.allelefrequencies = round.(genomes.allelefrequencies .* ploidy) ./ ploidy;
julia> proportionofvariance = zeros(9, 1); proportionofvariance[1, 1] = 0.5;
julia> trials, effects = GenomicBreedingCore.simulatetrials(genomes=genomes, nyears=1, nseasons=1, nharvests=1, nsites=1, nreplications=1, fadddomepi=[0.05 0.00 0.00;], proportionofvariance = proportionofvariance, verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> G, y, GRM, fit = gwasprep(genomes=genomes, phenomes=phenomes);
julia> loglik = loglikreml([0.53, 0.15], (y, hcat(ones(length(y)), G[:, 1]), GRM));
julia> loglik < 100 true
GenomicBreedingModels.metrics
— Methodmetrics(y_true::Vector{Float64}, y_pred::Vector{Float64})::Dict{String,Float64}
Calculate various metrics comparing predicted vs true values.
Returns a dictionary containing the following metrics:
cor
: Pearson correlation coefficientmad
: Mean absolute deviationmsd
: Mean squared deviationrmsd
: Root mean squared deviationnrmsd
: Normalized root mean squared deviationeuc
: Euclidean distancejac
: Jaccard distancetvar
: Total variation distanceh²
: Narrow-sense heritability
Arguments
y_true::Vector{Float64}
: Vector of true/observed valuesy_pred::Vector{Float64}
: Vector of predicted values
Returns
Dict{String,Float64}
: Dictionary mapping metric names to their calculated values
GenomicBreedingModels.mlp
— Methodmlp(;
genomes::Genomes,
phenomes::Phenomes,
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
idx_trait::Int64 = 1,
n_layers::Int64 = 3,
activation::Function = relu,
max_n_nodes::Int64 = 256,
n_nodes_droprate::Float64 = 0.50,
dropout_droprate::Float64 = 0.25,
n_epochs::Int64 = 100_000,
use_cpu::Bool = false,
seed::Int64 = 123,
verbose::Bool = false
)::Fit
Train a multi-layer perceptron (MLP) model for genomic prediction.
Arguments
genomes::Genomes
: AGenomes
struct containing the genomic data.phenomes::Phenomes
: APhenomes
struct containing the phenomic data.idx_entries::Union{Nothing, Vector{Int64}}
: Indices of entries to include in the model. Ifnothing
, all entries are included. Default isnothing
.idx_loci_alleles::Union{Nothing, Vector{Int64}}
: Indices of loci-alleles to include in the model. Ifnothing
, all loci-alleles are included. Default isnothing
.idx_trait::Int64
: Index of the trait to predict. Default is 1.n_layers::Int64
: Number of layers in the MLP. Default is 3.activation::Function
: Activation function to use in the MLP. Default isrelu
.max_n_nodes::Int64
: Maximum number of nodes in each layer. Default is 256.n_nodes_droprate::Float64
: Drop rate for the number of nodes in each layer. Default is 0.50.dropout_droprate::Float64
: Dropout rate for the layers. Default is 0.25.n_epochs::Int64
: Number of training epochs. Default is 100,000.use_cpu::Bool
: Iftrue
, use CPU for training. Iffalse
, use GPU if available. Default isfalse
.seed::Int64
: Random seed for reproducibility. Default is 123.verbose::Bool
: Iftrue
, prints detailed progress information during training. Default isfalse
.
Returns
Fit
: AFit
struct containing the trained MLP model and performance metrics.
Details
This function trains a multi-layer perceptron (MLP) model on genomic and phenomic data. The function performs the following steps:
- Set Random Seed: Sets the random seed for reproducibility.
- Extract Features and Targets: Extracts the feature matrix
X
, target vectory
, and other relevant information from the genomic and phenomic data. - Instantiate Output Fit: Creates a
Fit
struct to store the model and results. - Construct MLP Layers: Constructs the MLP layers based on the specified number of layers, activation function, and dropout rates.
- Move Data to Device: Moves the data to the appropriate device (CPU or GPU).
- Setup Training State: Initializes the training state with the model parameters and optimizer.
- Train the Model: Trains the MLP model for the specified number of epochs, printing progress if
verbose
istrue
. - Evaluate Performance: Evaluates the model's performance using the specified metrics.
- Output: Returns the
Fit
struct containing the trained model and performance metrics.
Notes
- The function uses the Lux library for constructing and training the MLP model.
- The
verbose
option provides additional insights into the training process by printing progress information. - The function ensures that the trained model and performance metrics are stored in the
Fit
struct.
Throws
ArgumentError
: If theGenomes
orPhenomes
struct is corrupted or if any of the arguments are out of range.ErrorException
: If an error occurs during model training or evaluation.
Example
julia> genomes = GenomicBreedingCore.simulategenomes(l=1_000, verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);
julia> phenomes = extractphenomes(trials);
julia> fit_cpu = Suppressor.@suppress mlp(genomes=genomes, phenomes=phenomes, n_epochs=1_000, use_cpu=true, verbose=false);
julia> fit_gpu = Suppressor.@suppress mlp(genomes=genomes, phenomes=phenomes, n_epochs=1_000, use_cpu=false, verbose=false);
julia> fit_cpu.metrics["cor"] >= 0.2
true
julia> fit_gpu.metrics["cor"] >= 0.2
true
GenomicBreedingModels.mult
— Methodmult(x, y) = x * y
This function takes two numbers and multiplies them together. A member of the set of endofunctions defined in GenomicBreedingModels.jl for building non-linear genomic prediction models. Inputs and ouput range between zero and one.
GenomicBreedingModels.ols
— Methodols(;
genomes::Genomes,
phenomes::Phenomes,
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
idx_trait::Int64 = 1,
verbose::Bool = false
)::Fit
Fits an ordinary least squares (OLS) regression model to genomic and phenotypic data.
Arguments
genomes::Genomes
: Genomic data structure containing allele frequenciesphenomes::Phenomes
: Phenotypic data structure containing trait measurementsidx_entries::Union{Nothing,Vector{Int64}}
: Optional indices to select specific entries (default: all entries)idx_loci_alleles::Union{Nothing,Vector{Int64}}
: Optional indices to select specific loci-alleles (default: all loci-alleles)idx_trait::Int64
: Index of the trait to analyze (default: 1)verbose::Bool
: If true, displays diagnostic plots and performance metrics (default: false)
Returns
Fit
: A fitted model object containing:model
: Model identifier ("ols")b_hat
: Estimated regression coefficientsb_hat_labels
: Labels for the coefficientstrait
: Name of the analyzed traitentries
: Entry identifierspopulations
: Population identifiersy_true
: Observed phenotypic valuesy_pred
: Predicted phenotypic valuesmetrics
: Dictionary of performance metrics
Description
Performs ordinary least squares regression on genomic data to predict phenotypic values. The model includes an intercept term and estimates effects for each locus-allele combination.
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> fit = ols(genomes=genomes, phenomes=phenomes);
julia> fit.model == "ols"
true
julia> fit.metrics["cor"] > 0.50
true
GenomicBreedingModels.pearsonscorrelation
— Methodpearsonscorrelation(y_true::Vector{Float64}, y_pred::Vector{Float64})::Float64
Calculate the Pearson correlation coefficient between two vectors.
The Pearson correlation coefficient measures the linear correlation between two variables, giving a value between -1 and +1, where:
- +1 represents perfect positive correlation
- 0 represents no linear correlation
- -1 represents perfect negative correlation
Arguments
y_true::Vector{Float64}
: Vector of true/actual valuesy_pred::Vector{Float64}
: Vector of predicted/estimated values
Returns
Float64
: Pearson correlation coefficient
Notes
- Returns 0.0 if the variance of either input vector is less than 1e-10
- Uses
1 - correlation distance
from Distances.jl package
GenomicBreedingModels.predict
— Methodpredict(; fit::Fit, genomes::Genomes, idx_entries::Vector{Int64})::Vector{Float64}
Calculate predicted phenotypes using a fitted genomic prediction model.
Arguments
fit::Fit
: A fitted genomic prediction model containing coefficients and model informationgenomes::Genomes
: Genomic data containing allele frequenciesidx_entries::Vector{Int64}
: Vector of indices specifying which entries to predict
Returns
Vector{Float64}
: Predicted phenotypic values for the specified entries
Details
Supports various linear genomic prediction models including:
- OLS (Ordinary Least Squares)
- Ridge Regression
- LASSO
- Bayes A
- Bayes B
- Bayes C
The function validates input dimensions and compatibility between the fitted model and genomic data before making predictions.
Throws
ArgumentError
: If the Fit or Genomes structs are corruptedArgumentError
: If entry indices are out of boundsArgumentError
: If loci-alleles in the fitted model don't match the validation setArgumentError
: If the genomic prediction model is not recognized
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);
julia> phenomes = extractphenomes(trials);
julia> fit = ridge(genomes=genomes, phenomes=phenomes, idx_entries=collect(1:90));
julia> y_hat = GenomicBreedingModels.predict(fit=fit, genomes=genomes, idx_entries=collect(91:100));
julia> cor(phenomes.phenotypes[91:100, 1], y_hat) > 0.5
true
GenomicBreedingModels.raise
— Methodraise(x, y) = x^y
This function raises the first value to the power of the second value. A member of the set of endofunctions defined in GenomicBreedingModels.jl for building non-linear genomic prediction models. Inputs and ouput range between zero and one.
GenomicBreedingModels.reconstitutefeatures
— Methodreconstitutefeatures(
genomes::Genomes;
feature_names::Vector{String},
verbose::Bool = false
)::Genomes
Reconstruct epistasis features from a Genomes struct using feature names that encode the transformations applied.
Arguments
genomes::Genomes
: Genomic data structure containing allele frequenciesfeature_names::Vector{String}
: Vector of feature names containing encoded transformation informationverbose::Bool
: Whether to show progress bar during reconstruction (default: false)
Returns
Genomes
: A new Genomes struct with reconstructed epistasis features
Details
The function parses the feature names to determine which transformations were applied and reconstructs the features by applying these transformations to the original genomic data. Feature names should contain the transformation operations in a format that can be parsed and evaluated.
Throws
ArgumentError
: If the input Genomes struct is corrupted (invalid dimensions)ErrorException
: If feature reconstruction fails
Example
julia> genomes = GenomicBreedingCore.simulategenomes(l=1_000, verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> genomes_epifeat = epistasisfeatures(genomes, phenomes, n_new_features_per_transformation=50, n_reps=2, verbose=false);
julia> feature_names = genomes_epifeat.loci_alleles;
julia> genomes_epifeat_reconstructed = reconstitutefeatures(genomes, feature_names=feature_names);
julia> genomes_epifeat == genomes_epifeat_reconstructed
true
GenomicBreedingModels.ridge
— Methodridge(;
genomes::Genomes,
phenomes::Phenomes,
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
idx_trait::Int64 = 1,
verbose::Bool = false
)::Fit
Fit a ridge (L2) regression model to genomic data. Ridge regression adds an L2 regularization term to the ordinary least squares objective function, which helps prevent overfitting and handles multicollinearity in the predictors.
Arguments
genomes::Genomes
: Genomic data structure containing allele frequenciesphenomes::Phenomes
: Phenotypic data structure containing trait measurementsidx_entries::Union{Nothing,Vector{Int64}}
: Optional indices to subset specific entries/individualsidx_loci_alleles::Union{Nothing,Vector{Int64}}
: Optional indices to subset specific loci-allelesidx_trait::Int64
: Index of the trait to analyze (default: 1)verbose::Bool
: If true, prints diagnostic plots and additional information (default: false)
Returns
Fit
: A fitted model object containing:model
: Model identifier ("ols")b_hat
: Estimated regression coefficientsb_hat_labels
: Labels for the coefficientstrait
: Name of the analyzed traitentries
: Entry identifierspopulations
: Population identifiersy_true
: Observed phenotypic valuesy_pred
: Predicted phenotypic valuesmetrics
: Dictionary of performance metrics
Notes
- Uses cross-validation to select the optimal regularization parameter (λ)
- Standardizes predictors before fitting
- Includes an intercept term in the model
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> fit = ridge(genomes=genomes, phenomes=phenomes);
julia> fit.model == "ridge"
true
julia> fit.metrics["cor"] > 0.50
true
GenomicBreedingModels.square
— Methodsquare(x) = x^2
This function takes a number and squares it. A member of the set of endofunctions defined in GenomicBreedingModels.jl for building non-linear genomic prediction models. Both input and ouput range between zero and one.
GenomicBreedingModels.transform1
— Methodtransform1(
f::Function,
genomes::Genomes,
phenomes::Phenomes;
idx_trait::Int64 = 1,
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
n_new_features_per_transformation::Int64 = 1_000,
ϵ::Float64 = eps(Float64),
use_abs::Bool = false,
σ²_threshold::Float64 = 0.01,
verbose::Bool = false
)::Genomes
Apply a transformation function to each allele frequency in genomic data while considering their effects on phenotypes.
Arguments
f::Function
: Transformation function to apply to allele frequenciesgenomes::Genomes
: Input genomic data structurephenomes::Phenomes
: Corresponding phenotypic data structure
Keywords
idx_trait::Int64
: Index of the trait to analyze (default: 1)idx_entries::Union{Nothing,Vector{Int64}}
: Indices of entries to include (default: all)idx_loci_alleles::Union{Nothing,Vector{Int64}}
: Indices of loci-alleles to include (default: all)n_new_features_per_transformation::Int64
: Maximum number of transformed features to retain (default: 1000)ϵ::Float64
: Small value added to prevent numerical issues (default: machine epsilon)use_abs::Bool
: Whether to use absolute values before transformation (default: false)σ²_threshold::Float64
: Minimum variance threshold for considering a locus (default: 0.01)verbose::Bool
: Whether to show progress and plots (default: false)
Returns
Genomes
: A new Genomes object containing the transformed allele frequencies
Details
The function performs the following steps:
- Extracts allele frequencies and phenotypic data
- Applies the transformation function to each locus
- Estimates effects using ordinary least squares
- Selects the most important transformed features
- Cleans numerical artifacts (values near 0 or 1)
Notes
- Use named functions to ensure transformations can be reconstructed from
loci_alleles
- The function adds
ϵ
to frequencies to prevent numerical issues - Features with variance below
σ²_threshold
are skipped - The output contains at most
n_new_features_per_transformation
features
Example
julia> genomes = GenomicBreedingCore.simulategenomes(l=1_000, verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> genomes_transformed = transform1(x -> x^2, genomes, phenomes);
julia> idx = findall(genomes.loci_alleles .== split(split(replace(genomes_transformed.loci_alleles[1], ")" => ""), "(")[2], ",")[1])[1];
julia> mean(sqrt.(genomes_transformed.allele_frequencies[:, 1]) .- genomes.allele_frequencies[:, idx]) < 1e-10
true
julia> squareaddpi(x) = x^2 + pi;
julia> genomes_transformed = transform1(squareaddpi, genomes, phenomes);
julia> idx = findall(genomes.loci_alleles .== split(split(replace(genomes_transformed.loci_alleles[1], ")" => ""), "(")[2], ",")[1])[1];
julia> mean(squareaddpi.(genomes.allele_frequencies[:, idx]) .- genomes_transformed.allele_frequencies[:, 1]) < 1e-10
true
GenomicBreedingModels.transform2
— Methodtransform2(
f::Function,
genomes::Genomes,
phenomes::Phenomes;
idx_trait::Int64 = 1,
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
n_new_features_per_transformation::Int64 = 1_000,
ϵ::Float64 = eps(Float64),
use_abs::Bool = false,
σ²_threshold::Float64 = 0.01,
commutative::Bool = false,
verbose::Bool = false
)::Genomes
Apply a transformation function to pairs of allele frequencies in genomic data while considering their effects on phenotypes.
Arguments
f::Function
: Transformation function to apply to pairs of allele frequenciesgenomes::Genomes
: Input genomic data structurephenomes::Phenomes
: Corresponding phenotypic data structureidx_trait::Int64
: Index of the trait to analyze (default: 1)idx_entries::Union{Nothing,Vector{Int64}}
: Subset of entries to include (default: all)idx_loci_alleles::Union{Nothing,Vector{Int64}}
: Subset of loci-alleles to include (default: all)n_new_features_per_transformation::Int64
: Maximum number of transformed features to retain (default: 1000)ϵ::Float64
: Small value added to frequencies to avoid numerical issues (default: machine epsilon)use_abs::Bool
: Whether to use absolute values of frequencies (default: false)σ²_threshold::Float64
: Minimum variance threshold for considering loci (default: 0.01)commutative::Bool
: Whether the transformation function is commutative (default: false)verbose::Bool
: Whether to display progress and diagnostics (default: false)
Returns
Genomes
: A new Genomes object containing the transformed features
Details
The function performs the following steps:
- Extracts allele frequencies and phenotypic data
- Applies the transformation function to all possible pairs of allele freuqencies (i.e., locus-allele combinations)
- Estimates effects using ordinary least squares
- Selects the most important transformed features
- Cleans numerical artifacts (values near 0 or 1)
Notes
- Use named functions to ensure transformations can be reconstructed from
loci_alleles
- The function adds
ϵ
to frequencies to prevent numerical issues - Features with variance below
σ²_threshold
are skipped - The output contains at most
n_new_features_per_transformation
features
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(l=1_000, verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> genomes_transformed = transform2((x,y) -> (x^2 + sqrt(y)) / 2, genomes, phenomes);
julia> idx_1 = findall(genomes.loci_alleles .== split(split(replace(genomes_transformed.loci_alleles[1], ")" => ""), "(")[2], ",")[1])[1];
julia> idx_2 = findall(genomes.loci_alleles .== split(split(replace(genomes_transformed.loci_alleles[1], ")" => ""), "(")[2], ",")[2])[1];
julia> mean((genomes.allele_frequencies[:,idx_1].^2 .+ sqrt.(genomes.allele_frequencies[:,idx_2])) ./ 2 .- genomes_transformed.allele_frequencies[:,1]) < 1e-10
true
julia> raisexbyythenlog(x, y) = log(abs(x^y));
julia> genomes_transformed = transform2(raisexbyythenlog, genomes, phenomes);
julia> idx_1 = findall(genomes.loci_alleles .== split(split(replace(genomes_transformed.loci_alleles[1], ")" => ""), "(")[2], ",")[1])[1];
julia> idx_2 = findall(genomes.loci_alleles .== split(split(replace(genomes_transformed.loci_alleles[1], ")" => ""), "(")[2], ",")[2])[1];
julia> mean(raisexbyythenlog.(genomes.allele_frequencies[:,idx_1], genomes.allele_frequencies[:,idx_2]) .- genomes_transformed.allele_frequencies[:,1]) < 1e-10
true
GenomicBreedingModels.turing_bayesG
— MethodTuring specification of Bayesian linear regression using a Gaussian prior with common variance
Example usage
# Benchmarking
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
model = turing_bayesG(G, y)
benchmarks = TuringBenchmarking.benchmark_model(
model;
# Check correctness of computations
check=true,
# Automatic differentiation backends to check and benchmark
adbackends=[:forwarddiff, :reversediff, :reversediff_compiled, :zygote]
)
# Test more loci
genomes = GenomicBreedingCore.simulategenomes(n=10, l=10_000)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Check for uninferred types in the model
@code_warntype model = turing_bayesG(G, y)
# Fit
model = turing_bayesG(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.65, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# @time chain = Turing.sample(rng, model, HMC(0.05, 10; adtype=AutoReverseDiff(compile=true)), niter, progress=true);
p = Plots.histogram(chain[:σ²])
Plots.gui(p)
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[501:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)
GenomicBreedingModels.turing_bayesG_logit
— MethodTuring specification of Bayesian logistic regression using a Gaussian prior with common variance
GenomicBreedingModels.turing_bayesGs
— MethodTuring specification of Bayesian linear regression using a Gaussian prior with varying variances
Example usage
# Simulate data
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
# Extract genotype and phenotype data
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Regress for just 200 iterations for demonstration purposes only. Use way way more iterations, e.g. 10,000.
model = turing_bayesGs(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.5, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[150:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)
GenomicBreedingModels.turing_bayesGπ
— MethodTuring specification of Bayesian linear regression using a Gaussian prior with a point mass at zero and common variance
Example usage
# Simulate data
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
# Extract genotype and phenotype data
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Regress for just 200 iterations for demonstration purposes only. Use way way more iterations, e.g. 10,000.
model = turing_bayesGπ(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.5, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[150:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)
GenomicBreedingModels.turing_bayesGπs
— MethodTuring specification of Bayesian linear regression using a Gaussian prior with a point mass at zero and varying variances
Example usage
# Simulate data
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
# Extract genotype and phenotype data
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Regress for just 200 iterations for demonstration purposes only. Use way way more iterations, e.g. 10,000.
model = turing_bayesGπs(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.5, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[150:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)
GenomicBreedingModels.turing_bayesL
— MethodTuring specification of Bayesian linear regression using a Laplacian prior with a common scale
Example usage
# Simulate data
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
# Extract genotype and phenotype data
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Regress for just 200 iterations for demonstration purposes only. Use way way more iterations, e.g. 10,000.
model = turing_bayesL(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.5, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[150:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)
GenomicBreedingModels.turing_bayesLs
— MethodTuring specification of Bayesian linear regression using a Laplacian prior with varying scales
Example usage
# Simulate data
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
# Extract genotype and phenotype data
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Regress for just 200 iterations for demonstration purposes only. Use way way more iterations, e.g. 10,000.
model = turing_bayesLs(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.5, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[150:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)
GenomicBreedingModels.turing_bayesLπ
— MethodTuring specification of Bayesian linear regression using a Laplacian prior with a point mass at zero and common scale
Example usage
# Simulate data
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
# Extract genotype and phenotype data
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Regress for just 200 iterations for demonstration purposes only. Use way way more iterations, e.g. 10,000.
model = turing_bayesLπ(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.5, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[150:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)
GenomicBreedingModels.turing_bayesLπs
— MethodTuring specification of Bayesian linear regression using a Laplacian prior with a point mass at zero and common scale
Example usage
# Simulate data
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
# Extract genotype and phenotype data
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Regress for just 200 iterations for demonstration purposes only. Use way way more iterations, e.g. 10,000.
model = turing_bayesLπs(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.5, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[150:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)
GenomicBreedingModels.turing_bayesT
— MethodTuring specification of Bayesian linear regression using a T-distribution
Example usage
# Simulate data
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
# Extract genotype and phenotype data
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Regress for just 200 iterations for demonstration purposes only. Use way way more iterations, e.g. 10,000.
model = turing_bayesT(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.5, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[150:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)
GenomicBreedingModels.turing_bayesTπ
— MethodTuring specification of Bayesian linear regression using a T-distribution with a point mass at zero
Example usage
# Simulate data
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
# Extract genotype and phenotype data
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Regress for just 200 iterations for demonstration purposes only. Use way way more iterations, e.g. 10,000.
model = turing_bayesTπ(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.5, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[150:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)
GenomicBreedingModels.validate
— Methodvalidate(
fit::Fit,
genomes::Genomes,
phenomes::Phenomes;
idx_validation::Vector{Int64},
replication::String="",
fold::String=""
)::CV
Evaluate the predictive accuracy of a genomic prediction model on a validation dataset.
Arguments
fit::Fit
: A fitted genomic prediction modelgenomes::Genomes
: Genomic data structure containing allele frequenciesphenomes::Phenomes
: Phenotypic data structure containing trait measurementsidx_validation::Vector{Int64}
: Indices of entries to use for validationreplication::String
: Optional identifier for the validation replicationfold::String
: Optional identifier for the cross-validation fold
Returns
CV
: A cross-validation result object containing:- Validation metrics (correlation, RMSE, etc.)
- True and predicted values
- Entry and population information
- Model specifications
Notes
- Performs checks for data leakage between training and validation sets
- Handles missing, NaN, and Inf values in phenotypic data
- Validates dimensions of output CV struct
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);
julia> phenomes = extractphenomes(trials);
julia> fit = ridge(genomes=genomes, phenomes=phenomes, idx_entries=collect(1:90));
julia> cv = validate(fit, genomes, phenomes, idx_validation=collect(91:100));
julia> cv.metrics["cor"] > 0.50
true
GenomicBreedingModels.@string2operations
— Macro@string2operations(x)
Convert a string containing an endofunction formulae for genomic features into a parsed Julia expressions.
Arguments
x
: A string containing mathematical operations on allele frequencies
Returns
- Parsed expression that can be evaluated in Julia