GenomicBreedingModels

Documentation for GenomicBreedingModels.

GenomicBreedingModels.LπDist
GenomicBreedingModels.NπDist
GenomicBreedingModels.TπDist
Base.maximum
Base.maximum
Base.maximum
Base.minimum
Base.minimum
Base.minimum
Base.rand
Base.rand
Base.rand
Distributions.logpdf
Distributions.logpdf
Distributions.logpdf
GenomicBreedingModels.addnorm
GenomicBreedingModels.bayesa
GenomicBreedingModels.bayesb
GenomicBreedingModels.bayesc
GenomicBreedingModels.bayesian
GenomicBreedingModels.bayesian
GenomicBreedingModels.bglr
GenomicBreedingModels.cvbulk
GenomicBreedingModels.cvleaveonepopulationout
GenomicBreedingModels.cvmultithread!
GenomicBreedingModels.cvpairwisepopulation
GenomicBreedingModels.cvperpopulation
GenomicBreedingModels.epistasisfeatures
GenomicBreedingModels.extractxyetc
GenomicBreedingModels.gwaslmm
GenomicBreedingModels.gwasols
GenomicBreedingModels.gwasprep
GenomicBreedingModels.gwasreml
GenomicBreedingModels.heritabilitynarrow_sense
GenomicBreedingModels.invoneplus
GenomicBreedingModels.lasso
GenomicBreedingModels.log10epsdivlog10eps
GenomicBreedingModels.loglikreml
GenomicBreedingModels.metrics
GenomicBreedingModels.mlp
GenomicBreedingModels.mult
GenomicBreedingModels.ols
GenomicBreedingModels.pearsonscorrelation
GenomicBreedingModels.predict
GenomicBreedingModels.raise
GenomicBreedingModels.reconstitutefeatures
GenomicBreedingModels.ridge
GenomicBreedingModels.square
GenomicBreedingModels.transform1
GenomicBreedingModels.transform2
GenomicBreedingModels.turing_bayesG
GenomicBreedingModels.turing_bayesG_logit
GenomicBreedingModels.turing_bayesGs
GenomicBreedingModels.turing_bayesGπ
GenomicBreedingModels.turing_bayesGπs
GenomicBreedingModels.turing_bayesL
GenomicBreedingModels.turing_bayesLs
GenomicBreedingModels.turing_bayesLπ
GenomicBreedingModels.turing_bayesLπs
GenomicBreedingModels.turing_bayesT
GenomicBreedingModels.turing_bayesTπ
GenomicBreedingModels.validate
GenomicBreedingModels.@string2operations

GenomicBreedingModels.LπDist — Type

Laplace distribution with a point mass at 0.0

source

GenomicBreedingModels.NπDist — Type

Gaussian distribution with a point mass at 0.0

source

GenomicBreedingModels.TπDist — Type

T-distribution with a point mass at 0.0

source

Base.maximum — Method

Maximum value of the LπDist distribution

source

Base.maximum — Method

Maximum value of the NπDist distribution

source

Base.maximum — Method

Maximum value of the TπDist distribution

source

Base.minimum — Method

Minimum value of the LπDist distribution

source

Base.minimum — Method

Minimum value of the NπDist distribution

source

Base.minimum — Method

Minimum value of the TπDist distribution

source

Base.rand — Method

Sampling method for LπDist

Examples

d = LπDist(0.1, 0.0, 1.0)
rand(d)

source

Base.rand — Method

Sampling method for NπDist

Examples

d = NπDist(0.1, 0.0, 1.0)
rand(d)

source

Base.rand — Method

Sampling method for TπDist

Examples

d = TπDist(0.1, 1.0)
rand(d)

source

Distributions.logpdf — Method

log(pdf) of LπDist

Examples

d = LπDist(0.1, 0.0, 1.0)
logpdf.(d, [-1.0, 0.0, 1.0])

source

Distributions.logpdf — Method

log(pdf) of NπDist

Examples

d = NπDist(0.1, 0.0, 1.0)
logpdf.(d, [-1.0, 0.0, 1.0])

source

Distributions.logpdf — Method

log(pdf) of TπDist

Examples

d = TπDist(0.1, 1.0)
logpdf.(d, [-1.0, 0.0, 1.0])

source

GenomicBreedingModels.addnorm — Method

addnorm(x, y) = (x + y) / 2.0

This function takes two numbers and finds their arithmetic mean. A member of the set of endofunctions defined in GenomicBreedingModels.jl for building non-linear genomic prediction models. Inputs and ouput range between zero and one.

source

GenomicBreedingModels.bayesa — Method

bayesa(;
    genomes::Genomes,
    phenomes::Phenomes,
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    idx_trait::Int64 = 1,
    n_iter::Int64 = 1_500,
    n_burnin::Int64 = 500,
    verbose::Bool = false
)::Fit

Fits a Bayes A model for genomic prediction using the BGLR method. Bayes A assumes marker effects follow a scaled-t distribution, allowing for different variances for each marker effect.

Arguments

genomes::Genomes: Genomic data structure containing allele frequencies
phenomes::Phenomes: Phenotypic data structure containing trait measurements
idx_entries::Union{Nothing,Vector{Int64}}: Optional indices to subset specific entries/individuals
idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to subset specific genetic markers
idx_trait::Int64: Index of the trait to analyze (default: 1)
n_iter::Int64: Number of iterations for the MCMC algorithm (default: 1,500)
n_burnin::Int64: Number of burn-in iterations to discard (default: 500)
verbose::Bool: Whether to print progress messages (default: false)

Returns

Fit: A fitted model object containing:
- model: Model identifier ("ols")
- b_hat: Estimated regression coefficients
- b_hat_labels: Labels for the coefficients
- trait: Name of the analyzed trait
- entries: Entry identifiers
- populations: Population identifiers
- y_true: Observed phenotypic values
- y_pred: Predicted phenotypic values
- metrics: Dictionary of performance metrics

Notes

The model assumes a Gaussian distribution for the trait values
Recommended to check convergence by examining trace plots of key parameters
The burn-in period should be adjusted based on convergence diagnostics

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> fit = bayesa(genomes=genomes, phenomes=phenomes);

julia> fit.model == "bayesa"
true

julia> fit.metrics["cor"] > 0.50
true

source

GenomicBreedingModels.bayesb — Method

bayesb(;
    genomes::Genomes,
    phenomes::Phenomes,
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    idx_trait::Int64 = 1,
    n_iter::Int64 = 1_500,
    n_burnin::Int64 = 500,
    verbose::Bool = false
)::Fit

Fits a Bayes B model for genomic prediction using the BGLR method. Bayes B assumes that marker effects follow a mixture distribution where some markers have zero effect and others follow a scaled-t distribution.

Arguments

genomes::Genomes: Genomic data structure containing allele frequencies
phenomes::Phenomes: Phenotypic data structure containing trait measurements
idx_entries::Union{Nothing,Vector{Int64}}: Optional indices to subset specific entries (default: nothing)
idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to subset specific loci/alleles (default: nothing)
idx_trait::Int64: Index of the trait to analyze (default: 1)
n_iter::Int64: Number of iterations for the MCMC algorithm (default: 1,500)
n_burnin::Int64: Number of burn-in iterations to discard (default: 500)
verbose::Bool: Whether to print progress information (default: false)

Returns

Fit: A fitted model object containing:
- model: Model identifier ("ols")
- b_hat: Estimated regression coefficients
- b_hat_labels: Labels for the coefficients
- trait: Name of the analyzed trait
- entries: Entry identifiers
- populations: Population identifiers
- y_true: Observed phenotypic values
- y_pred: Predicted phenotypic values
- metrics: Dictionary of performance metrics

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> fit = bayesb(genomes=genomes, phenomes=phenomes);

julia> fit.model == "bayesb"
true

julia> fit.metrics["cor"] > 0.50
true

source

GenomicBreedingModels.bayesc — Method

bayesc(;
    genomes::Genomes,
    phenomes::Phenomes,
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    idx_trait::Int64 = 1,
    n_iter::Int64 = 1_500,
    n_burnin::Int64 = 500,
    verbose::Bool = false
)::Fit

Fits a Bayes C model for genomic prediction using the BGLR method. Bayes C assumes that marker effects follow a mixture distribution with a point mass at zero and a normal distribution.

Arguments

genomes::Genomes: Genomic data structure containing allele frequencies
phenomes::Phenomes: Phenotypic data structure containing trait measurements
idx_entries::Union{Nothing,Vector{Int64}}: Optional indices to subset specific entries/individuals
idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to subset specific loci/alleles
idx_trait::Int64: Index of the trait to analyze (default: 1)
n_iter::Int64: Number of iterations for MCMC sampling (default: 1,500)
n_burnin::Int64: Number of burn-in iterations to discard (default: 500)
verbose::Bool: Whether to print progress information (default: false)

Returns

Fit: A fitted model object containing:
- model: Model identifier ("ols")
- b_hat: Estimated regression coefficients
- b_hat_labels: Labels for the coefficients
- trait: Name of the analyzed trait
- entries: Entry identifiers
- populations: Population identifiers
- y_true: Observed phenotypic values
- y_pred: Predicted phenotypic values
- metrics: Dictionary of performance metrics

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> fit = bayesc(genomes=genomes, phenomes=phenomes);

julia> fit.model == "bayesc"
true

julia> fit.metrics["cor"] > 0.50
true

source

GenomicBreedingModels.bayesian — Method

bayesian(
    turing_model::Function;
    X::Matrix{Float64},
    y::Vector{Float64},
    sampler::String = ["NUTS", "HMC", "HMCDA", "MH", "PG"][1],
    sampling_method::Int64 = 1,
    seed::Int64 = 123,
    n_burnin::Int64 = 500,
    n_iter::Int64 = 1_500,
    verbose::Bool = false,
)::Fit

Fit a Bayesian linear regression models via Turing.jl

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);

julia> phenomes = extractphenomes(trials);

julia> sol = Suppressor.@suppress bayesian(turing_bayesG, genomes=genomes, phenomes=phenomes);

julia> # Slow because not multivariate T-dist: sol_T = Suppressor.@suppress bayesian(turing_bayesT, genomes=genomes, phenomes=phenomes);

julia> # Even slower because of an extra set of distribution to define a non-spherical variance-covariance matrix: sol_Gs = Suppressor.@suppress bayesian(turing_bayesGs, genomes=genomes, phenomes=phenomes);

julia> sol_BGLR = Suppressor.@suppress bayesian("BayesA", genomes=genomes, phenomes=phenomes); sol.metrics["cor"] > sol_BGLR.metrics["cor"]
true

julia> sol.metrics["cor"] > 0.5
true

source

GenomicBreedingModels.bayesian — Method

bayesian(
    bglr_model::String;
    genomes::Genomes,
    phenomes::Phenomes,
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    idx_trait::Int64 = 1,
    response_type::String = ["gaussian", "ordinal"][1], 
    n_burnin::Int64 = 500,
    n_iter::Int64 = 1_500, 
    verbose::Bool = false
)::Fit

Fit Bayesian genomic prediction models using the BGLR R package.

Arguments

bglr_model::String: Type of BGLR model to fit. Options: "BayesA", "BayesB", "BayesC"
genomes::Genomes: Genomic data structure containing allele frequencies
phenomes::Phenomes: Phenotypic data structure containing trait measurements
idx_entries::Union{Nothing,Vector{Int64}}: Optional indices to subset entries
idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to subset loci/alleles
idx_trait::Int64: Index of trait to analyze (default: 1)
response_type::String: Type of response variable. Options: "gaussian" or "ordinal" (default: "gaussian")
n_burnin::Int64: Number of burn-in MCMC iterations (default: 500)
n_iter::Int64: Total number of MCMC iterations (default: 1,500)
verbose::Bool: Whether to print progress information (default: false)

Returns

Fit: Object containing:
- Model fit summary
- Estimated genetic effects
- Predicted values
- Model performance metrics
- Input data details

Details

This function provides a Julia interface to the BGLR R package for Bayesian genomic prediction. It uses temporary files to interface with R and automatically cleans up afterward.

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);

julia> phenomes = extractphenomes(trials);

julia> sol = Suppressor.@suppress bayesian("BayesA", genomes=genomes, phenomes=phenomes);

julia> sol.metrics["cor"] > 0.5
true

source

GenomicBreedingModels.bglr — Method

bglr(; G::Matrix{Float64}, y::Vector{Float64}, model::String="BayesA", 
     response_type::String="gaussian", n_iter::Int64=1_500, 
     n_burnin::Int64=500, verbose::Bool=false)::Vector{Float64}

Perform Bayesian genomic prediction using the BGLR (Bayesian Generalized Linear Regression) R package.

Arguments

G::Matrix{Float64}: Marker matrix where rows represent individuals and columns represent locus-allele combinations
y::Vector{Float64}: Vector of phenotypic values
model::String: Bayesian model type to use. Options: "BayesA", "BayesB", or "BayesC". Default: "BayesA"
response_type::String: Type of response variable. Options: "gaussian" or "ordinal". Default: "gaussian"
n_iter::Int64: Number of iterations for the MCMC chain. Default: 1,500
n_burnin::Int64: Number of burn-in iterations to discard. Default: 500
verbose::Bool: Whether to print progress information. Default: false

Returns

Vector{Float64}: Estimated effects including the intercept (first element) followed by marker effects

Details

This function creates temporary files to interface with R's BGLR package, runs the Bayesian analysis, and automatically cleans up temporary files afterward. The function uses the system's R installation and requires the BGLR package to be installed in R.

Notes

This is hacky. It invokes Rscript for each instance which should allow multithreading because RCall.jl currently does not allow multithreading.

source

GenomicBreedingModels.cvbulk — Method

cvbulk(;
    genomes::Genomes,
    phenomes::Phenomes,
    models::Vector{Function}=[ridge],
    n_replications::Int64=5,
    n_folds::Int64=5,
    seed::Int64=42,
    verbose::Bool=true
)::Tuple{Vector{CV}, Vector{String}}

Perform replicated k-fold cross-validation of genomic prediction model(s) across all available traits and entries, ignoring populations.

Arguments

genomes::Genomes: Genomic data structure containing allele frequencies
phenomes::Phenomes: Phenotypic data structure containing trait measurements
models::Vector{Function}: Vector of genomic prediction model functions to evaluate
n_replications::Int64: Number of times to repeat k-fold cross-validation randomising k-fold partitioning each time (default: 5)
n_folds::Int64: Number of cross-validation folds (default: 5)
seed::Int64: Random seed for reproducibility (default: 42)
verbose::Bool: Whether to display progress information (default: true)

Returns

Tuple containing:
- Vector of CV objects with cross-validation results
- Vector of warning messages about skipped cases

Threading

Uses multiple threads if Julia is started with threading enabled. Example startup command: julia --threads 7,1 (7 worker threads, 1 runtime thread)

Notes

Performs random k-fold partitioning of entries ignoring population structure
Handles missing/invalid phenotype values
Validates model inputs and data dimensions
Returns warnings for cases with insufficient data or zero variance

Example

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false); genomes.populations = StatsBase.sample(string.("pop_", 1:3), length(genomes.entries), replace=true);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, verbose=false);

julia> phenomes = extractphenomes(trials);

julia> cvs, notes = cvbulk(genomes=genomes, phenomes=phenomes, models=[ols, ridge], n_replications=2, n_folds=2, verbose=false);

julia> df_across_entries, df_per_entry = tabularise(cvs);

julia> idx_across = findall((df_across_entries.trait .== "trait_1") .&& (df_across_entries.model .== "ridge") .&& (df_across_entries.replication .== "replication_1") .&& (df_across_entries.fold .== "fold_1"));

julia> idx_per = findall((df_per_entry.trait .== "trait_1") .&& (df_per_entry.model .== "ridge") .&& (df_per_entry.replication .== "replication_1") .&& (df_per_entry.fold .== "fold_1"));

julia> abs(df_across_entries.cor[idx_across][1] - cor(df_per_entry.y_true[idx_per], df_per_entry.y_pred[idx_per])) < 1e-10
true

source

GenomicBreedingModels.cvleaveonepopulationout — Method

cvleaveonepopulationout(;
    genomes::Genomes,
    phenomes::Phenomes,
    models::Vector{Function}=[ridge],
    n_replications::Int64=5,
    n_folds::Int64=5,
    seed::Int64=42,
    verbose::Bool=true
)::Tuple{Vector{CV}, Vector{String}}

Performs leave-one-population-out cross-validation for genomic prediction models across all available traits.

Arguments

genomes::Genomes: Genomic data structure containing allele frequencies
phenomes::Phenomes: Phenotypic data structure containing trait measurements
models::Vector{Function}: Vector of model functions to evaluate (default: [ridge])
n_replications::Int64: Number of replications (not used in this implementation)
n_folds::Int64: Number of folds (not used in this implementation)
seed::Int64: Random seed (not used in this implementation)
verbose::Bool: If true, displays progress information during execution

Returns

Tuple{Vector{CV}, Vector{String}}: Returns a tuple containing:
- Vector of CV objects with cross-validation results
- Vector of warning/error messages for skipped validations

Details

The function implements a leave-one-population-out cross-validation strategy where:

For each trait and population combination:
- Uses one population as validation set
- Uses remaining populations as training set
Evaluates multiple genomic prediction models
Handles missing data and variance checks
Supports parallel processing via Julia's multi-threading

Threading

To utilize multiple threads, start Julia with: julia --threads n,1 where n is the desired number of threads for computation and 1 is reserved for the runtime.

Supported Models

ols (Ordinary Least Squares)
ridge (Ridge Regression)
lasso (Lasso Regression)
bayesa (Bayes A)
bayesb (Bayes B)
bayesc (Bayes C)

Example

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false); genomes.populations = StatsBase.sample(string.("pop_", 1:3), length(genomes.entries), replace=true);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);

julia> phenomes = extractphenomes(trials);

julia> cvs, notes = cvleaveonepopulationout(genomes=genomes, phenomes=phenomes, models=[ridge, bayesa], n_replications=2, n_folds=2, verbose=false);

julia> df_across_entries, df_per_entry = tabularise(cvs);

julia> sum([sum(split(df_across_entries.training_population[i], ";") .== df_across_entries.validation_population[i]) for i in 1:size(df_across_entries, 1)]) == 0
true

julia> idx_across = findall((df_across_entries.validation_population .== "pop_1") .&& (df_across_entries.trait .== "trait_1") .&& (df_across_entries.model .== "ridge"));

julia> idx_per = findall((df_per_entry.validation_population .== "pop_1") .&& (df_per_entry.trait .== "trait_1") .&& (df_per_entry.model .== "ridge"));

julia> abs(df_across_entries.cor[idx_across][1] - cor(df_per_entry.y_true[idx_per], df_per_entry.y_pred[idx_per])) < 1e-10
true

source

GenomicBreedingModels.cvmultithread! — Method

cvmultithread!(cvs::Vector{CV}; genomes::Genomes, phenomes::Phenomes, models_vector, verbose::Bool = true)::Vector{CV}

Perform multi-threaded genomic prediction cross-validation using specified models.

Arguments

cvs::Vector{CV}: Vector of cross-validation objects to be processed
genomes::Genomes: Genomic data structure containing allele frequencies
phenomes::Phenomes: Phenotypic data structure containing trait measurements
models_vector: Vector of model functions to be used for prediction (e.g., [ridge, bayesa])
verbose::Bool=true: Whether to display progress bar during computation

Returns

Modified cvs vector with updated cross-validation results

Threading

Requires Julia to be started with multiple threads to utilize parallel processing. Example startup command: julia --threads 7,1 (7 worker threads, 1 runtime thread)

Details

The function performs cross-validation in parallel for each CV object using the corresponding model from the models_vector. For each fold:

Extracts training and validation indices
Fits the specified model using training data
Validates the model using validation data
Updates the CV object with prediction results

Example

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);

julia> phenomes = extractphenomes(trials);

julia> idx_training = collect(1:50);

julia> idx_validation_1 = collect(51:75);

julia> idx_validation_2 = collect(76:100);

julia> fit = Fit(n = length(idx_training), l = length(genomes.loci_alleles) + 1); fit.model = "ridge"; fit.trait = "trait_1"; 

julia> fit.entries = genomes.entries[idx_training]; fit.populations = genomes.populations[idx_training]; 

julia> fit.b_hat_labels = vcat(["intercept"], genomes.loci_alleles);

julia> cv_1 = CV("replication_1", "fold_1", fit, genomes.populations[idx_validation_1], genomes.entries[idx_validation_1], zeros(length(idx_validation_1)), zeros(length(idx_validation_1)), fit.metrics);

julia> cv_2 = CV("replication_1", "fold_2", fit, genomes.populations[idx_validation_2], genomes.entries[idx_validation_2], zeros(length(idx_validation_2)), zeros(length(idx_validation_2)), fit.metrics);

julia> cvs = [cv_1, cv_2]; models = [ridge, ridge];

julia> cvmultithread!(cvs, genomes=genomes, phenomes=phenomes, models_vector=[ridge, bayesa], verbose=false);

julia> df_across_entries, df_per_entry = tabularise(cvs);

julia> idx_across = findall(df_across_entries.fold .== "fold_2");

julia> idx_per = findall(df_per_entry.fold .== "fold_2");

julia> abs(df_across_entries.cor[idx_across][1] - cor(df_per_entry.y_true[idx_per], df_per_entry.y_pred[idx_per])) < 1e-10
true

source

GenomicBreedingModels.cvpairwisepopulation — Method

cvpairwisepopulation(;
    genomes::Genomes,
    phenomes::Phenomes,
    models::Vector{Function}=[ridge],
    n_replications::Int64=5,
    n_folds::Int64=5, 
    seed::Int64=42,
    verbose::Bool=true
)::Tuple{Vector{CV}, Vector{String}}

Performs pairwise cross-validation between populations for genomic prediction models.

Arguments

genomes::Genomes: Genomic data structure containing allele frequencies
phenomes::Phenomes: Phenotypic data structure containing trait measurements
models::Vector{Function}: Vector of genomic prediction model functions to evaluate (default: [ridge])
n_replications::Int64: Number of replications (unused, maintained for API consistency)
n_folds::Int64: Number of folds (unused, maintained for API consistency)
seed::Int64: Random seed (unused, maintained for API consistency)
verbose::Bool: Whether to display progress information

Returns

Tuple{Vector{CV}, Vector{String}}:
- Vector of CV objects containing cross-validation results
- Vector of warning messages for skipped validations

Details

For each pair of populations (pop1, pop2):

Uses pop1 as training set and pop2 as validation set
Skips within-population validation (pop1==pop2)
Evaluates each model on all available traits
Handles missing/invalid phenotype values
Validates model inputs and data dimensions

Threading

Requires Julia to be started with multiple threads: julia --threads n,1 where n is number of worker threads

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false); genomes.populations = StatsBase.sample(string.("pop_", 1:3), length(genomes.entries), replace=true);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);

julia> phenomes = extractphenomes(trials);

julia> cvs, notes = cvpairwisepopulation(genomes=genomes, phenomes=phenomes, models=[ols, ridge], n_replications=2, n_folds=2, verbose=false);

julia> df_across_entries, df_per_entry = tabularise(cvs);

julia> sum(df_across_entries.training_population .!= df_across_entries.validation_population) == size(df_across_entries, 1)
true

julia> idx_across = findall((df_across_entries.training_population .== "pop_1") .&& (df_across_entries.validation_population .== "pop_2") .&& (df_across_entries.trait .== "trait_1") .&& (df_across_entries.model .== "ridge"));

julia> idx_per = findall((df_per_entry.training_population .== "pop_1") .&& (df_per_entry.validation_population .== "pop_2") .&& (df_per_entry.trait .== "trait_1") .&& (df_per_entry.model .== "ridge"));

julia> abs(df_across_entries.cor[idx_across][1] - cor(df_per_entry.y_true[idx_per], df_per_entry.y_pred[idx_per])) < 1e-10
true

source

GenomicBreedingModels.cvperpopulation — Method

cvperpopulation(;
    genomes::Genomes,
    phenomes::Phenomes,
    models::Vector{Function}=[ridge],
    n_replications::Int64=5,
    n_folds::Int64=5,
    seed::Int64=42,
    verbose::Bool=true
)::Tuple{Vector{CV}, Vector{String}}

Performs within-population replicated cross-validation of genomic prediction models across all available traits.

Arguments

genomes::Genomes: Genomic data structure containing allele frequencies
phenomes::Phenomes: Phenotypic data structure containing trait measurements
models::Vector{Function}=[ridge]: Vector of genomic prediction model functions to evaluate
n_replications::Int64=5: Number of replications for cross-validation
n_folds::Int64=5: Number of folds for k-fold cross-validation
seed::Int64=42: Random seed for reproducibility
verbose::Bool=true: Whether to print progress information

Returns

Tuple{Vector{CV}, Vector{String}}: A tuple containing:
- Vector of CV objects with cross-validation results
- Vector of notes/warnings generated during the process

Details

The function performs separate cross-validations for each unique population in the dataset. Supports multiple genomic prediction models including:

ols: Ordinary Least Squares
ridge: Ridge Regression
lasso: Lasso Regression
bayesa: Bayes A
bayesb: Bayes B
bayesc: Bayes C

Threading

To use multiple threads, invoke Julia with: julia --threads n,1 where n is the desired number of threads for multi-threaded processes and 1 is reserved for the Julia runtime.

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(l=1_000, verbose=false); genomes.populations = StatsBase.sample(string.("pop_", 1:3), length(genomes.entries), replace=true);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);

julia> phenomes = extractphenomes(trials);

julia> cvs, notes = cvperpopulation(genomes=genomes, phenomes=phenomes, models=[ridge, bayesa], n_replications=2, n_folds=2, verbose=false);

julia> df_across_entries, df_per_entry = tabularise(cvs);

julia> sort(unique(df_across_entries.training_population))
3-element Vector{String}:
 "pop_1"
 "pop_2"
 "pop_3"

julia> df_across_entries.training_population == df_across_entries.validation_population
true

julia> idx_across = findall((df_across_entries.validation_population .== "pop_1") .&& (df_across_entries.trait .== "trait_1") .&& (df_across_entries.model .== "bayesa") .&& (df_across_entries.replication .== "replication_1") .&& (df_across_entries.fold .== "fold_1"));

julia> idx_per = findall((df_per_entry.validation_population .== "pop_1") .&& (df_per_entry.trait .== "trait_1") .&& (df_per_entry.model .== "bayesa") .&& (df_per_entry.replication .== "replication_1") .&& (df_per_entry.fold .== "fold_1"));

julia> abs(df_across_entries.cor[idx_across][1] - cor(df_per_entry.y_true[idx_per], df_per_entry.y_pred[idx_per])) < 1e-10
true

julia> summary_across, summary_per_entry = summarise(cvs);

julia> size(summary_across)
(6, 8)

julia> size(summary_per_entry)
(200, 8)

source

GenomicBreedingModels.epistasisfeatures — Method

epistasisfeatures(
    genomes::Genomes,
    phenomes::Phenomes;
    idx_trait::Int64 = 1,
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    transformations1::Vector{Function} = [square, invoneplus, log10epsdivlog10eps],
    transformations2::Vector{Function} = [mult, addnorm, raise],
    n_new_features_per_transformation::Int64 = 1_000,
    n_reps::Int64 = 3,
    verbose::Bool = false
)::Genomes

Generate epistasis features by applying various transformations to genomic data.

Arguments

genomes::Genomes: Genomic data structure containing allele frequencies
phenomes::Phenomes: Phenotypic data structure containing trait measurements
idx_trait::Int64: Index of the trait to analyze (default: 1)
idx_entries::Union{Nothing,Vector{Int64}}: Indices of entries to include (default: all)
idx_loci_alleles::Union{Nothing,Vector{Int64}}: Indices of loci/alleles to include (default: all)
transformations1::Vector{Function}: Single-input transformation functions (default: [square, invoneplus, log10epsdivlog10eps])
transformations2::Vector{Function}: Two-input transformation functions (default: [mult, addnorm, raise])
n_new_features_per_transformation::Int64: Number of new features to generate per transformation (default: 1_000)
n_reps::Int64: Number of times to repeat the transformation process (default: 3)
verbose::Bool: Whether to display progress information (default: false)

Returns

Genomes: Enhanced genomic data structure with additional epistasis features

Description

This function generates epistasis features by applying two types of transformations:

Single-input transformations (transformations1) applied to individual genomic features
Two-input transformations (transformations2) applied to pairs of genomic features

The transformations are repeated n_reps times to create a rich set of derived features.

Notes

Ensures all generated features are within [0,1] range
Maintains dimensional consistency of input/output structures
Automatically handles entry and loci/allele subsetting
Validates input data structures before processing

Example

julia> genomes = GenomicBreedingCore.simulategenomes(l=1_000, verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> genomes_plus_features = epistasisfeatures(genomes, phenomes, n_new_features_per_transformation=50, n_reps=2, verbose=false);

julia> cvs, notes = cvbulk(genomes=genomes_plus_features, phenomes=phenomes, models=[ridge, lasso, bayesa], verbose=false);

julia> cvs_no_epi, notes_no_epi = cvbulk(genomes=genomes, phenomes=phenomes, models=[ridge, lasso, bayesa], verbose=false);

julia> df_across, df_per_entry = GenomicBreedingCore.tabularise(cvs);

julia> df_across_no_epi, df_per_entry_no_epi = GenomicBreedingCore.tabularise(cvs_no_epi);

julia> df_summary = combine(groupby(df_across, [:trait, :model]), [[:cor] => mean, [:cor] => std]);

julia> df_summary_no_epi = combine(groupby(df_across_no_epi, [:trait, :model]), [[:cor] => mean, [:cor] => std]);

julia> mean(df_summary.cor_mean) > mean(df_summary_no_epi.cor_mean)
true

source

GenomicBreedingModels.extractxyetc — Method

extractxyetc(
    genomes::Genomes,
    phenomes::Phenomes;
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    idx_trait::Int64 = 1,
    add_intercept::Bool = true
) -> Tuple{Matrix{Float64}, Vector{Float64}, Vector{String}, Vector{String}, Vector{String}}

Extract data matrices and vectors from genomic and phenotypic data for statistical analyses.

Arguments

genomes::Genomes: Genomic data structure containing allele frequencies
phenomes::Phenomes: Phenotypic data structure containing trait measurements
idx_entries::Union{Nothing,Vector{Int64}}: Optional indices to select specific entries (default: all entries)
idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to select specific loci-alleles (default: all loci-alleles)
idx_trait::Int64: Index of the trait to analyze (default: 1)
add_intercept::Bool: Whether to add an intercept column to the design matrix (default: true)

Returns

A tuple containing:

X::Matrix{Float64}: Design matrix with allele frequencies (and intercept if specified)
y::Vector{Float64}: Response vector with phenotypic values
entries::Vector{String}: Names of the selected entries
populations::Vector{String}: Population identifiers for the selected entries
loci_alleles::Vector{String}: Names of the selected loci-alleles

Notes

The function filters out entries with missing, NaN, or infinite phenotype values
Requires at least 2 valid entries after filtering
Checks for non-zero variance in the trait values
Ensures consistency between genomic and phenotypic data dimensions
Validates all index inputs are within bounds

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> X, y, loci_alleles = extractxyetc(genomes, phenomes);

julia> X == hcat(ones(length(phenomes.entries)), genomes.allele_frequencies)
true

julia> y == phenomes.phenotypes[:, 1]
true

source

GenomicBreedingModels.gwaslmm — Method

gwaslmm(
    genomes::Genomes,
    phenomes::Phenomes;
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    idx_trait::Int64 = 1,
    GRM_type::String = ["simple", "ploidy-aware"][1],
    verbose::Bool = false
)::Fit

Perform genome-wide association analysis using a linear mixed model (LMM) approach.

Arguments

genomes::Genomes: Genomic data structure containing allele frequencies
phenomes::Phenomes: Phenotypic data structure containing trait measurements
idx_entries::Union{Nothing,Vector{Int64}}: Optional indices for subsetting entries
idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices for subsetting loci/alleles
idx_trait::Int64: Index of the trait to analyze (default: 1)
GRM_type::String: Type of genetic relationship matrix to use:
- "simple": Standard GRM calculation
- "ploidy-aware": Ploidy-adjusted GRM calculation
verbose::Bool: Whether to display progress and plots (default: false)

Returns

Fit: A structure containing GWAS results including:
- model: Model identifier ("GWAS_LMM")
- b_hat: Vector of test statistics (z-scores) for genetic markers

Details

The function implements a mixed model GWAS using the first principal component of the genetic relationship matrix (GRM) as a fixed effect to control for population structure. The model includes random effects for entries and uses REML estimation.

Notes

Handles both diploid and polyploid data through the GRM_type parameter
Uses multi-threading for parallel computation of marker effects
Includes automatic convergence retry on fitting failures
Maximum fitting time per marker is limited to 60 seconds

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> ploidy = 4;

julia> genomes.allele_frequencies = round.(genomes.allele_frequencies .* ploidy) ./ ploidy;

julia> proportion_of_variance = zeros(9, 1); proportion_of_variance[1, 1] = 0.5;

julia> trials, effects = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.05 0.00 0.00;], proportion_of_variance = proportion_of_variance, verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> fit_1 = Suppressor.@suppress gwaslmm(genomes=genomes, phenomes=phenomes, GRM_type="simple");

julia> fit_1.model
"GWAS_LMM"

julia> fit_2 = Suppressor.@suppress gwaslmm(genomes=genomes, phenomes=phenomes, GRM_type="ploidy-aware");

julia> fit_2.model
"GWAS_LMM"

julia> findall(fit_1.b_hat .== maximum(fit_1.b_hat)) == findall(fit_2.b_hat .== maximum(fit_2.b_hat))
true

source

GenomicBreedingModels.gwasols — Method

gwasols(
    genomes::Genomes,
    phenomes::Phenomes;
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    idx_trait::Int64 = 1,
    GRM_type::String = ["simple", "ploidy-aware"][1],
    verbose::Bool = false,
)::Fit

Perform genome-wide association study (GWAS) using ordinary least squares (OLS) regression with population structure correction.

Arguments

genomes::Genomes: Genomic data structure containing allele frequencies
phenomes::Phenomes: Phenotypic data structure containing trait measurements
idx_entries::Union{Nothing,Vector{Int64}}: Optional indices to subset entries (default: all entries)
idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to subset loci/alleles (default: all loci)
idx_trait::Int64: Index of the trait to analyze (default: 1)
GRM_type::String: Type of genetic relationship matrix to use ("simple" or "ploidy-aware") (default: "simple")
verbose::Bool: Whether to display progress and plots (default: false)

Returns

Fit: A structure containing GWAS results including:
- model: Model identifier ("GWAS_OLS")
- b_hat: Vector of effect size estimates/t-statistics for each marker
- Additional model information

Details

The function implements GWAS using OLS regression while accounting for population structure through the first principal component of the genetic relationship matrix (GRM) as a covariate. Two types of GRM can be used: "simple" assumes diploid organisms, while "ploidy-aware" accounts for different ploidy levels.

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> ploidy = 4;

julia> genomes.allele_frequencies = round.(genomes.allele_frequencies .* ploidy) ./ ploidy;

julia> proportion_of_variance = zeros(9, 1); proportion_of_variance[1, 1] = 0.5;

julia> trials, effects = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.05 0.00 0.00;], proportion_of_variance = proportion_of_variance, verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> fit_1 = gwasols(genomes=genomes, phenomes=phenomes, GRM_type="simple");

julia> fit_1.model
"GWAS_OLS"

julia> fit_2 = gwasols(genomes=genomes, phenomes=phenomes, GRM_type="ploidy-aware");

julia> fit_2.model
"GWAS_OLS"

julia> findall(fit_1.b_hat .== maximum(fit_1.b_hat)) == findall(fit_2.b_hat .== maximum(fit_2.b_hat))
true

source

GenomicBreedingModels.gwasprep — Method

gwasprep(
    genomes::Genomes,
    phenomes::Phenomes;
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    idx_trait::Int64 = 1,
    GRM_type::String = ["simple", "ploidy-aware"][1],
    standardise::Bool = true,
    verbose::Bool = false,
)::Tuple{Matrix{Float64},Vector{Float64},Matrix{Float64},Fit}

Prepare data matrices and structures for genome-wide association studies (GWAS).

Arguments

genomes::Genomes: Genomic data structure containing allele frequencies
phenomes::Phenomes: Phenotypic data structure containing trait measurements
idx_entries::Union{Nothing,Vector{Int64}}: Optional indices to subset entries (default: all entries)
idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to subset loci/alleles (default: all loci)
idx_trait::Int64: Index of the trait to analyze (default: 1)
GRM_type::String: Type of genetic relationship matrix to use ("simple" or "ploidy-aware") (default: "simple")
standardise::Bool: Whether to standardize the data matrices (default: true)
verbose::Bool: Whether to print progress information (default: false)

Returns

A tuple containing:

G::Matrix{Float64}: Standardized allele frequency matrix
y::Vector{Float64}: Standardized phenotype vector
GRM::Matrix{Float64}: Genetic relationship matrix
fit::Fit: Initialized Fit struct for GWAS results

Details

Performs data validation and preprocessing for GWAS analysis
Removes fixed loci with no variation
Standardizes genomic and phenotypic data if requested
Constructs appropriate genetic relationship matrix
Initializes output structure for association results

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> ploidy = 4;

julia> genomes.allele_frequencies = round.(genomes.allele_frequencies .* ploidy) ./ ploidy;

julia> proportion_of_variance = zeros(9, 1); proportion_of_variance[1, 1] = 0.5;

julia> trials, effects = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.05 0.00 0.00;], proportion_of_variance = proportion_of_variance, verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> G, y, GRM, fit = gwasprep(genomes=genomes, phenomes=phenomes);

julia> sum(abs.(mean(G, dims=1)[1,:]) .< 1e-10) == size(G, 2)
true

julia> sum(abs.(std(G, dims=1)[1,:] .- 1) .< 1e-10) == size(G, 2)
true

julia> (abs(mean(y)) < 1e-10, abs(std(y) - 1) < 1e-10)
(true, true)

julia> size(G, 1) == length(y)
true

julia> (size(G, 1), length(y)) == size(GRM)
true

julia> length(fit.entries) == length(y)
true

julia> length(fit.b_hat) == size(G, 2)
true

source

GenomicBreedingModels.gwasreml — Method

gwasreml(
    genomes::Genomes,
    phenomes::Phenomes;
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    idx_trait::Int64 = 1,
    GRM_type::String = "simple",
    verbose::Bool = false
)::Fit

Performs genome-wide association analysis using restricted maximum likelihood estimation (REML).

Arguments

genomes::Genomes: Genomic data structure containing allele frequencies
phenomes::Phenomes: Phenotypic data structure containing trait measurements
idx_entries::Union{Nothing,Vector{Int64}}: Optional indices to subset entries (default: nothing)
idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to subset loci/alleles (default: nothing)
idx_trait::Int64: Index of the trait to analyze (default: 1)
GRM_type::String: Type of genetic relationship matrix to use, either "simple" or "ploidy-aware" (default: "simple")
verbose::Bool: Whether to display progress and plots (default: false)

Returns

::Fit: A Fit struct containing GWAS results, including effect estimates and test statistics

Details

Implements the REML log-likelihood calculation for a mixed model of the form: y = Xβ + Zu + e where:

β are fixed effects
u are random genetic effects with u ~ N(0, σ²_u * GRM)
e are residual effects with e ~ N(0, σ²_e * I)

The function constructs the variance-covariance matrices and computes the REML transformation to obtain the log-likelihood value used in variance component estimation.

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(l=1_000, verbose=false);

julia> ploidy = 4;

julia> genomes.allele_frequencies = round.(genomes.allele_frequencies .* ploidy) ./ ploidy;

julia> proportion_of_variance = zeros(9, 1); proportion_of_variance[1, 1] = 0.5;

julia> trials, effects = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.05 0.00 0.00;], proportion_of_variance = proportion_of_variance, verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> fit_1 = gwasreml(genomes=genomes, phenomes=phenomes, GRM_type="simple");

julia> fit_1.model
"GWAS_REML"

julia> fit_2 = gwasreml(genomes=genomes, phenomes=phenomes, GRM_type="ploidy-aware");

julia> fit_2.model
"GWAS_REML"

julia> findall(fit_1.b_hat .== maximum(fit_1.b_hat)) == findall(fit_2.b_hat .== maximum(fit_2.b_hat))
true

source

GenomicBreedingModels.heritabilitynarrow_sense — Method

heritabilitynarrow_sense(y_true::Vector{Float64}, y_pred::Vector{Float64})::Float64

Calculate narrow-sense heritability (h²) from true and predicted phenotypic values.

Narrow-sense heritability is the proportion of phenotypic variance that can be attributed to additive genetic effects. It is calculated as the ratio of additive genetic variance (s²a) to total phenotypic variance (s²a + s²e).

Arguments

y_true::Vector{Float64}: Vector of observed/true phenotypic values
y_pred::Vector{Float64}: Vector of predicted genetic values

Returns

Float64: Narrow-sense heritability (h²) value between 0 and 1

Details

Returns 0.0 if variance of either input vector is near zero (< 1e-10)
Additive genetic variance (s²a) is estimated from variance of predictions
Environmental variance (s²e) is estimated from variance of residuals
Result is bounded between 0 and 1

source

GenomicBreedingModels.invoneplus — Method

invoneplus(x) = 1 / (1 + x)

This function takes a number, adds one and inverts it. A member of the set of endofunctions defined in GenomicBreedingModels.jl for building non-linear genomic prediction models. Both input and ouput range between zero and one.

source

GenomicBreedingModels.lasso — Method

lasso(;
    genomes::Genomes,
    phenomes::Phenomes,
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    idx_trait::Int64 = 1,
    verbose::Bool = false,
)::Fit

Fits a LASSO (Least Absolute Shrinkage and Selection Operator) regression model with L1 regularization for genomic prediction.

Arguments

genomes::Genomes: Genomic data structure containing allele frequencies
phenomes::Phenomes: Phenotypic data structure containing trait measurements
idx_entries::Union{Nothing,Vector{Int64}}: Optional indices to select specific entries/individuals
idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to select specific loci-alleles
idx_trait::Int64: Index of the trait to analyze (default: 1)
verbose::Bool: If true, prints diagnostic plots and additional information (default: false)

Returns

Fit: A fitted model object containing:
- model: Model identifier ("ols")
- b_hat: Estimated regression coefficients
- b_hat_labels: Labels for the coefficients
- trait: Name of the analyzed trait
- entries: Entry identifiers
- populations: Population identifiers
- y_true: Observed phenotypic values
- y_pred: Predicted phenotypic values
- metrics: Dictionary of performance metrics

Details

The function implements LASSO regression using the GLMNet package, which performs automatic feature selection through L1 regularization. The optimal regularization parameter (λ) is selected using cross-validation to minimize prediction error while promoting sparsity in the coefficients.

Notes

Standardisation is disabled by default because the allele frequencies across loci are comparable as they all range from zero to one
The model includes an intercept term

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> fit = lasso(genomes=genomes, phenomes=phenomes);

julia> fit.model == "lasso"
true

julia> fit.metrics["cor"] > 0.50
true

source

GenomicBreedingModels.log10epsdivlog10eps — Method

log10epsdivlog10eps(x) = (log10(abs(x) + eps(Float64))) / log10(eps(Float64))

This function takes a number, adds a very tiny value, takes the log10, and divide it by the log10 of the same very tiny value. A member of the set of endofunctions defined in GenomicBreedingModels.jl for building non-linear genomic prediction models. Both input and ouput range between zero and one.

source

GenomicBreedingModels.loglikreml — Method

loglikreml(θ::Vector{Float64}, data::Tuple{Vector{Float64},Matrix{Float64},Matrix{Float64}})::Float64

Calculate the restricted maximum likelihood (REML) log-likelihood for a mixed linear model.

Arguments

θ::Vector{Float64}: Vector of variance components [σ²e, σ²u] where:
- σ²_e is the residual variance
- σ²_u is the genetic variance
data::Tuple{Vector{Float64},Matrix{Float64},Matrix{Float64}}: Tuple containing:
- y: Vector of phenotypic observations
- X: Design matrix for fixed effects
- GRM: Genomic relationship matrix

Returns

Float64: The REML log-likelihood value. Returns Inf if matrix operations fail.

Details

Implements the REML log-likelihood calculation for a mixed model of the form: y = Xβ + Zu + e where:

β are fixed effects
u are random genetic effects with u ~ N(0, σ²_u * GRM)
e are residual effects with e ~ N(0, σ²_e * I)

The function constructs the variance-covariance matrices and computes the REML transformation to obtain the log-likelihood value used in variance component estimation.

Examples

```jldoctest; setup = :(using GenomicBreedingCore, GenomicBreedingModels, LinearAlgebra, StatsBase) julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> ploidy = 4;

julia> genomes.allelefrequencies = round.(genomes.allelefrequencies .* ploidy) ./ ploidy;

julia> proportionofvariance = zeros(9, 1); proportionofvariance[1, 1] = 0.5;

julia> trials, effects = GenomicBreedingCore.simulatetrials(genomes=genomes, nyears=1, nseasons=1, nharvests=1, nsites=1, nreplications=1, fadddomepi=[0.05 0.00 0.00;], proportionofvariance = proportionofvariance, verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> G, y, GRM, fit = gwasprep(genomes=genomes, phenomes=phenomes);

julia> loglik = loglikreml([0.53, 0.15], (y, hcat(ones(length(y)), G[:, 1]), GRM));

julia> loglik < 100 true

source

GenomicBreedingModels.metrics — Method

metrics(y_true::Vector{Float64}, y_pred::Vector{Float64})::Dict{String,Float64}

Calculate various metrics comparing predicted vs true values.

Returns a dictionary containing the following metrics:

cor: Pearson correlation coefficient
mad: Mean absolute deviation
msd: Mean squared deviation
rmsd: Root mean squared deviation
nrmsd: Normalized root mean squared deviation
euc: Euclidean distance
jac: Jaccard distance
tvar: Total variation distance
h²: Narrow-sense heritability

Arguments

y_true::Vector{Float64}: Vector of true/observed values
y_pred::Vector{Float64}: Vector of predicted values

Returns

Dict{String,Float64}: Dictionary mapping metric names to their calculated values

source

GenomicBreedingModels.mlp — Method

mlp(;
    genomes::Genomes,
    phenomes::Phenomes,
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    idx_trait::Int64 = 1,
    n_layers::Int64 = 3,
    activation::Function = relu,
    max_n_nodes::Int64 = 256,
    n_nodes_droprate::Float64 = 0.50,
    dropout_droprate::Float64 = 0.25,
    n_epochs::Int64 = 100_000,
    use_cpu::Bool = false,
    seed::Int64 = 123,
    verbose::Bool = false
)::Fit

Train a multi-layer perceptron (MLP) model for genomic prediction.

Arguments

genomes::Genomes: A Genomes struct containing the genomic data.
phenomes::Phenomes: A Phenomes struct containing the phenomic data.
idx_entries::Union{Nothing, Vector{Int64}}: Indices of entries to include in the model. If nothing, all entries are included. Default is nothing.
idx_loci_alleles::Union{Nothing, Vector{Int64}}: Indices of loci-alleles to include in the model. If nothing, all loci-alleles are included. Default is nothing.
idx_trait::Int64: Index of the trait to predict. Default is 1.
n_layers::Int64: Number of layers in the MLP. Default is 3.
activation::Function: Activation function to use in the MLP. Default is relu.
max_n_nodes::Int64: Maximum number of nodes in each layer. Default is 256.
n_nodes_droprate::Float64: Drop rate for the number of nodes in each layer. Default is 0.50.
dropout_droprate::Float64: Dropout rate for the layers. Default is 0.25.
n_epochs::Int64: Number of training epochs. Default is 100,000.
use_cpu::Bool: If true, use CPU for training. If false, use GPU if available. Default is false.
seed::Int64: Random seed for reproducibility. Default is 123.
verbose::Bool: If true, prints detailed progress information during training. Default is false.

Returns

Fit: A Fit struct containing the trained MLP model and performance metrics.

Details

This function trains a multi-layer perceptron (MLP) model on genomic and phenomic data. The function performs the following steps:

Set Random Seed: Sets the random seed for reproducibility.
Extract Features and Targets: Extracts the feature matrix X, target vector y, and other relevant information from the genomic and phenomic data.
Instantiate Output Fit: Creates a Fit struct to store the model and results.
Construct MLP Layers: Constructs the MLP layers based on the specified number of layers, activation function, and dropout rates.
Move Data to Device: Moves the data to the appropriate device (CPU or GPU).
Setup Training State: Initializes the training state with the model parameters and optimizer.
Train the Model: Trains the MLP model for the specified number of epochs, printing progress if verbose is true.
Evaluate Performance: Evaluates the model's performance using the specified metrics.
Output: Returns the Fit struct containing the trained model and performance metrics.

Notes

The function uses the Lux library for constructing and training the MLP model.
The verbose option provides additional insights into the training process by printing progress information.
The function ensures that the trained model and performance metrics are stored in the Fit struct.

Throws

ArgumentError: If the Genomes or Phenomes struct is corrupted or if any of the arguments are out of range.
ErrorException: If an error occurs during model training or evaluation.

Example

julia> genomes = GenomicBreedingCore.simulategenomes(l=1_000, verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);

julia> phenomes = extractphenomes(trials);

julia> fit_cpu = Suppressor.@suppress mlp(genomes=genomes, phenomes=phenomes, n_epochs=1_000, use_cpu=true, verbose=false);

julia> fit_gpu = Suppressor.@suppress mlp(genomes=genomes, phenomes=phenomes, n_epochs=1_000, use_cpu=false, verbose=false);

julia> fit_cpu.metrics["cor"] >= 0.2
true

julia> fit_gpu.metrics["cor"] >= 0.2
true

source

GenomicBreedingModels.mult — Method

mult(x, y) = x * y

This function takes two numbers and multiplies them together. A member of the set of endofunctions defined in GenomicBreedingModels.jl for building non-linear genomic prediction models. Inputs and ouput range between zero and one.

source

GenomicBreedingModels.ols — Method

ols(;
    genomes::Genomes,
    phenomes::Phenomes,
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    idx_trait::Int64 = 1,
    verbose::Bool = false
)::Fit

Fits an ordinary least squares (OLS) regression model to genomic and phenotypic data.

Arguments

genomes::Genomes: Genomic data structure containing allele frequencies
phenomes::Phenomes: Phenotypic data structure containing trait measurements
idx_entries::Union{Nothing,Vector{Int64}}: Optional indices to select specific entries (default: all entries)
idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to select specific loci-alleles (default: all loci-alleles)
idx_trait::Int64: Index of the trait to analyze (default: 1)
verbose::Bool: If true, displays diagnostic plots and performance metrics (default: false)

Returns

Fit: A fitted model object containing:
- model: Model identifier ("ols")
- b_hat: Estimated regression coefficients
- b_hat_labels: Labels for the coefficients
- trait: Name of the analyzed trait
- entries: Entry identifiers
- populations: Population identifiers
- y_true: Observed phenotypic values
- y_pred: Predicted phenotypic values
- metrics: Dictionary of performance metrics

Description

Performs ordinary least squares regression on genomic data to predict phenotypic values. The model includes an intercept term and estimates effects for each locus-allele combination.

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> fit = ols(genomes=genomes, phenomes=phenomes);

julia> fit.model == "ols"
true

julia> fit.metrics["cor"] > 0.50
true

source

GenomicBreedingModels.pearsonscorrelation — Method

pearsonscorrelation(y_true::Vector{Float64}, y_pred::Vector{Float64})::Float64

Calculate the Pearson correlation coefficient between two vectors.

The Pearson correlation coefficient measures the linear correlation between two variables, giving a value between -1 and +1, where:

+1 represents perfect positive correlation
0 represents no linear correlation
-1 represents perfect negative correlation

Arguments

y_true::Vector{Float64}: Vector of true/actual values
y_pred::Vector{Float64}: Vector of predicted/estimated values

Returns

Float64: Pearson correlation coefficient

Notes

Returns 0.0 if the variance of either input vector is less than 1e-10
Uses 1 - correlation distance from Distances.jl package

source

GenomicBreedingModels.predict — Method

predict(; fit::Fit, genomes::Genomes, idx_entries::Vector{Int64})::Vector{Float64}

Calculate predicted phenotypes using a fitted genomic prediction model.

Arguments

fit::Fit: A fitted genomic prediction model containing coefficients and model information
genomes::Genomes: Genomic data containing allele frequencies
idx_entries::Vector{Int64}: Vector of indices specifying which entries to predict

Returns

Vector{Float64}: Predicted phenotypic values for the specified entries

Details

Supports various linear genomic prediction models including:

OLS (Ordinary Least Squares)
Ridge Regression
LASSO
Bayes A
Bayes B
Bayes C

The function validates input dimensions and compatibility between the fitted model and genomic data before making predictions.

Throws

ArgumentError: If the Fit or Genomes structs are corrupted
ArgumentError: If entry indices are out of bounds
ArgumentError: If loci-alleles in the fitted model don't match the validation set
ArgumentError: If the genomic prediction model is not recognized

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);

julia> phenomes = extractphenomes(trials);

julia> fit = ridge(genomes=genomes, phenomes=phenomes, idx_entries=collect(1:90));

julia> y_hat = GenomicBreedingModels.predict(fit=fit, genomes=genomes, idx_entries=collect(91:100));

julia> cor(phenomes.phenotypes[91:100, 1], y_hat) > 0.5
true

source

GenomicBreedingModels.raise — Method

raise(x, y) = x^y

This function raises the first value to the power of the second value. A member of the set of endofunctions defined in GenomicBreedingModels.jl for building non-linear genomic prediction models. Inputs and ouput range between zero and one.

source

GenomicBreedingModels.reconstitutefeatures — Method

reconstitutefeatures(
    genomes::Genomes;
    feature_names::Vector{String},
    verbose::Bool = false
)::Genomes

Reconstruct epistasis features from a Genomes struct using feature names that encode the transformations applied.

Arguments

genomes::Genomes: Genomic data structure containing allele frequencies
feature_names::Vector{String}: Vector of feature names containing encoded transformation information
verbose::Bool: Whether to show progress bar during reconstruction (default: false)

Returns

Genomes: A new Genomes struct with reconstructed epistasis features

Details

The function parses the feature names to determine which transformations were applied and reconstructs the features by applying these transformations to the original genomic data. Feature names should contain the transformation operations in a format that can be parsed and evaluated.

Throws

ArgumentError: If the input Genomes struct is corrupted (invalid dimensions)
ErrorException: If feature reconstruction fails

Example

julia> genomes = GenomicBreedingCore.simulategenomes(l=1_000, verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> genomes_epifeat = epistasisfeatures(genomes, phenomes, n_new_features_per_transformation=50, n_reps=2, verbose=false);

julia> feature_names = genomes_epifeat.loci_alleles;

julia> genomes_epifeat_reconstructed = reconstitutefeatures(genomes, feature_names=feature_names);

julia> genomes_epifeat == genomes_epifeat_reconstructed
true

source

GenomicBreedingModels.ridge — Method

ridge(;
    genomes::Genomes,
    phenomes::Phenomes,
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    idx_trait::Int64 = 1,
    verbose::Bool = false
)::Fit

Fit a ridge (L2) regression model to genomic data. Ridge regression adds an L2 regularization term to the ordinary least squares objective function, which helps prevent overfitting and handles multicollinearity in the predictors.

Arguments

genomes::Genomes: Genomic data structure containing allele frequencies
phenomes::Phenomes: Phenotypic data structure containing trait measurements
idx_entries::Union{Nothing,Vector{Int64}}: Optional indices to subset specific entries/individuals
idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to subset specific loci-alleles
idx_trait::Int64: Index of the trait to analyze (default: 1)
verbose::Bool: If true, prints diagnostic plots and additional information (default: false)

Returns

Fit: A fitted model object containing:
- model: Model identifier ("ols")
- b_hat: Estimated regression coefficients
- b_hat_labels: Labels for the coefficients
- trait: Name of the analyzed trait
- entries: Entry identifiers
- populations: Population identifiers
- y_true: Observed phenotypic values
- y_pred: Predicted phenotypic values
- metrics: Dictionary of performance metrics

Notes

Uses cross-validation to select the optimal regularization parameter (λ)
Standardizes predictors before fitting
Includes an intercept term in the model

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> fit = ridge(genomes=genomes, phenomes=phenomes);

julia> fit.model == "ridge"
true

julia> fit.metrics["cor"] > 0.50
true

source

GenomicBreedingModels.square — Method

square(x) = x^2

This function takes a number and squares it. A member of the set of endofunctions defined in GenomicBreedingModels.jl for building non-linear genomic prediction models. Both input and ouput range between zero and one.

source

GenomicBreedingModels.transform1 — Method

transform1(
    f::Function,
    genomes::Genomes,
    phenomes::Phenomes;
    idx_trait::Int64 = 1,
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    n_new_features_per_transformation::Int64 = 1_000,
    ϵ::Float64 = eps(Float64),
    use_abs::Bool = false,
    σ²_threshold::Float64 = 0.01,
    verbose::Bool = false
)::Genomes

Apply a transformation function to each allele frequency in genomic data while considering their effects on phenotypes.

Arguments

f::Function: Transformation function to apply to allele frequencies
genomes::Genomes: Input genomic data structure
phenomes::Phenomes: Corresponding phenotypic data structure

Keywords

idx_trait::Int64: Index of the trait to analyze (default: 1)
idx_entries::Union{Nothing,Vector{Int64}}: Indices of entries to include (default: all)
idx_loci_alleles::Union{Nothing,Vector{Int64}}: Indices of loci-alleles to include (default: all)
n_new_features_per_transformation::Int64: Maximum number of transformed features to retain (default: 1000)
ϵ::Float64: Small value added to prevent numerical issues (default: machine epsilon)
use_abs::Bool: Whether to use absolute values before transformation (default: false)
σ²_threshold::Float64: Minimum variance threshold for considering a locus (default: 0.01)
verbose::Bool: Whether to show progress and plots (default: false)

Returns

Genomes: A new Genomes object containing the transformed allele frequencies

Details

The function performs the following steps:

Extracts allele frequencies and phenotypic data
Applies the transformation function to each locus
Estimates effects using ordinary least squares
Selects the most important transformed features
Cleans numerical artifacts (values near 0 or 1)

Notes

Use named functions to ensure transformations can be reconstructed from loci_alleles
The function adds ϵ to frequencies to prevent numerical issues
Features with variance below σ²_threshold are skipped
The output contains at most n_new_features_per_transformation features

Example

julia> genomes = GenomicBreedingCore.simulategenomes(l=1_000, verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> genomes_transformed = transform1(x -> x^2, genomes, phenomes);

julia> idx = findall(genomes.loci_alleles .== split(split(replace(genomes_transformed.loci_alleles[1], ")" => ""), "(")[2], ",")[1])[1];

julia> mean(sqrt.(genomes_transformed.allele_frequencies[:, 1]) .- genomes.allele_frequencies[:, idx]) < 1e-10
true

julia> squareaddpi(x) = x^2 + pi;

julia> genomes_transformed = transform1(squareaddpi, genomes, phenomes);

julia> idx = findall(genomes.loci_alleles .== split(split(replace(genomes_transformed.loci_alleles[1], ")" => ""), "(")[2], ",")[1])[1];

julia> mean(squareaddpi.(genomes.allele_frequencies[:, idx]) .- genomes_transformed.allele_frequencies[:, 1]) < 1e-10
true

source

GenomicBreedingModels.transform2 — Method

transform2(
    f::Function,
    genomes::Genomes,
    phenomes::Phenomes;
    idx_trait::Int64 = 1,
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    n_new_features_per_transformation::Int64 = 1_000,
    ϵ::Float64 = eps(Float64),
    use_abs::Bool = false,
    σ²_threshold::Float64 = 0.01,
    commutative::Bool = false,
    verbose::Bool = false
)::Genomes

Apply a transformation function to pairs of allele frequencies in genomic data while considering their effects on phenotypes.

Arguments

f::Function: Transformation function to apply to pairs of allele frequencies
genomes::Genomes: Input genomic data structure
phenomes::Phenomes: Corresponding phenotypic data structure
idx_trait::Int64: Index of the trait to analyze (default: 1)
idx_entries::Union{Nothing,Vector{Int64}}: Subset of entries to include (default: all)
idx_loci_alleles::Union{Nothing,Vector{Int64}}: Subset of loci-alleles to include (default: all)
n_new_features_per_transformation::Int64: Maximum number of transformed features to retain (default: 1000)
ϵ::Float64: Small value added to frequencies to avoid numerical issues (default: machine epsilon)
use_abs::Bool: Whether to use absolute values of frequencies (default: false)
σ²_threshold::Float64: Minimum variance threshold for considering loci (default: 0.01)
commutative::Bool: Whether the transformation function is commutative (default: false)
verbose::Bool: Whether to display progress and diagnostics (default: false)

Returns

Genomes: A new Genomes object containing the transformed features

Details

The function performs the following steps:

Extracts allele frequencies and phenotypic data
Applies the transformation function to all possible pairs of allele freuqencies (i.e., locus-allele combinations)
Estimates effects using ordinary least squares
Selects the most important transformed features
Cleans numerical artifacts (values near 0 or 1)

Notes

Use named functions to ensure transformations can be reconstructed from loci_alleles
The function adds ϵ to frequencies to prevent numerical issues
Features with variance below σ²_threshold are skipped
The output contains at most n_new_features_per_transformation features

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(l=1_000, verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> genomes_transformed = transform2((x,y) -> (x^2 + sqrt(y)) / 2, genomes, phenomes);

julia> idx_1 = findall(genomes.loci_alleles .== split(split(replace(genomes_transformed.loci_alleles[1], ")" => ""), "(")[2], ",")[1])[1];

julia> idx_2 = findall(genomes.loci_alleles .== split(split(replace(genomes_transformed.loci_alleles[1], ")" => ""), "(")[2], ",")[2])[1];

julia> mean((genomes.allele_frequencies[:,idx_1].^2 .+ sqrt.(genomes.allele_frequencies[:,idx_2])) ./ 2 .- genomes_transformed.allele_frequencies[:,1]) < 1e-10
true

julia> raisexbyythenlog(x, y) = log(abs(x^y));

julia> genomes_transformed = transform2(raisexbyythenlog, genomes, phenomes);

julia> idx_1 = findall(genomes.loci_alleles .== split(split(replace(genomes_transformed.loci_alleles[1], ")" => ""), "(")[2], ",")[1])[1];

julia> idx_2 = findall(genomes.loci_alleles .== split(split(replace(genomes_transformed.loci_alleles[1], ")" => ""), "(")[2], ",")[2])[1];

julia> mean(raisexbyythenlog.(genomes.allele_frequencies[:,idx_1], genomes.allele_frequencies[:,idx_2]) .- genomes_transformed.allele_frequencies[:,1]) < 1e-10
true

source

GenomicBreedingModels.turing_bayesG — Method

Turing specification of Bayesian linear regression using a Gaussian prior with common variance

Example usage

# Benchmarking
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
model = turing_bayesG(G, y)
benchmarks = TuringBenchmarking.benchmark_model(
    model;
    # Check correctness of computations
    check=true,
    # Automatic differentiation backends to check and benchmark
    adbackends=[:forwarddiff, :reversediff, :reversediff_compiled, :zygote]
)

# Test more loci
genomes = GenomicBreedingCore.simulategenomes(n=10, l=10_000)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Check for uninferred types in the model
@code_warntype model = turing_bayesG(G, y)
# Fit
model = turing_bayesG(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.65, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# @time chain = Turing.sample(rng, model, HMC(0.05, 10; adtype=AutoReverseDiff(compile=true)), niter, progress=true);
p = Plots.histogram(chain[:σ²])
Plots.gui(p)
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[501:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)

source

GenomicBreedingModels.turing_bayesG_logit — Method

Turing specification of Bayesian logistic regression using a Gaussian prior with common variance

source

GenomicBreedingModels.turing_bayesGs — Method

Turing specification of Bayesian linear regression using a Gaussian prior with varying variances

Example usage

# Simulate data
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
# Extract genotype and phenotype data
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Regress for just 200 iterations for demonstration purposes only. Use way way more iterations, e.g. 10,000.
model = turing_bayesGs(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.5, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[150:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)

source

GenomicBreedingModels.turing_bayesGπ — Method

Turing specification of Bayesian linear regression using a Gaussian prior with a point mass at zero and common variance

Example usage

# Simulate data
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
# Extract genotype and phenotype data
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Regress for just 200 iterations for demonstration purposes only. Use way way more iterations, e.g. 10,000.
model = turing_bayesGπ(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.5, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[150:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)

source

GenomicBreedingModels.turing_bayesGπs — Method

Turing specification of Bayesian linear regression using a Gaussian prior with a point mass at zero and varying variances

Example usage

# Simulate data
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
# Extract genotype and phenotype data
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Regress for just 200 iterations for demonstration purposes only. Use way way more iterations, e.g. 10,000.
model = turing_bayesGπs(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.5, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[150:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)

source

GenomicBreedingModels.turing_bayesL — Method

Turing specification of Bayesian linear regression using a Laplacian prior with a common scale

Example usage

# Simulate data
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
# Extract genotype and phenotype data
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Regress for just 200 iterations for demonstration purposes only. Use way way more iterations, e.g. 10,000.
model = turing_bayesL(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.5, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[150:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)

source

GenomicBreedingModels.turing_bayesLs — Method

Turing specification of Bayesian linear regression using a Laplacian prior with varying scales

Example usage

# Simulate data
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
# Extract genotype and phenotype data
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Regress for just 200 iterations for demonstration purposes only. Use way way more iterations, e.g. 10,000.
model = turing_bayesLs(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.5, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[150:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)

source

GenomicBreedingModels.turing_bayesLπ — Method

Turing specification of Bayesian linear regression using a Laplacian prior with a point mass at zero and common scale

Example usage

# Simulate data
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
# Extract genotype and phenotype data
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Regress for just 200 iterations for demonstration purposes only. Use way way more iterations, e.g. 10,000.
model = turing_bayesLπ(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.5, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[150:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)

source

GenomicBreedingModels.turing_bayesLπs — Method

Turing specification of Bayesian linear regression using a Laplacian prior with a point mass at zero and common scale

Example usage

# Simulate data
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
# Extract genotype and phenotype data
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Regress for just 200 iterations for demonstration purposes only. Use way way more iterations, e.g. 10,000.
model = turing_bayesLπs(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.5, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[150:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)

source

GenomicBreedingModels.turing_bayesT — Method

Turing specification of Bayesian linear regression using a T-distribution

Example usage

# Simulate data
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
# Extract genotype and phenotype data
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Regress for just 200 iterations for demonstration purposes only. Use way way more iterations, e.g. 10,000.
model = turing_bayesT(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.5, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[150:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)

source

GenomicBreedingModels.turing_bayesTπ — Method

Turing specification of Bayesian linear regression using a T-distribution with a point mass at zero

Example usage

# Simulate data
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
# Extract genotype and phenotype data
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Regress for just 200 iterations for demonstration purposes only. Use way way more iterations, e.g. 10,000.
model = turing_bayesTπ(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.5, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[150:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)

source

GenomicBreedingModels.validate — Method

validate(
    fit::Fit,
    genomes::Genomes,
    phenomes::Phenomes;
    idx_validation::Vector{Int64},
    replication::String="",
    fold::String=""
)::CV

Evaluate the predictive accuracy of a genomic prediction model on a validation dataset.

Arguments

fit::Fit: A fitted genomic prediction model
genomes::Genomes: Genomic data structure containing allele frequencies
phenomes::Phenomes: Phenotypic data structure containing trait measurements
idx_validation::Vector{Int64}: Indices of entries to use for validation
replication::String: Optional identifier for the validation replication
fold::String: Optional identifier for the cross-validation fold

Returns

CV: A cross-validation result object containing:
- Validation metrics (correlation, RMSE, etc.)
- True and predicted values
- Entry and population information
- Model specifications

Notes

Performs checks for data leakage between training and validation sets
Handles missing, NaN, and Inf values in phenotypic data
Validates dimensions of output CV struct

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);

julia> phenomes = extractphenomes(trials);

julia> fit = ridge(genomes=genomes, phenomes=phenomes, idx_entries=collect(1:90));

julia> cv = validate(fit, genomes, phenomes, idx_validation=collect(91:100));

julia> cv.metrics["cor"] > 0.50
true

source

GenomicBreedingModels.@string2operations — Macro

@string2operations(x)

Convert a string containing an endofunction formulae for genomic features into a parsed Julia expressions.

Arguments

x: A string containing mathematical operations on allele frequencies

Returns

Parsed expression that can be evaluated in Julia

source