GenomicBreedingModels

Documentation for GenomicBreedingModels.

Base.randMethod

Sampling method for LπDist

Examples

d = LπDist(0.1, 0.0, 1.0)
rand(d)
source
Base.randMethod

Sampling method for NπDist

Examples

d = NπDist(0.1, 0.0, 1.0)
rand(d)
source
Base.randMethod

Sampling method for TπDist

Examples

d = TπDist(0.1, 1.0)
rand(d)
source
GenomicBreedingModels.addnormMethod
addnorm(x, y) = (x + y) / 2.0

This function takes two numbers and finds their arithmetic mean. A member of the set of endofunctions defined in GenomicBreedingModels.jl for building non-linear genomic prediction models. Inputs and ouput range between zero and one.

source
GenomicBreedingModels.bayesaMethod
bayesa(;
    genomes::Genomes,
    phenomes::Phenomes,
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    idx_trait::Int64 = 1,
    n_iter::Int64 = 1_500,
    n_burnin::Int64 = 500,
    verbose::Bool = false
)::Fit

Fits a Bayes A model for genomic prediction using the BGLR method. Bayes A assumes marker effects follow a scaled-t distribution, allowing for different variances for each marker effect.

Arguments

  • genomes::Genomes: Genomic data structure containing allele frequencies
  • phenomes::Phenomes: Phenotypic data structure containing trait measurements
  • idx_entries::Union{Nothing,Vector{Int64}}: Optional indices to subset specific entries/individuals
  • idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to subset specific genetic markers
  • idx_trait::Int64: Index of the trait to analyze (default: 1)
  • n_iter::Int64: Number of iterations for the MCMC algorithm (default: 1,500)
  • n_burnin::Int64: Number of burn-in iterations to discard (default: 500)
  • verbose::Bool: Whether to print progress messages (default: false)

Returns

  • Fit: A fitted model object containing:
    • model: Model identifier ("ols")
    • b_hat: Estimated regression coefficients
    • b_hat_labels: Labels for the coefficients
    • trait: Name of the analyzed trait
    • entries: Entry identifiers
    • populations: Population identifiers
    • y_true: Observed phenotypic values
    • y_pred: Predicted phenotypic values
    • metrics: Dictionary of performance metrics

Notes

  • The model assumes a Gaussian distribution for the trait values
  • Recommended to check convergence by examining trace plots of key parameters
  • The burn-in period should be adjusted based on convergence diagnostics

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> fit = bayesa(genomes=genomes, phenomes=phenomes);

julia> fit.model == "bayesa"
true

julia> fit.metrics["cor"] > 0.50
true
source
GenomicBreedingModels.bayesbMethod
bayesb(;
    genomes::Genomes,
    phenomes::Phenomes,
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    idx_trait::Int64 = 1,
    n_iter::Int64 = 1_500,
    n_burnin::Int64 = 500,
    verbose::Bool = false
)::Fit

Fits a Bayes B model for genomic prediction using the BGLR method. Bayes B assumes that marker effects follow a mixture distribution where some markers have zero effect and others follow a scaled-t distribution.

Arguments

  • genomes::Genomes: Genomic data structure containing allele frequencies
  • phenomes::Phenomes: Phenotypic data structure containing trait measurements
  • idx_entries::Union{Nothing,Vector{Int64}}: Optional indices to subset specific entries (default: nothing)
  • idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to subset specific loci/alleles (default: nothing)
  • idx_trait::Int64: Index of the trait to analyze (default: 1)
  • n_iter::Int64: Number of iterations for the MCMC algorithm (default: 1,500)
  • n_burnin::Int64: Number of burn-in iterations to discard (default: 500)
  • verbose::Bool: Whether to print progress information (default: false)

Returns

  • Fit: A fitted model object containing:
    • model: Model identifier ("ols")
    • b_hat: Estimated regression coefficients
    • b_hat_labels: Labels for the coefficients
    • trait: Name of the analyzed trait
    • entries: Entry identifiers
    • populations: Population identifiers
    • y_true: Observed phenotypic values
    • y_pred: Predicted phenotypic values
    • metrics: Dictionary of performance metrics

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> fit = bayesb(genomes=genomes, phenomes=phenomes);

julia> fit.model == "bayesb"
true

julia> fit.metrics["cor"] > 0.50
true
source
GenomicBreedingModels.bayescMethod
bayesc(;
    genomes::Genomes,
    phenomes::Phenomes,
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    idx_trait::Int64 = 1,
    n_iter::Int64 = 1_500,
    n_burnin::Int64 = 500,
    verbose::Bool = false
)::Fit

Fits a Bayes C model for genomic prediction using the BGLR method. Bayes C assumes that marker effects follow a mixture distribution with a point mass at zero and a normal distribution.

Arguments

  • genomes::Genomes: Genomic data structure containing allele frequencies
  • phenomes::Phenomes: Phenotypic data structure containing trait measurements
  • idx_entries::Union{Nothing,Vector{Int64}}: Optional indices to subset specific entries/individuals
  • idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to subset specific loci/alleles
  • idx_trait::Int64: Index of the trait to analyze (default: 1)
  • n_iter::Int64: Number of iterations for MCMC sampling (default: 1,500)
  • n_burnin::Int64: Number of burn-in iterations to discard (default: 500)
  • verbose::Bool: Whether to print progress information (default: false)

Returns

  • Fit: A fitted model object containing:
    • model: Model identifier ("ols")
    • b_hat: Estimated regression coefficients
    • b_hat_labels: Labels for the coefficients
    • trait: Name of the analyzed trait
    • entries: Entry identifiers
    • populations: Population identifiers
    • y_true: Observed phenotypic values
    • y_pred: Predicted phenotypic values
    • metrics: Dictionary of performance metrics

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> fit = bayesc(genomes=genomes, phenomes=phenomes);

julia> fit.model == "bayesc"
true

julia> fit.metrics["cor"] > 0.50
true
source
GenomicBreedingModels.bayesianMethod
bayesian(
    turing_model::Function;
    X::Matrix{Float64},
    y::Vector{Float64},
    sampler::String = ["NUTS", "HMC", "HMCDA", "MH", "PG"][1],
    sampling_method::Int64 = 1,
    seed::Int64 = 123,
    n_burnin::Int64 = 500,
    n_iter::Int64 = 1_500,
    verbose::Bool = false,
)::Fit

Fit a Bayesian linear regression models via Turing.jl

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);

julia> phenomes = extractphenomes(trials);

julia> sol = Suppressor.@suppress bayesian(turing_bayesG, genomes=genomes, phenomes=phenomes);

julia> # Slow because not multivariate T-dist: sol_T = Suppressor.@suppress bayesian(turing_bayesT, genomes=genomes, phenomes=phenomes);

julia> # Even slower because of an extra set of distribution to define a non-spherical variance-covariance matrix: sol_Gs = Suppressor.@suppress bayesian(turing_bayesGs, genomes=genomes, phenomes=phenomes);

julia> sol_BGLR = Suppressor.@suppress bayesian("BayesA", genomes=genomes, phenomes=phenomes); sol.metrics["cor"] > sol_BGLR.metrics["cor"]
true

julia> sol.metrics["cor"] > 0.5
true
source
GenomicBreedingModels.bayesianMethod
bayesian(
    bglr_model::String;
    genomes::Genomes,
    phenomes::Phenomes,
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    idx_trait::Int64 = 1,
    response_type::String = ["gaussian", "ordinal"][1], 
    n_burnin::Int64 = 500,
    n_iter::Int64 = 1_500, 
    verbose::Bool = false
)::Fit

Fit Bayesian genomic prediction models using the BGLR R package.

Arguments

  • bglr_model::String: Type of BGLR model to fit. Options: "BayesA", "BayesB", "BayesC"
  • genomes::Genomes: Genomic data structure containing allele frequencies
  • phenomes::Phenomes: Phenotypic data structure containing trait measurements
  • idx_entries::Union{Nothing,Vector{Int64}}: Optional indices to subset entries
  • idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to subset loci/alleles
  • idx_trait::Int64: Index of trait to analyze (default: 1)
  • response_type::String: Type of response variable. Options: "gaussian" or "ordinal" (default: "gaussian")
  • n_burnin::Int64: Number of burn-in MCMC iterations (default: 500)
  • n_iter::Int64: Total number of MCMC iterations (default: 1,500)
  • verbose::Bool: Whether to print progress information (default: false)

Returns

  • Fit: Object containing:
    • Model fit summary
    • Estimated genetic effects
    • Predicted values
    • Model performance metrics
    • Input data details

Details

This function provides a Julia interface to the BGLR R package for Bayesian genomic prediction. It uses temporary files to interface with R and automatically cleans up afterward.

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);

julia> phenomes = extractphenomes(trials);

julia> sol = Suppressor.@suppress bayesian("BayesA", genomes=genomes, phenomes=phenomes);

julia> sol.metrics["cor"] > 0.5
true
source
GenomicBreedingModels.bglrMethod
bglr(; G::Matrix{Float64}, y::Vector{Float64}, model::String="BayesA", 
     response_type::String="gaussian", n_iter::Int64=1_500, 
     n_burnin::Int64=500, verbose::Bool=false)::Vector{Float64}

Perform Bayesian genomic prediction using the BGLR (Bayesian Generalized Linear Regression) R package.

Arguments

  • G::Matrix{Float64}: Marker matrix where rows represent individuals and columns represent locus-allele combinations
  • y::Vector{Float64}: Vector of phenotypic values
  • model::String: Bayesian model type to use. Options: "BayesA", "BayesB", or "BayesC". Default: "BayesA"
  • response_type::String: Type of response variable. Options: "gaussian" or "ordinal". Default: "gaussian"
  • n_iter::Int64: Number of iterations for the MCMC chain. Default: 1,500
  • n_burnin::Int64: Number of burn-in iterations to discard. Default: 500
  • verbose::Bool: Whether to print progress information. Default: false

Returns

  • Vector{Float64}: Estimated effects including the intercept (first element) followed by marker effects

Details

This function creates temporary files to interface with R's BGLR package, runs the Bayesian analysis, and automatically cleans up temporary files afterward. The function uses the system's R installation and requires the BGLR package to be installed in R.

Notes

This is hacky. It invokes Rscript for each instance which should allow multithreading because RCall.jl currently does not allow multithreading.

source
GenomicBreedingModels.cvbulkMethod
cvbulk(;
    genomes::Genomes,
    phenomes::Phenomes,
    models::Vector{Function}=[ridge],
    n_replications::Int64=5,
    n_folds::Int64=5,
    seed::Int64=42,
    verbose::Bool=true
)::Tuple{Vector{CV}, Vector{String}}

Perform replicated k-fold cross-validation of genomic prediction model(s) across all available traits and entries, ignoring populations.

Arguments

  • genomes::Genomes: Genomic data structure containing allele frequencies
  • phenomes::Phenomes: Phenotypic data structure containing trait measurements
  • models::Vector{Function}: Vector of genomic prediction model functions to evaluate
  • n_replications::Int64: Number of times to repeat k-fold cross-validation randomising k-fold partitioning each time (default: 5)
  • n_folds::Int64: Number of cross-validation folds (default: 5)
  • seed::Int64: Random seed for reproducibility (default: 42)
  • verbose::Bool: Whether to display progress information (default: true)

Returns

  • Tuple containing:
    • Vector of CV objects with cross-validation results
    • Vector of warning messages about skipped cases

Threading

Uses multiple threads if Julia is started with threading enabled. Example startup command: julia --threads 7,1 (7 worker threads, 1 runtime thread)

Notes

  • Performs random k-fold partitioning of entries ignoring population structure
  • Handles missing/invalid phenotype values
  • Validates model inputs and data dimensions
  • Returns warnings for cases with insufficient data or zero variance

Example

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false); genomes.populations = StatsBase.sample(string.("pop_", 1:3), length(genomes.entries), replace=true);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, verbose=false);

julia> phenomes = extractphenomes(trials);

julia> cvs, notes = cvbulk(genomes=genomes, phenomes=phenomes, models=[ols, ridge], n_replications=2, n_folds=2, verbose=false);

julia> df_across_entries, df_per_entry = tabularise(cvs);

julia> idx_across = findall((df_across_entries.trait .== "trait_1") .&& (df_across_entries.model .== "ridge") .&& (df_across_entries.replication .== "replication_1") .&& (df_across_entries.fold .== "fold_1"));

julia> idx_per = findall((df_per_entry.trait .== "trait_1") .&& (df_per_entry.model .== "ridge") .&& (df_per_entry.replication .== "replication_1") .&& (df_per_entry.fold .== "fold_1"));

julia> abs(df_across_entries.cor[idx_across][1] - cor(df_per_entry.y_true[idx_per], df_per_entry.y_pred[idx_per])) < 1e-10
true
source
GenomicBreedingModels.cvleaveonepopulationoutMethod
cvleaveonepopulationout(;
    genomes::Genomes,
    phenomes::Phenomes,
    models::Vector{Function}=[ridge],
    n_replications::Int64=5,
    n_folds::Int64=5,
    seed::Int64=42,
    verbose::Bool=true
)::Tuple{Vector{CV}, Vector{String}}

Performs leave-one-population-out cross-validation for genomic prediction models across all available traits.

Arguments

  • genomes::Genomes: Genomic data structure containing allele frequencies
  • phenomes::Phenomes: Phenotypic data structure containing trait measurements
  • models::Vector{Function}: Vector of model functions to evaluate (default: [ridge])
  • n_replications::Int64: Number of replications (not used in this implementation)
  • n_folds::Int64: Number of folds (not used in this implementation)
  • seed::Int64: Random seed (not used in this implementation)
  • verbose::Bool: If true, displays progress information during execution

Returns

  • Tuple{Vector{CV}, Vector{String}}: Returns a tuple containing:
    • Vector of CV objects with cross-validation results
    • Vector of warning/error messages for skipped validations

Details

The function implements a leave-one-population-out cross-validation strategy where:

  1. For each trait and population combination:
    • Uses one population as validation set
    • Uses remaining populations as training set
  2. Evaluates multiple genomic prediction models
  3. Handles missing data and variance checks
  4. Supports parallel processing via Julia's multi-threading

Threading

To utilize multiple threads, start Julia with: julia --threads n,1 where n is the desired number of threads for computation and 1 is reserved for the runtime.

Supported Models

  • ols (Ordinary Least Squares)
  • ridge (Ridge Regression)
  • lasso (Lasso Regression)
  • bayesa (Bayes A)
  • bayesb (Bayes B)
  • bayesc (Bayes C)

Example

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false); genomes.populations = StatsBase.sample(string.("pop_", 1:3), length(genomes.entries), replace=true);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);

julia> phenomes = extractphenomes(trials);

julia> cvs, notes = cvleaveonepopulationout(genomes=genomes, phenomes=phenomes, models=[ridge, bayesa], n_replications=2, n_folds=2, verbose=false);

julia> df_across_entries, df_per_entry = tabularise(cvs);

julia> sum([sum(split(df_across_entries.training_population[i], ";") .== df_across_entries.validation_population[i]) for i in 1:size(df_across_entries, 1)]) == 0
true

julia> idx_across = findall((df_across_entries.validation_population .== "pop_1") .&& (df_across_entries.trait .== "trait_1") .&& (df_across_entries.model .== "ridge"));

julia> idx_per = findall((df_per_entry.validation_population .== "pop_1") .&& (df_per_entry.trait .== "trait_1") .&& (df_per_entry.model .== "ridge"));

julia> abs(df_across_entries.cor[idx_across][1] - cor(df_per_entry.y_true[idx_per], df_per_entry.y_pred[idx_per])) < 1e-10
true
source
GenomicBreedingModels.cvmultithread!Method
cvmultithread!(cvs::Vector{CV}; genomes::Genomes, phenomes::Phenomes, models_vector, verbose::Bool = true)::Vector{CV}

Perform multi-threaded genomic prediction cross-validation using specified models.

Arguments

  • cvs::Vector{CV}: Vector of cross-validation objects to be processed
  • genomes::Genomes: Genomic data structure containing allele frequencies
  • phenomes::Phenomes: Phenotypic data structure containing trait measurements
  • models_vector: Vector of model functions to be used for prediction (e.g., [ridge, bayesa])
  • verbose::Bool=true: Whether to display progress bar during computation

Returns

  • Modified cvs vector with updated cross-validation results

Threading

Requires Julia to be started with multiple threads to utilize parallel processing. Example startup command: julia --threads 7,1 (7 worker threads, 1 runtime thread)

Details

The function performs cross-validation in parallel for each CV object using the corresponding model from the models_vector. For each fold:

  1. Extracts training and validation indices
  2. Fits the specified model using training data
  3. Validates the model using validation data
  4. Updates the CV object with prediction results

Example

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);

julia> phenomes = extractphenomes(trials);

julia> idx_training = collect(1:50);

julia> idx_validation_1 = collect(51:75);

julia> idx_validation_2 = collect(76:100);

julia> fit = Fit(n = length(idx_training), l = length(genomes.loci_alleles) + 1); fit.model = "ridge"; fit.trait = "trait_1"; 

julia> fit.entries = genomes.entries[idx_training]; fit.populations = genomes.populations[idx_training]; 

julia> fit.b_hat_labels = vcat(["intercept"], genomes.loci_alleles);

julia> cv_1 = CV("replication_1", "fold_1", fit, genomes.populations[idx_validation_1], genomes.entries[idx_validation_1], zeros(length(idx_validation_1)), zeros(length(idx_validation_1)), fit.metrics);

julia> cv_2 = CV("replication_1", "fold_2", fit, genomes.populations[idx_validation_2], genomes.entries[idx_validation_2], zeros(length(idx_validation_2)), zeros(length(idx_validation_2)), fit.metrics);

julia> cvs = [cv_1, cv_2]; models = [ridge, ridge];

julia> cvmultithread!(cvs, genomes=genomes, phenomes=phenomes, models_vector=[ridge, bayesa], verbose=false);

julia> df_across_entries, df_per_entry = tabularise(cvs);

julia> idx_across = findall(df_across_entries.fold .== "fold_2");

julia> idx_per = findall(df_per_entry.fold .== "fold_2");

julia> abs(df_across_entries.cor[idx_across][1] - cor(df_per_entry.y_true[idx_per], df_per_entry.y_pred[idx_per])) < 1e-10
true
source
GenomicBreedingModels.cvpairwisepopulationMethod
cvpairwisepopulation(;
    genomes::Genomes,
    phenomes::Phenomes,
    models::Vector{Function}=[ridge],
    n_replications::Int64=5,
    n_folds::Int64=5, 
    seed::Int64=42,
    verbose::Bool=true
)::Tuple{Vector{CV}, Vector{String}}

Performs pairwise cross-validation between populations for genomic prediction models.

Arguments

  • genomes::Genomes: Genomic data structure containing allele frequencies
  • phenomes::Phenomes: Phenotypic data structure containing trait measurements
  • models::Vector{Function}: Vector of genomic prediction model functions to evaluate (default: [ridge])
  • n_replications::Int64: Number of replications (unused, maintained for API consistency)
  • n_folds::Int64: Number of folds (unused, maintained for API consistency)
  • seed::Int64: Random seed (unused, maintained for API consistency)
  • verbose::Bool: Whether to display progress information

Returns

  • Tuple{Vector{CV}, Vector{String}}:
    • Vector of CV objects containing cross-validation results
    • Vector of warning messages for skipped validations

Details

For each pair of populations (pop1, pop2):

  1. Uses pop1 as training set and pop2 as validation set
  2. Skips within-population validation (pop1==pop2)
  3. Evaluates each model on all available traits
  4. Handles missing/invalid phenotype values
  5. Validates model inputs and data dimensions

Threading

Requires Julia to be started with multiple threads: julia --threads n,1 where n is number of worker threads

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false); genomes.populations = StatsBase.sample(string.("pop_", 1:3), length(genomes.entries), replace=true);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);

julia> phenomes = extractphenomes(trials);

julia> cvs, notes = cvpairwisepopulation(genomes=genomes, phenomes=phenomes, models=[ols, ridge], n_replications=2, n_folds=2, verbose=false);

julia> df_across_entries, df_per_entry = tabularise(cvs);

julia> sum(df_across_entries.training_population .!= df_across_entries.validation_population) == size(df_across_entries, 1)
true

julia> idx_across = findall((df_across_entries.training_population .== "pop_1") .&& (df_across_entries.validation_population .== "pop_2") .&& (df_across_entries.trait .== "trait_1") .&& (df_across_entries.model .== "ridge"));

julia> idx_per = findall((df_per_entry.training_population .== "pop_1") .&& (df_per_entry.validation_population .== "pop_2") .&& (df_per_entry.trait .== "trait_1") .&& (df_per_entry.model .== "ridge"));

julia> abs(df_across_entries.cor[idx_across][1] - cor(df_per_entry.y_true[idx_per], df_per_entry.y_pred[idx_per])) < 1e-10
true
source
GenomicBreedingModels.cvperpopulationMethod
cvperpopulation(;
    genomes::Genomes,
    phenomes::Phenomes,
    models::Vector{Function}=[ridge],
    n_replications::Int64=5,
    n_folds::Int64=5,
    seed::Int64=42,
    verbose::Bool=true
)::Tuple{Vector{CV}, Vector{String}}

Performs within-population replicated cross-validation of genomic prediction models across all available traits.

Arguments

  • genomes::Genomes: Genomic data structure containing allele frequencies
  • phenomes::Phenomes: Phenotypic data structure containing trait measurements
  • models::Vector{Function}=[ridge]: Vector of genomic prediction model functions to evaluate
  • n_replications::Int64=5: Number of replications for cross-validation
  • n_folds::Int64=5: Number of folds for k-fold cross-validation
  • seed::Int64=42: Random seed for reproducibility
  • verbose::Bool=true: Whether to print progress information

Returns

  • Tuple{Vector{CV}, Vector{String}}: A tuple containing:
    • Vector of CV objects with cross-validation results
    • Vector of notes/warnings generated during the process

Details

The function performs separate cross-validations for each unique population in the dataset. Supports multiple genomic prediction models including:

  • ols: Ordinary Least Squares
  • ridge: Ridge Regression
  • lasso: Lasso Regression
  • bayesa: Bayes A
  • bayesb: Bayes B
  • bayesc: Bayes C

Threading

To use multiple threads, invoke Julia with: julia --threads n,1 where n is the desired number of threads for multi-threaded processes and 1 is reserved for the Julia runtime.

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(l=1_000, verbose=false); genomes.populations = StatsBase.sample(string.("pop_", 1:3), length(genomes.entries), replace=true);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);

julia> phenomes = extractphenomes(trials);

julia> cvs, notes = cvperpopulation(genomes=genomes, phenomes=phenomes, models=[ridge, bayesa], n_replications=2, n_folds=2, verbose=false);

julia> df_across_entries, df_per_entry = tabularise(cvs);

julia> sort(unique(df_across_entries.training_population))
3-element Vector{String}:
 "pop_1"
 "pop_2"
 "pop_3"

julia> df_across_entries.training_population == df_across_entries.validation_population
true

julia> idx_across = findall((df_across_entries.validation_population .== "pop_1") .&& (df_across_entries.trait .== "trait_1") .&& (df_across_entries.model .== "bayesa") .&& (df_across_entries.replication .== "replication_1") .&& (df_across_entries.fold .== "fold_1"));

julia> idx_per = findall((df_per_entry.validation_population .== "pop_1") .&& (df_per_entry.trait .== "trait_1") .&& (df_per_entry.model .== "bayesa") .&& (df_per_entry.replication .== "replication_1") .&& (df_per_entry.fold .== "fold_1"));

julia> abs(df_across_entries.cor[idx_across][1] - cor(df_per_entry.y_true[idx_per], df_per_entry.y_pred[idx_per])) < 1e-10
true

julia> summary_across, summary_per_entry = summarise(cvs);

julia> size(summary_across)
(6, 8)

julia> size(summary_per_entry)
(200, 8)
source
GenomicBreedingModels.epistasisfeaturesMethod
epistasisfeatures(
    genomes::Genomes,
    phenomes::Phenomes;
    idx_trait::Int64 = 1,
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    transformations1::Vector{Function} = [square, invoneplus, log10epsdivlog10eps],
    transformations2::Vector{Function} = [mult, addnorm, raise],
    n_new_features_per_transformation::Int64 = 1_000,
    n_reps::Int64 = 3,
    verbose::Bool = false
)::Genomes

Generate epistasis features by applying various transformations to genomic data.

Arguments

  • genomes::Genomes: Genomic data structure containing allele frequencies
  • phenomes::Phenomes: Phenotypic data structure containing trait measurements
  • idx_trait::Int64: Index of the trait to analyze (default: 1)
  • idx_entries::Union{Nothing,Vector{Int64}}: Indices of entries to include (default: all)
  • idx_loci_alleles::Union{Nothing,Vector{Int64}}: Indices of loci/alleles to include (default: all)
  • transformations1::Vector{Function}: Single-input transformation functions (default: [square, invoneplus, log10epsdivlog10eps])
  • transformations2::Vector{Function}: Two-input transformation functions (default: [mult, addnorm, raise])
  • n_new_features_per_transformation::Int64: Number of new features to generate per transformation (default: 1_000)
  • n_reps::Int64: Number of times to repeat the transformation process (default: 3)
  • verbose::Bool: Whether to display progress information (default: false)

Returns

  • Genomes: Enhanced genomic data structure with additional epistasis features

Description

This function generates epistasis features by applying two types of transformations:

  1. Single-input transformations (transformations1) applied to individual genomic features
  2. Two-input transformations (transformations2) applied to pairs of genomic features

The transformations are repeated n_reps times to create a rich set of derived features.

Notes

  • Ensures all generated features are within [0,1] range
  • Maintains dimensional consistency of input/output structures
  • Automatically handles entry and loci/allele subsetting
  • Validates input data structures before processing

Example

julia> genomes = GenomicBreedingCore.simulategenomes(l=1_000, verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> genomes_plus_features = epistasisfeatures(genomes, phenomes, n_new_features_per_transformation=50, n_reps=2, verbose=false);

julia> cvs, notes = cvbulk(genomes=genomes_plus_features, phenomes=phenomes, models=[ridge, lasso, bayesa], verbose=false);

julia> cvs_no_epi, notes_no_epi = cvbulk(genomes=genomes, phenomes=phenomes, models=[ridge, lasso, bayesa], verbose=false);

julia> df_across, df_per_entry = GenomicBreedingCore.tabularise(cvs);

julia> df_across_no_epi, df_per_entry_no_epi = GenomicBreedingCore.tabularise(cvs_no_epi);

julia> df_summary = combine(groupby(df_across, [:trait, :model]), [[:cor] => mean, [:cor] => std]);

julia> df_summary_no_epi = combine(groupby(df_across_no_epi, [:trait, :model]), [[:cor] => mean, [:cor] => std]);

julia> mean(df_summary.cor_mean) > mean(df_summary_no_epi.cor_mean)
true
source
GenomicBreedingModels.extractxyetcMethod
extractxyetc(
    genomes::Genomes,
    phenomes::Phenomes;
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    idx_trait::Int64 = 1,
    add_intercept::Bool = true
) -> Tuple{Matrix{Float64}, Vector{Float64}, Vector{String}, Vector{String}, Vector{String}}

Extract data matrices and vectors from genomic and phenotypic data for statistical analyses.

Arguments

  • genomes::Genomes: Genomic data structure containing allele frequencies
  • phenomes::Phenomes: Phenotypic data structure containing trait measurements
  • idx_entries::Union{Nothing,Vector{Int64}}: Optional indices to select specific entries (default: all entries)
  • idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to select specific loci-alleles (default: all loci-alleles)
  • idx_trait::Int64: Index of the trait to analyze (default: 1)
  • add_intercept::Bool: Whether to add an intercept column to the design matrix (default: true)

Returns

A tuple containing:

  1. X::Matrix{Float64}: Design matrix with allele frequencies (and intercept if specified)
  2. y::Vector{Float64}: Response vector with phenotypic values
  3. entries::Vector{String}: Names of the selected entries
  4. populations::Vector{String}: Population identifiers for the selected entries
  5. loci_alleles::Vector{String}: Names of the selected loci-alleles

Notes

  • The function filters out entries with missing, NaN, or infinite phenotype values
  • Requires at least 2 valid entries after filtering
  • Checks for non-zero variance in the trait values
  • Ensures consistency between genomic and phenotypic data dimensions
  • Validates all index inputs are within bounds

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> X, y, loci_alleles = extractxyetc(genomes, phenomes);

julia> X == hcat(ones(length(phenomes.entries)), genomes.allele_frequencies)
true

julia> y == phenomes.phenotypes[:, 1]
true
source
GenomicBreedingModels.gwaslmmMethod
gwaslmm(
    genomes::Genomes,
    phenomes::Phenomes;
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    idx_trait::Int64 = 1,
    GRM_type::String = ["simple", "ploidy-aware"][1],
    verbose::Bool = false
)::Fit

Perform genome-wide association analysis using a linear mixed model (LMM) approach.

Arguments

  • genomes::Genomes: Genomic data structure containing allele frequencies
  • phenomes::Phenomes: Phenotypic data structure containing trait measurements
  • idx_entries::Union{Nothing,Vector{Int64}}: Optional indices for subsetting entries
  • idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices for subsetting loci/alleles
  • idx_trait::Int64: Index of the trait to analyze (default: 1)
  • GRM_type::String: Type of genetic relationship matrix to use:
    • "simple": Standard GRM calculation
    • "ploidy-aware": Ploidy-adjusted GRM calculation
  • verbose::Bool: Whether to display progress and plots (default: false)

Returns

  • Fit: A structure containing GWAS results including:
    • model: Model identifier ("GWAS_LMM")
    • b_hat: Vector of test statistics (z-scores) for genetic markers

Details

The function implements a mixed model GWAS using the first principal component of the genetic relationship matrix (GRM) as a fixed effect to control for population structure. The model includes random effects for entries and uses REML estimation.

Notes

  • Handles both diploid and polyploid data through the GRM_type parameter
  • Uses multi-threading for parallel computation of marker effects
  • Includes automatic convergence retry on fitting failures
  • Maximum fitting time per marker is limited to 60 seconds

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> ploidy = 4;

julia> genomes.allele_frequencies = round.(genomes.allele_frequencies .* ploidy) ./ ploidy;

julia> proportion_of_variance = zeros(9, 1); proportion_of_variance[1, 1] = 0.5;

julia> trials, effects = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.05 0.00 0.00;], proportion_of_variance = proportion_of_variance, verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> fit_1 = Suppressor.@suppress gwaslmm(genomes=genomes, phenomes=phenomes, GRM_type="simple");

julia> fit_1.model
"GWAS_LMM"

julia> fit_2 = Suppressor.@suppress gwaslmm(genomes=genomes, phenomes=phenomes, GRM_type="ploidy-aware");

julia> fit_2.model
"GWAS_LMM"

julia> findall(fit_1.b_hat .== maximum(fit_1.b_hat)) == findall(fit_2.b_hat .== maximum(fit_2.b_hat))
true
source
GenomicBreedingModels.gwasolsMethod
gwasols(
    genomes::Genomes,
    phenomes::Phenomes;
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    idx_trait::Int64 = 1,
    GRM_type::String = ["simple", "ploidy-aware"][1],
    verbose::Bool = false,
)::Fit

Perform genome-wide association study (GWAS) using ordinary least squares (OLS) regression with population structure correction.

Arguments

  • genomes::Genomes: Genomic data structure containing allele frequencies
  • phenomes::Phenomes: Phenotypic data structure containing trait measurements
  • idx_entries::Union{Nothing,Vector{Int64}}: Optional indices to subset entries (default: all entries)
  • idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to subset loci/alleles (default: all loci)
  • idx_trait::Int64: Index of the trait to analyze (default: 1)
  • GRM_type::String: Type of genetic relationship matrix to use ("simple" or "ploidy-aware") (default: "simple")
  • verbose::Bool: Whether to display progress and plots (default: false)

Returns

  • Fit: A structure containing GWAS results including:
    • model: Model identifier ("GWAS_OLS")
    • b_hat: Vector of effect size estimates/t-statistics for each marker
    • Additional model information

Details

The function implements GWAS using OLS regression while accounting for population structure through the first principal component of the genetic relationship matrix (GRM) as a covariate. Two types of GRM can be used: "simple" assumes diploid organisms, while "ploidy-aware" accounts for different ploidy levels.

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> ploidy = 4;

julia> genomes.allele_frequencies = round.(genomes.allele_frequencies .* ploidy) ./ ploidy;

julia> proportion_of_variance = zeros(9, 1); proportion_of_variance[1, 1] = 0.5;

julia> trials, effects = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.05 0.00 0.00;], proportion_of_variance = proportion_of_variance, verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> fit_1 = gwasols(genomes=genomes, phenomes=phenomes, GRM_type="simple");

julia> fit_1.model
"GWAS_OLS"

julia> fit_2 = gwasols(genomes=genomes, phenomes=phenomes, GRM_type="ploidy-aware");

julia> fit_2.model
"GWAS_OLS"

julia> findall(fit_1.b_hat .== maximum(fit_1.b_hat)) == findall(fit_2.b_hat .== maximum(fit_2.b_hat))
true
source
GenomicBreedingModels.gwasprepMethod
gwasprep(
    genomes::Genomes,
    phenomes::Phenomes;
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    idx_trait::Int64 = 1,
    GRM_type::String = ["simple", "ploidy-aware"][1],
    standardise::Bool = true,
    verbose::Bool = false,
)::Tuple{Matrix{Float64},Vector{Float64},Matrix{Float64},Fit}

Prepare data matrices and structures for genome-wide association studies (GWAS).

Arguments

  • genomes::Genomes: Genomic data structure containing allele frequencies
  • phenomes::Phenomes: Phenotypic data structure containing trait measurements
  • idx_entries::Union{Nothing,Vector{Int64}}: Optional indices to subset entries (default: all entries)
  • idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to subset loci/alleles (default: all loci)
  • idx_trait::Int64: Index of the trait to analyze (default: 1)
  • GRM_type::String: Type of genetic relationship matrix to use ("simple" or "ploidy-aware") (default: "simple")
  • standardise::Bool: Whether to standardize the data matrices (default: true)
  • verbose::Bool: Whether to print progress information (default: false)

Returns

A tuple containing:

  • G::Matrix{Float64}: Standardized allele frequency matrix
  • y::Vector{Float64}: Standardized phenotype vector
  • GRM::Matrix{Float64}: Genetic relationship matrix
  • fit::Fit: Initialized Fit struct for GWAS results

Details

  • Performs data validation and preprocessing for GWAS analysis
  • Removes fixed loci with no variation
  • Standardizes genomic and phenotypic data if requested
  • Constructs appropriate genetic relationship matrix
  • Initializes output structure for association results

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> ploidy = 4;

julia> genomes.allele_frequencies = round.(genomes.allele_frequencies .* ploidy) ./ ploidy;

julia> proportion_of_variance = zeros(9, 1); proportion_of_variance[1, 1] = 0.5;

julia> trials, effects = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.05 0.00 0.00;], proportion_of_variance = proportion_of_variance, verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> G, y, GRM, fit = gwasprep(genomes=genomes, phenomes=phenomes);

julia> sum(abs.(mean(G, dims=1)[1,:]) .< 1e-10) == size(G, 2)
true

julia> sum(abs.(std(G, dims=1)[1,:] .- 1) .< 1e-10) == size(G, 2)
true

julia> (abs(mean(y)) < 1e-10, abs(std(y) - 1) < 1e-10)
(true, true)

julia> size(G, 1) == length(y)
true

julia> (size(G, 1), length(y)) == size(GRM)
true

julia> length(fit.entries) == length(y)
true

julia> length(fit.b_hat) == size(G, 2)
true
source
GenomicBreedingModels.gwasremlMethod
gwasreml(
    genomes::Genomes,
    phenomes::Phenomes;
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    idx_trait::Int64 = 1,
    GRM_type::String = "simple",
    verbose::Bool = false
)::Fit

Performs genome-wide association analysis using restricted maximum likelihood estimation (REML).

Arguments

  • genomes::Genomes: Genomic data structure containing allele frequencies
  • phenomes::Phenomes: Phenotypic data structure containing trait measurements
  • idx_entries::Union{Nothing,Vector{Int64}}: Optional indices to subset entries (default: nothing)
  • idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to subset loci/alleles (default: nothing)
  • idx_trait::Int64: Index of the trait to analyze (default: 1)
  • GRM_type::String: Type of genetic relationship matrix to use, either "simple" or "ploidy-aware" (default: "simple")
  • verbose::Bool: Whether to display progress and plots (default: false)

Returns

  • ::Fit: A Fit struct containing GWAS results, including effect estimates and test statistics

Details

Implements the REML log-likelihood calculation for a mixed model of the form: y = Xβ + Zu + e where:

  • β are fixed effects
  • u are random genetic effects with u ~ N(0, σ²_u * GRM)
  • e are residual effects with e ~ N(0, σ²_e * I)

The function constructs the variance-covariance matrices and computes the REML transformation to obtain the log-likelihood value used in variance component estimation.

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(l=1_000, verbose=false);

julia> ploidy = 4;

julia> genomes.allele_frequencies = round.(genomes.allele_frequencies .* ploidy) ./ ploidy;

julia> proportion_of_variance = zeros(9, 1); proportion_of_variance[1, 1] = 0.5;

julia> trials, effects = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.05 0.00 0.00;], proportion_of_variance = proportion_of_variance, verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> fit_1 = gwasreml(genomes=genomes, phenomes=phenomes, GRM_type="simple");

julia> fit_1.model
"GWAS_REML"

julia> fit_2 = gwasreml(genomes=genomes, phenomes=phenomes, GRM_type="ploidy-aware");

julia> fit_2.model
"GWAS_REML"

julia> findall(fit_1.b_hat .== maximum(fit_1.b_hat)) == findall(fit_2.b_hat .== maximum(fit_2.b_hat))
true
source
GenomicBreedingModels.heritabilitynarrow_senseMethod
heritabilitynarrow_sense(y_true::Vector{Float64}, y_pred::Vector{Float64})::Float64

Calculate narrow-sense heritability (h²) from true and predicted phenotypic values.

Narrow-sense heritability is the proportion of phenotypic variance that can be attributed to additive genetic effects. It is calculated as the ratio of additive genetic variance (s²a) to total phenotypic variance (s²a + s²e).

Arguments

  • y_true::Vector{Float64}: Vector of observed/true phenotypic values
  • y_pred::Vector{Float64}: Vector of predicted genetic values

Returns

  • Float64: Narrow-sense heritability (h²) value between 0 and 1

Details

  • Returns 0.0 if variance of either input vector is near zero (< 1e-10)
  • Additive genetic variance (s²a) is estimated from variance of predictions
  • Environmental variance (s²e) is estimated from variance of residuals
  • Result is bounded between 0 and 1
source
GenomicBreedingModels.invoneplusMethod
invoneplus(x) = 1 / (1 + x)

This function takes a number, adds one and inverts it. A member of the set of endofunctions defined in GenomicBreedingModels.jl for building non-linear genomic prediction models. Both input and ouput range between zero and one.

source
GenomicBreedingModels.lassoMethod
lasso(;
    genomes::Genomes,
    phenomes::Phenomes,
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    idx_trait::Int64 = 1,
    verbose::Bool = false,
)::Fit

Fits a LASSO (Least Absolute Shrinkage and Selection Operator) regression model with L1 regularization for genomic prediction.

Arguments

  • genomes::Genomes: Genomic data structure containing allele frequencies
  • phenomes::Phenomes: Phenotypic data structure containing trait measurements
  • idx_entries::Union{Nothing,Vector{Int64}}: Optional indices to select specific entries/individuals
  • idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to select specific loci-alleles
  • idx_trait::Int64: Index of the trait to analyze (default: 1)
  • verbose::Bool: If true, prints diagnostic plots and additional information (default: false)

Returns

  • Fit: A fitted model object containing:
    • model: Model identifier ("ols")
    • b_hat: Estimated regression coefficients
    • b_hat_labels: Labels for the coefficients
    • trait: Name of the analyzed trait
    • entries: Entry identifiers
    • populations: Population identifiers
    • y_true: Observed phenotypic values
    • y_pred: Predicted phenotypic values
    • metrics: Dictionary of performance metrics

Details

The function implements LASSO regression using the GLMNet package, which performs automatic feature selection through L1 regularization. The optimal regularization parameter (λ) is selected using cross-validation to minimize prediction error while promoting sparsity in the coefficients.

Notes

  • Standardisation is disabled by default because the allele frequencies across loci are comparable as they all range from zero to one
  • The model includes an intercept term

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> fit = lasso(genomes=genomes, phenomes=phenomes);

julia> fit.model == "lasso"
true

julia> fit.metrics["cor"] > 0.50
true
source
GenomicBreedingModels.log10epsdivlog10epsMethod
log10epsdivlog10eps(x) = (log10(abs(x) + eps(Float64))) / log10(eps(Float64))

This function takes a number, adds a very tiny value, takes the log10, and divide it by the log10 of the same very tiny value. A member of the set of endofunctions defined in GenomicBreedingModels.jl for building non-linear genomic prediction models. Both input and ouput range between zero and one.

source
GenomicBreedingModels.loglikremlMethod
loglikreml(θ::Vector{Float64}, data::Tuple{Vector{Float64},Matrix{Float64},Matrix{Float64}})::Float64

Calculate the restricted maximum likelihood (REML) log-likelihood for a mixed linear model.

Arguments

  • θ::Vector{Float64}: Vector of variance components [σ²e, σ²u] where:
    • σ²_e is the residual variance
    • σ²_u is the genetic variance
  • data::Tuple{Vector{Float64},Matrix{Float64},Matrix{Float64}}: Tuple containing:
    • y: Vector of phenotypic observations
    • X: Design matrix for fixed effects
    • GRM: Genomic relationship matrix

Returns

  • Float64: The REML log-likelihood value. Returns Inf if matrix operations fail.

Details

Implements the REML log-likelihood calculation for a mixed model of the form: y = Xβ + Zu + e where:

  • β are fixed effects
  • u are random genetic effects with u ~ N(0, σ²_u * GRM)
  • e are residual effects with e ~ N(0, σ²_e * I)

The function constructs the variance-covariance matrices and computes the REML transformation to obtain the log-likelihood value used in variance component estimation.

Examples

```jldoctest; setup = :(using GenomicBreedingCore, GenomicBreedingModels, LinearAlgebra, StatsBase) julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> ploidy = 4;

julia> genomes.allelefrequencies = round.(genomes.allelefrequencies .* ploidy) ./ ploidy;

julia> proportionofvariance = zeros(9, 1); proportionofvariance[1, 1] = 0.5;

julia> trials, effects = GenomicBreedingCore.simulatetrials(genomes=genomes, nyears=1, nseasons=1, nharvests=1, nsites=1, nreplications=1, fadddomepi=[0.05 0.00 0.00;], proportionofvariance = proportionofvariance, verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> G, y, GRM, fit = gwasprep(genomes=genomes, phenomes=phenomes);

julia> loglik = loglikreml([0.53, 0.15], (y, hcat(ones(length(y)), G[:, 1]), GRM));

julia> loglik < 100 true

source
GenomicBreedingModels.metricsMethod
metrics(y_true::Vector{Float64}, y_pred::Vector{Float64})::Dict{String,Float64}

Calculate various metrics comparing predicted vs true values.

Returns a dictionary containing the following metrics:

  • cor: Pearson correlation coefficient
  • mad: Mean absolute deviation
  • msd: Mean squared deviation
  • rmsd: Root mean squared deviation
  • nrmsd: Normalized root mean squared deviation
  • euc: Euclidean distance
  • jac: Jaccard distance
  • tvar: Total variation distance
  • : Narrow-sense heritability

Arguments

  • y_true::Vector{Float64}: Vector of true/observed values
  • y_pred::Vector{Float64}: Vector of predicted values

Returns

  • Dict{String,Float64}: Dictionary mapping metric names to their calculated values
source
GenomicBreedingModels.mlpMethod
mlp(;
    genomes::Genomes,
    phenomes::Phenomes,
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    idx_trait::Int64 = 1,
    n_layers::Int64 = 3,
    activation::Function = relu,
    max_n_nodes::Int64 = 256,
    n_nodes_droprate::Float64 = 0.50,
    dropout_droprate::Float64 = 0.25,
    n_epochs::Int64 = 100_000,
    use_cpu::Bool = false,
    seed::Int64 = 123,
    verbose::Bool = false
)::Fit

Train a multi-layer perceptron (MLP) model for genomic prediction.

Arguments

  • genomes::Genomes: A Genomes struct containing the genomic data.
  • phenomes::Phenomes: A Phenomes struct containing the phenomic data.
  • idx_entries::Union{Nothing, Vector{Int64}}: Indices of entries to include in the model. If nothing, all entries are included. Default is nothing.
  • idx_loci_alleles::Union{Nothing, Vector{Int64}}: Indices of loci-alleles to include in the model. If nothing, all loci-alleles are included. Default is nothing.
  • idx_trait::Int64: Index of the trait to predict. Default is 1.
  • n_layers::Int64: Number of layers in the MLP. Default is 3.
  • activation::Function: Activation function to use in the MLP. Default is relu.
  • max_n_nodes::Int64: Maximum number of nodes in each layer. Default is 256.
  • n_nodes_droprate::Float64: Drop rate for the number of nodes in each layer. Default is 0.50.
  • dropout_droprate::Float64: Dropout rate for the layers. Default is 0.25.
  • n_epochs::Int64: Number of training epochs. Default is 100,000.
  • use_cpu::Bool: If true, use CPU for training. If false, use GPU if available. Default is false.
  • seed::Int64: Random seed for reproducibility. Default is 123.
  • verbose::Bool: If true, prints detailed progress information during training. Default is false.

Returns

  • Fit: A Fit struct containing the trained MLP model and performance metrics.

Details

This function trains a multi-layer perceptron (MLP) model on genomic and phenomic data. The function performs the following steps:

  1. Set Random Seed: Sets the random seed for reproducibility.
  2. Extract Features and Targets: Extracts the feature matrix X, target vector y, and other relevant information from the genomic and phenomic data.
  3. Instantiate Output Fit: Creates a Fit struct to store the model and results.
  4. Construct MLP Layers: Constructs the MLP layers based on the specified number of layers, activation function, and dropout rates.
  5. Move Data to Device: Moves the data to the appropriate device (CPU or GPU).
  6. Setup Training State: Initializes the training state with the model parameters and optimizer.
  7. Train the Model: Trains the MLP model for the specified number of epochs, printing progress if verbose is true.
  8. Evaluate Performance: Evaluates the model's performance using the specified metrics.
  9. Output: Returns the Fit struct containing the trained model and performance metrics.

Notes

  • The function uses the Lux library for constructing and training the MLP model.
  • The verbose option provides additional insights into the training process by printing progress information.
  • The function ensures that the trained model and performance metrics are stored in the Fit struct.

Throws

  • ArgumentError: If the Genomes or Phenomes struct is corrupted or if any of the arguments are out of range.
  • ErrorException: If an error occurs during model training or evaluation.

Example

julia> genomes = GenomicBreedingCore.simulategenomes(l=1_000, verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);

julia> phenomes = extractphenomes(trials);

julia> fit_cpu = Suppressor.@suppress mlp(genomes=genomes, phenomes=phenomes, n_epochs=1_000, use_cpu=true, verbose=false);

julia> fit_gpu = Suppressor.@suppress mlp(genomes=genomes, phenomes=phenomes, n_epochs=1_000, use_cpu=false, verbose=false);

julia> fit_cpu.metrics["cor"] >= 0.2
true

julia> fit_gpu.metrics["cor"] >= 0.2
true
source
GenomicBreedingModels.multMethod
mult(x, y) = x * y

This function takes two numbers and multiplies them together. A member of the set of endofunctions defined in GenomicBreedingModels.jl for building non-linear genomic prediction models. Inputs and ouput range between zero and one.

source
GenomicBreedingModels.olsMethod
ols(;
    genomes::Genomes,
    phenomes::Phenomes,
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    idx_trait::Int64 = 1,
    verbose::Bool = false
)::Fit

Fits an ordinary least squares (OLS) regression model to genomic and phenotypic data.

Arguments

  • genomes::Genomes: Genomic data structure containing allele frequencies
  • phenomes::Phenomes: Phenotypic data structure containing trait measurements
  • idx_entries::Union{Nothing,Vector{Int64}}: Optional indices to select specific entries (default: all entries)
  • idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to select specific loci-alleles (default: all loci-alleles)
  • idx_trait::Int64: Index of the trait to analyze (default: 1)
  • verbose::Bool: If true, displays diagnostic plots and performance metrics (default: false)

Returns

  • Fit: A fitted model object containing:
    • model: Model identifier ("ols")
    • b_hat: Estimated regression coefficients
    • b_hat_labels: Labels for the coefficients
    • trait: Name of the analyzed trait
    • entries: Entry identifiers
    • populations: Population identifiers
    • y_true: Observed phenotypic values
    • y_pred: Predicted phenotypic values
    • metrics: Dictionary of performance metrics

Description

Performs ordinary least squares regression on genomic data to predict phenotypic values. The model includes an intercept term and estimates effects for each locus-allele combination.

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> fit = ols(genomes=genomes, phenomes=phenomes);

julia> fit.model == "ols"
true

julia> fit.metrics["cor"] > 0.50
true
source
GenomicBreedingModels.pearsonscorrelationMethod
pearsonscorrelation(y_true::Vector{Float64}, y_pred::Vector{Float64})::Float64

Calculate the Pearson correlation coefficient between two vectors.

The Pearson correlation coefficient measures the linear correlation between two variables, giving a value between -1 and +1, where:

  • +1 represents perfect positive correlation
  • 0 represents no linear correlation
  • -1 represents perfect negative correlation

Arguments

  • y_true::Vector{Float64}: Vector of true/actual values
  • y_pred::Vector{Float64}: Vector of predicted/estimated values

Returns

  • Float64: Pearson correlation coefficient

Notes

  • Returns 0.0 if the variance of either input vector is less than 1e-10
  • Uses 1 - correlation distance from Distances.jl package
source
GenomicBreedingModels.predictMethod
predict(; fit::Fit, genomes::Genomes, idx_entries::Vector{Int64})::Vector{Float64}

Calculate predicted phenotypes using a fitted genomic prediction model.

Arguments

  • fit::Fit: A fitted genomic prediction model containing coefficients and model information
  • genomes::Genomes: Genomic data containing allele frequencies
  • idx_entries::Vector{Int64}: Vector of indices specifying which entries to predict

Returns

  • Vector{Float64}: Predicted phenotypic values for the specified entries

Details

Supports various linear genomic prediction models including:

  • OLS (Ordinary Least Squares)
  • Ridge Regression
  • LASSO
  • Bayes A
  • Bayes B
  • Bayes C

The function validates input dimensions and compatibility between the fitted model and genomic data before making predictions.

Throws

  • ArgumentError: If the Fit or Genomes structs are corrupted
  • ArgumentError: If entry indices are out of bounds
  • ArgumentError: If loci-alleles in the fitted model don't match the validation set
  • ArgumentError: If the genomic prediction model is not recognized

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);

julia> phenomes = extractphenomes(trials);

julia> fit = ridge(genomes=genomes, phenomes=phenomes, idx_entries=collect(1:90));

julia> y_hat = GenomicBreedingModels.predict(fit=fit, genomes=genomes, idx_entries=collect(91:100));

julia> cor(phenomes.phenotypes[91:100, 1], y_hat) > 0.5
true
source
GenomicBreedingModels.raiseMethod
raise(x, y) = x^y

This function raises the first value to the power of the second value. A member of the set of endofunctions defined in GenomicBreedingModels.jl for building non-linear genomic prediction models. Inputs and ouput range between zero and one.

source
GenomicBreedingModels.reconstitutefeaturesMethod
reconstitutefeatures(
    genomes::Genomes;
    feature_names::Vector{String},
    verbose::Bool = false
)::Genomes

Reconstruct epistasis features from a Genomes struct using feature names that encode the transformations applied.

Arguments

  • genomes::Genomes: Genomic data structure containing allele frequencies
  • feature_names::Vector{String}: Vector of feature names containing encoded transformation information
  • verbose::Bool: Whether to show progress bar during reconstruction (default: false)

Returns

  • Genomes: A new Genomes struct with reconstructed epistasis features

Details

The function parses the feature names to determine which transformations were applied and reconstructs the features by applying these transformations to the original genomic data. Feature names should contain the transformation operations in a format that can be parsed and evaluated.

Throws

  • ArgumentError: If the input Genomes struct is corrupted (invalid dimensions)
  • ErrorException: If feature reconstruction fails

Example

julia> genomes = GenomicBreedingCore.simulategenomes(l=1_000, verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> genomes_epifeat = epistasisfeatures(genomes, phenomes, n_new_features_per_transformation=50, n_reps=2, verbose=false);

julia> feature_names = genomes_epifeat.loci_alleles;

julia> genomes_epifeat_reconstructed = reconstitutefeatures(genomes, feature_names=feature_names);

julia> genomes_epifeat == genomes_epifeat_reconstructed
true
source
GenomicBreedingModels.ridgeMethod
ridge(;
    genomes::Genomes,
    phenomes::Phenomes,
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    idx_trait::Int64 = 1,
    verbose::Bool = false
)::Fit

Fit a ridge (L2) regression model to genomic data. Ridge regression adds an L2 regularization term to the ordinary least squares objective function, which helps prevent overfitting and handles multicollinearity in the predictors.

Arguments

  • genomes::Genomes: Genomic data structure containing allele frequencies
  • phenomes::Phenomes: Phenotypic data structure containing trait measurements
  • idx_entries::Union{Nothing,Vector{Int64}}: Optional indices to subset specific entries/individuals
  • idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to subset specific loci-alleles
  • idx_trait::Int64: Index of the trait to analyze (default: 1)
  • verbose::Bool: If true, prints diagnostic plots and additional information (default: false)

Returns

  • Fit: A fitted model object containing:
    • model: Model identifier ("ols")
    • b_hat: Estimated regression coefficients
    • b_hat_labels: Labels for the coefficients
    • trait: Name of the analyzed trait
    • entries: Entry identifiers
    • populations: Population identifiers
    • y_true: Observed phenotypic values
    • y_pred: Predicted phenotypic values
    • metrics: Dictionary of performance metrics

Notes

  • Uses cross-validation to select the optimal regularization parameter (λ)
  • Standardizes predictors before fitting
  • Includes an intercept term in the model

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> fit = ridge(genomes=genomes, phenomes=phenomes);

julia> fit.model == "ridge"
true

julia> fit.metrics["cor"] > 0.50
true
source
GenomicBreedingModels.squareMethod
square(x) = x^2

This function takes a number and squares it. A member of the set of endofunctions defined in GenomicBreedingModels.jl for building non-linear genomic prediction models. Both input and ouput range between zero and one.

source
GenomicBreedingModels.transform1Method
transform1(
    f::Function,
    genomes::Genomes,
    phenomes::Phenomes;
    idx_trait::Int64 = 1,
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    n_new_features_per_transformation::Int64 = 1_000,
    ϵ::Float64 = eps(Float64),
    use_abs::Bool = false,
    σ²_threshold::Float64 = 0.01,
    verbose::Bool = false
)::Genomes

Apply a transformation function to each allele frequency in genomic data while considering their effects on phenotypes.

Arguments

  • f::Function: Transformation function to apply to allele frequencies
  • genomes::Genomes: Input genomic data structure
  • phenomes::Phenomes: Corresponding phenotypic data structure

Keywords

  • idx_trait::Int64: Index of the trait to analyze (default: 1)
  • idx_entries::Union{Nothing,Vector{Int64}}: Indices of entries to include (default: all)
  • idx_loci_alleles::Union{Nothing,Vector{Int64}}: Indices of loci-alleles to include (default: all)
  • n_new_features_per_transformation::Int64: Maximum number of transformed features to retain (default: 1000)
  • ϵ::Float64: Small value added to prevent numerical issues (default: machine epsilon)
  • use_abs::Bool: Whether to use absolute values before transformation (default: false)
  • σ²_threshold::Float64: Minimum variance threshold for considering a locus (default: 0.01)
  • verbose::Bool: Whether to show progress and plots (default: false)

Returns

  • Genomes: A new Genomes object containing the transformed allele frequencies

Details

The function performs the following steps:

  1. Extracts allele frequencies and phenotypic data
  2. Applies the transformation function to each locus
  3. Estimates effects using ordinary least squares
  4. Selects the most important transformed features
  5. Cleans numerical artifacts (values near 0 or 1)

Notes

  • Use named functions to ensure transformations can be reconstructed from loci_alleles
  • The function adds ϵ to frequencies to prevent numerical issues
  • Features with variance below σ²_threshold are skipped
  • The output contains at most n_new_features_per_transformation features

Example

julia> genomes = GenomicBreedingCore.simulategenomes(l=1_000, verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> genomes_transformed = transform1(x -> x^2, genomes, phenomes);

julia> idx = findall(genomes.loci_alleles .== split(split(replace(genomes_transformed.loci_alleles[1], ")" => ""), "(")[2], ",")[1])[1];

julia> mean(sqrt.(genomes_transformed.allele_frequencies[:, 1]) .- genomes.allele_frequencies[:, idx]) < 1e-10
true

julia> squareaddpi(x) = x^2 + pi;

julia> genomes_transformed = transform1(squareaddpi, genomes, phenomes);

julia> idx = findall(genomes.loci_alleles .== split(split(replace(genomes_transformed.loci_alleles[1], ")" => ""), "(")[2], ",")[1])[1];

julia> mean(squareaddpi.(genomes.allele_frequencies[:, idx]) .- genomes_transformed.allele_frequencies[:, 1]) < 1e-10
true
source
GenomicBreedingModels.transform2Method
transform2(
    f::Function,
    genomes::Genomes,
    phenomes::Phenomes;
    idx_trait::Int64 = 1,
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    n_new_features_per_transformation::Int64 = 1_000,
    ϵ::Float64 = eps(Float64),
    use_abs::Bool = false,
    σ²_threshold::Float64 = 0.01,
    commutative::Bool = false,
    verbose::Bool = false
)::Genomes

Apply a transformation function to pairs of allele frequencies in genomic data while considering their effects on phenotypes.

Arguments

  • f::Function: Transformation function to apply to pairs of allele frequencies
  • genomes::Genomes: Input genomic data structure
  • phenomes::Phenomes: Corresponding phenotypic data structure
  • idx_trait::Int64: Index of the trait to analyze (default: 1)
  • idx_entries::Union{Nothing,Vector{Int64}}: Subset of entries to include (default: all)
  • idx_loci_alleles::Union{Nothing,Vector{Int64}}: Subset of loci-alleles to include (default: all)
  • n_new_features_per_transformation::Int64: Maximum number of transformed features to retain (default: 1000)
  • ϵ::Float64: Small value added to frequencies to avoid numerical issues (default: machine epsilon)
  • use_abs::Bool: Whether to use absolute values of frequencies (default: false)
  • σ²_threshold::Float64: Minimum variance threshold for considering loci (default: 0.01)
  • commutative::Bool: Whether the transformation function is commutative (default: false)
  • verbose::Bool: Whether to display progress and diagnostics (default: false)

Returns

  • Genomes: A new Genomes object containing the transformed features

Details

The function performs the following steps:

  1. Extracts allele frequencies and phenotypic data
  2. Applies the transformation function to all possible pairs of allele freuqencies (i.e., locus-allele combinations)
  3. Estimates effects using ordinary least squares
  4. Selects the most important transformed features
  5. Cleans numerical artifacts (values near 0 or 1)

Notes

  • Use named functions to ensure transformations can be reconstructed from loci_alleles
  • The function adds ϵ to frequencies to prevent numerical issues
  • Features with variance below σ²_threshold are skipped
  • The output contains at most n_new_features_per_transformation features

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(l=1_000, verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;

julia> phenomes = extractphenomes(trials);

julia> genomes_transformed = transform2((x,y) -> (x^2 + sqrt(y)) / 2, genomes, phenomes);

julia> idx_1 = findall(genomes.loci_alleles .== split(split(replace(genomes_transformed.loci_alleles[1], ")" => ""), "(")[2], ",")[1])[1];

julia> idx_2 = findall(genomes.loci_alleles .== split(split(replace(genomes_transformed.loci_alleles[1], ")" => ""), "(")[2], ",")[2])[1];

julia> mean((genomes.allele_frequencies[:,idx_1].^2 .+ sqrt.(genomes.allele_frequencies[:,idx_2])) ./ 2 .- genomes_transformed.allele_frequencies[:,1]) < 1e-10
true

julia> raisexbyythenlog(x, y) = log(abs(x^y));

julia> genomes_transformed = transform2(raisexbyythenlog, genomes, phenomes);

julia> idx_1 = findall(genomes.loci_alleles .== split(split(replace(genomes_transformed.loci_alleles[1], ")" => ""), "(")[2], ",")[1])[1];

julia> idx_2 = findall(genomes.loci_alleles .== split(split(replace(genomes_transformed.loci_alleles[1], ")" => ""), "(")[2], ",")[2])[1];

julia> mean(raisexbyythenlog.(genomes.allele_frequencies[:,idx_1], genomes.allele_frequencies[:,idx_2]) .- genomes_transformed.allele_frequencies[:,1]) < 1e-10
true
source
GenomicBreedingModels.turing_bayesGMethod

Turing specification of Bayesian linear regression using a Gaussian prior with common variance

Example usage

# Benchmarking
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
model = turing_bayesG(G, y)
benchmarks = TuringBenchmarking.benchmark_model(
    model;
    # Check correctness of computations
    check=true,
    # Automatic differentiation backends to check and benchmark
    adbackends=[:forwarddiff, :reversediff, :reversediff_compiled, :zygote]
)

# Test more loci
genomes = GenomicBreedingCore.simulategenomes(n=10, l=10_000)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Check for uninferred types in the model
@code_warntype model = turing_bayesG(G, y)
# Fit
model = turing_bayesG(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.65, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# @time chain = Turing.sample(rng, model, HMC(0.05, 10; adtype=AutoReverseDiff(compile=true)), niter, progress=true);
p = Plots.histogram(chain[:σ²])
Plots.gui(p)
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[501:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)
source
GenomicBreedingModels.turing_bayesGsMethod

Turing specification of Bayesian linear regression using a Gaussian prior with varying variances

Example usage

# Simulate data
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
# Extract genotype and phenotype data
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Regress for just 200 iterations for demonstration purposes only. Use way way more iterations, e.g. 10,000.
model = turing_bayesGs(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.5, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[150:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)
source
GenomicBreedingModels.turing_bayesGπMethod

Turing specification of Bayesian linear regression using a Gaussian prior with a point mass at zero and common variance

Example usage

# Simulate data
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
# Extract genotype and phenotype data
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Regress for just 200 iterations for demonstration purposes only. Use way way more iterations, e.g. 10,000.
model = turing_bayesGπ(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.5, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[150:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)
source
GenomicBreedingModels.turing_bayesGπsMethod

Turing specification of Bayesian linear regression using a Gaussian prior with a point mass at zero and varying variances

Example usage

# Simulate data
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
# Extract genotype and phenotype data
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Regress for just 200 iterations for demonstration purposes only. Use way way more iterations, e.g. 10,000.
model = turing_bayesGπs(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.5, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[150:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)
source
GenomicBreedingModels.turing_bayesLMethod

Turing specification of Bayesian linear regression using a Laplacian prior with a common scale

Example usage

# Simulate data
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
# Extract genotype and phenotype data
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Regress for just 200 iterations for demonstration purposes only. Use way way more iterations, e.g. 10,000.
model = turing_bayesL(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.5, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[150:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)
source
GenomicBreedingModels.turing_bayesLsMethod

Turing specification of Bayesian linear regression using a Laplacian prior with varying scales

Example usage

# Simulate data
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
# Extract genotype and phenotype data
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Regress for just 200 iterations for demonstration purposes only. Use way way more iterations, e.g. 10,000.
model = turing_bayesLs(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.5, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[150:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)
source
GenomicBreedingModels.turing_bayesLπMethod

Turing specification of Bayesian linear regression using a Laplacian prior with a point mass at zero and common scale

Example usage

# Simulate data
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
# Extract genotype and phenotype data
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Regress for just 200 iterations for demonstration purposes only. Use way way more iterations, e.g. 10,000.
model = turing_bayesLπ(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.5, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[150:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)
source
GenomicBreedingModels.turing_bayesLπsMethod

Turing specification of Bayesian linear regression using a Laplacian prior with a point mass at zero and common scale

Example usage

# Simulate data
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
# Extract genotype and phenotype data
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Regress for just 200 iterations for demonstration purposes only. Use way way more iterations, e.g. 10,000.
model = turing_bayesLπs(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.5, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[150:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)
source
GenomicBreedingModels.turing_bayesTMethod

Turing specification of Bayesian linear regression using a T-distribution

Example usage

# Simulate data
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
# Extract genotype and phenotype data
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Regress for just 200 iterations for demonstration purposes only. Use way way more iterations, e.g. 10,000.
model = turing_bayesT(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.5, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[150:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)
source
GenomicBreedingModels.turing_bayesTπMethod

Turing specification of Bayesian linear regression using a T-distribution with a point mass at zero

Example usage

# Simulate data
genomes = GenomicBreedingCore.simulategenomes(n=10, l=100)
trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.9 0.01 0.00;])
phenomes = extractphenomes(trials)
# Extract genotype and phenotype data
G::Matrix{Float64} = genomes.allele_frequencies
y::Vector{Float64} = phenomes.phenotypes[:, 1]
# Regress for just 200 iterations for demonstration purposes only. Use way way more iterations, e.g. 10,000.
model = turing_bayesTπ(G, y)
# We use compile=true in AutoReverseDiff() because we do not have any if-statements in our Turing model below
rng::TaskLocalRNG = Random.seed!(123)
niter::Int64 = 1_500
nburnin::Int64 = 500
@time chain = Turing.sample(rng, model, NUTS(nburnin, 0.5, max_depth=5, Δ_max=1000.0, init_ϵ=0.2; adtype=AutoReverseDiff(compile=true)), niter-nburnin, progress=true);
# Use the mean paramter values after 150 burn-in iterations
params = Turing.get_params(chain[150:end, :, :]);
b_hat = vcat(mean(params.intercept), mean(stack(params.coefficients, dims=1)[:, :, 1], dims=2)[:,1]);
# Assess prediction accuracy
y_pred::Vector{Float64} = hcat(ones(size(G,1)), G) * b_hat;
UnicodePlots.scatterplot(y, y_pred)
performance::Dict{String, Float64} = metrics(y, y_pred)
source
GenomicBreedingModels.validateMethod
validate(
    fit::Fit,
    genomes::Genomes,
    phenomes::Phenomes;
    idx_validation::Vector{Int64},
    replication::String="",
    fold::String=""
)::CV

Evaluate the predictive accuracy of a genomic prediction model on a validation dataset.

Arguments

  • fit::Fit: A fitted genomic prediction model
  • genomes::Genomes: Genomic data structure containing allele frequencies
  • phenomes::Phenomes: Phenotypic data structure containing trait measurements
  • idx_validation::Vector{Int64}: Indices of entries to use for validation
  • replication::String: Optional identifier for the validation replication
  • fold::String: Optional identifier for the cross-validation fold

Returns

  • CV: A cross-validation result object containing:
    • Validation metrics (correlation, RMSE, etc.)
    • True and predicted values
    • Entry and population information
    • Model specifications

Notes

  • Performs checks for data leakage between training and validation sets
  • Handles missing, NaN, and Inf values in phenotypic data
  • Validates dimensions of output CV struct

Examples

julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);

julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);

julia> phenomes = extractphenomes(trials);

julia> fit = ridge(genomes=genomes, phenomes=phenomes, idx_entries=collect(1:90));

julia> cv = validate(fit, genomes, phenomes, idx_validation=collect(91:100));

julia> cv.metrics["cor"] > 0.50
true
source
GenomicBreedingModels.@string2operationsMacro
@string2operations(x)

Convert a string containing an endofunction formulae for genomic features into a parsed Julia expressions.

Arguments

  • x: A string containing mathematical operations on allele frequencies

Returns

  • Parsed expression that can be evaluated in Julia
source