GenomicBreedingModels
Documentation for GenomicBreedingModels.
GenomicBreedingModels.addnormGenomicBreedingModels.bayesaGenomicBreedingModels.bayesbGenomicBreedingModels.bayescGenomicBreedingModels.bayesianGenomicBreedingModels.bglrGenomicBreedingModels.cvbulkGenomicBreedingModels.cvleaveonepopulationoutGenomicBreedingModels.cvmultithread!GenomicBreedingModels.cvpairwisepopulationGenomicBreedingModels.cvperpopulationGenomicBreedingModels.epistasisfeaturesGenomicBreedingModels.extractxyetcGenomicBreedingModels.gwaslmmGenomicBreedingModels.gwasolsGenomicBreedingModels.gwasprepGenomicBreedingModels.gwasremlGenomicBreedingModels.heritabilitynarrow_senseGenomicBreedingModels.invoneplusGenomicBreedingModels.lassoGenomicBreedingModels.log10epsdivlog10epsGenomicBreedingModels.loglikremlGenomicBreedingModels.metricsGenomicBreedingModels.multGenomicBreedingModels.olsGenomicBreedingModels.pearsonscorrelationGenomicBreedingModels.predictGenomicBreedingModels.raiseGenomicBreedingModels.reconstitutefeaturesGenomicBreedingModels.ridgeGenomicBreedingModels.squareGenomicBreedingModels.transform1GenomicBreedingModels.transform2GenomicBreedingModels.validateGenomicBreedingModels.@string2operations
GenomicBreedingModels.addnorm — Method
addnorm(x, y) = (x + y) / 2.0This function takes two numbers and finds their arithmetic mean. A member of the set of endofunctions defined in GenomicBreedingModels.jl for building non-linear genomic prediction models. Inputs and ouput range between zero and one.
GenomicBreedingModels.bayesa — Method
bayesa(;
genomes::Genomes,
phenomes::Phenomes,
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
idx_trait::Int64 = 1,
n_iter::Int64 = 1_500,
n_burnin::Int64 = 500,
verbose::Bool = false
)::FitFits a Bayes A model for genomic prediction using the BGLR method. Bayes A assumes marker effects follow a scaled-t distribution, allowing for different variances for each marker effect.
Arguments
genomes::Genomes: Genomic data structure containing allele frequenciesphenomes::Phenomes: Phenotypic data structure containing trait measurementsidx_entries::Union{Nothing,Vector{Int64}}: Optional indices to subset specific entries/individualsidx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to subset specific genetic markersidx_trait::Int64: Index of the trait to analyze (default: 1)n_iter::Int64: Number of iterations for the MCMC algorithm (default: 1,500)n_burnin::Int64: Number of burn-in iterations to discard (default: 500)verbose::Bool: Whether to print progress messages (default: false)
Returns
Fit: A fitted model object containing:model: Model identifier ("ols")b_hat: Estimated regression coefficientsb_hat_labels: Labels for the coefficientstrait: Name of the analyzed traitentries: Entry identifierspopulations: Population identifiersy_true: Observed phenotypic valuesy_pred: Predicted phenotypic valuesmetrics: Dictionary of performance metrics
Notes
- The model assumes a Gaussian distribution for the trait values
- Recommended to check convergence by examining trace plots of key parameters
- The burn-in period should be adjusted based on convergence diagnostics
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> fit = bayesa(genomes=genomes, phenomes=phenomes);
julia> fit.model == "bayesa"
true
julia> fit.metrics["cor"] > 0.50
trueGenomicBreedingModels.bayesb — Method
bayesb(;
genomes::Genomes,
phenomes::Phenomes,
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
idx_trait::Int64 = 1,
n_iter::Int64 = 1_500,
n_burnin::Int64 = 500,
verbose::Bool = false
)::FitFits a Bayes B model for genomic prediction using the BGLR method. Bayes B assumes that marker effects follow a mixture distribution where some markers have zero effect and others follow a scaled-t distribution.
Arguments
genomes::Genomes: Genomic data structure containing allele frequenciesphenomes::Phenomes: Phenotypic data structure containing trait measurementsidx_entries::Union{Nothing,Vector{Int64}}: Optional indices to subset specific entries (default: nothing)idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to subset specific loci/alleles (default: nothing)idx_trait::Int64: Index of the trait to analyze (default: 1)n_iter::Int64: Number of iterations for the MCMC algorithm (default: 1,500)n_burnin::Int64: Number of burn-in iterations to discard (default: 500)verbose::Bool: Whether to print progress information (default: false)
Returns
Fit: A fitted model object containing:model: Model identifier ("ols")b_hat: Estimated regression coefficientsb_hat_labels: Labels for the coefficientstrait: Name of the analyzed traitentries: Entry identifierspopulations: Population identifiersy_true: Observed phenotypic valuesy_pred: Predicted phenotypic valuesmetrics: Dictionary of performance metrics
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> fit = bayesb(genomes=genomes, phenomes=phenomes);
julia> fit.model == "bayesb"
true
julia> fit.metrics["cor"] > 0.50
trueGenomicBreedingModels.bayesc — Method
bayesc(;
genomes::Genomes,
phenomes::Phenomes,
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
idx_trait::Int64 = 1,
n_iter::Int64 = 1_500,
n_burnin::Int64 = 500,
verbose::Bool = false
)::FitFits a Bayes C model for genomic prediction using the BGLR method. Bayes C assumes that marker effects follow a mixture distribution with a point mass at zero and a normal distribution.
Arguments
genomes::Genomes: Genomic data structure containing allele frequenciesphenomes::Phenomes: Phenotypic data structure containing trait measurementsidx_entries::Union{Nothing,Vector{Int64}}: Optional indices to subset specific entries/individualsidx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to subset specific loci/allelesidx_trait::Int64: Index of the trait to analyze (default: 1)n_iter::Int64: Number of iterations for MCMC sampling (default: 1,500)n_burnin::Int64: Number of burn-in iterations to discard (default: 500)verbose::Bool: Whether to print progress information (default: false)
Returns
Fit: A fitted model object containing:model: Model identifier ("ols")b_hat: Estimated regression coefficientsb_hat_labels: Labels for the coefficientstrait: Name of the analyzed traitentries: Entry identifierspopulations: Population identifiersy_true: Observed phenotypic valuesy_pred: Predicted phenotypic valuesmetrics: Dictionary of performance metrics
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> fit = bayesc(genomes=genomes, phenomes=phenomes);
julia> fit.model == "bayesc"
true
julia> fit.metrics["cor"] > 0.50
trueGenomicBreedingModels.bayesian — Method
bayesian(
bglr_model::String;
genomes::Genomes,
phenomes::Phenomes,
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
idx_trait::Int64 = 1,
response_type::String = ["gaussian", "ordinal"][1],
n_burnin::Int64 = 500,
n_iter::Int64 = 1_500,
verbose::Bool = false
)::FitFit Bayesian genomic prediction models using the BGLR R package.
Arguments
bglr_model::String: Type of BGLR model to fit. Options: "BayesA", "BayesB", "BayesC"genomes::Genomes: Genomic data structure containing allele frequenciesphenomes::Phenomes: Phenotypic data structure containing trait measurementsidx_entries::Union{Nothing,Vector{Int64}}: Optional indices to subset entriesidx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to subset loci/allelesidx_trait::Int64: Index of trait to analyze (default: 1)response_type::String: Type of response variable. Options: "gaussian" or "ordinal" (default: "gaussian")n_burnin::Int64: Number of burn-in MCMC iterations (default: 500)n_iter::Int64: Total number of MCMC iterations (default: 1,500)verbose::Bool: Whether to print progress information (default: false)
Returns
Fit: Object containing:- Model fit summary
- Estimated genetic effects
- Predicted values
- Model performance metrics
- Input data details
Details
This function provides a Julia interface to the BGLR R package for Bayesian genomic prediction. It uses temporary files to interface with R and automatically cleans up afterward.
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);
julia> phenomes = extractphenomes(trials);
julia> sol = Suppressor.@suppress bayesian("BayesA", genomes=genomes, phenomes=phenomes);
julia> sol.metrics["cor"] > 0.5
trueGenomicBreedingModels.bglr — Method
bglr(; G::Matrix{Float64}, y::Vector{Float64}, model::String="BayesA",
response_type::String="gaussian", n_iter::Int64=1_500,
n_burnin::Int64=500, verbose::Bool=false)::Vector{Float64}Perform Bayesian genomic prediction using the BGLR (Bayesian Generalized Linear Regression) R package.
Arguments
G::Matrix{Float64}: Marker matrix where rows represent individuals and columns represent locus-allele combinationsy::Vector{Float64}: Vector of phenotypic valuesmodel::String: Bayesian model type to use. Options: "BayesA", "BayesB", or "BayesC". Default: "BayesA"response_type::String: Type of response variable. Options: "gaussian" or "ordinal". Default: "gaussian"n_iter::Int64: Number of iterations for the MCMC chain. Default: 1,500n_burnin::Int64: Number of burn-in iterations to discard. Default: 500verbose::Bool: Whether to print progress information. Default: false
Returns
Vector{Float64}: Estimated effects including the intercept (first element) followed by marker effects
Details
This function creates temporary files to interface with R's BGLR package, runs the Bayesian analysis, and automatically cleans up temporary files afterward. The function uses the system's R installation and requires the BGLR package to be installed in R.
Notes
This is hacky. It invokes Rscript for each instance which should allow multithreading because RCall.jl currently does not allow multithreading.
GenomicBreedingModels.cvbulk — Method
cvbulk(;
genomes::Genomes,
phenomes::Phenomes,
models::Vector{Function}=[ridge],
n_replications::Int64=5,
n_folds::Int64=5,
seed::Int64=42,
verbose::Bool=true
)::Tuple{Vector{CV}, Vector{String}}Perform replicated k-fold cross-validation of genomic prediction model(s) across all available traits and entries, ignoring populations.
Arguments
genomes::Genomes: Genomic data structure containing allele frequenciesphenomes::Phenomes: Phenotypic data structure containing trait measurementsmodels::Vector{Function}: Vector of genomic prediction model functions to evaluaten_replications::Int64: Number of times to repeat k-fold cross-validation randomising k-fold partitioning each time (default: 5)n_folds::Int64: Number of cross-validation folds (default: 5)seed::Int64: Random seed for reproducibility (default: 42)verbose::Bool: Whether to display progress information (default: true)
Returns
- Tuple containing:
- Vector of CV objects with cross-validation results
- Vector of warning messages about skipped cases
Threading
Uses multiple threads if Julia is started with threading enabled. Example startup command: julia --threads 7,1 (7 worker threads, 1 runtime thread)
Notes
- Performs random k-fold partitioning of entries ignoring population structure
- Handles missing/invalid phenotype values
- Validates model inputs and data dimensions
- Returns warnings for cases with insufficient data or zero variance
Example
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false); genomes.populations = StatsBase.sample(string.("pop_", 1:3), length(genomes.entries), replace=true);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, verbose=false);
julia> phenomes = extractphenomes(trials);
julia> cvs, notes = cvbulk(genomes=genomes, phenomes=phenomes, models=[ols, ridge], n_replications=2, n_folds=2, verbose=false);
julia> df_across_entries, df_per_entry = tabularise(cvs);
julia> idx_across = findall((df_across_entries.trait .== "trait_1") .&& (df_across_entries.model .== "ridge") .&& (df_across_entries.replication .== "replication_1") .&& (df_across_entries.fold .== "fold_1"));
julia> idx_per = findall((df_per_entry.trait .== "trait_1") .&& (df_per_entry.model .== "ridge") .&& (df_per_entry.replication .== "replication_1") .&& (df_per_entry.fold .== "fold_1"));
julia> abs(df_across_entries.cor[idx_across][1] - cor(df_per_entry.y_true[idx_per], df_per_entry.y_pred[idx_per])) < 1e-10
trueGenomicBreedingModels.cvleaveonepopulationout — Method
cvleaveonepopulationout(;
genomes::Genomes,
phenomes::Phenomes,
models::Vector{Function}=[ridge],
n_replications::Int64=5,
n_folds::Int64=5,
seed::Int64=42,
verbose::Bool=true
)::Tuple{Vector{CV}, Vector{String}}Performs leave-one-population-out cross-validation for genomic prediction models across all available traits.
Arguments
genomes::Genomes: Genomic data structure containing allele frequenciesphenomes::Phenomes: Phenotypic data structure containing trait measurementsmodels::Vector{Function}: Vector of model functions to evaluate (default: [ridge])n_replications::Int64: Number of replications (not used in this implementation)n_folds::Int64: Number of folds (not used in this implementation)seed::Int64: Random seed (not used in this implementation)verbose::Bool: If true, displays progress information during execution
Returns
Tuple{Vector{CV}, Vector{String}}: Returns a tuple containing:- Vector of CV objects with cross-validation results
- Vector of warning/error messages for skipped validations
Details
The function implements a leave-one-population-out cross-validation strategy where:
- For each trait and population combination:
- Uses one population as validation set
- Uses remaining populations as training set
- Evaluates multiple genomic prediction models
- Handles missing data and variance checks
- Supports parallel processing via Julia's multi-threading
Threading
To utilize multiple threads, start Julia with: julia --threads n,1 where n is the desired number of threads for computation and 1 is reserved for the runtime.
Supported Models
- ols (Ordinary Least Squares)
- ridge (Ridge Regression)
- lasso (Lasso Regression)
- bayesa (Bayes A)
- bayesb (Bayes B)
- bayesc (Bayes C)
Example
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false); genomes.populations = StatsBase.sample(string.("pop_", 1:3), length(genomes.entries), replace=true);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);
julia> phenomes = extractphenomes(trials);
julia> cvs, notes = cvleaveonepopulationout(genomes=genomes, phenomes=phenomes, models=[ridge, bayesa], n_replications=2, n_folds=2, verbose=false);
julia> df_across_entries, df_per_entry = tabularise(cvs);
julia> sum([sum(split(df_across_entries.training_population[i], ";") .== df_across_entries.validation_population[i]) for i in 1:size(df_across_entries, 1)]) == 0
true
julia> idx_across = findall((df_across_entries.validation_population .== "pop_1") .&& (df_across_entries.trait .== "trait_1") .&& (df_across_entries.model .== "ridge"));
julia> idx_per = findall((df_per_entry.validation_population .== "pop_1") .&& (df_per_entry.trait .== "trait_1") .&& (df_per_entry.model .== "ridge"));
julia> abs(df_across_entries.cor[idx_across][1] - cor(df_per_entry.y_true[idx_per], df_per_entry.y_pred[idx_per])) < 1e-10
trueGenomicBreedingModels.cvmultithread! — Method
cvmultithread!(cvs::Vector{CV}; genomes::Genomes, phenomes::Phenomes, models_vector, verbose::Bool = true)::Vector{CV}Perform multi-threaded genomic prediction cross-validation using specified models.
Arguments
cvs::Vector{CV}: Vector of cross-validation objects to be processedgenomes::Genomes: Genomic data structure containing allele frequenciesphenomes::Phenomes: Phenotypic data structure containing trait measurementsmodels_vector: Vector of model functions to be used for prediction (e.g., [ridge, bayesa])verbose::Bool=true: Whether to display progress bar during computation
Returns
- Modified
cvsvector with updated cross-validation results
Threading
Requires Julia to be started with multiple threads to utilize parallel processing. Example startup command: julia --threads 7,1 (7 worker threads, 1 runtime thread)
Details
The function performs cross-validation in parallel for each CV object using the corresponding model from the models_vector. For each fold:
- Extracts training and validation indices
- Fits the specified model using training data
- Validates the model using validation data
- Updates the CV object with prediction results
Example
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);
julia> phenomes = extractphenomes(trials);
julia> idx_training = collect(1:50);
julia> idx_validation_1 = collect(51:75);
julia> idx_validation_2 = collect(76:100);
julia> fit = Fit(n = length(idx_training), l = length(genomes.loci_alleles) + 1); fit.model = "ridge"; fit.trait = "trait_1";
julia> fit.entries = genomes.entries[idx_training]; fit.populations = genomes.populations[idx_training];
julia> fit.b_hat_labels = vcat(["intercept"], genomes.loci_alleles);
julia> cv_1 = CV("replication_1", "fold_1", fit, genomes.populations[idx_validation_1], genomes.entries[idx_validation_1], zeros(length(idx_validation_1)), zeros(length(idx_validation_1)), fit.metrics);
julia> cv_2 = CV("replication_1", "fold_2", fit, genomes.populations[idx_validation_2], genomes.entries[idx_validation_2], zeros(length(idx_validation_2)), zeros(length(idx_validation_2)), fit.metrics);
julia> cvs = [cv_1, cv_2]; models = [ridge, ridge];
julia> cvmultithread!(cvs, genomes=genomes, phenomes=phenomes, models_vector=[ridge, bayesa], verbose=false);
julia> df_across_entries, df_per_entry = tabularise(cvs);
julia> idx_across = findall(df_across_entries.fold .== "fold_2");
julia> idx_per = findall(df_per_entry.fold .== "fold_2");
julia> abs(df_across_entries.cor[idx_across][1] - cor(df_per_entry.y_true[idx_per], df_per_entry.y_pred[idx_per])) < 1e-10
trueGenomicBreedingModels.cvpairwisepopulation — Method
cvpairwisepopulation(;
genomes::Genomes,
phenomes::Phenomes,
models::Vector{Function}=[ridge],
n_replications::Int64=5,
n_folds::Int64=5,
seed::Int64=42,
verbose::Bool=true
)::Tuple{Vector{CV}, Vector{String}}Performs pairwise cross-validation between populations for genomic prediction models.
Arguments
genomes::Genomes: Genomic data structure containing allele frequenciesphenomes::Phenomes: Phenotypic data structure containing trait measurementsmodels::Vector{Function}: Vector of genomic prediction model functions to evaluate (default: [ridge])n_replications::Int64: Number of replications (unused, maintained for API consistency)n_folds::Int64: Number of folds (unused, maintained for API consistency)seed::Int64: Random seed (unused, maintained for API consistency)verbose::Bool: Whether to display progress information
Returns
Tuple{Vector{CV}, Vector{String}}:- Vector of CV objects containing cross-validation results
- Vector of warning messages for skipped validations
Details
For each pair of populations (pop1, pop2):
- Uses pop1 as training set and pop2 as validation set
- Skips within-population validation (pop1==pop2)
- Evaluates each model on all available traits
- Handles missing/invalid phenotype values
- Validates model inputs and data dimensions
Threading
Requires Julia to be started with multiple threads: julia --threads n,1 where n is number of worker threads
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false); genomes.populations = StatsBase.sample(string.("pop_", 1:3), length(genomes.entries), replace=true);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);
julia> phenomes = extractphenomes(trials);
julia> cvs, notes = cvpairwisepopulation(genomes=genomes, phenomes=phenomes, models=[ols, ridge], n_replications=2, n_folds=2, verbose=false);
julia> df_across_entries, df_per_entry = tabularise(cvs);
julia> sum(df_across_entries.training_population .!= df_across_entries.validation_population) == size(df_across_entries, 1)
true
julia> idx_across = findall((df_across_entries.training_population .== "pop_1") .&& (df_across_entries.validation_population .== "pop_2") .&& (df_across_entries.trait .== "trait_1") .&& (df_across_entries.model .== "ridge"));
julia> idx_per = findall((df_per_entry.training_population .== "pop_1") .&& (df_per_entry.validation_population .== "pop_2") .&& (df_per_entry.trait .== "trait_1") .&& (df_per_entry.model .== "ridge"));
julia> abs(df_across_entries.cor[idx_across][1] - cor(df_per_entry.y_true[idx_per], df_per_entry.y_pred[idx_per])) < 1e-10
trueGenomicBreedingModels.cvperpopulation — Method
cvperpopulation(;
genomes::Genomes,
phenomes::Phenomes,
models::Vector{Function}=[ridge],
n_replications::Int64=5,
n_folds::Int64=5,
seed::Int64=42,
verbose::Bool=true
)::Tuple{Vector{CV}, Vector{String}}Performs within-population replicated cross-validation of genomic prediction models across all available traits.
Arguments
genomes::Genomes: Genomic data structure containing allele frequenciesphenomes::Phenomes: Phenotypic data structure containing trait measurementsmodels::Vector{Function}=[ridge]: Vector of genomic prediction model functions to evaluaten_replications::Int64=5: Number of replications for cross-validationn_folds::Int64=5: Number of folds for k-fold cross-validationseed::Int64=42: Random seed for reproducibilityverbose::Bool=true: Whether to print progress information
Returns
Tuple{Vector{CV}, Vector{String}}: A tuple containing:- Vector of CV objects with cross-validation results
- Vector of notes/warnings generated during the process
Details
The function performs separate cross-validations for each unique population in the dataset. Supports multiple genomic prediction models including:
ols: Ordinary Least Squaresridge: Ridge Regressionlasso: Lasso Regressionbayesa: Bayes Abayesb: Bayes Bbayesc: Bayes C
Threading
To use multiple threads, invoke Julia with: julia --threads n,1 where n is the desired number of threads for multi-threaded processes and 1 is reserved for the Julia runtime.
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(l=1_000, verbose=false); genomes.populations = StatsBase.sample(string.("pop_", 1:3), length(genomes.entries), replace=true);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);
julia> phenomes = extractphenomes(trials);
julia> cvs, notes = cvperpopulation(genomes=genomes, phenomes=phenomes, models=[ridge, bayesa], n_replications=2, n_folds=2, verbose=false);
julia> df_across_entries, df_per_entry = tabularise(cvs);
julia> sort(unique(df_across_entries.training_population))
3-element Vector{String}:
"pop_1"
"pop_2"
"pop_3"
julia> df_across_entries.training_population == df_across_entries.validation_population
true
julia> idx_across = findall((df_across_entries.validation_population .== "pop_1") .&& (df_across_entries.trait .== "trait_1") .&& (df_across_entries.model .== "bayesa") .&& (df_across_entries.replication .== "replication_1") .&& (df_across_entries.fold .== "fold_1"));
julia> idx_per = findall((df_per_entry.validation_population .== "pop_1") .&& (df_per_entry.trait .== "trait_1") .&& (df_per_entry.model .== "bayesa") .&& (df_per_entry.replication .== "replication_1") .&& (df_per_entry.fold .== "fold_1"));
julia> abs(df_across_entries.cor[idx_across][1] - cor(df_per_entry.y_true[idx_per], df_per_entry.y_pred[idx_per])) < 1e-10
true
julia> summary_across, summary_per_entry = summarise(cvs);
julia> size(summary_across)
(6, 11)
julia> size(summary_per_entry)
(200, 9)GenomicBreedingModels.epistasisfeatures — Method
epistasisfeatures(
genomes::Genomes,
phenomes::Phenomes;
idx_trait::Int64 = 1,
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
transformations1::Vector{Function} = [square, invoneplus, log10epsdivlog10eps],
transformations2::Vector{Function} = [mult, addnorm, raise],
n_new_features_per_transformation::Int64 = 1_000,
n_reps::Int64 = 3,
verbose::Bool = false
)::GenomesGenerate epistasis features by applying various transformations to genomic data.
Arguments
genomes::Genomes: Genomic data structure containing allele frequenciesphenomes::Phenomes: Phenotypic data structure containing trait measurementsidx_trait::Int64: Index of the trait to analyze (default: 1)idx_entries::Union{Nothing,Vector{Int64}}: Indices of entries to include (default: all)idx_loci_alleles::Union{Nothing,Vector{Int64}}: Indices of loci/alleles to include (default: all)transformations1::Vector{Function}: Single-input transformation functions (default: [square, invoneplus, log10epsdivlog10eps])transformations2::Vector{Function}: Two-input transformation functions (default: [mult, addnorm, raise])n_new_features_per_transformation::Int64: Number of new features to generate per transformation (default: 1_000)n_reps::Int64: Number of times to repeat the transformation process (default: 3)verbose::Bool: Whether to display progress information (default: false)
Returns
Genomes: Enhanced genomic data structure with additional epistasis features
Description
This function generates epistasis features by applying two types of transformations:
- Single-input transformations (transformations1) applied to individual genomic features
- Two-input transformations (transformations2) applied to pairs of genomic features
The transformations are repeated n_reps times to create a rich set of derived features.
Notes
- Ensures all generated features are within [0,1] range
- Maintains dimensional consistency of input/output structures
- Automatically handles entry and loci/allele subsetting
- Validates input data structures before processing
Example
julia> genomes = GenomicBreedingCore.simulategenomes(l=1_000, verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> genomes_plus_features = epistasisfeatures(genomes, phenomes, n_new_features_per_transformation=50, n_reps=2, verbose=false);
julia> cvs, notes = cvbulk(genomes=genomes_plus_features, phenomes=phenomes, models=[ridge, lasso, bayesa], verbose=false);
julia> cvs_no_epi, notes_no_epi = cvbulk(genomes=genomes, phenomes=phenomes, models=[ridge, lasso, bayesa], verbose=false);
julia> df_across, df_per_entry = GenomicBreedingCore.tabularise(cvs);
julia> df_across_no_epi, df_per_entry_no_epi = GenomicBreedingCore.tabularise(cvs_no_epi);
julia> df_summary = combine(groupby(df_across, [:trait, :model]), [[:cor] => mean, [:cor] => std]);
julia> df_summary_no_epi = combine(groupby(df_across_no_epi, [:trait, :model]), [[:cor] => mean, [:cor] => std]);
julia> mean(df_summary.cor_mean) > mean(df_summary_no_epi.cor_mean)
trueGenomicBreedingModels.extractxyetc — Method
extractxyetc(
genomes::Genomes,
phenomes::Phenomes;
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
idx_trait::Int64 = 1,
add_intercept::Bool = true
) -> Tuple{Matrix{Float64}, Vector{Float64}, Vector{String}, Vector{String}, Vector{String}}Extract data matrices and vectors from genomic and phenotypic data for statistical analyses.
Arguments
genomes::Genomes: Genomic data structure containing allele frequenciesphenomes::Phenomes: Phenotypic data structure containing trait measurementsidx_entries::Union{Nothing,Vector{Int64}}: Optional indices to select specific entries (default: all entries)idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to select specific loci-alleles (default: all loci-alleles)idx_trait::Int64: Index of the trait to analyze (default: 1)add_intercept::Bool: Whether to add an intercept column to the design matrix (default: true)
Returns
A tuple containing:
X::Matrix{Float64}: Design matrix with allele frequencies (and intercept if specified)y::Vector{Float64}: Response vector with phenotypic valuesentries::Vector{String}: Names of the selected entriespopulations::Vector{String}: Population identifiers for the selected entriesloci_alleles::Vector{String}: Names of the selected loci-alleles
Notes
- The function filters out entries with missing, NaN, or infinite phenotype values
- Requires at least 2 valid entries after filtering
- Checks for non-zero variance in the trait values
- Ensures consistency between genomic and phenotypic data dimensions
- Validates all index inputs are within bounds
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> X, y, loci_alleles = extractxyetc(genomes, phenomes);
julia> X == hcat(ones(length(phenomes.entries)), genomes.allele_frequencies)
true
julia> y == phenomes.phenotypes[:, 1]
trueGenomicBreedingModels.gwaslmm — Method
gwaslmm(
genomes::Genomes,
phenomes::Phenomes;
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
idx_trait::Int64 = 1,
GRM_type::String = ["simple", "ploidy-aware"][1],
verbose::Bool = false
)::FitPerform genome-wide association analysis using a linear mixed model (LMM) approach.
Arguments
genomes::Genomes: Genomic data structure containing allele frequenciesphenomes::Phenomes: Phenotypic data structure containing trait measurementsidx_entries::Union{Nothing,Vector{Int64}}: Optional indices for subsetting entriesidx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices for subsetting loci/allelesidx_trait::Int64: Index of the trait to analyze (default: 1)GRM_type::String: Type of genetic relationship matrix to use:- "simple": Standard GRM calculation
- "ploidy-aware": Ploidy-adjusted GRM calculation
verbose::Bool: Whether to display progress and plots (default: false)
Returns
Fit: A structure containing GWAS results including:model: Model identifier ("GWAS_LMM")b_hat: Vector of test statistics (z-scores) for genetic markers
Details
The function implements a mixed model GWAS using the first principal component of the genetic relationship matrix (GRM) as a fixed effect to control for population structure. The model includes random effects for entries and uses REML estimation.
Notes
- Handles both diploid and polyploid data through the
GRM_typeparameter - Uses multi-threading for parallel computation of marker effects
- Includes automatic convergence retry on fitting failures
- Maximum fitting time per marker is limited to 60 seconds
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> ploidy = 4;
julia> genomes.allele_frequencies = round.(genomes.allele_frequencies .* ploidy) ./ ploidy;
julia> proportion_of_variance = zeros(9, 1); proportion_of_variance[1, 1] = 0.5;
julia> trials, effects = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.05 0.00 0.00;], proportion_of_variance = proportion_of_variance, verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> fit_1 = Suppressor.@suppress gwaslmm(genomes=genomes, phenomes=phenomes, GRM_type="simple");
julia> fit_1.model
"GWAS_LMM"
julia> fit_2 = Suppressor.@suppress gwaslmm(genomes=genomes, phenomes=phenomes, GRM_type="ploidy-aware");
julia> fit_2.model
"GWAS_LMM"
julia> findall(fit_1.b_hat .== maximum(fit_1.b_hat)) == findall(fit_2.b_hat .== maximum(fit_2.b_hat))
trueGenomicBreedingModels.gwasols — Method
gwasols(
genomes::Genomes,
phenomes::Phenomes;
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
idx_trait::Int64 = 1,
GRM_type::String = ["simple", "ploidy-aware"][1],
verbose::Bool = false,
)::FitPerform genome-wide association study (GWAS) using ordinary least squares (OLS) regression with population structure correction.
Arguments
genomes::Genomes: Genomic data structure containing allele frequenciesphenomes::Phenomes: Phenotypic data structure containing trait measurementsidx_entries::Union{Nothing,Vector{Int64}}: Optional indices to subset entries (default: all entries)idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to subset loci/alleles (default: all loci)idx_trait::Int64: Index of the trait to analyze (default: 1)GRM_type::String: Type of genetic relationship matrix to use ("simple" or "ploidy-aware") (default: "simple")verbose::Bool: Whether to display progress and plots (default: false)
Returns
Fit: A structure containing GWAS results including:model: Model identifier ("GWAS_OLS")b_hat: Vector of effect size estimates/t-statistics for each marker- Additional model information
Details
The function implements GWAS using OLS regression while accounting for population structure through the first principal component of the genetic relationship matrix (GRM) as a covariate. Two types of GRM can be used: "simple" assumes diploid organisms, while "ploidy-aware" accounts for different ploidy levels.
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> ploidy = 4;
julia> genomes.allele_frequencies = round.(genomes.allele_frequencies .* ploidy) ./ ploidy;
julia> proportion_of_variance = zeros(9, 1); proportion_of_variance[1, 1] = 0.5;
julia> trials, effects = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.05 0.00 0.00;], proportion_of_variance = proportion_of_variance, verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> fit_1 = gwasols(genomes=genomes, phenomes=phenomes, GRM_type="simple");
julia> fit_1.model
"GWAS_OLS"
julia> fit_2 = gwasols(genomes=genomes, phenomes=phenomes, GRM_type="ploidy-aware");
julia> fit_2.model
"GWAS_OLS"
julia> findall(fit_1.b_hat .== maximum(fit_1.b_hat)) == findall(fit_2.b_hat .== maximum(fit_2.b_hat))
trueGenomicBreedingModels.gwasprep — Method
gwasprep(
genomes::Genomes,
phenomes::Phenomes;
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
idx_trait::Int64 = 1,
GRM_type::String = ["simple", "ploidy-aware"][1],
standardise::Bool = true,
verbose::Bool = false,
)::Tuple{Matrix{Float64},Vector{Float64},Matrix{Float64},Fit}Prepare data matrices and structures for genome-wide association studies (GWAS).
Arguments
genomes::Genomes: Genomic data structure containing allele frequenciesphenomes::Phenomes: Phenotypic data structure containing trait measurementsidx_entries::Union{Nothing,Vector{Int64}}: Optional indices to subset entries (default: all entries)idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to subset loci/alleles (default: all loci)idx_trait::Int64: Index of the trait to analyze (default: 1)GRM_type::String: Type of genetic relationship matrix to use ("simple" or "ploidy-aware") (default: "simple")standardise::Bool: Whether to standardize the data matrices (default: true)verbose::Bool: Whether to print progress information (default: false)
Returns
A tuple containing:
G::Matrix{Float64}: Standardized allele frequency matrixy::Vector{Float64}: Standardized phenotype vectorGRM::Matrix{Float64}: Genetic relationship matrixfit::Fit: Initialized Fit struct for GWAS results
Details
- Performs data validation and preprocessing for GWAS analysis
- Removes fixed loci with no variation
- Standardizes genomic and phenotypic data if requested
- Constructs appropriate genetic relationship matrix
- Initializes output structure for association results
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> ploidy = 4;
julia> genomes.allele_frequencies = round.(genomes.allele_frequencies .* ploidy) ./ ploidy;
julia> proportion_of_variance = zeros(9, 1); proportion_of_variance[1, 1] = 0.5;
julia> trials, effects = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.05 0.00 0.00;], proportion_of_variance = proportion_of_variance, verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> G, y, GRM, fit = gwasprep(genomes=genomes, phenomes=phenomes);
julia> sum(abs.(mean(G, dims=1)[1,:]) .< 1e-10) == size(G, 2)
true
julia> sum(abs.(std(G, dims=1)[1,:] .- 1) .< 1e-10) == size(G, 2)
true
julia> (abs(mean(y)) < 1e-10, abs(std(y) - 1) < 1e-10)
(true, true)
julia> size(G, 1) == length(y)
true
julia> (size(G, 1), length(y)) == size(GRM)
true
julia> length(fit.entries) == length(y)
true
julia> length(fit.b_hat) == size(G, 2)
trueGenomicBreedingModels.gwasreml — Method
gwasreml(
genomes::Genomes,
phenomes::Phenomes;
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
idx_trait::Int64 = 1,
GRM_type::String = "simple",
verbose::Bool = false
)::FitPerforms genome-wide association analysis using restricted maximum likelihood estimation (REML).
Arguments
genomes::Genomes: Genomic data structure containing allele frequenciesphenomes::Phenomes: Phenotypic data structure containing trait measurementsidx_entries::Union{Nothing,Vector{Int64}}: Optional indices to subset entries (default: nothing)idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to subset loci/alleles (default: nothing)idx_trait::Int64: Index of the trait to analyze (default: 1)GRM_type::String: Type of genetic relationship matrix to use, either "simple" or "ploidy-aware" (default: "simple")verbose::Bool: Whether to display progress and plots (default: false)
Returns
::Fit: A Fit struct containing GWAS results, including effect estimates and test statistics
Details
Implements the REML log-likelihood calculation for a mixed model of the form: y = Xβ + Zu + e where:
- β are fixed effects
- u are random genetic effects with u ~ N(0, σ²_u * GRM)
- e are residual effects with e ~ N(0, σ²_e * I)
The function constructs the variance-covariance matrices and computes the REML transformation to obtain the log-likelihood value used in variance component estimation.
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(l=1_000, verbose=false);
julia> ploidy = 4;
julia> genomes.allele_frequencies = round.(genomes.allele_frequencies .* ploidy) ./ ploidy;
julia> proportion_of_variance = zeros(9, 1); proportion_of_variance[1, 1] = 0.5;
julia> trials, effects = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.05 0.00 0.00;], proportion_of_variance = proportion_of_variance, verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> fit_1 = gwasreml(genomes=genomes, phenomes=phenomes, GRM_type="simple");
julia> fit_1.model
"GWAS_REML"
julia> fit_2 = gwasreml(genomes=genomes, phenomes=phenomes, GRM_type="ploidy-aware");
julia> fit_2.model
"GWAS_REML"
julia> findall(fit_1.b_hat .== maximum(fit_1.b_hat)) == findall(fit_2.b_hat .== maximum(fit_2.b_hat))
trueGenomicBreedingModels.heritabilitynarrow_sense — Method
heritabilitynarrow_sense(y_true::Vector{Float64}, y_pred::Vector{Float64})::Float64Calculate narrow-sense heritability (h²) from true and predicted phenotypic values.
Narrow-sense heritability is the proportion of phenotypic variance that can be attributed to additive genetic effects. It is calculated as the ratio of additive genetic variance (s²a) to total phenotypic variance (s²a + s²e).
Arguments
y_true::Vector{Float64}: Vector of observed/true phenotypic valuesy_pred::Vector{Float64}: Vector of predicted genetic values
Returns
Float64: Narrow-sense heritability (h²) value between 0 and 1
Details
- Returns 0.0 if variance of either input vector is near zero (< 1e-10)
- Additive genetic variance (s²a) is estimated from variance of predictions
- Environmental variance (s²e) is estimated from variance of residuals
- Result is bounded between 0 and 1
GenomicBreedingModels.invoneplus — Method
invoneplus(x) = 1 / (1 + x)This function takes a number, adds one and inverts it. A member of the set of endofunctions defined in GenomicBreedingModels.jl for building non-linear genomic prediction models. Both input and ouput range between zero and one.
GenomicBreedingModels.lasso — Method
lasso(;
genomes::Genomes,
phenomes::Phenomes,
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
idx_trait::Int64 = 1,
verbose::Bool = false,
)::FitFits a LASSO (Least Absolute Shrinkage and Selection Operator) regression model with L1 regularization for genomic prediction.
Arguments
genomes::Genomes: Genomic data structure containing allele frequenciesphenomes::Phenomes: Phenotypic data structure containing trait measurementsidx_entries::Union{Nothing,Vector{Int64}}: Optional indices to select specific entries/individualsidx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to select specific loci-allelesidx_trait::Int64: Index of the trait to analyze (default: 1)verbose::Bool: If true, prints diagnostic plots and additional information (default: false)
Returns
Fit: A fitted model object containing:model: Model identifier ("ols")b_hat: Estimated regression coefficientsb_hat_labels: Labels for the coefficientstrait: Name of the analyzed traitentries: Entry identifierspopulations: Population identifiersy_true: Observed phenotypic valuesy_pred: Predicted phenotypic valuesmetrics: Dictionary of performance metrics
Details
The function implements LASSO regression using the GLMNet package, which performs automatic feature selection through L1 regularization. The optimal regularization parameter (λ) is selected using cross-validation to minimize prediction error while promoting sparsity in the coefficients.
Notes
- Standardisation is disabled by default because the allele frequencies across loci are comparable as they all range from zero to one
- The model includes an intercept term
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> fit = lasso(genomes=genomes, phenomes=phenomes);
julia> fit.model == "lasso"
true
julia> fit.metrics["cor"] > 0.50
trueGenomicBreedingModels.log10epsdivlog10eps — Method
log10epsdivlog10eps(x) = (log10(abs(x) + eps(Float64))) / log10(eps(Float64))This function takes a number, adds a very tiny value, takes the log10, and divide it by the log10 of the same very tiny value. A member of the set of endofunctions defined in GenomicBreedingModels.jl for building non-linear genomic prediction models. Both input and ouput range between zero and one.
GenomicBreedingModels.loglikreml — Method
loglikreml(θ::Vector{Float64}, data::Tuple{Vector{Float64},Matrix{Float64},Matrix{Float64}})::Float64Calculate the restricted maximum likelihood (REML) log-likelihood for a mixed linear model.
Arguments
θ::Vector{Float64}: Vector of variance components [σ²e, σ²u] where:- σ²_e is the residual variance
- σ²_u is the genetic variance
data::Tuple{Vector{Float64},Matrix{Float64},Matrix{Float64}}: Tuple containing:- y: Vector of phenotypic observations
- X: Design matrix for fixed effects
- GRM: Genomic relationship matrix
Returns
Float64: The REML log-likelihood value. ReturnsInfif matrix operations fail.
Details
Implements the REML log-likelihood calculation for a mixed model of the form: y = Xβ + Zu + e where:
- β are fixed effects
- u are random genetic effects with u ~ N(0, σ²_u * GRM)
- e are residual effects with e ~ N(0, σ²_e * I)
The function constructs the variance-covariance matrices and computes the REML transformation to obtain the log-likelihood value used in variance component estimation.
Examples
```jldoctest; setup = :(using GenomicBreedingCore, GenomicBreedingModels, LinearAlgebra, StatsBase) julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> ploidy = 4;
julia> genomes.allelefrequencies = round.(genomes.allelefrequencies .* ploidy) ./ ploidy;
julia> proportionofvariance = zeros(9, 1); proportionofvariance[1, 1] = 0.5;
julia> trials, effects = GenomicBreedingCore.simulatetrials(genomes=genomes, nyears=1, nseasons=1, nharvests=1, nsites=1, nreplications=1, fadddomepi=[0.05 0.00 0.00;], proportionofvariance = proportionofvariance, verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> G, y, GRM, fit = gwasprep(genomes=genomes, phenomes=phenomes);
julia> loglik = loglikreml([0.53, 0.15], (y, hcat(ones(length(y)), G[:, 1]), GRM));
julia> loglik < 100 true
GenomicBreedingModels.metrics — Method
metrics(y_true::Vector{Float64}, y_pred::Vector{Float64})::Dict{String,Float64}Calculate various metrics comparing predicted vs true values.
Returns a dictionary containing the following metrics:
cor: Pearson correlation coefficientmad: Mean absolute deviationmsd: Mean squared deviationrmsd: Root mean squared deviationnrmsd: Normalized root mean squared deviationeuc: Euclidean distancejac: Jaccard distancetvar: Total variation distanceh²: Narrow-sense heritability
Arguments
y_true::Vector{Float64}: Vector of true/observed valuesy_pred::Vector{Float64}: Vector of predicted values
Returns
Dict{String,Float64}: Dictionary mapping metric names to their calculated values
GenomicBreedingModels.mult — Method
mult(x, y) = x * yThis function takes two numbers and multiplies them together. A member of the set of endofunctions defined in GenomicBreedingModels.jl for building non-linear genomic prediction models. Inputs and ouput range between zero and one.
GenomicBreedingModels.ols — Method
ols(;
genomes::Genomes,
phenomes::Phenomes,
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
idx_trait::Int64 = 1,
verbose::Bool = false
)::FitFits an ordinary least squares (OLS) regression model to genomic and phenotypic data.
Arguments
genomes::Genomes: Genomic data structure containing allele frequenciesphenomes::Phenomes: Phenotypic data structure containing trait measurementsidx_entries::Union{Nothing,Vector{Int64}}: Optional indices to select specific entries (default: all entries)idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to select specific loci-alleles (default: all loci-alleles)idx_trait::Int64: Index of the trait to analyze (default: 1)verbose::Bool: If true, displays diagnostic plots and performance metrics (default: false)
Returns
Fit: A fitted model object containing:model: Model identifier ("ols")b_hat: Estimated regression coefficientsb_hat_labels: Labels for the coefficientstrait: Name of the analyzed traitentries: Entry identifierspopulations: Population identifiersy_true: Observed phenotypic valuesy_pred: Predicted phenotypic valuesmetrics: Dictionary of performance metrics
Description
Performs ordinary least squares regression on genomic data to predict phenotypic values. The model includes an intercept term and estimates effects for each locus-allele combination.
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> fit = ols(genomes=genomes, phenomes=phenomes);
julia> fit.model == "ols"
true
julia> fit.metrics["cor"] > 0.50
trueGenomicBreedingModels.pearsonscorrelation — Method
pearsonscorrelation(y_true::Vector{Float64}, y_pred::Vector{Float64})::Float64Calculate the Pearson correlation coefficient between two vectors.
The Pearson correlation coefficient measures the linear correlation between two variables, giving a value between -1 and +1, where:
- +1 represents perfect positive correlation
- 0 represents no linear correlation
- -1 represents perfect negative correlation
Arguments
y_true::Vector{Float64}: Vector of true/actual valuesy_pred::Vector{Float64}: Vector of predicted/estimated values
Returns
Float64: Pearson correlation coefficient
Notes
- Returns 0.0 if the variance of either input vector is less than 1e-10
- Uses
1 - correlation distancefrom Distances.jl package
GenomicBreedingModels.predict — Method
predict(; fit::Fit, genomes::Genomes, idx_entries::Vector{Int64})::Vector{Float64}Calculate predicted phenotypes using a fitted genomic prediction model.
Arguments
fit::Fit: A fitted genomic prediction model containing coefficients and model informationgenomes::Genomes: Genomic data containing allele frequenciesidx_entries::Vector{Int64}: Vector of indices specifying which entries to predict
Returns
Vector{Float64}: Predicted phenotypic values for the specified entries
Details
Supports various linear genomic prediction models including:
- OLS (Ordinary Least Squares)
- Ridge Regression
- LASSO
- Bayes A
- Bayes B
- Bayes C
The function validates input dimensions and compatibility between the fitted model and genomic data before making predictions.
Throws
ArgumentError: If the Fit or Genomes structs are corruptedArgumentError: If entry indices are out of boundsArgumentError: If loci-alleles in the fitted model don't match the validation setArgumentError: If the genomic prediction model is not recognized
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);
julia> phenomes = extractphenomes(trials);
julia> fit = ridge(genomes=genomes, phenomes=phenomes, idx_entries=collect(1:90));
julia> y_hat = GenomicBreedingModels.predict(fit=fit, genomes=genomes, idx_entries=collect(91:100));
julia> cor(phenomes.phenotypes[91:100, 1], y_hat) > 0.5
trueGenomicBreedingModels.raise — Method
raise(x, y) = x^yThis function raises the first value to the power of the second value. A member of the set of endofunctions defined in GenomicBreedingModels.jl for building non-linear genomic prediction models. Inputs and ouput range between zero and one.
GenomicBreedingModels.reconstitutefeatures — Method
reconstitutefeatures(
genomes::Genomes;
feature_names::Vector{String},
verbose::Bool = false
)::GenomesReconstruct epistasis features from a Genomes struct using feature names that encode the transformations applied.
Arguments
genomes::Genomes: Genomic data structure containing allele frequenciesfeature_names::Vector{String}: Vector of feature names containing encoded transformation informationverbose::Bool: Whether to show progress bar during reconstruction (default: false)
Returns
Genomes: A new Genomes struct with reconstructed epistasis features
Details
The function parses the feature names to determine which transformations were applied and reconstructs the features by applying these transformations to the original genomic data. Feature names should contain the transformation operations in a format that can be parsed and evaluated.
Throws
ArgumentError: If the input Genomes struct is corrupted (invalid dimensions)ErrorException: If feature reconstruction fails
Example
julia> genomes = GenomicBreedingCore.simulategenomes(l=1_000, verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> genomes_epifeat = epistasisfeatures(genomes, phenomes, n_new_features_per_transformation=50, n_reps=2, verbose=false);
julia> feature_names = genomes_epifeat.loci_alleles;
julia> genomes_epifeat_reconstructed = reconstitutefeatures(genomes, feature_names=feature_names);
julia> genomes_epifeat == genomes_epifeat_reconstructed
trueGenomicBreedingModels.ridge — Method
ridge(;
genomes::Genomes,
phenomes::Phenomes,
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
idx_trait::Int64 = 1,
verbose::Bool = false
)::FitFit a ridge (L2) regression model to genomic data. Ridge regression adds an L2 regularization term to the ordinary least squares objective function, which helps prevent overfitting and handles multicollinearity in the predictors.
Arguments
genomes::Genomes: Genomic data structure containing allele frequenciesphenomes::Phenomes: Phenotypic data structure containing trait measurementsidx_entries::Union{Nothing,Vector{Int64}}: Optional indices to subset specific entries/individualsidx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to subset specific loci-allelesidx_trait::Int64: Index of the trait to analyze (default: 1)verbose::Bool: If true, prints diagnostic plots and additional information (default: false)
Returns
Fit: A fitted model object containing:model: Model identifier ("ols")b_hat: Estimated regression coefficientsb_hat_labels: Labels for the coefficientstrait: Name of the analyzed traitentries: Entry identifierspopulations: Population identifiersy_true: Observed phenotypic valuesy_pred: Predicted phenotypic valuesmetrics: Dictionary of performance metrics
Notes
- Uses cross-validation to select the optimal regularization parameter (λ)
- Standardizes predictors before fitting
- Includes an intercept term in the model
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> fit = ridge(genomes=genomes, phenomes=phenomes);
julia> fit.model == "ridge"
true
julia> fit.metrics["cor"] > 0.50
trueGenomicBreedingModels.square — Method
square(x) = x^2This function takes a number and squares it. A member of the set of endofunctions defined in GenomicBreedingModels.jl for building non-linear genomic prediction models. Both input and ouput range between zero and one.
GenomicBreedingModels.transform1 — Method
transform1(
f::Function,
genomes::Genomes,
phenomes::Phenomes;
idx_trait::Int64 = 1,
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
n_new_features_per_transformation::Int64 = 1_000,
ϵ::Float64 = eps(Float64),
use_abs::Bool = false,
σ²_threshold::Float64 = 0.01,
verbose::Bool = false
)::GenomesApply a transformation function to each allele frequency in genomic data while considering their effects on phenotypes.
Arguments
f::Function: Transformation function to apply to allele frequenciesgenomes::Genomes: Input genomic data structurephenomes::Phenomes: Corresponding phenotypic data structure
Keywords
idx_trait::Int64: Index of the trait to analyze (default: 1)idx_entries::Union{Nothing,Vector{Int64}}: Indices of entries to include (default: all)idx_loci_alleles::Union{Nothing,Vector{Int64}}: Indices of loci-alleles to include (default: all)n_new_features_per_transformation::Int64: Maximum number of transformed features to retain (default: 1000)ϵ::Float64: Small value added to prevent numerical issues (default: machine epsilon)use_abs::Bool: Whether to use absolute values before transformation (default: false)σ²_threshold::Float64: Minimum variance threshold for considering a locus (default: 0.01)verbose::Bool: Whether to show progress and plots (default: false)
Returns
Genomes: A new Genomes object containing the transformed allele frequencies
Details
The function performs the following steps:
- Extracts allele frequencies and phenotypic data
- Applies the transformation function to each locus
- Estimates effects using ordinary least squares
- Selects the most important transformed features
- Cleans numerical artifacts (values near 0 or 1)
Notes
- Use named functions to ensure transformations can be reconstructed from
loci_alleles - The function adds
ϵto frequencies to prevent numerical issues - Features with variance below
σ²_thresholdare skipped - The output contains at most
n_new_features_per_transformationfeatures
Example
julia> genomes = GenomicBreedingCore.simulategenomes(l=1_000, verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> genomes_transformed = transform1(x -> x^2, genomes, phenomes);
julia> idx = findall(genomes.loci_alleles .== split(split(replace(genomes_transformed.loci_alleles[1], ")" => ""), "(")[2], ",")[1])[1];
julia> mean(sqrt.(genomes_transformed.allele_frequencies[:, 1]) .- genomes.allele_frequencies[:, idx]) < 1e-10
true
julia> squareaddpi(x) = x^2 + pi;
julia> genomes_transformed = transform1(squareaddpi, genomes, phenomes);
julia> idx = findall(genomes.loci_alleles .== split(split(replace(genomes_transformed.loci_alleles[1], ")" => ""), "(")[2], ",")[1])[1];
julia> mean(squareaddpi.(genomes.allele_frequencies[:, idx]) .- genomes_transformed.allele_frequencies[:, 1]) < 1e-10
trueGenomicBreedingModels.transform2 — Method
transform2(
f::Function,
genomes::Genomes,
phenomes::Phenomes;
idx_trait::Int64 = 1,
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
n_new_features_per_transformation::Int64 = 1_000,
ϵ::Float64 = eps(Float64),
use_abs::Bool = false,
σ²_threshold::Float64 = 0.01,
commutative::Bool = false,
verbose::Bool = false
)::GenomesApply a transformation function to pairs of allele frequencies in genomic data while considering their effects on phenotypes.
Arguments
f::Function: Transformation function to apply to pairs of allele frequenciesgenomes::Genomes: Input genomic data structurephenomes::Phenomes: Corresponding phenotypic data structureidx_trait::Int64: Index of the trait to analyze (default: 1)idx_entries::Union{Nothing,Vector{Int64}}: Subset of entries to include (default: all)idx_loci_alleles::Union{Nothing,Vector{Int64}}: Subset of loci-alleles to include (default: all)n_new_features_per_transformation::Int64: Maximum number of transformed features to retain (default: 1000)ϵ::Float64: Small value added to frequencies to avoid numerical issues (default: machine epsilon)use_abs::Bool: Whether to use absolute values of frequencies (default: false)σ²_threshold::Float64: Minimum variance threshold for considering loci (default: 0.01)commutative::Bool: Whether the transformation function is commutative (default: false)verbose::Bool: Whether to display progress and diagnostics (default: false)
Returns
Genomes: A new Genomes object containing the transformed features
Details
The function performs the following steps:
- Extracts allele frequencies and phenotypic data
- Applies the transformation function to all possible pairs of allele freuqencies (i.e., locus-allele combinations)
- Estimates effects using ordinary least squares
- Selects the most important transformed features
- Cleans numerical artifacts (values near 0 or 1)
Notes
- Use named functions to ensure transformations can be reconstructed from
loci_alleles - The function adds
ϵto frequencies to prevent numerical issues - Features with variance below
σ²_thresholdare skipped - The output contains at most
n_new_features_per_transformationfeatures
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(l=1_000, verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);;
julia> phenomes = extractphenomes(trials);
julia> genomes_transformed = transform2((x,y) -> (x^2 + sqrt(y)) / 2, genomes, phenomes);
julia> idx_1 = findall(genomes.loci_alleles .== split(split(replace(genomes_transformed.loci_alleles[1], ")" => ""), "(")[2], ",")[1])[1];
julia> idx_2 = findall(genomes.loci_alleles .== split(split(replace(genomes_transformed.loci_alleles[1], ")" => ""), "(")[2], ",")[2])[1];
julia> mean((genomes.allele_frequencies[:,idx_1].^2 .+ sqrt.(genomes.allele_frequencies[:,idx_2])) ./ 2 .- genomes_transformed.allele_frequencies[:,1]) < 1e-10
true
julia> raisexbyythenlog(x, y) = log(abs(x^y));
julia> genomes_transformed = transform2(raisexbyythenlog, genomes, phenomes);
julia> idx_1 = findall(genomes.loci_alleles .== split(split(replace(genomes_transformed.loci_alleles[1], ")" => ""), "(")[2], ",")[1])[1];
julia> idx_2 = findall(genomes.loci_alleles .== split(split(replace(genomes_transformed.loci_alleles[1], ")" => ""), "(")[2], ",")[2])[1];
julia> mean(raisexbyythenlog.(genomes.allele_frequencies[:,idx_1], genomes.allele_frequencies[:,idx_2]) .- genomes_transformed.allele_frequencies[:,1]) < 1e-10
trueGenomicBreedingModels.validate — Method
validate(
fit::Fit,
genomes::Genomes,
phenomes::Phenomes;
idx_validation::Vector{Int64},
replication::String="",
fold::String=""
)::CVEvaluate the predictive accuracy of a genomic prediction model on a validation dataset.
Arguments
fit::Fit: A fitted genomic prediction modelgenomes::Genomes: Genomic data structure containing allele frequenciesphenomes::Phenomes: Phenotypic data structure containing trait measurementsidx_validation::Vector{Int64}: Indices of entries to use for validationreplication::String: Optional identifier for the validation replicationfold::String: Optional identifier for the cross-validation fold
Returns
CV: A cross-validation result object containing:- Validation metrics (correlation, RMSE, etc.)
- True and predicted values
- Entry and population information
- Model specifications
Notes
- Performs checks for data leakage between training and validation sets
- Handles missing, NaN, and Inf values in phenotypic data
- Validates dimensions of output CV struct
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=1, f_add_dom_epi=[0.1 0.01 0.01;], verbose=false);
julia> phenomes = extractphenomes(trials);
julia> fit = ridge(genomes=genomes, phenomes=phenomes, idx_entries=collect(1:90));
julia> cv = validate(fit, genomes, phenomes, idx_validation=collect(91:100));
julia> cv.metrics["cor"] > 0.50
trueGenomicBreedingModels.@string2operations — Macro
@string2operations(x)Convert a string containing an endofunction formulae for genomic features into a parsed Julia expressions.
Arguments
x: A string containing mathematical operations on allele frequencies
Returns
- Parsed expression that can be evaluated in Julia