GenomicBreedingCore
Documentation for GenomicBreedingCore.
GenomicBreedingCore.AbstractGB
GenomicBreedingCore.BLR
GenomicBreedingCore.CV
GenomicBreedingCore.Fit
GenomicBreedingCore.GRM
GenomicBreedingCore.Genomes
GenomicBreedingCore.Phenomes
GenomicBreedingCore.SimulatedEffects
GenomicBreedingCore.TEBV
GenomicBreedingCore.Trials
Base.:==
Base.:==
Base.:==
Base.:==
Base.:==
Base.:==
Base.:==
Base.:==
Base.:==
Base.filter
Base.filter
Base.filter
Base.hash
Base.hash
Base.hash
Base.hash
Base.hash
Base.hash
Base.hash
Base.hash
Base.hash
Base.merge
Base.merge
Base.merge
Base.sum
GenomicBreedingCore.addcompositetrait
GenomicBreedingCore.addcompositetrait
GenomicBreedingCore.aggregateharvests
GenomicBreedingCore.analyse
GenomicBreedingCore.analyse
GenomicBreedingCore.analyseviaBLR
GenomicBreedingCore.checkandfocalterms
GenomicBreedingCore.checkdims
GenomicBreedingCore.checkdims
GenomicBreedingCore.checkdims
GenomicBreedingCore.checkdims
GenomicBreedingCore.checkdims
GenomicBreedingCore.checkdims
GenomicBreedingCore.checkdims
GenomicBreedingCore.checkdims
GenomicBreedingCore.checkdims
GenomicBreedingCore.clone
GenomicBreedingCore.clone
GenomicBreedingCore.clone
GenomicBreedingCore.clone
GenomicBreedingCore.clone
GenomicBreedingCore.clone
GenomicBreedingCore.clone
GenomicBreedingCore.clone
GenomicBreedingCore.countlevels
GenomicBreedingCore.dimensions
GenomicBreedingCore.dimensions
GenomicBreedingCore.dimensions
GenomicBreedingCore.dimensions
GenomicBreedingCore.dimensions
GenomicBreedingCore.distances
GenomicBreedingCore.distances
GenomicBreedingCore.divideintomockscaffolds
GenomicBreedingCore.estimatedistances
GenomicBreedingCore.estimateld
GenomicBreedingCore.extractXb
GenomicBreedingCore.extracteffects
GenomicBreedingCore.extractmodelinputs
GenomicBreedingCore.extractphenomes
GenomicBreedingCore.extractphenomes
GenomicBreedingCore.filterbymaf
GenomicBreedingCore.filterbypca
GenomicBreedingCore.filterbysnplist
GenomicBreedingCore.filterbysparsity
GenomicBreedingCore.grmploidyaware
GenomicBreedingCore.grmsimple
GenomicBreedingCore.histallelefreqs
GenomicBreedingCore.impute
GenomicBreedingCore.inflatediagonals!
GenomicBreedingCore.instantiateblr
GenomicBreedingCore.knni
GenomicBreedingCore.knnioptim
GenomicBreedingCore.loci
GenomicBreedingCore.loci_alleles
GenomicBreedingCore.makex
GenomicBreedingCore.maskmissing!
GenomicBreedingCore.plot
GenomicBreedingCore.plot
GenomicBreedingCore.plot
GenomicBreedingCore.plot
GenomicBreedingCore.plot
GenomicBreedingCore.removemissnaninf
GenomicBreedingCore.removespatialeffects!
GenomicBreedingCore.simulateallelefreqs!
GenomicBreedingCore.simulatechromstruct
GenomicBreedingCore.simulatecovarianceautocorrelated
GenomicBreedingCore.simulatecovariancediagonal
GenomicBreedingCore.simulatecovariancekinship
GenomicBreedingCore.simulatecovariancerandom
GenomicBreedingCore.simulatecovariancespherical
GenomicBreedingCore.simulateeffects
GenomicBreedingCore.simulategenomes
GenomicBreedingCore.simulategenomiceffects
GenomicBreedingCore.simulateldblocks
GenomicBreedingCore.simulatemating
GenomicBreedingCore.simulateperpopμΣ
GenomicBreedingCore.simulatepopgroups
GenomicBreedingCore.simulateposandalleles
GenomicBreedingCore.simulatetrials
GenomicBreedingCore.slice
GenomicBreedingCore.slice
GenomicBreedingCore.slice
GenomicBreedingCore.sparsities
GenomicBreedingCore.summarise
GenomicBreedingCore.tabularise
GenomicBreedingCore.tabularise
GenomicBreedingCore.tabularise
GenomicBreedingCore.tabularise
GenomicBreedingCore.trialsmodelsfomulae!
GenomicBreedingCore.turingblr
GenomicBreedingCore.turingblrmcmc!
GenomicBreedingCore.@string2formula
GenomicBreedingCore.@stringevaluation
GenomicBreedingCore.AbstractGB
— TypeAbstractGB
The root abstract type for all core types in GenomicBreeding.jl package.
This type serves as the common ancestor for the type hierarchy in the package, enabling shared functionality and type-based dispatch across all derived types.
Extended help
All custom core types in GenomicBreeding.jl should subtype from AbstractGB
to ensure consistency in the type system and to enable generic implementations of common operations.
GenomicBreedingCore.BLR
— TypeBLR <: AbstractGB
Bayesian Linear Regression (BLR) model structure for phenotype analyses, genomic prediction and selection.
Fields
entries::Vector{String}
: Names or identifiers for the observationsXs::Dict{String,Matrix{Union{Bool,Float64}}}
: Design matrices for factors and numeric matrix for covariates, including interceptΣs::Dict{String,Union{Nothing,Matrix{Float64},UniformScaling{Float64}}}
: Variance-covariance matrices for random effectscoefficients::Dict{String, Vector{Float64}}
: Dictionary of estimated coefficients/effects grouped by componentcoefficient_names::Dict{String, Vector{String}}
: Dictionary of names for coefficients grouped by componenty::Vector{Float64}
: Response/dependent variable vectorŷ::Vector{Float64}
: Fitted/predicted valuesϵ::Vector{Float64}
: Residuals (y - ŷ)diagnostics::DataFrame
: Model diagnostics and convergence metrics
Constructor
BLR(; n::Int64, p::Int64, var_comp::Dict{String, Int64} = Dict("σ²" => 1))
Creates a BLR model with specified dimensions and variance components.
Arguments
n::Int64
: Number of observationsp::Int64
: Number of coefficients/effects (including intercept)var_comp::Dict{String, Int64}
: Dictionary specifying variance components and their dimensions
Details
The constructor initializes:
- Design matrices (Xs) with zeros, including an intercept component
- Variance-covariance matrices (Σs) as identity matrices scaled by 1.0
- Dictionaries for coefficients and their names, with "intercept" as default first component
- Zero vectors for y, fitted values (ŷ), and residuals (ϵ)
- Empty strings for entries and coefficient names (except "intercept")
- Empty DataFrame for model diagnostics
Requirements
- At least one variance component (σ²) must be specified for residual variance
- Total number of parameters (p) must equal sum of dimensions in variance components
- Number of coefficients (p) must be ≥ 1
- If p > 1 and only residual variance is specified, a dummy variance component is added with p-1 dimensions
Notes
The structure supports multiple random effects through the dictionary-based design, where each component (except intercept) can have its own design matrix (X), variance-covariance matrix (Σ), coefficients, and coefficient names.
GenomicBreedingCore.CV
— TypeCross-validation struct
Contains genomic prediction cross-validation details.
Fields
replication
: replication namefold
: fold namefit
: genomic prediction model fit on the training setvalidation_populations
: vector of validation populations corresponding to each validation entryvalidation_entries
: corresponding vector of entries in the validation population/svalidation_y_true
: corresponding vector of observed phenotypes in the validation population/svalidation_y_pred
: corresponding vector of predicted phenotypes in the validation population/smetrics
: dictionary of genomic prediction accuracy metrics on the validation population/s
Constructor
Uses the default constructor.
GenomicBreedingCore.Fit
— TypeGenomic prediction model fit
Contains genomic prediction model fit details.
Fields
model
: name of the genomic prediction model usedb_hat_labels
: names of the loci-alleles usedb_hat
: effects of the loci-allelestrait
: name of the traitentries
: names of the entries used in the current cross-validation replication and foldpopulations
: names of the populations used in the current cross-validation replication and foldy_true
: corresponding observed phenotype valuesy_pred
: corresponding predicted phenotype valuesmetrics
: dictionary of genomic prediction accuracy metrics, inluding Pearson's correlation, mean absolute error and root mean-squared errorlux_model
: Nothing or a trained Lux neural network model
Constructor
Fit(; n=1, l=10)
where:
- n::Int64: Number of entries
- l::Int64: Number of loci-alleles
GenomicBreedingCore.GRM
— TypeGenomic relationship matrix struct
Contains genomic relationship matrix as well as the corresponding entries and loci-alleles used to compute it.
Fields
entries
: names of then
entries corresponding to the rows and columns of the genomic relationship matrixloci_alleles
: names of the loci-alleles used to compute the genomic relationship matrixgenomic_relationship_matrix
:n x n
matrix of genomic relationship values between entries
Constructor
Uses the default constructor.
GenomicBreedingCore.Genomes
— TypeGenomes struct
Containes unique entries and loci_alleles where allele frequencies can have missing values
Fields
entries
: names of then
entries or samplespopulations
: name/s of the population/s each entries or samples belong toloci_alleles
: names of thep
loci-alleles combinations (p
=l
loci xa-1
alleles) including the chromsome or scaffold name, position, all alleles and current allele separated by tabs ("\t")allele_frequencies
:n x p
matrix of allele frequencies between 0 and 1 which can have missing valuesmask
:n x p
matrix of boolean mask for selective analyses and slicing
Constructor
Genomes(; n::Int64 = 1, p::Int64 = 2)
where:
n::Int64=1
: Number of entries in the genomic datasetp::Int64=2
: Number of locus-allele combinations in the genomic dataset
Examples
julia> genomes = Genomes(n=2, p=2)
Genomes(["", ""], ["", ""], ["", ""], Union{Missing, Float64}[missing missing; missing missing], Bool[1 1; 1 1])
julia> fieldnames(Genomes)
(:entries, :populations, :loci_alleles, :allele_frequencies, :mask)
julia> genomes.entries = ["entry_1", "entry_2"];
julia> genomes.populations = ["pop_1", "pop_1"];
julia> genomes.loci_alleles = ["chr1\t12345\tA|T\tA", "chr2\t678910\tC|D\tD"];
julia> genomes.allele_frequencies = [0.5 0.25; 0.9 missing];
julia> genomes.mask = [true true; true false];
julia> genomes
Genomes(["entry_1", "entry_2"], ["pop_1", "pop_1"], ["chr1\t12345\tA|T\tA", "chr2\t678910\tC|D\tD"], Union{Missing, Float64}[0.5 0.25; 0.9 missing], Bool[1 1; 1 0])
GenomicBreedingCore.Phenomes
— TypePhenomes struct
Constains unique entries and traits where phenotype data can have missing values
Fields
entries
: names of then
entries or samplespopulations
: name/s of the population/s each entries or samples belong totraits
: names of thet
traitsphenotypes
:n x t
matrix of numeric (R
) phenotype data which can have missing valuesmask
:n x t
matrix of boolean mask for selective analyses and slicing
Constructor
Phenomes(; n::Int64 = 1, t::Int64 = 2)
where:
n::Int64=1
: Number of entries in the phenomic datasett::Int64=2
: Number of traits in the phenomic dataset
Examples
julia> phenomes = Phenomes(n=2, t=2)
Phenomes(["", ""], ["", ""], ["", ""], Union{Missing, Float64}[missing missing; missing missing], Bool[1 1; 1 1])
julia> fieldnames(Phenomes)
(:entries, :populations, :traits, :phenotypes, :mask)
julia> phenomes.entries = ["entry_1", "entry_2"];
julia> phenomes.populations = ["pop_A", "pop_B"];
julia> phenomes.traits = ["height", "yield"];
julia> phenomes.phenotypes = [200.0 2.5; 150.0 missing];
julia> phenomes.mask = [true true; true false];
julia> phenomes
Phenomes(["entry_1", "entry_2"], ["pop_A", "pop_B"], ["height", "yield"], Union{Missing, Float64}[200.0 2.5; 150.0 missing], Bool[1 1; 1 0])
GenomicBreedingCore.SimulatedEffects
— TypeSimulatedEffects struct
Contains the various simulated genetic, environmental and GxE effects.
Fields
id::Vector{String}
: Vector of identifiersyear::Float64
: Year effectseason::Float64
: Season effectsite::Float64
: Site effectseasons_x_year::Float64
: Interaction effect between seasons and yearsharvests_x_season_x_year::Float64
: Interaction effect between harvests, seasons and yearssites_x_harvest_x_season_x_year::Float64
: Interaction effect between sites, harvests, seasons and yearsfield_layout::Matrix{Int64}
: 2D matrix representing field layoutreplications_x_site_x_harvest_x_season_x_year::Vector{Float64}
: Replication interaction effectsblocks_x_site_x_harvest_x_season_x_year::Vector{Float64}
: Block interaction effectsrows_x_site_x_harvest_x_season_x_year::Vector{Float64}
: Row interaction effectscols_x_site_x_harvest_x_season_x_year::Vector{Float64}
: Column interaction effectsadditive_genetic::Vector{Float64}
: Additive genetic effectsdominance_genetic::Vector{Float64}
: Dominance genetic effectsepistasis_genetic::Vector{Float64}
: Epistasis genetic effectsadditive_allele_x_site_x_harvest_x_season_x_year::Vector{Float64}
: Additive allele interaction effectsdominance_allele_x_site_x_harvest_x_season_x_year::Vector{Float64}
: Dominance allele interaction effectsepistasis_allele_x_site_x_harvest_x_season_x_year::Vector{Float64}
: Epistasis allele interaction effects
Constructor
SimulatedEffects()
Creates a new SimulatedEffects instance with default values:
- Empty strings for IDs (vector of size 6)
- 0.0 for all float values
- 4x4 zero matrix for field_layout
- Single-element zero vectors for all vector fields
GenomicBreedingCore.TEBV
— TypeTrial-estimated breeding values (TEBV) struct
Contains trial-estimated breeding values as generated by analyse(trials::Trials, ...)
.
Fields
traits
: names of the traitst
traitsformulae
: best-fitting formula for each traitmodels
: best-fitting linear mixed model for each trait which may be MixedModes.jl model or a tuple of coefficient names, the coefficients and the design matrix from Bayesial linear regressiondf_BLUEs
: vector of data frames of best linear unbiased estimators or fixed effects table of each best fitting modeldf_BLUPs
: vector of data frames of best linear unbiased predictors or random effects table of each best fitting modelphenomes
: vector of Phenomes structs each containing the breeding values
Examples
julia> tebv = TEBV(traits=[""], formulae=[""], models=[MixedModel(@formula(y~1+(1|x)), DataFrame(y=1, x=1))], df_BLUEs=[DataFrame(x=1)], df_BLUPs=[DataFrame(x=1)], phenomes=[Phenomes(n=1, t=1)]);
julia> tebv.traits
1-element Vector{String}:
""
GenomicBreedingCore.Trials
— TypeTrials struct
Contains phenotype data across years, seasons, harvest, sites, populations, replications, blocks, rows, and columns
Fields
phenotypes
:n x t
matrix of numeric phenotype data which can have missing valuestraits
: names of the traitst
traitsyears
: names of the years corresponding to each row in the phenotype matrixseasons
: names of the seasons corresponding to each row in the phenotype matrixharvests
: names of the harvests corresponding to each row in the phenotype matrixsites
: names of the sites corresponding to each row in the phenotype matrixreplications
: names of the replications corresponding to each row in the phenotype matrixblocks
: names of the blocks corresponding to each row in the phenotype matrixrows
: names of the rows corresponding to each row in the phenotype matrixcols
: names of the cols corresponding to each row in the phenotype matrixentries
: names of the entries corresponding to each row in the phenotype matrixpopulations
: names of the populations corresponding to each row in the phenotype matrix
Constructor
Trials(; n::Int64 = 2, p::Int64 = 2)
where:
n::Int64=1
: Number of entries in the trialst::Int64=2
: Number of traits in the trials
Examples
julia> trials = Trials(n=1, t=2)
Trials(Union{Missing, Float64}[missing missing], ["", ""], [""], [""], [""], [""], [""], [""], [""], [""], [""], [""])
julia> fieldnames(Trials)
(:phenotypes, :traits, :years, :seasons, :harvests, :sites, :replications, :blocks, :rows, :cols, :entries, :populations)
Base.:==
— MethodBase.:(==)(x::BLR, y::BLR)::Bool
Compare two BLR
structs for equality based on their hash values.
This method defines equality comparison for BLR
structs by comparing their hash values. Two BLR
structs are considered equal if they have identical hash values, which means they have the same values for all their fields, except the computationally expensive Xs, and Σs fields.
Arguments
x::BLR
: First BLR struct to comparey::BLR
: Second BLR struct to compare
Returns
Bool
:true
if the BLR structs are equal,false
otherwise
Examples
julia> blr_1 = BLR(n=1, p=4);
julia> blr_2 = BLR(n=1, p=4);
julia> blr_3 = BLR(n=1, p=1);
julia> blr_1 == blr_2
true
julia> blr_1 == blr_3
false
Base.:==
— MethodBase.:(==)(x::CV, y::CV)::Bool
Compare two CV (Cross-Validation) structs for equality.
This method overloads the equality operator (==
) for CV structs by comparing their hash values. Two CV structs are considered equal if they have identical values for all fields.
Arguments
x::CV
: First CV struct to comparey::CV
: Second CV struct to compare
Returns
Bool
:true
if the CV structs are equal,false
otherwise
Examples
julia> fit = Fit(n=1, l=2);
julia> cv_1 = CV("replication_1", "fold_1", fit, ["population_1"], ["entry_1"], [0.0], [0.0], fit.metrics);
julia> cv_2 = clone(cv_1);
julia> cv_3 = clone(cv_1); cv_3.replication = "other_replication";
julia> cv_1 == cv_2
true
julia> cv_1 == cv_3
false
Base.:==
— MethodBase.:(==)(x::Fit, y::Fit)::Bool
Compare two Fit
structs for equality based on their hash values.
This method defines equality comparison for Fit
structs by comparing their hash values. Two Fit
structs are considered equal if they have identical hash values, which means they have the same values for all their fields.
Arguments
x::Fit
: First Fit struct to comparey::Fit
: Second Fit struct to compare
Returns
Bool
:true
if the Fit structs are equal,false
otherwise
Examples
julia> fit_1 = Fit(n=1, l=4);
julia> fit_2 = Fit(n=1, l=4);
julia> fit_3 = Fit(n=1, l=2);
julia> fit_1 == fit_2
true
julia> fit_1 == fit_3
false
Base.:==
— MethodBase.:(==)(x::GRM, y::GRM)::Bool
Compare two GRM structs for equality.
Overloads the equality operator (==
) for GRM structs by comparing their hash values. Two GRM structs are considered equal if they have identical values for all their fields.
Arguments
x::GRM
: First GRM struct to comparey::GRM
: Second GRM struct to compare
Returns
Bool
:true
if the GRM structs are equal,false
otherwise
Examples
julia> grm_1 = GRM(string.(["entries_1", "entries_2"]), string.(["chr1 123 A|T A", "chr1 456 C|G G"]), Float64.(rand(2,2)));
julia> grm_2 = clone(grm_1);
julia> grm_3 = GRM(string.(["entries_1", "entries_2"]), string.(["chr1 123 A|T A", "chr1 456 C|G G"]), Float64.(rand(2,2)));
julia> grm_1 == grm_2
true
julia> grm_1 == grm_3
false
Base.:==
— Method==(x::Genomes, y::Genomes)::Bool
Compare two Genomes
structs for equality by comparing their hash values.
This method implements equality comparison for Genomes
structs by utilizing their hash values, ensuring that two genomes are considered equal if and only if they have identical structural properties and content.
Arguments
x::Genomes
: First Genomes struct to comparey::Genomes
: Second Genomes struct to compare
Returns
Bool
:true
if the genomes are equal,false
otherwise
Examples
julia> genomes_1 = genomes = Genomes(n=2,p=4);
julia> genomes_2 = genomes = Genomes(n=2,p=4);
julia> genomes_3 = genomes = Genomes(n=1,p=2);
julia> genomes_1 == genomes_2
true
julia> genomes_1 == genomes_3
false
Base.:==
— Method==(x::Phenomes, y::Phenomes)::Bool
Compare two Phenomes
structs for equality using their hash values.
This method implements equality comparison for Phenomes
objects by comparing their hash values, ensuring that two phenomes with identical structure and content are considered equal.
Arguments
x::Phenomes
: First phenomes object to comparey::Phenomes
: Second phenomes object to compare
Returns
Bool
:true
if the phenomes are equal,false
otherwise
Examples
julia> phenomes_1 = phenomes = Phenomes(n=2, t=4);
julia> phenomes_2 = phenomes = Phenomes(n=2, t=4);
julia> phenomes_3 = phenomes = Phenomes(n=1, t=2);
julia> phenomes_1 == phenomes_2
true
julia> phenomes_1 == phenomes_3
false
Base.:==
— MethodBase.:(==)(x::SimulatedEffects, y::SimulatedEffects)::Bool
Defines equality comparison for SimulatedEffects structs by comparing their hash values.
This method overloads the == operator for SimulatedEffects type and determines if two SimulatedEffects instances are equal by comparing their hash values rather than doing a field-by-field comparison.
Arguments
x::SimulatedEffects
: First SimulatedEffects instance to comparey::SimulatedEffects
: Second SimulatedEffects instance to compare
Returns
Bool
: true if the hash values of both instances are equal, false otherwise
Examples
julia> effects_1 = SimulatedEffects();
julia> effects_2 = SimulatedEffects();
julia> effects_3 = SimulatedEffects(); effects_3.id[1] = "SOMETHING_ELSE";
julia> effects_1 == effects_2
true
julia> effects_1 == effects_3
false
Base.:==
— Method==(x::TEBV, y::TEBV)::Bool
Compare two TEBV (Trial-Estimated Breeding Values) objects for equality.
This method implements equality comparison for TEBV structs by comparing their hash values. Two TEBV objects are considered equal if they have identical values for all their fields: traits, formulae, models, dfBLUEs, dfBLUPs, and phenomes.
Arguments
x::TEBV
: First TEBV object to comparey::TEBV
: Second TEBV object to compare
Returns
Bool
:true
if the TEBV objects are equal,false
otherwise
Examples
julia> tebv_1 = TEBV(traits=[""], formulae=[""], models=[MixedModel(@formula(y~1+(1|x)), DataFrame(y=1, x=1))], df_BLUEs=[DataFrame(x=1)], df_BLUPs=[DataFrame(x=1)], phenomes=[Phenomes(n=1,t=1)]);
julia> tebv_2 = clone(tebv_1);
julia> tebv_3 = TEBV(traits=["SOMETHING_ELSE"], formulae=[""], models=[MixedModel(@formula(y~1+(1|x)), DataFrame(y=1, x=1))], df_BLUEs=[DataFrame(x=1)], df_BLUPs=[DataFrame(x=1)], phenomes=[Phenomes(n=1,t=1)]);
julia> tebv_1 == tebv_2
true
julia> tebv_1 == tebv_3
false
Base.:==
— Method==(x::Trials, y::Trials)::Bool
Compare two Trials
structs for equality by comparing their hash values.
Two Trials
structs are considered equal if they have identical hash values, which implies they have the same configuration parameters (number of trials n
and time steps t
).
Arguments
x::Trials
: First Trials struct to comparey::Trials
: Second Trials struct to compare
Returns
Bool
:true
if the Trials structs are equal,false
otherwise
Examples
julia> trials_1 = trials = Trials(n=2, t=4);
julia> trials_2 = trials = Trials(n=2, t=4);
julia> trials_3 = trials = Trials(n=1, t=2);
julia> trials_1 == trials_2
true
julia> trials_1 == trials_3
false
Base.filter
— Methodfilter(
genomes::Genomes,
maf::Float64;
max_entry_sparsity::Float64 = 0.0,
max_locus_sparsity::Float64 = 0.0,
max_prop_pc_varexp::Float64 = 0.90,
max_entry_sparsity_percentile::Float64 = 0.90,
max_locus_sparsity_percentile::Float64 = 0.50,
chr_pos_allele_ids::Union{Nothing,Vector{String}} = nothing,
match_alleles::Bool = true,
verbose::Bool = false,
)::Tuple{Genomes, Dict{String,Vector{String}}}
Filter genomic data based on multiple criteria including sparsity, minor allele frequency (MAF), principal component analysis (PCA), and a list of loci-allele combinations.
Arguments
genomes::Genomes
: AGenomes
struct containing the genomic data.maf::Float64
: The minimum allele frequency threshold.max_entry_sparsity::Float64
: The maximum allowable sparsity for entries. Default is 0.0.max_locus_sparsity::Float64
: The maximum allowable sparsity for loci. Default is 0.0.max_prop_pc_varexp::Float64
: The maximum proportion of variance explained by the first two principal components (PC1 and PC2). Default is 0.90.max_entry_sparsity_percentile::Float64
: The percentile threshold for entry sparsity. Default is 0.90.max_locus_sparsity_percentile::Float64
: The percentile threshold for locus sparsity. Default is 0.50.chr_pos_allele_ids::Union{Nothing, Vector{String}}
: A vector of loci-allele combination names in the format "chromosome\tposition\tallele". Ifnothing
, no filtering is applied. Default isnothing
.match_alleles::Bool
: Iftrue
, matches alleles exactly when filtering by SNP list. Iffalse
, matches only chromosome and position. Default istrue
.verbose::Bool
: Iftrue
, prints detailed progress information during the filtering process. Default isfalse
.
Returns
Tuple{Genomes, Dict{String, Vector{String}}}
: A tuple containing:- A
Genomes
struct with filtered genomic data. - A dictionary of omitted loci-allele names categorized by the filtering criteria.
- A
Details
This function filters genomic data based on multiple criteria including sparsity, minor allele frequency (MAF), principal component analysis (PCA), and a list of loci-allele combinations. The function performs the following steps:
- Input Validation: Ensures that the
Genomes
struct is not corrupted and that the filtering thresholds are within valid ranges. - Filter by Sparsity: Removes the sparsest entries and loci-alleles until the maximum allowable sparsity thresholds are met.
- Filter by MAF: Removes loci-alleles with minor allele frequencies below the specified threshold.
- Filter by PCA: Removes outlier loci-alleles based on the proportion of variance explained by the first two principal components.
- Filter by SNP List: Retains only the specified loci-allele combinations, with optional exact allele matching.
- Verbose Output: If
verbose
istrue
, prints detailed progress information during each filtering step.
Notes
- The function uses multi-threading to improve performance on large datasets.
- The
match_alleles
parameter provides flexibility in SNP list filtering. - The function ensures that the filtered genomic data retains a minimum number of entries and loci-alleles.
Throws
ArgumentError
: If theGenomes
struct is corrupted or any of the filtering thresholds are out of range.ErrorException
: If no loci-alleles are retained after filtering.
Examples
julia> genomes = simulategenomes(n=100, l=1_000, n_alleles=4, sparsity=0.01, verbose=false);
julia> filtered_genomes_1, omitted_loci_alleles_1 = filter(genomes, 0.1);
julia> filtered_genomes_2, omitted_loci_alleles_2 = filter(genomes, 0.1, chr_pos_allele_ids=genomes.loci_alleles[1:1000]);
julia> size(genomes.allele_frequencies)
(100, 3000)
julia> prod(size(filtered_genomes_1.allele_frequencies)) < prod(size(genomes.allele_frequencies))
true
julia> prod(size(filtered_genomes_2.allele_frequencies)) < prod(size(filtered_genomes_1.allele_frequencies))
true
Base.filter
— Methodfilter(genomes::Genomes; verbose::Bool = false)::Genomes
Filter a Genomes struct by removing entries and loci with missing data based on the mask matrix.
Description
This function filters a Genomes struct by:
- Removing rows (entries) where any column has a false value in the mask matrix
- Removing columns (loci) where any row has a false value in the mask matrix
Arguments
genomes::Genomes
: Input Genomes struct containing genetic data and a mask matrixverbose::Bool
: Optional flag to control verbose output (default: false)
Returns
Genomes
: A new filtered Genomes struct with complete data (no missing values)
Examples
julia> genomes = simulategenomes(verbose=false); genomes.mask[1:10, 42:100] .= false;
julia> filtered_genomes = filter(genomes);
julia> size(filtered_genomes.allele_frequencies)
(90, 9941)
Base.filter
— Methodfilter(phenomes::Phenomes)::Phenomes
Filter a Phenomes struct by removing rows (entries) and columns (traits) as indicated by the mask matrix. An entry or trait is removed if it contains at least one false value in the mask.
Arguments
phenomes::Phenomes
: The Phenomes struct to be filtered, containing entries, populations, traits, phenotypes, and a boolean mask matrix.
Returns
Phenomes
: A new Phenomes struct with filtered entries and traits, where the mask matrix is all true.
Details
The function uses the mean of rows and columns in the mask matrix to identify which entries and traits should be kept. Only entries and traits with a mean of 1.0 (all true values) are retained in the filtered result.
Examples
julia> phenomes = Phenomes(n=10, t=3); phenomes.entries = string.("entry_", 1:10); phenomes.populations .= "pop_1"; phenomes.traits = ["A", "B", "C"]; phenomes.phenotypes = fill(0.0, 10,3);
julia> phenomes.mask .= true; phenomes.mask[6:10, 1] .= false;
julia> filtered_phenomes = filter(phenomes);
julia> size(filtered_phenomes.phenotypes)
(5, 2)
Base.hash
— MethodBase.hash(x::BLR, h::UInt)::UInt
Calculate a hash value for a BLR
struct.
This method implements hashing for the BLR
type by combining the hashes of its fields, excluding the "Xs" and "Σs" fields. Each field's hash is combined sequentially with the input hash value.
Arguments
x::BLR
: The BLR struct to be hashedh::UInt
: The initial hash value to be mixed with
Returns
UInt
: The computed hash value
Notes
- The fields "Xs" and "Σs" are explicitly excluded from the hash computation
- Fields are processed in the order defined in the struct
Example
julia> blr = BLR(n=1, p=1);
julia> typeof(hash(blr))
UInt64
Base.hash
— MethodBase.hash(x::CV, h::UInt)::UInt
Compute a hash value for a CV (Cross-Validation) struct.
This method defines how CV structs should be hashed, which is useful for using CV objects in hash-based collections like Sets or as Dict keys.
Arguments
x::CV
: The CV struct to be hashedh::UInt
: The hash value to be mixed with the new hash
Returns
UInt
: A hash value for the CV struct
Implementation Details
The hash is computed by combining the following fields:
- replication
- fold
- fit
- validation_populations
- validation_entries
- validationytrue
- validationypred
- metrics
Example
julia> fit = Fit(n=1, l=2);
julia> cv = CV("replication_1", "fold_1", fit, ["population_1"], ["entry_1"], [0.0], [0.0], fit.metrics);
julia> typeof(hash(cv))
UInt64
Base.hash
— MethodBase.hash(x::Fit, h::UInt)::UInt
Calculate a hash value for a Fit
struct.
This method implements hashing for the Fit
type by combining the hashes of its core components in a specific order. The hash is computed using the following fields:
- model
- b_hat (estimated effects)
- trait
- entries
- populations
- metrics
- y_true (observed values)
- y_pred (predicted values)
Arguments
x::Fit
: The Fit struct to be hashedh::UInt
: The hash value to be mixed with
Returns
UInt
: The computed hash value
Example
julia> fit = Fit(n=1, l=2);
julia> typeof(hash(fit))
UInt64
Base.hash
— MethodBase.hash(x::GRM, h::UInt)::UInt
Compute a hash value for a GRM (Genomic Relationship Matrix) struct.
This method defines how GRM structs should be hashed, making them usable in hash-based collections like Sets or as Dict keys. The hash is computed by iteratively combining the hash values of all fields in the struct.
Arguments
x::GRM
: The GRM struct to be hashedh::UInt
: The initial hash value to be combined with the struct's hash
Returns
UInt
: A combined hash value for the entire GRM struct
Example
julia> grm = GRM(string.(["entries_1", "entries_2"]), string.(["chr1 123 A|T A", "chr1 456 C|G G"]), Float64.(rand(2,2)));
julia> typeof(hash(grm))
UInt64
Base.hash
— MethodBase.hash(x::Genomes, h::UInt)::UInt
Compute a hash value for a Genomes
struct.
This hash function considers three key components of the Genomes
struct:
- entries
- populations
- loci_alleles
For performance reasons, allele_frequencies
and mask
fields are deliberately excluded from the hash computation.
Arguments
x::Genomes
: The Genomes struct to hashh::UInt
: The hash seed value
Returns
UInt
: A hash value for the Genomes struct
Examples
julia> genomes = Genomes(n=2, p=2);
julia> typeof(hash(genomes))
UInt64
Base.hash
— MethodBase.hash(x::Phenomes, h::UInt)::UInt
Compute a hash value for a Phenomes
struct by recursively hashing its internal fields.
Arguments
x::Phenomes
: The Phenomes struct to be hashedh::UInt
: The hash value to be mixed with
Returns
UInt
: A hash value for the entire Phenomes struct
Note
This function is used for dictionary operations and computing hash-based data structures. The hash is computed by combining hashes of all internal fields: entries, populations, traits, phenotypes, and mask.
Examples
julia> phenomes = Phenomes(n=2, t=2);
julia> typeof(hash(phenomes))
UInt64
Base.hash
— Methodhash(x::SimulatedEffects, h::UInt)::UInt
Compute a hash value for a SimulatedEffects
object.
This method implements custom hashing for SimulatedEffects
by iterating through all fields of the object and combining their hash values with the provided seed hash h
.
Arguments
x::SimulatedEffects
: The object to be hashedh::UInt
: The hash seed value
Returns
UInt
: The computed hash value
Examples
julia> effects = SimulatedEffects();
julia> typeof(hash(effects))
UInt64
Base.hash
— MethodBase.hash(x::TEBV, h::UInt)::UInt
Calculate a hash value for a TEBV (Trial-Estimated Breeding Value) struct.
This method implements hashing for TEBV objects by combining the hash values of selected fields:
- traits: Vector of trait names
- formulae: Vector of formula strings
- phenomes: Vector of Phenomes objects
Note: For performance reasons, the following fields are deliberately excluded from the hash calculation:
- models
- df_BLUEs
- df_BLUPs
Arguments
x::TEBV
: The TEBV struct to be hashedh::UInt
: The hash value to be mixed with the object's hash
Returns
UInt
: A unique hash value for the TEBV object
Examples
julia> tebv = TEBV(traits=[""], formulae=[""], models=[MixedModel(@formula(y~1+(1|x)), DataFrame(y=1, x=1))], df_BLUEs=[DataFrame(x=1)], df_BLUPs=[DataFrame(x=1)], phenomes=[Phenomes(n=1,t=1)]);
julia> typeof(hash(tebv))
UInt64
Base.hash
— MethodBase.hash(x::Trials, h::UInt)::UInt
Compute a hash value for a Trials
struct by recursively hashing all of its fields.
This method implements hash functionality for the Trials
type, allowing Trials
objects to be used as dictionary keys or in hash-based collections.
Arguments
x::Trials
: The Trials struct to be hashedh::UInt
: The hash value to be mixed with the object's hash
Returns
UInt
: A hash value for the entire Trials struct
Examples
julia> trials = Trials(n=2, t=2);
julia> typeof(hash(trials))
UInt64
Base.merge
— Methodmerge(
genomes::Genomes,
other::Genomes;
conflict_resolution::Tuple{Float64,Float64} = (0.5, 0.5),
verbose::Bool = true
)::Genomes
Merge two Genomes structs by combining their entries and loci_alleles while resolving conflicts in allele frequencies.
Arguments
genomes::Genomes
: First Genomes struct to mergeother::Genomes
: Second Genomes struct to mergeconflict_resolution::Tuple{Float64,Float64}
: Weights for resolving conflicts between allele frequencies (must sum to 1.0)verbose::Bool
: If true, displays a progress bar during merging
Returns
Genomes
: A new Genomes struct containing the merged data
Details
The function performs the following operations:
- If the loci_alleles are identical and there are no overlapping entries, performs a quick merge:
- Concatenates entries, populations, allele frequencies, and mask without conflict resolution.
- Combines unique entries and loci_alleles from both input structs
- Resolves population conflicts by concatenating conflicting values
- For overlapping entries and loci:
- If allele frequencies match, uses the existing value
- If frequencies differ, applies weighted average using conflict_resolution
- For missing values, uses available non-missing value
- Resolves mask conflicts using weighted average
Examples
julia> n = 100; l = 5_000; n_alleles = 2;
julia> all = simulategenomes(n=n, l=l, n_alleles=n_alleles, verbose=false);
julia> genomes = slice(all, idx_entries=collect(1:Int(floor(n*0.75))), idx_loci_alleles=collect(1:Int(floor(l*(n_alleles-1)*0.75))));
julia> other = slice(all, idx_entries=collect(Int(floor(n*0.50)):n), idx_loci_alleles=collect(Int(floor(l*(n_alleles-1)*0.50)):l*(n_alleles-1)));
julia> merged_genomes = merge(genomes, other, conflict_resolution=(0.75, 0.25), verbose=false);
julia> size(merged_genomes.allele_frequencies)
(100, 5000)
julia> sum(ismissing.(merged_genomes.allele_frequencies))
123725
Base.merge
— Methodmerge(genomes::Genomes, phenomes::Phenomes; keep_all::Bool=true)::Tuple{Genomes,Phenomes}
Merge Genomes
and Phenomes
structs based on their entries, combining genomic and phenotypic data.
Arguments
genomes::Genomes
: A struct containing genomic data including entries, populations, and allele frequenciesphenomes::Phenomes
: A struct containing phenotypic data including entries, populations, and phenotypeskeep_all::Bool=true
: If true, performs a union of entries; if false, performs an intersection
Returns
Tuple{Genomes,Phenomes}
: A tuple containing:- A new
Genomes
struct with merged entries and corresponding genomic data - A new
Phenomes
struct with merged entries and corresponding phenotypic data
- A new
Details
- Maintains dimensional consistency between input and output structs
- Handles population conflicts by creating a combined population name
- Preserves allele frequencies and phenotypic data for matched entries
- When
keep_all=true
, includes all entries from both structs - When
keep_all=false
, includes only entries present in both structs
Examples
julia> genomes = simulategenomes(n=10, verbose=false);
julia> trials, effects = simulatetrials(genomes=slice(genomes, idx_entries=collect(1:5), idx_loci_alleles=collect(1:length(genomes.loci_alleles))), f_add_dom_epi=[0.90 0.05 0.05;], n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=2, verbose=false);
julia> phenomes = Phenomes(n=5, t=1);
julia> phenomes.entries = trials.entries[1:5];
julia> phenomes.populations = trials.populations[1:5];
julia> phenomes.traits = trials.traits;
julia> phenomes.phenotypes = trials.phenotypes[1:5, :];
julia> phenomes.mask .= true;
julia> genomes_merged_1, phenomes_merged_1 = merge(genomes, phenomes, keep_all=true);
julia> size(genomes_merged_1.allele_frequencies), size(phenomes_merged_1.phenotypes)
((10, 10000), (10, 1))
julia> genomes_merged_2, phenomes_merged_2 = merge(genomes, phenomes, keep_all=false);
julia> size(genomes_merged_2.allele_frequencies), size(phenomes_merged_2.phenotypes)
((5, 10000), (5, 1))
Base.merge
— Methodmerge(
phenomes::Phenomes,
other::Phenomes;
conflict_resolution::Tuple{Float64,Float64} = (0.5, 0.5),
verbose::Bool = true
)::Phenomes
Merge two Phenomes
structs into a single combined struct, handling overlapping entries and traits.
Arguments
phenomes::Phenomes
: The first Phenomes struct to mergeother::Phenomes
: The second Phenomes struct to mergeconflict_resolution::Tuple{Float64,Float64}
: Weights for resolving conflicts between overlapping values (must sum to 1.0)verbose::Bool
: Whether to display a progress bar during merging
Returns
Phenomes
: A new merged Phenomes struct containing all entries and traits from both input structs
Details
The merge operation combines:
- All unique entries from both structs
- All unique traits from both structs
- Phenotype values and masks, using weighted averaging for conflicts
- Population information, marking conflicts with a "CONFLICT" prefix
For overlapping entries and traits:
- Identical values are preserved as-is
- Different values are combined using the weights specified in
conflict_resolution
- Missing values are handled by using the available non-missing value
- Population conflicts are marked in the format "CONFLICT (pop1, pop2)"
Throws
ArgumentError
: If either Phenomes struct is corrupted (invalid dimensions)ArgumentError
: If conflict_resolution weights don't sum to 1.0 or aren't a 2-tupleErrorException
: If the merging operation produces an invalid result
Examples
julia> all = Phenomes(n=10, t=3); all.entries = string.("entry_", 1:10); all.traits = ["A", "B", "C"]; all.phenotypes = rand(10,3);
julia> phenomes = slice(all, idx_entries=collect(1:7), idx_traits=[1,2]);
julia> other = slice(all, idx_entries=collect(5:10), idx_traits=[2,3]);
julia> merged_phenomes = merge(phenomes, other, conflict_resolution=(0.75, 0.25), verbose=false);
julia> size(merged_phenomes.phenotypes)
(10, 3)
julia> sum(ismissing.(merged_phenomes.phenotypes))
7
Base.sum
— Methodsum(effects::SimulatedEffects)::Vector{Float64}
Sum up all simulated effects to generate the simulated phenotype values. The function iterates through all fields of the SimulatedEffects struct (except :id and :field_layout) and adds their values element-wise to produce a vector of phenotypic values.
Arguments
effects::SimulatedEffects
: A struct containing various genetic and environmental effects
Returns
Vector{Float64}
: A vector containing the summed effects (phenotypic values)
Examples
julia> effects = SimulatedEffects();
julia> sum(effects)
1-element Vector{Float64}:
0.0
julia> effects.additive_genetic[1] = pi;
julia> sum(effects)
1-element Vector{Float64}:
3.141592653589793
GenomicBreedingCore.addcompositetrait
— Methodaddcompositetrait(phenomes::Phenomes; composite_trait_name::String, formula_string::String)::Phenomes
Create a new composite trait by combining existing traits using mathematical operations.
Arguments
phenomes::Phenomes
: A Phenomes struct containing the original trait datacomposite_trait_name::String
: Name for the new composite traitformula_string::String
: Mathematical formula describing how to combine existing traits. Supports traits as variables and the following operations:- Basic arithmetic: +, -, *, /, ^, %
- Functions: abs(), sqrt(), log(), log2(), log10()
- Parentheses for operation precedence
Returns
Phenomes
: A new Phenomes struct with the composite trait added as the last column
Examples
julia> phenomes = Phenomes(n=10, t=3); phenomes.entries = string.("entry_", 1:10); phenomes.populations .= "pop_1"; phenomes.traits = ["A", "B", "C"]; phenomes.phenotypes = rand(10,3);
julia> phenomes_new = addcompositetrait(phenomes, composite_trait_name = "some_wild_composite_trait", formula_string = "A");
julia> phenomes_new.phenotypes[:, end] == phenomes.phenotypes[:, 1]
true
julia> phenomes_new = addcompositetrait(phenomes, composite_trait_name = "some_wild_composite_trait", formula_string = "(A^B) + (C/A) - sqrt(abs(B-A)) + log(1.00 + C)");
julia> phenomes_new.phenotypes[:, end] == (phenomes.phenotypes[:,1].^phenomes.phenotypes[:,2]) .+ (phenomes.phenotypes[:,3]./phenomes.phenotypes[:,1]) .- sqrt.(abs.(phenomes.phenotypes[:,2].-phenomes.phenotypes[:,1])) .+ log.(1.00 .+ phenomes.phenotypes[:,3])
true
GenomicBreedingCore.addcompositetrait
— Methodaddcompositetrait(trials::Trials; composite_trait_name::String, formula_string::String)::Trials
Create a new composite trait by combining existing traits using a mathematical formula.
Arguments
trials::Trials
: A Trials struct containing phenotypic datacomposite_trait_name::String
: Name for the new composite traitformula_string::String
: Mathematical formula defining how to combine existing traits
Formula Syntax
The formula can include:
- Trait names (e.g., "trait1", "trait2")
- Mathematical operators: +, -, *, /, ^, %
- Functions: abs(), sqrt(), log(), log2(), log10()
- Parentheses for grouping operations
Returns
Trials
: A new Trials struct with the added composite trait
Examples
julia> trials, _ = simulatetrials(genomes = simulategenomes(verbose=false), verbose=false);
julia> trials_new = addcompositetrait(trials, composite_trait_name = "some_wild_composite_trait", formula_string = "trait_1");
julia> trials_new.phenotypes[:, end] == trials.phenotypes[:, 1]
true
julia> trials_new = addcompositetrait(trials, composite_trait_name = "some_wild_composite_trait", formula_string = "(trait_1^(trait_2/100)) + (trait_3/trait_1) - sqrt(abs(trait_2-trait_1)) + log(1.00 + trait_3)");
julia> trials_new.phenotypes[:, end] == (trials.phenotypes[:,1].^(trials.phenotypes[:,2]/100)) .+ (trials.phenotypes[:,3]./trials.phenotypes[:,1]) .- sqrt.(abs.(trials.phenotypes[:,2].-trials.phenotypes[:,1])) .+ log.(1.00 .+ trials.phenotypes[:,3])
true
GenomicBreedingCore.aggregateharvests
— Methodaggregateharvests(
trials::Trials;
traits::Union{Vector{String}, Nothing} = nothing,
grouping::Vector{String} = ["years", "seasons", "sites", "replications", "blocks", "rows", "cols", "entries", "populations"],
f::Function=x -> sum(skipmissing(x))
)::Tuple{Trials, DataFrame}
Aggregate harvest data from a Trials
struct by specified grouping variables.
Arguments
trials::Trials
: Input trials struct containing harvest datatraits::Union{Vector{String}, Nothing}
: Vector of trait names to aggregate (default: all traits)grouping::Vector{String}
: Vector of column names to group by (default: ["years", "seasons", "sites", "replications", "blocks", "rows", "cols", "entries", "populations"])f::Function
: Aggregation function to apply (default: sum of non-missing values)
Returns
Tuple{Trials, DataFrame}
: A tuple containing:- Aggregated trials struct with combined harvest data
- DataFrame with harvest counts per year, season, and site
Examples
julia> trials, _ = simulatetrials(genomes = simulategenomes(n=5, l=1000, verbose=false), verbose=false);
julia> trials_agg_1, df_n_harvests_1 = aggregateharvests(trials);
julia> length(trials.entries) > length(trials_agg_1.entries)
true
julia> trials_agg_2, df_n_harvests_2 = aggregateharvests(trials, traits=["trait_1"]);
julia> length(trials.traits) > length(trials_agg_2.traits)
true
GenomicBreedingCore.analyse
— Functionanalyse(
trials::Trials,
formula_string::String = "";
traits::Union{Nothing,Vector{String}} = nothing,
max_levels::Int64 = 100,
max_time_per_model::Int64 = 60,
covariates_continuous::Union{Nothing,Vector{String}} = nothing,
verbose::Bool = true
)::TEBV
Analyze trial data using linear mixed models to estimate Best Linear Unbiased Estimates (BLUEs) and Best Linear Unbiased Predictions (BLUPs).
Arguments
trials
: A Trials struct containing the experimental dataformula_string
: Optional model formula string. If empty, automatic model selection is performedtraits
: Optional vector of trait names to analyze. If nothing, all traits are analyzedmax_levels
: Maximum number of levels for non-entry random effects (default: 100)max_time_per_model
: Maximum fitting time in seconds per model (default: 60)covariates_continuous
: Optional vector of continuous covariates to include in modelsverbose
: Whether to display analysis progress (default: true)
Returns
A TEBV
struct containing:
traits
: Vector of analyzed trait namesformulae
: Vector of best-fitting model formulaemodels
: Vector of fitted LinearMixedModel objectsdf_BLUEs
: Vector of DataFrames containing BLUEsdf_BLUPs
: Vector of DataFrames containing BLUPsphenomes
: Vector of Phenomes objects with predicted values
Details
The function implements a mixed model fitting strategy with the following principles:
- Avoids over-parameterization
- Uses unstructured variance-covariance matrix for random effects
- Prefers REML over ML estimation
- Compares BLUEs vs BLUPs of entries
- Handles both parallel and iterative model fitting based on model complexity
Notes
- Models are fitted using REML
- Simple models are fitted in parallel while complex models are fitted iteratively to avoid memory issues
- Returns empty results if no models can be successfully fitted
Examples
julia> trials, _simulated_effects = simulatetrials(genomes = simulategenomes(n=10, verbose=false), n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=10, verbose=false);
julia> tebv_1 = analyse(trials, "trait_1 ~ 1 + (1|entries)", max_levels=50, verbose=false);
julia> tebv_1.traits
3-element Vector{String}:
"trait_1"
"trait_2"
"trait_3"
julia> tebv_2 = analyse(trials, max_levels=50, verbose=false);
julia> mean(tebv_2.phenomes[1].phenotypes) < mean(tebv_2.phenomes[2].phenotypes)
true
julia> trials = addcompositetrait(trials, composite_trait_name = "covariate", formula_string = "(trait_1 + trait_2) / (trait_3 + 0.0001)");
julia> tebv_3 = Suppressor.@suppress analyse(trials, "y ~ 1 + covariate + entries + (1|blocks)", max_levels=50, verbose=false);
julia> mean(tebv_3.phenomes[1].phenotypes) < mean(tebv_3.phenomes[2].phenotypes)
true
julia> tebv_4 = Suppressor.@suppress analyse(trials, max_levels=50, covariates_continuous=["covariate"], verbose=false);
julia> mean(tebv_4.phenomes[1].phenotypes) < mean(tebv_4.phenomes[2].phenotypes)
true
GenomicBreedingCore.analyse
— Methodanalyse(
df::DataFrame;
formulae::Vector{String},
idx_parallel_models::Vector{Int64},
idx_iterative_models::Vector{Int64},
max_time_per_model::Int64 = 60,
verbose::Bool=false
)::Tuple{String, Any, DataFrame, DataFrame, Phenomes}
Fit univariate linear mixed models to extract entry effects from the best-fitting model.
Arguments
df::DataFrame
: Input data frame containing trial data with columns for entries, traits, and other experimental factorsformulae::Vector{String}
: Vector of model formulae strings to be testedidx_parallel_models::Vector{Int64}
: Indices of simpler models to be fitted in parallelidx_iterative_models::Vector{Int64}
: Indices of complex models to be fitted iterativelymax_time_per_model::Int64
: Maximum time in seconds allowed for fitting each model (default: 60)verbose::Bool
: Whether to display progress information (default: false)
Returns
A tuple containing:
- String: Formula of the best-fitting model
- Any: The fitted model object
- DataFrame: BLUEs (Best Linear Unbiased Estimates) results
- DataFrame: BLUPs (Best Linear Unbiased Predictions) results
- Phenomes: Struct containing consolidated phenotypic predictions
Details
The function implements a mixed model fitting strategy with the following principles:
- Avoids over-parameterization
- Uses unstructured variance-covariance matrix for random effects
- Prefers REML over ML estimation
- Compares BLUEs vs BLUPs of entries
- Handles both parallel and iterative model fitting based on model complexity
Notes
- All formulae must model the same trait
- Models are fitted using REML
- Simple models are fitted in parallel while complex models are fitted iteratively to avoid memory issues
- Returns empty results if no models can be successfully fitted
Examples
julia> trials, _ = simulatetrials(genomes = simulategenomes(verbose=false), verbose=false);
julia> df = tabularise(trials);
julia> formulae, n_levels = trialsmodelsfomulae!(df; trait = "trait_1", max_levels = 10);
julia> idx_parallel_models::Vector{Int64} = findall(n_levels .<= (15));
julia> idx_iterative_models::Vector{Int64} = findall((n_levels .<= (15)) .!= true);
julia> formula_string, model, df_BLUEs, df_BLUPs, phenomes = analyse(df, formulae=formulae, idx_parallel_models=idx_parallel_models, idx_iterative_models=idx_iterative_models);
julia> length(phenomes.entries) == length(unique(df.entries))
true
julia> df_2 = df[(df.years .== df.years[1]) .&& (df.harvests .== df.harvests[1]) .&& (df.seasons .== df.seasons[1]) .&& (df.sites .== df.sites[1]) .&& (df.replications .== df.replications[1]), :];
julia> formula_string_2, model_2, df_BLUEs_2, df_BLUPs_2, phenomes_2 = analyse(df_2, formulae=["trait_1 ~ 1 + 1|entries"]);
julia> cor(phenomes_2.phenotypes[sortperm(phenomes_2.entries),1], df_2.trait_1[sortperm(df_2.entries)]) > 0.99
true
GenomicBreedingCore.analyseviaBLR
— MethodanalyseviaBLR(
trials::Trials,
traits::Vector{String};
grm::Union{GRM,Nothing} = nothing,
other_covariates::Union{Vector{String},Nothing} = nothing,
empirical_Σs::Bool = true,
multiple_σs_threshold::Int64 = 500,
n_iter::Int64 = 10_000,
n_burnin::Int64 = 1_000,
seed::Int64 = 1234,
verbose::Bool = false,
)::Tuple{TEBV,Dict{String,DataFrame}}
Perform Bayesian linear mixed model analysis on trial data for genetic evaluation.
Arguments
trials::Trials
: A Trials struct containing the experimental datatraits::Vector{String}
: Vector of trait names to analyze. If empty, all traits in trials will be analyzedgrm::Union{GRM,Nothing}=nothing
: Optional genomic relationship matrixother_covariates::Union{Vector{String},Nothing}=nothing
: Additional covariates to include in the modelmultiple_σs_threshold::Int64=500
: Threshold for determining multiple variance componentsn_iter::Int64=10_000
: Number of MCMC iterationsn_burnin::Int64=1_000
: Number of burn-in iterationsseed::Int64=1234
: Random seed for reproducibilityverbose::Bool=false
: Whether to print progress information
Returns
A tuple containing:
TEBV
: Total estimated breeding values struct with model results. Note that:- only the variance-covariance components represent the p-1 factor levels; while,
- the rest have the full number of levels, i.e. using the one-hot encoding vectors and matrices).
- This means that the
Σs
have less rows and columns than the number of elements incoefficient_names
.
Dict{String,DataFrame}
: Spatial diagnostics information
Details
Performs a two-stage analysis:
- Stage-1: Spatial analysis per harvest-site-year combination
- Creates temporary JLD2 file with spatially adjusted data to manage memory
- Corrects for spatial effects:
- With rows and columns regardless of whether blocks are present:
rows
+cols
+rows:cols
- With blocks and one spatial factor:
blocks
+rows
+blocks:rows
blocks
+cols
+blocks:cols
- With single spatial factor:
blocks
rows
cols
- With rows and columns regardless of whether blocks are present:
- Removes effects of continuous covariates
- Returns spatially adjusted traits with "SPATADJ-" prefix
- Stage-2: GxE modeling excluding spatial effects and continuous covariates 2.a. Genotypic effects: -
entries
(required) 2.b. Environmental main effects: -sites
if present -seasons
if present -years
if present 2.c Environmental interactions: - With all 3 environment factors: +years:sites
+seasons:sites
+entries:seasons:sites
- With 2 environment factors: +years:seasons
+entries:seasons
(no sites) +years:sites
+entries:sites
(no seasons) +seasons:sites
+entries:seasons:sites
(no years) - With 1 environment factor: +entries:years
+entries:seasons
+entries:sites
The analysis includes:
- Genomic relationship matrix (GRM) integration if provided:
- GRM replaces identity matrices for entry-related effects
- Diagonals are inflated if resulting matrices not positive definite
- Inflation repeated up to 10 times to ensure stability
- Variance component estimation:
- Single vs multiple variance scalers determined by threshold
- Separate parameters for complex interaction terms
- MCMC-based Bayesian inference with:
- Burn-in period for chain convergence
- Diagnostic checks for convergence and mixing
Notes
- Automatically handles memory management for large design matrices
- Creates temporary JLD2 file "TEMP-dfspatadj-[hash].jld2" to store spatially adjusted data
- Automatically removes temporary JLD2 file after analysis is complete
- Supports both identity and genomic relationship matrices for genetic effects
- Performs automatic model diagnostics and variance component scaling
- Excludes continuous covariates from Stage-2 as they are handled in Stage-1
Examples
julia> genomes = simulategenomes(n=5, l=1_000, verbose=false);
julia> grm = grmploidyaware(genomes, ploidy=2, max_iter=10);
julia> trials, simulated_effects = simulatetrials(genomes = genomes, n_years=1, n_seasons=2, n_harvests=1, n_sites=3, n_replications=3, verbose=false);
julia> tebv_1, spatial_diagnostics_1 = analyseviaBLR(trials, ["trait_1"], n_iter = 1_000, n_burnin = 100);
julia> tebv_2, spatial_diagnostics_2 = analyseviaBLR(trials, ["trait_1", "trait_2"], other_covariates = ["trait_3"], n_iter = 1_000, n_burnin = 100);
julia> tebv_3, spatial_diagnostics_3 = analyseviaBLR(trials, ["trait_3"], grm = grm, n_iter = 1_000, n_burnin = 100);
julia> (length(tebv_1.phenomes[1].entries) == 5) && (length(tebv_1.phenomes[2].entries) == 30) && (length(spatial_diagnostics_1) == 6)
true
julia> (length(tebv_2.phenomes[1].entries) == 5) && (length(tebv_2.phenomes[2].entries) == 30) && (length(spatial_diagnostics_2) == 12)
true
julia> (length(tebv_3.phenomes[1].entries) == 5) && (length(tebv_3.phenomes[2].entries) == 30) && (length(spatial_diagnostics_3) == 6)
true
GenomicBreedingCore.checkandfocalterms
— Methodcheckandfocalterms(trait::String, factors::Vector{String}, df::DataFrame, other_covariates::Union{Vector{String},Nothing} = nothing)::Vector{String}
Validate input data and generate model terms for Bayesian analysis of field trials.
Arguments
trait::String
: Name of the response variable (trait) column in the DataFramefactors::Vector{String}
: Vector of factor names (categorical variables) to include in the modeldf::DataFrame
: DataFrame containing the trial dataother_covariates::Union{Vector{String},Nothing}
: Optional vector of numeric covariate column names
Returns
Vector{String}
: A vector of model terms including main effects, GxE interaction effects, and spatial effects
Details
The function performs several tasks:
- Validates that all specified columns exist in the DataFrame
- Ensures trait and covariates are numeric
- Converts factors to strings
- Generates appropriate model terms based on available factors:
- Main effects (entries, sites, seasons, years)
- GxE interaction effects (various combinations of entries × environmental factors)
- Spatial effects (blocks, rows, columns and their interactions)
Throws
ArgumentError
: If trait or factors are not found in DataFrameArgumentError
: If trait or covariates are non-numeric or contain missing values
Example
julia> genomes = simulategenomes(n=5, l=1_000, verbose=false);
julia> trials, simulated_effects = simulatetrials(genomes = genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=3, verbose=false);
julia> df = tabularise(trials);
julia> focal_terms_1 = checkandfocalterms(trait = trials.traits[1], factors = ["rows", "cols"], df = df, other_covariates = trials.traits[2:end])
2-element Vector{String}:
"rows"
"cols"
julia> focal_terms_2 = checkandfocalterms(trait = trials.traits[1], factors = ["years", "seasons", "sites"], df = df)
5-element Vector{String}:
"years"
"seasons"
"sites"
"years:sites"
"seasons:sites"
GenomicBreedingCore.checkdims
— Methodcheckdims(blr::BLR)::Bool
Check dimension compatibility of the internal fields of a BLR
struct.
This function verifies that all vector and matrix fields in the BLR
struct have compatible dimensions:
- Length of
entries
,y
,ȳ
, andϵ
must be equal (denoted asn
) - Length of
coefficients
andcoefficient_names
must be equal (denoted asp
) - Matrix
Xs
must have dimensionsn × p
- Matrix
Σs
must have dimensionsp × p
Returns true
if all dimensions are compatible, false
otherwise.
Arguments
blr::BLR
: The BLR struct to check dimensions for
Returns
Bool
:true
if dimensions are compatible,false
otherwise
Examples
julia> blr = BLR(n=1, p=4);
julia> checkdims(blr)
true
julia> blr.coefficient_names["dummy_X"] = ["dummy_coef"];
julia> checkdims(blr)
false
GenomicBreedingCore.checkdims
— Methodcheckdims(cv::CV)::Bool
Check dimension compatibility of the fields of the CV struct.
The function verifies that:
- The fit object dimensions are valid
- The number of validation populations matches the number of validation entries
- The number of validation true values matches the number of validation predictions
- The number of metrics matches the number of metrics in the fit object
Returns:
true
if all dimensions are compatiblefalse
if any dimension mismatch is found
Examples
julia> fit = Fit(n=1, l=2);
julia> cv = CV("replication_1", "fold_1", fit, ["population_1"], ["entry_1"], [0.0], [0.0], fit.metrics);
julia> checkdims(cv)
true
julia> cv.validation_y_true = [0.0, 0.0];
julia> checkdims(cv)
false
GenomicBreedingCore.checkdims
— Methodcheckdims(fit::Fit)::Bool
Check dimension compatibility of the internal fields of a Fit
struct.
This function verifies that all vector fields in the Fit
struct have compatible dimensions:
- Length of
entries
,populations
,y_true
, andy_pred
must be equal (denoted asn
) - Length of
b_hat
andb_hat_labels
must be equal (denoted asl
)
Returns true
if all dimensions are compatible, false
otherwise.
Arguments
fit::Fit
: The Fit struct to check dimensions for
Returns
Bool
:true
if dimensions are compatible,false
otherwise
Examples
julia> fit = Fit(n=1, l=4);
julia> checkdims(fit)
true
julia> fit.b_hat_labels = ["chr1\t1\tA|T\tA"];
julia> checkdims(fit)
false
GenomicBreedingCore.checkdims
— Methodcheckdims(grm::GRM)::Bool
Check dimension compatibility of the GRM (Genomic Relationship Matrix) struct fields.
Arguments
grm::GRM
: A Genomic Relationship Matrix struct containing entries and relationship matrix
Returns
true
if the number of entries matches the dimensions of the genomic relationship matrixfalse
if there is a mismatch between the number of entries and matrix dimensions
Details
The function verifies that:
- The number of entries equals the number of rows in the genomic relationship matrix
- The number of entries equals the number of columns in the genomic relationship matrix
(The genomic relationship matrix should be square with dimensions matching the number of entries)
Examples
julia> grm = GRM(string.(["entries_1", "entries_2"]), string.(["chr1 123 A|T A", "chr1 456 C|G G"]), Float64.(rand(2,2)));
julia> checkdims(grm)
true
julia> grm.entries = ["dummy_entry"];
julia> checkdims(grm)
false
GenomicBreedingCore.checkdims
— Methodcheckdims(genomes::Genomes)::Bool
Check dimension compatibility of the fields in a Genomes
struct.
Returns true
if all dimensions are compatible, false
otherwise.
Arguments
genomes::Genomes
: A Genomes struct containing genomic data
Details
Verifies that:
- Number of entries matches number of populations (n)
- Entry names are unique
- Number of loci alleles matches width of frequency matrix (p)
- Locus-allele combinations are unique
- Entries are unique
- Dimensions of frequency matrix (n×p) match mask matrix dimensions
Examples
julia> genomes = Genomes(n=2,p=4);
julia> checkdims(genomes)
false
julia> genomes.entries = ["entry_1", "entry_2"];
julia> genomes.loci_alleles = ["chr1\t1\tA|T\tA", "chr1\t2\tC|G\tG", "chr2\t3\tA|T\tA", "chr2\t4\tG|T\tG"];
julia> checkdims(genomes)
true
GenomicBreedingCore.checkdims
— Methodcheckdims(y::Phenomes)::Bool
Verify dimensional compatibility between all fields of a Phenomes struct.
Checks if:
- Number of entries matches the number of rows in phenotypes matrix
- All entry names are unique
- Number of populations matches number of entries
- Number of traits matches number of columns in phenotypes matrix
- All trait names are unique
- Dimensions of mask matrix match phenotypes matrix
Arguments
y::Phenomes
: A Phenomes struct containing phenotypic data
Returns
Bool
:true
if all dimensions are compatible,false
otherwise
Examples
julia> y = Phenomes(n=2, t=2);
julia> checkdims(y)
false
julia> y.entries = ["entry_1", "entry_2"];
julia> y.traits = ["trait_1", "trait_2"];
julia> checkdims(y)
true
GenomicBreedingCore.checkdims
— Methodcheckdims(effects::SimulatedEffects)::Bool
Check dimension compatibility of the fields of the SimulatedEffects struct.
Arguments
effects::SimulatedEffects
: A SimulatedEffects struct containing various genetic and experimental effects
Returns
Bool
:true
if all dimensions are compatible,false
otherwise
Verifies that:
id
has length 6field_layout
has 4 columns- All following vectors have the same length (n):
replications_x_site_x_harvest_x_season_x_year
blocks_x_site_x_harvest_x_season_x_year
rows_x_site_x_harvest_x_season_x_year
cols_x_site_x_harvest_x_season_x_year
additive_genetic
dominance_genetic
epistasis_genetic
additive_allele_x_site_x_harvest_x_season_x_year
dominance_allele_x_site_x_harvest_x_season_x_year
epistasis_allele_x_site_x_harvest_x_season_x_year
Examples
julia> effects = SimulatedEffects();
julia> typeof(hash(effects))
UInt64
jldoctest; setup = :(using GenomicBreedingCore) julia> effects = SimulatedEffects();
julia> checkdims(effects) true
julia> effects.id = ["beaking_change"];
julia> checkdims(effects) false ```
GenomicBreedingCore.checkdims
— Methodcheckdims(y::TEBV)::Bool
Check if all fields in the TEBV struct have compatible dimensions. The function verifies that the length of all arrays in the TEBV struct match the number of traits.
Arguments
tebv::TEBV
: A TEBV (Trial-estimated Breeding Values) struct containing traits, formulae, models, BLUEs, BLUPs, and phenomes.
Returns
Bool
: Returnstrue
if all fields have matching dimensions (equal to the number of traits),false
otherwise.
Details
The function checks if the following fields have the same length as traits
:
- formulae
- unique models
- unique BLUEs DataFrames
- unique BLUPs DataFrames
- unique phenomes
Examples
``jldoctest; setup = :(using GenomicBreedingCore, MixedModels, DataFrames)
julia> tebv = TEBV(traits=[""], formulae=[""], models=[MixedModel(@formula(y~1+(1|x)), DataFrame(y=1, x=1))], dfBLUEs=[DataFrame(x=1)], dfBLUPs=[DataFrame(x=1)], phenomes=[Phenomes(n=1,t=1)]);
julia> checkdims(tebv) true ```
GenomicBreedingCore.checkdims
— Methodcheckdims(trials::Trials)::Bool
Check dimension compatibility of all fields in a Trials
struct.
This function verifies that the dimensions of all vector fields in the Trials
struct are consistent with the size of the phenotypes matrix. Specifically, it checks:
- Number of traits (
t
) matches number of columns in phenotypes and length of traits vector - Number of entries (
n
) matches number of rows in phenotypes and length of:- years
- seasons
- harvests
- sites
- replications
- blocks
- rows
- cols
- entries
- populations
Returns true
if all dimensions are compatible, false
otherwise.
Arguments
trials::Trials
: A Trials struct containing trial data
Returns
Bool
:true
if dimensions are compatible,false
otherwise
Examples
julia> trials = Trials(n=1, t=2);
julia> trials.entries = ["entry_1"]; trials.traits = ["trait_1", "trait_2"];
julia> checkdims(trials)
true
julia> trials.entries = ["entering_2_entries", "instead_of_just_1"];
julia> checkdims(trials)
false
GenomicBreedingCore.clone
— Methodclone(x::BLR)::BLR
Create a deep copy of a BLR object, duplicating all its fields.
This function performs a deep clone of the input BLR object, ensuring that all nested structures and arrays are also copied, preventing any shared references between the original and the cloned object.
Arguments
x::BLR
: The BLR object to be cloned
Returns
BLR
: A new BLR object with identical but independent values
Fields copied
entries::Vector{String}
: Names or identifiers for the observationscoefficient_names::Vector{String}
: Names of the model coefficients/effectsy::Vector{Float64}
: Response/dependent variable vectorXs::Dict{String, Matrix{Union{Bool, Float64}}}
: Design matrices for factors and numeric matrix for the other covariatescoefficients::Vector{Float64}
: Estimated coefficients/effectsŷ::Vector{Float64}
: Fitted/predicted valuesϵ::Vector{Float64}
: Residuals (y - ŷ)Σs::Dict{String, Union{Matrix{Float64}, UniformScaling{Float64}}}
: Variance-covariance matrices
Examples
julia> blr = BLR(n=1, p=1);
julia> copy_blr = clone(blr)
BLR([""], Dict{String, Matrix{Union{Bool, Float64}}}("intercept" => [true;;]), Dict{String, Union{Nothing, UniformScaling{Float64}, Matrix{Float64}}}("σ²" => UniformScaling{Float64}(1.0)), Dict("intercept" => [0.0]), Dict("intercept" => ["intercept"]), [0.0], [0.0], [0.0], 0×0 DataFrame)
GenomicBreedingCore.clone
— Methodclone(x::CV)::CV
Create a deep copy of a CV (cross-validation) object.
Creates a new CV object with deep copies of all fields from the input object. The clone function ensures that modifications to the cloned object do not affect the original object.
Arguments
x::CV
: The CV object to be cloned
Returns
CV
: A new CV object containing deep copies of all fields from the input
Example
Clone a CV object
Example
julia> fit = Fit(n=1, l=2);
julia> cv = CV("replication_1", "fold_1", fit, ["population_1"], ["entry_1"], [0.0], [0.0], fit.metrics);
julia> copy_cv = clone(cv)
CV("replication_1", "fold_1", Fit("", ["", ""], [0.0, 0.0], "", [""], [""], [0.0], [0.0], Dict("" => 0.0), nothing), ["population_1"], ["entry_1"], [0.0], [0.0], Dict("" => 0.0))
GenomicBreedingCore.clone
— Methodclone(x::Fit)::Fit
Create a deep copy of a Fit object, duplicating all its fields.
This function performs a deep clone of the input Fit object, ensuring that all nested structures and arrays are also copied, preventing any shared references between the original and the cloned object.
Arguments
x::Fit
: The Fit object to be cloned
Returns
Fit
: A new Fit object with identical but independent values
Fields copied
model
: The statistical modelb_hat_labels
: Labels for the estimated parametersb_hat
: Estimated parameterstrait
: The trait being analyzedentries
: Entry identifierspopulations
: Population identifiersmetrics
: Performance metricsy_true
: Observed valuesy_pred
: Predicted values
Examples
julia> fit = Fit(n=1, l=2);
julia> copy_fit = clone(fit)
Fit("", ["", ""], [0.0, 0.0], "", [""], [""], [0.0], [0.0], Dict("" => 0.0), nothing)
GenomicBreedingCore.clone
— Methodclone(x::GRM)::GRM
Create a deep copy of a GRM (Genomic Relationship Matrix) object.
Creates a new GRM object with deep copies of all fields from the input object. The clone function ensures that modifications to the cloned object do not affect the original object.
Arguments
x::GRM
: The GRM object to be cloned
Returns
GRM
: A new GRM object containing deep copies of the entries, locialleles, and genomicrelationship_matrix fields from the input
Example
julia> grm = GRM(string.(["entries_1", "entries_2"]), string.(["chr1 123 A|T A", "chr1 456 C|G G"]), Float64.(rand(2,2)));
julia> copy_grm = clone(grm);
julia> (copy_grm.entries == grm.entries) && (copy_grm.loci_alleles == grm.loci_alleles) && (copy_grm.genomic_relationship_matrix == grm.genomic_relationship_matrix)
true
GenomicBreedingCore.clone
— Methodclone(x::Genomes)::Genomes
Create a deep copy of a Genomes
object.
This function performs a deep clone of all fields in the Genomes
object, including:
- entries
- populations
- loci_alleles
- allele_frequencies
- mask
Returns a new Genomes
instance with identical but independent data.
Arguments
x::Genomes
: The source Genomes object to clone
Returns
Genomes
: A new Genomes object containing deep copies of all fields
Example
julia> genomes = Genomes(n=2, p=2);
julia> copy_genomes = clone(genomes)
Genomes(["", ""], ["", ""], ["", ""], Union{Missing, Float64}[missing missing; missing missing], Bool[1 1; 1 1])
GenomicBreedingCore.clone
— Methodclone(x::Phenomes)::Phenomes
Create a deep copy of a Phenomes
object, including all its fields.
This function performs a deep copy of the following fields:
- entries: Vector of entry names
- populations: Vector of population identifiers
- traits: Vector of trait names
- phenotypes: Matrix of phenotypic values
- mask: Matrix of boolean masks
Returns a new Phenomes
object with identical structure but independent memory allocation.
Arguments
x::Phenomes
: The source Phenomes object to be cloned
Returns
Phenomes
: A new Phenomes object containing deep copies of all fields
Example
julia> phenomes = Phenomes(n=2, t=2);
julia> copy_phenomes = clone(phenomes)
Phenomes(["", ""], ["", ""], ["", ""], Union{Missing, Float64}[missing missing; missing missing], Bool[1 1; 1 1])
GenomicBreedingCore.clone
— Methodclone(x::TEBV)::TEBV
Create a deep copy of a TEBV (Trial-Estimated Breeding Value) object.
Returns a new TEBV instance with all fields deeply copied from the input object, ensuring complete independence between the original and cloned objects.
Arguments
x::TEBV
: The source TEBV object to be cloned
Returns
TEBV
: A new TEBV object containing deep copies of all fields from the input
Examples
julia> tebv = TEBV(traits=[""], formulae=[""], models=[MixedModel(@formula(y~1+(1|x)), DataFrame(y=1, x=1))], df_BLUEs=[DataFrame(x=1)], df_BLUPs=[DataFrame(x=1)], phenomes=[Phenomes(n=1,t=1)]);
julia> copy_tebv = clone(tebv);
julia> copy_tebv.traits == tebv.traits
true
julia> copy_tebv.phenomes == tebv.phenomes
true
GenomicBreedingCore.clone
— Methodclone(x::Trials)::Trials
Create a deep copy of a Trials
object, including all its fields.
This function performs a complete deep copy of the input Trials
object, ensuring that all nested data structures are also copied rather than referenced.
Arguments
x::Trials
: The sourceTrials
object to be cloned
Returns
Trials
: A newTrials
object containing copies of all data from the input
Example
julia> trials = Trials(n=2, t=2);
julia> copy_trials = clone(trials)
Trials(Union{Missing, Float64}[missing missing; missing missing], ["", ""], ["", ""], ["", ""], ["", ""], ["", ""], ["", ""], ["", ""], ["", ""], ["", ""], ["", ""], ["", ""])
GenomicBreedingCore.countlevels
— Methodcountlevels(df::DataFrame; column_names::Vector{String})::Int64
Count the total number of unique values (factor levels) across specified columns in a DataFrame.
Arguments
df::DataFrame
: Input DataFrame to analyzecolumn_names::Vector{String}
: Vector of column names to count unique values from
Returns
Int64
: Sum of unique values across all specified columns
Throws
ArgumentError
: If any of the specified column names are not found in the DataFrame
GenomicBreedingCore.dimensions
— Methoddimensions(blr::BLR)::Dict{String, Any}
Calculate various dimensional properties of a Bayesian Linear Regression (BLR) model.
Arguments
blr::BLR
: A Bayesian Linear Regression model structure
Returns
A dictionary containing the following keys:
"n_rows"
: Number of observations"n_coefficients"
: Total number of coefficients across all components"n_variance_components"
: Number of variance components"n_entries"
: Number of unique entries"coeff_per_varcomp"
: Dictionary mapping variance component names to their number of coefficients"varex_per_varcomp"
: Dictionary mapping variance component names to their variance explained (normalized)
Details
The function performs the following:
- Validates BLR structure dimensions
- Calculates coefficients per variance component
- Computes variance explained for each component and normalizes by total variance
- Returns dimensional summary as a dictionary
Throws
ArgumentError
: If the BLR structure dimensions are inconsistent
Examples
julia> blr = BLR(n=10, p=6, var_comp = Dict("entries" => 5, "σ²" => 1));
julia> dimensions(blr)
Dict{String, Any} with 6 entries:
"n_coefficients" => 6
"n_variance_components" => 2
"varex_per_varcomp" => Dict("entries"=>0.333333, "σ²"=>0.666667)
"n_entries" => 1
"n_rows" => 10
"coeff_per_varcomp" => Dict("entries"=>5.0, "σ²"=>10.0)
GenomicBreedingCore.dimensions
— Methoddimensions(genomes::Genomes)::Dict{String, Int64}
Calculate various dimensional metrics of a Genomes struct.
Returns a dictionary containing the following metrics:
"n_entries"
: Number of unique entries/samples"n_populations"
: Number of unique populations"n_loci_alleles"
: Total number of loci-allele combinations"n_chr"
: Number of chromosomes"n_loci"
: Number of unique loci across all chromosomes"max_n_alleles"
: Maximum number of alleles observed at any locus"n_missing"
: Count of missing values in allele frequencies
Arguments
genomes::Genomes
: A valid Genomes struct containing genetic data
Returns
Dict{String,Int64}
: Dictionary containing dimensional metrics
Throws
ArgumentError
: If the Genomes struct is corrupted (fails dimension check)
Examples
julia> genomes = simulategenomes(n=100, l=1_000, n_alleles=4, verbose=false);
julia> dimensions(genomes)
Dict{String, Int64} with 7 entries:
"n_entries" => 100
"n_chr" => 7
"n_loci" => 1000
"n_loci_alleles" => 3000
"n_populations" => 1
"n_missing" => 0
"max_n_alleles" => 4
GenomicBreedingCore.dimensions
— Methoddimensions(phenomes::Phenomes)::Dict{String, Int64}
Calculate various dimensional statistics of a Phenomes
struct.
Returns a dictionary containing counts of:
"n_entries"
: unique entries in the dataset"n_populations"
: unique populations"n_traits"
: number of traits"n_total"
: total number of phenotypic observations (entries × traits)"n_zeroes"
: number of zero values in phenotypes"n_missing"
: number of missing values"n_nan"
: number of NaN values"n_inf"
: number of infinite values
Arguments
phenomes::Phenomes
: A Phenomes struct containing phenotypic data
Returns
Dict{String,Int64}
: Dictionary with dimensional statistics
Throws
ArgumentError
: If the Phenomes struct dimensions are inconsistent
Examples
julia> phenomes = Phenomes(n=10, t=3); phenomes.entries = string.("entry_", 1:10); phenomes.populations .= "pop_1"; phenomes.traits = ["A", "B", "C"]; phenomes.phenotypes = fill(0.0, 10,3);
julia> dimensions(phenomes)
Dict{String, Int64} with 8 entries:
"n_total" => 30
"n_zeroes" => 30
"n_nan" => 0
"n_entries" => 10
"n_traits" => 3
"n_inf" => 0
"n_populations" => 1
"n_missing" => 0
GenomicBreedingCore.dimensions
— Methoddimensions(tebv::TEBV)::Dict{String, Int64}
Calculate various dimensional metrics for a TEBV (Trial-Estimated Breeding Values) struct.
Arguments
tebv::TEBV
: A TEBV struct containing traits, formulae, models, BLUEs, BLUPs, and phenomes data
Returns
A dictionary containing the following counts:
"n_entries"
: Number of unique entries across all phenomes"n_populations"
: Number of unique populations across all phenomes"n_traits"
: Number of traits in the TEBV struct"n_total"
: Total number of observations across all traits"n_zeroes"
: Total number of zero values across all traits"n_missing"
: Total number of missing values across all traits"n_nan"
: Total number of NaN values across all traits"n_inf"
: Total number of Infinite values across all traits
Throws
ArgumentError
: If the TEBV struct dimensions are inconsistent or corrupted
Examples
julia> tebv = TEBV(traits=["trait_1"], formulae=["trait_1 ~ 1 + 1|entries"], models=[MixedModel(@formula(y~1+(1|x)), DataFrame(y=1, x=1))], df_BLUEs=[DataFrame(x=1)], df_BLUPs=[DataFrame(x=1)], phenomes=[Phenomes(n=1,t=1)]);
julia> dimensions(tebv)
Dict{String, Int64} with 8 entries:
"n_total" => 1
"n_zeroes" => 0
"n_nan" => 0
"n_entries" => 1
"n_traits" => 1
"n_inf" => 0
"n_populations" => 1
"n_missing" => 1
GenomicBreedingCore.dimensions
— Methoddimensions(trials::Trials)::Dict{String, Int64}
Calculate dimensional statistics of a Trials
struct, returning a dictionary with counts of various elements.
Arguments
trials::Trials
: ATrials
struct containing trial data
Returns
A Dict{String, Int64}
with the following keys:
"n_traits"
: Number of unique traits"n_years"
: Number of unique years"n_seasons"
: Number of unique seasons"n_harvests"
: Number of unique harvests"n_sites"
: Number of unique sites"n_replications"
: Number of unique replications"n_blocks"
: Number of unique blocks"n_rows"
: Number of unique rows"n_cols"
: Number of unique columns"n_entries"
: Number of unique entries"n_populations"
: Number of unique populations"n_total"
: Total number of phenotype observations (entries × traits)"n_zeroes"
: Count of zero values in phenotypes"n_missing"
: Count of missing values in phenotypes"n_nan"
: Count of NaN values in phenotypes"n_inf"
: Count of Inf values in phenotypes
Throws
ArgumentError
: If the Trials struct dimensions are inconsistent
Examples
julia> trials = Trials(n=1, t=2);
julia> trials.entries = ["entry_1"]; trials.traits = ["trait_1", "trait_2"];
julia> dimensions(trials)
Dict{String, Int64} with 16 entries:
"n_zeroes" => 0
"n_harvests" => 1
"n_nan" => 0
"n_entries" => 1
"n_traits" => 2
"n_seasons" => 1
"n_rows" => 1
"n_blocks" => 1
"n_missing" => 2
"n_inf" => 0
"n_total" => 2
"n_replications" => 1
"n_years" => 1
"n_sites" => 1
"n_cols" => 1
"n_populations" => 1
GenomicBreedingCore.distances
— Methoddistances(
genomes::Genomes;
distance_metrics::Vector{String}=["euclidean", "correlation", "mad", "rmsd", "χ²"],
idx_loci_alleles::Union{Nothing, Vector{Int64}} = nothing,
include_loci_alleles::Bool = true,
include_entries::Bool = true,
include_counts::Bool = true,
verbose::Bool = false
)::Tuple{Vector{String},Vector{String},Dict{String,Matrix{Float64}}}
Calculate pairwise distances/similarity metrics between loci-alleles and entries in a Genomes
object.
Arguments
genomes::Genomes
: Input Genomes objectdistance_metrics::Vector{String}
: Vector of distance metrics to calculate. Valid options:- "euclidean": Euclidean distance
- "correlation": Pearson correlation coefficient
- "mad": Mean absolute deviation
- "rmsd": Root mean square deviation
- "χ²": Chi-square distance
idx_loci_alleles::Union{Nothing, Vector{Int64}}
: Optional indices of loci-alleles to include. If nothing, randomly samples 100 loci-alleles.include_loci_alleles::Bool
: Whether to calculate distances between loci-alleles. Defaults to true.include_entries::Bool
: Whether to calculate distances between entries. Defaults to true.include_counts::Bool
: Whether to include matrices showing number of valid pairs used. Defaults to true.verbose::Bool
: Whether to show progress bars. Defaults to false.
Returns
Tuple containing:
- Vector of loci-allele names used
- Vector of entry names
- Dictionary mapping "{dimension}|{metric}" to distance matrices, where:
- dimension is either "loci_alleles" or "entries"
- metric is one of the distance metrics or "counts" (number of valid pairs used)
- matrices contain pairwise distances/correlations (-Inf where insufficient data)
Details
- For loci-alleles, calculates distances between allele frequency profiles across entries
- For entries, calculates distances between entries based on their allele frequencies
- Requires at least 2 valid (non-missing, finite) pairs to calculate metrics
- Includes count matrices showing number of valid pairs used per calculation
- Multi-threaded implementation which uses indexing on pre-allocated vectors and matrices which should avoid data races
Examples
julia> genomes = simulategenomes(n=100, l=1_000, n_alleles=4, verbose=false);
julia> (loci_alleles_names, entries, dist) = distances(genomes, distance_metrics=["correlation", "χ²"]);
julia> sort(string.(keys(dist))) == ["entries|correlation", "entries|counts", "entries|χ²", "loci_alleles|correlation", "loci_alleles|counts", "loci_alleles|χ²"]
true
julia> C = dist["entries|correlation"]; C[diagind(C)] == repeat([1], length(genomes.entries))
true
julia> χ² = dist["loci_alleles|χ²"]; χ²[diagind(χ²)] == repeat([0.0], 100)
true
GenomicBreedingCore.distances
— Methoddistances(
phenomes::Phenomes;
distance_metrics::Vector{String}=["euclidean", "correlation", "mad", "rmsd", "χ²"],
standardise_traits::Bool = false
)::Tuple{Vector{String}, Vector{String}, Dict{String, Matrix{Float64}}}
Calculate pairwise distances/correlations between traits and entries in a phenotypic dataset.
Arguments
phenomes::Phenomes
: A Phenomes struct containing phenotypic datadistance_metrics::Vector{String}
: Vector of distance metrics to compute. Valid options are:- "euclidean": Euclidean distance
- "correlation": Pearson correlation coefficient
- "mad": Mean absolute deviation
- "rmsd": Root mean square deviation
- "χ²": Chi-square distance
standardise_traits::Bool
: If true, standardizes traits to mean=0 and sd=1 before computing distances
Returns
A tuple containing:
- Vector of trait names
- Vector of entry names
- Dictionary mapping "{dimension}|{metric}" to distance matrices, where:
- dimension ∈ ["traits", "entries"]
- metric ∈ distance_metrics ∪ ["counts"]
- "counts" matrices contain the number of non-missing pairs used in calculations
Notes
- Pairs with fewer than 2 non-missing values result in -Inf distance values
- For correlation calculations, traits with near-zero variance (< 1e-7) are skipped
- χ² distance adds machine epsilon to denominator to avoid division by zero
Examples
julia> phenomes = Phenomes(n=10, t=3); phenomes.entries = string.("entry_", 1:10); phenomes.populations .= "pop_1"; phenomes.traits = ["A", "B", "C"]; phenomes.phenotypes = rand(10,3); phenomes.phenotypes[2,2] = missing;
julia> (traits, entries, dist) = distances(phenomes, distance_metrics=["correlation", "χ²"]);
julia> sort(string.(keys(dist))) == ["entries|correlation", "entries|counts", "entries|χ²", "traits|correlation", "traits|counts", "traits|χ²"]
true
julia> C = dist["entries|correlation"]; C[diagind(C)] == repeat([1], length(phenomes.entries))
true
julia> dist["traits|counts"][:, 2] == dist["traits|counts"][2, :] == repeat([9], length(phenomes.traits))
true
julia> dist["entries|counts"][:, 2] == dist["entries|counts"][2, :] == repeat([2], length(phenomes.entries))
true
GenomicBreedingCore.divideintomockscaffolds
— Methodddivideintomockscaffolds(
genomes::Genomes;
max_n_loci_per_chrom::Int64 = 100_000,
verbose::Bool = false,
)::Vector{String}
Divide genomic loci into mock scaffolds based on a maximum number of loci per chromosome.
Arguments
genomes::Genomes
: A Genomes struct containing genomic datamax_n_loci_per_chrom::Int64
: Maximum number of loci per chromosome (default: 100,000)verbose::Bool
: If true, prints additional information during execution (default: false)
Returns
Vector{String}
: A vector containing mock scaffold assignments for each locus
Description
This function takes a Genomes struct and divides the loci into mock scaffolds based on the specified maximum number of loci per chromosome. It creates scaffold names in the format "mockscaffoldX" where X is the scaffold number.
Throws
ArgumentError
: If the Genomes struct dimensions are invalid or corrupted
Example
julia> genomes = simulategenomes(n=20, sparsity=0.3, verbose=false);
julia> mock_scaffolds = divideintomockscaffolds(genomes, max_n_loci_per_chrom=100);
julia> sum(mock_scaffolds .== mock_scaffolds[1]) == Int64(length(genomes.loci_alleles) / 100)
true
GenomicBreedingCore.estimatedistances
— Methodestimatedistances(
genomes::Genomes;
LD::Matrix{Float64},
idx_focal_locus::Int64,
idx_loci_alleles_per_chrom::Vector{Int64},
min_loci_corr::Float64,
min_l_loci::Int64,
verbose::Bool = false,
)::Matrix{Float64}
Estimate pairwise distances between entries based on loci most linked to a focal locus.
Arguments
genomes::Genomes
: AGenomes
struct containing the genomic data.LD::Matrix{Float64}
: A linkage disequilibrium (LD) matrix.idx_focal_locus::Int64
: The local index of the focal locus, i.e. index at the current chromosome or mock scaffold.idx_loci_alleles_per_chrom::Vector{Int64}
: A vector of global indices for loci-alleles per chromosome, i.e. indices corresponding the wholegenomes.allele_frequencies
matrix.min_loci_corr::Float64
: The minimum correlation threshold for loci to be considered linked.min_l_loci::Int64
: The minimum number of loci required to be linked to the focal locus.verbose::Bool
: Iftrue
, prints detailed progress information.
Returns
Matrix{Float64}
: A matrix of pairwise distances between entries, estimated using the mean absolute difference (MAD) in allele frequencies of loci most linked to the focal locus.
Details
This function calculates pairwise distances between entries in a genomic dataset. The distances are based on loci that are most correlated with a specified focal locus. The function uses the following steps:
- Input Validation: Checks the validity of the input arguments, ensuring the Genomes struct is not corrupted, the LD matrix is square, and the indices are within valid ranges.
- Identify Linked Loci: Determines which loci are most linked to the focal locus based on the provided LD matrix and correlation threshold. If the number of linked loci is below the specified minimum, the top loci are selected based on their correlation values.
- Estimate Distances: Calculates pairwise distances between entries using the mean absolute difference (MAD) in allele frequencies of the selected loci.
- Output: Returns a matrix of pairwise distances.
Notes
- The function identifies loci most linked to the focal locus based on the provided LD matrix and correlation threshold.
- If all loci have undetermined level of linkage with the focal locus (i.e. -Inf values in the LD matrix), then all the loci-alleles are selected to estimate distances.
- If the number of linked loci is below the specified minimum, then the minimum number of loci most correlated to the focal locus are selected to estimate distances.
- Pairwise distances are estimated using the mean absolute difference (MAD) in allele frequencies of the selected loci.
- Intermediate results and progress can be printed if verbose is set to true.
Throws
- ArgumentError: If the input arguments are invalid or the Genomes struct is corrupted.
- ArgumentError: If there are less than 2 indices for loci-alleles in the current chromosome.
- ArgumentError: If the LD matrix is not square.
- ArgumentError: If the LD matrix and the vector of indices are incompatible.
- ArgumentError: If the focal locus index is out of range.
- ArgumentError: If the minimum correlation threshold is out of range.
- ArgumentError: If the minimum number of loci is out of range.
Example
julia> genomes = simulategenomes(n=40, l=1_000, sparsity=0.3, verbose=false);
julia> chromosomes = divideintomockscaffolds(genomes, max_n_loci_per_chrom=100);
julia> chroms_uniq, LDs = estimateld(genomes, chromosomes=chromosomes);
julia> rm.(readdir()[.!isnothing.(match.(Regex("jld2"), readdir()))]);
julia> k = 1; LD = LDs[k]; chrom = chroms_uniq[k]; idx_focal_locus = 1;
julia> idx_loci_alleles_per_chrom = findall(chromosomes .== chrom);
julia> min_loci_corr = 0.9; min_l_loci = 2;
julia> D = estimatedistances(genomes, LD=LD, idx_focal_locus=idx_focal_locus, idx_loci_alleles_per_chrom=idx_loci_alleles_per_chrom, min_loci_corr=min_loci_corr, min_l_loci=min_l_loci);
julia> size(D) == (length(genomes.entries), length(genomes.entries))
true
GenomicBreedingCore.estimateld
— Methodestimateld(
genomes::Genomes;
chromosomes::Union{Nothing,Vector{String}} = nothing,
verbose::Bool = false,
)::Tuple{Vector{String},Vector{Matrix{Float64}}}
Calculate linkage disequilibrium (LD) matrices for each chromosome in the given genomic data.
Arguments
genomes::Genomes
: A Genomes struct containing genomic datachromosomes::Union{Nothing, Vector{String}}
: Optional vector of chromosome names to analyse. If nothing, all chromosomes in the data will be usedverbose::Bool
: If true, prints progress information during computation
Returns
Tuple{Vector{String}, Vector{Matrix{Float64}}}
: A tuple containing:- A vector of unique chromosome names.
- A vector of LD matrices corresponding to each chromosome.
Examples
julia> genomes = simulategenomes(n=30, l=1_000, sparsity=0.3, verbose=false);
julia> chroms_uniq, LDs_all_chroms = estimateld(genomes);
julia> chrom, pos, allele = loci_alleles(genomes);
julia> rm.(readdir()[.!isnothing.(match.(Regex("jld2"), readdir()))]);
julia> mock_scaffolds = divideintomockscaffolds(genomes, max_n_loci_per_chrom=100);
julia> mock_scaffolds_uniq, LDs_mock_scaffolds = estimateld(genomes, chromosomes=mock_scaffolds);
julia> rm.(readdir()[.!isnothing.(match.(Regex("jld2"), readdir()))]);
julia> length(LDs_all_chroms) == length(chroms_uniq) == length(unique(chrom))
true
julia> length(LDs_mock_scaffolds) == length(mock_scaffolds_uniq) == Int(length(chrom) / 100)
true
GenomicBreedingCore.extractXb
— MethodextractXb(blr::BLR)::Tuple{Matrix{Float64}, Vector{Float64}, Vector{String}}
Extract design matrix X, coefficients b, and coefficient labels from a BLR (Bayesian Linear Regression) model.
Arguments
blr::BLR
: A BLR struct containing model components
Returns
A tuple containing:
X::Matrix{Float64}
: The design matrix combining all predictorsb::Vector{Float64}
: Vector of estimated coefficientsb_labels::Vector{String}
: Vector of coefficient labels/names
Throws
ArgumentError
: If the dimensions in the BLR struct are inconsistent
Details
Extracts and combines the design matrices, coefficients and their labels from the BLR model components. The intercept is handled separately and placed first, followed by other variance components.
Note
The function assumes the BLR struct has valid "intercept" components and optional additional variance components.
Examples
julia> blr = BLR(n=1, p=4);
julia> X, b, b_labels = extractXb(blr);
julia> size(X) == (1, 4)
true
julia> length(b) == length(b_labels)
true
GenomicBreedingCore.extracteffects
— Methodextracteffects(blr::BLR; verbose::Bool = false)::Dict{String, DataFrame}
Extract and organize effects from a Bayesian Linear Regression (BLR) model.
This function processes both main effects and interaction effects from a fitted BLR model, organizing them into separate DataFrames within a dictionary.
Arguments
blr::BLR
: A fitted Bayesian Linear Regression model structureverbose::Bool=false
: If true, prints intermediate results during processing
Returns
Dict{String, DataFrame}
: A dictionary where:- Keys are effect names (main effects or interaction effects)
- Values are DataFrames containing:
name
: Labels for the effectsvalue
: Corresponding effect values
Details
- Extracts design matrix (X), effects (b), and their labels from the BLR model
- Processes main effects (single factors) and interaction effects (combined factors) separately
- Removes redundant rows based on hashing
- Combines effects with their corresponding design matrix elements
Throws
ArgumentError
: If the BLR struct dimensions are invalid
Example
julia> genomes = simulategenomes(n=5, l=1_000, verbose=false);
julia> trials, simulated_effects = simulatetrials(genomes = genomes, n_years=1, n_seasons=2, n_harvests=1, n_sites=3, n_replications=3, verbose=false);
julia> tebv, spatial_diagnostics = analyseviaBLR(trials, ["trait_1"], n_iter = 1_000, n_burnin = 100);
julia> blr = tebv.models[1];
julia> dfs = extracteffects(blr);
julia> [size(v) for (k, v) in dfs]
5-element Vector{Tuple{Int64, Int64}}:
(2, 2)
(3, 2)
(6, 2)
(5, 2)
(30, 2)
GenomicBreedingCore.extractmodelinputs
— Methodextractmodelinputs(blr::BLR; multiple_σs::Union{Nothing, Dict{String, Bool}}=nothing)::Dict
Extract model inputs from BLR object for Bayesian modelling.
Arguments
blr::BLR
: A Bayesian Linear Regression model objectmultiple_σs::Union{Nothing, Dict{String, Bool}}=nothing
: Optional dictionary specifying whether each variance component should use single (false) or multiple (true) variance scalers. If nothing, defaults to single scaler for all components.
Returns
A dictionary containing:
"y"
: Response variable vector"vector_of_Xs_noint"
: Vector of design matrices for each variance component"vector_of_Δs"
: Vector of variance-covariance matrices or uniform scaling factors"length_of_σs"
: Vector specifying number of variance components for each term"variance_components"
: Vector of variance component names"vector_coefficient_names"
: Vector of coefficient names for all variance components
Notes
- Excludes intercept from variance components
- For each variance component:
- If multiple_σs[component] = false: Uses single variance scaler
- If multiple_σs[component] = true: Uses separate variance scaler per coefficient
- Processes design matrices, variance-covariance structures, and coefficient names from the BLR object
Throws
ErrorException
: If a variance component is not specified in the multiple_σs dictionary
Example
julia> genomes = simulategenomes(n=5, l=1_000, verbose=false);
julia> trials, simulated_effects = simulatetrials(genomes = genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=3, verbose=false);
julia> df = tabularise(trials);
julia> blr = instantiateblr(trait = trials.traits[1], factors = ["rows", "cols"], df = df, other_covariates = trials.traits[2:end], verbose = false);
julia> model_inputs_1 = extractmodelinputs(blr, multiple_σs = nothing);
julia> model_inputs_2 = extractmodelinputs(blr, multiple_σs = Dict("rows" => true, "cols" => true, "rows & cols" => false, "other_covariates" => true));
julia> (model_inputs_1["y"] == model_inputs_2["y"]) && (model_inputs_1["vector_of_Xs_noint"] == model_inputs_2["vector_of_Xs_noint"]) && (model_inputs_1["vector_of_Δs"] == model_inputs_2["vector_of_Δs"]) && (model_inputs_1["vector_coefficient_names"] == model_inputs_2["vector_coefficient_names"])
true
julia> sum(model_inputs_1["length_of_σs"]) < sum(model_inputs_2["length_of_σs"])
true
GenomicBreedingCore.extractphenomes
— Methodextractphenomes(tebv::TEBV)::Phenomes
Extract phenotypic values from a Trial-Estimated Breeding Value (TEBV) object.
This function processes phenotypic data from a TEBV object, handling intercept effects and merging multiple phenomes if present. It performs the following operations:
- Validates input TEBV dimensions
- Processes intercept effects if present by:
- Identifying intercept terms
- Combining intercept values with trait effects
- Adjusting trait names and phenotypic values accordingly
- Merges multiple phenomes if present
- Renames traits to match input TEBV traits if dimensions align
- Validates output Phenomes dimensions
Arguments
tebv::TEBV
: A Trial Estimated Breeding Value object containing phenotypic data
Returns
Phenomes
: A Phenomes object containing processed phenotypic values
Throws
ArgumentError
: If input TEBV or output Phenomes dimensions are invalid
Examples
julia> trials, _simulated_effects = simulatetrials(genomes = simulategenomes(n=10, verbose=false), n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=10, verbose=false);
julia> tebv = analyse(trials, max_levels=50, verbose=false);
julia> phenomes = extractphenomes(tebv);
julia> phenomes.traits == ["trait_1", "trait_2", "trait_3"]
true
GenomicBreedingCore.extractphenomes
— Methodextractphenomes(trials::Trials)::Phenomes
Convert a Trials
struct into a Phenomes
struct by extracting phenotypic values across different environments.
Details
- Combines trait measurements with their environmental contexts
- Creates unique trait identifiers by combining trait names with environment variables
- Environment variables include: years, harvests, seasons, sites, and replications
- For single environment scenarios, trait names remain without environmental suffixes
Arguments
trials::Trials
: A Trials struct containing phenotypic measurements across different environments
Returns
- A Phenomes struct containing:
phenotypes
: Matrix of phenotypic values (entries × traits)entries
: Vector of entry namespopulations
: Vector of population namestraits
: Vector of trait names (with environmental contexts)
Throws
ArgumentError
: If duplicate entries exist within year-harvest-season-site-replication combinationsErrorException
: If dimensional validation fails during Phenomes construction
Examples
julia> trials, _ = simulatetrials(genomes = simulategenomes(verbose=false), verbose=false);
julia> phenomes = extractphenomes(trials);
julia> size(phenomes.phenotypes)
(100, 384)
GenomicBreedingCore.filterbymaf
— Methodfilterbymaf(
genomes::Genomes;
maf::Float64 = 0.01,
verbose::Bool = false,
)::Genomes
Filter genomic data by removing loci with minor allele frequencies (MAF) below a specified threshold.
Arguments
genomes::Genomes
: AGenomes
struct containing the genomic data.maf::Float64
: The minimum allele frequency threshold. Default is 0.01.verbose::Bool
: Iftrue
, prints detailed progress information during the filtering process. Default isfalse
.
Returns
Genomes
: AGenomes
struct with filtered genomic data.
Details
This function filters genomic data by removing loci with minor allele frequencies (MAF) below a specified threshold. The function performs the following steps:
- Input Validation: Ensures that the
Genomes
struct is not corrupted and that themaf
argument is within the valid range (0.0 to 1.0). Throws anArgumentError
if any argument is out of range. - Early Return for maf = 0.0: If
maf
is set to 0.0, the function returns the originalGenomes
struct without filtering. - Calculate MAF: Computes the mean allele frequency for each locus, skipping missing values.
- Filter Loci: Identifies loci that pass the MAF threshold and retains only those loci:
- If all loci pass the MAF threshold, the function returns the original
Genomes
struct. - If no loci pass the MAF threshold, the function throws an
ErrorException
.
- If all loci pass the MAF threshold, the function returns the original
- Verbose Output: If
verbose
istrue
, prints detailed progress information during the filtering process. - Output: Returns the filtered
Genomes
struct.
Notes
- The function uses multi-threading to compute the mean allele frequencies for each locus, improving performance on large datasets.
- The
verbose
option provides additional insights into the filtering process by printing progress information. - The function ensures that the filtered genomic data retains a minimum number of loci.
Throws
ArgumentError
: If theGenomes
struct is corrupted.ArgumentError
: If themaf
argument is out of range.ErrorException
: If all loci are filtered out based on the MAF threshold.
Example
julia> genomes = simulategenomes(n=100, l=1_000, n_alleles=4, verbose=false);
julia> filtered_genomes = filterbymaf(genomes, maf=0.05);
julia> length(genomes.loci_alleles) >= length(filtered_genomes.loci_alleles)
true
GenomicBreedingCore.filterbypca
— Methodfilterbypca(
genomes::Genomes;
max_prop_pc_varexp::Float64 = 0.9,
verbose::Bool = false,
)::Genomes
Filter genomic data by removing outlier loci-alleles based on principal component analysis (PCA).
Arguments
genomes::Genomes
: AGenomes
struct containing the genomic data.max_prop_pc_varexp::Float64
: The maximum proportion of variance explained by the first two principal components (PC1 and PC2). Default is 0.9.verbose::Bool
: Iftrue
, prints detailed progress information during the filtering process. Default isfalse
.
Returns
Genomes
: AGenomes
struct with filtered genomic data.
Details
This function filters genomic data by removing outlier loci-alleles based on principal component analysis (PCA). The function performs the following steps:
- Input Validation: Ensures that the
Genomes
struct is not corrupted and that themax_prop_pc_varexp
argument is within the valid range (0.0 to Inf). Throws anArgumentError
if any argument is out of range. - Early Return for maxproppc_varexp = Inf: If
max_prop_pc_varexp
is set to Inf, the function returns the originalGenomes
struct without filtering. - Extract Non-Missing Loci-Alleles: Identifies loci-alleles that are non-missing and have non-zero variance across all entries.
- Standardize Allele Frequencies: Standardizes the allele frequencies per locus-allele in preparation for PCA.
- Perform PCA: Conducts PCA on the standardized allele frequencies and calculates the proportion of variance explained by the first two principal components (PC1 and PC2).
- Identify Outliers: Identifies loci-alleles that are outliers based on the specified proportion of variance explained by PC1 and PC2.
- Filter Outliers: Removes the identified outlier loci-alleles from the genomic data.
- Verbose Output: If
verbose
istrue
, prints detailed progress information during the filtering process. - Output: Returns the filtered
Genomes
struct.
Notes
- The function uses PCA to identify outlier loci-alleles based on the proportion of variance explained by the first two principal components.
- The
verbose
option provides additional insights into the filtering process by printing progress information. - The function ensures that the filtered genomic data retains a minimum number of loci-alleles.
Throws
ArgumentError
: If theGenomes
struct is corrupted.ArgumentError
: If themax_prop_pc_varexp
argument is out of range.ErrorException
: If there are less than 10 loci-alleles left after removing fixed and missing values across all entries.
Example
julia> genomes = simulategenomes(n=100, l=1_000, n_alleles=4, verbose=false);
julia> filtered_genomes = filterbypca(genomes, max_prop_pc_varexp=0.9);
julia> length(filtered_genomes.loci_alleles) <= length(genomes.loci_alleles)
true
GenomicBreedingCore.filterbysnplist
— Methodfilterbysnplist(
genomes::Genomes;
chr_pos_allele_ids::Union{Nothing,Vector{String}} = nothing,
match_alleles::Bool = true,
verbose::Bool = false,
)::Genomes
Filter genomic data by retaining only the specified loci or loci-allele combinations.
Arguments
genomes::Genomes
: AGenomes
struct containing the genomic data.chr_pos_allele_ids::Union{Nothing, Vector{String}}
: A vector of loci or loci-allele combination names in the format "chromosome\tposition" or "chromosome\tposition\tallele". Ifnothing
, no filtering is applied. Default isnothing
.match_alleles::Bool
: Iftrue
, matches both loci coordinates (chromosome and position) and alleles. Iffalse
, matches only by chromosome and position regardless of alleles. Default istrue
.verbose::Bool
: Iftrue
, prints detailed progress information during the filtering process. Default isfalse
.
Returns
Genomes
: AGenomes
struct with filtered genomic data.
Details
This function filters genomic data by retaining only the specified loci or loci-allele combinations. The function performs the following steps:
- Input Validation: Ensures that the
Genomes
struct is not corrupted. - Early Return for No Filtering: If no IDs are provided, returns the original
Genomes
struct. - Parse Input IDs: Parses the input IDs and validates their format.
- Extract Available IDs: Gets the current loci or loci-allele combinations from the
Genomes
struct. - Filter Data: Identifies and retains only the specified combinations.
- Verbose Output: If enabled, prints detailed progress information.
- Final Check: Ensures that filtering retains at least some data.
Notes
- Uses multi-threading for parsing and filtering operations
- When
match_alleles=false
, matches only chromosome and position - Preserves the original data structure while filtering unwanted entries
Throws
ArgumentError
: If theGenomes
struct is corrupted or input IDs are malformedErrorException
: If no data remains after filtering
Example
julia> genomes = simulategenomes(n=100, l=1_000, n_alleles=4, verbose=false);
julia> chr_pos_allele_ids = sample(genomes.loci_alleles, 100, replace=false); sort!(chr_pos_allele_ids);
julia> filtered_genomes = filterbysnplist(genomes, chr_pos_allele_ids=chr_pos_allele_ids);
julia> size(filtered_genomes.allele_frequencies)
(100, 100)
julia> chr_pos_allele_ids_no_alleles = unique([join(split(x, " ")[1:2], " ") for x in chr_pos_allele_ids]);
julia> filtered_genomes_2 = filterbysnplist(genomes, chr_pos_allele_ids=chr_pos_allele_ids_no_alleles, match_alleles=false);
julia> size(filtered_genomes_2.allele_frequencies, 2) > size(filtered_genomes.allele_frequencies, 2)
true
GenomicBreedingCore.filterbysparsity
— Methodfilterbysparsity(
genomes::Genomes;
max_entry_sparsity::Float64 = 0.0,
max_locus_sparsity::Float64 = 0.0,
max_entry_sparsity_percentile::Float64 = 0.90,
max_locus_sparsity_percentile::Float64 = 0.50,
verbose::Bool = false,
)::Genomes
Filter genomic data by removing entries and loci with high sparsity.
Arguments
genomes::Genomes
: AGenomes
struct containing the genomic data.max_entry_sparsity::Float64
: The maximum allowable sparsity for entries. Default is 0.0.max_locus_sparsity::Float64
: The maximum allowable sparsity for loci. Default is 0.0.max_entry_sparsity_percentile::Float64
: The percentile threshold for entry sparsity used in iteratively filtering out the sparsest entries. Default is 0.90.max_locus_sparsity_percentile::Float64
: The percentile threshold for locus sparsity used in iteratively filtering out sparsest loci. Default is 0.50.verbose::Bool
: Iftrue
, prints detailed progress information during the filtering process. Default isfalse
.
Returns
Genomes
: AGenomes
struct with filtered genomic data.
Details
This function filters genomic data by iteratively removing entries and loci with high sparsity. The function performs the following steps:
- Input Validation: Ensures that the
Genomes
struct is not corrupted and that the sparsity thresholds are within the valid range (0.0 to 1.0). Throws anArgumentError
if any argument is out of range. - Calculate Sparsities: Computes the sparsities of entries and loci in the genomic data.
- Initial Check: Checks if the input
Genomes
struct passes all the filtering thresholds. If so, returns the originalGenomes
struct. - Iterative Filtering: Iteratively removes the sparsest loci and entries until the maximum allowable sparsity thresholds are met:
- Remove Sparsest Loci: Removes loci with sparsity above the specified percentile threshold or maximum loci sparsity, whichever is higher.
- Remove Sparsest Entries: Removes entries with sparsity above the specified percentile threshold or maximum entries sparsity, whichever is higher.
- Verbose Output: If
verbose
istrue
, prints detailed progress information during each iteration. - Final Check: Ensures that there are remaining entries and loci after filtering. Throws an
ErrorException
if all entries or loci are filtered out. - Output: Returns the filtered
Genomes
struct.
Notes
- The function uses percentile thresholds to iteratively remove the sparsest loci and entries.
- The
verbose
option provides additional insights into the filtering process by printing progress information. - The function ensures that the filtered genomic data retains a minimum number of entries and loci.
- If one of
max_entry_sparsity
ormax_locus_sparsity
is set to 1.00, the other threshold may not be met because
filtering out entries/loci further will mean filtering the other but the requirement at 100% maximum sparsity is already met. At instances like this, you may manually reduced the maximum sparsity level from 1.00.
Throws
ArgumentError
: If theGenomes
struct is corrupted.ArgumentError
: If any of the sparsity thresholds are out of range.ErrorException
: If all entries and/or loci are filtered out based on the sparsity thresholds.
Examples
julia> genomes = simulategenomes(n=100, l=1_000, n_alleles=4, sparsity=0.01, verbose=false);
julia> filtered_genomes = filterbysparsity(genomes);
julia> size(genomes.allele_frequencies)
(100, 3000)
julia> size(filtered_genomes.allele_frequencies)
(92, 1239)
julia> entry_sparsities, locus_sparsities = sparsities(filtered_genomes);
julia> maximum(entry_sparsities) == maximum(locus_sparsities) == 0.0
true
julia> filtered_genomes_2 = filterbysparsity(genomes, max_entry_sparsity=1.00, max_locus_sparsity=0.00);
GenomicBreedingCore.grmploidyaware
— Methodgrmploidyaware(
genomes::Genomes;
ploidy::Int64 = 2,
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
verbose::Bool = false
)::Matrix{Float64}
Generate a ploidy-aware genetic relationship matrix (GRM) based on the methods described in Bell et al. (2017) and VanRaden et al. (2008).
Arguments
genomes::Genomes
: Genomic data structure containing allele frequenciesploidy::Int64
: Number of chromosome copies in the organism (default: 2)idx_entries::Union{Nothing,Vector{Int64}}
: Optional indices to select specific entries (default: nothing)idx_loci_alleles::Union{Nothing,Vector{Int64}}
: Optional indices to select specific loci/alleles (default: nothing)verbose::Bool
: If true, displays a heatmap of the resulting GRM (default: false)
Returns
Matrix{Float64}
: A symmetric genetic relationship matrix with dimensions (n × n), where n is the number of entries
Details
The function implements the following steps:
- Extracts and processes genomic data
- Calculates allele frequencies and centers the data
- Computes the GRM using VanRaden's method
- Ensures matrix invertibility by adding small values to the diagonal if necessary
Note
The diagonal elements may be slightly inflated to ensure matrix invertibility for downstream analyses.
Example
julia> genomes = simulategenomes(l=1_000, verbose=false);
julia> grm = grmploidyaware(genomes);
julia> size(grm.genomic_relationship_matrix), issymmetric(grm.genomic_relationship_matrix)
((100, 100), true)
julia> det(grm.genomic_relationship_matrix) > eps(Float64)
true
GenomicBreedingCore.grmsimple
— Methodgrmsimple(
genomes::Genomes;
idx_entries::Union{Nothing,Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
verbose::Bool = false
)::Matrix{Float64}
Generate a simple genetic relationship matrix (GRM) from genomic data.
Arguments
genomes::Genomes
: Genomic data structure containing allele frequenciesidx_entries::Union{Nothing,Vector{Int64}}
: Optional indices to select specific entries/individualsidx_loci_alleles::Union{Nothing,Vector{Int64}}
: Optional indices to select specific loci/allelesverbose::Bool
: If true, displays a heatmap visualization of the GRM
Returns
Matrix{Float64}
: A symmetric positive definite genetic relationship matrix
Details
The function computes a genetic relationship matrix by:
- Converting genomic data to a numerical matrix
- Computing GRM as G * G' / ncol(G)
- Adding small positive values to diagonal elements if necessary to ensure matrix invertibility
Notes
- The resulting matrix is always symmetric
- Diagonal elements may be slightly inflated to ensure matrix invertibility
- The matrix dimensions will be n×n where n is the number of entries/individuals
Example
julia> genomes = simulategenomes(l=1_000, verbose=false);
julia> grm = grmsimple(genomes);
julia> size(grm.genomic_relationship_matrix), issymmetric(grm.genomic_relationship_matrix)
((100, 100), true)
julia> det(grm.genomic_relationship_matrix) > eps(Float64)
true
GenomicBreedingCore.histallelefreqs
— Methodhistallelefreqs(genomes::Genomes)::Nothing
Plot a histogram of allele frequencies from a Genomes object.
Arguments
genomes::Genomes
: A Genomes object containing allele frequency data in itsallele_frequencies
field
Returns
Nothing
: Displays a histogram plot and returns nothing
Description
Creates and displays a vertical histogram of non-missing allele frequencies using UnicodePlots. The histogram shows frequency distribution in the range [0,1] with 50 bins.
Example
julia> genomes = simulategenomes(n=100, l=10_000, n_alleles=3, n_populations=3, verbose=false);
julia> histallelefreqs(genomes)
GenomicBreedingCore.impute
— Methodimpute(
genomes::Genomes;
max_n_loci_per_chrom::Int64 = 100_000,
n_reps::Int64 = 2,
optim_n::Int64 = 10,
min_l_loci::Int64 = 10,
min_k_entries::Int64 = 2,
verbose::Bool = false,
)::Tuple{Genomes, Float64}
Impute missing allele frequencies in a genomic dataset using optimised allele frequency k-nearest neighbours imputation (AF-kNNi).
Arguments
genomes::Genomes
: AGenomes
struct containing the genomic data.max_n_loci_per_chrom::Int64
: The maximum number of loci per chromosome. Default is 100,000.n_reps::Int64
: The number of repetitions for cross-validation during optimisation. Default is 2.optim_n::Int64
: The number of optimisation steps for each parameter. Default is 10.min_l_loci::Int64
: The minimum number of loci required to be linked to the focal locus. Default is 10.min_k_entries::Int64
: The minimum number of nearest neighbours required for imputation. Default is 2.verbose::Bool
: Iftrue
, prints detailed progress information. Default isfalse
.
Returns
Tuple{Genomes, Float64}
: A tuple containing:- An updated
Genomes
struct with imputed allele frequencies. - The mean absolute error (MAE) of the imputed values.
- An updated
Details
This function imputes missing allele frequencies in a genomic dataset using an optimised k-nearest neighbors (KNN) imputation method. The function performs the following steps:
- Input Validation: Checks the validity of the input arguments, ensuring the Genomes struct is not corrupted, the number of entries and loci are within valid ranges, and the parameters are correctly specified.
- Chromosome Division: Divides the allele frequencies into mock scaffolds if the number of loci per chromosome exceeds the specified maximum.
- Linkage Disequilibrium (LD) Estimation: Estimates LD between loci using Pearson's correlation for each chromosome.
- Imputation Process: For each locus-allele per chromosome:
- Identifies entries with missing data.
- Optimises the parameters minlocicorr and maxentriesdist using cross-validation.
- Computes distances between entries using the optimised minlocicorr.
- Imputes missing data using the weighted means of the nearest neighbors based on the optimised maxentriesdist.
- Output: Returns the updated Genomes struct with imputed allele frequencies and the mean absolute error (MAE) of the imputed values.
Notes
- The function is designed to handle the following cases:
- If all loci have undetermined level of linkage with the focal locus (i.e. -Inf values in the LD matrix), then all the loci-alleles are selected to estimate distances.
- If the number of linked loci is below the specified minimum, then the minimum number of loci most correlated to the focal locus are selected to estimate distances.
- If the number of nearest neighbours is less than the specified minimum, then then the minimum number of closest neighbours are used in to compute the weighted average allele frequency.
- The verbose option provides additional insights into the imputation process by printing progress information.
- The function is designed to handle large genomic datasets efficiently by dividing loci into mock scaffolds if necessary.
Throws
- ArgumentError: If the Genomes struct is corrupted.
- ArgumentError: If the number of entries in the Genomes struct is less than 2.
- ArgumentError: If the number of loci per chromosome is less than 10.
- ArgumentError: If n_reps is not between 1 and the number of entries.
- ArgumentError: If minlloci is not between 10 and the number of loci per chromosome.
- ArgumentError: If minkentries is not between 2 and the number of entries.
- ErrorException: If an error occurs during the imputation process.
Example
julia> genomes = simulategenomes(n=100, l=1_000, sparsity=0.3, verbose=false);
julia> genomes_imputed, mae = impute(genomes);
julia> dimensions(genomes)["n_missing"] == 30_000
true
julia> dimensions(genomes_imputed)["n_missing"] == 0
true
GenomicBreedingCore.inflatediagonals!
— Methodinflatediagonals!(X::Matrix{Float64}; max_iter::Int64=1_000, verbose::Bool=false)::Nothing
Ensure matrix invertibility by iteratively inflating diagonal elements until both the determinant is nonzero and the matrix is positive definite.
Arguments
X::Matrix{Float64}
: Input square symmetric matrix to be modified in-placemax_iter::Int64=1_000
: Maximum number of iterationsverbose::Bool=false
: If true, prints information about the inflation process
Details
The function adds progressively larger values to the diagonal elements until the matrix becomes invertible (det(X) > eps(Float64)) and positive definite, or the maximum number of iterations is reached. The initial inflation value ϵ is set to the maximum absolute value in the matrix and increases slightly in each iteration by a factor of (1 + eps(Float64)).
The function returns early if the matrix is already positive definite and has a nonzero determinant.
Throws
ArgumentError
: If the input matrix is not symmetricArgumentError
: If the input matrix is not squareArgumentError
: If the input matrix contains NaN valuesArgumentError
: If the input matrix contains Inf values
Returns
Nothing
. The input matrix X
is modified in-place.
Example
julia> x::Vector{Float64} = rand(10);
julia> X::Matrix{Float64} = x * x';
julia> inflatediagonals!(X);
julia> det(X) > eps(Float64)
true
GenomicBreedingCore.instantiateblr
— Methodinstantiateblr(; trait::String, factors::Vector{String}, df::DataFrame,
other_covariates::Union{Vector{String}, Nothing}=nothing,
saturated_model::Bool=false,
verbose::Bool=false)::BLR
Extract design matrices and response variable for Bayesian modelling of factorial experiments.
Arguments
trait::String
: Name of the response variable (dependent variable) in the DataFramefactors::Vector{String}
: Vector of factor names (independent variables) to be included in the modeldf::DataFrame
: DataFrame containing the dataother_covariates::Union{Vector{String}, Nothing}=nothing
: Additional numeric covariates to include in the modelsaturated_model::Bool=false
: If true, includes all possible interactions between factorsverbose::Bool=false
: If true, prints additional information during execution
Returns
A BLR struct containing:
entries
: Vector of entry identifiersy
: Response variable vectorŷ
: Predicted values vectorϵ
: Residuals vectorXs
: Dict mapping factors to design matricesΣs
: Dict mapping factors to covariance matricescoefficients
: Dict mapping factors to coefficient vectorscoefficient_names
: Dict mapping factors to coefficient namesdiagnostics
: DataFrame with MCMC diagnostics (added after sampling)
Model Details
Creates design matrices for hierarchical factorial experiments with:
- Main effects:
- Genetic effects:
entries
- Environmental effects:
sites
,seasons
,years
- Spatial effects:
blocks
,rows
,cols
- Interaction effects:
- GxE interactions (entries × environment)
- Environmental interactions (between environment factors)
- Spatial interactions (between spatial factors)
Implementation Notes
- Uses FullDummyCoding for categorical factors
- Converts factors to strings
- Validates numeric traits/covariates
- Creates design matrices with full levels
- Handles continuous covariates as Float64
- Uses memory efficient boolean matrices for factors
- Assigns identity matrices as initial covariance structures
Throws
ArgumentError
: For missing/invalid inputsErrorException
: For matrix extraction failures
Example
julia> genomes = simulategenomes(n=5, l=1_000, verbose=false);
julia> trials, simulated_effects = simulatetrials(genomes = genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=3, verbose=false);
julia> df = tabularise(trials);
julia> blr_1 = instantiateblr(trait = trials.traits[1], factors = ["rows", "cols"], df = df, other_covariates = nothing, verbose = false);
julia> blr_2 = instantiateblr(trait = trials.traits[1], factors = ["years", "seasons", "sites", "entries"], df = df, other_covariates = [trials.traits[2]], verbose = false);
julia> length(blr_1.Xs) == 3
true
julia> size(blr_1.Xs["rows"]) == (15, 3)
true
julia> length(blr_2.Xs) == 9
true
julia> size(blr_2.Xs["entries & seasons & sites"]) == (15, 5)
true
GenomicBreedingCore.knni
— Methodknni(; qs::Vector{Float64}, d::Vector{Float64}, max_entries_dist::Float64 = 0.1, min_k_entries::Int64 = 2)::Float64
Impute missing allele frequencies using the k-nearest neighbours (kNN) method based on distances.
Arguments
qs::Vector{Float64}
: A vector of non-missing allele frequencies at the focal locus.d::Vector{Float64}
: A vector of distances corresponding to the entries inqs
.max_entries_dist::Float64
: The maximum distance threshold to consider an entry as a neighbour. Default is 0.1.min_k_entries::Int64
: The minimum number of nearest neighbours required for imputation. Default is 2.
Returns
Float64
: The imputed allele frequency at the focal locus.
Details
This function imputes missing allele frequencies at a focal locus using the k-nearest neighbours (KNN) method. The function performs the following steps:
- Input Validation: Ensures that the lengths of qs and d are equal. Throws an ArgumentError if they are not.
- Identify Nearest Neighbors: Finds the indices of the nearest neighbours within the specified distance threshold (maxentriesdist). If the number of neighbours is less than the minimum required (minkentries), the nearest neighbours are selected based on the smallest distances.
- Impute Missing Data: Calculates the weighted mean of the allele frequencies of the nearest neighbours. The weights are based on the inverse of the distances to the focal locus, ensuring that closer neighbours have a higher influence on the imputed value.
- Output: Returns the imputed allele frequency.
Notes
- The function uses a small epsilon value (eps(Float64)) to avoid division by zero when calculating weights.
- If the sum of distances is infinite, a uniform weight is assigned to each neighbour.
- The function is designed to handle cases where the number of nearest neighbours is less than the specified minimum by selecting the closest available neighbours.
Throws
- ArgumentError: If the lengths of qs and d are not equal.
Example
julia> genomes = simulategenomes(n=50, l=1_000, sparsity=0.3, verbose=false);
julia> chromosomes = divideintomockscaffolds(genomes, max_n_loci_per_chrom=100);
julia> chroms_uniq, LDs = estimateld(genomes, chromosomes=chromosomes);
julia> rm.(readdir()[.!isnothing.(match.(Regex("jld2"), readdir()))]);
julia> k = 1; LD = LDs[k]; chrom = chroms_uniq[k]; idx_focal_locus = 1;
julia> idx_loci_alleles_per_chrom = findall(chromosomes .== chrom);
julia> min_loci_corr = 0.9; min_l_loci = 2;
julia> D = estimatedistances(genomes, LD=LD, idx_focal_locus=idx_focal_locus, idx_loci_alleles_per_chrom=idx_loci_alleles_per_chrom, min_loci_corr=min_loci_corr, min_l_loci=min_l_loci);
julia> i = findall(ismissing.(genomes.allele_frequencies[:, idx_focal_locus]))[1];
julia> idx_entries_not_missing = findall(.!ismissing.(genomes.allele_frequencies[:, idx_focal_locus]));
julia> qs::Vector{Float64} = genomes.allele_frequencies[idx_entries_not_missing, idx_focal_locus];
julia> d = D[i, idx_entries_not_missing];
julia> q̄ = knni(qs=qs, d=d, max_entries_dist=0.25, min_k_entries=2);
julia> q̄ > 0.0
true
GenomicBreedingCore.knnioptim
— Methodknnioptim(
genomes::Genomes;
j::Int64,
idx_focal_locus::Int64,
idx_loci_alleles_per_chrom::Vector{Int64},
idx_entries_not_missing::Vector{Int64},
LD::Matrix{Float64},
n_reps::Int64 = 2,
optim_n::Int64 = 10,
min_l_loci::Int64 = 10,
min_k_entries::Int64 = 2,
verbose::Bool = false,
)::Dict{String,Float64}
Optimise parameters for allele frequency k-nearest neighbours imputation (AF-kNNi) using cross-validation, i.e. the minimum loci correlation (min_loci_corr
) and maximum entries distance (max_entries_dist
).
Arguments
genomes::Genomes
: AGenomes
struct containing the genomic data.j::Int64
: The global index of the locus for which imputation is performed, i.e. index corresponding to the wholegenomes.allele_frequencies
matrix.idx_focal_locus::Int64
: The local index of the focal locus, i.e. index at the current chromosome or mock scaffold.idx_loci_alleles_per_chrom::Vector{Int64}
: A vector of global indices for loci-alleles per chromosome, i.e. indices corresponding the wholegenomes.allele_frequencies
matrix.idx_entries_not_missing::Vector{Int64}
: A vector of indices for entries with non-missing data at the focal locus.LD::Matrix{Float64}
: A linkage disequilibrium (LD) matrix.n_reps::Int64
: The number of repetitions for cross-validation. Default is 2.optim_n::Int64
: The number of optimisation steps for each parameter. Default is 10.min_l_loci::Int64
: The minimum number of loci required to be linked to the focal locus. Default is 10.min_k_entries::Int64
: The minimum number of nearest neighbours required for imputation. Default is 2.verbose::Bool
: Iftrue
, prints detailed progress information and plots the mean absolute error (MAE) heatmap. Default isfalse
.
Returns
Dict{String, Float64}
: A dictionary containing the optimised parameters:"min_loci_corr"
: The optimised minimum correlation threshold for loci to be considered linked."max_entries_dist"
: The optimised maximum distance threshold to consider an entry as a neighbor."mae"
: The mean absolute error (MAE) of the optimised parameters.
Details
This function optimises the minimum loci correlation (min_loci_corr
) and maximum entries distance (max_entries_dist
) for allele frequency k-nearest neighbours imputation (AF-kNNi) using cross-validation. The function performs the following steps:
- Input Validation: Ensures that there are at least 2 entries with non-missing data at the focal locus. Throws an ArgumentError if this condition is not met.
- Parameter Ranges: Defines the ranges for the minimum correlation threshold (
min_loci_corr
) and the maximum distance threshold (maxentriesdist) for optimisation. - Cross-Validation: For each combination of
min_loci_corr
andmax_entries_dist
, the function:- Estimates pairwise distances between entries using the
estimatedistances
function. - Samples entries with non-missing data to simulate missing data.
- Performs AF-kNNi using the
knni
function and calculates the mean absolute error (MAE) for the imputed values.
- Estimates pairwise distances between entries using the
- Optimisation: Identifies the combination of parameters that minimises the MAE.
- Output: Returns a dictionary containing the optimised parameters and the corresponding MAE. If verbose is true, prints detailed progress information and plots a heatmap of the MAE.
Notes
- The function is designed to handle the following cases:
- If all loci have undetermined level of linkage with the focal locus (i.e. -Inf values in the LD matrix), then all the loci-alleles are selected to estimate distances.
- If the number of linked loci is below the specified minimum, then the minimum number of loci most correlated to the focal locus are selected to estimate distances.
- If the number of nearest neighbours is less than the specified minimum, then then the minimum number of closest neighbours are used in to compute the weighted average allele frequency.
- The verbose option provides additional insights into the optimisation process by printing the parameters and plotting the MAE heatmap.
Throws
- ArgumentError: If there are less than 2 entries with non-missing data at the focal locus.
Example
julia> genomes = simulategenomes(n=123, l=1_000, sparsity=0.3, rel_dist_multiplier=10.0, verbose=false);
julia> chromosomes = divideintomockscaffolds(genomes, max_n_loci_per_chrom=100);
julia> chroms_uniq, LDs = estimateld(genomes, chromosomes=chromosomes);
julia> rm.(readdir()[.!isnothing.(match.(Regex("jld2"), readdir()))]);
julia> k = 1; LD = LDs[k]; chrom = chroms_uniq[k]; j = idx_focal_locus = 1;
julia> idx_loci_alleles_per_chrom = findall(chromosomes .== chrom);
julia> min_loci_corr = 0.9; min_l_loci = 2;
julia> D = estimatedistances(genomes, LD=LD, idx_focal_locus=idx_focal_locus, idx_loci_alleles_per_chrom=idx_loci_alleles_per_chrom, min_loci_corr=min_loci_corr, min_l_loci=min_l_loci);
julia> i = findall(ismissing.(genomes.allele_frequencies[:, idx_focal_locus]))[1];
julia> idx_entries_not_missing = findall(.!ismissing.(genomes.allele_frequencies[:, idx_focal_locus]));
julia> params = knnioptim(genomes, j=j, idx_focal_locus=idx_focal_locus, idx_loci_alleles_per_chrom=idx_loci_alleles_per_chrom, idx_entries_not_missing=idx_entries_not_missing, LD=LD);
julia> params["min_loci_corr"] > 0.5
true
julia> params["max_entries_dist"] < 0.5
true
julia> params["mae"] < 0.1
true
GenomicBreedingCore.loci
— Methodloci(genomes::Genomes; verbose::Bool = false)::Tuple{Vector{String},Vector{Int64},Vector{Int64},Vector{Int64}}
Extract genomic positional information from a Genomes
object, returning a tuple of vectors containing chromosome names, positions, and locus boundary indices.
Arguments
genomes::Genomes
: A Genomes object containing genomic dataverbose::Bool = false
: If true, displays a progress bar during computation
Returns
A tuple containing four vectors:
chromosomes::Vector{String}
: Names of chromosomespositions::Vector{Int64}
: Positions within chromosomesloci_ini_idx::Vector{Int64}
: Starting indices for each locusloci_fin_idx::Vector{Int64}
: Ending indices for each locus
Examples
julia> genomes = simulategenomes(n=100, l=1_000, n_alleles=4, verbose=false);
julia> chromosomes, positions, loci_ini_idx, loci_fin_idx = loci(genomes);
julia> length(chromosomes), length(positions), length(loci_ini_idx), length(loci_fin_idx)
(1000, 1000, 1000, 1000)
GenomicBreedingCore.loci_alleles
— Methodloci_alleles(genomes::Genomes; verbose::Bool = false)::Tuple{Vector{String},Vector{Int64},Vector{String}}
Extract chromosomes, positions, and alleles information from a Genomes
object.
Returns a tuple of three vectors containing:
- Chromosomes identifiers as strings
- Base-pair positions as integers
- Allele identifiers as strings
Each vector has length equal to the total number of loci-allele combinations in the genome. The function processes the data in parallel using multiple threads for performance optimization.
Arguments
genomes::Genomes
: A valid Genomes object containing loci and allele informationverbose::Bool = false
: If true, displays a progress bar during extraction
Returns
Tuple{Vector{String},Vector{Int64},Vector{String}}
: A tuple containing three vectors:- chromosomes: Vector of chromosome identifiers
- positions: Vector of base-pair positions
- alleles: Vector of allele identifiers
Throws
ArgumentError
: If the Genomes struct dimensions are invalid or corrupted
Examples
julia> genomes = simulategenomes(n=100, l=1_000, n_alleles=4, verbose=false);
julia> chromosomes, positions, alleles = loci_alleles(genomes);
julia> length(chromosomes), length(positions), length(alleles)
(3000, 3000, 3000)
GenomicBreedingCore.makex
— Methodmakex(varex::Vector{String}; df::DataFrame)::Tuple{Matrix{Float64}, Vector{String}, Vector{String}}
Constructs a design matrix X
based on categorical variables specified in varex
from the given DataFrame df
.
Arguments
varex::Vector{String}
: A vector of column names indf
representing categorical variables to be encoded.df::DataFrame
: The input DataFrame containing the data.
Returns
A tuple (X, X_vars, X_labels)
where:
X::Matrix{Float64}
: The design matrix with one-hot encoded columns for each categorical variable.X_vars::Vector{String}
: A vector indicating the variable name corresponding to each column inX
.X_labels::Vector{String}
: A vector of unique labels for each categorical variable.
Example
GenomicBreedingCore.maskmissing!
— Methodmaskmissing!(genomes::Genomes; verbose::Bool = false)
Update the mask matrix for missing values in the genomes struct.
This function updates the mask matrix in a Genomes
struct by marking positions where allele frequencies are not missing. The mask is set to true
for non-missing values and false
for missing values.
Arguments
genomes::Genomes
: A Genomes struct containing genomic data including allele frequencies and mask matrixverbose::Bool=false
: If true, displays a progress bar during computation
Throws
ArgumentError
: If the dimensions in the Genomes struct are inconsistent
Effects
- Modifies the
mask
field of the inputgenomes
struct in-place
Threads
- Uses multi-threading for parallel computation across loci
- Uses a thread lock for safe concurrent access to shared memory
Example
julia> genomes = simulategenomes(n=10, sparsity=0.3, verbose=false);
julia> round(1.00 - mean(genomes.mask), digits=10)
0.0
julia> maskmissing!(genomes);
julia> round(1.00 - mean(genomes.mask), digits=10)
0.3
GenomicBreedingCore.plot
— Functionplot(genomes::Genomes, seed::Int64 = 42)::Nothing
Generate visualization plots for allele frequencies in genomic data.
For each population in the dataset, creates three plots:
- Histogram of per-entry allele frequencies
- Histogram of mean allele frequencies per locus
- Correlation heatmap of allele frequencies between loci
Arguments
genomes::Genomes
: A Genomes struct containing allele frequency dataseed::Int64=42
: Random seed for reproducibility of sampling loci
Returns
Nothing
: Displays plots but doesn't return any value
Notes
- Uses up to 100 randomly sampled loci for visualization
- Handles missing values in the data
- Displays folded frequency spectra (both q and 1-q)
- Will throw ArgumentError if the Genomes struct is corrupted
Examples
julia> genomes = simulategenomes(n=100, l=1_000, n_alleles=4, verbose=false);
julia> GenomicBreedingCore.plot(genomes)
GenomicBreedingCore.plot
— Functionplot(fit::Fit, distribution::Any=[TDist(1), Normal()][2], α::Float64=0.05)
Generate diagnostic plots for genetic association analysis results.
Arguments
fit::Fit
: A Fit object containing the association analysis results, specifically theb_hat
field with effect sizesdistribution::Any
: The null distribution for p-value calculation. Defaults to Normal distributionα::Float64
: Significance level for multiple testing correction (Bonferroni). Defaults to 0.05
Returns
Displays three plots:
- Histogram showing the distribution of effect sizes
- Manhattan plot showing -log10(p-values) across loci with Bonferroni threshold
- Q-Q plot comparing observed vs expected -log10(p-values)
Examples
julia> distribution = [TDist(1), Normal()][2];
julia> fit = Fit(n=100, l=10_000); fit.b_hat = rand(distribution, 10_000); α = 0.05;
julia> GenomicBreedingCore.plot(fit);
GenomicBreedingCore.plot
— Methodplot(phenomes::Phenomes; nbins::Int64 = 10)::Nothing
Generate diagnostic plots for phenotypic data stored in a Phenomes
struct.
Arguments
phenomes::Phenomes
: A Phenomes struct containing phenotypic datanbins::Int64=10
: Number of bins for the histograms (optional)
Description
For each population in the dataset:
- Creates histograms showing the distribution of each trait
- Generates a heatmap of trait correlations for traits with non-zero variance
Notes
- Skips traits with all missing, NaN, or infinite values
- Only includes traits with variance > 1e-10 in correlation analysis
- Requires at least 3 data points to generate a histogram
- Uses UnicodePlots for visualization in terminal
Examples
julia> phenomes = Phenomes(n=10, t=3); phenomes.entries = string.("entry_", 1:10); phenomes.populations .= "pop_1"; phenomes.traits = ["A", "B", "C"]; phenomes.phenotypes = rand(10,3);
julia> GenomicBreedingCore.plot(phenomes);
GenomicBreedingCore.plot
— Methodplot(tebv::TEBV)
Create a visualization of True Estimated Breeding Values (TEBV) analysis results.
This function extracts phenomes from the TEBV object and generates a plot to visualize the breeding value estimates.
Arguments
tebv::TEBV
: A TEBV object containing the analysis results
Returns
- A plot object representing the visualization of the phenomes data
GenomicBreedingCore.plot
— Methodplot(trials::Trials; nbins::Int64 = 10)::Nothing
Generate a comprehensive visualization of trial data through histograms, correlation heatmaps, and bar plots.
Arguments
trials::Trials
: A Trials struct containing the trial data to be visualizednbins::Int64=10
: Number of bins for the histogram plots (optional)
Details
The function creates three types of plots:
- Histograms for each trait within each population, showing the distribution of trait values
- Correlation heatmaps between traits for each population
- Bar plots showing mean trait values across different trial factors:
- Years
- Seasons
- Harvests
- Sites
- Replications
- Rows
- Columns
- Populations
Missing, NaN, or infinite values are automatically filtered out before plotting.
Returns
Nothing
: The function displays plots but does not return any value
Notes
- Requires valid trial data with non-zero variance for correlation plots
- Uses UnicodePlots for visualization in terminal
- Skips plotting for traits with insufficient data points
Examples
julia> trials, _ = simulatetrials(genomes = simulategenomes(verbose=false), verbose=false);
julia> GenomicBreedingCore.plot(trials);
GenomicBreedingCore.removemissnaninf
— Methodremovemissnaninf(trials::Trials)::Trials
Removes missing (missing
), NaN
, and Inf
values from the Trials
struct.
Arguments
trials::Trials
: ATrials
struct containing phenotypic and trial data.
Returns
- A new
Trials
struct with:- Trait columns that are completely missing,
NaN
, orInf
removed. - Rows with any missing,
NaN
, orInf
values in phenotypes removed.
- Trait columns that are completely missing,
Throws
- If the
Trials
struct is corrupted (fails dimension checks), anArgumentError
is thrown. - If no rows are removed (i.e., all rows are valid), the original
Trials
struct is returned. - If the resulting
Trials
struct fails dimension checks after filtering, aDimensionMismatch
error is thrown.
Example
julia> trials, _ = simulatetrials(genomes = simulategenomes(n=50, l=100, verbose=false), f_add_dom_epi=rand(13,3), sparsity=0.1, verbose=false);
julia> filtered_trials = removemissnaninf(trials);
julia> D1 = dimensions(trials); D2 = dimensions(filtered_trials);
julia> (D1["n_missing"] > D2["n_missing"]) && (D2["n_missing"] == D2["n_nan"] == D2["n_inf"] == 0)
true
GenomicBreedingCore.removespatialeffects!
— Methodremovespatialeffects!(df::DataFrame; factors::Vector{String}, traits::Vector{String},
other_covariates::Union{Vector{String},Nothing} = nothing,
Σ_type::Symbol = :empirical,
ρs::Dict{String, Float64} = Dict("rows" => 0.5, "cols" => 0.5),
n_iter::Int64 = 10_000, n_burnin::Int64 = 1_000,
seed::Int64 = 1234, verbose::Bool = false)::Tuple{Vector{String}, Dict{String, DataFrame}}
Remove spatial effects from trait measurements in field trials using Bayesian linear regression.
Arguments
df::DataFrame
: DataFrame containing trial data with columns for traits, spatial factors, and harvestsfactors::Vector{String}
: Vector of factor names to be considered in the modeltraits::Vector{String}
: Vector of trait names to be spatially adjustedother_covariates::Union{Vector{String},Nothing}
: Optional vector of additional covariatesΣ_type::Symbol
: Type of covariance structure to use (:empirical, :autoregressive, or :spherical) (default: :empirical)ρs::Dict{String, Float64}
: Correlation parameters for autoregressive structure, keys are factor names (default: rows=0.5, cols=0.5)n_iter::Int64
: Number of MCMC iterations (default: 10_000)n_burnin::Int64
: Number of burn-in iterations (default: 1_000)seed::Int64
: Random seed for reproducibility (default: 1234)verbose::Bool
: If true, prints detailed information (default: false)
Returns
In addition to mutating df
by adding the spatially adjusted traits as new columns with the prefix "SPATADJ-", it returns a tuple containing:
- Vector of remaining significant factors after spatial adjustment
- Dictionary mapping harvest-trait combinations to diagnostic DataFrame results
Details
This function performs spatial adjustment for traits measured in field trials by:
- Identifying spatial factors (blocks, rows, columns) and creating design matrices
- Fitting Bayesian linear model per harvest to account for:
- Spatial effects (blocks, rows, columns and interactions)
- Choice of covariance structure:
- Empirical: Estimated using linear shrinkage with unequal variances
- Autoregressive: AR(1) structure with specified correlation parameters
- Spherical: Identity matrix scaled by variance component
- Additional numeric covariates if specified
- Creating spatially adjusted traits by adding intercept and residuals
- Removing redundant factors and retaining only unique design matrices
The spatial adjustment is only performed if blocks, rows or columns are present. Each harvest is treated separately to allow for year and site-specific spatial effects.
Variance component scalers are specified as follows:
- Rows, columns and other covariates use unique variance scalers (multiple_σs = true)
- Row-by-column interactions use a single spherical variance-covariance matrix (multiple_σs = false)
- This improves model tractability while allowing for flexible variance structures where needed
Notes
- Requires "entries" and "harvests" columns in input DataFrame
- Uses Bayesian linear regression via MCMC for spatial modeling
- Creates new columns with "SPATADJ-" prefix for adjusted traits
- Returns original DataFrame if no spatial factors present
- Automatically detects spatial factors via regex pattern "blocks|rows|cols"
- Supports empirical, autoregressive or spherical covariance structures
- Modifies the input DataFrame in-place
Example
julia> genomes = simulategenomes(n=5, l=1_000, verbose=false);
julia> trials, simulated_effects = simulatetrials(genomes = genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=3, verbose=false);
julia> df = tabularise(trials);
julia> factors, diagnostics = removespatialeffects!(df, factors = ["rows", "cols", "blocks"], traits = ["trait_1", "trait_2"], other_covariates=["trait_3"], n_iter=1_000, n_burnin=200, verbose = false);
julia> ("SPATADJ-trait_1" ∈ names(df)) && ("SPATADJ-trait_2" ∈ names(df))
true
julia> [(sum(d.rhat .< 1.01) < nrow(d)) && (sum(d.ess .>= 100) < nrow(d)) for (_, d) in diagnostics] == [true, true]
true
GenomicBreedingCore.simulateallelefreqs!
— Methodsimulateallelefreqs!(
allele_frequencies::Matrix{Union{Float64,Missing}},
locus_counter::Vector{UInt},
pb::Union{Nothing, ProgressMeter.Progress};
μ::Vector{Float64},
Σ::SparseMatrixCSC{Float64},
n_alleles::Int64,
idx_entries_per_population::Vector{Int64},
rng::TaskLocalRNG = Random.GLOBAL_RNG,
)::Nothing
Simulates allele frequencies for multiple loci and populations using a multivariate normal distribution.
Arguments
allele_frequencies::Matrix{Union{Float64,Missing}}
: Matrix to store the simulated allele frequencies for all locilocus_counter::Vector{UInt}
: Counter tracking the current locus positionpb::ProgressMeter.Progress
: Progress bar for tracking simulation progressμ::Vector{Float64}
: Mean vector for the multivariate normal distributionΣ::SparseMatrixCSC{Float64}
: Variance-covariance matrix for the multivariate normal distributionn_alleles::Int64
: Number of alleles per locusidx_entries_per_population::Vector{Int64}
: Indices for entries in each populationrng::TaskLocalRNG
: Random number generator (default: global RNG)
Description
Simulates allele frequencies for multiple populations and loci using a multivariate normal distribution. For each allele (except the last one) at each locus, frequencies are sampled from the specified distribution. For alleles after the first one, frequencies are adjusted to ensure they sum to 1.0 within each locus. All frequencies are bounded between 0.0 and 1.0.
Returns
Nothing
Throws
ArgumentError
: If length of allele_frequencies is less than or equal to length of μArgumentError
: If length of μ does not match the dimensions of ΣArgumentError
: If maximum index in idxentriesper_population exceeds number of entries
Note
The last allele frequency for each locus is implicitly determined as 1 minus the sum of other allele frequencies to ensure frequencies sum to 1.0.
Example
julia> chrom_lengths, chrom_loci_counts = simulatechromstruct(l=100, n_chroms=7, max_pos=135_000_000);
julia> positions, loci_alleles = simulateposandalleles(chrom_lengths=chrom_lengths, chrom_loci_counts=chrom_loci_counts, n_alleles=2);
julia> Σ_base = simulateldblocks(chrom_positions=positions[1], chrom_length=chrom_lengths[1], ld_corr_50perc_kb=1_000);
julia> μ, Σ = simulateperpopμΣ(Σ_base=Σ_base, μ_β_params=(0.5, 0.5));
julia> populations, idx_population_groupings = simulatepopgroups(n=123, n_populations=3);
julia> allele_frequencies::Matrix{Union{Missing, Float64}} = fill(missing, 123, 100); locus_counter::Vector{UInt} = [1]; pb = nothing;
julia> simulateallelefreqs!(allele_frequencies, locus_counter, pb, μ=μ, Σ=Σ, n_alleles=2, idx_entries_per_population=idx_population_groupings[1])
julia> sum(.!ismissing.(allele_frequencies[idx_population_groupings[1], 1:length(μ)])) == (length(idx_population_groupings[1]) * length(μ))
true
GenomicBreedingCore.simulatechromstruct
— Methodsimulatechromstruct(;l::Int64, n_chroms::Int64, max_pos::Int64)
Generate chromosome structure parameters for genome simulation.
Arguments
l::Int64
: Total number of loci to distribute across chromosomes (2 to 1e9)n_chroms::Int64
: Number of chromosomes (1 to 1e6)max_pos::Int64
: Maximum genome size in base pairs (10 to 160e9)
Returns
A tuple containing:
chrom_lengths::Vector{Int64}
: Vector of chromosome lengths in base pairschrom_loci_counts::Vector{Int64}
: Vector of loci counts per chromosome
Examples
julia> chrom_lengths, chrom_loci_counts = simulatechromstruct(l=10_000, n_chroms=7, max_pos=135_000_000)
([19285714, 19285714, 19285714, 19285714, 19285714, 19285714, 19285716], [1428, 1428, 1428, 1428, 1428, 1428, 1432])
GenomicBreedingCore.simulatecovarianceautocorrelated
— Functionsimulatecovarianceautocorrelated(p::Int64, ρ::Float64 = 0.75)::Matrix{Float64}
Generate a p × p autocorrelated covariance matrix where the correlation between elements decays exponentially with their distance.
Arguments
p::Int64
: Size of the square covariance matrixρ::Float64 = 0.75
: Autocorrelation parameter, must be between -1 and 1
Returns
Matrix{Float64}
: A p × p positive definite covariance matrix where Σᵢⱼ = ρ^(|i-j|)
Throws
ArgumentError
: If ρ is not between -1 and 1
Examples
julia> Σ = simulatecovarianceautocorrelated(7, 0.72);
julia> size(Σ)
(7, 7)
julia> (var(Σ) > 0.0) && (abs(det(Σ)) > 0.0)
true
julia> (Σ[:, 1] == Σ[1, :]) && (Σ[:, 2] == Σ[2, :])
true
julia> sum(diff(reverse(Σ[1, :])) .> 0.0) == 6
true
GenomicBreedingCore.simulatecovariancediagonal
— Methodsimulatecovariancediagonal(p::Int64, σ²::Vector{Float64})::Matrix{Float64}
Simulate a diagonal covariance matrix with specified variances σ² on the diagonal.
Arguments
p::Int64
: Dimension of the covariance matrixσ²::Vector{Float64}
: Vector of variance values for diagonal elements
Returns
Matrix{Float64}
: p × p diagonal covariance matrix
Examples
julia> Σ = simulatecovariancediagonal(7, rand(7));
julia> size(Σ)
(7, 7)
julia> var(Σ[diagind(Σ)]) > 0.0
true
GenomicBreedingCore.simulatecovariancekinship
— Methodsimulatecovariancekinship(p::Int64, genomes::Genomes)::Matrix{Float64}
Calculate a genomic relationship matrix (GRM) from simulated genomic data.
Arguments
p::Int64
: Expected number of entries/individuals in the genomic datagenomes::Genomes
: Genomic data structure containing genetic information
Returns
Matrix{Float64}
: A p × p genomic relationship matrix
Throws
ArgumentError
: If the providedp
doesn't match the number of entries ingenomes
Examples
julia> Σ = simulatecovariancekinship(7, simulategenomes(n=7, l=1_000, verbose=false));
julia> size(Σ)
(7, 7)
julia> (var(Σ) > 0.0) && (abs(det(Σ)) > 0.0)
true
GenomicBreedingCore.simulatecovariancerandom
— Functionsimulatecovariancerandom(p::Int64, seed::Int64 = 42)::Matrix{Float64}
Generate a random positive semidefinite covariance matrix of size p × p.
Arguments
p::Int64
: Dimension of the covariance matrix to generateseed::Int64 = 42
: Random seed for reproducibility
Returns
Matrix{Float64}
: A p × p positive semidefinite covariance matrix
Details
The function generates a random covariance matrix by:
- Creating a p × 1 random normal vector
- Computing the outer product of this vector with itself
- Inflating the diagonal elements to ensure positive definiteness
Examples
julia> Σ = simulatecovariancerandom(7, 123);
julia> size(Σ)
(7, 7)
julia> (var(Σ) > 0.0) && (abs(det(Σ)) > 0.0)
true
GenomicBreedingCore.simulatecovariancespherical
— Methodsimulatecovariancespherical(p::Int64, σ²::Float64)::Matrix{Float64}
Simulate a spherical covariance matrix with constant variance σ² on the diagonal.
Arguments
p::Int64
: Dimension of the covariance matrixσ²::Float64
: Constant variance value for diagonal elements
Returns
Matrix{Float64}
: p × p spherical covariance matrix
Examples
julia> Σ = simulatecovariancespherical(7, 2.15);
julia> size(Σ)
(7, 7)
julia> Σ[diagind(Σ)] == fill(2.15, 7)
true
GenomicBreedingCore.simulateeffects
— Methodsimulateeffects(; p::Int64 = 2, q::Int64 = 1, λ::Float64 = 1.00,
covar_details::Tuple{Function,Union{Float64,Vector{Float64},Int64,Genomes}} = (simulatecovariancespherical, 1.00),
seed::Int64 = 42)::Matrix{Float64}
Simulate correlated effects by sampling from a multivariate normal distribution.
This function generates a matrix of correlated effects by:
- Sampling means (μ) from an exponential distribution with parameter λ
- Creating a covariance matrix Σ using the specified covariance function and parameters
- Drawing samples from MvNormal(μ, Σ)
Arguments
p::Int64
: Number of correlated effects to simulate (default = 2)q::Int64
: Number of times to simulate the correlated effects from the same distribution (default = 1)λ::Float64
: Rate parameter of the exponential distribution for sampling means (default = 1.00)covar_details::Tuple{Function,Union{Float64,Vector{Float64},Int64,Genomes}}
: Tuple containing:- First element: Covariance simulation function to use
- Second element: Parameter(s) for the covariance function:
- Float64 for spherical or autocorrelated covariance
- Vector{Float64} for diagonal covariance
- Int64 for random covariance seed
- Genomes object for kinship covariance
seed::Int64
: Random number generator seed for reproducibility (default = 42)
Returns
Matrix{Float64}
: A p × q matrix where each column represents a set of correlated effects
Supported Covariance Functions
simulatecovariancespherical
: Spherical covariance with constant variance parametersimulatecovariancediagonal
: Diagonal covariance with vector of variancessimulatecovariancerandom
: Random covariance with seed parametersimulatecovarianceautocorrelated
: Autocorrelated covariance with correlation parametersimulatecovariancekinship
: Kinship-based covariance with Genomes object
Examples
julia> θ₀ = simulateeffects();
julia> sum(abs.(θ₀)) > 0.0
true
julia> p = 10; q = 5;
julia> θ₁ = simulateeffects(p=p, q=q, λ=0.50, covar_details=(simulatecovariancediagonal, rand(p)));
julia> (size(θ₁) == (p, q)) && (θ₀ != θ₁)
true
julia> θ₂ = simulateeffects(p=p, q=q, λ=0.50, covar_details=(simulatecovariancerandom, 123));
julia> (size(θ₂) == (p, q)) && (θ₁ != θ₂)
true
julia> θ₃ = simulateeffects(p=p, q=q, λ=0.50, covar_details=(simulatecovarianceautocorrelated, 0.71));
julia> (size(θ₃) == (p, q)) && (θ₂ != θ₃)
true
julia> genomes = simulategenomes(n=p, l=1_000, n_alleles=3, verbose=false);
julia> θ₄ = simulateeffects(p=p, q=q, λ=0.50, covar_details=(simulatecovariancekinship, genomes));
julia> (size(θ₄) == (p, q)) && (θ₃ != θ₄)
true
GenomicBreedingCore.simulategenomes
— Methodsimulategenomes(;
n::Int64 = 100,
n_populations::Int64 = 1,
l::Int64 = 10_000,
n_chroms::Int64 = 7,
n_alleles::Int64 = 2,
max_pos::Int64 = 135_000_000,
ld_corr_50perc_kb::Int64 = 1_000,
rel_dist_multiplier::Float64 = 10.0,
μ_β_params::Tuple{Float64,Float64} = (0.5, 0.5),
sparsity::Float64 = 0.0,
seed::Int64 = 42,
verbose::Bool = true
)::Genomes
Simulates genomic data with population structure and linkage disequilibrium.
Arguments
n::Int64
: Number of entries/individuals to simulate (1 to 1e9)n_populations::Int64
: Number of populations to simulate (1 to n)l::Int64
: Number of loci to simulate (2 to 1e9)n_chroms::Int64
: Number of chromosomes (1 to 1e6)n_alleles::Int64
: Number of alleles per locus (2 to 5, representing A, T, C, G, D)max_pos::Int64
: Maximum position in base pairs (10 to 160e9)ld_corr_50perc_kb::Int64
: Distance in kb where correlation decay reaches 50%rel_dist_multiplier::Float64
: Multiplier for maximum relative distance to consider in LD blocksμ_β_params::Tuple{Float64,Float64}
: Shape parameters for Beta distribution of allele frequenciessparsity::Float64
: Proportion of missing data to simulate (0.0 to 1.0)seed::Int64
: Random seed for reproducibilityverbose::Bool
: Whether to show progress bar and final plot
Returns
Genomes
: A struct containing:- entries: Vector of entry IDs
- populations: Vector of population assignments
- loci_alleles: Vector of locus-allele combinations
- allele_frequencies: Matrix of allele frequencies
- mask: Boolean matrix indicating valid data points
Details
- Simulates genomic data by:
- Generating chromosome lengths and loci positions
- Assigning alleles to loci
- Grouping entries into populations
- Simulating allele frequencies with linkage disequilibrium using multivariate normal distribution
- Adding optional sparsity (missing data)
- Chromosome lengths are distributed evenly, with any remainder added to last chromosome
- Loci positions are randomly sampled without replacement within each chromosome
- LD decay follows an exponential function: corr = 1/exp(r*d), where d is normalized distance
- Mean allele frequencies are sampled from Beta(α,β) distribution
- Population structure is implemented by sampling the mean allele frequencies per population
- For each entry and locus, allele frequencies with linkage disequilibrium are simulated by sampling a multivariate normal distribution per chromosome
- Missing data is randomly assigned if sparsity > 0
Throws
ArgumentError
: If input parameters are outside acceptable rangesDimensionMismatch
: If there's an error in the simulation process
Examples
julia> genomes = simulategenomes(n=100, l=1_000, n_alleles=3, verbose=false);
julia> length(genomes.entries)
100
julia> length(genomes.populations)
100
julia> length(genomes.loci_alleles)
2000
julia> size(genomes.allele_frequencies)
(100, 2000)
julia> mean(ismissing.(genomes.allele_frequencies))
0.0
julia> rng::TaskLocalRNG = Random.seed!(123);
julia> idx = StatsBase.sample(rng, range(1, 2_000, step=2), 250, replace = false, ordered = true);
julia> correlations = StatsBase.cor(genomes.allele_frequencies[:, idx]);
julia> correlations[10,10] == 1.00
true
julia> correlations[10,10] > correlations[10,250]
true
julia> genomes = simulategenomes(n=100, l=10_000, n_alleles=3, sparsity=0.25, verbose=false);
julia> mean(ismissing.(genomes.allele_frequencies))
0.25
GenomicBreedingCore.simulategenomiceffects
— Methodsimulategenomiceffects(;
genomes::Genomes,
f_additive::Float64 = 0.01,
f_dominance::Float64 = 0.10,
f_epistasis::Float64 = 0.05,
seed::Int64 = 42,
)::Tuple{Matrix{Float64},Matrix{Float64}}
Simulate additive, dominance, and epistatic effects for multiple loci.
Arguments
genomes::Genomes
: Genome struct containingn
entries xp
loci-alleles combinationsf_additive::Float64
: Proportion of loci with non-zero additive effects (default = 0.01)f_dominance::Float64
: Proportion of additive loci with dominance effects (default = 0.10)f_epistasis::Float64
: Proportion of additive loci with epistasis effects (default = 0.05)seed::Int64
: Random seed for reproducibility (default = 42)
Returns
Tuple{Matrix{Float64},Matrix{Float64}}
:- First matrix (n x 3): Additive, dominance and epistasis effects per entry
- Second matrix (p x 3): Effects per locus-allele combination
Details
The genetic effects are simulated using diagonal variance-covariance matrices:
For additive effects: Uses a diagonal matrix with random variances for each of the
a
loci with maxnalleles-1 allele effects per locus.For dominance effects: Uses a diagonal matrix with random variances for each of the
d
loci, simulating one dominance effect per locus.For epistasis effects: Uses a diagonal matrix with random variances for each of the
e
loci with maxnalleles-1 allele effects per locus. The final epistatic effects are computed by multiplying allele frequencies and effects for all possible pairs of epistatic loci.
For all three types of effects, means are sampled from an exponential distribution (λ=1).
Examples
julia> genomes::Genomes = simulategenomes(n=100, l=2_000, n_alleles=3, verbose=false);
julia> G, B = simulategenomiceffects(genomes=genomes, f_additive=0.05, f_dominance=0.75, f_epistasis=0.25);
julia> size.([G, B])
2-element Vector{Tuple{Int64, Int64}}:
(100, 3)
(4000, 3)
julia> sum(B .!= 0.0, dims=1)
1×3 Matrix{Int64}:
200 75 50
GenomicBreedingCore.simulateldblocks
— Methodsimulateldblocks(;
chrom_positions::Vector{Int64},
chrom_length::Int64,
ld_corr_50perc_kb::Int64,
rel_dist_multiplier::Float64 = 10.0
)::SparseMatrixCSC{Float64}
Simulate linkage disequilibrium (LD) blocks by generating a sparse variance-covariance matrix.
Arguments
chrom_positions::Vector{Int64}
: Vector of positions on the chromosome for each locuschrom_length::Int64
: Total length of the chromosomeld_corr_50perc_kb::Int64
: Distance in kilobases at which the LD correlation decays to 50%rel_dist_multiplier::Float64
: Multiplier for maximum relative distance to consider (default: 10.0)
Returns
SparseMatrixCSC{Float64}
: A sparse variance-covariance matrix representing LD blocks
Details
The function creates a variance-covariance matrix where the correlation between loci decays exponentially with distance. The decay rate is calculated to achieve 50% correlation at the specified distance (ld_corr_50perc_kb
).
For computational efficiency, correlations between loci are set to zero if:
- The normalized distance is greater than reldistmultiplier * q, where q is the normalized LD decay distance (ldcorr50perckb / chromlength)
- The normalized distance is greater than 0.9 (90% of chromosome length)
The computation uses multi-threading with Threads.@threads
to parallelize the calculation of correlation values across loci positions.
Throws
ArgumentError
: If number of loci exceeds chromosome lengthArgumentError
: If LD correlation distance exceeds chromosome lengthArgumentError
: If reldistmultiplier is less than 1.0
Example
julia> chrom_lengths, chrom_loci_counts = simulatechromstruct(l=100, n_chroms=7, max_pos=135_000_000);
julia> positions, loci_alleles = simulateposandalleles(chrom_lengths=chrom_lengths, chrom_loci_counts=chrom_loci_counts, n_alleles=2);
julia> Σ = simulateldblocks(chrom_positions=positions[1], chrom_length=chrom_lengths[1], ld_corr_50perc_kb=1_000);
julia> size(Σ)
(14, 14)
GenomicBreedingCore.simulatemating
— Methodsimulatemating(;
parent_genomes::Genomes,
n_generations::Int = 1,
pop_size_per_gen::Vector{Int64} = [100],
seed::Int64 = 42,
verbose::Bool = false
)::Vector{Genomes}
Simulates mating processes across multiple generations using a multivariate normal distribution approach.
Arguments
parent_genomes::Genomes
: Initial parent genomic information containing allele frequenciesn_generations::Int
: Number of generations to simulate (default: 1)pop_size_per_gen::Vector{Int64}
: Vector of population sizes for each generation (default: [100])seed::Int64
: Random seed for reproducibility (default: 42)verbose::Bool
: Whether to print progress messages (default: false)
Returns
Vector{Genomes}
: Vector of genomes across generations
Description
This function simulates genetic inheritance across generations by:
- Sampling progeny allele frequencies using multivariate normal distribution
- Enforcing biological constraints (frequencies between 0 and 1)
- Normalizing frequencies for multiallelic loci
- Displaying frequency histograms for each generation
The simulation maintains allele frequency correlations within chromosomes and handles multiallelic loci by ensuring their frequencies sum to 1.
Throws
ArgumentError
: If parent genomes contain missing values or invalid dimensions
Example
julia> parent_genomes = simulategenomes(n=5, l=10_000, n_alleles=3, verbose=false);
julia> great_great_offspring_genomes = simulatemating(parent_genomes=parent_genomes, n_generations=3, pop_size_per_gen=[10, 20, 30]);
julia> [length(x.entries) for x in great_great_offspring_genomes] == [5, 10, 20, 30]
true
GenomicBreedingCore.simulateperpopμΣ
— MethodsimulateperpopμΣ(;
Σ_base::SparseMatrixCSC{Float64},
μ_β_params::Tuple{Float64,Float64} = (0.5, 0.5),
rng::TaskLocalRNG = Random.GLOBAL_RNG
) -> Tuple{Vector{Float64}, SparseMatrixCSC{Float64}}
Simulate per-population mean allele frequencies and their variance-covariance matrix.
Arguments
Σ_base::SparseMatrixCSC{Float64}
: Base variance-covariance matrix to be scaledμ_β_params::Tuple{Float64,Float64}
: Parameters (α, β) for Beta distribution used to generate mean allele frequencies (default: (0.5, 0.5))rng::TaskLocalRNG
: Random number generator for reproducibility (default: GLOBAL_RNG)
Returns
Tuple{Vector{Float64}, SparseMatrixCSC{Float64}}
: A tuple containing:- Vector of mean allele frequencies (μ)
- Scaled variance-covariance matrix (Σ)
Details
The function:
- Samples mean allele frequencies from a Beta distribution
- Scales the variance-covariance matrix based on allele frequencies
- Ensures positive definiteness of the resulting matrix through iterative adjustment
The variance scaling is performed such that loci closer to fixation (0.0 or 1.0) have lower variance, following population genetics expectations.
Example
julia> chrom_lengths, chrom_loci_counts = simulatechromstruct(l=100, n_chroms=7, max_pos=135_000_000);
julia> positions, loci_alleles = simulateposandalleles(chrom_lengths=chrom_lengths, chrom_loci_counts=chrom_loci_counts, n_alleles=2);
julia> Σ_base = simulateldblocks(chrom_positions=positions[1], chrom_length=chrom_lengths[1], ld_corr_50perc_kb=1_000);
julia> μ, Σ = simulateperpopμΣ(Σ_base=Σ_base, μ_β_params=(0.5, 0.5));
julia> length(μ) == length(positions[1])
true
julia> size(Σ) == size(Σ_base)
true
julia> abs(sum(Σ .- Σ_base)) > 0.0
true
GenomicBreedingCore.simulatepopgroups
— Methodsimulatepopgroups(; n::Int64, n_populations::Int64)::Tuple{Vector{String}, Vector{Vector{Int64}}}
Simulate population groups by dividing a total number of samples into populations.
Arguments
n::Int64
: Total number of samples (between 1 and 1 billion)n_populations::Int64
: Number of populations to create (between 1 and n)
Returns
A tuple containing:
Vector{String}
: Vector of population labels for each sampleVector{Vector{Int64}}
: Vector of vectors containing indices for each population group
Example
julia> populations, idx_population_groupings = simulatepopgroups(n=100, n_populations=3);
julia> length(populations) == 100
true
julia> length(idx_population_groupings) == 3
true
julia> pops = fill("", 100); [pops[x] .= unique(populations)[i] for (i, x) in enumerate([x for x in (idx_population_groupings)])];
julia> pops == populations
true
GenomicBreedingCore.simulateposandalleles
— Methodsimulateposandalleles(;
chrom_lengths::Vector{Int64},
chrom_loci_counts::Vector{Int64},
n_alleles::Int64,
allele_choices::Vector{String} = ["A", "T", "C", "G", "D"],
allele_weights::Weights{Float64,Float64,Vector{Float64}} = StatsBase.Weights([1.0, 1.0, 1.0, 1.0, 0.1] / sum([1.0, 1.0, 1.0, 1.0, 0.1])),
rng::TaskLocalRNG = Random.GLOBAL_RNG,
verbose::Bool = false
) -> Tuple{Vector{Vector{Int64}}, Vector{String}}
Simulates genomic positions and alleles for multiple chromosomes.
Arguments
chrom_lengths::Vector{Int64}
: Vector containing the length of each chromosomechrom_loci_counts::Vector{Int64}
: Vector containing the number of loci to generate for each chromosomen_alleles::Int64
: Number of alleles to generate per locusallele_choices::Vector{String}
: Vector of possible alleles to choose from (default: ["A", "T", "C", "G", "D"])allele_weights::Weights
: Weights for sampling alleles (default: normalized weights favoring A,T,C,G over D)rng::TaskLocalRNG
: Random number generator for reproducibility (default: global RNG)verbose::Bool
: Whether to show progress bar (default: false)
Returns
Tuple{Vector{Vector{Int64}}, Vector{String}}
: A tuple containing:- A vector of vectors, where each inner vector contains the positions for a chromosome
- A vector of strings, where each string contains tab-separated locus information in the format: "chromN\tposition\tallalleles\tchosen_allele"
Throws
ArgumentError
: If input argument lengths don't match or if invalid number of alleles is requested
Example
julia> chrom_lengths, chrom_loci_counts = simulatechromstruct(l=100, n_chroms=7, max_pos=135_000_000);
julia> positions, loci_alleles = simulateposandalleles(chrom_lengths=chrom_lengths, chrom_loci_counts=chrom_loci_counts, n_alleles=2);
julia> length(positions) == length(chrom_lengths)
true
julia> length(loci_alleles) == sum(chrom_loci_counts)
true
GenomicBreedingCore.simulatetrials
— Methodsimulatetrials(;
genomes::Genomes,
f_add_dom_epi::Matrix{Float64} = [
0.01 0.25 0.10
0.05 0.50 0.25
0.10 0.25 0.00
],
n_years::Int64 = 2,
n_seasons::Int64 = 4,
n_harvests::Int64 = 2,
n_sites::Int64 = 4,
n_replications::Int64 = 2,
n_blocks::Union{Missing,Int64} = missing,
n_rows::Union{Missing,Int64} = missing,
n_cols::Union{Missing,Int64} = missing,
proportion_of_variance::Union{Missing,Matrix{Float64}} = missing,
sparsity::Float64 = 0.0,
seed::Int64 = 42,
verbose::Bool = true,
)::Tuple{Trials,Vector{SimulatedEffects}}
Arguments
genomes
: Genome struct includes then
entries xp
loci-alleles combinations (p
=l
loci xa-1
alleles)f_add_dom_epi
:n_traits
x 3 numeric matrix of loci proportion with additive, dominance and epistasis effects, i.e. each column refers to:f_additive
: proportion of thel
loci with non-zero additive effects on the phenotypef_dominance
: proportion of thel*f_additive
additive effects loci with additional dominance effectsf_epistasis
: proportion of thel*f_additive
additive effects loci with additional epistasis effects- (default = [0.01 0.25 0.10; 0.05 0.50 0.25; 0.10 0.25 0.00])
n_years
: Number of years (default = 2)n_seasons
: Number of seasons (default = 4)n_harvests
: Number of harvests (default = 2)n_sites
: Number of sites (default = 4)n_replications
: Number of replications (default = 2)n_blocks
: Number of blocks across the entire field layout (default = missing)n_rows
: Number of rows across the entire field layout (default = missing)n_cols
: Number of columns across the entire field layout (default = missing)proportion_of_variance
:9
xn_traits
numeric matrix of scaled/non-scaled proportion of variances allocated to genetic and environmental effects (default = missing; values will be sampled from a uniform distribution followed by a biased sample on the first row, i.e. additive effects row). The rows correspond to the variance allocated to:- additive genetic effects
- dominance genetic effects
- epistasis genetic effects
- years effects
- seasons effects
- sites effects
- environmental interactions
- spatial interactions
- GxE interactiions
seed
: Randomisation seed (default = 42)sparsity
: Proportion of missing data (default = 0.0)verbose
: Show trials simulation progress bar? (default = true)
Outputs
Trials
struct of simulated phenotype data- Vector of
SimulatedEffects
each corresponding to each trait-year-season-harvest-site-replication combination
Details
The function simulates trial data by:
- Generating genetic effects (additive, dominance, epistasis)
- Simulating environmental effects for:
- Years, seasons, sites
- Environmental interactions
- Spatial field effects (blocks, rows, columns)
- Genotype-by-environment interactions
- Combining effects according to specified variance proportions
- Applying optional sparsity to create missing data
The field layout is optimized to have:
- Number of rows ≤ number of columns
- Blocks divided along columns
- Even distribution of entries and replications
The function uses various covariance structures to simulate environmental and spatial effects:
- Autocorrelated Covariance: Used for effects such as years, seasons, and blocks. This structure assumes that neighboring elements are more correlated than distant ones.
- Random Covariance: Used for site effects, where the covariance matrix is generated randomly.
- Spherical Covariance: Used for replication effects, assuming uniform correlation across all elements.
- Diagonal Covariance: Used for allele-by-environment interaction effects, where only diagonal elements are non-zero, representing independent effects.
- Custom Covariance: Specific covariance structures can be passed as arguments to simulate effects with tailored correlation patterns.
Examples
julia> genomes::Genomes = simulategenomes(n=100, l=2_000, n_alleles=3, verbose=false);
julia> trials::Trials, vector_of_effects::Array{SimulatedEffects,1} = simulatetrials(genomes=genomes, sparsity=0.25, verbose=false);
julia> size(trials.phenotypes)
(12800, 3)
julia> size(trials.traits)
(3,)
julia> unique(trials.entries) == genomes.entries
true
julia> unique(trials.populations) == unique(genomes.populations)
true
julia> abs(mean(ismissing.(trials.phenotypes)) - 0.25) < 0.00001
true
GenomicBreedingCore.slice
— Methodslice(
genomes::Genomes;
idx_entries::Union{Nothing, Vector{Int64}} = nothing,
idx_loci_alleles::Union{Nothing, Vector{Int64}} = nothing,
verbose::Bool = false
)::Genomes
Create a subset of a Genomes
struct by selecting specific entries and loci-allele combinations.
Arguments
genomes::Genomes
: The source genomic data structure to be slicedidx_entries::Union{Nothing, Vector{Int64}}
: Indices of entries to keep. Ifnothing
, all entries are keptidx_loci_alleles::Union{Nothing, Vector{Int64}}
: Indices of loci-allele combinations to keep. Ifnothing
, all loci-alleles are keptverbose::Bool
: If true, displays a progress bar during slicing. Defaults to false
Returns
Genomes
: A newGenomes
struct containing only the selected entries and loci-allele combinations
Performance Notes
- The function uses multi-threaded implementation for optimal performance
- Progress bar is available when
verbose=true
to monitor the slicing operation - Memory efficient implementation that creates a new pre-allocated structure
Behaviour
- Both index vectors are automatically sorted and deduplicated
- If both
idx_entries
andidx_loci_alleles
arenothing
, returns a clone of the input - Maintains all relationships and structure of the original genomic data
- Preserves population assignments and allele frequencies for selected entries
Validation
- Performs dimension checks on both input and output genomic structures
- Validates that all indices are within proper bounds
- Ensures data consistency throughout the slicing operation
Throws
ArgumentError
: If inputGenomes
struct is corrupted or indices are out of boundsDimensionMismatch
: If the resulting sliced genome has inconsistent dimensions
Examples
julia> genomes = simulategenomes(n=100, l=1_000, n_alleles=4, verbose=false);
julia> sliced_genomes = slice(genomes, idx_entries=collect(1:10), idx_loci_alleles=collect(1:300));
julia> dimensions(sliced_genomes)
Dict{String, Int64} with 7 entries:
"n_entries" => 10
"n_chr" => 1
"n_loci" => 100
"n_loci_alleles" => 300
"n_populations" => 1
"n_missing" => 0
"max_n_alleles" => 4
GenomicBreedingCore.slice
— Methodslice(phenomes::Phenomes; idx_entries::Union{Nothing, Vector{Int64}}=nothing, idx_traits::Union{Nothing, Vector{Int64}}=nothing)::Phenomes
Create a new Phenomes
object containing a subset of the original data by selecting specific entries and traits.
Arguments
phenomes::Phenomes
: The original Phenomes object to sliceidx_entries::Union{Nothing, Vector{Int64}}=nothing
: Indices of entries to keep. Ifnothing
, all entries are keptidx_traits::Union{Nothing, Vector{Int64}}=nothing
: Indices of traits to keep. Ifnothing
, all traits are kept
Returns
Phenomes
: A new Phenomes object containing only the selected entries and traits
Notes
- The function preserves the original structure while reducing dimensions
- Indices must be within valid ranges (1 to nentries/ntraits)
- Duplicate indices are automatically removed
- The resulting object maintains all relationships between entries, populations, traits, and phenotypes
Throws
ArgumentError
: If the input Phenomes struct is corrupted or if indices are out of boundsDimensionMismatch
: If the slicing operation results in invalid dimensions
Examples
julia> phenomes = Phenomes(n=10, t=3); phenomes.entries = string.("entry_", 1:10); phenomes.populations .= "pop_1"; phenomes.traits = ["A", "B", "C"]; phenomes.phenotypes = fill(0.0, 10,3);
julia> sliced_phenomes = slice(phenomes, idx_entries=collect(1:5); idx_traits=collect(2:3));
julia> dimensions(sliced_phenomes)
Dict{String, Int64} with 8 entries:
"n_total" => 10
"n_zeroes" => 10
"n_nan" => 0
"n_entries" => 5
"n_traits" => 2
"n_inf" => 0
"n_populations" => 1
"n_missing" => 0
GenomicBreedingCore.slice
— Methodslice(
trials::Trials;
traits::Union{Nothing, Vector{String}} = nothing,
populations::Union{Nothing, Vector{String}} = nothing,
entries::Union{Nothing, Vector{String}} = nothing,
years::Union{Nothing, Vector{String}} = nothing,
harvests::Union{Nothing, Vector{String}} = nothing,
seasons::Union{Nothing, Vector{String}} = nothing,
sites::Union{Nothing, Vector{String}} = nothing,
replications::Union{Nothing, Vector{String}} = nothing,
blocks::Union{Nothing, Vector{String}} = nothing,
rows::Union{Nothing, Vector{String}} = nothing,
cols::Union{Nothing, Vector{String}} = nothing,
)::Trials
Create a subset of a Trials
struct by filtering its components based on specified criteria.
Arguments
trials::Trials
: The source trials data structure to be slicedtraits::Vector{String}
: Selected trait names to includepopulations::Vector{String}
: Selected population names to includeentries::Vector{String}
: Selected entry names to includeyears::Vector{String}
: Selected years to includeharvests::Vector{String}
: Selected harvest identifiers to includeseasons::Vector{String}
: Selected seasons to includesites::Vector{String}
: Selected site names to includereplications::Vector{String}
: Selected replication identifiers to includeblocks::Vector{String}
: Selected block identifiers to includerows::Vector{String}
: Selected row identifiers to includecols::Vector{String}
: Selected column identifiers to include
All arguments except trials
are optional. When an argument is not provided (i.e., nothing
), all values for that category are included in the slice.
Returns
- A new
Trials
struct containing only the selected data
Throws
ArgumentError
: If invalid names are provided for any category or if no data remains after filteringDimensionMismatch
: If the resulting sliced trials structure has inconsistent dimensionsArgumentError
: If the input trials structure is corrupted
Examples
julia> trials, _ = simulatetrials(genomes = simulategenomes(verbose=false), verbose=false);
julia> sliced_trials = slice(trials, traits=trials.traits[2:3], years=[unique(trials.years)[1]], seasons=unique(trials.seasons)[2:3]);
julia> dimensions(sliced_trials)
Dict{String, Int64} with 16 entries:
"n_zeroes" => 0
"n_harvests" => 2
"n_nan" => 0
"n_entries" => 100
"n_traits" => 2
"n_seasons" => 2
"n_blocks" => 10
"n_rows" => 10
"n_missing" => 0
"n_inf" => 0
"n_total" => 6400
"n_replications" => 2
"n_years" => 1
"n_sites" => 4
"n_cols" => 20
"n_populations" => 1
GenomicBreedingCore.sparsities
— Methodsparsities(genomes::Genomes) -> Tuple{Vector{Float64}, Vector{Float64}}
Calculate the sparsity (proportion of missing data) for each entry and locus in a Genomes
object.
Returns a tuple of two vectors:
- First vector contains sparsity values for each entry (row-wise mean of missing values)
- Second vector contains sparsity values for each locus (column-wise mean of missing values)
The function processes the data in parallel using multiple threads for performance optimization.
Arguments
genomes::Genomes
: A Genomes object containing allele frequency data with potentially missing values
Returns
Tuple{Vector{Float64}, Vector{Float64}}
: A tuple containing:- Vector of entry sparsities (values between 0.0 and 1.0)
- Vector of locus sparsities (values between 0.0 and 1.0)
Example
julia> genomes = simulategenomes(n=100, l=1_000, sparsity=0.25, verbose=false);
julia> entry_sparsities, locus_sparsities = sparsities(genomes);
julia> abs(0.25 - mean(entry_sparsities)) < 0.0001
true
julia> abs(0.25 - mean(locus_sparsities)) < 0.0001
true
GenomicBreedingCore.summarise
— Methodsummarise(cvs::Vector{CV})::Tuple{DataFrame,DataFrame}
Summarize cross-validation results from a vector of CV structs into two DataFrames.
Returns
- A tuple containing two DataFrames:
- Summary DataFrame with mean metrics across entries, replications, and folds
- Contains means and standard deviations of correlation coefficients
- Includes average training and validation set sizes
- Grouped by training population, validation population, trait, and model
- Entry-level DataFrame with phenotype prediction statistics
- Contains true phenotype values, predicted means (μ), and standard deviations (σ)
- Grouped by training population, validation population, trait, model, and entry
- Summary DataFrame with mean metrics across entries, replications, and folds
Arguments
cvs::Vector{CV}
: Vector of CV structs containing cross-validation results
Notes
- Validates dimensions of input CV structs before processing
- Handles missing values in phenotype predictions
Throws
ArgumentError
: If any CV struct in the input vector has inconsistent dimensions
Examples
julia> fit_1 = Fit(n=1, l=2); fit_1.metrics = Dict("cor" => 0.0, "rmse" => 1.0); fit_1.trait = "trait_1";
julia> cv_1 = CV("replication_1", "fold_1", fit_1, ["population_1"], ["entry_1"], [0.0], [0.0], fit_1.metrics);
julia> fit_2 = Fit(n=1, l=2); fit_2.metrics = Dict("cor" => 1.0, "rmse" => 0.0); fit_2.trait = "trait_1";
julia> cv_2 = CV("replication_2", "fold_2", fit_2, ["population_2"], ["entry_2"], [0.0], [0.0], fit_2.metrics);
julia> cvs = [cv_1, cv_2];
julia> df_summary, df_summary_per_entry = summarise(cvs);
julia> size(df_summary)
(2, 8)
julia> size(df_summary_per_entry)
(2, 8)
GenomicBreedingCore.tabularise
— Functiontabularise(fit::Fit, metric::String = "cor")::DataFrame
Convert a Fit struct into a DataFrame for easier data manipulation and analysis.
Arguments
fit::Fit
: A Fit struct containing model results and parametersmetric::String = "cor"
: The metric to extract from fit.metrics dictionary (default: "cor")
Returns
DataFrame
: A DataFrame with the following columns:model
: The model nametrait
: The trait namepopulation
: Semicolon-separated string of unique population namesmetric
: The specified metric value from fit.metricsb_hat_labels
: Labels for the effect sizesb_hat
: Estimated effect sizes
Examples
julia> fit = Fit(n=100, l=10_000); fit.b_hat = rand(10_000); fit.model="some_model"; fit.trait="some_trait";
julia> fit.metrics = Dict("cor" => rand(), "rmse" => rand()); fit.populations .= "pop_1";
julia> df = tabularise(fit);
julia> size(df)
(10000, 6)
GenomicBreedingCore.tabularise
— Methodtabularise(phenomes::Phenomes)::DataFrame
Convert a Phenomes
struct into a tabular format as a DataFrame
.
The resulting DataFrame contains the following columns:
id
: Integer index for each entryentries
: Entry identifierspopulations
: Population assignments- Additional columns for each trait in
phenomes.traits
Arguments
phenomes::Phenomes
: A valid Phenomes struct containing phenotypic data
Returns
DataFrame
: A DataFrame with entries as rows and traits as columns
Throws
ArgumentError
: If the Phenomes struct dimensions are inconsistent
Examples
julia> phenomes = Phenomes(n=10, t=3); phenomes.entries = string.("entry_", 1:10); phenomes.populations .= "pop_1"; phenomes.traits = ["A", "B", "C"]; phenomes.phenotypes = fill(0.0, 10,3);
julia> tabularise(phenomes)
10×6 DataFrame
Row │ id entries populations A B C
│ Int64 String String Float64? Float64? Float64?
─────┼────────────────────────────────────────────────────────────
1 │ 1 entry_1 pop_1 0.0 0.0 0.0
2 │ 2 entry_2 pop_1 0.0 0.0 0.0
3 │ 3 entry_3 pop_1 0.0 0.0 0.0
4 │ 4 entry_4 pop_1 0.0 0.0 0.0
5 │ 5 entry_5 pop_1 0.0 0.0 0.0
6 │ 6 entry_6 pop_1 0.0 0.0 0.0
7 │ 7 entry_7 pop_1 0.0 0.0 0.0
8 │ 8 entry_8 pop_1 0.0 0.0 0.0
9 │ 9 entry_9 pop_1 0.0 0.0 0.0
10 │ 10 entry_10 pop_1 0.0 0.0 0.0
GenomicBreedingCore.tabularise
— Methodtabularise(trials::Trials)::DataFrame
Convert a Trials struct into a DataFrame representation for easier data manipulation and analysis.
Arguments
trials::Trials
: A valid Trials struct containing experimental field trial data.
Returns
DataFrame
: A DataFrame with the following columns:id
: Unique identifier for each trial observationyears
: Year of the trialseasons
: Season identifierharvests
: Harvest identifiersites
: Location/site identifierreplications
: Replication numberblocks
: Block identifierrows
: Row positioncols
: Column positionentries
: Entry identifierpopulations
: Population identifier- Additional columns for each trait in
trials.traits
Throws
ArgumentError
: If the Trials struct dimensions are inconsistent
Examples
julia> trials, _ = simulatetrials(genomes = simulategenomes(verbose=false), verbose=false);
julia> df = tabularise(trials);
julia> size(df)
(12800, 14)
GenomicBreedingCore.tabularise
— Methodtabularise(cvs::Vector{CV})::Tuple{DataFrame,DataFrame}
Convert a vector of CV (Cross-Validation) structs into two DataFrames containing metrics and predictions.
Arguments
cvs::Vector{CV}
: Vector of CV structs containing cross-validation results
Returns
Tuple{DataFrame,DataFrame}
: A tuple of two DataFrames:df_across_entries
: Contains aggregated metrics across entries with columns:training_population
: Semicolon-separated list of training populationsvalidation_population
: Semicolon-separated list of validation populationstrait
: Name of the traitmodel
: Name of the model usedreplication
: Replication identifierfold
: Fold identifiertraining_size
: Number of entries in training setvalidation_size
: Number of entries in validation set- Additional columns for each metric (e.g.,
cor
,rmse
)
df_per_entry
: Contains per-entry predictions with columns:training_population
: Training population identifiervalidation_population
: Validation population identifierentry
: Entry identifiertrait
: Name of the traitmodel
: Name of the model usedreplication
: Replication identifierfold
: Fold identifiery_true
: True valuesy_pred
: Predicted values
Throws
ArgumentError
: If input vector is empty or if any CV struct is corrupted
Notes
- Warns if there are empty CV structs resulting from insufficient training sizes or fixed traits
- Metrics are extracted from the
metrics
dictionary in each CV struct - Population identifiers are sorted and joined with semicolons when multiple populations exist
Examples
julia> fit_1 = Fit(n=1, l=2); fit_1.metrics = Dict("cor" => 0.0, "rmse" => 1.0); fit_1.trait = "trait_1";
julia> cv_1 = CV("replication_1", "fold_1", fit_1, ["population_1"], ["entry_1"], [0.0], [0.0], fit_1.metrics);
julia> fit_2 = Fit(n=1, l=2); fit_2.metrics = Dict("cor" => 1.0, "rmse" => 0.0); fit_2.trait = "trait_1";
julia> cv_2 = CV("replication_2", "fold_2", fit_2, ["population_2"], ["entry_2"], [0.0], [0.0], fit_2.metrics);
julia> cvs = [cv_1, cv_2];
julia> df_across_entries, df_per_entry = tabularise(cvs);
julia> names(df_across_entries)
10-element Vector{String}:
"training_population"
"validation_population"
"trait"
"model"
"replication"
"fold"
"training_size"
"validation_size"
"cor"
"rmse"
julia> df_across_entries[!, [:cor, :rmse]]
2×2 DataFrame
Row │ cor rmse
│ Float64 Float64
─────┼──────────────────
1 │ 0.0 1.0
2 │ 1.0 0.0
julia> names(df_per_entry)
9-element Vector{String}:
"training_population"
"validation_population"
"entry"
"trait"
"model"
"replication"
"fold"
"y_true"
"y_pred"
julia> df_per_entry[!, [:entry, :y_true, :y_pred]]
2×3 DataFrame
Row │ entry y_true y_pred
│ String Float64 Float64
─────┼───────────────────────────
1 │ entry_1 0.0 0.0
2 │ entry_2 0.0 0.0
GenomicBreedingCore.trialsmodelsfomulae!
— Methodtrialsmodelsfomulae!(df::DataFrame; trait::String, max_levels::Int64 = 100)::Tuple{Vector{String},Vector{Int64}}
Generate mixed model formulae for analyzing multi-environment trial data.
Arguments
df::DataFrame
: Input DataFrame containing trial data, will be modified in-placetrait::String
: Name of the response variable columnmax_levels::Int64=100
: Maximum number of factor levels allowed in interaction terms
Returns
A tuple containing:
Vector{String}
: Collection of mixed model formulae with increasing complexityVector{Int64}
: Corresponding number of factor levels for each formula
Details
The function:
- Identifies available trial design variables (nesters, spatial components, targets)
- Creates interaction terms between variables and adds them to the DataFrame
- Generates model formulae considering:
- Single and multi-environment models
- Fixed and random entry effects
- Spatial error components
- Nested random effects
- Filters redundant models and sorts by complexity
Notes
- Warns if trials are unreplicated
- Throws error if only one entry is present
- Automatically removes block effects when both row and column effects are present
- Removes redundant nesting structures
Examples
julia> trials, _simulated_effects = simulatetrials(genomes = simulategenomes(verbose=false), verbose=false);
julia> df = tabularise(trials);
julia> size(df)
(12800, 14)
julia> formulae, n_levels = trialsmodelsfomulae!(df, trait="trait_1");
julia> size(df)
(12800, 134)
julia> length(formulae)
76
julia> sum(n_levels .== sort(n_levels))
76
GenomicBreedingCore.turingblr
— Methodturingblr(vector_of_Xs_noint::Vector{Matrix{Bool}}, y::Vector{Float64})
Bayesian Linear Regression model implemented using Turing.jl.
Arguments
vector_of_Xs_noint::Vector{Matrix{Bool}}
: Vector of predictor matrices, where each matrix contains binary (Bool) predictorsy::Vector{Float64}
: Vector of response variables
Model Details
- Includes an intercept term with Normal(0.0, 10.0) prior
- For each predictor matrix in
vector_of_Xs_noint
:- Variance parameter (σ²) with Exponential(1.0) prior
- Regression coefficients (β) with multivariate normal prior: MvNormal(0, σ² * I)
- Residual variance (σ²) with Exponential(10.0) prior
- Response variable modeled as multivariate normal: MvNormal(μ, σ² * I)
Returns
- A Turing model object that can be used for MCMC sampling
GenomicBreedingCore.turingblrmcmc!
— Methodturingblrmcmc!(
blr::BLR;
multiple_σs::Union{Nothing, Dict{String, Bool}} = nothing,
turing_model::Function = turingblr,
n_iter::Int64 = 10_000,
n_burnin::Int64 = 1_000,
δ::Float64 = 0.65,
max_depth::Int64 = 5,
Δ_max::Float64 = 1000.0,
init_ϵ::Float64 = 0.2,
adtype::AutoReverseDiff = AutoReverseDiff(compile = true),
diagnostics_threshold_std_lt::Float64 = 0.05,
diagnostics_threshold_ess_ge::Int64 = 100,
diagnostics_threshold_rhat_lt::Float64 = 1.01,
seed::Int64 = 1234,
verbose::Bool = true
)::Nothing
Perform MCMC sampling for Bayesian Linear Regression on a BLR model using NUTS sampler.
Arguments
blr::BLR
: The BLR model to fitmultiple_σs::Union{Nothing, Dict{String, Bool}}
: Optional dictionary specifying multiple (true) or single (false) variance scalers per componentturing_model::Function
: The Turing model function to use (default: turingblr)n_iter::Int64
: Number of MCMC iterations (default: 10,000)n_burnin::Int64
: Number of burn-in iterations to discard (default: 1,000)δ::Float64
: Target acceptance rate for dual averaging (default: 0.65)max_depth::Int64
: Maximum doubling tree depth (default: 5)Δ_max::Float64
: Maximum divergence during doubling tree (default: 1000.0)init_ϵ::Float64
: Initial step size; 0 means auto-search using heuristics (default: 0.2)adtype::AutoReverseDiff
: Automatic differentiation type (default: AutoReverseDiff(compile = true))diagnostics_threshold_std_lt::Float64
: Threshold for MCSE/std ratio convergence (default: 0.05)diagnostics_threshold_ess_ge::Int64
: Minimum effective sample size for convergence (default: 100)diagnostics_threshold_rhat_lt::Float64
: Maximum R-hat for convergence (default: 1.01)seed::Int64
: Random seed for reproducibility (default: 1234)verbose::Bool
: Whether to show progress during sampling (default: true)
Returns
Nothing
: The function mutates the input BLR struct in-place, updating its:- coefficients,
- variance components,
- predicted values,
- residuals, and
- MCMC diagnostics
Notes
- Performs model validation checks before fitting
- Uses NUTS (No-U-Turn Sampler) with specified AD type
- Computes convergence diagnostics by splitting the chain into 5 sub-chains to calculate:
- improved Gelman-Rubin statistics (R̂; https://doi.org/10.1214/20-BA1221),
- effective sample size (ESS), and
- Monte Carlo standard error (MCSE)
- Warns if parameters haven't converged based on:
- R̂ >= diagnosticsthresholdrhat_lt
- ESS < diagnosticsthresholdess_ge
- MCSE/std ratio >= diagnosticsthresholdstd_lt
- Updates the model with estimated coefficients, variance components, predicted values, and residuals
- Mutates the input BLR model in-place
julia> genomes = simulategenomes(n=5, l=1_000, verbose=false);
julia> trials, simulated_effects = simulatetrials(genomes = genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=3, verbose=false);
julia> df = tabularise(trials);
julia> blr = instantiateblr(trait = trials.traits[1], factors = ["rows", "cols"], df = df, other_covariates = trials.traits[2:end], verbose = false);
julia> turingblrmcmc!(blr, n_iter=1_000, n_burnin=200, seed=123, verbose=false);
julia> mean([mean(x) for (_, x) in blr.coefficients]) != 0.0
true
julia> cor(blr.y, blr.ŷ) > 0.0
true
julia> (sum(blr.diagnostics.rhat .< 1.01) < nrow(blr.diagnostics))
true
julia> (sum(blr.diagnostics.ess .>= 100) < nrow(blr.diagnostics))
true
GenomicBreedingCore.@string2formula
— Macro@string2formula(x::String)
Convert a string representation of a formula into a Formula
object.
This macro parses a string containing a formula expression and evaluates it into a proper Formula
object that can be used in statistical modelling.
Arguments
x::String
: A string containing the formula expression (e.g., "y ~ x + z")
Returns
Formula
: A Formula object representing the parsed expression
GenomicBreedingCore.@stringevaluation
— Macro@stringevaluation(x)
Parse and evaluate a string expression at compile time.
Arguments
x
: A string containing a Julia expression to be parsed.
Returns
- The parsed expression as an
Expr
object ready for evaluation.