GenomicBreedingCore

Documentation for GenomicBreedingCore.

GenomicBreedingCore.AbstractGBType
AbstractGB

The root abstract type for all core types in GenomicBreeding.jl package.

This type serves as the common ancestor for the type hierarchy in the package, enabling shared functionality and type-based dispatch across all derived types.

Extended help

All custom core types in GenomicBreeding.jl should subtype from AbstractGB to ensure consistency in the type system and to enable generic implementations of common operations.

source
GenomicBreedingCore.BLRType
BLR <: AbstractGB

Bayesian Linear Regression (BLR) model structure for phenotype analyses, genomic prediction and selection.

Fields

  • entries::Vector{String}: Names or identifiers for the observations
  • Xs::Dict{String,Matrix{Union{Bool,Float64}}}: Design matrices for factors and numeric matrix for covariates, including intercept
  • Σs::Dict{String,Union{Nothing,Matrix{Float64},UniformScaling{Float64}}}: Variance-covariance matrices for random effects
  • coefficients::Dict{String, Vector{Float64}}: Dictionary of estimated coefficients/effects grouped by component
  • coefficient_names::Dict{String, Vector{String}}: Dictionary of names for coefficients grouped by component
  • y::Vector{Float64}: Response/dependent variable vector
  • ŷ::Vector{Float64}: Fitted/predicted values
  • ϵ::Vector{Float64}: Residuals (y - ŷ)
  • diagnostics::DataFrame: Model diagnostics and convergence metrics

Constructor

BLR(; n::Int64, p::Int64, var_comp::Dict{String, Int64} = Dict("σ²" => 1))

Creates a BLR model with specified dimensions and variance components.

Arguments

  • n::Int64: Number of observations
  • p::Int64: Number of coefficients/effects (including intercept)
  • var_comp::Dict{String, Int64}: Dictionary specifying variance components and their dimensions

Details

The constructor initializes:

  • Design matrices (Xs) with zeros, including an intercept component
  • Variance-covariance matrices (Σs) as identity matrices scaled by 1.0
  • Dictionaries for coefficients and their names, with "intercept" as default first component
  • Zero vectors for y, fitted values (ŷ), and residuals (ϵ)
  • Empty strings for entries and coefficient names (except "intercept")
  • Empty DataFrame for model diagnostics

Requirements

  • At least one variance component (σ²) must be specified for residual variance
  • Total number of parameters (p) must equal sum of dimensions in variance components
  • Number of coefficients (p) must be ≥ 1
  • If p > 1 and only residual variance is specified, a dummy variance component is added with p-1 dimensions

Notes

The structure supports multiple random effects through the dictionary-based design, where each component (except intercept) can have its own design matrix (X), variance-covariance matrix (Σ), coefficients, and coefficient names.

source
GenomicBreedingCore.CVType

Cross-validation struct

Contains genomic prediction cross-validation details.

Fields

  • replication: replication name
  • fold: fold name
  • fit: genomic prediction model fit on the training set
  • validation_populations: vector of validation populations corresponding to each validation entry
  • validation_entries: corresponding vector of entries in the validation population/s
  • validation_y_true: corresponding vector of observed phenotypes in the validation population/s
  • validation_y_pred: corresponding vector of predicted phenotypes in the validation population/s
  • metrics: dictionary of genomic prediction accuracy metrics on the validation population/s

Constructor

Uses the default constructor.

source
GenomicBreedingCore.FitType

Genomic prediction model fit

Contains genomic prediction model fit details.

Fields

  • model: name of the genomic prediction model used
  • b_hat_labels: names of the loci-alleles used
  • b_hat: effects of the loci-alleles
  • trait: name of the trait
  • entries: names of the entries used in the current cross-validation replication and fold
  • populations: names of the populations used in the current cross-validation replication and fold
  • y_true: corresponding observed phenotype values
  • y_pred: corresponding predicted phenotype values
  • metrics: dictionary of genomic prediction accuracy metrics, inluding Pearson's correlation, mean absolute error and root mean-squared error
  • lux_model: Nothing or a trained Lux neural network model

Constructor

Fit(; n=1, l=10)
where:
- n::Int64: Number of entries
- l::Int64: Number of loci-alleles
source
GenomicBreedingCore.GRMType

Genomic relationship matrix struct

Contains genomic relationship matrix as well as the corresponding entries and loci-alleles used to compute it.

Fields

  • entries: names of the n entries corresponding to the rows and columns of the genomic relationship matrix
  • loci_alleles: names of the loci-alleles used to compute the genomic relationship matrix
  • genomic_relationship_matrix: n x n matrix of genomic relationship values between entries

Constructor

Uses the default constructor.

source
GenomicBreedingCore.GenomesType

Genomes struct

Containes unique entries and loci_alleles where allele frequencies can have missing values

Fields

  • entries: names of the n entries or samples
  • populations: name/s of the population/s each entries or samples belong to
  • loci_alleles: names of the p loci-alleles combinations (p = l loci x a-1 alleles) including the chromsome or scaffold name, position, all alleles and current allele separated by tabs ("\t")
  • allele_frequencies: n x p matrix of allele frequencies between 0 and 1 which can have missing values
  • mask: n x p matrix of boolean mask for selective analyses and slicing

Constructor

Genomes(; n::Int64 = 1, p::Int64 = 2)

where:

  • n::Int64=1: Number of entries in the genomic dataset
  • p::Int64=2: Number of locus-allele combinations in the genomic dataset

Examples

julia> genomes = Genomes(n=2, p=2)
Genomes(["", ""], ["", ""], ["", ""], Union{Missing, Float64}[missing missing; missing missing], Bool[1 1; 1 1])

julia> fieldnames(Genomes)
(:entries, :populations, :loci_alleles, :allele_frequencies, :mask)

julia> genomes.entries = ["entry_1", "entry_2"];

julia> genomes.populations = ["pop_1", "pop_1"];

julia> genomes.loci_alleles = ["chr1\t12345\tA|T\tA", "chr2\t678910\tC|D\tD"];

julia> genomes.allele_frequencies = [0.5 0.25; 0.9 missing];

julia> genomes.mask = [true true; true false];

julia> genomes
Genomes(["entry_1", "entry_2"], ["pop_1", "pop_1"], ["chr1\t12345\tA|T\tA", "chr2\t678910\tC|D\tD"], Union{Missing, Float64}[0.5 0.25; 0.9 missing], Bool[1 1; 1 0])
source
GenomicBreedingCore.PhenomesType

Phenomes struct

Constains unique entries and traits where phenotype data can have missing values

Fields

  • entries: names of the n entries or samples
  • populations: name/s of the population/s each entries or samples belong to
  • traits: names of the t traits
  • phenotypes: n x t matrix of numeric (R) phenotype data which can have missing values
  • mask: n x t matrix of boolean mask for selective analyses and slicing

Constructor

Phenomes(; n::Int64 = 1, t::Int64 = 2)

where:

  • n::Int64=1: Number of entries in the phenomic dataset
  • t::Int64=2: Number of traits in the phenomic dataset

Examples

julia> phenomes = Phenomes(n=2, t=2)
Phenomes(["", ""], ["", ""], ["", ""], Union{Missing, Float64}[missing missing; missing missing], Bool[1 1; 1 1])

julia> fieldnames(Phenomes)
(:entries, :populations, :traits, :phenotypes, :mask)

julia> phenomes.entries = ["entry_1", "entry_2"];

julia> phenomes.populations = ["pop_A", "pop_B"];

julia> phenomes.traits = ["height", "yield"];

julia> phenomes.phenotypes = [200.0 2.5; 150.0 missing];

julia> phenomes.mask = [true true; true false];

julia> phenomes
Phenomes(["entry_1", "entry_2"], ["pop_A", "pop_B"], ["height", "yield"], Union{Missing, Float64}[200.0 2.5; 150.0 missing], Bool[1 1; 1 0])
source
GenomicBreedingCore.SimulatedEffectsType

SimulatedEffects struct

Contains the various simulated genetic, environmental and GxE effects.

Fields

  • id::Vector{String}: Vector of identifiers
  • year::Float64: Year effect
  • season::Float64: Season effect
  • site::Float64: Site effect
  • seasons_x_year::Float64: Interaction effect between seasons and years
  • harvests_x_season_x_year::Float64: Interaction effect between harvests, seasons and years
  • sites_x_harvest_x_season_x_year::Float64: Interaction effect between sites, harvests, seasons and years
  • field_layout::Matrix{Int64}: 2D matrix representing field layout
  • replications_x_site_x_harvest_x_season_x_year::Vector{Float64}: Replication interaction effects
  • blocks_x_site_x_harvest_x_season_x_year::Vector{Float64}: Block interaction effects
  • rows_x_site_x_harvest_x_season_x_year::Vector{Float64}: Row interaction effects
  • cols_x_site_x_harvest_x_season_x_year::Vector{Float64}: Column interaction effects
  • additive_genetic::Vector{Float64}: Additive genetic effects
  • dominance_genetic::Vector{Float64}: Dominance genetic effects
  • epistasis_genetic::Vector{Float64}: Epistasis genetic effects
  • additive_allele_x_site_x_harvest_x_season_x_year::Vector{Float64}: Additive allele interaction effects
  • dominance_allele_x_site_x_harvest_x_season_x_year::Vector{Float64}: Dominance allele interaction effects
  • epistasis_allele_x_site_x_harvest_x_season_x_year::Vector{Float64}: Epistasis allele interaction effects

Constructor

SimulatedEffects()

Creates a new SimulatedEffects instance with default values:

  • Empty strings for IDs (vector of size 6)
  • 0.0 for all float values
  • 4x4 zero matrix for field_layout
  • Single-element zero vectors for all vector fields
source
GenomicBreedingCore.TEBVType

Trial-estimated breeding values (TEBV) struct

Contains trial-estimated breeding values as generated by analyse(trials::Trials, ...).

Fields

  • traits: names of the traits t traits
  • formulae: best-fitting formula for each trait
  • models: best-fitting linear mixed model for each trait which may be MixedModes.jl model or a tuple of coefficient names, the coefficients and the design matrix from Bayesial linear regression
  • df_BLUEs: vector of data frames of best linear unbiased estimators or fixed effects table of each best fitting model
  • df_BLUPs: vector of data frames of best linear unbiased predictors or random effects table of each best fitting model
  • phenomes: vector of Phenomes structs each containing the breeding values

Examples

julia> tebv = TEBV(traits=[""], formulae=[""], models=[MixedModel(@formula(y~1+(1|x)), DataFrame(y=1, x=1))], df_BLUEs=[DataFrame(x=1)], df_BLUPs=[DataFrame(x=1)], phenomes=[Phenomes(n=1, t=1)]);

julia> tebv.traits
1-element Vector{String}:
 ""
source
GenomicBreedingCore.TrialsType

Trials struct

Contains phenotype data across years, seasons, harvest, sites, populations, replications, blocks, rows, and columns

Fields

  • phenotypes: n x t matrix of numeric phenotype data which can have missing values
  • traits: names of the traits t traits
  • years: names of the years corresponding to each row in the phenotype matrix
  • seasons: names of the seasons corresponding to each row in the phenotype matrix
  • harvests: names of the harvests corresponding to each row in the phenotype matrix
  • sites: names of the sites corresponding to each row in the phenotype matrix
  • replications: names of the replications corresponding to each row in the phenotype matrix
  • blocks: names of the blocks corresponding to each row in the phenotype matrix
  • rows: names of the rows corresponding to each row in the phenotype matrix
  • cols: names of the cols corresponding to each row in the phenotype matrix
  • entries: names of the entries corresponding to each row in the phenotype matrix
  • populations: names of the populations corresponding to each row in the phenotype matrix

Constructor

Trials(; n::Int64 = 2, p::Int64 = 2)

where:

  • n::Int64=1: Number of entries in the trials
  • t::Int64=2: Number of traits in the trials

Examples

julia> trials = Trials(n=1, t=2)
Trials(Union{Missing, Float64}[missing missing], ["", ""], [""], [""], [""], [""], [""], [""], [""], [""], [""], [""])

julia> fieldnames(Trials)
(:phenotypes, :traits, :years, :seasons, :harvests, :sites, :replications, :blocks, :rows, :cols, :entries, :populations)
source
Base.:==Method
Base.:(==)(x::BLR, y::BLR)::Bool

Compare two BLR structs for equality based on their hash values.

This method defines equality comparison for BLR structs by comparing their hash values. Two BLR structs are considered equal if they have identical hash values, which means they have the same values for all their fields, except the computationally expensive Xs, and Σs fields.

Arguments

  • x::BLR: First BLR struct to compare
  • y::BLR: Second BLR struct to compare

Returns

  • Bool: true if the BLR structs are equal, false otherwise

Examples

julia> blr_1 = BLR(n=1, p=4);

julia> blr_2 = BLR(n=1, p=4);

julia> blr_3 = BLR(n=1, p=1);

julia> blr_1 == blr_2
true

julia> blr_1 == blr_3
false
source
Base.:==Method
Base.:(==)(x::CV, y::CV)::Bool

Compare two CV (Cross-Validation) structs for equality.

This method overloads the equality operator (==) for CV structs by comparing their hash values. Two CV structs are considered equal if they have identical values for all fields.

Arguments

  • x::CV: First CV struct to compare
  • y::CV: Second CV struct to compare

Returns

  • Bool: true if the CV structs are equal, false otherwise

Examples

julia> fit = Fit(n=1, l=2);

julia> cv_1 = CV("replication_1", "fold_1", fit, ["population_1"], ["entry_1"], [0.0], [0.0], fit.metrics);

julia> cv_2 = clone(cv_1);

julia> cv_3 = clone(cv_1); cv_3.replication = "other_replication";

julia> cv_1 == cv_2
true

julia> cv_1 == cv_3
false
source
Base.:==Method
Base.:(==)(x::Fit, y::Fit)::Bool

Compare two Fit structs for equality based on their hash values.

This method defines equality comparison for Fit structs by comparing their hash values. Two Fit structs are considered equal if they have identical hash values, which means they have the same values for all their fields.

Arguments

  • x::Fit: First Fit struct to compare
  • y::Fit: Second Fit struct to compare

Returns

  • Bool: true if the Fit structs are equal, false otherwise

Examples

julia> fit_1 = Fit(n=1, l=4);

julia> fit_2 = Fit(n=1, l=4);

julia> fit_3 = Fit(n=1, l=2);

julia> fit_1 == fit_2
true

julia> fit_1 == fit_3
false
source
Base.:==Method
Base.:(==)(x::GRM, y::GRM)::Bool

Compare two GRM structs for equality.

Overloads the equality operator (==) for GRM structs by comparing their hash values. Two GRM structs are considered equal if they have identical values for all their fields.

Arguments

  • x::GRM: First GRM struct to compare
  • y::GRM: Second GRM struct to compare

Returns

  • Bool: true if the GRM structs are equal, false otherwise

Examples

julia> grm_1 = GRM(string.(["entries_1", "entries_2"]), string.(["chr1	123	A|T	A", "chr1	456	C|G	G"]), Float64.(rand(2,2)));

julia> grm_2 = clone(grm_1);

julia> grm_3 = GRM(string.(["entries_1", "entries_2"]), string.(["chr1	123	A|T	A", "chr1	456	C|G	G"]), Float64.(rand(2,2)));

julia> grm_1 == grm_2
true

julia> grm_1 == grm_3
false
source
Base.:==Method
==(x::Genomes, y::Genomes)::Bool

Compare two Genomes structs for equality by comparing their hash values.

This method implements equality comparison for Genomes structs by utilizing their hash values, ensuring that two genomes are considered equal if and only if they have identical structural properties and content.

Arguments

  • x::Genomes: First Genomes struct to compare
  • y::Genomes: Second Genomes struct to compare

Returns

  • Bool: true if the genomes are equal, false otherwise

Examples

julia> genomes_1 = genomes = Genomes(n=2,p=4);

julia> genomes_2 = genomes = Genomes(n=2,p=4);

julia> genomes_3 = genomes = Genomes(n=1,p=2);

julia> genomes_1 == genomes_2
true

julia> genomes_1 == genomes_3
false
source
Base.:==Method
==(x::Phenomes, y::Phenomes)::Bool

Compare two Phenomes structs for equality using their hash values.

This method implements equality comparison for Phenomes objects by comparing their hash values, ensuring that two phenomes with identical structure and content are considered equal.

Arguments

  • x::Phenomes: First phenomes object to compare
  • y::Phenomes: Second phenomes object to compare

Returns

  • Bool: true if the phenomes are equal, false otherwise

Examples

julia> phenomes_1 = phenomes = Phenomes(n=2, t=4);

julia> phenomes_2 = phenomes = Phenomes(n=2, t=4);

julia> phenomes_3 = phenomes = Phenomes(n=1, t=2);

julia> phenomes_1 == phenomes_2
true

julia> phenomes_1 == phenomes_3
false
source
Base.:==Method
Base.:(==)(x::SimulatedEffects, y::SimulatedEffects)::Bool

Defines equality comparison for SimulatedEffects structs by comparing their hash values.

This method overloads the == operator for SimulatedEffects type and determines if two SimulatedEffects instances are equal by comparing their hash values rather than doing a field-by-field comparison.

Arguments

  • x::SimulatedEffects: First SimulatedEffects instance to compare
  • y::SimulatedEffects: Second SimulatedEffects instance to compare

Returns

  • Bool: true if the hash values of both instances are equal, false otherwise

Examples

julia> effects_1 = SimulatedEffects();

julia> effects_2 = SimulatedEffects();

julia> effects_3 = SimulatedEffects(); effects_3.id[1] = "SOMETHING_ELSE";

julia> effects_1 == effects_2
true

julia> effects_1 == effects_3
false
source
Base.:==Method
==(x::TEBV, y::TEBV)::Bool

Compare two TEBV (Trial-Estimated Breeding Values) objects for equality.

This method implements equality comparison for TEBV structs by comparing their hash values. Two TEBV objects are considered equal if they have identical values for all their fields: traits, formulae, models, dfBLUEs, dfBLUPs, and phenomes.

Arguments

  • x::TEBV: First TEBV object to compare
  • y::TEBV: Second TEBV object to compare

Returns

  • Bool: true if the TEBV objects are equal, false otherwise

Examples

julia> tebv_1 = TEBV(traits=[""], formulae=[""], models=[MixedModel(@formula(y~1+(1|x)), DataFrame(y=1, x=1))], df_BLUEs=[DataFrame(x=1)], df_BLUPs=[DataFrame(x=1)], phenomes=[Phenomes(n=1,t=1)]);

julia> tebv_2 = clone(tebv_1);

julia> tebv_3 = TEBV(traits=["SOMETHING_ELSE"], formulae=[""], models=[MixedModel(@formula(y~1+(1|x)), DataFrame(y=1, x=1))], df_BLUEs=[DataFrame(x=1)], df_BLUPs=[DataFrame(x=1)], phenomes=[Phenomes(n=1,t=1)]);

julia> tebv_1 == tebv_2
true

julia> tebv_1 == tebv_3
false
source
Base.:==Method
==(x::Trials, y::Trials)::Bool

Compare two Trials structs for equality by comparing their hash values.

Two Trials structs are considered equal if they have identical hash values, which implies they have the same configuration parameters (number of trials n and time steps t).

Arguments

  • x::Trials: First Trials struct to compare
  • y::Trials: Second Trials struct to compare

Returns

  • Bool: true if the Trials structs are equal, false otherwise

Examples

julia> trials_1 = trials = Trials(n=2, t=4);

julia> trials_2 = trials = Trials(n=2, t=4);

julia> trials_3 = trials = Trials(n=1, t=2);

julia> trials_1 == trials_2
true

julia> trials_1 == trials_3
false
source
Base.filterMethod
filter(
    genomes::Genomes,
    maf::Float64;
    max_entry_sparsity::Float64 = 0.0,
    max_locus_sparsity::Float64 = 0.0,
    max_prop_pc_varexp::Float64 = 0.90,
    max_entry_sparsity_percentile::Float64 = 0.90,
    max_locus_sparsity_percentile::Float64 = 0.50,
    chr_pos_allele_ids::Union{Nothing,Vector{String}} = nothing,
    match_alleles::Bool = true,
    verbose::Bool = false,
)::Tuple{Genomes, Dict{String,Vector{String}}}

Filter genomic data based on multiple criteria including sparsity, minor allele frequency (MAF), principal component analysis (PCA), and a list of loci-allele combinations.

Arguments

  • genomes::Genomes: A Genomes struct containing the genomic data.
  • maf::Float64: The minimum allele frequency threshold.
  • max_entry_sparsity::Float64: The maximum allowable sparsity for entries. Default is 0.0.
  • max_locus_sparsity::Float64: The maximum allowable sparsity for loci. Default is 0.0.
  • max_prop_pc_varexp::Float64: The maximum proportion of variance explained by the first two principal components (PC1 and PC2). Default is 0.90.
  • max_entry_sparsity_percentile::Float64: The percentile threshold for entry sparsity. Default is 0.90.
  • max_locus_sparsity_percentile::Float64: The percentile threshold for locus sparsity. Default is 0.50.
  • chr_pos_allele_ids::Union{Nothing, Vector{String}}: A vector of loci-allele combination names in the format "chromosome\tposition\tallele". If nothing, no filtering is applied. Default is nothing.
  • match_alleles::Bool: If true, matches alleles exactly when filtering by SNP list. If false, matches only chromosome and position. Default is true.
  • verbose::Bool: If true, prints detailed progress information during the filtering process. Default is false.

Returns

  • Tuple{Genomes, Dict{String, Vector{String}}}: A tuple containing:
    • A Genomes struct with filtered genomic data.
    • A dictionary of omitted loci-allele names categorized by the filtering criteria.

Details

This function filters genomic data based on multiple criteria including sparsity, minor allele frequency (MAF), principal component analysis (PCA), and a list of loci-allele combinations. The function performs the following steps:

  1. Input Validation: Ensures that the Genomes struct is not corrupted and that the filtering thresholds are within valid ranges.
  2. Filter by Sparsity: Removes the sparsest entries and loci-alleles until the maximum allowable sparsity thresholds are met.
  3. Filter by MAF: Removes loci-alleles with minor allele frequencies below the specified threshold.
  4. Filter by PCA: Removes outlier loci-alleles based on the proportion of variance explained by the first two principal components.
  5. Filter by SNP List: Retains only the specified loci-allele combinations, with optional exact allele matching.
  6. Verbose Output: If verbose is true, prints detailed progress information during each filtering step.

Notes

  • The function uses multi-threading to improve performance on large datasets.
  • The match_alleles parameter provides flexibility in SNP list filtering.
  • The function ensures that the filtered genomic data retains a minimum number of entries and loci-alleles.

Throws

  • ArgumentError: If the Genomes struct is corrupted or any of the filtering thresholds are out of range.
  • ErrorException: If no loci-alleles are retained after filtering.

Examples

julia> genomes = simulategenomes(n=100, l=1_000, n_alleles=4, sparsity=0.01, verbose=false);

julia> filtered_genomes_1, omitted_loci_alleles_1 = filter(genomes, 0.1);

julia> filtered_genomes_2, omitted_loci_alleles_2 = filter(genomes, 0.1, chr_pos_allele_ids=genomes.loci_alleles[1:1000]);

julia> size(genomes.allele_frequencies)
(100, 3000)

julia> prod(size(filtered_genomes_1.allele_frequencies)) < prod(size(genomes.allele_frequencies))
true

julia> prod(size(filtered_genomes_2.allele_frequencies)) < prod(size(filtered_genomes_1.allele_frequencies))
true
source
Base.filterMethod
filter(genomes::Genomes; verbose::Bool = false)::Genomes

Filter a Genomes struct by removing entries and loci with missing data based on the mask matrix.

Description

This function filters a Genomes struct by:

  1. Removing rows (entries) where any column has a false value in the mask matrix
  2. Removing columns (loci) where any row has a false value in the mask matrix

Arguments

  • genomes::Genomes: Input Genomes struct containing genetic data and a mask matrix
  • verbose::Bool: Optional flag to control verbose output (default: false)

Returns

  • Genomes: A new filtered Genomes struct with complete data (no missing values)

Examples

julia> genomes = simulategenomes(verbose=false); genomes.mask[1:10, 42:100] .= false;
    
julia> filtered_genomes = filter(genomes);

julia> size(filtered_genomes.allele_frequencies)
(90, 9941)
source
Base.filterMethod
filter(phenomes::Phenomes)::Phenomes

Filter a Phenomes struct by removing rows (entries) and columns (traits) as indicated by the mask matrix. An entry or trait is removed if it contains at least one false value in the mask.

Arguments

  • phenomes::Phenomes: The Phenomes struct to be filtered, containing entries, populations, traits, phenotypes, and a boolean mask matrix.

Returns

  • Phenomes: A new Phenomes struct with filtered entries and traits, where the mask matrix is all true.

Details

The function uses the mean of rows and columns in the mask matrix to identify which entries and traits should be kept. Only entries and traits with a mean of 1.0 (all true values) are retained in the filtered result.

Examples

julia> phenomes = Phenomes(n=10, t=3); phenomes.entries = string.("entry_", 1:10); phenomes.populations .= "pop_1"; phenomes.traits = ["A", "B", "C"]; phenomes.phenotypes = fill(0.0, 10,3);

julia> phenomes.mask .= true; phenomes.mask[6:10, 1] .= false;
    
julia> filtered_phenomes = filter(phenomes);

julia> size(filtered_phenomes.phenotypes)
(5, 2)
source
Base.hashMethod
Base.hash(x::BLR, h::UInt)::UInt

Calculate a hash value for a BLR struct.

This method implements hashing for the BLR type by combining the hashes of its fields, excluding the "Xs" and "Σs" fields. Each field's hash is combined sequentially with the input hash value.

Arguments

  • x::BLR: The BLR struct to be hashed
  • h::UInt: The initial hash value to be mixed with

Returns

  • UInt: The computed hash value

Notes

  • The fields "Xs" and "Σs" are explicitly excluded from the hash computation
  • Fields are processed in the order defined in the struct

Example

julia> blr = BLR(n=1, p=1);

julia> typeof(hash(blr))
UInt64
source
Base.hashMethod
Base.hash(x::CV, h::UInt)::UInt

Compute a hash value for a CV (Cross-Validation) struct.

This method defines how CV structs should be hashed, which is useful for using CV objects in hash-based collections like Sets or as Dict keys.

Arguments

  • x::CV: The CV struct to be hashed
  • h::UInt: The hash value to be mixed with the new hash

Returns

  • UInt: A hash value for the CV struct

Implementation Details

The hash is computed by combining the following fields:

  • replication
  • fold
  • fit
  • validation_populations
  • validation_entries
  • validationytrue
  • validationypred
  • metrics

Example

julia> fit = Fit(n=1, l=2);

julia> cv = CV("replication_1", "fold_1", fit, ["population_1"], ["entry_1"], [0.0], [0.0], fit.metrics);

julia> typeof(hash(cv))
UInt64
source
Base.hashMethod
Base.hash(x::Fit, h::UInt)::UInt

Calculate a hash value for a Fit struct.

This method implements hashing for the Fit type by combining the hashes of its core components in a specific order. The hash is computed using the following fields:

  • model
  • b_hat (estimated effects)
  • trait
  • entries
  • populations
  • metrics
  • y_true (observed values)
  • y_pred (predicted values)

Arguments

  • x::Fit: The Fit struct to be hashed
  • h::UInt: The hash value to be mixed with

Returns

  • UInt: The computed hash value

Example

julia> fit = Fit(n=1, l=2);

julia> typeof(hash(fit))
UInt64
source
Base.hashMethod
Base.hash(x::GRM, h::UInt)::UInt

Compute a hash value for a GRM (Genomic Relationship Matrix) struct.

This method defines how GRM structs should be hashed, making them usable in hash-based collections like Sets or as Dict keys. The hash is computed by iteratively combining the hash values of all fields in the struct.

Arguments

  • x::GRM: The GRM struct to be hashed
  • h::UInt: The initial hash value to be combined with the struct's hash

Returns

  • UInt: A combined hash value for the entire GRM struct

Example

julia> grm = GRM(string.(["entries_1", "entries_2"]), string.(["chr1	123	A|T	A", "chr1	456	C|G	G"]), Float64.(rand(2,2)));

julia> typeof(hash(grm))
UInt64
source
Base.hashMethod
Base.hash(x::Genomes, h::UInt)::UInt

Compute a hash value for a Genomes struct.

This hash function considers three key components of the Genomes struct:

  • entries
  • populations
  • loci_alleles

For performance reasons, allele_frequencies and mask fields are deliberately excluded from the hash computation.

Arguments

  • x::Genomes: The Genomes struct to hash
  • h::UInt: The hash seed value

Returns

  • UInt: A hash value for the Genomes struct

Examples

julia> genomes = Genomes(n=2, p=2);

julia> typeof(hash(genomes))
UInt64
source
Base.hashMethod
Base.hash(x::Phenomes, h::UInt)::UInt

Compute a hash value for a Phenomes struct by recursively hashing its internal fields.

Arguments

  • x::Phenomes: The Phenomes struct to be hashed
  • h::UInt: The hash value to be mixed with

Returns

  • UInt: A hash value for the entire Phenomes struct

Note

This function is used for dictionary operations and computing hash-based data structures. The hash is computed by combining hashes of all internal fields: entries, populations, traits, phenotypes, and mask.

Examples

julia> phenomes = Phenomes(n=2, t=2);

julia> typeof(hash(phenomes))
UInt64
source
Base.hashMethod
hash(x::SimulatedEffects, h::UInt)::UInt

Compute a hash value for a SimulatedEffects object.

This method implements custom hashing for SimulatedEffects by iterating through all fields of the object and combining their hash values with the provided seed hash h.

Arguments

  • x::SimulatedEffects: The object to be hashed
  • h::UInt: The hash seed value

Returns

  • UInt: The computed hash value

Examples

julia> effects = SimulatedEffects();

julia> typeof(hash(effects))
UInt64
source
Base.hashMethod
Base.hash(x::TEBV, h::UInt)::UInt

Calculate a hash value for a TEBV (Trial-Estimated Breeding Value) struct.

This method implements hashing for TEBV objects by combining the hash values of selected fields:

  • traits: Vector of trait names
  • formulae: Vector of formula strings
  • phenomes: Vector of Phenomes objects

Note: For performance reasons, the following fields are deliberately excluded from the hash calculation:

  • models
  • df_BLUEs
  • df_BLUPs

Arguments

  • x::TEBV: The TEBV struct to be hashed
  • h::UInt: The hash value to be mixed with the object's hash

Returns

  • UInt: A unique hash value for the TEBV object

Examples

julia> tebv = TEBV(traits=[""], formulae=[""], models=[MixedModel(@formula(y~1+(1|x)), DataFrame(y=1, x=1))], df_BLUEs=[DataFrame(x=1)], df_BLUPs=[DataFrame(x=1)], phenomes=[Phenomes(n=1,t=1)]);

julia> typeof(hash(tebv))
UInt64
source
Base.hashMethod
Base.hash(x::Trials, h::UInt)::UInt

Compute a hash value for a Trials struct by recursively hashing all of its fields.

This method implements hash functionality for the Trials type, allowing Trials objects to be used as dictionary keys or in hash-based collections.

Arguments

  • x::Trials: The Trials struct to be hashed
  • h::UInt: The hash value to be mixed with the object's hash

Returns

  • UInt: A hash value for the entire Trials struct

Examples

julia> trials = Trials(n=2, t=2);

julia> typeof(hash(trials))
UInt64
source
Base.mergeMethod
merge(
    genomes::Genomes,
    other::Genomes;
    conflict_resolution::Tuple{Float64,Float64} = (0.5, 0.5),
    verbose::Bool = true
)::Genomes

Merge two Genomes structs by combining their entries and loci_alleles while resolving conflicts in allele frequencies.

Arguments

  • genomes::Genomes: First Genomes struct to merge
  • other::Genomes: Second Genomes struct to merge
  • conflict_resolution::Tuple{Float64,Float64}: Weights for resolving conflicts between allele frequencies (must sum to 1.0)
  • verbose::Bool: If true, displays a progress bar during merging

Returns

  • Genomes: A new Genomes struct containing the merged data

Details

The function performs the following operations:

  1. If the loci_alleles are identical and there are no overlapping entries, performs a quick merge:
    • Concatenates entries, populations, allele frequencies, and mask without conflict resolution.
  2. Combines unique entries and loci_alleles from both input structs
  3. Resolves population conflicts by concatenating conflicting values
  4. For overlapping entries and loci:
    • If allele frequencies match, uses the existing value
    • If frequencies differ, applies weighted average using conflict_resolution
    • For missing values, uses available non-missing value
    • Resolves mask conflicts using weighted average

Examples

julia> n = 100; l = 5_000; n_alleles = 2;

julia> all = simulategenomes(n=n, l=l, n_alleles=n_alleles, verbose=false);

julia> genomes = slice(all, idx_entries=collect(1:Int(floor(n*0.75))), idx_loci_alleles=collect(1:Int(floor(l*(n_alleles-1)*0.75))));

julia> other = slice(all, idx_entries=collect(Int(floor(n*0.50)):n), idx_loci_alleles=collect(Int(floor(l*(n_alleles-1)*0.50)):l*(n_alleles-1)));

julia> merged_genomes = merge(genomes, other, conflict_resolution=(0.75, 0.25), verbose=false);

julia> size(merged_genomes.allele_frequencies)
(100, 5000)

julia> sum(ismissing.(merged_genomes.allele_frequencies))
123725
source
Base.mergeMethod
merge(genomes::Genomes, phenomes::Phenomes; keep_all::Bool=true)::Tuple{Genomes,Phenomes}

Merge Genomes and Phenomes structs based on their entries, combining genomic and phenotypic data.

Arguments

  • genomes::Genomes: A struct containing genomic data including entries, populations, and allele frequencies
  • phenomes::Phenomes: A struct containing phenotypic data including entries, populations, and phenotypes
  • keep_all::Bool=true: If true, performs a union of entries; if false, performs an intersection

Returns

  • Tuple{Genomes,Phenomes}: A tuple containing:
    • A new Genomes struct with merged entries and corresponding genomic data
    • A new Phenomes struct with merged entries and corresponding phenotypic data

Details

  • Maintains dimensional consistency between input and output structs
  • Handles population conflicts by creating a combined population name
  • Preserves allele frequencies and phenotypic data for matched entries
  • When keep_all=true, includes all entries from both structs
  • When keep_all=false, includes only entries present in both structs

Examples

julia> genomes = simulategenomes(n=10, verbose=false);

julia> trials, effects = simulatetrials(genomes=slice(genomes, idx_entries=collect(1:5), idx_loci_alleles=collect(1:length(genomes.loci_alleles))), f_add_dom_epi=[0.90 0.05 0.05;], n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=2, verbose=false);

julia> phenomes = Phenomes(n=5, t=1);

julia> phenomes.entries = trials.entries[1:5];

julia> phenomes.populations = trials.populations[1:5];

julia> phenomes.traits = trials.traits;

julia> phenomes.phenotypes = trials.phenotypes[1:5, :];

julia> phenomes.mask .= true;

julia> genomes_merged_1, phenomes_merged_1 = merge(genomes, phenomes, keep_all=true);

julia> size(genomes_merged_1.allele_frequencies), size(phenomes_merged_1.phenotypes)
((10, 10000), (10, 1))

julia> genomes_merged_2, phenomes_merged_2 = merge(genomes, phenomes, keep_all=false);

julia> size(genomes_merged_2.allele_frequencies), size(phenomes_merged_2.phenotypes)
((5, 10000), (5, 1))
source
Base.mergeMethod
merge(
    phenomes::Phenomes,
    other::Phenomes;
    conflict_resolution::Tuple{Float64,Float64} = (0.5, 0.5),
    verbose::Bool = true
)::Phenomes

Merge two Phenomes structs into a single combined struct, handling overlapping entries and traits.

Arguments

  • phenomes::Phenomes: The first Phenomes struct to merge
  • other::Phenomes: The second Phenomes struct to merge
  • conflict_resolution::Tuple{Float64,Float64}: Weights for resolving conflicts between overlapping values (must sum to 1.0)
  • verbose::Bool: Whether to display a progress bar during merging

Returns

  • Phenomes: A new merged Phenomes struct containing all entries and traits from both input structs

Details

The merge operation combines:

  • All unique entries from both structs
  • All unique traits from both structs
  • Phenotype values and masks, using weighted averaging for conflicts
  • Population information, marking conflicts with a "CONFLICT" prefix

For overlapping entries and traits:

  • Identical values are preserved as-is
  • Different values are combined using the weights specified in conflict_resolution
  • Missing values are handled by using the available non-missing value
  • Population conflicts are marked in the format "CONFLICT (pop1, pop2)"

Throws

  • ArgumentError: If either Phenomes struct is corrupted (invalid dimensions)
  • ArgumentError: If conflict_resolution weights don't sum to 1.0 or aren't a 2-tuple
  • ErrorException: If the merging operation produces an invalid result

Examples

julia> all = Phenomes(n=10, t=3); all.entries = string.("entry_", 1:10); all.traits = ["A", "B", "C"]; all.phenotypes = rand(10,3);

julia> phenomes = slice(all, idx_entries=collect(1:7), idx_traits=[1,2]);

julia> other = slice(all, idx_entries=collect(5:10), idx_traits=[2,3]);

julia> merged_phenomes = merge(phenomes, other, conflict_resolution=(0.75, 0.25), verbose=false);

julia> size(merged_phenomes.phenotypes)
(10, 3)

julia> sum(ismissing.(merged_phenomes.phenotypes))
7
source
Base.sumMethod
sum(effects::SimulatedEffects)::Vector{Float64}

Sum up all simulated effects to generate the simulated phenotype values. The function iterates through all fields of the SimulatedEffects struct (except :id and :field_layout) and adds their values element-wise to produce a vector of phenotypic values.

Arguments

  • effects::SimulatedEffects: A struct containing various genetic and environmental effects

Returns

  • Vector{Float64}: A vector containing the summed effects (phenotypic values)

Examples

julia> effects = SimulatedEffects();

julia> sum(effects)
1-element Vector{Float64}:
 0.0

julia> effects.additive_genetic[1] = pi;

julia> sum(effects)
1-element Vector{Float64}:
 3.141592653589793
source
GenomicBreedingCore.addcompositetraitMethod
addcompositetrait(phenomes::Phenomes; composite_trait_name::String, formula_string::String)::Phenomes

Create a new composite trait by combining existing traits using mathematical operations.

Arguments

  • phenomes::Phenomes: A Phenomes struct containing the original trait data
  • composite_trait_name::String: Name for the new composite trait
  • formula_string::String: Mathematical formula describing how to combine existing traits. Supports traits as variables and the following operations:
    • Basic arithmetic: +, -, *, /, ^, %
    • Functions: abs(), sqrt(), log(), log2(), log10()
    • Parentheses for operation precedence

Returns

  • Phenomes: A new Phenomes struct with the composite trait added as the last column

Examples

julia> phenomes = Phenomes(n=10, t=3); phenomes.entries = string.("entry_", 1:10); phenomes.populations .= "pop_1"; phenomes.traits = ["A", "B", "C"]; phenomes.phenotypes = rand(10,3);

julia> phenomes_new = addcompositetrait(phenomes, composite_trait_name = "some_wild_composite_trait", formula_string = "A");

julia> phenomes_new.phenotypes[:, end] == phenomes.phenotypes[:, 1]
true

julia> phenomes_new = addcompositetrait(phenomes, composite_trait_name = "some_wild_composite_trait", formula_string = "(A^B) + (C/A) - sqrt(abs(B-A)) + log(1.00 + C)");

julia> phenomes_new.phenotypes[:, end] == (phenomes.phenotypes[:,1].^phenomes.phenotypes[:,2]) .+ (phenomes.phenotypes[:,3]./phenomes.phenotypes[:,1]) .- sqrt.(abs.(phenomes.phenotypes[:,2].-phenomes.phenotypes[:,1])) .+ log.(1.00 .+ phenomes.phenotypes[:,3])
true
source
GenomicBreedingCore.addcompositetraitMethod
addcompositetrait(trials::Trials; composite_trait_name::String, formula_string::String)::Trials

Create a new composite trait by combining existing traits using a mathematical formula.

Arguments

  • trials::Trials: A Trials struct containing phenotypic data
  • composite_trait_name::String: Name for the new composite trait
  • formula_string::String: Mathematical formula defining how to combine existing traits

Formula Syntax

The formula can include:

  • Trait names (e.g., "trait1", "trait2")
  • Mathematical operators: +, -, *, /, ^, %
  • Functions: abs(), sqrt(), log(), log2(), log10()
  • Parentheses for grouping operations

Returns

  • Trials: A new Trials struct with the added composite trait

Examples

julia> trials, _ = simulatetrials(genomes = simulategenomes(verbose=false), verbose=false);

julia> trials_new = addcompositetrait(trials, composite_trait_name = "some_wild_composite_trait", formula_string = "trait_1");

julia> trials_new.phenotypes[:, end] == trials.phenotypes[:, 1]
true

julia> trials_new = addcompositetrait(trials, composite_trait_name = "some_wild_composite_trait", formula_string = "(trait_1^(trait_2/100)) + (trait_3/trait_1) - sqrt(abs(trait_2-trait_1)) + log(1.00 + trait_3)");

julia> trials_new.phenotypes[:, end] == (trials.phenotypes[:,1].^(trials.phenotypes[:,2]/100)) .+ (trials.phenotypes[:,3]./trials.phenotypes[:,1]) .- sqrt.(abs.(trials.phenotypes[:,2].-trials.phenotypes[:,1])) .+ log.(1.00 .+ trials.phenotypes[:,3])
true
source
GenomicBreedingCore.aggregateharvestsMethod
aggregateharvests(
    trials::Trials; 
    traits::Union{Vector{String}, Nothing} = nothing,
    grouping::Vector{String} = ["years", "seasons", "sites", "replications", "blocks", "rows", "cols", "entries", "populations"],
    f::Function=x -> sum(skipmissing(x))
)::Tuple{Trials, DataFrame}

Aggregate harvest data from a Trials struct by specified grouping variables.

Arguments

  • trials::Trials: Input trials struct containing harvest data
  • traits::Union{Vector{String}, Nothing}: Vector of trait names to aggregate (default: all traits)
  • grouping::Vector{String}: Vector of column names to group by (default: ["years", "seasons", "sites", "replications", "blocks", "rows", "cols", "entries", "populations"])
  • f::Function: Aggregation function to apply (default: sum of non-missing values)

Returns

  • Tuple{Trials, DataFrame}: A tuple containing:
    • Aggregated trials struct with combined harvest data
    • DataFrame with harvest counts per year, season, and site

Examples

julia> trials, _ = simulatetrials(genomes = simulategenomes(n=5, l=1000, verbose=false), verbose=false);

julia> trials_agg_1, df_n_harvests_1 = aggregateharvests(trials);

julia> length(trials.entries) > length(trials_agg_1.entries)
true

julia> trials_agg_2, df_n_harvests_2 = aggregateharvests(trials, traits=["trait_1"]);

julia> length(trials.traits) > length(trials_agg_2.traits)
true
source
GenomicBreedingCore.analyseFunction
analyse(
    trials::Trials,
    formula_string::String = "";
    traits::Union{Nothing,Vector{String}} = nothing,
    max_levels::Int64 = 100,
    max_time_per_model::Int64 = 60,
    covariates_continuous::Union{Nothing,Vector{String}} = nothing,
    verbose::Bool = true
)::TEBV

Analyze trial data using linear mixed models to estimate Best Linear Unbiased Estimates (BLUEs) and Best Linear Unbiased Predictions (BLUPs).

Arguments

  • trials: A Trials struct containing the experimental data
  • formula_string: Optional model formula string. If empty, automatic model selection is performed
  • traits: Optional vector of trait names to analyze. If nothing, all traits are analyzed
  • max_levels: Maximum number of levels for non-entry random effects (default: 100)
  • max_time_per_model: Maximum fitting time in seconds per model (default: 60)
  • covariates_continuous: Optional vector of continuous covariates to include in models
  • verbose: Whether to display analysis progress (default: true)

Returns

A TEBV struct containing:

  • traits: Vector of analyzed trait names
  • formulae: Vector of best-fitting model formulae
  • models: Vector of fitted LinearMixedModel objects
  • df_BLUEs: Vector of DataFrames containing BLUEs
  • df_BLUPs: Vector of DataFrames containing BLUPs
  • phenomes: Vector of Phenomes objects with predicted values

Details

The function implements a mixed model fitting strategy with the following principles:

  • Avoids over-parameterization
  • Uses unstructured variance-covariance matrix for random effects
  • Prefers REML over ML estimation
  • Compares BLUEs vs BLUPs of entries
  • Handles both parallel and iterative model fitting based on model complexity

Notes

  • Models are fitted using REML
  • Simple models are fitted in parallel while complex models are fitted iteratively to avoid memory issues
  • Returns empty results if no models can be successfully fitted

Examples

julia> trials, _simulated_effects = simulatetrials(genomes = simulategenomes(n=10, verbose=false), n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=10, verbose=false);

julia> tebv_1 = analyse(trials, "trait_1 ~ 1 + (1|entries)", max_levels=50, verbose=false);

julia> tebv_1.traits
3-element Vector{String}:
 "trait_1"
 "trait_2"
 "trait_3"

julia> tebv_2 = analyse(trials, max_levels=50, verbose=false);

julia> mean(tebv_2.phenomes[1].phenotypes) < mean(tebv_2.phenomes[2].phenotypes)
true

julia> trials = addcompositetrait(trials, composite_trait_name = "covariate", formula_string = "(trait_1 + trait_2) / (trait_3 + 0.0001)");

julia> tebv_3 = Suppressor.@suppress analyse(trials, "y ~ 1 + covariate + entries + (1|blocks)", max_levels=50, verbose=false);

julia> mean(tebv_3.phenomes[1].phenotypes) < mean(tebv_3.phenomes[2].phenotypes)
true

julia> tebv_4 = Suppressor.@suppress analyse(trials, max_levels=50, covariates_continuous=["covariate"], verbose=false);

julia> mean(tebv_4.phenomes[1].phenotypes) < mean(tebv_4.phenomes[2].phenotypes)
true
source
GenomicBreedingCore.analyseMethod
analyse(
    df::DataFrame; 
    formulae::Vector{String},
    idx_parallel_models::Vector{Int64},
    idx_iterative_models::Vector{Int64},
    max_time_per_model::Int64 = 60,
    verbose::Bool=false
)::Tuple{String, Any, DataFrame, DataFrame, Phenomes}

Fit univariate linear mixed models to extract entry effects from the best-fitting model.

Arguments

  • df::DataFrame: Input data frame containing trial data with columns for entries, traits, and other experimental factors
  • formulae::Vector{String}: Vector of model formulae strings to be tested
  • idx_parallel_models::Vector{Int64}: Indices of simpler models to be fitted in parallel
  • idx_iterative_models::Vector{Int64}: Indices of complex models to be fitted iteratively
  • max_time_per_model::Int64: Maximum time in seconds allowed for fitting each model (default: 60)
  • verbose::Bool: Whether to display progress information (default: false)

Returns

A tuple containing:

  1. String: Formula of the best-fitting model
  2. Any: The fitted model object
  3. DataFrame: BLUEs (Best Linear Unbiased Estimates) results
  4. DataFrame: BLUPs (Best Linear Unbiased Predictions) results
  5. Phenomes: Struct containing consolidated phenotypic predictions

Details

The function implements a mixed model fitting strategy with the following principles:

  • Avoids over-parameterization
  • Uses unstructured variance-covariance matrix for random effects
  • Prefers REML over ML estimation
  • Compares BLUEs vs BLUPs of entries
  • Handles both parallel and iterative model fitting based on model complexity

Notes

  • All formulae must model the same trait
  • Models are fitted using REML
  • Simple models are fitted in parallel while complex models are fitted iteratively to avoid memory issues
  • Returns empty results if no models can be successfully fitted

Examples

julia> trials, _ = simulatetrials(genomes = simulategenomes(verbose=false), verbose=false);

julia> df = tabularise(trials);

julia> formulae, n_levels = trialsmodelsfomulae!(df; trait = "trait_1", max_levels = 10);

julia> idx_parallel_models::Vector{Int64} = findall(n_levels .<= (15));

julia> idx_iterative_models::Vector{Int64} = findall((n_levels .<= (15)) .!= true);

julia> formula_string, model, df_BLUEs, df_BLUPs, phenomes = analyse(df, formulae=formulae, idx_parallel_models=idx_parallel_models, idx_iterative_models=idx_iterative_models);

julia> length(phenomes.entries) == length(unique(df.entries))
true

julia> df_2 = df[(df.years .== df.years[1]) .&& (df.harvests .== df.harvests[1]) .&& (df.seasons .== df.seasons[1]) .&& (df.sites .== df.sites[1]) .&& (df.replications .== df.replications[1]), :];

julia> formula_string_2, model_2, df_BLUEs_2, df_BLUPs_2, phenomes_2 = analyse(df_2, formulae=["trait_1 ~ 1 + 1|entries"]);

julia> cor(phenomes_2.phenotypes[sortperm(phenomes_2.entries),1], df_2.trait_1[sortperm(df_2.entries)]) > 0.99
true
source
GenomicBreedingCore.analyseviaBLRMethod
analyseviaBLR(
    trials::Trials,
    traits::Vector{String};
    grm::Union{GRM,Nothing} = nothing,
    other_covariates::Union{Vector{String},Nothing} = nothing,
    empirical_Σs::Bool = true,
    multiple_σs_threshold::Int64 = 500,
    n_iter::Int64 = 10_000,
    n_burnin::Int64 = 1_000,
    seed::Int64 = 1234,
    verbose::Bool = false,
)::Tuple{TEBV,Dict{String,DataFrame}}

Perform Bayesian linear mixed model analysis on trial data for genetic evaluation.

Arguments

  • trials::Trials: A Trials struct containing the experimental data
  • traits::Vector{String}: Vector of trait names to analyze. If empty, all traits in trials will be analyzed
  • grm::Union{GRM,Nothing}=nothing: Optional genomic relationship matrix
  • other_covariates::Union{Vector{String},Nothing}=nothing: Additional covariates to include in the model
  • multiple_σs_threshold::Int64=500: Threshold for determining multiple variance components
  • n_iter::Int64=10_000: Number of MCMC iterations
  • n_burnin::Int64=1_000: Number of burn-in iterations
  • seed::Int64=1234: Random seed for reproducibility
  • verbose::Bool=false: Whether to print progress information

Returns

A tuple containing:

  • TEBV: Total estimated breeding values struct with model results. Note that:
    • only the variance-covariance components represent the p-1 factor levels; while,
    • the rest have the full number of levels, i.e. using the one-hot encoding vectors and matrices).
    • This means that the Σs have less rows and columns than the number of elements in coefficient_names.
  • Dict{String,DataFrame}: Spatial diagnostics information

Details

Performs a two-stage analysis:

  1. Stage-1: Spatial analysis per harvest-site-year combination
  • Creates temporary JLD2 file with spatially adjusted data to manage memory
  • Corrects for spatial effects:
    • With rows and columns regardless of whether blocks are present:
      • rows + cols + rows:cols
    • With blocks and one spatial factor:
      • blocks + rows + blocks:rows
      • blocks + cols + blocks:cols
    • With single spatial factor:
      • blocks
      • rows
      • cols
  • Removes effects of continuous covariates
  • Returns spatially adjusted traits with "SPATADJ-" prefix
  1. Stage-2: GxE modeling excluding spatial effects and continuous covariates 2.a. Genotypic effects: - entries (required) 2.b. Environmental main effects: - sites if present - seasons if present - years if present 2.c Environmental interactions: - With all 3 environment factors: + years:sites + seasons:sites + entries:seasons:sites - With 2 environment factors: + years:seasons + entries:seasons (no sites) + years:sites + entries:sites (no seasons) + seasons:sites + entries:seasons:sites (no years) - With 1 environment factor: + entries:years + entries:seasons + entries:sites

The analysis includes:

  • Genomic relationship matrix (GRM) integration if provided:
    • GRM replaces identity matrices for entry-related effects
    • Diagonals are inflated if resulting matrices not positive definite
    • Inflation repeated up to 10 times to ensure stability
  • Variance component estimation:
    • Single vs multiple variance scalers determined by threshold
    • Separate parameters for complex interaction terms
  • MCMC-based Bayesian inference with:
    • Burn-in period for chain convergence
    • Diagnostic checks for convergence and mixing

Notes

  • Automatically handles memory management for large design matrices
  • Creates temporary JLD2 file "TEMP-dfspatadj-[hash].jld2" to store spatially adjusted data
  • Automatically removes temporary JLD2 file after analysis is complete
  • Supports both identity and genomic relationship matrices for genetic effects
  • Performs automatic model diagnostics and variance component scaling
  • Excludes continuous covariates from Stage-2 as they are handled in Stage-1

Examples

julia> genomes = simulategenomes(n=5, l=1_000, verbose=false);

julia> grm = grmploidyaware(genomes, ploidy=2, max_iter=10);

julia> trials, simulated_effects = simulatetrials(genomes = genomes, n_years=1, n_seasons=2, n_harvests=1, n_sites=3, n_replications=3, verbose=false);

julia> tebv_1, spatial_diagnostics_1 = analyseviaBLR(trials, ["trait_1"], n_iter = 1_000, n_burnin = 100);

julia> tebv_2, spatial_diagnostics_2 = analyseviaBLR(trials, ["trait_1", "trait_2"], other_covariates = ["trait_3"], n_iter = 1_000, n_burnin = 100);

julia> tebv_3, spatial_diagnostics_3 = analyseviaBLR(trials, ["trait_3"], grm = grm, n_iter = 1_000, n_burnin = 100);

julia> (length(tebv_1.phenomes[1].entries) == 5) && (length(tebv_1.phenomes[2].entries) == 30) && (length(spatial_diagnostics_1) == 6)
true

julia> (length(tebv_2.phenomes[1].entries) == 5) && (length(tebv_2.phenomes[2].entries) == 30) && (length(spatial_diagnostics_2) == 12)
true

julia> (length(tebv_3.phenomes[1].entries) == 5) && (length(tebv_3.phenomes[2].entries) == 30) && (length(spatial_diagnostics_3) == 6)
true
source
GenomicBreedingCore.checkandfocaltermsMethod
checkandfocalterms(trait::String, factors::Vector{String}, df::DataFrame, other_covariates::Union{Vector{String},Nothing} = nothing)::Vector{String}

Validate input data and generate model terms for Bayesian analysis of field trials.

Arguments

  • trait::String: Name of the response variable (trait) column in the DataFrame
  • factors::Vector{String}: Vector of factor names (categorical variables) to include in the model
  • df::DataFrame: DataFrame containing the trial data
  • other_covariates::Union{Vector{String},Nothing}: Optional vector of numeric covariate column names

Returns

  • Vector{String}: A vector of model terms including main effects, GxE interaction effects, and spatial effects

Details

The function performs several tasks:

  1. Validates that all specified columns exist in the DataFrame
  2. Ensures trait and covariates are numeric
  3. Converts factors to strings
  4. Generates appropriate model terms based on available factors:
    • Main effects (entries, sites, seasons, years)
    • GxE interaction effects (various combinations of entries × environmental factors)
    • Spatial effects (blocks, rows, columns and their interactions)

Throws

  • ArgumentError: If trait or factors are not found in DataFrame
  • ArgumentError: If trait or covariates are non-numeric or contain missing values

Example

julia> genomes = simulategenomes(n=5, l=1_000, verbose=false);

julia> trials, simulated_effects = simulatetrials(genomes = genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=3, verbose=false);

julia> df = tabularise(trials);

julia> focal_terms_1 = checkandfocalterms(trait = trials.traits[1], factors = ["rows", "cols"], df = df, other_covariates = trials.traits[2:end])
2-element Vector{String}:
 "rows"
 "cols"

julia> focal_terms_2 = checkandfocalterms(trait = trials.traits[1], factors = ["years", "seasons", "sites"], df = df)
5-element Vector{String}:
 "years"
 "seasons"
 "sites"
 "years:sites"
 "seasons:sites"
source
GenomicBreedingCore.checkdimsMethod
checkdims(blr::BLR)::Bool

Check dimension compatibility of the internal fields of a BLR struct.

This function verifies that all vector and matrix fields in the BLR struct have compatible dimensions:

  • Length of entries, y, ȳ, and ϵ must be equal (denoted as n)
  • Length of coefficients and coefficient_names must be equal (denoted as p)
  • Matrix Xs must have dimensions n × p
  • Matrix Σs must have dimensions p × p

Returns true if all dimensions are compatible, false otherwise.

Arguments

  • blr::BLR: The BLR struct to check dimensions for

Returns

  • Bool: true if dimensions are compatible, false otherwise

Examples

julia> blr = BLR(n=1, p=4);

julia> checkdims(blr)
true

julia> blr.coefficient_names["dummy_X"] = ["dummy_coef"];

julia> checkdims(blr)
false
source
GenomicBreedingCore.checkdimsMethod
checkdims(cv::CV)::Bool

Check dimension compatibility of the fields of the CV struct.

The function verifies that:

  • The fit object dimensions are valid
  • The number of validation populations matches the number of validation entries
  • The number of validation true values matches the number of validation predictions
  • The number of metrics matches the number of metrics in the fit object

Returns:

  • true if all dimensions are compatible
  • false if any dimension mismatch is found

Examples

julia> fit = Fit(n=1, l=2);

julia> cv = CV("replication_1", "fold_1", fit, ["population_1"], ["entry_1"], [0.0], [0.0], fit.metrics);

julia> checkdims(cv)
true

julia> cv.validation_y_true = [0.0, 0.0];

julia> checkdims(cv)
false
source
GenomicBreedingCore.checkdimsMethod
checkdims(fit::Fit)::Bool

Check dimension compatibility of the internal fields of a Fit struct.

This function verifies that all vector fields in the Fit struct have compatible dimensions:

  • Length of entries, populations, y_true, and y_pred must be equal (denoted as n)
  • Length of b_hat and b_hat_labels must be equal (denoted as l)

Returns true if all dimensions are compatible, false otherwise.

Arguments

  • fit::Fit: The Fit struct to check dimensions for

Returns

  • Bool: true if dimensions are compatible, false otherwise

Examples

julia> fit = Fit(n=1, l=4);

julia> checkdims(fit)
true

julia> fit.b_hat_labels = ["chr1\t1\tA|T\tA"];

julia> checkdims(fit)
false
source
GenomicBreedingCore.checkdimsMethod
checkdims(grm::GRM)::Bool

Check dimension compatibility of the GRM (Genomic Relationship Matrix) struct fields.

Arguments

  • grm::GRM: A Genomic Relationship Matrix struct containing entries and relationship matrix

Returns

  • true if the number of entries matches the dimensions of the genomic relationship matrix
  • false if there is a mismatch between the number of entries and matrix dimensions

Details

The function verifies that:

  1. The number of entries equals the number of rows in the genomic relationship matrix
  2. The number of entries equals the number of columns in the genomic relationship matrix

(The genomic relationship matrix should be square with dimensions matching the number of entries)

Examples

julia> grm = GRM(string.(["entries_1", "entries_2"]), string.(["chr1	123	A|T	A", "chr1	456	C|G	G"]), Float64.(rand(2,2)));

julia> checkdims(grm)
true

julia> grm.entries = ["dummy_entry"];

julia> checkdims(grm)
false
source
GenomicBreedingCore.checkdimsMethod
checkdims(genomes::Genomes)::Bool

Check dimension compatibility of the fields in a Genomes struct.

Returns true if all dimensions are compatible, false otherwise.

Arguments

  • genomes::Genomes: A Genomes struct containing genomic data

Details

Verifies that:

  • Number of entries matches number of populations (n)
  • Entry names are unique
  • Number of loci alleles matches width of frequency matrix (p)
  • Locus-allele combinations are unique
  • Entries are unique
  • Dimensions of frequency matrix (n×p) match mask matrix dimensions

Examples

julia> genomes = Genomes(n=2,p=4);

julia> checkdims(genomes)
false

julia> genomes.entries = ["entry_1", "entry_2"];

julia> genomes.loci_alleles = ["chr1\t1\tA|T\tA", "chr1\t2\tC|G\tG", "chr2\t3\tA|T\tA", "chr2\t4\tG|T\tG"];

julia> checkdims(genomes)
true
source
GenomicBreedingCore.checkdimsMethod
checkdims(y::Phenomes)::Bool

Verify dimensional compatibility between all fields of a Phenomes struct.

Checks if:

  • Number of entries matches the number of rows in phenotypes matrix
  • All entry names are unique
  • Number of populations matches number of entries
  • Number of traits matches number of columns in phenotypes matrix
  • All trait names are unique
  • Dimensions of mask matrix match phenotypes matrix

Arguments

  • y::Phenomes: A Phenomes struct containing phenotypic data

Returns

  • Bool: true if all dimensions are compatible, false otherwise

Examples

julia> y = Phenomes(n=2, t=2);

julia> checkdims(y)
false

julia> y.entries = ["entry_1", "entry_2"];

julia> y.traits = ["trait_1", "trait_2"];

julia> checkdims(y)
true
source
GenomicBreedingCore.checkdimsMethod
checkdims(effects::SimulatedEffects)::Bool

Check dimension compatibility of the fields of the SimulatedEffects struct.

Arguments

  • effects::SimulatedEffects: A SimulatedEffects struct containing various genetic and experimental effects

Returns

  • Bool: true if all dimensions are compatible, false otherwise

Verifies that:

  • id has length 6
  • field_layout has 4 columns
  • All following vectors have the same length (n):
    • replications_x_site_x_harvest_x_season_x_year
    • blocks_x_site_x_harvest_x_season_x_year
    • rows_x_site_x_harvest_x_season_x_year
    • cols_x_site_x_harvest_x_season_x_year
    • additive_genetic
    • dominance_genetic
    • epistasis_genetic
    • additive_allele_x_site_x_harvest_x_season_x_year
    • dominance_allele_x_site_x_harvest_x_season_x_year
    • epistasis_allele_x_site_x_harvest_x_season_x_year

Examples

julia> effects = SimulatedEffects();

julia> typeof(hash(effects))
UInt64

jldoctest; setup = :(using GenomicBreedingCore) julia> effects = SimulatedEffects();

julia> checkdims(effects) true

julia> effects.id = ["beaking_change"];

julia> checkdims(effects) false ```

source
GenomicBreedingCore.checkdimsMethod
checkdims(y::TEBV)::Bool

Check if all fields in the TEBV struct have compatible dimensions. The function verifies that the length of all arrays in the TEBV struct match the number of traits.

Arguments

  • tebv::TEBV: A TEBV (Trial-estimated Breeding Values) struct containing traits, formulae, models, BLUEs, BLUPs, and phenomes.

Returns

  • Bool: Returns true if all fields have matching dimensions (equal to the number of traits), false otherwise.

Details

The function checks if the following fields have the same length as traits:

  • formulae
  • unique models
  • unique BLUEs DataFrames
  • unique BLUPs DataFrames
  • unique phenomes

Examples

``jldoctest; setup = :(using GenomicBreedingCore, MixedModels, DataFrames) julia> tebv = TEBV(traits=[""], formulae=[""], models=[MixedModel(@formula(y~1+(1|x)), DataFrame(y=1, x=1))], dfBLUEs=[DataFrame(x=1)], dfBLUPs=[DataFrame(x=1)], phenomes=[Phenomes(n=1,t=1)]);

julia> checkdims(tebv) true ```

source
GenomicBreedingCore.checkdimsMethod
checkdims(trials::Trials)::Bool

Check dimension compatibility of all fields in a Trials struct.

This function verifies that the dimensions of all vector fields in the Trials struct are consistent with the size of the phenotypes matrix. Specifically, it checks:

  • Number of traits (t) matches number of columns in phenotypes and length of traits vector
  • Number of entries (n) matches number of rows in phenotypes and length of:
    • years
    • seasons
    • harvests
    • sites
    • replications
    • blocks
    • rows
    • cols
    • entries
    • populations

Returns true if all dimensions are compatible, false otherwise.

Arguments

  • trials::Trials: A Trials struct containing trial data

Returns

  • Bool: true if dimensions are compatible, false otherwise

Examples

julia> trials = Trials(n=1, t=2);

julia> trials.entries = ["entry_1"]; trials.traits = ["trait_1", "trait_2"];

julia> checkdims(trials)
true

julia> trials.entries = ["entering_2_entries", "instead_of_just_1"];

julia> checkdims(trials)
false
source
GenomicBreedingCore.cloneMethod
clone(x::BLR)::BLR

Create a deep copy of a BLR object, duplicating all its fields.

This function performs a deep clone of the input BLR object, ensuring that all nested structures and arrays are also copied, preventing any shared references between the original and the cloned object.

Arguments

  • x::BLR: The BLR object to be cloned

Returns

  • BLR: A new BLR object with identical but independent values

Fields copied

  • entries::Vector{String}: Names or identifiers for the observations
  • coefficient_names::Vector{String}: Names of the model coefficients/effects
  • y::Vector{Float64}: Response/dependent variable vector
  • Xs::Dict{String, Matrix{Union{Bool, Float64}}}: Design matrices for factors and numeric matrix for the other covariates
  • coefficients::Vector{Float64}: Estimated coefficients/effects
  • ŷ::Vector{Float64}: Fitted/predicted values
  • ϵ::Vector{Float64}: Residuals (y - ŷ)
  • Σs::Dict{String, Union{Matrix{Float64}, UniformScaling{Float64}}}: Variance-covariance matrices

Examples

julia> blr = BLR(n=1, p=1);

julia> copy_blr = clone(blr)
BLR([""], Dict{String, Matrix{Union{Bool, Float64}}}("intercept" => [true;;]), Dict{String, Union{Nothing, UniformScaling{Float64}, Matrix{Float64}}}("σ²" => UniformScaling{Float64}(1.0)), Dict("intercept" => [0.0]), Dict("intercept" => ["intercept"]), [0.0], [0.0], [0.0], 0×0 DataFrame)
source
GenomicBreedingCore.cloneMethod
clone(x::CV)::CV

Create a deep copy of a CV (cross-validation) object.

Creates a new CV object with deep copies of all fields from the input object. The clone function ensures that modifications to the cloned object do not affect the original object.

Arguments

  • x::CV: The CV object to be cloned

Returns

  • CV: A new CV object containing deep copies of all fields from the input

Example

Clone a CV object

Example

julia> fit = Fit(n=1, l=2);

julia> cv = CV("replication_1", "fold_1", fit, ["population_1"], ["entry_1"], [0.0], [0.0], fit.metrics);

julia> copy_cv = clone(cv)
CV("replication_1", "fold_1", Fit("", ["", ""], [0.0, 0.0], "", [""], [""], [0.0], [0.0], Dict("" => 0.0), nothing), ["population_1"], ["entry_1"], [0.0], [0.0], Dict("" => 0.0))
source
GenomicBreedingCore.cloneMethod
clone(x::Fit)::Fit

Create a deep copy of a Fit object, duplicating all its fields.

This function performs a deep clone of the input Fit object, ensuring that all nested structures and arrays are also copied, preventing any shared references between the original and the cloned object.

Arguments

  • x::Fit: The Fit object to be cloned

Returns

  • Fit: A new Fit object with identical but independent values

Fields copied

  • model: The statistical model
  • b_hat_labels: Labels for the estimated parameters
  • b_hat: Estimated parameters
  • trait: The trait being analyzed
  • entries: Entry identifiers
  • populations: Population identifiers
  • metrics: Performance metrics
  • y_true: Observed values
  • y_pred: Predicted values

Examples

julia> fit = Fit(n=1, l=2);

julia> copy_fit = clone(fit)
Fit("", ["", ""], [0.0, 0.0], "", [""], [""], [0.0], [0.0], Dict("" => 0.0), nothing)
source
GenomicBreedingCore.cloneMethod
clone(x::GRM)::GRM

Create a deep copy of a GRM (Genomic Relationship Matrix) object.

Creates a new GRM object with deep copies of all fields from the input object. The clone function ensures that modifications to the cloned object do not affect the original object.

Arguments

  • x::GRM: The GRM object to be cloned

Returns

  • GRM: A new GRM object containing deep copies of the entries, locialleles, and genomicrelationship_matrix fields from the input

Example

julia> grm = GRM(string.(["entries_1", "entries_2"]), string.(["chr1	123	A|T	A", "chr1	456	C|G	G"]), Float64.(rand(2,2)));

julia> copy_grm = clone(grm);

julia> (copy_grm.entries == grm.entries) && (copy_grm.loci_alleles == grm.loci_alleles) && (copy_grm.genomic_relationship_matrix == grm.genomic_relationship_matrix)
true
source
GenomicBreedingCore.cloneMethod
clone(x::Genomes)::Genomes

Create a deep copy of a Genomes object.

This function performs a deep clone of all fields in the Genomes object, including:

  • entries
  • populations
  • loci_alleles
  • allele_frequencies
  • mask

Returns a new Genomes instance with identical but independent data.

Arguments

  • x::Genomes: The source Genomes object to clone

Returns

  • Genomes: A new Genomes object containing deep copies of all fields

Example

julia> genomes = Genomes(n=2, p=2);

julia> copy_genomes = clone(genomes)
Genomes(["", ""], ["", ""], ["", ""], Union{Missing, Float64}[missing missing; missing missing], Bool[1 1; 1 1])
source
GenomicBreedingCore.cloneMethod
clone(x::Phenomes)::Phenomes

Create a deep copy of a Phenomes object, including all its fields.

This function performs a deep copy of the following fields:

  • entries: Vector of entry names
  • populations: Vector of population identifiers
  • traits: Vector of trait names
  • phenotypes: Matrix of phenotypic values
  • mask: Matrix of boolean masks

Returns a new Phenomes object with identical structure but independent memory allocation.

Arguments

  • x::Phenomes: The source Phenomes object to be cloned

Returns

  • Phenomes: A new Phenomes object containing deep copies of all fields

Example

julia> phenomes = Phenomes(n=2, t=2);

julia> copy_phenomes = clone(phenomes)
Phenomes(["", ""], ["", ""], ["", ""], Union{Missing, Float64}[missing missing; missing missing], Bool[1 1; 1 1])
source
GenomicBreedingCore.cloneMethod
clone(x::TEBV)::TEBV

Create a deep copy of a TEBV (Trial-Estimated Breeding Value) object.

Returns a new TEBV instance with all fields deeply copied from the input object, ensuring complete independence between the original and cloned objects.

Arguments

  • x::TEBV: The source TEBV object to be cloned

Returns

  • TEBV: A new TEBV object containing deep copies of all fields from the input

Examples

julia> tebv = TEBV(traits=[""], formulae=[""], models=[MixedModel(@formula(y~1+(1|x)), DataFrame(y=1, x=1))], df_BLUEs=[DataFrame(x=1)], df_BLUPs=[DataFrame(x=1)], phenomes=[Phenomes(n=1,t=1)]);

julia> copy_tebv = clone(tebv);

julia> copy_tebv.traits == tebv.traits
true

julia> copy_tebv.phenomes == tebv.phenomes
true
source
GenomicBreedingCore.cloneMethod
clone(x::Trials)::Trials

Create a deep copy of a Trials object, including all its fields.

This function performs a complete deep copy of the input Trials object, ensuring that all nested data structures are also copied rather than referenced.

Arguments

  • x::Trials: The source Trials object to be cloned

Returns

  • Trials: A new Trials object containing copies of all data from the input

Example

julia> trials = Trials(n=2, t=2);

julia> copy_trials = clone(trials)
Trials(Union{Missing, Float64}[missing missing; missing missing], ["", ""], ["", ""], ["", ""], ["", ""], ["", ""], ["", ""], ["", ""], ["", ""], ["", ""], ["", ""], ["", ""])
source
GenomicBreedingCore.countlevelsMethod
countlevels(df::DataFrame; column_names::Vector{String})::Int64

Count the total number of unique values (factor levels) across specified columns in a DataFrame.

Arguments

  • df::DataFrame: Input DataFrame to analyze
  • column_names::Vector{String}: Vector of column names to count unique values from

Returns

  • Int64: Sum of unique values across all specified columns

Throws

  • ArgumentError: If any of the specified column names are not found in the DataFrame
source
GenomicBreedingCore.dimensionsMethod
dimensions(blr::BLR)::Dict{String, Any}

Calculate various dimensional properties of a Bayesian Linear Regression (BLR) model.

Arguments

  • blr::BLR: A Bayesian Linear Regression model structure

Returns

A dictionary containing the following keys:

  • "n_rows": Number of observations
  • "n_coefficients": Total number of coefficients across all components
  • "n_variance_components": Number of variance components
  • "n_entries": Number of unique entries
  • "coeff_per_varcomp": Dictionary mapping variance component names to their number of coefficients
  • "varex_per_varcomp": Dictionary mapping variance component names to their variance explained (normalized)

Details

The function performs the following:

  1. Validates BLR structure dimensions
  2. Calculates coefficients per variance component
  3. Computes variance explained for each component and normalizes by total variance
  4. Returns dimensional summary as a dictionary

Throws

  • ArgumentError: If the BLR structure dimensions are inconsistent

Examples

julia> blr = BLR(n=10, p=6, var_comp = Dict("entries" => 5, "σ²" => 1));

julia> dimensions(blr)
Dict{String, Any} with 6 entries:
  "n_coefficients"        => 6
  "n_variance_components" => 2
  "varex_per_varcomp"     => Dict("entries"=>0.333333, "σ²"=>0.666667)
  "n_entries"             => 1
  "n_rows"                => 10
  "coeff_per_varcomp"     => Dict("entries"=>5.0, "σ²"=>10.0)
source
GenomicBreedingCore.dimensionsMethod
dimensions(genomes::Genomes)::Dict{String, Int64}

Calculate various dimensional metrics of a Genomes struct.

Returns a dictionary containing the following metrics:

  • "n_entries": Number of unique entries/samples
  • "n_populations": Number of unique populations
  • "n_loci_alleles": Total number of loci-allele combinations
  • "n_chr": Number of chromosomes
  • "n_loci": Number of unique loci across all chromosomes
  • "max_n_alleles": Maximum number of alleles observed at any locus
  • "n_missing": Count of missing values in allele frequencies

Arguments

  • genomes::Genomes: A valid Genomes struct containing genetic data

Returns

  • Dict{String,Int64}: Dictionary containing dimensional metrics

Throws

  • ArgumentError: If the Genomes struct is corrupted (fails dimension check)

Examples

julia> genomes = simulategenomes(n=100, l=1_000, n_alleles=4, verbose=false);

julia> dimensions(genomes)
Dict{String, Int64} with 7 entries:
  "n_entries"      => 100
  "n_chr"          => 7
  "n_loci"         => 1000
  "n_loci_alleles" => 3000
  "n_populations"  => 1
  "n_missing"      => 0
  "max_n_alleles"  => 4
source
GenomicBreedingCore.dimensionsMethod
dimensions(phenomes::Phenomes)::Dict{String, Int64}

Calculate various dimensional statistics of a Phenomes struct.

Returns a dictionary containing counts of:

  • "n_entries": unique entries in the dataset
  • "n_populations": unique populations
  • "n_traits": number of traits
  • "n_total": total number of phenotypic observations (entries × traits)
  • "n_zeroes": number of zero values in phenotypes
  • "n_missing": number of missing values
  • "n_nan": number of NaN values
  • "n_inf": number of infinite values

Arguments

  • phenomes::Phenomes: A Phenomes struct containing phenotypic data

Returns

  • Dict{String,Int64}: Dictionary with dimensional statistics

Throws

  • ArgumentError: If the Phenomes struct dimensions are inconsistent

Examples

julia> phenomes = Phenomes(n=10, t=3); phenomes.entries = string.("entry_", 1:10); phenomes.populations .= "pop_1"; phenomes.traits = ["A", "B", "C"]; phenomes.phenotypes = fill(0.0, 10,3);

julia> dimensions(phenomes)
Dict{String, Int64} with 8 entries:
  "n_total"       => 30
  "n_zeroes"      => 30
  "n_nan"         => 0
  "n_entries"     => 10
  "n_traits"      => 3
  "n_inf"         => 0
  "n_populations" => 1
  "n_missing"     => 0
source
GenomicBreedingCore.dimensionsMethod
dimensions(tebv::TEBV)::Dict{String, Int64}

Calculate various dimensional metrics for a TEBV (Trial-Estimated Breeding Values) struct.

Arguments

  • tebv::TEBV: A TEBV struct containing traits, formulae, models, BLUEs, BLUPs, and phenomes data

Returns

A dictionary containing the following counts:

  • "n_entries": Number of unique entries across all phenomes
  • "n_populations": Number of unique populations across all phenomes
  • "n_traits": Number of traits in the TEBV struct
  • "n_total": Total number of observations across all traits
  • "n_zeroes": Total number of zero values across all traits
  • "n_missing": Total number of missing values across all traits
  • "n_nan": Total number of NaN values across all traits
  • "n_inf": Total number of Infinite values across all traits

Throws

  • ArgumentError: If the TEBV struct dimensions are inconsistent or corrupted

Examples

julia> tebv = TEBV(traits=["trait_1"], formulae=["trait_1 ~ 1 + 1|entries"], models=[MixedModel(@formula(y~1+(1|x)), DataFrame(y=1, x=1))], df_BLUEs=[DataFrame(x=1)], df_BLUPs=[DataFrame(x=1)], phenomes=[Phenomes(n=1,t=1)]);

julia> dimensions(tebv)
Dict{String, Int64} with 8 entries:
  "n_total"       => 1
  "n_zeroes"      => 0
  "n_nan"         => 0
  "n_entries"     => 1
  "n_traits"      => 1
  "n_inf"         => 0
  "n_populations" => 1
  "n_missing"     => 1
source
GenomicBreedingCore.dimensionsMethod
dimensions(trials::Trials)::Dict{String, Int64}

Calculate dimensional statistics of a Trials struct, returning a dictionary with counts of various elements.

Arguments

  • trials::Trials: A Trials struct containing trial data

Returns

A Dict{String, Int64} with the following keys:

  • "n_traits": Number of unique traits
  • "n_years": Number of unique years
  • "n_seasons": Number of unique seasons
  • "n_harvests": Number of unique harvests
  • "n_sites": Number of unique sites
  • "n_replications": Number of unique replications
  • "n_blocks": Number of unique blocks
  • "n_rows": Number of unique rows
  • "n_cols": Number of unique columns
  • "n_entries": Number of unique entries
  • "n_populations": Number of unique populations
  • "n_total": Total number of phenotype observations (entries × traits)
  • "n_zeroes": Count of zero values in phenotypes
  • "n_missing": Count of missing values in phenotypes
  • "n_nan": Count of NaN values in phenotypes
  • "n_inf": Count of Inf values in phenotypes

Throws

  • ArgumentError: If the Trials struct dimensions are inconsistent

Examples

julia> trials = Trials(n=1, t=2);

julia> trials.entries = ["entry_1"]; trials.traits = ["trait_1", "trait_2"];

julia> dimensions(trials)
Dict{String, Int64} with 16 entries:
  "n_zeroes"       => 0
  "n_harvests"     => 1
  "n_nan"          => 0
  "n_entries"      => 1
  "n_traits"       => 2
  "n_seasons"      => 1
  "n_rows"         => 1
  "n_blocks"       => 1
  "n_missing"      => 2
  "n_inf"          => 0
  "n_total"        => 2
  "n_replications" => 1
  "n_years"        => 1
  "n_sites"        => 1
  "n_cols"         => 1
  "n_populations"  => 1
source
GenomicBreedingCore.distancesMethod
distances(
    genomes::Genomes; 
    distance_metrics::Vector{String}=["euclidean", "correlation", "mad", "rmsd", "χ²"],
    idx_loci_alleles::Union{Nothing, Vector{Int64}} = nothing,
    include_loci_alleles::Bool = true,
    include_entries::Bool = true,
    include_counts::Bool = true,
    verbose::Bool = false
)::Tuple{Vector{String},Vector{String},Dict{String,Matrix{Float64}}}

Calculate pairwise distances/similarity metrics between loci-alleles and entries in a Genomes object.

Arguments

  • genomes::Genomes: Input Genomes object
  • distance_metrics::Vector{String}: Vector of distance metrics to calculate. Valid options:
    • "euclidean": Euclidean distance
    • "correlation": Pearson correlation coefficient
    • "mad": Mean absolute deviation
    • "rmsd": Root mean square deviation
    • "χ²": Chi-square distance
  • idx_loci_alleles::Union{Nothing, Vector{Int64}}: Optional indices of loci-alleles to include. If nothing, randomly samples 100 loci-alleles.
  • include_loci_alleles::Bool: Whether to calculate distances between loci-alleles. Defaults to true.
  • include_entries::Bool: Whether to calculate distances between entries. Defaults to true.
  • include_counts::Bool: Whether to include matrices showing number of valid pairs used. Defaults to true.
  • verbose::Bool: Whether to show progress bars. Defaults to false.

Returns

Tuple containing:

  1. Vector of loci-allele names used
  2. Vector of entry names
  3. Dictionary mapping "{dimension}|{metric}" to distance matrices, where:
    • dimension is either "loci_alleles" or "entries"
    • metric is one of the distance metrics or "counts" (number of valid pairs used)
    • matrices contain pairwise distances/correlations (-Inf where insufficient data)

Details

  • For loci-alleles, calculates distances between allele frequency profiles across entries
  • For entries, calculates distances between entries based on their allele frequencies
  • Requires at least 2 valid (non-missing, finite) pairs to calculate metrics
  • Includes count matrices showing number of valid pairs used per calculation
  • Multi-threaded implementation which uses indexing on pre-allocated vectors and matrices which should avoid data races

Examples

julia> genomes = simulategenomes(n=100, l=1_000, n_alleles=4, verbose=false);

julia> (loci_alleles_names, entries, dist) = distances(genomes, distance_metrics=["correlation", "χ²"]);

julia> sort(string.(keys(dist))) == ["entries|correlation", "entries|counts", "entries|χ²", "loci_alleles|correlation", "loci_alleles|counts", "loci_alleles|χ²"]
true

julia> C = dist["entries|correlation"]; C[diagind(C)] == repeat([1], length(genomes.entries))
true

julia> χ² = dist["loci_alleles|χ²"]; χ²[diagind(χ²)] == repeat([0.0], 100)
true
source
GenomicBreedingCore.distancesMethod
distances(
    phenomes::Phenomes; 
    distance_metrics::Vector{String}=["euclidean", "correlation", "mad", "rmsd", "χ²"],
    standardise_traits::Bool = false
)::Tuple{Vector{String}, Vector{String}, Dict{String, Matrix{Float64}}}

Calculate pairwise distances/correlations between traits and entries in a phenotypic dataset.

Arguments

  • phenomes::Phenomes: A Phenomes struct containing phenotypic data
  • distance_metrics::Vector{String}: Vector of distance metrics to compute. Valid options are:
    • "euclidean": Euclidean distance
    • "correlation": Pearson correlation coefficient
    • "mad": Mean absolute deviation
    • "rmsd": Root mean square deviation
    • "χ²": Chi-square distance
  • standardise_traits::Bool: If true, standardizes traits to mean=0 and sd=1 before computing distances

Returns

A tuple containing:

  1. Vector of trait names
  2. Vector of entry names
  3. Dictionary mapping "{dimension}|{metric}" to distance matrices, where:
    • dimension ∈ ["traits", "entries"]
    • metric ∈ distance_metrics ∪ ["counts"]
    • "counts" matrices contain the number of non-missing pairs used in calculations

Notes

  • Pairs with fewer than 2 non-missing values result in -Inf distance values
  • For correlation calculations, traits with near-zero variance (< 1e-7) are skipped
  • χ² distance adds machine epsilon to denominator to avoid division by zero

Examples

julia> phenomes = Phenomes(n=10, t=3); phenomes.entries = string.("entry_", 1:10); phenomes.populations .= "pop_1"; phenomes.traits = ["A", "B", "C"]; phenomes.phenotypes = rand(10,3); phenomes.phenotypes[2,2] = missing;

julia> (traits, entries, dist) = distances(phenomes, distance_metrics=["correlation", "χ²"]);

julia> sort(string.(keys(dist))) == ["entries|correlation", "entries|counts", "entries|χ²", "traits|correlation", "traits|counts", "traits|χ²"]
true

julia> C = dist["entries|correlation"]; C[diagind(C)] == repeat([1], length(phenomes.entries))
true

julia> dist["traits|counts"][:, 2] == dist["traits|counts"][2, :] == repeat([9], length(phenomes.traits))
true

julia> dist["entries|counts"][:, 2] == dist["entries|counts"][2, :] == repeat([2], length(phenomes.entries))
true
source
GenomicBreedingCore.divideintomockscaffoldsMethod
ddivideintomockscaffolds(
    genomes::Genomes;
    max_n_loci_per_chrom::Int64 = 100_000,
    verbose::Bool = false,
)::Vector{String}

Divide genomic loci into mock scaffolds based on a maximum number of loci per chromosome.

Arguments

  • genomes::Genomes: A Genomes struct containing genomic data
  • max_n_loci_per_chrom::Int64: Maximum number of loci per chromosome (default: 100,000)
  • verbose::Bool: If true, prints additional information during execution (default: false)

Returns

  • Vector{String}: A vector containing mock scaffold assignments for each locus

Description

This function takes a Genomes struct and divides the loci into mock scaffolds based on the specified maximum number of loci per chromosome. It creates scaffold names in the format "mockscaffoldX" where X is the scaffold number.

Throws

  • ArgumentError: If the Genomes struct dimensions are invalid or corrupted

Example

julia> genomes = simulategenomes(n=20, sparsity=0.3, verbose=false);

julia> mock_scaffolds = divideintomockscaffolds(genomes, max_n_loci_per_chrom=100);

julia> sum(mock_scaffolds .== mock_scaffolds[1]) == Int64(length(genomes.loci_alleles) / 100)
true
source
GenomicBreedingCore.estimatedistancesMethod
estimatedistances(
    genomes::Genomes;
    LD::Matrix{Float64},
    idx_focal_locus::Int64,
    idx_loci_alleles_per_chrom::Vector{Int64},
    min_loci_corr::Float64,
    min_l_loci::Int64,
    verbose::Bool = false,
)::Matrix{Float64}

Estimate pairwise distances between entries based on loci most linked to a focal locus.

Arguments

  • genomes::Genomes: A Genomes struct containing the genomic data.
  • LD::Matrix{Float64}: A linkage disequilibrium (LD) matrix.
  • idx_focal_locus::Int64: The local index of the focal locus, i.e. index at the current chromosome or mock scaffold.
  • idx_loci_alleles_per_chrom::Vector{Int64}: A vector of global indices for loci-alleles per chromosome, i.e. indices corresponding the whole genomes.allele_frequencies matrix.
  • min_loci_corr::Float64: The minimum correlation threshold for loci to be considered linked.
  • min_l_loci::Int64: The minimum number of loci required to be linked to the focal locus.
  • verbose::Bool: If true, prints detailed progress information.

Returns

  • Matrix{Float64}: A matrix of pairwise distances between entries, estimated using the mean absolute difference (MAD) in allele frequencies of loci most linked to the focal locus.

Details

This function calculates pairwise distances between entries in a genomic dataset. The distances are based on loci that are most correlated with a specified focal locus. The function uses the following steps:

  • Input Validation: Checks the validity of the input arguments, ensuring the Genomes struct is not corrupted, the LD matrix is square, and the indices are within valid ranges.
  • Identify Linked Loci: Determines which loci are most linked to the focal locus based on the provided LD matrix and correlation threshold. If the number of linked loci is below the specified minimum, the top loci are selected based on their correlation values.
  • Estimate Distances: Calculates pairwise distances between entries using the mean absolute difference (MAD) in allele frequencies of the selected loci.
  • Output: Returns a matrix of pairwise distances.

Notes

  • The function identifies loci most linked to the focal locus based on the provided LD matrix and correlation threshold.
  • If all loci have undetermined level of linkage with the focal locus (i.e. -Inf values in the LD matrix), then all the loci-alleles are selected to estimate distances.
  • If the number of linked loci is below the specified minimum, then the minimum number of loci most correlated to the focal locus are selected to estimate distances.
  • Pairwise distances are estimated using the mean absolute difference (MAD) in allele frequencies of the selected loci.
  • Intermediate results and progress can be printed if verbose is set to true.

Throws

  • ArgumentError: If the input arguments are invalid or the Genomes struct is corrupted.
  • ArgumentError: If there are less than 2 indices for loci-alleles in the current chromosome.
  • ArgumentError: If the LD matrix is not square.
  • ArgumentError: If the LD matrix and the vector of indices are incompatible.
  • ArgumentError: If the focal locus index is out of range.
  • ArgumentError: If the minimum correlation threshold is out of range.
  • ArgumentError: If the minimum number of loci is out of range.

Example

julia> genomes = simulategenomes(n=40, l=1_000, sparsity=0.3, verbose=false);

julia> chromosomes = divideintomockscaffolds(genomes, max_n_loci_per_chrom=100);

julia> chroms_uniq, LDs = estimateld(genomes, chromosomes=chromosomes);

julia> rm.(readdir()[.!isnothing.(match.(Regex("jld2"), readdir()))]);

julia> k = 1; LD = LDs[k]; chrom = chroms_uniq[k]; idx_focal_locus = 1;

julia> idx_loci_alleles_per_chrom = findall(chromosomes .== chrom);

julia> min_loci_corr = 0.9; min_l_loci = 2;

julia> D = estimatedistances(genomes, LD=LD, idx_focal_locus=idx_focal_locus, idx_loci_alleles_per_chrom=idx_loci_alleles_per_chrom, min_loci_corr=min_loci_corr, min_l_loci=min_l_loci);

julia> size(D) == (length(genomes.entries), length(genomes.entries))
true
source
GenomicBreedingCore.estimateldMethod
estimateld(
    genomes::Genomes;
    chromosomes::Union{Nothing,Vector{String}} = nothing,
    verbose::Bool = false,
)::Tuple{Vector{String},Vector{Matrix{Float64}}}

Calculate linkage disequilibrium (LD) matrices for each chromosome in the given genomic data.

Arguments

  • genomes::Genomes: A Genomes struct containing genomic data
  • chromosomes::Union{Nothing, Vector{String}}: Optional vector of chromosome names to analyse. If nothing, all chromosomes in the data will be used
  • verbose::Bool: If true, prints progress information during computation

Returns

  • Tuple{Vector{String}, Vector{Matrix{Float64}}}: A tuple containing:
    • A vector of unique chromosome names.
    • A vector of LD matrices corresponding to each chromosome.

Examples

julia> genomes = simulategenomes(n=30, l=1_000, sparsity=0.3, verbose=false);

julia> chroms_uniq, LDs_all_chroms = estimateld(genomes);

julia> chrom, pos, allele = loci_alleles(genomes);

julia> rm.(readdir()[.!isnothing.(match.(Regex("jld2"), readdir()))]);

julia> mock_scaffolds = divideintomockscaffolds(genomes, max_n_loci_per_chrom=100);

julia> mock_scaffolds_uniq, LDs_mock_scaffolds = estimateld(genomes, chromosomes=mock_scaffolds);

julia> rm.(readdir()[.!isnothing.(match.(Regex("jld2"), readdir()))]);

julia> length(LDs_all_chroms) == length(chroms_uniq) == length(unique(chrom))
true

julia> length(LDs_mock_scaffolds) == length(mock_scaffolds_uniq) == Int(length(chrom) / 100)
true
source
GenomicBreedingCore.extractXbMethod
extractXb(blr::BLR)::Tuple{Matrix{Float64}, Vector{Float64}, Vector{String}}

Extract design matrix X, coefficients b, and coefficient labels from a BLR (Bayesian Linear Regression) model.

Arguments

  • blr::BLR: A BLR struct containing model components

Returns

A tuple containing:

  • X::Matrix{Float64}: The design matrix combining all predictors
  • b::Vector{Float64}: Vector of estimated coefficients
  • b_labels::Vector{String}: Vector of coefficient labels/names

Throws

  • ArgumentError: If the dimensions in the BLR struct are inconsistent

Details

Extracts and combines the design matrices, coefficients and their labels from the BLR model components. The intercept is handled separately and placed first, followed by other variance components.

Note

The function assumes the BLR struct has valid "intercept" components and optional additional variance components.

Examples

julia> blr = BLR(n=1, p=4);

julia> X, b, b_labels = extractXb(blr);

julia> size(X) == (1, 4)
true

julia> length(b) == length(b_labels)
true
source
GenomicBreedingCore.extracteffectsMethod
extracteffects(blr::BLR; verbose::Bool = false)::Dict{String, DataFrame}

Extract and organize effects from a Bayesian Linear Regression (BLR) model.

This function processes both main effects and interaction effects from a fitted BLR model, organizing them into separate DataFrames within a dictionary.

Arguments

  • blr::BLR: A fitted Bayesian Linear Regression model structure
  • verbose::Bool=false: If true, prints intermediate results during processing

Returns

  • Dict{String, DataFrame}: A dictionary where:
    • Keys are effect names (main effects or interaction effects)
    • Values are DataFrames containing:
      • name: Labels for the effects
      • value: Corresponding effect values

Details

  • Extracts design matrix (X), effects (b), and their labels from the BLR model
  • Processes main effects (single factors) and interaction effects (combined factors) separately
  • Removes redundant rows based on hashing
  • Combines effects with their corresponding design matrix elements

Throws

  • ArgumentError: If the BLR struct dimensions are invalid

Example

julia> genomes = simulategenomes(n=5, l=1_000, verbose=false);

julia> trials, simulated_effects = simulatetrials(genomes = genomes, n_years=1, n_seasons=2, n_harvests=1, n_sites=3, n_replications=3, verbose=false);

julia> tebv, spatial_diagnostics = analyseviaBLR(trials, ["trait_1"], n_iter = 1_000, n_burnin = 100);

julia> blr = tebv.models[1];

julia> dfs = extracteffects(blr);

julia> [size(v) for (k, v) in dfs]
5-element Vector{Tuple{Int64, Int64}}:
 (2, 2)
 (3, 2)
 (6, 2)
 (5, 2)
 (30, 2)
source
GenomicBreedingCore.extractmodelinputsMethod
extractmodelinputs(blr::BLR; multiple_σs::Union{Nothing, Dict{String, Bool}}=nothing)::Dict

Extract model inputs from BLR object for Bayesian modelling.

Arguments

  • blr::BLR: A Bayesian Linear Regression model object
  • multiple_σs::Union{Nothing, Dict{String, Bool}}=nothing: Optional dictionary specifying whether each variance component should use single (false) or multiple (true) variance scalers. If nothing, defaults to single scaler for all components.

Returns

A dictionary containing:

  • "y": Response variable vector
  • "vector_of_Xs_noint": Vector of design matrices for each variance component
  • "vector_of_Δs": Vector of variance-covariance matrices or uniform scaling factors
  • "length_of_σs": Vector specifying number of variance components for each term
  • "variance_components": Vector of variance component names
  • "vector_coefficient_names": Vector of coefficient names for all variance components

Notes

  • Excludes intercept from variance components
  • For each variance component:
    • If multiple_σs[component] = false: Uses single variance scaler
    • If multiple_σs[component] = true: Uses separate variance scaler per coefficient
  • Processes design matrices, variance-covariance structures, and coefficient names from the BLR object

Throws

  • ErrorException: If a variance component is not specified in the multiple_σs dictionary

Example

julia> genomes = simulategenomes(n=5, l=1_000, verbose=false);

julia> trials, simulated_effects = simulatetrials(genomes = genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=3, verbose=false);

julia> df = tabularise(trials);

julia> blr = instantiateblr(trait = trials.traits[1], factors = ["rows", "cols"], df = df, other_covariates = trials.traits[2:end], verbose = false);

julia> model_inputs_1 = extractmodelinputs(blr, multiple_σs = nothing);

julia> model_inputs_2 = extractmodelinputs(blr, multiple_σs = Dict("rows" => true, "cols" => true, "rows & cols" => false, "other_covariates" => true));

julia> (model_inputs_1["y"] == model_inputs_2["y"]) && (model_inputs_1["vector_of_Xs_noint"] == model_inputs_2["vector_of_Xs_noint"]) && (model_inputs_1["vector_of_Δs"] == model_inputs_2["vector_of_Δs"]) && (model_inputs_1["vector_coefficient_names"] == model_inputs_2["vector_coefficient_names"])
true

julia> sum(model_inputs_1["length_of_σs"]) < sum(model_inputs_2["length_of_σs"])
true
source
GenomicBreedingCore.extractphenomesMethod
extractphenomes(tebv::TEBV)::Phenomes

Extract phenotypic values from a Trial-Estimated Breeding Value (TEBV) object.

This function processes phenotypic data from a TEBV object, handling intercept effects and merging multiple phenomes if present. It performs the following operations:

  1. Validates input TEBV dimensions
  2. Processes intercept effects if present by:
    • Identifying intercept terms
    • Combining intercept values with trait effects
    • Adjusting trait names and phenotypic values accordingly
  3. Merges multiple phenomes if present
  4. Renames traits to match input TEBV traits if dimensions align
  5. Validates output Phenomes dimensions

Arguments

  • tebv::TEBV: A Trial Estimated Breeding Value object containing phenotypic data

Returns

  • Phenomes: A Phenomes object containing processed phenotypic values

Throws

  • ArgumentError: If input TEBV or output Phenomes dimensions are invalid

Examples

julia> trials, _simulated_effects = simulatetrials(genomes = simulategenomes(n=10, verbose=false), n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=10, verbose=false);

julia> tebv = analyse(trials, max_levels=50, verbose=false);

julia> phenomes = extractphenomes(tebv);

julia> phenomes.traits == ["trait_1", "trait_2", "trait_3"]
true
source
GenomicBreedingCore.extractphenomesMethod
extractphenomes(trials::Trials)::Phenomes

Convert a Trials struct into a Phenomes struct by extracting phenotypic values across different environments.

Details

  • Combines trait measurements with their environmental contexts
  • Creates unique trait identifiers by combining trait names with environment variables
  • Environment variables include: years, harvests, seasons, sites, and replications
  • For single environment scenarios, trait names remain without environmental suffixes

Arguments

  • trials::Trials: A Trials struct containing phenotypic measurements across different environments

Returns

  • A Phenomes struct containing:
    • phenotypes: Matrix of phenotypic values (entries × traits)
    • entries: Vector of entry names
    • populations: Vector of population names
    • traits: Vector of trait names (with environmental contexts)

Throws

  • ArgumentError: If duplicate entries exist within year-harvest-season-site-replication combinations
  • ErrorException: If dimensional validation fails during Phenomes construction

Examples

julia> trials, _ = simulatetrials(genomes = simulategenomes(verbose=false), verbose=false);

julia> phenomes = extractphenomes(trials);

julia> size(phenomes.phenotypes)
(100, 384)
source
GenomicBreedingCore.filterbymafMethod
filterbymaf(
    genomes::Genomes;
    maf::Float64 = 0.01,
    verbose::Bool = false,
)::Genomes

Filter genomic data by removing loci with minor allele frequencies (MAF) below a specified threshold.

Arguments

  • genomes::Genomes: A Genomes struct containing the genomic data.
  • maf::Float64: The minimum allele frequency threshold. Default is 0.01.
  • verbose::Bool: If true, prints detailed progress information during the filtering process. Default is false.

Returns

  • Genomes: A Genomes struct with filtered genomic data.

Details

This function filters genomic data by removing loci with minor allele frequencies (MAF) below a specified threshold. The function performs the following steps:

  1. Input Validation: Ensures that the Genomes struct is not corrupted and that the maf argument is within the valid range (0.0 to 1.0). Throws an ArgumentError if any argument is out of range.
  2. Early Return for maf = 0.0: If maf is set to 0.0, the function returns the original Genomes struct without filtering.
  3. Calculate MAF: Computes the mean allele frequency for each locus, skipping missing values.
  4. Filter Loci: Identifies loci that pass the MAF threshold and retains only those loci:
    • If all loci pass the MAF threshold, the function returns the original Genomes struct.
    • If no loci pass the MAF threshold, the function throws an ErrorException.
  5. Verbose Output: If verbose is true, prints detailed progress information during the filtering process.
  6. Output: Returns the filtered Genomes struct.

Notes

  • The function uses multi-threading to compute the mean allele frequencies for each locus, improving performance on large datasets.
  • The verbose option provides additional insights into the filtering process by printing progress information.
  • The function ensures that the filtered genomic data retains a minimum number of loci.

Throws

  • ArgumentError: If the Genomes struct is corrupted.
  • ArgumentError: If the maf argument is out of range.
  • ErrorException: If all loci are filtered out based on the MAF threshold.

Example

julia> genomes = simulategenomes(n=100, l=1_000, n_alleles=4, verbose=false);

julia> filtered_genomes = filterbymaf(genomes, maf=0.05);

julia> length(genomes.loci_alleles) >= length(filtered_genomes.loci_alleles)
true
source
GenomicBreedingCore.filterbypcaMethod
filterbypca(
    genomes::Genomes;
    max_prop_pc_varexp::Float64 = 0.9,
    verbose::Bool = false,
)::Genomes

Filter genomic data by removing outlier loci-alleles based on principal component analysis (PCA).

Arguments

  • genomes::Genomes: A Genomes struct containing the genomic data.
  • max_prop_pc_varexp::Float64: The maximum proportion of variance explained by the first two principal components (PC1 and PC2). Default is 0.9.
  • verbose::Bool: If true, prints detailed progress information during the filtering process. Default is false.

Returns

  • Genomes: A Genomes struct with filtered genomic data.

Details

This function filters genomic data by removing outlier loci-alleles based on principal component analysis (PCA). The function performs the following steps:

  1. Input Validation: Ensures that the Genomes struct is not corrupted and that the max_prop_pc_varexp argument is within the valid range (0.0 to Inf). Throws an ArgumentError if any argument is out of range.
  2. Early Return for maxproppc_varexp = Inf: If max_prop_pc_varexp is set to Inf, the function returns the original Genomes struct without filtering.
  3. Extract Non-Missing Loci-Alleles: Identifies loci-alleles that are non-missing and have non-zero variance across all entries.
  4. Standardize Allele Frequencies: Standardizes the allele frequencies per locus-allele in preparation for PCA.
  5. Perform PCA: Conducts PCA on the standardized allele frequencies and calculates the proportion of variance explained by the first two principal components (PC1 and PC2).
  6. Identify Outliers: Identifies loci-alleles that are outliers based on the specified proportion of variance explained by PC1 and PC2.
  7. Filter Outliers: Removes the identified outlier loci-alleles from the genomic data.
  8. Verbose Output: If verbose is true, prints detailed progress information during the filtering process.
  9. Output: Returns the filtered Genomes struct.

Notes

  • The function uses PCA to identify outlier loci-alleles based on the proportion of variance explained by the first two principal components.
  • The verbose option provides additional insights into the filtering process by printing progress information.
  • The function ensures that the filtered genomic data retains a minimum number of loci-alleles.

Throws

  • ArgumentError: If the Genomes struct is corrupted.
  • ArgumentError: If the max_prop_pc_varexp argument is out of range.
  • ErrorException: If there are less than 10 loci-alleles left after removing fixed and missing values across all entries.

Example

julia> genomes = simulategenomes(n=100, l=1_000, n_alleles=4, verbose=false);

julia> filtered_genomes = filterbypca(genomes, max_prop_pc_varexp=0.9);

julia> length(filtered_genomes.loci_alleles) <= length(genomes.loci_alleles)
true
source
GenomicBreedingCore.filterbysnplistMethod
filterbysnplist(
    genomes::Genomes;
    chr_pos_allele_ids::Union{Nothing,Vector{String}} = nothing,
    match_alleles::Bool = true,
    verbose::Bool = false,
)::Genomes

Filter genomic data by retaining only the specified loci or loci-allele combinations.

Arguments

  • genomes::Genomes: A Genomes struct containing the genomic data.
  • chr_pos_allele_ids::Union{Nothing, Vector{String}}: A vector of loci or loci-allele combination names in the format "chromosome\tposition" or "chromosome\tposition\tallele". If nothing, no filtering is applied. Default is nothing.
  • match_alleles::Bool: If true, matches both loci coordinates (chromosome and position) and alleles. If false, matches only by chromosome and position regardless of alleles. Default is true.
  • verbose::Bool: If true, prints detailed progress information during the filtering process. Default is false.

Returns

  • Genomes: A Genomes struct with filtered genomic data.

Details

This function filters genomic data by retaining only the specified loci or loci-allele combinations. The function performs the following steps:

  1. Input Validation: Ensures that the Genomes struct is not corrupted.
  2. Early Return for No Filtering: If no IDs are provided, returns the original Genomes struct.
  3. Parse Input IDs: Parses the input IDs and validates their format.
  4. Extract Available IDs: Gets the current loci or loci-allele combinations from the Genomes struct.
  5. Filter Data: Identifies and retains only the specified combinations.
  6. Verbose Output: If enabled, prints detailed progress information.
  7. Final Check: Ensures that filtering retains at least some data.

Notes

  • Uses multi-threading for parsing and filtering operations
  • When match_alleles=false, matches only chromosome and position
  • Preserves the original data structure while filtering unwanted entries

Throws

  • ArgumentError: If the Genomes struct is corrupted or input IDs are malformed
  • ErrorException: If no data remains after filtering

Example

julia> genomes = simulategenomes(n=100, l=1_000, n_alleles=4, verbose=false);

julia> chr_pos_allele_ids = sample(genomes.loci_alleles, 100, replace=false); sort!(chr_pos_allele_ids);

julia> filtered_genomes = filterbysnplist(genomes, chr_pos_allele_ids=chr_pos_allele_ids);

julia> size(filtered_genomes.allele_frequencies)
(100, 100)

julia> chr_pos_allele_ids_no_alleles = unique([join(split(x, "	")[1:2], "	") for x in chr_pos_allele_ids]);

julia> filtered_genomes_2 = filterbysnplist(genomes, chr_pos_allele_ids=chr_pos_allele_ids_no_alleles, match_alleles=false);

julia> size(filtered_genomes_2.allele_frequencies, 2) > size(filtered_genomes.allele_frequencies, 2)
true
source
GenomicBreedingCore.filterbysparsityMethod
filterbysparsity(
    genomes::Genomes;
    max_entry_sparsity::Float64 = 0.0,
    max_locus_sparsity::Float64 = 0.0,
    max_entry_sparsity_percentile::Float64 = 0.90,
    max_locus_sparsity_percentile::Float64 = 0.50,
    verbose::Bool = false,
)::Genomes

Filter genomic data by removing entries and loci with high sparsity.

Arguments

  • genomes::Genomes: A Genomes struct containing the genomic data.
  • max_entry_sparsity::Float64: The maximum allowable sparsity for entries. Default is 0.0.
  • max_locus_sparsity::Float64: The maximum allowable sparsity for loci. Default is 0.0.
  • max_entry_sparsity_percentile::Float64: The percentile threshold for entry sparsity used in iteratively filtering out the sparsest entries. Default is 0.90.
  • max_locus_sparsity_percentile::Float64: The percentile threshold for locus sparsity used in iteratively filtering out sparsest loci. Default is 0.50.
  • verbose::Bool: If true, prints detailed progress information during the filtering process. Default is false.

Returns

  • Genomes: A Genomes struct with filtered genomic data.

Details

This function filters genomic data by iteratively removing entries and loci with high sparsity. The function performs the following steps:

  1. Input Validation: Ensures that the Genomes struct is not corrupted and that the sparsity thresholds are within the valid range (0.0 to 1.0). Throws an ArgumentError if any argument is out of range.
  2. Calculate Sparsities: Computes the sparsities of entries and loci in the genomic data.
  3. Initial Check: Checks if the input Genomes struct passes all the filtering thresholds. If so, returns the original Genomes struct.
  4. Iterative Filtering: Iteratively removes the sparsest loci and entries until the maximum allowable sparsity thresholds are met:
    • Remove Sparsest Loci: Removes loci with sparsity above the specified percentile threshold or maximum loci sparsity, whichever is higher.
    • Remove Sparsest Entries: Removes entries with sparsity above the specified percentile threshold or maximum entries sparsity, whichever is higher.
  5. Verbose Output: If verbose is true, prints detailed progress information during each iteration.
  6. Final Check: Ensures that there are remaining entries and loci after filtering. Throws an ErrorException if all entries or loci are filtered out.
  7. Output: Returns the filtered Genomes struct.

Notes

  • The function uses percentile thresholds to iteratively remove the sparsest loci and entries.
  • The verbose option provides additional insights into the filtering process by printing progress information.
  • The function ensures that the filtered genomic data retains a minimum number of entries and loci.
  • If one of max_entry_sparsity or max_locus_sparsity is set to 1.00, the other threshold may not be met because

filtering out entries/loci further will mean filtering the other but the requirement at 100% maximum sparsity is already met. At instances like this, you may manually reduced the maximum sparsity level from 1.00.

Throws

  • ArgumentError: If the Genomes struct is corrupted.
  • ArgumentError: If any of the sparsity thresholds are out of range.
  • ErrorException: If all entries and/or loci are filtered out based on the sparsity thresholds.

Examples

julia> genomes = simulategenomes(n=100, l=1_000, n_alleles=4, sparsity=0.01, verbose=false);

julia> filtered_genomes = filterbysparsity(genomes);

julia> size(genomes.allele_frequencies)
(100, 3000)

julia> size(filtered_genomes.allele_frequencies)
(92, 1239)

julia> entry_sparsities, locus_sparsities = sparsities(filtered_genomes);

julia> maximum(entry_sparsities) == maximum(locus_sparsities) == 0.0
true

julia> filtered_genomes_2 = filterbysparsity(genomes, max_entry_sparsity=1.00, max_locus_sparsity=0.00);
source
GenomicBreedingCore.grmploidyawareMethod
grmploidyaware(
    genomes::Genomes;
    ploidy::Int64 = 2,
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    verbose::Bool = false
)::Matrix{Float64}

Generate a ploidy-aware genetic relationship matrix (GRM) based on the methods described in Bell et al. (2017) and VanRaden et al. (2008).

Arguments

  • genomes::Genomes: Genomic data structure containing allele frequencies
  • ploidy::Int64: Number of chromosome copies in the organism (default: 2)
  • idx_entries::Union{Nothing,Vector{Int64}}: Optional indices to select specific entries (default: nothing)
  • idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to select specific loci/alleles (default: nothing)
  • verbose::Bool: If true, displays a heatmap of the resulting GRM (default: false)

Returns

  • Matrix{Float64}: A symmetric genetic relationship matrix with dimensions (n × n), where n is the number of entries

Details

The function implements the following steps:

  1. Extracts and processes genomic data
  2. Calculates allele frequencies and centers the data
  3. Computes the GRM using VanRaden's method
  4. Ensures matrix invertibility by adding small values to the diagonal if necessary

Note

The diagonal elements may be slightly inflated to ensure matrix invertibility for downstream analyses.

Example

julia> genomes = simulategenomes(l=1_000, verbose=false);

julia> grm = grmploidyaware(genomes);

julia> size(grm.genomic_relationship_matrix), issymmetric(grm.genomic_relationship_matrix)
((100, 100), true)

julia> det(grm.genomic_relationship_matrix) > eps(Float64)
true
source
GenomicBreedingCore.grmsimpleMethod
grmsimple(
    genomes::Genomes;
    idx_entries::Union{Nothing,Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing,Vector{Int64}} = nothing,
    verbose::Bool = false
)::Matrix{Float64}

Generate a simple genetic relationship matrix (GRM) from genomic data.

Arguments

  • genomes::Genomes: Genomic data structure containing allele frequencies
  • idx_entries::Union{Nothing,Vector{Int64}}: Optional indices to select specific entries/individuals
  • idx_loci_alleles::Union{Nothing,Vector{Int64}}: Optional indices to select specific loci/alleles
  • verbose::Bool: If true, displays a heatmap visualization of the GRM

Returns

  • Matrix{Float64}: A symmetric positive definite genetic relationship matrix

Details

The function computes a genetic relationship matrix by:

  1. Converting genomic data to a numerical matrix
  2. Computing GRM as G * G' / ncol(G)
  3. Adding small positive values to diagonal elements if necessary to ensure matrix invertibility

Notes

  • The resulting matrix is always symmetric
  • Diagonal elements may be slightly inflated to ensure matrix invertibility
  • The matrix dimensions will be n×n where n is the number of entries/individuals

Example

julia> genomes = simulategenomes(l=1_000, verbose=false);

julia> grm = grmsimple(genomes);

julia> size(grm.genomic_relationship_matrix), issymmetric(grm.genomic_relationship_matrix)
((100, 100), true)

julia> det(grm.genomic_relationship_matrix) > eps(Float64)
true
source
GenomicBreedingCore.histallelefreqsMethod
histallelefreqs(genomes::Genomes)::Nothing

Plot a histogram of allele frequencies from a Genomes object.

Arguments

  • genomes::Genomes: A Genomes object containing allele frequency data in its allele_frequencies field

Returns

  • Nothing: Displays a histogram plot and returns nothing

Description

Creates and displays a vertical histogram of non-missing allele frequencies using UnicodePlots. The histogram shows frequency distribution in the range [0,1] with 50 bins.

Example

julia> genomes = simulategenomes(n=100, l=10_000, n_alleles=3, n_populations=3, verbose=false);

julia> histallelefreqs(genomes)
source
GenomicBreedingCore.imputeMethod
impute(
    genomes::Genomes;
    max_n_loci_per_chrom::Int64 = 100_000,
    n_reps::Int64 = 2,
    optim_n::Int64 = 10,
    min_l_loci::Int64 = 10,
    min_k_entries::Int64 = 2,
    verbose::Bool = false,
)::Tuple{Genomes, Float64}

Impute missing allele frequencies in a genomic dataset using optimised allele frequency k-nearest neighbours imputation (AF-kNNi).

Arguments

  • genomes::Genomes: A Genomes struct containing the genomic data.
  • max_n_loci_per_chrom::Int64: The maximum number of loci per chromosome. Default is 100,000.
  • n_reps::Int64: The number of repetitions for cross-validation during optimisation. Default is 2.
  • optim_n::Int64: The number of optimisation steps for each parameter. Default is 10.
  • min_l_loci::Int64: The minimum number of loci required to be linked to the focal locus. Default is 10.
  • min_k_entries::Int64: The minimum number of nearest neighbours required for imputation. Default is 2.
  • verbose::Bool: If true, prints detailed progress information. Default is false.

Returns

  • Tuple{Genomes, Float64}: A tuple containing:
    • An updated Genomes struct with imputed allele frequencies.
    • The mean absolute error (MAE) of the imputed values.

Details

This function imputes missing allele frequencies in a genomic dataset using an optimised k-nearest neighbors (KNN) imputation method. The function performs the following steps:

  • Input Validation: Checks the validity of the input arguments, ensuring the Genomes struct is not corrupted, the number of entries and loci are within valid ranges, and the parameters are correctly specified.
  • Chromosome Division: Divides the allele frequencies into mock scaffolds if the number of loci per chromosome exceeds the specified maximum.
  • Linkage Disequilibrium (LD) Estimation: Estimates LD between loci using Pearson's correlation for each chromosome.
  • Imputation Process: For each locus-allele per chromosome:
    • Identifies entries with missing data.
    • Optimises the parameters minlocicorr and maxentriesdist using cross-validation.
    • Computes distances between entries using the optimised minlocicorr.
    • Imputes missing data using the weighted means of the nearest neighbors based on the optimised maxentriesdist.
  • Output: Returns the updated Genomes struct with imputed allele frequencies and the mean absolute error (MAE) of the imputed values.

Notes

  • The function is designed to handle the following cases:
    • If all loci have undetermined level of linkage with the focal locus (i.e. -Inf values in the LD matrix), then all the loci-alleles are selected to estimate distances.
    • If the number of linked loci is below the specified minimum, then the minimum number of loci most correlated to the focal locus are selected to estimate distances.
    • If the number of nearest neighbours is less than the specified minimum, then then the minimum number of closest neighbours are used in to compute the weighted average allele frequency.
  • The verbose option provides additional insights into the imputation process by printing progress information.
  • The function is designed to handle large genomic datasets efficiently by dividing loci into mock scaffolds if necessary.

Throws

  • ArgumentError: If the Genomes struct is corrupted.
  • ArgumentError: If the number of entries in the Genomes struct is less than 2.
  • ArgumentError: If the number of loci per chromosome is less than 10.
  • ArgumentError: If n_reps is not between 1 and the number of entries.
  • ArgumentError: If minlloci is not between 10 and the number of loci per chromosome.
  • ArgumentError: If minkentries is not between 2 and the number of entries.
  • ErrorException: If an error occurs during the imputation process.

Example

julia> genomes = simulategenomes(n=100, l=1_000, sparsity=0.3, verbose=false);

julia> genomes_imputed, mae = impute(genomes);

julia> dimensions(genomes)["n_missing"] == 30_000
true

julia> dimensions(genomes_imputed)["n_missing"] == 0
true
source
GenomicBreedingCore.inflatediagonals!Method
inflatediagonals!(X::Matrix{Float64}; max_iter::Int64=1_000, verbose::Bool=false)::Nothing

Ensure matrix invertibility by iteratively inflating diagonal elements until both the determinant is nonzero and the matrix is positive definite.

Arguments

  • X::Matrix{Float64}: Input square symmetric matrix to be modified in-place
  • max_iter::Int64=1_000: Maximum number of iterations
  • verbose::Bool=false: If true, prints information about the inflation process

Details

The function adds progressively larger values to the diagonal elements until the matrix becomes invertible (det(X) > eps(Float64)) and positive definite, or the maximum number of iterations is reached. The initial inflation value ϵ is set to the maximum absolute value in the matrix and increases slightly in each iteration by a factor of (1 + eps(Float64)).

The function returns early if the matrix is already positive definite and has a nonzero determinant.

Throws

  • ArgumentError: If the input matrix is not symmetric
  • ArgumentError: If the input matrix is not square
  • ArgumentError: If the input matrix contains NaN values
  • ArgumentError: If the input matrix contains Inf values

Returns

Nothing. The input matrix X is modified in-place.

Example

julia> x::Vector{Float64} = rand(10);

julia> X::Matrix{Float64} = x * x';

julia> inflatediagonals!(X);

julia> det(X) > eps(Float64)
true
source
GenomicBreedingCore.instantiateblrMethod
instantiateblr(; trait::String, factors::Vector{String}, df::DataFrame, 
              other_covariates::Union{Vector{String}, Nothing}=nothing,
              saturated_model::Bool=false,
              verbose::Bool=false)::BLR

Extract design matrices and response variable for Bayesian modelling of factorial experiments.

Arguments

  • trait::String: Name of the response variable (dependent variable) in the DataFrame
  • factors::Vector{String}: Vector of factor names (independent variables) to be included in the model
  • df::DataFrame: DataFrame containing the data
  • other_covariates::Union{Vector{String}, Nothing}=nothing: Additional numeric covariates to include in the model
  • saturated_model::Bool=false: If true, includes all possible interactions between factors
  • verbose::Bool=false: If true, prints additional information during execution

Returns

A BLR struct containing:

  • entries: Vector of entry identifiers
  • y: Response variable vector
  • ŷ: Predicted values vector
  • ϵ: Residuals vector
  • Xs: Dict mapping factors to design matrices
  • Σs: Dict mapping factors to covariance matrices
  • coefficients: Dict mapping factors to coefficient vectors
  • coefficient_names: Dict mapping factors to coefficient names
  • diagnostics: DataFrame with MCMC diagnostics (added after sampling)

Model Details

Creates design matrices for hierarchical factorial experiments with:

  1. Main effects:
  • Genetic effects: entries
  • Environmental effects: sites, seasons, years
  • Spatial effects: blocks, rows, cols
  1. Interaction effects:
  • GxE interactions (entries × environment)
  • Environmental interactions (between environment factors)
  • Spatial interactions (between spatial factors)

Implementation Notes

  • Uses FullDummyCoding for categorical factors
  • Converts factors to strings
  • Validates numeric traits/covariates
  • Creates design matrices with full levels
  • Handles continuous covariates as Float64
  • Uses memory efficient boolean matrices for factors
  • Assigns identity matrices as initial covariance structures

Throws

  • ArgumentError: For missing/invalid inputs
  • ErrorException: For matrix extraction failures

Example

julia> genomes = simulategenomes(n=5, l=1_000, verbose=false);

julia> trials, simulated_effects = simulatetrials(genomes = genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=3, verbose=false);

julia> df = tabularise(trials);

julia> blr_1 = instantiateblr(trait = trials.traits[1], factors = ["rows", "cols"], df = df, other_covariates = nothing, verbose = false);

julia> blr_2 = instantiateblr(trait = trials.traits[1], factors = ["years", "seasons", "sites", "entries"], df = df, other_covariates = [trials.traits[2]], verbose = false);

julia> length(blr_1.Xs) == 3
true

julia> size(blr_1.Xs["rows"]) == (15, 3)
true

julia> length(blr_2.Xs) == 9
true

julia> size(blr_2.Xs["entries & seasons & sites"]) == (15, 5)
true
source
GenomicBreedingCore.knniMethod
knni(; qs::Vector{Float64}, d::Vector{Float64}, max_entries_dist::Float64 = 0.1, min_k_entries::Int64 = 2)::Float64

Impute missing allele frequencies using the k-nearest neighbours (kNN) method based on distances.

Arguments

  • qs::Vector{Float64}: A vector of non-missing allele frequencies at the focal locus.
  • d::Vector{Float64}: A vector of distances corresponding to the entries in qs.
  • max_entries_dist::Float64: The maximum distance threshold to consider an entry as a neighbour. Default is 0.1.
  • min_k_entries::Int64: The minimum number of nearest neighbours required for imputation. Default is 2.

Returns

  • Float64: The imputed allele frequency at the focal locus.

Details

This function imputes missing allele frequencies at a focal locus using the k-nearest neighbours (KNN) method. The function performs the following steps:

  • Input Validation: Ensures that the lengths of qs and d are equal. Throws an ArgumentError if they are not.
  • Identify Nearest Neighbors: Finds the indices of the nearest neighbours within the specified distance threshold (maxentriesdist). If the number of neighbours is less than the minimum required (minkentries), the nearest neighbours are selected based on the smallest distances.
  • Impute Missing Data: Calculates the weighted mean of the allele frequencies of the nearest neighbours. The weights are based on the inverse of the distances to the focal locus, ensuring that closer neighbours have a higher influence on the imputed value.
  • Output: Returns the imputed allele frequency.

Notes

  • The function uses a small epsilon value (eps(Float64)) to avoid division by zero when calculating weights.
  • If the sum of distances is infinite, a uniform weight is assigned to each neighbour.
  • The function is designed to handle cases where the number of nearest neighbours is less than the specified minimum by selecting the closest available neighbours.

Throws

  • ArgumentError: If the lengths of qs and d are not equal.

Example

julia> genomes = simulategenomes(n=50, l=1_000, sparsity=0.3, verbose=false);

julia> chromosomes = divideintomockscaffolds(genomes, max_n_loci_per_chrom=100);

julia> chroms_uniq, LDs = estimateld(genomes, chromosomes=chromosomes);

julia> rm.(readdir()[.!isnothing.(match.(Regex("jld2"), readdir()))]);

julia> k = 1; LD = LDs[k]; chrom = chroms_uniq[k]; idx_focal_locus = 1;

julia> idx_loci_alleles_per_chrom = findall(chromosomes .== chrom);

julia> min_loci_corr = 0.9; min_l_loci = 2;

julia> D = estimatedistances(genomes, LD=LD, idx_focal_locus=idx_focal_locus, idx_loci_alleles_per_chrom=idx_loci_alleles_per_chrom, min_loci_corr=min_loci_corr, min_l_loci=min_l_loci);

julia> i = findall(ismissing.(genomes.allele_frequencies[:, idx_focal_locus]))[1];

julia> idx_entries_not_missing = findall(.!ismissing.(genomes.allele_frequencies[:, idx_focal_locus]));

julia> qs::Vector{Float64} = genomes.allele_frequencies[idx_entries_not_missing, idx_focal_locus];

julia> d = D[i, idx_entries_not_missing];

julia> q̄ = knni(qs=qs, d=d, max_entries_dist=0.25, min_k_entries=2);

julia> q̄ > 0.0
true
source
GenomicBreedingCore.knnioptimMethod
knnioptim(
    genomes::Genomes;
    j::Int64,
    idx_focal_locus::Int64,
    idx_loci_alleles_per_chrom::Vector{Int64},
    idx_entries_not_missing::Vector{Int64},
    LD::Matrix{Float64},
    n_reps::Int64 = 2,
    optim_n::Int64 = 10,
    min_l_loci::Int64 = 10,
    min_k_entries::Int64 = 2,
    verbose::Bool = false,
)::Dict{String,Float64}

Optimise parameters for allele frequency k-nearest neighbours imputation (AF-kNNi) using cross-validation, i.e. the minimum loci correlation (min_loci_corr) and maximum entries distance (max_entries_dist).

Arguments

  • genomes::Genomes: A Genomes struct containing the genomic data.
  • j::Int64: The global index of the locus for which imputation is performed, i.e. index corresponding to the whole genomes.allele_frequencies matrix.
  • idx_focal_locus::Int64: The local index of the focal locus, i.e. index at the current chromosome or mock scaffold.
  • idx_loci_alleles_per_chrom::Vector{Int64}: A vector of global indices for loci-alleles per chromosome, i.e. indices corresponding the whole genomes.allele_frequencies matrix.
  • idx_entries_not_missing::Vector{Int64}: A vector of indices for entries with non-missing data at the focal locus.
  • LD::Matrix{Float64}: A linkage disequilibrium (LD) matrix.
  • n_reps::Int64: The number of repetitions for cross-validation. Default is 2.
  • optim_n::Int64: The number of optimisation steps for each parameter. Default is 10.
  • min_l_loci::Int64: The minimum number of loci required to be linked to the focal locus. Default is 10.
  • min_k_entries::Int64: The minimum number of nearest neighbours required for imputation. Default is 2.
  • verbose::Bool: If true, prints detailed progress information and plots the mean absolute error (MAE) heatmap. Default is false.

Returns

  • Dict{String, Float64}: A dictionary containing the optimised parameters:
    • "min_loci_corr": The optimised minimum correlation threshold for loci to be considered linked.
    • "max_entries_dist": The optimised maximum distance threshold to consider an entry as a neighbor.
    • "mae": The mean absolute error (MAE) of the optimised parameters.

Details

This function optimises the minimum loci correlation (min_loci_corr) and maximum entries distance (max_entries_dist) for allele frequency k-nearest neighbours imputation (AF-kNNi) using cross-validation. The function performs the following steps:

  • Input Validation: Ensures that there are at least 2 entries with non-missing data at the focal locus. Throws an ArgumentError if this condition is not met.
  • Parameter Ranges: Defines the ranges for the minimum correlation threshold (min_loci_corr) and the maximum distance threshold (maxentriesdist) for optimisation.
  • Cross-Validation: For each combination of min_loci_corr and max_entries_dist, the function:
    • Estimates pairwise distances between entries using the estimatedistances function.
    • Samples entries with non-missing data to simulate missing data.
    • Performs AF-kNNi using the knni function and calculates the mean absolute error (MAE) for the imputed values.
  • Optimisation: Identifies the combination of parameters that minimises the MAE.
  • Output: Returns a dictionary containing the optimised parameters and the corresponding MAE. If verbose is true, prints detailed progress information and plots a heatmap of the MAE.

Notes

  • The function is designed to handle the following cases:
    • If all loci have undetermined level of linkage with the focal locus (i.e. -Inf values in the LD matrix), then all the loci-alleles are selected to estimate distances.
    • If the number of linked loci is below the specified minimum, then the minimum number of loci most correlated to the focal locus are selected to estimate distances.
    • If the number of nearest neighbours is less than the specified minimum, then then the minimum number of closest neighbours are used in to compute the weighted average allele frequency.
  • The verbose option provides additional insights into the optimisation process by printing the parameters and plotting the MAE heatmap.

Throws

  • ArgumentError: If there are less than 2 entries with non-missing data at the focal locus.

Example

julia> genomes = simulategenomes(n=123, l=1_000, sparsity=0.3, rel_dist_multiplier=10.0, verbose=false);

julia> chromosomes = divideintomockscaffolds(genomes, max_n_loci_per_chrom=100);

julia> chroms_uniq, LDs = estimateld(genomes, chromosomes=chromosomes);

julia> rm.(readdir()[.!isnothing.(match.(Regex("jld2"), readdir()))]);

julia> k = 1; LD = LDs[k]; chrom = chroms_uniq[k]; j = idx_focal_locus = 1;

julia> idx_loci_alleles_per_chrom = findall(chromosomes .== chrom);

julia> min_loci_corr = 0.9; min_l_loci = 2;

julia> D = estimatedistances(genomes, LD=LD, idx_focal_locus=idx_focal_locus, idx_loci_alleles_per_chrom=idx_loci_alleles_per_chrom, min_loci_corr=min_loci_corr, min_l_loci=min_l_loci);

julia> i = findall(ismissing.(genomes.allele_frequencies[:, idx_focal_locus]))[1];

julia> idx_entries_not_missing = findall(.!ismissing.(genomes.allele_frequencies[:, idx_focal_locus]));

julia> params = knnioptim(genomes, j=j, idx_focal_locus=idx_focal_locus, idx_loci_alleles_per_chrom=idx_loci_alleles_per_chrom, idx_entries_not_missing=idx_entries_not_missing, LD=LD);

julia> params["min_loci_corr"] > 0.5
true

julia> params["max_entries_dist"] < 0.5
true

julia> params["mae"] < 0.1
true
source
GenomicBreedingCore.lociMethod
loci(genomes::Genomes; verbose::Bool = false)::Tuple{Vector{String},Vector{Int64},Vector{Int64},Vector{Int64}}

Extract genomic positional information from a Genomes object, returning a tuple of vectors containing chromosome names, positions, and locus boundary indices.

Arguments

  • genomes::Genomes: A Genomes object containing genomic data
  • verbose::Bool = false: If true, displays a progress bar during computation

Returns

A tuple containing four vectors:

  • chromosomes::Vector{String}: Names of chromosomes
  • positions::Vector{Int64}: Positions within chromosomes
  • loci_ini_idx::Vector{Int64}: Starting indices for each locus
  • loci_fin_idx::Vector{Int64}: Ending indices for each locus

Examples

julia> genomes = simulategenomes(n=100, l=1_000, n_alleles=4, verbose=false);

julia> chromosomes, positions, loci_ini_idx, loci_fin_idx = loci(genomes);

julia> length(chromosomes), length(positions), length(loci_ini_idx), length(loci_fin_idx)
(1000, 1000, 1000, 1000)
source
GenomicBreedingCore.loci_allelesMethod
loci_alleles(genomes::Genomes; verbose::Bool = false)::Tuple{Vector{String},Vector{Int64},Vector{String}}

Extract chromosomes, positions, and alleles information from a Genomes object.

Returns a tuple of three vectors containing:

  1. Chromosomes identifiers as strings
  2. Base-pair positions as integers
  3. Allele identifiers as strings

Each vector has length equal to the total number of loci-allele combinations in the genome. The function processes the data in parallel using multiple threads for performance optimization.

Arguments

  • genomes::Genomes: A valid Genomes object containing loci and allele information
  • verbose::Bool = false: If true, displays a progress bar during extraction

Returns

  • Tuple{Vector{String},Vector{Int64},Vector{String}}: A tuple containing three vectors:
    • chromosomes: Vector of chromosome identifiers
    • positions: Vector of base-pair positions
    • alleles: Vector of allele identifiers

Throws

  • ArgumentError: If the Genomes struct dimensions are invalid or corrupted

Examples

julia> genomes = simulategenomes(n=100, l=1_000, n_alleles=4, verbose=false);

julia> chromosomes, positions, alleles = loci_alleles(genomes);

julia> length(chromosomes), length(positions), length(alleles)
(3000, 3000, 3000)
source
GenomicBreedingCore.makexMethod
makex(varex::Vector{String}; df::DataFrame)::Tuple{Matrix{Float64}, Vector{String}, Vector{String}}

Constructs a design matrix X based on categorical variables specified in varex from the given DataFrame df.

Arguments

  • varex::Vector{String}: A vector of column names in df representing categorical variables to be encoded.
  • df::DataFrame: The input DataFrame containing the data.

Returns

A tuple (X, X_vars, X_labels) where:

  • X::Matrix{Float64}: The design matrix with one-hot encoded columns for each categorical variable.
  • X_vars::Vector{String}: A vector indicating the variable name corresponding to each column in X.
  • X_labels::Vector{String}: A vector of unique labels for each categorical variable.

Example

source
GenomicBreedingCore.maskmissing!Method
maskmissing!(genomes::Genomes; verbose::Bool = false)

Update the mask matrix for missing values in the genomes struct.

This function updates the mask matrix in a Genomes struct by marking positions where allele frequencies are not missing. The mask is set to true for non-missing values and false for missing values.

Arguments

  • genomes::Genomes: A Genomes struct containing genomic data including allele frequencies and mask matrix
  • verbose::Bool=false: If true, displays a progress bar during computation

Throws

  • ArgumentError: If the dimensions in the Genomes struct are inconsistent

Effects

  • Modifies the mask field of the input genomes struct in-place

Threads

  • Uses multi-threading for parallel computation across loci
  • Uses a thread lock for safe concurrent access to shared memory

Example

julia> genomes = simulategenomes(n=10, sparsity=0.3, verbose=false);

julia> round(1.00 - mean(genomes.mask), digits=10)
0.0

julia> maskmissing!(genomes);

julia> round(1.00 - mean(genomes.mask), digits=10)
0.3
source
GenomicBreedingCore.plotFunction
plot(genomes::Genomes, seed::Int64 = 42)::Nothing

Generate visualization plots for allele frequencies in genomic data.

For each population in the dataset, creates three plots:

  1. Histogram of per-entry allele frequencies
  2. Histogram of mean allele frequencies per locus
  3. Correlation heatmap of allele frequencies between loci

Arguments

  • genomes::Genomes: A Genomes struct containing allele frequency data
  • seed::Int64=42: Random seed for reproducibility of sampling loci

Returns

  • Nothing: Displays plots but doesn't return any value

Notes

  • Uses up to 100 randomly sampled loci for visualization
  • Handles missing values in the data
  • Displays folded frequency spectra (both q and 1-q)
  • Will throw ArgumentError if the Genomes struct is corrupted

Examples

julia> genomes = simulategenomes(n=100, l=1_000, n_alleles=4, verbose=false);

julia> GenomicBreedingCore.plot(genomes)
source
GenomicBreedingCore.plotFunction
plot(fit::Fit, distribution::Any=[TDist(1), Normal()][2], α::Float64=0.05)

Generate diagnostic plots for genetic association analysis results.

Arguments

  • fit::Fit: A Fit object containing the association analysis results, specifically the b_hat field with effect sizes
  • distribution::Any: The null distribution for p-value calculation. Defaults to Normal distribution
  • α::Float64: Significance level for multiple testing correction (Bonferroni). Defaults to 0.05

Returns

Displays three plots:

  • Histogram showing the distribution of effect sizes
  • Manhattan plot showing -log10(p-values) across loci with Bonferroni threshold
  • Q-Q plot comparing observed vs expected -log10(p-values)

Examples

julia> distribution = [TDist(1), Normal()][2];

julia> fit = Fit(n=100, l=10_000); fit.b_hat = rand(distribution, 10_000);  α = 0.05;

julia> GenomicBreedingCore.plot(fit);
source
GenomicBreedingCore.plotMethod
plot(phenomes::Phenomes; nbins::Int64 = 10)::Nothing

Generate diagnostic plots for phenotypic data stored in a Phenomes struct.

Arguments

  • phenomes::Phenomes: A Phenomes struct containing phenotypic data
  • nbins::Int64=10: Number of bins for the histograms (optional)

Description

For each population in the dataset:

  1. Creates histograms showing the distribution of each trait
  2. Generates a heatmap of trait correlations for traits with non-zero variance

Notes

  • Skips traits with all missing, NaN, or infinite values
  • Only includes traits with variance > 1e-10 in correlation analysis
  • Requires at least 3 data points to generate a histogram
  • Uses UnicodePlots for visualization in terminal

Examples

julia> phenomes = Phenomes(n=10, t=3); phenomes.entries = string.("entry_", 1:10); phenomes.populations .= "pop_1"; phenomes.traits = ["A", "B", "C"]; phenomes.phenotypes = rand(10,3);

julia> GenomicBreedingCore.plot(phenomes);
source
GenomicBreedingCore.plotMethod
plot(tebv::TEBV)

Create a visualization of True Estimated Breeding Values (TEBV) analysis results.

This function extracts phenomes from the TEBV object and generates a plot to visualize the breeding value estimates.

Arguments

  • tebv::TEBV: A TEBV object containing the analysis results

Returns

  • A plot object representing the visualization of the phenomes data
source
GenomicBreedingCore.plotMethod
plot(trials::Trials; nbins::Int64 = 10)::Nothing

Generate a comprehensive visualization of trial data through histograms, correlation heatmaps, and bar plots.

Arguments

  • trials::Trials: A Trials struct containing the trial data to be visualized
  • nbins::Int64=10: Number of bins for the histogram plots (optional)

Details

The function creates three types of plots:

  1. Histograms for each trait within each population, showing the distribution of trait values
  2. Correlation heatmaps between traits for each population
  3. Bar plots showing mean trait values across different trial factors:
    • Years
    • Seasons
    • Harvests
    • Sites
    • Replications
    • Rows
    • Columns
    • Populations

Missing, NaN, or infinite values are automatically filtered out before plotting.

Returns

  • Nothing: The function displays plots but does not return any value

Notes

  • Requires valid trial data with non-zero variance for correlation plots
  • Uses UnicodePlots for visualization in terminal
  • Skips plotting for traits with insufficient data points

Examples

julia> trials, _ = simulatetrials(genomes = simulategenomes(verbose=false), verbose=false);

julia> GenomicBreedingCore.plot(trials);
source
GenomicBreedingCore.removemissnaninfMethod
removemissnaninf(trials::Trials)::Trials

Removes missing (missing), NaN, and Inf values from the Trials struct.

Arguments

  • trials::Trials: A Trials struct containing phenotypic and trial data.

Returns

  • A new Trials struct with:
    • Trait columns that are completely missing, NaN, or Inf removed.
    • Rows with any missing, NaN, or Inf values in phenotypes removed.

Throws

  • If the Trials struct is corrupted (fails dimension checks), an ArgumentError is thrown.
  • If no rows are removed (i.e., all rows are valid), the original Trials struct is returned.
  • If the resulting Trials struct fails dimension checks after filtering, a DimensionMismatch error is thrown.

Example

julia> trials, _ = simulatetrials(genomes = simulategenomes(n=50, l=100, verbose=false), f_add_dom_epi=rand(13,3), sparsity=0.1, verbose=false);

julia> filtered_trials = removemissnaninf(trials);

julia> D1 = dimensions(trials); D2 = dimensions(filtered_trials);

julia> (D1["n_missing"] > D2["n_missing"]) && (D2["n_missing"] == D2["n_nan"] == D2["n_inf"] == 0)
true
source
GenomicBreedingCore.removespatialeffects!Method
removespatialeffects!(df::DataFrame; factors::Vector{String}, traits::Vector{String}, 
                     other_covariates::Union{Vector{String},Nothing} = nothing,
                     Σ_type::Symbol = :empirical,
                     ρs::Dict{String, Float64} = Dict("rows" => 0.5, "cols" => 0.5),
                     n_iter::Int64 = 10_000, n_burnin::Int64 = 1_000, 
                     seed::Int64 = 1234, verbose::Bool = false)::Tuple{Vector{String}, Dict{String, DataFrame}}

Remove spatial effects from trait measurements in field trials using Bayesian linear regression.

Arguments

  • df::DataFrame: DataFrame containing trial data with columns for traits, spatial factors, and harvests
  • factors::Vector{String}: Vector of factor names to be considered in the model
  • traits::Vector{String}: Vector of trait names to be spatially adjusted
  • other_covariates::Union{Vector{String},Nothing}: Optional vector of additional covariates
  • Σ_type::Symbol: Type of covariance structure to use (:empirical, :autoregressive, or :spherical) (default: :empirical)
  • ρs::Dict{String, Float64}: Correlation parameters for autoregressive structure, keys are factor names (default: rows=0.5, cols=0.5)
  • n_iter::Int64: Number of MCMC iterations (default: 10_000)
  • n_burnin::Int64: Number of burn-in iterations (default: 1_000)
  • seed::Int64: Random seed for reproducibility (default: 1234)
  • verbose::Bool: If true, prints detailed information (default: false)

Returns

In addition to mutating df by adding the spatially adjusted traits as new columns with the prefix "SPATADJ-", it returns a tuple containing:

  1. Vector of remaining significant factors after spatial adjustment
  2. Dictionary mapping harvest-trait combinations to diagnostic DataFrame results

Details

This function performs spatial adjustment for traits measured in field trials by:

  1. Identifying spatial factors (blocks, rows, columns) and creating design matrices
  2. Fitting Bayesian linear model per harvest to account for:
    • Spatial effects (blocks, rows, columns and interactions)
    • Choice of covariance structure:
      • Empirical: Estimated using linear shrinkage with unequal variances
      • Autoregressive: AR(1) structure with specified correlation parameters
      • Spherical: Identity matrix scaled by variance component
    • Additional numeric covariates if specified
  3. Creating spatially adjusted traits by adding intercept and residuals
  4. Removing redundant factors and retaining only unique design matrices

The spatial adjustment is only performed if blocks, rows or columns are present. Each harvest is treated separately to allow for year and site-specific spatial effects.

Variance component scalers are specified as follows:

  • Rows, columns and other covariates use unique variance scalers (multiple_σs = true)
  • Row-by-column interactions use a single spherical variance-covariance matrix (multiple_σs = false)
  • This improves model tractability while allowing for flexible variance structures where needed

Notes

  • Requires "entries" and "harvests" columns in input DataFrame
  • Uses Bayesian linear regression via MCMC for spatial modeling
  • Creates new columns with "SPATADJ-" prefix for adjusted traits
  • Returns original DataFrame if no spatial factors present
  • Automatically detects spatial factors via regex pattern "blocks|rows|cols"
  • Supports empirical, autoregressive or spherical covariance structures
  • Modifies the input DataFrame in-place

Example

julia> genomes = simulategenomes(n=5, l=1_000, verbose=false);

julia> trials, simulated_effects = simulatetrials(genomes = genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=3, verbose=false);

julia> df = tabularise(trials);

julia> factors, diagnostics = removespatialeffects!(df, factors = ["rows", "cols", "blocks"], traits = ["trait_1", "trait_2"], other_covariates=["trait_3"], n_iter=1_000, n_burnin=200, verbose = false);

julia> ("SPATADJ-trait_1" ∈ names(df)) && ("SPATADJ-trait_2" ∈ names(df))
true

julia> [(sum(d.rhat .< 1.01) < nrow(d)) && (sum(d.ess .>= 100) < nrow(d)) for (_, d) in diagnostics] == [true, true]
true
source
GenomicBreedingCore.simulateallelefreqs!Method
simulateallelefreqs!(
    allele_frequencies::Matrix{Union{Float64,Missing}}, 
    locus_counter::Vector{UInt},
    pb::Union{Nothing, ProgressMeter.Progress};
    μ::Vector{Float64}, 
    Σ::SparseMatrixCSC{Float64},
    n_alleles::Int64,
    idx_entries_per_population::Vector{Int64},
    rng::TaskLocalRNG = Random.GLOBAL_RNG,
)::Nothing

Simulates allele frequencies for multiple loci and populations using a multivariate normal distribution.

Arguments

  • allele_frequencies::Matrix{Union{Float64,Missing}}: Matrix to store the simulated allele frequencies for all loci
  • locus_counter::Vector{UInt}: Counter tracking the current locus position
  • pb::ProgressMeter.Progress: Progress bar for tracking simulation progress
  • μ::Vector{Float64}: Mean vector for the multivariate normal distribution
  • Σ::SparseMatrixCSC{Float64}: Variance-covariance matrix for the multivariate normal distribution
  • n_alleles::Int64: Number of alleles per locus
  • idx_entries_per_population::Vector{Int64}: Indices for entries in each population
  • rng::TaskLocalRNG: Random number generator (default: global RNG)

Description

Simulates allele frequencies for multiple populations and loci using a multivariate normal distribution. For each allele (except the last one) at each locus, frequencies are sampled from the specified distribution. For alleles after the first one, frequencies are adjusted to ensure they sum to 1.0 within each locus. All frequencies are bounded between 0.0 and 1.0.

Returns

Nothing

Throws

  • ArgumentError: If length of allele_frequencies is less than or equal to length of μ
  • ArgumentError: If length of μ does not match the dimensions of Σ
  • ArgumentError: If maximum index in idxentriesper_population exceeds number of entries

Note

The last allele frequency for each locus is implicitly determined as 1 minus the sum of other allele frequencies to ensure frequencies sum to 1.0.

Example

julia> chrom_lengths, chrom_loci_counts = simulatechromstruct(l=100, n_chroms=7, max_pos=135_000_000);

julia> positions, loci_alleles = simulateposandalleles(chrom_lengths=chrom_lengths, chrom_loci_counts=chrom_loci_counts, n_alleles=2);

julia> Σ_base = simulateldblocks(chrom_positions=positions[1], chrom_length=chrom_lengths[1], ld_corr_50perc_kb=1_000);

julia> μ, Σ = simulateperpopμΣ(Σ_base=Σ_base, μ_β_params=(0.5, 0.5));

julia> populations, idx_population_groupings = simulatepopgroups(n=123, n_populations=3);

julia> allele_frequencies::Matrix{Union{Missing, Float64}} = fill(missing, 123, 100); locus_counter::Vector{UInt} = [1]; pb = nothing;

julia> simulateallelefreqs!(allele_frequencies, locus_counter, pb, μ=μ, Σ=Σ, n_alleles=2, idx_entries_per_population=idx_population_groupings[1])

julia> sum(.!ismissing.(allele_frequencies[idx_population_groupings[1], 1:length(μ)])) == (length(idx_population_groupings[1]) * length(μ))
true
source
GenomicBreedingCore.simulatechromstructMethod
simulatechromstruct(;l::Int64, n_chroms::Int64, max_pos::Int64)

Generate chromosome structure parameters for genome simulation.

Arguments

  • l::Int64: Total number of loci to distribute across chromosomes (2 to 1e9)
  • n_chroms::Int64: Number of chromosomes (1 to 1e6)
  • max_pos::Int64: Maximum genome size in base pairs (10 to 160e9)

Returns

A tuple containing:

  • chrom_lengths::Vector{Int64}: Vector of chromosome lengths in base pairs
  • chrom_loci_counts::Vector{Int64}: Vector of loci counts per chromosome

Examples

julia> chrom_lengths, chrom_loci_counts = simulatechromstruct(l=10_000, n_chroms=7, max_pos=135_000_000)
([19285714, 19285714, 19285714, 19285714, 19285714, 19285714, 19285716], [1428, 1428, 1428, 1428, 1428, 1428, 1432])
source
GenomicBreedingCore.simulatecovarianceautocorrelatedFunction
simulatecovarianceautocorrelated(p::Int64, ρ::Float64 = 0.75)::Matrix{Float64}

Generate a p × p autocorrelated covariance matrix where the correlation between elements decays exponentially with their distance.

Arguments

  • p::Int64: Size of the square covariance matrix
  • ρ::Float64 = 0.75: Autocorrelation parameter, must be between -1 and 1

Returns

  • Matrix{Float64}: A p × p positive definite covariance matrix where Σᵢⱼ = ρ^(|i-j|)

Throws

  • ArgumentError: If ρ is not between -1 and 1

Examples

julia> Σ = simulatecovarianceautocorrelated(7, 0.72);

julia> size(Σ)
(7, 7)

julia> (var(Σ) > 0.0) && (abs(det(Σ)) > 0.0)
true

julia> (Σ[:, 1] == Σ[1, :]) && (Σ[:, 2] == Σ[2, :])
true

julia> sum(diff(reverse(Σ[1, :])) .> 0.0) == 6
true
source
GenomicBreedingCore.simulatecovariancediagonalMethod
simulatecovariancediagonal(p::Int64, σ²::Vector{Float64})::Matrix{Float64}

Simulate a diagonal covariance matrix with specified variances σ² on the diagonal.

Arguments

  • p::Int64: Dimension of the covariance matrix
  • σ²::Vector{Float64}: Vector of variance values for diagonal elements

Returns

  • Matrix{Float64}: p × p diagonal covariance matrix

Examples

julia> Σ = simulatecovariancediagonal(7, rand(7));

julia> size(Σ)
(7, 7)

julia> var(Σ[diagind(Σ)]) > 0.0
true
source
GenomicBreedingCore.simulatecovariancekinshipMethod
simulatecovariancekinship(p::Int64, genomes::Genomes)::Matrix{Float64}

Calculate a genomic relationship matrix (GRM) from simulated genomic data.

Arguments

  • p::Int64: Expected number of entries/individuals in the genomic data
  • genomes::Genomes: Genomic data structure containing genetic information

Returns

  • Matrix{Float64}: A p × p genomic relationship matrix

Throws

  • ArgumentError: If the provided p doesn't match the number of entries in genomes

Examples

julia> Σ = simulatecovariancekinship(7, simulategenomes(n=7, l=1_000, verbose=false));

julia> size(Σ)
(7, 7)

julia> (var(Σ) > 0.0) && (abs(det(Σ)) > 0.0)
true
source
GenomicBreedingCore.simulatecovariancerandomFunction
simulatecovariancerandom(p::Int64, seed::Int64 = 42)::Matrix{Float64}

Generate a random positive semidefinite covariance matrix of size p × p.

Arguments

  • p::Int64: Dimension of the covariance matrix to generate
  • seed::Int64 = 42: Random seed for reproducibility

Returns

  • Matrix{Float64}: A p × p positive semidefinite covariance matrix

Details

The function generates a random covariance matrix by:

  1. Creating a p × 1 random normal vector
  2. Computing the outer product of this vector with itself
  3. Inflating the diagonal elements to ensure positive definiteness

Examples

julia> Σ = simulatecovariancerandom(7, 123);

julia> size(Σ)
(7, 7)

julia> (var(Σ) > 0.0) && (abs(det(Σ)) > 0.0)
true
source
GenomicBreedingCore.simulatecovariancesphericalMethod
simulatecovariancespherical(p::Int64, σ²::Float64)::Matrix{Float64}

Simulate a spherical covariance matrix with constant variance σ² on the diagonal.

Arguments

  • p::Int64: Dimension of the covariance matrix
  • σ²::Float64: Constant variance value for diagonal elements

Returns

  • Matrix{Float64}: p × p spherical covariance matrix

Examples

julia> Σ = simulatecovariancespherical(7, 2.15);

julia> size(Σ)
(7, 7)

julia> Σ[diagind(Σ)] == fill(2.15, 7)
true
source
GenomicBreedingCore.simulateeffectsMethod
simulateeffects(; p::Int64 = 2, q::Int64 = 1, λ::Float64 = 1.00, 
                covar_details::Tuple{Function,Union{Float64,Vector{Float64},Int64,Genomes}} = (simulatecovariancespherical, 1.00),
                seed::Int64 = 42)::Matrix{Float64}

Simulate correlated effects by sampling from a multivariate normal distribution.

This function generates a matrix of correlated effects by:

  1. Sampling means (μ) from an exponential distribution with parameter λ
  2. Creating a covariance matrix Σ using the specified covariance function and parameters
  3. Drawing samples from MvNormal(μ, Σ)

Arguments

  • p::Int64: Number of correlated effects to simulate (default = 2)
  • q::Int64: Number of times to simulate the correlated effects from the same distribution (default = 1)
  • λ::Float64: Rate parameter of the exponential distribution for sampling means (default = 1.00)
  • covar_details::Tuple{Function,Union{Float64,Vector{Float64},Int64,Genomes}}: Tuple containing:
    • First element: Covariance simulation function to use
    • Second element: Parameter(s) for the covariance function:
      • Float64 for spherical or autocorrelated covariance
      • Vector{Float64} for diagonal covariance
      • Int64 for random covariance seed
      • Genomes object for kinship covariance
  • seed::Int64: Random number generator seed for reproducibility (default = 42)

Returns

  • Matrix{Float64}: A p × q matrix where each column represents a set of correlated effects

Supported Covariance Functions

  • simulatecovariancespherical: Spherical covariance with constant variance parameter
  • simulatecovariancediagonal: Diagonal covariance with vector of variances
  • simulatecovariancerandom: Random covariance with seed parameter
  • simulatecovarianceautocorrelated: Autocorrelated covariance with correlation parameter
  • simulatecovariancekinship: Kinship-based covariance with Genomes object

Examples

julia> θ₀ = simulateeffects();

julia> sum(abs.(θ₀)) > 0.0
true

julia> p = 10; q = 5;

julia> θ₁ = simulateeffects(p=p, q=q, λ=0.50, covar_details=(simulatecovariancediagonal, rand(p)));

julia> (size(θ₁) == (p, q)) && (θ₀ != θ₁)
true

julia> θ₂ = simulateeffects(p=p, q=q, λ=0.50, covar_details=(simulatecovariancerandom, 123));

julia> (size(θ₂) == (p, q)) && (θ₁ != θ₂)
true

julia> θ₃ = simulateeffects(p=p, q=q, λ=0.50, covar_details=(simulatecovarianceautocorrelated, 0.71));

julia> (size(θ₃) == (p, q)) && (θ₂ != θ₃)
true

julia> genomes = simulategenomes(n=p, l=1_000, n_alleles=3, verbose=false);

julia> θ₄ = simulateeffects(p=p, q=q, λ=0.50, covar_details=(simulatecovariancekinship, genomes));

julia> (size(θ₄) == (p, q)) && (θ₃ != θ₄)
true
source
GenomicBreedingCore.simulategenomesMethod
simulategenomes(;
    n::Int64 = 100,
    n_populations::Int64 = 1, 
    l::Int64 = 10_000,
    n_chroms::Int64 = 7,
    n_alleles::Int64 = 2,
    max_pos::Int64 = 135_000_000,
    ld_corr_50perc_kb::Int64 = 1_000,
    rel_dist_multiplier::Float64 = 10.0,
    μ_β_params::Tuple{Float64,Float64} = (0.5, 0.5),
    sparsity::Float64 = 0.0,
    seed::Int64 = 42,
    verbose::Bool = true
)::Genomes

Simulates genomic data with population structure and linkage disequilibrium.

Arguments

  • n::Int64: Number of entries/individuals to simulate (1 to 1e9)
  • n_populations::Int64: Number of populations to simulate (1 to n)
  • l::Int64: Number of loci to simulate (2 to 1e9)
  • n_chroms::Int64: Number of chromosomes (1 to 1e6)
  • n_alleles::Int64: Number of alleles per locus (2 to 5, representing A, T, C, G, D)
  • max_pos::Int64: Maximum position in base pairs (10 to 160e9)
  • ld_corr_50perc_kb::Int64: Distance in kb where correlation decay reaches 50%
  • rel_dist_multiplier::Float64: Multiplier for maximum relative distance to consider in LD blocks
  • μ_β_params::Tuple{Float64,Float64}: Shape parameters for Beta distribution of allele frequencies
  • sparsity::Float64: Proportion of missing data to simulate (0.0 to 1.0)
  • seed::Int64: Random seed for reproducibility
  • verbose::Bool: Whether to show progress bar and final plot

Returns

  • Genomes: A struct containing:
    • entries: Vector of entry IDs
    • populations: Vector of population assignments
    • loci_alleles: Vector of locus-allele combinations
    • allele_frequencies: Matrix of allele frequencies
    • mask: Boolean matrix indicating valid data points

Details

  • Simulates genomic data by:
    • Generating chromosome lengths and loci positions
    • Assigning alleles to loci
    • Grouping entries into populations
    • Simulating allele frequencies with linkage disequilibrium using multivariate normal distribution
    • Adding optional sparsity (missing data)
  • Chromosome lengths are distributed evenly, with any remainder added to last chromosome
  • Loci positions are randomly sampled without replacement within each chromosome
  • LD decay follows an exponential function: corr = 1/exp(r*d), where d is normalized distance
  • Mean allele frequencies are sampled from Beta(α,β) distribution
  • Population structure is implemented by sampling the mean allele frequencies per population
  • For each entry and locus, allele frequencies with linkage disequilibrium are simulated by sampling a multivariate normal distribution per chromosome
  • Missing data is randomly assigned if sparsity > 0

Throws

  • ArgumentError: If input parameters are outside acceptable ranges
  • DimensionMismatch: If there's an error in the simulation process

Examples

julia> genomes = simulategenomes(n=100, l=1_000, n_alleles=3, verbose=false);

julia> length(genomes.entries)
100

julia> length(genomes.populations)
100

julia> length(genomes.loci_alleles)
2000

julia> size(genomes.allele_frequencies)
(100, 2000)

julia> mean(ismissing.(genomes.allele_frequencies))
0.0

julia> rng::TaskLocalRNG = Random.seed!(123);

julia> idx = StatsBase.sample(rng, range(1, 2_000, step=2), 250, replace = false, ordered = true);

julia> correlations = StatsBase.cor(genomes.allele_frequencies[:, idx]);

julia> correlations[10,10] == 1.00
true

julia> correlations[10,10] > correlations[10,250]
true

julia> genomes = simulategenomes(n=100, l=10_000, n_alleles=3, sparsity=0.25, verbose=false);

julia> mean(ismissing.(genomes.allele_frequencies))
0.25
source
GenomicBreedingCore.simulategenomiceffectsMethod
simulategenomiceffects(;
    genomes::Genomes,
    f_additive::Float64 = 0.01,
    f_dominance::Float64 = 0.10,
    f_epistasis::Float64 = 0.05,
    seed::Int64 = 42,
)::Tuple{Matrix{Float64},Matrix{Float64}}

Simulate additive, dominance, and epistatic effects for multiple loci.

Arguments

  • genomes::Genomes: Genome struct containing n entries x p loci-alleles combinations
  • f_additive::Float64: Proportion of loci with non-zero additive effects (default = 0.01)
  • f_dominance::Float64: Proportion of additive loci with dominance effects (default = 0.10)
  • f_epistasis::Float64: Proportion of additive loci with epistasis effects (default = 0.05)
  • seed::Int64: Random seed for reproducibility (default = 42)

Returns

  • Tuple{Matrix{Float64},Matrix{Float64}}:
    • First matrix (n x 3): Additive, dominance and epistasis effects per entry
    • Second matrix (p x 3): Effects per locus-allele combination

Details

The genetic effects are simulated using diagonal variance-covariance matrices:

  1. For additive effects: Uses a diagonal matrix with random variances for each of the a loci with maxnalleles-1 allele effects per locus.

  2. For dominance effects: Uses a diagonal matrix with random variances for each of the d loci, simulating one dominance effect per locus.

  3. For epistasis effects: Uses a diagonal matrix with random variances for each of the e loci with maxnalleles-1 allele effects per locus. The final epistatic effects are computed by multiplying allele frequencies and effects for all possible pairs of epistatic loci.

For all three types of effects, means are sampled from an exponential distribution (λ=1).

Examples

julia> genomes::Genomes = simulategenomes(n=100, l=2_000, n_alleles=3, verbose=false);

julia> G, B = simulategenomiceffects(genomes=genomes, f_additive=0.05, f_dominance=0.75, f_epistasis=0.25);

julia> size.([G, B])
2-element Vector{Tuple{Int64, Int64}}:
 (100, 3)
 (4000, 3)

julia> sum(B .!= 0.0, dims=1)
1×3 Matrix{Int64}:
 200  75  50
source
GenomicBreedingCore.simulateldblocksMethod
simulateldblocks(;
    chrom_positions::Vector{Int64}, 
    chrom_length::Int64, 
    ld_corr_50perc_kb::Int64,
    rel_dist_multiplier::Float64 = 10.0
)::SparseMatrixCSC{Float64}

Simulate linkage disequilibrium (LD) blocks by generating a sparse variance-covariance matrix.

Arguments

  • chrom_positions::Vector{Int64}: Vector of positions on the chromosome for each locus
  • chrom_length::Int64: Total length of the chromosome
  • ld_corr_50perc_kb::Int64: Distance in kilobases at which the LD correlation decays to 50%
  • rel_dist_multiplier::Float64: Multiplier for maximum relative distance to consider (default: 10.0)

Returns

  • SparseMatrixCSC{Float64}: A sparse variance-covariance matrix representing LD blocks

Details

The function creates a variance-covariance matrix where the correlation between loci decays exponentially with distance. The decay rate is calculated to achieve 50% correlation at the specified distance (ld_corr_50perc_kb).

For computational efficiency, correlations between loci are set to zero if:

  1. The normalized distance is greater than reldistmultiplier * q, where q is the normalized LD decay distance (ldcorr50perckb / chromlength)
  2. The normalized distance is greater than 0.9 (90% of chromosome length)

The computation uses multi-threading with Threads.@threads to parallelize the calculation of correlation values across loci positions.

Throws

  • ArgumentError: If number of loci exceeds chromosome length
  • ArgumentError: If LD correlation distance exceeds chromosome length
  • ArgumentError: If reldistmultiplier is less than 1.0

Example

julia> chrom_lengths, chrom_loci_counts = simulatechromstruct(l=100, n_chroms=7, max_pos=135_000_000);

julia> positions, loci_alleles = simulateposandalleles(chrom_lengths=chrom_lengths, chrom_loci_counts=chrom_loci_counts, n_alleles=2);

julia> Σ = simulateldblocks(chrom_positions=positions[1], chrom_length=chrom_lengths[1], ld_corr_50perc_kb=1_000);

julia> size(Σ)
(14, 14)
source
GenomicBreedingCore.simulatematingMethod
simulatemating(;
    parent_genomes::Genomes,
    n_generations::Int = 1,
    pop_size_per_gen::Vector{Int64} = [100],
    seed::Int64 = 42,
    verbose::Bool = false
)::Vector{Genomes}

Simulates mating processes across multiple generations using a multivariate normal distribution approach.

Arguments

  • parent_genomes::Genomes: Initial parent genomic information containing allele frequencies
  • n_generations::Int: Number of generations to simulate (default: 1)
  • pop_size_per_gen::Vector{Int64}: Vector of population sizes for each generation (default: [100])
  • seed::Int64: Random seed for reproducibility (default: 42)
  • verbose::Bool: Whether to print progress messages (default: false)

Returns

  • Vector{Genomes}: Vector of genomes across generations

Description

This function simulates genetic inheritance across generations by:

  1. Sampling progeny allele frequencies using multivariate normal distribution
  2. Enforcing biological constraints (frequencies between 0 and 1)
  3. Normalizing frequencies for multiallelic loci
  4. Displaying frequency histograms for each generation

The simulation maintains allele frequency correlations within chromosomes and handles multiallelic loci by ensuring their frequencies sum to 1.

Throws

  • ArgumentError: If parent genomes contain missing values or invalid dimensions

Example

julia> parent_genomes = simulategenomes(n=5, l=10_000, n_alleles=3, verbose=false);

julia> great_great_offspring_genomes = simulatemating(parent_genomes=parent_genomes, n_generations=3, pop_size_per_gen=[10, 20, 30]);

julia> [length(x.entries) for x in great_great_offspring_genomes] == [5, 10, 20, 30]
true
source
GenomicBreedingCore.simulateperpopμΣMethod
simulateperpopμΣ(; 
    Σ_base::SparseMatrixCSC{Float64}, 
    μ_β_params::Tuple{Float64,Float64} = (0.5, 0.5),
    rng::TaskLocalRNG = Random.GLOBAL_RNG
) -> Tuple{Vector{Float64}, SparseMatrixCSC{Float64}}

Simulate per-population mean allele frequencies and their variance-covariance matrix.

Arguments

  • Σ_base::SparseMatrixCSC{Float64}: Base variance-covariance matrix to be scaled
  • μ_β_params::Tuple{Float64,Float64}: Parameters (α, β) for Beta distribution used to generate mean allele frequencies (default: (0.5, 0.5))
  • rng::TaskLocalRNG: Random number generator for reproducibility (default: GLOBAL_RNG)

Returns

  • Tuple{Vector{Float64}, SparseMatrixCSC{Float64}}: A tuple containing:
    • Vector of mean allele frequencies (μ)
    • Scaled variance-covariance matrix (Σ)

Details

The function:

  1. Samples mean allele frequencies from a Beta distribution
  2. Scales the variance-covariance matrix based on allele frequencies
  3. Ensures positive definiteness of the resulting matrix through iterative adjustment

The variance scaling is performed such that loci closer to fixation (0.0 or 1.0) have lower variance, following population genetics expectations.

Example

julia> chrom_lengths, chrom_loci_counts = simulatechromstruct(l=100, n_chroms=7, max_pos=135_000_000);

julia> positions, loci_alleles = simulateposandalleles(chrom_lengths=chrom_lengths, chrom_loci_counts=chrom_loci_counts, n_alleles=2);

julia> Σ_base = simulateldblocks(chrom_positions=positions[1], chrom_length=chrom_lengths[1], ld_corr_50perc_kb=1_000);

julia> μ, Σ = simulateperpopμΣ(Σ_base=Σ_base, μ_β_params=(0.5, 0.5));

julia> length(μ) == length(positions[1])
true

julia> size(Σ) == size(Σ_base)
true

julia> abs(sum(Σ .- Σ_base)) > 0.0
true
source
GenomicBreedingCore.simulatepopgroupsMethod
simulatepopgroups(; n::Int64, n_populations::Int64)::Tuple{Vector{String}, Vector{Vector{Int64}}}

Simulate population groups by dividing a total number of samples into populations.

Arguments

  • n::Int64: Total number of samples (between 1 and 1 billion)
  • n_populations::Int64: Number of populations to create (between 1 and n)

Returns

A tuple containing:

  • Vector{String}: Vector of population labels for each sample
  • Vector{Vector{Int64}}: Vector of vectors containing indices for each population group

Example

julia> populations, idx_population_groupings = simulatepopgroups(n=100, n_populations=3);

julia> length(populations) == 100
true

julia> length(idx_population_groupings) == 3
true

julia> pops = fill("", 100); [pops[x] .= unique(populations)[i] for (i, x) in enumerate([x for x in (idx_population_groupings)])];

julia> pops == populations
true
source
GenomicBreedingCore.simulateposandallelesMethod
simulateposandalleles(;
    chrom_lengths::Vector{Int64}, 
    chrom_loci_counts::Vector{Int64},
    n_alleles::Int64, 
    allele_choices::Vector{String} = ["A", "T", "C", "G", "D"],
    allele_weights::Weights{Float64,Float64,Vector{Float64}} = StatsBase.Weights([1.0, 1.0, 1.0, 1.0, 0.1] / sum([1.0, 1.0, 1.0, 1.0, 0.1])),
    rng::TaskLocalRNG = Random.GLOBAL_RNG,
    verbose::Bool = false
) -> Tuple{Vector{Vector{Int64}}, Vector{String}}

Simulates genomic positions and alleles for multiple chromosomes.

Arguments

  • chrom_lengths::Vector{Int64}: Vector containing the length of each chromosome
  • chrom_loci_counts::Vector{Int64}: Vector containing the number of loci to generate for each chromosome
  • n_alleles::Int64: Number of alleles to generate per locus
  • allele_choices::Vector{String}: Vector of possible alleles to choose from (default: ["A", "T", "C", "G", "D"])
  • allele_weights::Weights: Weights for sampling alleles (default: normalized weights favoring A,T,C,G over D)
  • rng::TaskLocalRNG: Random number generator for reproducibility (default: global RNG)
  • verbose::Bool: Whether to show progress bar (default: false)

Returns

  • Tuple{Vector{Vector{Int64}}, Vector{String}}: A tuple containing:
    1. A vector of vectors, where each inner vector contains the positions for a chromosome
    2. A vector of strings, where each string contains tab-separated locus information in the format: "chromN\tposition\tallalleles\tchosen_allele"

Throws

  • ArgumentError: If input argument lengths don't match or if invalid number of alleles is requested

Example

julia> chrom_lengths, chrom_loci_counts = simulatechromstruct(l=100, n_chroms=7, max_pos=135_000_000);

julia> positions, loci_alleles = simulateposandalleles(chrom_lengths=chrom_lengths, chrom_loci_counts=chrom_loci_counts, n_alleles=2);

julia> length(positions) == length(chrom_lengths)
true

julia> length(loci_alleles) == sum(chrom_loci_counts)
true
source
GenomicBreedingCore.simulatetrialsMethod
simulatetrials(;
    genomes::Genomes,
    f_add_dom_epi::Matrix{Float64} = [
        0.01 0.25 0.10
        0.05 0.50 0.25
        0.10 0.25 0.00
    ],
    n_years::Int64 = 2,
    n_seasons::Int64 = 4,
    n_harvests::Int64 = 2,
    n_sites::Int64 = 4,
    n_replications::Int64 = 2,
    n_blocks::Union{Missing,Int64} = missing,
    n_rows::Union{Missing,Int64} = missing,
    n_cols::Union{Missing,Int64} = missing,
    proportion_of_variance::Union{Missing,Matrix{Float64}} = missing,
    sparsity::Float64 = 0.0,
    seed::Int64 = 42,
    verbose::Bool = true,
)::Tuple{Trials,Vector{SimulatedEffects}}

Arguments

  • genomes: Genome struct includes the n entries x p loci-alleles combinations (p = l loci x a-1 alleles)
  • f_add_dom_epi: n_traits x 3 numeric matrix of loci proportion with additive, dominance and epistasis effects, i.e. each column refers to:
    • f_additive: proportion of the l loci with non-zero additive effects on the phenotype
    • f_dominance: proportion of the l*f_additive additive effects loci with additional dominance effects
    • f_epistasis: proportion of the l*f_additive additive effects loci with additional epistasis effects
    • (default = [0.01 0.25 0.10; 0.05 0.50 0.25; 0.10 0.25 0.00])
  • n_years: Number of years (default = 2)
  • n_seasons: Number of seasons (default = 4)
  • n_harvests: Number of harvests (default = 2)
  • n_sites: Number of sites (default = 4)
  • n_replications: Number of replications (default = 2)
  • n_blocks: Number of blocks across the entire field layout (default = missing)
  • n_rows: Number of rows across the entire field layout (default = missing)
  • n_cols: Number of columns across the entire field layout (default = missing)
  • proportion_of_variance: 9 x n_traits numeric matrix of scaled/non-scaled proportion of variances allocated to genetic and environmental effects (default = missing; values will be sampled from a uniform distribution followed by a biased sample on the first row, i.e. additive effects row). The rows correspond to the variance allocated to:
    1. additive genetic effects
    2. dominance genetic effects
    3. epistasis genetic effects
    4. years effects
    5. seasons effects
    6. sites effects
    7. environmental interactions
    8. spatial interactions
    9. GxE interactiions
  • seed: Randomisation seed (default = 42)
  • sparsity: Proportion of missing data (default = 0.0)
  • verbose: Show trials simulation progress bar? (default = true)

Outputs

  • Trials struct of simulated phenotype data
  • Vector of SimulatedEffects each corresponding to each trait-year-season-harvest-site-replication combination

Details

The function simulates trial data by:

  1. Generating genetic effects (additive, dominance, epistasis)
  2. Simulating environmental effects for:
    • Years, seasons, sites
    • Environmental interactions
    • Spatial field effects (blocks, rows, columns)
    • Genotype-by-environment interactions
  3. Combining effects according to specified variance proportions
  4. Applying optional sparsity to create missing data

The field layout is optimized to have:

  • Number of rows ≤ number of columns
  • Blocks divided along columns
  • Even distribution of entries and replications

The function uses various covariance structures to simulate environmental and spatial effects:

  1. Autocorrelated Covariance: Used for effects such as years, seasons, and blocks. This structure assumes that neighboring elements are more correlated than distant ones.
  2. Random Covariance: Used for site effects, where the covariance matrix is generated randomly.
  3. Spherical Covariance: Used for replication effects, assuming uniform correlation across all elements.
  4. Diagonal Covariance: Used for allele-by-environment interaction effects, where only diagonal elements are non-zero, representing independent effects.
  5. Custom Covariance: Specific covariance structures can be passed as arguments to simulate effects with tailored correlation patterns.

Examples

julia> genomes::Genomes = simulategenomes(n=100, l=2_000, n_alleles=3, verbose=false);

julia> trials::Trials, vector_of_effects::Array{SimulatedEffects,1} = simulatetrials(genomes=genomes, sparsity=0.25, verbose=false);

julia> size(trials.phenotypes)
(12800, 3)

julia> size(trials.traits)
(3,)

julia> unique(trials.entries) == genomes.entries
true

julia> unique(trials.populations) == unique(genomes.populations)
true

julia> abs(mean(ismissing.(trials.phenotypes)) - 0.25) < 0.00001
true
source
GenomicBreedingCore.sliceMethod
slice(
    genomes::Genomes; 
    idx_entries::Union{Nothing, Vector{Int64}} = nothing,
    idx_loci_alleles::Union{Nothing, Vector{Int64}} = nothing,
    verbose::Bool = false
)::Genomes

Create a subset of a Genomes struct by selecting specific entries and loci-allele combinations.

Arguments

  • genomes::Genomes: The source genomic data structure to be sliced
  • idx_entries::Union{Nothing, Vector{Int64}}: Indices of entries to keep. If nothing, all entries are kept
  • idx_loci_alleles::Union{Nothing, Vector{Int64}}: Indices of loci-allele combinations to keep. If nothing, all loci-alleles are kept
  • verbose::Bool: If true, displays a progress bar during slicing. Defaults to false

Returns

  • Genomes: A new Genomes struct containing only the selected entries and loci-allele combinations

Performance Notes

  • The function uses multi-threaded implementation for optimal performance
  • Progress bar is available when verbose=true to monitor the slicing operation
  • Memory efficient implementation that creates a new pre-allocated structure

Behaviour

  • Both index vectors are automatically sorted and deduplicated
  • If both idx_entries and idx_loci_alleles are nothing, returns a clone of the input
  • Maintains all relationships and structure of the original genomic data
  • Preserves population assignments and allele frequencies for selected entries

Validation

  • Performs dimension checks on both input and output genomic structures
  • Validates that all indices are within proper bounds
  • Ensures data consistency throughout the slicing operation

Throws

  • ArgumentError: If input Genomes struct is corrupted or indices are out of bounds
  • DimensionMismatch: If the resulting sliced genome has inconsistent dimensions

Examples

julia> genomes = simulategenomes(n=100, l=1_000, n_alleles=4, verbose=false);

julia> sliced_genomes = slice(genomes, idx_entries=collect(1:10), idx_loci_alleles=collect(1:300));

julia> dimensions(sliced_genomes)
Dict{String, Int64} with 7 entries:
  "n_entries"      => 10
  "n_chr"          => 1
  "n_loci"         => 100
  "n_loci_alleles" => 300
  "n_populations"  => 1
  "n_missing"      => 0
  "max_n_alleles"  => 4
source
GenomicBreedingCore.sliceMethod
slice(phenomes::Phenomes; idx_entries::Union{Nothing, Vector{Int64}}=nothing, idx_traits::Union{Nothing, Vector{Int64}}=nothing)::Phenomes

Create a new Phenomes object containing a subset of the original data by selecting specific entries and traits.

Arguments

  • phenomes::Phenomes: The original Phenomes object to slice
  • idx_entries::Union{Nothing, Vector{Int64}}=nothing: Indices of entries to keep. If nothing, all entries are kept
  • idx_traits::Union{Nothing, Vector{Int64}}=nothing: Indices of traits to keep. If nothing, all traits are kept

Returns

  • Phenomes: A new Phenomes object containing only the selected entries and traits

Notes

  • The function preserves the original structure while reducing dimensions
  • Indices must be within valid ranges (1 to nentries/ntraits)
  • Duplicate indices are automatically removed
  • The resulting object maintains all relationships between entries, populations, traits, and phenotypes

Throws

  • ArgumentError: If the input Phenomes struct is corrupted or if indices are out of bounds
  • DimensionMismatch: If the slicing operation results in invalid dimensions

Examples

julia> phenomes = Phenomes(n=10, t=3); phenomes.entries = string.("entry_", 1:10); phenomes.populations .= "pop_1"; phenomes.traits = ["A", "B", "C"]; phenomes.phenotypes = fill(0.0, 10,3);

julia> sliced_phenomes = slice(phenomes, idx_entries=collect(1:5); idx_traits=collect(2:3));

julia> dimensions(sliced_phenomes)
Dict{String, Int64} with 8 entries:
  "n_total"       => 10
  "n_zeroes"      => 10
  "n_nan"         => 0
  "n_entries"     => 5
  "n_traits"      => 2
  "n_inf"         => 0
  "n_populations" => 1
  "n_missing"     => 0
source
GenomicBreedingCore.sliceMethod
slice(
    trials::Trials; 
    traits::Union{Nothing, Vector{String}} = nothing,
    populations::Union{Nothing, Vector{String}} = nothing,
    entries::Union{Nothing, Vector{String}} = nothing,
    years::Union{Nothing, Vector{String}} = nothing,
    harvests::Union{Nothing, Vector{String}} = nothing,
    seasons::Union{Nothing, Vector{String}} = nothing,
    sites::Union{Nothing, Vector{String}} = nothing,
    replications::Union{Nothing, Vector{String}} = nothing,
    blocks::Union{Nothing, Vector{String}} = nothing,
    rows::Union{Nothing, Vector{String}} = nothing,
    cols::Union{Nothing, Vector{String}} = nothing,
)::Trials

Create a subset of a Trials struct by filtering its components based on specified criteria.

Arguments

  • trials::Trials: The source trials data structure to be sliced
  • traits::Vector{String}: Selected trait names to include
  • populations::Vector{String}: Selected population names to include
  • entries::Vector{String}: Selected entry names to include
  • years::Vector{String}: Selected years to include
  • harvests::Vector{String}: Selected harvest identifiers to include
  • seasons::Vector{String}: Selected seasons to include
  • sites::Vector{String}: Selected site names to include
  • replications::Vector{String}: Selected replication identifiers to include
  • blocks::Vector{String}: Selected block identifiers to include
  • rows::Vector{String}: Selected row identifiers to include
  • cols::Vector{String}: Selected column identifiers to include

All arguments except trials are optional. When an argument is not provided (i.e., nothing), all values for that category are included in the slice.

Returns

  • A new Trials struct containing only the selected data

Throws

  • ArgumentError: If invalid names are provided for any category or if no data remains after filtering
  • DimensionMismatch: If the resulting sliced trials structure has inconsistent dimensions
  • ArgumentError: If the input trials structure is corrupted

Examples

julia> trials, _ = simulatetrials(genomes = simulategenomes(verbose=false), verbose=false);

julia> sliced_trials = slice(trials, traits=trials.traits[2:3], years=[unique(trials.years)[1]], seasons=unique(trials.seasons)[2:3]);

julia> dimensions(sliced_trials)
Dict{String, Int64} with 16 entries:
  "n_zeroes"       => 0
  "n_harvests"     => 2
  "n_nan"          => 0
  "n_entries"      => 100
  "n_traits"       => 2
  "n_seasons"      => 2
  "n_blocks"       => 10
  "n_rows"         => 10
  "n_missing"      => 0
  "n_inf"          => 0
  "n_total"        => 6400
  "n_replications" => 2
  "n_years"        => 1
  "n_sites"        => 4
  "n_cols"         => 20
  "n_populations"  => 1
source
GenomicBreedingCore.sparsitiesMethod
sparsities(genomes::Genomes) -> Tuple{Vector{Float64}, Vector{Float64}}

Calculate the sparsity (proportion of missing data) for each entry and locus in a Genomes object.

Returns a tuple of two vectors:

  • First vector contains sparsity values for each entry (row-wise mean of missing values)
  • Second vector contains sparsity values for each locus (column-wise mean of missing values)

The function processes the data in parallel using multiple threads for performance optimization.

Arguments

  • genomes::Genomes: A Genomes object containing allele frequency data with potentially missing values

Returns

  • Tuple{Vector{Float64}, Vector{Float64}}: A tuple containing:
    • Vector of entry sparsities (values between 0.0 and 1.0)
    • Vector of locus sparsities (values between 0.0 and 1.0)

Example

julia> genomes = simulategenomes(n=100, l=1_000, sparsity=0.25, verbose=false);

julia> entry_sparsities, locus_sparsities = sparsities(genomes);

julia> abs(0.25 - mean(entry_sparsities)) < 0.0001
true

julia> abs(0.25 - mean(locus_sparsities)) < 0.0001
true
source
GenomicBreedingCore.summariseMethod
summarise(cvs::Vector{CV})::Tuple{DataFrame,DataFrame}

Summarize cross-validation results from a vector of CV structs into two DataFrames.

Returns

  • A tuple containing two DataFrames:
    1. Summary DataFrame with mean metrics across entries, replications, and folds
      • Contains means and standard deviations of correlation coefficients
      • Includes average training and validation set sizes
      • Grouped by training population, validation population, trait, and model
    2. Entry-level DataFrame with phenotype prediction statistics
      • Contains true phenotype values, predicted means (μ), and standard deviations (σ)
      • Grouped by training population, validation population, trait, model, and entry

Arguments

  • cvs::Vector{CV}: Vector of CV structs containing cross-validation results

Notes

  • Validates dimensions of input CV structs before processing
  • Handles missing values in phenotype predictions

Throws

  • ArgumentError: If any CV struct in the input vector has inconsistent dimensions

Examples

julia> fit_1 = Fit(n=1, l=2); fit_1.metrics = Dict("cor" => 0.0, "rmse" => 1.0); fit_1.trait = "trait_1";

julia> cv_1 = CV("replication_1", "fold_1", fit_1, ["population_1"], ["entry_1"], [0.0], [0.0], fit_1.metrics);

julia> fit_2 = Fit(n=1, l=2); fit_2.metrics = Dict("cor" => 1.0, "rmse" => 0.0); fit_2.trait = "trait_1";

julia> cv_2 = CV("replication_2", "fold_2", fit_2, ["population_2"], ["entry_2"], [0.0], [0.0], fit_2.metrics);

julia> cvs = [cv_1, cv_2];

julia> df_summary, df_summary_per_entry = summarise(cvs);

julia> size(df_summary)
(2, 8)

julia> size(df_summary_per_entry)
(2, 8)
source
GenomicBreedingCore.tabulariseFunction
tabularise(fit::Fit, metric::String = "cor")::DataFrame

Convert a Fit struct into a DataFrame for easier data manipulation and analysis.

Arguments

  • fit::Fit: A Fit struct containing model results and parameters
  • metric::String = "cor": The metric to extract from fit.metrics dictionary (default: "cor")

Returns

  • DataFrame: A DataFrame with the following columns:
    • model: The model name
    • trait: The trait name
    • population: Semicolon-separated string of unique population names
    • metric: The specified metric value from fit.metrics
    • b_hat_labels: Labels for the effect sizes
    • b_hat: Estimated effect sizes

Examples

julia> fit = Fit(n=100, l=10_000); fit.b_hat = rand(10_000); fit.model="some_model"; fit.trait="some_trait"; 

julia> fit.metrics = Dict("cor" => rand(), "rmse" => rand()); fit.populations .= "pop_1";

julia> df = tabularise(fit);

julia> size(df)
(10000, 6)
source
GenomicBreedingCore.tabulariseMethod
tabularise(phenomes::Phenomes)::DataFrame

Convert a Phenomes struct into a tabular format as a DataFrame.

The resulting DataFrame contains the following columns:

  • id: Integer index for each entry
  • entries: Entry identifiers
  • populations: Population assignments
  • Additional columns for each trait in phenomes.traits

Arguments

  • phenomes::Phenomes: A valid Phenomes struct containing phenotypic data

Returns

  • DataFrame: A DataFrame with entries as rows and traits as columns

Throws

  • ArgumentError: If the Phenomes struct dimensions are inconsistent

Examples

julia> phenomes = Phenomes(n=10, t=3); phenomes.entries = string.("entry_", 1:10); phenomes.populations .= "pop_1"; phenomes.traits = ["A", "B", "C"]; phenomes.phenotypes = fill(0.0, 10,3);

julia> tabularise(phenomes)
10×6 DataFrame
 Row │ id     entries   populations  A         B         C        
     │ Int64  String    String       Float64?  Float64?  Float64? 
─────┼────────────────────────────────────────────────────────────
   1 │     1  entry_1   pop_1             0.0       0.0       0.0
   2 │     2  entry_2   pop_1             0.0       0.0       0.0
   3 │     3  entry_3   pop_1             0.0       0.0       0.0
   4 │     4  entry_4   pop_1             0.0       0.0       0.0
   5 │     5  entry_5   pop_1             0.0       0.0       0.0
   6 │     6  entry_6   pop_1             0.0       0.0       0.0
   7 │     7  entry_7   pop_1             0.0       0.0       0.0
   8 │     8  entry_8   pop_1             0.0       0.0       0.0
   9 │     9  entry_9   pop_1             0.0       0.0       0.0
  10 │    10  entry_10  pop_1             0.0       0.0       0.0
source
GenomicBreedingCore.tabulariseMethod
tabularise(trials::Trials)::DataFrame

Convert a Trials struct into a DataFrame representation for easier data manipulation and analysis.

Arguments

  • trials::Trials: A valid Trials struct containing experimental field trial data.

Returns

  • DataFrame: A DataFrame with the following columns:
    • id: Unique identifier for each trial observation
    • years: Year of the trial
    • seasons: Season identifier
    • harvests: Harvest identifier
    • sites: Location/site identifier
    • replications: Replication number
    • blocks: Block identifier
    • rows: Row position
    • cols: Column position
    • entries: Entry identifier
    • populations: Population identifier
    • Additional columns for each trait in trials.traits

Throws

  • ArgumentError: If the Trials struct dimensions are inconsistent

Examples

julia> trials, _ = simulatetrials(genomes = simulategenomes(verbose=false), verbose=false);

julia> df = tabularise(trials);

julia> size(df)
(12800, 14)
source
GenomicBreedingCore.tabulariseMethod
tabularise(cvs::Vector{CV})::Tuple{DataFrame,DataFrame}

Convert a vector of CV (Cross-Validation) structs into two DataFrames containing metrics and predictions.

Arguments

  • cvs::Vector{CV}: Vector of CV structs containing cross-validation results

Returns

  • Tuple{DataFrame,DataFrame}: A tuple of two DataFrames:

    1. df_across_entries: Contains aggregated metrics across entries with columns:

      • training_population: Semicolon-separated list of training populations
      • validation_population: Semicolon-separated list of validation populations
      • trait: Name of the trait
      • model: Name of the model used
      • replication: Replication identifier
      • fold: Fold identifier
      • training_size: Number of entries in training set
      • validation_size: Number of entries in validation set
      • Additional columns for each metric (e.g., cor, rmse)
    2. df_per_entry: Contains per-entry predictions with columns:

      • training_population: Training population identifier
      • validation_population: Validation population identifier
      • entry: Entry identifier
      • trait: Name of the trait
      • model: Name of the model used
      • replication: Replication identifier
      • fold: Fold identifier
      • y_true: True values
      • y_pred: Predicted values

Throws

  • ArgumentError: If input vector is empty or if any CV struct is corrupted

Notes

  • Warns if there are empty CV structs resulting from insufficient training sizes or fixed traits
  • Metrics are extracted from the metrics dictionary in each CV struct
  • Population identifiers are sorted and joined with semicolons when multiple populations exist

Examples

julia> fit_1 = Fit(n=1, l=2); fit_1.metrics = Dict("cor" => 0.0, "rmse" => 1.0); fit_1.trait = "trait_1";

julia> cv_1 = CV("replication_1", "fold_1", fit_1, ["population_1"], ["entry_1"], [0.0], [0.0], fit_1.metrics);

julia> fit_2 = Fit(n=1, l=2); fit_2.metrics = Dict("cor" => 1.0, "rmse" => 0.0); fit_2.trait = "trait_1";

julia> cv_2 = CV("replication_2", "fold_2", fit_2, ["population_2"], ["entry_2"], [0.0], [0.0], fit_2.metrics);

julia> cvs = [cv_1, cv_2];

julia> df_across_entries, df_per_entry = tabularise(cvs);

julia> names(df_across_entries)
10-element Vector{String}:
 "training_population"
 "validation_population"
 "trait"
 "model"
 "replication"
 "fold"
 "training_size"
 "validation_size"
 "cor"
 "rmse"

julia> df_across_entries[!, [:cor, :rmse]]
2×2 DataFrame
 Row │ cor      rmse    
     │ Float64  Float64 
─────┼──────────────────
   1 │     0.0      1.0
   2 │     1.0      0.0

julia> names(df_per_entry)
9-element Vector{String}:
 "training_population"
 "validation_population"
 "entry"
 "trait"
 "model"
 "replication"
 "fold"
 "y_true"
 "y_pred"

julia> df_per_entry[!, [:entry, :y_true, :y_pred]]
2×3 DataFrame
 Row │ entry    y_true   y_pred  
     │ String   Float64  Float64 
─────┼───────────────────────────
   1 │ entry_1      0.0      0.0
   2 │ entry_2      0.0      0.0
source
GenomicBreedingCore.trialsmodelsfomulae!Method
trialsmodelsfomulae!(df::DataFrame; trait::String, max_levels::Int64 = 100)::Tuple{Vector{String},Vector{Int64}}

Generate mixed model formulae for analyzing multi-environment trial data.

Arguments

  • df::DataFrame: Input DataFrame containing trial data, will be modified in-place
  • trait::String: Name of the response variable column
  • max_levels::Int64=100: Maximum number of factor levels allowed in interaction terms

Returns

A tuple containing:

  • Vector{String}: Collection of mixed model formulae with increasing complexity
  • Vector{Int64}: Corresponding number of factor levels for each formula

Details

The function:

  1. Identifies available trial design variables (nesters, spatial components, targets)
  2. Creates interaction terms between variables and adds them to the DataFrame
  3. Generates model formulae considering:
    • Single and multi-environment models
    • Fixed and random entry effects
    • Spatial error components
    • Nested random effects
  4. Filters redundant models and sorts by complexity

Notes

  • Warns if trials are unreplicated
  • Throws error if only one entry is present
  • Automatically removes block effects when both row and column effects are present
  • Removes redundant nesting structures

Examples

julia> trials, _simulated_effects = simulatetrials(genomes = simulategenomes(verbose=false), verbose=false);

julia> df = tabularise(trials);

julia> size(df)
(12800, 14)

julia> formulae, n_levels = trialsmodelsfomulae!(df, trait="trait_1");

julia> size(df)
(12800, 134)

julia> length(formulae)
76

julia> sum(n_levels .== sort(n_levels))
76
source
GenomicBreedingCore.turingblrMethod
turingblr(vector_of_Xs_noint::Vector{Matrix{Bool}}, y::Vector{Float64})

Bayesian Linear Regression model implemented using Turing.jl.

Arguments

  • vector_of_Xs_noint::Vector{Matrix{Bool}}: Vector of predictor matrices, where each matrix contains binary (Bool) predictors
  • y::Vector{Float64}: Vector of response variables

Model Details

  • Includes an intercept term with Normal(0.0, 10.0) prior
  • For each predictor matrix in vector_of_Xs_noint:
    • Variance parameter (σ²) with Exponential(1.0) prior
    • Regression coefficients (β) with multivariate normal prior: MvNormal(0, σ² * I)
  • Residual variance (σ²) with Exponential(10.0) prior
  • Response variable modeled as multivariate normal: MvNormal(μ, σ² * I)

Returns

  • A Turing model object that can be used for MCMC sampling
source
GenomicBreedingCore.turingblrmcmc!Method
turingblrmcmc!(
    blr::BLR;
    multiple_σs::Union{Nothing, Dict{String, Bool}} = nothing,
    turing_model::Function = turingblr,
    n_iter::Int64 = 10_000,
    n_burnin::Int64 = 1_000,
    δ::Float64 = 0.65,
    max_depth::Int64 = 5,
    Δ_max::Float64 = 1000.0,
    init_ϵ::Float64 = 0.2,
    adtype::AutoReverseDiff = AutoReverseDiff(compile = true),
    diagnostics_threshold_std_lt::Float64 = 0.05,
    diagnostics_threshold_ess_ge::Int64 = 100, 
    diagnostics_threshold_rhat_lt::Float64 = 1.01,
    seed::Int64 = 1234,
    verbose::Bool = true
)::Nothing

Perform MCMC sampling for Bayesian Linear Regression on a BLR model using NUTS sampler.

Arguments

  • blr::BLR: The BLR model to fit
  • multiple_σs::Union{Nothing, Dict{String, Bool}}: Optional dictionary specifying multiple (true) or single (false) variance scalers per component
  • turing_model::Function: The Turing model function to use (default: turingblr)
  • n_iter::Int64: Number of MCMC iterations (default: 10,000)
  • n_burnin::Int64: Number of burn-in iterations to discard (default: 1,000)
  • δ::Float64: Target acceptance rate for dual averaging (default: 0.65)
  • max_depth::Int64: Maximum doubling tree depth (default: 5)
  • Δ_max::Float64: Maximum divergence during doubling tree (default: 1000.0)
  • init_ϵ::Float64: Initial step size; 0 means auto-search using heuristics (default: 0.2)
  • adtype::AutoReverseDiff: Automatic differentiation type (default: AutoReverseDiff(compile = true))
  • diagnostics_threshold_std_lt::Float64: Threshold for MCSE/std ratio convergence (default: 0.05)
  • diagnostics_threshold_ess_ge::Int64: Minimum effective sample size for convergence (default: 100)
  • diagnostics_threshold_rhat_lt::Float64: Maximum R-hat for convergence (default: 1.01)
  • seed::Int64: Random seed for reproducibility (default: 1234)
  • verbose::Bool: Whether to show progress during sampling (default: true)

Returns

  • Nothing: The function mutates the input BLR struct in-place, updating its:
    • coefficients,
    • variance components,
    • predicted values,
    • residuals, and
    • MCMC diagnostics

Notes

  • Performs model validation checks before fitting
  • Uses NUTS (No-U-Turn Sampler) with specified AD type
  • Computes convergence diagnostics by splitting the chain into 5 sub-chains to calculate:
    • improved Gelman-Rubin statistics (R̂; https://doi.org/10.1214/20-BA1221),
    • effective sample size (ESS), and
    • Monte Carlo standard error (MCSE)
  • Warns if parameters haven't converged based on:
    • R̂ >= diagnosticsthresholdrhat_lt
    • ESS < diagnosticsthresholdess_ge
    • MCSE/std ratio >= diagnosticsthresholdstd_lt
  • Updates the model with estimated coefficients, variance components, predicted values, and residuals
  • Mutates the input BLR model in-place
julia> genomes = simulategenomes(n=5, l=1_000, verbose=false);

julia> trials, simulated_effects = simulatetrials(genomes = genomes, n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=3, verbose=false);

julia> df = tabularise(trials);

julia> blr = instantiateblr(trait = trials.traits[1], factors = ["rows", "cols"], df = df, other_covariates = trials.traits[2:end], verbose = false);

julia> turingblrmcmc!(blr, n_iter=1_000, n_burnin=200, seed=123, verbose=false);

julia> mean([mean(x) for (_, x) in blr.coefficients]) != 0.0
true

julia> cor(blr.y, blr.ŷ) > 0.0
true

julia> (sum(blr.diagnostics.rhat .< 1.01) < nrow(blr.diagnostics))
true

julia> (sum(blr.diagnostics.ess .>= 100) < nrow(blr.diagnostics))
true
source
GenomicBreedingCore.@string2formulaMacro
@string2formula(x::String)

Convert a string representation of a formula into a Formula object.

This macro parses a string containing a formula expression and evaluates it into a proper Formula object that can be used in statistical modelling.

Arguments

  • x::String: A string containing the formula expression (e.g., "y ~ x + z")

Returns

  • Formula: A Formula object representing the parsed expression
source
GenomicBreedingCore.@stringevaluationMacro
@stringevaluation(x)

Parse and evaluate a string expression at compile time.

Arguments

  • x: A string containing a Julia expression to be parsed.

Returns

  • The parsed expression as an Expr object ready for evaluation.
source