GenomicBreedingIO
Documentation for GenomicBreedingIO.
GenomicBreedingIO.isfuzzymatch
GenomicBreedingIO.levenshteindistance
GenomicBreedingIO.readdelimited
GenomicBreedingIO.readdelimited
GenomicBreedingIO.readdelimited
GenomicBreedingIO.readjld2
GenomicBreedingIO.readvcf
GenomicBreedingIO.vcfchunkify
GenomicBreedingIO.vcfcountlocialleles
GenomicBreedingIO.vcfextractallelefreqs!
GenomicBreedingIO.vcfextractentriesandformats
GenomicBreedingIO.vcfextractinfo
GenomicBreedingIO.vcfinstantiateoutput
GenomicBreedingIO.vcfparsecoordinates
GenomicBreedingIO.writedelimited
GenomicBreedingIO.writedelimited
GenomicBreedingIO.writedelimited
GenomicBreedingIO.writejld2
GenomicBreedingIO.writevcf
GenomicBreedingIO.isfuzzymatch
— Methodisfuzzymatch(a::String, b::String; threshold::Float64=0.3)::Bool
Determines if two strings approximately match each other using Levenshtein distance.
The function compares two strings and returns true
if they are considered similar enough based on the Levenshtein edit distance and a threshold value. The threshold is applied as a fraction of the length of the shorter string. Additionally, the function normalizes specific string inputs (e.g., "#chr"
is replaced with "chrom"
) before comparison.
Arguments
a::String
: First string to compareb::String
: Second string to comparethreshold::Float64=0.3
: Maximum allowed edit distance as a fraction of the shorter string length
Returns
Bool
:true
if the strings match within the threshold,false
otherwise
Examples
julia> isfuzzymatch("populations", "populations")
true
julia> isfuzzymatch("populations", "poplation")
true
julia> isfuzzymatch("populations", "entry")
false
GenomicBreedingIO.levenshteindistance
— Methodlevenshteindistance(a::String, b::String)::Int64
Calculate the Levenshtein distance (edit distance) between two strings.
The Levenshtein distance is a measure of the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another.
Arguments
a::String
: First input stringb::String
: Second input string
Returns
Int64
: The minimum number of edits needed to transform stringa
into stringb
Examples
julia> levenshteindistance("populations", "populations")
0
julia> levenshteindistance("populations", "poplation")
2
julia> levenshteindistance("populations", "entry")
3
GenomicBreedingIO.readdelimited
— Methodreaddelimited(
type::Type{Genomes};
fname::String,
sep::String = "\t",
parse_populations_from_entries::Union{Nothing,Function} = nothing,
all_alleles_column::Bool = true,
verbose::Bool = false
)::Genomes
Load genotype data from a delimited text file into a Genomes
struct.
Arguments
type::Type{Genomes}
: Type parameter (alwaysGenomes
)fname::String
: Path to the input filesep::String
: Delimiter character (default: tab)parse_populations_from_entries::Union{Nothing,Function}
: Optional function to extract population names from entry namesall_alleles_column::Bool
: Whether input file contains all alleles column (default: true)verbose::Bool
: Whether to show progress bar during loading
File Format
The input file should be structured as follows:
- Supported extensions: .tsv, .csv, or .txt
- Comments and headers start with '#'
- Header format (2 lines where the second line is optional):
- Column names:
- With allalleles: "chrom,pos,allalleles,allele,entry1,entry2,..."
- Without allalleles: "chrom,pos,allele,entry1,entry_2,..."
- Population names (optional): Same format as line 1 but with population names
- Column names:
- Data columns:
- chromosome identifier
- position (integer)
- all alleles at locus (if all_alleles=true)
- specific allele
Returns
Genomes
: A populated Genomes struct containing the loaded data
Throws
ErrorException
: If file doesn't exist or has invalid formatArgumentError
: If column names don't match expected formatOverflowError
: If allele frequencies are outside [0,1] range
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(n=10, verbose=false);
julia> genomes.entries = [string(genomes.populations[i], "-", genomes.entries[i]) for i in eachindex(genomes.populations)];
julia> fname = writedelimited(genomes);
julia> genomes_reloaded = readdelimited(Genomes, fname=fname);
julia> genomes == genomes_reloaded
true
julia> fname = writedelimited(genomes, include_population_header=false);
julia> genomes_reloaded = readdelimited(Genomes, fname=fname);
julia> unique(genomes_reloaded.populations) == ["Unknown_population"]
true
julia> genomes_reloaded = readdelimited(Genomes, fname=fname, parse_populations_from_entries=x -> split(x, "-")[1]);
julia> genomes == genomes_reloaded
true
GenomicBreedingIO.readdelimited
— Methodreaddelimited(type::Type{Phenomes}; fname::String, sep::String = "\t", verbose::Bool = false)::Phenomes
Load phenotypic data from a delimited text file into a Phenomes
struct.
Arguments
type::Type{Phenomes}
: Type parameter (must be Phenomes)fname::String
: Path to the input filesep::String
: Delimiter character (default: tab "\t")verbose::Bool
: Whether to show progress bar during loading (default: false)
File Format
The file should be a delimited text file with:
- Header row containing column names
- First column: Entry identifiers
- Second column: Population identifiers
- Remaining columns: Phenotypic trait values (numeric or missing)
Missing values can be specified as "missing", "NA", "na", "N/A", "n/a" or empty string.
Returns
Phenomes
: A Phenomes struct containing the loaded phenotypic data
Throws
ErrorException
: If file doesn't exist or has invalid formatArgumentError
: If required columns are missing or misnamedErrorException
: If duplicate entries or traits are foundErrorException
: If numeric values cannot be parsed
Notes
- Comments starting with '#' are ignored
- Empty lines are skipped
- Mathematical operators (+,-,*,/,%) in trait names are replaced with underscores
- Performs dimension checks on the loaded data
Examples
julia> phenomes = Phenomes(n=10, t=3); phenomes.entries = string.("entry_", 1:10); phenomes.populations .= "pop1"; phenomes.traits = ["A", "B", "C"]; phenomes.phenotypes = rand(10,3); phenomes.phenotypes[1,1] = missing; phenomes.mask .= true;
julia> fname = writedelimited(phenomes);
julia> phenomes_reloaded = readdelimited(Phenomes, fname=fname);
julia> phenomes == phenomes_reloaded
true
GenomicBreedingIO.readdelimited
— Methodreaddelimited(type::Type{Trials}; fname::String, sep::String = "\t", verbose::Bool = false)::Trials
Load a Trials
struct from a string-delimited file.
Arguments
type::Type{Trials}
: Type parameter (must beTrials
)fname::String
: Path to the input filesep::String = "\t"
: Delimiter character (default is tab)verbose::Bool = false
: Whether to display progress information
Required File Structure
The input file must contain the following 10 identifier columns:
years
: Year identifiersseasons
: Season identifiersharvests
: Harvest identifierssites
: Site identifiersentries
: Entry identifierspopulations
: Population identifiersreplications
: Replication identifiersblocks
: Block identifiersrows
: Row identifierscols
: Column identifiers
All remaining columns are treated as numeric phenotype measurements. Column names are fuzzy-matched to accommodate slight spelling variations.
Returns
Trials
: A populated Trials struct containing the loaded data
Notes
- Missing values can be represented as "missing", "NA", "na", "N/A", "n/a", or empty strings
- Trait names containing mathematical operators (+, -, *, /, %) are converted to underscores
- Duplicate trait names are not allowed
Throws
ErrorException
: If the input file doesn't exist or has invalid formatArgumentError
: If required columns are missing or ambiguous
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(n=10, verbose=false);
julia> trials, _ = GenomicBreedingCore.simulatetrials(genomes=genomes, sparsity=0.1, verbose=false);
julia> fname = writedelimited(trials);
julia> trials_reloaded = readdelimited(Trials, fname=fname);
julia> trials == trials_reloaded
true
GenomicBreedingIO.readjld2
— Methodreadjld2(type::Type; fname::String)::Type
Load a core (Genomes
, Phenomes
, and Trials
), simulation (SimulatedEffects
), or model (TEBV
) struct from a JLD2 file.
Arguments
type::Type
: The type of struct to load (Genomes
,Phenomes
,Trials
,SimulatedEffects
, orTEBV
)fname::String
: Path to the JLD2 file to read from
Returns
- The loaded struct of the specified type
Throws
ArgumentError
: If the specified file does not existDimensionMismatch
: If the loaded struct is corrupted
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(n=2, verbose=false);
julia> fname = writejld2(genomes);
julia> readjld2(Genomes, fname=fname) == genomes
true
julia> phenomes = Phenomes(n=2, t=2); phenomes.entries = ["entry_1", "entry_2"]; phenomes.traits = ["trait_1", "trait_2"];
julia> fname = writejld2(phenomes);
julia> readjld2(Phenomes, fname=fname) == phenomes
true
julia> trials, _ = simulatetrials(genomes=genomes, verbose=false);
julia> fname = writejld2(trials);
julia> readjld2(Trials, fname=fname) == trials
true
GenomicBreedingIO.readvcf
— Methodreadvcf(; fname::String, field::String = "any", min_depth::Int64 = 5, max_depth::Int64 = 100, verbose::Bool = false)::Genomes
Read genetic data from a VCF (Variant Call Format) file into a Genomes struct.
Arguments
fname::String
: Path to the VCF file. Can be gzipped (.vcf.gz or .vcf.bgz) or uncompressed (.vcf)field::String="any"
: Which FORMAT field to extract from VCF. Default "any" tries to automatically detect genotype fieldmin_depth::Int64=5
: Minimum read depth threshold for AD (Allele Depth) fieldmax_depth::Int64=100
: Maximum read depth threshold for AD fieldverbose::Bool=false
: Whether to print progress and debug information
Returns
Genomes
: A Genomes struct containing the loaded genetic data with fields:allele_frequencies
: Matrix of allele frequenciesloci_alleles
: Vector of locus-allele combination stringsmask
: Boolean matrix indicating missing datasamples
: Vector of sample names
Details
Reads VCF files in parallel using multiple threads. Handles multi-allelic variants and different ploidies. Field priority (when field="any"):
- AF (Allele Frequency)
- AD (Allele Depth)
- GT (Genotype)
Performs various checks on the input data including:
- File existence
- No duplicate loci-allele combinations
- Consistent dimensions in output struct
Throws
ErrorException
: If file doesn't exist, has duplicates, or output dimensions are invalid
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(n=10, verbose=false);
julia> fname = writevcf(genomes);
julia> fname_gz = writevcf(genomes, gzip=true);
julia> genomes_reloaded = readvcf(fname=fname);
julia> genomes_reloaded_gz = readvcf(fname=fname_gz);
julia> genomes.entries == genomes_reloaded.entries == genomes_reloaded_gz.entries
true
julia> dimensions(genomes) == dimensions(genomes_reloaded) == dimensions(genomes_reloaded_gz)
true
julia> ismissing.(genomes.allele_frequencies) == ismissing.(genomes_reloaded.allele_frequencies) == ismissing.(genomes_reloaded_gz.allele_frequencies)
true
GenomicBreedingIO.vcfchunkify
— Methodvcfchunkify(fname::String; n_loci::Int64, verbose::Bool = false)::Tuple{Vector{Int64},Vector{Int64},Vector{Int64},Vector{Int64}}
Divide a VCF file into chunks for parallel processing.
Arguments
fname::String
: Path to the VCF file (can be .vcf, .vcf.gz, or .vcf.bgz)n_loci::Int64
: Total number of loci in the VCF fileverbose::Bool=false
: If true, prints progress information
Returns
A tuple containing four Vector{Int64} arrays:
- Starting loci indices for each thread
- Ending loci indices for each thread
- Starting file positions for each thread
- Ending file positions for each thread
Details
- Automatically detects if the input file is gzipped
- Divides the workload evenly across available threads
- Skips header lines (starting with '#')
- Handles both regular and gzipped VCF files
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(n=10, verbose=false);
julia> fname = writevcf(genomes);
julia> _, n_loci, n_alt_alleles = vcfcountlocialleles(fname);
julia> idx_loci_per_thread_ini, idx_loci_per_thread_fin, file_pos_per_thread_ini, file_pos_per_thread_fin = vcfchunkify(fname, n_loci=n_loci);
julia> length(idx_loci_per_thread_ini) == length(idx_loci_per_thread_fin) == length(file_pos_per_thread_ini) == length(file_pos_per_thread_fin)
true
julia> (idx_loci_per_thread_ini[1] == 0) && (sum(idx_loci_per_thread_ini .== 0) == 1)
true
julia> (idx_loci_per_thread_fin[end] == n_loci) && (sum(idx_loci_per_thread_fin .== 0) == 0)
true
julia> (sum(file_pos_per_thread_ini .== 0) == 0) && (sum(file_pos_per_thread_fin .== 0) == 0)
true
julia> rm(fname);
GenomicBreedingIO.vcfcountlocialleles
— Methodvcfcountlocialleles(fname::String; verbose::Bool = false)::Tuple{Int64,Int64}
Count the number of loci and total lines in a VCF file.
Arguments
fname::String
: Path to the VCF file. Can be either a plain text VCF file or a gzipped VCF file (with extensions .vcf.gz or .vcf.bgz)verbose::Bool
: If true, prints progress messages and results to stdout. Defaults to false.
Returns
Tuple{Int64,Int64}
: A tuple containing:- First element: Total number of lines in the file (including headers)
- Second element: Number of data lines (variants/loci) excluding header lines
Description
Reads through a VCF (Variant Call Format) file and counts:
- Total lines in the file (including headers)
- Number of data lines (variants/loci) that don't start with '#'
The function automatically detects and handles different file formats:
- Plain text VCF files (.vcf)
- Gzipped VCF files (.vcf.gz)
- BGZipped VCF files (.vcf.bgz)
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(n=10, verbose=false);
julia> fname = writevcf(genomes);
julia> fname_gz = writevcf(genomes, gzip=true);
julia> n_1, p_1, l_1 = vcfcountlocialleles(fname);
julia> n_2, p_2, l_2 = vcfcountlocialleles(fname_gz);
julia> n_1 == n_2 == 10_009
true
julia> p_1 == p_2 == 10_000
true
julia> l_1 == l_2 == 10_000
true
julia> rm.([fname, fname_gz]);
GenomicBreedingIO.vcfextractallelefreqs!
— Methodvcfextractallelefreqs!(genomes::Genomes, pb::Union{Nothing,Progress}, i::Vector{Int64};
fname::String, line::Vector{String}, line_counter::Int64,
field::String, min_depth::Int64=10, max_depth::Int64=100,
verbose::Bool=false)
Extract allele frequencies from VCF file data and update a Genomes object.
Arguments
genomes::Genomes
: Object to store genomic datapb::Union{Nothing,Progress}
: Progress bar object or nothingi::Vector{Int64}
: Single-element vector containing current locus-allele indexfname::String
: Name of VCF file being processedline::Vector{String}
: Current line from VCF file split into fieldsline_counter::Int64
: Current line number in VCF filefield::String
: Type of field to extract ("AF", "AD", or "GT")min_depth::Int64=10
: Minimum read depth threshold for AD fieldmax_depth::Int64=100
: Maximum read depth threshold for AD fieldverbose::Bool=false
: Whether to display progress updates
Returns
Nothing; Updates the input parameters in place:
genomes
: Updated with new allele frequencies and loci informationpb
: Advanced if verbose=truei
: Index incremented based on processed alleles
Description
Processes VCF data to extract allele frequencies of the alternative allele/s using one of three methods:
- AF field: Direct frequency values from VCF
- AD field: Calculated from read depths (filtered by mindepth and maxdepth)
- GT field: Calculated from genotype calls
Updates Genomes object with:
- Loci-allele identifiers (chromosome, position, alleles)
- Allele frequencies for each sample
Throws
ArgumentError
: If field parameter is not "AF", "AD", or "GT"ErrorException
: If unable to parse AF or AD fields from VCF
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(n=10, verbose=false);
julia> fname = writevcf(genomes);
julia> _, _, n_alt_alleles = vcfcountlocialleles(fname);
julia> entries, format_lines = vcfextractentriesandformats(fname);
julia> field, n_alleles, _ = vcfextractinfo(fname, format_lines=format_lines);
julia> genomes_instantiated = vcfinstantiateoutput(fname, entries=entries, n_alt_alleles=n_alt_alleles);
julia> sum(ismissing.(genomes_instantiated.allele_frequencies[:, 1])) == length(entries)
true
julia> file = open(fname, "r"); line::Vector{String} = split([readline(file) for i in 1:10][end], " "); close(file);
julia> vcfextractallelefreqs!(genomes_instantiated, nothing, [0], fname=fname, line=line, line_counter=10, field=field);
julia> sum(ismissing.(genomes_instantiated.allele_frequencies[:, 1])) == 0
true
julia> rm(fname);
GenomicBreedingIO.vcfextractentriesandformats
— Methodvcfextractentriesandformats(fname::String; verbose::Bool = false)::Tuple{Vector{String},Vector{String}}
Extract sample entries and format definitions from a VCF file.
Arguments
fname::String
: Path to the VCF file (can be gzipped with extensions .vcf.gz or .vcf.bgz)verbose::Bool=false
: If true, prints progress information to stdout
Returns
A tuple containing:
Vector{String}
: List of sample names from the VCF headerVector{String}
: List of FORMAT field definitions from the VCF metadata
Description
Reads a VCF file and extracts two key pieces of information:
- Sample names from the header line (columns after FORMAT field)
- FORMAT field definitions from metadata lines starting with "##FORMAT"
The function validates the presence and correct order of standard VCF columns: CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, and FORMAT
Throws
ArgumentError
: If VCF has fewer than expected columns or column names don't match VCF format
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(n=10, verbose=false);
julia> fname = writevcf(genomes);
julia> entries, format_lines = vcfextractentriesandformats(fname);
julia> entries
10-element Vector{String}:
"entry_01"
"entry_02"
"entry_03"
"entry_04"
"entry_05"
"entry_06"
"entry_07"
"entry_08"
"entry_09"
"entry_10"
julia> format_lines
3-element Vector{String}:
"##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
"##FORMAT=<ID=AD,Number=2,Type=Float,Description=\"Allele Depth\">"
"##FORMAT=<ID=AF,Number=2,Type=Float,Description=\"Allele Frequency\">"
julia> rm(fname);
GenomicBreedingIO.vcfextractinfo
— Methodvcfextractinfo(fname::String; format_lines::Vector{String}, field::String="any", verbose::Bool=false)::Tuple{String,Int64,Int64}
Extract information about genotype fields from a VCF file.
Arguments
fname::String
: Path to the VCF file (can be gzipped)format_lines::Vector{String}
: Vector containing FORMAT lines from the VCF headerfield::String="any"
: Specific field to extract ("GT", "AD", "AF", or "any")verbose::Bool=false
: If true, prints progress information
Returns
A tuple containing:
field::String
: The identified genotype fieldn_alleles::Int64
: Maximum number of alleles per locusploidy::Int64
: Ploidy level (only meaningful for GT field; set to typemax(Int64) for AD and AF fields)
Details
- If
field
is "any", searches for fields in priority order: AF > AD > GT - For GT field, scans entire file to determine maximum number of alleles and ploidy
- For AF and AD fields, extracts allele count from format header
- Supports both gzipped (.gz, .bgz) and uncompressed VCF files
Throws
ArgumentError
: If specified field is not found in the VCF fileErrorException
: If unable to parse number of alleles from format header
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(n=10, verbose=false);
julia> fname = writevcf(genomes);
julia> _, format_lines = vcfextractentriesandformats(fname);
julia> field, n_alleles, ploidy = vcfextractinfo(fname, format_lines=format_lines);
julia> (field == "AF") && (n_alleles == 2) && (ploidy == typemax(Int64))
true
julia> rm(fname);
GenomicBreedingIO.vcfinstantiateoutput
— Methodvcfinstantiateoutput(fname::String; entries::Vector{String}, n_alt_alleles::Int64, verbose::Bool = false)::Genomes
Create and initialize a Genomes struct from VCF file parameters.
Arguments
fname::String
: Name of the VCF file being processedentries::Vector{String}
: Vector containing entry identifiersn_alt_alleles::Int64
: Total number of alternative alleles across all lociverbose::Bool=false
: If true, prints progress information
Returns
Genomes
: An initialized Genomes struct with:- dimensions n × p where n is number of entries and p = naltalleles
- entry names assigned from input entries
- populations set to "unknown"
- mask set to true
Throws
ErrorException
: If duplicate entries are found in the VCF file
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(n=10, verbose=false);
julia> fname = writevcf(genomes);
julia> entries, format_lines = vcfextractentriesandformats(fname);
julia> _, _, n_alt_alleles = vcfcountlocialleles(fname);
julia> genomes_instantiated = vcfinstantiateoutput(fname, entries=entries, n_alt_alleles=n_alt_alleles);
julia> size(genomes_instantiated.allele_frequencies)
(10, 10000)
julia> genomes_instantiated.entries == entries
true
julia> rm(fname);
GenomicBreedingIO.vcfparsecoordinates
— Methodvcfparsecoordinates(; line::Vector{String}, line_counter::Int64, field::String)::Union{Nothing,Tuple{Int64,String,Int64,Vector{String}}}
Parse coordinates and allele information from a VCF file line.
Arguments
line::Vector{String}
: A vector containing the split line from VCF fileline_counter::Int64
: Current line number being processed in the VCF filefield::String
: The field name to extract allele frequencies from
Returns
Nothing
: If the specified field is not found in the lineTuple{Int64,String,Int64,Vector{String}}
: A tuple containing:- Field index
- Chromosome name
- Position
- Combined reference and alternative alleles
Throws
ErrorException
: If the position field cannot be parsed as an integer
Note
The function validates the line format and extracts genomic coordinates and allele information from a VCF file line. It handles missing alternative alleles (denoted by ".") and performs necessary type conversions.
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(n=10, verbose=false);
julia> fname = writevcf(genomes);
julia> entries, format_lines = vcfextractentriesandformats(fname);
julia> field, _, _ = vcfextractinfo(fname, format_lines=format_lines);
julia> file = open(fname, "r"); line::Vector{String} = split([readline(file) for i in 1:10][end], " "); close(file);
julia> idx_field, chrom, pos, refalts = vcfparsecoordinates(line=line, line_counter=10, field=field);
julia> (idx_field == 3) && (chrom == line[1]) && (pos == parse(Int64, line[2])) && (refalts == line[4:5])
true
julia> rm(fname);
GenomicBreedingIO.writedelimited
— Methodwritedelimited(
genomes::Genomes;
fname::Union{Missing,String} = missing,
sep::String = "\t",
include_population_header::Bool = true
)::String
Write genomic data to a delimited text file.
Arguments
genomes::Genomes
: A Genomes struct containing the genomic data to be writtenfname::Union{Missing,String}
: Output filename. If missing, generates an automatic filename with timestampsep::String
: Delimiter character for the output file (default: tab)include_population_header::Bool
: Whether to include population information in the header (default: true)
Returns
String
: Path to the created output file
File Format
The output file contains:
- Header lines (prefixed with '#'):
- First line: chromosome, position, alleles, and entry information
- Second line (optional): population information
- Data rows with the following columns:
- Column 1: Chromosome identifier
- Column 2: Position
- Column 3: All alleles at the locus (pipe-separated)
- Column 4: Specific allele
- Remaining columns: Frequency data for each entry
Supported File Extensions
- '.tsv' (tab-separated, default)
- '.csv' (comma-separated)
- '.txt' (custom delimiter)
Throws
DimensionMismatch
: If the input Genomes struct is corruptedErrorException
: If the output file already existsArgumentError
: If the file extension is invalid or the output directory doesn't exist
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(n=2, verbose=false);
julia> writedelimited(genomes, fname="test_genomes.tsv")
"test_genomes.tsv"
GenomicBreedingIO.writedelimited
— Methodwritedelimited(phenomes::Phenomes; fname::Union{Missing,String} = missing, sep::String = " ")::String
Write phenotypic data from a Phenomes
struct to a delimited text file.
Arguments
phenomes::Phenomes
: A Phenomes struct containing phenotypic datafname::Union{Missing,String} = missing
: Output filename. If missing, generates an automatic filename with timestampsep::String = " "
: Delimiter character for the output file
Returns
String
: The name of the created file
File Format
- Header line starts with '#' containing column names
- First column: Entry names
- Second column: Population names
- Remaining columns: Trait values
- Missing values are represented as "NA"
File Extensions
Supported file extensions:
.tsv
for tab-separated files (default).csv
for comma-separated files.txt
for other delimiters
Throws
DimensionMismatch
: If the Phenomes struct dimensions are inconsistentErrorException
: If the output file already existsArgumentError
: If the file extension is invalid or the directory doesn't exist
Examples
julia> phenomes = Phenomes(n=2, t=2); phenomes.entries = ["entry_1", "entry_2"]; phenomes.traits = ["trait_1", "trait_2"];
julia> writedelimited(phenomes, fname="test_phenomes.tsv")
"test_phenomes.tsv"
GenomicBreedingIO.writedelimited
— Methodwritedelimited(trials::Trials; fname::Union{Missing,String} = missing, sep::String = " ",
overwrite::Bool = false, verbose::Bool = false)::String
Write a Trials
struct to a delimited text file, returning the filename.
Arguments
trials::Trials
: The trials data structure to be writtenfname::Union{Missing,String} = missing
: Output filename. If missing, generates automatic filename with timestampsep::String = " "
: Delimiter character between fieldsoverwrite::Bool = false
: Whether to overwrite existing output file if it existsverbose::Bool = false
: Whether to show progress bar during writing
Returns
String
: The name of the file that was written
File Format
The output file contains one header line and one line per trial entry. Header line is prefixed with '#' and contains column names.
Fixed Columns (1-10)
- years
- seasons
- harvests
- sites
- entries
- populations
- replications
- blocks
- rows
- cols
Variable Columns (11+)
- Additional columns contain phenotype traits values
- Missing values are written as "NA"
Notes
- Supported file extensions:
.tsv
,.csv
, or.txt
- File extension is automatically determined based on separator if filename is missing:
\t
→.tsv
,
or;
→.csv
- other →
.txt
- Will throw error if file exists and overwrite=false
- Directory must exist if path is specified in filename
Examples
julia> trials = Trials(n=1, t=2); trials.years = ["year_1"]; trials.seasons = ["season_1"]; trials.harvests = ["harvest_1"]; trials.sites = ["site_1"]; trials.entries = ["entry_1"]; trials.populations = ["population_1"]; trials.replications = ["replication_1"]; trials.blocks = ["block_1"]; trials.rows = ["row_1"]; trials.cols = ["col_1"]; trials.traits = ["trait_1", "trait_2"];
julia> writedelimited(trials, fname="test_trials.tsv")
"test_trials.tsv"
GenomicBreedingIO.writejld2
— Methodwritejld2(A::Union{Genomes,Phenomes,Trials,SimulatedEffects,TEBV}; fname::Union{Missing,String} = missing, overwrite::Bool=false)::String
Save genomic breeding core data structures to a JLD2 file (HDF5-compatible format).
Arguments
A
: A genomic breeding data structure (Genomes, Phenomes, Trials, SimulatedEffects, or TEBV)fname
: Optional. Output filename. If missing, generates an automatic name with timestampoverwrite
: Optional. If true, overwrites existing file with same name. Default is false
Returns
String
: Path to the saved JLD2 file
File Naming
- If
fname
is not provided, generates name: "output-[Type]-[Timestamp].jld2" - If
fname
is provided, must have ".jld2" extension
Throws
DimensionMismatch
: If input structure has invalid dimensionsErrorException
: If output file already exists and overwrite=falseArgumentError
: If invalid file extension or directory path
Notes
- Files are saved with compression enabled
- Data is stored as a Dictionary with single key-value pair
- Key is the string representation of the input type
Examples
julia> genomes = GenomicBreedingCore.simulategenomes(n=2, verbose=false);
julia> writejld2(genomes, fname="test_genomes.jld2")
"test_genomes.jld2"
julia> genomes_reloaded = load("test_genomes.jld2");
julia> genomes_reloaded[collect(keys(genomes_reloaded))[1]] == genomes
true
julia> phenomes = Phenomes(n=2, t=2); phenomes.entries = ["entry_1", "entry_2"]; phenomes.traits = ["trait_1", "trait_2"];
julia> writejld2(phenomes, fname="test_phenomes.jld2")
"test_phenomes.jld2"
julia> phenomes_reloaded = load("test_phenomes.jld2");
julia> phenomes_reloaded[collect(keys(phenomes_reloaded))[1]] == phenomes
true
julia> trials, _ = simulatetrials(genomes=genomes, verbose=false);
julia> writejld2(trials, fname="test_trials.jld2")
"test_trials.jld2"
julia> trials_reloaded = load("test_trials.jld2");
julia> trials_reloaded[collect(keys(trials_reloaded))[1]] == trials
true
julia> simulated_effects = SimulatedEffects();
julia> writejld2(simulated_effects, fname="test_simulated_effects.jld2")
"test_simulated_effects.jld2"
julia> simulated_effects_reloaded = load("test_simulated_effects.jld2");
julia> simulated_effects_reloaded[collect(keys(simulated_effects_reloaded))[1]] == simulated_effects
true
julia> trials, _simulated_effects = GenomicBreedingCore.simulatetrials(genomes = GenomicBreedingCore.simulategenomes(n=10, verbose=false), n_years=1, n_seasons=1, n_harvests=1, n_sites=1, n_replications=10, verbose=false);
julia> tebv = analyse(trials, max_levels=50, verbose=false);
julia> writejld2(tebv, fname="test_tebv.jld2")
"test_tebv.jld2"
julia> tebv_reloaded = load("test_tebv.jld2");
julia> tebv_reloaded[collect(keys(tebv_reloaded))[1]] == tebv
true
GenomicBreedingIO.writevcf
— Methodwritevcf(genomes::Genomes; fname::Union{Missing,String} = missing, ploidy::Int64 = 0,
max_depth::Int64 = 100, n_decimal_places::Int64 = 4, gzip::Bool = false)::String
Write genomic data to a Variant Call Format (VCF) file.
Arguments
genomes::Genomes
: A Genomes object containing the genetic data to be written.fname::Union{Missing,String} = missing
: Output filename. If missing, generates a default name with timestamp.ploidy::Int64 = 0
: The ploidy level of the organisms (e.g., 2 for diploid).max_depth::Int64 = 100
: Maximum depth for allele depth calculation.n_decimal_places::Int64 = 4
: Number of decimal places for rounding allele frequencies.gzip::Bool = false
: Whether to compress the output file using gzip.
Returns
String
: The name of the created VCF file.
Description
Creates a VCF v4.2 format file containing genomic variants data. The function processes allele frequencies and depths, calculates genotypes based on ploidy, and formats the data according to VCF specifications. The output includes:
- Standard VCF header information
- Sample information with FORMAT fields:
- GT (Genotype)
- AD (Allele Depth)
- AF (Allele Frequency)
Throws
DimensionMismatch
: If the input Genomes object has inconsistent dimensionsErrorException
: If the output file already existsArgumentError
: If the file extension is not '.vcf' or if the specified directory doesn't exist
Examples
julia> genomes_1 = GenomicBreedingCore.simulategenomes(n=2, verbose=false);
julia> writevcf(genomes_1, fname="test_genomes_1.vcf")
"test_genomes_1.vcf"
julia> genomes_2 = GenomicBreedingCore.simulategenomes(n=2, n_alleles=3, verbose=false);
julia> genomes_2.allele_frequencies = round.(genomes_2.allele_frequencies .* 4) ./ 4;
julia> writevcf(genomes_2, fname="test_genomes_2.vcf", ploidy=4)
"test_genomes_2.vcf"
julia> genomes_3 = GenomicBreedingCore.simulategenomes(n=3, verbose=false);
julia> writevcf(genomes_3, fname="test_genomes_3.vcf", gzip=true)
"test_genomes_3.vcf.gz"