API reference¶
Summary¶
Defining a Dinf model¶
|
A container that describes the components of a Dinf model. |
Specifying parameters¶
|
An ordered collection of inferrable parameters. |
|
An inferrable parameter with |
Feature extraction¶
|
A factory for feature matrices consisting of haplotypes and relative positions. |
|
A factory for labelled collections of |
|
A factory for feature matrices of pseudo-haplotypes. |
|
A factory for labelled collections of |
|
A collection of indexed VCF or BCF files. |
Training¶
|
Train a discriminator network. |
Inference¶
|
Sample features and make predictions using the discriminator. |
|
Adversarial Monte Carlo. |
|
Run the MCMC GAN. |
|
PG-GAN style simulated annealing. |
Classification¶
|
Wrapper for the discriminator neural network that classifies genotype matrices. |
Discriminator networks¶
|
An exchangeable CNN for the discriminator. |
|
An exchangeable CNN for the discriminator using the PG-GAN architecture. |
|
Network layer that summarises over a given axis in a way that is invariant to permutations of the input along that axis. |
Miscellaneous¶
|
Save discriminator predictions and parameter values to a .npz file. |
|
Load descriminator predictions and parameter values from a .npz file. |
|
Get the individuals corresponding to the tree sequence's samples. |
|
Get sample IDs for 1000 Genomes Project populations. |
|
Load contig lengths from a whitespace-separated file like an fai (fasta index). |
|
A sequence of folders with names "0", "1", ... |
Plotting¶
|
Plot a feature matrix as a heatmap. |
|
Plot multiple feature matrices as heatmaps. |
|
Plot training metrics from a discriminator neural network. |
|
Plot a histogram. |
|
Plot a 2d histogram. |
Reference¶
Type aliases¶
Defining a Dinf model¶
- class dinf.DinfModel(parameters, generator_func, target_func, discriminator_network=None)[source]¶
A container that describes the components of a Dinf model.
Constructing a Dinf model requires specifying:
the inferrable
parameters
,a
generator_func
function that accepts concrete parameter values, produces data under some simulation model, and returns a feature matrix (or matrices), anda
target_func
function that samples from the target dataset, and returns a feature matrix (or matrices).
import dinf import msprime parameters = dinf.Parameters(...) features = dinf.BinnedHaplotypeMatrix(...) vcfs = dinf.BagOfVcf(...) def generator(seed, ...): ... # E.g. simulate using msprime to get a succinct tree sequence (ts). ts = msprime.sim_ancestry(...) mts = msprime.sim_mutations(ts, ...) return features.from_ts(mts) def target(seed): ... return features.from_vcf(vcfs, ...) dinf_model = dinf.DinfModel( parameters=parameters, generator_func=generator, target_func=target, )
- Parameters:
parameters (Parameters) – A collection of inferrable parameters, one for each keyword argument provided to the
generator_func
.generator_func (Callable[..., FeatureCollection]) –
A function that returns features for concrete parameter values.
The first (positional) argument to the function is a
seed
, that may be used to seed a random number generator. The subsequent (keyword) arguments correspond to concrete values for theparameters
. The return type is either:a feature matrix, i.e. an n-dimensional numpy array, or
multiple feature matrices, i.e. a dictionary of n-dimensional numpy arrays, where the dictionary keys are arbitrary labels such as population names.
The signature of the function will follow the patterns below, where
p0
,p1
, etc. are the names of the parameters inparameters
. Type annotations are provided here for clarity, but are not required in user-defined functions. In function signatures, positional-only parameters preceed a/
and keyword-only parameters follow a*
.import dinf import numpy as np parameters = dinf.Parameters( p0=dinf.Param(...), p1=dinf.Param(...), ... ) # Signature for returning a single feature matrix. def generator_func1( seed: np.random.SeedSequence, /, *, p0: float, p1: float, ... ) -> np.ndarray: ... # Signature for returning multiple feature matrices. def generator_func2( seed: np.random.SeedSequence, /, *, p0: float, p1: float, ... ) -> dict[str, np.ndarray]: ... # For generator functions accepting large numbers of parameters, # the following pattern using ``**kwargs`` may be preferred. def generator_func3( seed: np.random.SeedSequence, /, **kwargs: float )-> dict[str, np.ndarray]: assert kwargs.keys() == parameters.keys() # do something with p0 do_something(kwargs["p0"]) # do something with p1 do_something_else(kwargs["p1"]) ...
target_func (Callable[[numpy.random.SeedSequence], FeatureCollection] | None) –
A function that returns features sampled from the target dataset.
If
None
(which must be specified explicitly), thegenerator_func
will be used to simulate the target dataset using each parameter’struth
value.The function takes a single (positional) argument, a
seed
, that may be used to seed a random number generator. The return type must match the return type ofgenerator_func
.discriminator_network (nn.Module | None) – A flax neural network. If not specified,
ExchangeableCNN
will be used.
- parameters: Parameters¶
The inferrable parameters.
- generator_func: Callable[..., FeatureCollection]¶
Function that simulates features using concrete parameter values.
- generator_func_v: Callable[[Tuple[numpy.random.SeedSequence, Iterable[float]]], FeatureCollection]¶
Wrapper for
generator_func
that accepts a single argument containing the seed and an iterable of parameter values (as opposed to keyword arguments).
- target_func: Callable[[numpy.random.SeedSequence], FeatureCollection] | None¶
Function that samples features from the target distribution.
- filename: pathlib.Path | None = None¶
Path to the file from which the model was loaded (if any). May be
None
.
- check(seed=None)[source]¶
Basic health checks: draw parameters and call the functions.
We don’t do this when initialising the DinfModel, because calling
generator_func()
andtarget_func()
are potentially time consuming, which could lead to annoying delays for the command line interface.
- static from_file(filename)[source]¶
Load the symbol “dinf_model” from a file.
https://docs.python.org/3/library/importlib.html#importing-a-source-file-directly
- Parameters:
filename (str | pathlib.Path) –
- Return type:
Specifying parameters¶
- class dinf.Parameters(**kwargs)[source]¶
An ordered collection of inferrable parameters.
A collection of parameters is built by passing
Param
objects to the constructor by keyword. The keyword will be used as thename
for the given parameter. This class implements Python’s mapping protocol.import dinf # Parameters named "a", "b", and "c". parameters = dinf.Parameters( a=dinf.Param(low=0, high=1), b=dinf.Param(low=5, high=100), c=dinf.Param(low=1e-6, high=1e-3), ) # Iterate over parameters. assert len(parameters) == 3 assert list(parameters) == ["a", "b", "c"] assert [p.high for p in parameters.values()] == [1, 100, 1e-3] # Lookup a parameter by name. a = parameters["a"] assert a.low == 0 assert a.high == 1 assert a.name == "a"
- sample_prior(*, size, rng)[source]¶
Get a random sample of parameter values from their prior distributions.
- Parameters:
size (int) – The sample size.
rng (numpy.random.Generator) – The numpy random number generator.
- Returns:
A 2d numpy array of parameter values, where ret[j][k] is the j’th draw for the k’th parameter.
- Return type:
- sample_kde(thetas, /, *, probs, size, rng, shrinkage=True)[source]¶
Sample from a smoothed set of weighted observations.
Samples are drawn from
thetas
, weighted by their probability. New points are drawn within a neighbourhood of the sampled thetas using a mulivariate normal whose covariance is calculated from the thetas. This is equivalent to sampling from a Gaussian KDE, but avoids doing an explicit density estimation. Values are sampled until they are within the parameter bounds. Scott’s rule of thumb is used for bandwidth selection.- Parameters:
thetas (np.ndarray) – Parameter values to sample from.
probs (np.ndarray | None) – Discriminator predictions corresponding to the
thetas
.size (int) – Number of samples to draw.
rng (numpy.random.Generator) – Numpy random generator.
shinkage – If True, shrink the thetas towards their mean. This corrects for variance inflation of the distribution due to the Gaussian sampling. See West 1993, https://doi.org/10.1111/j.2517-6161.1993.tb01911.x
shrinkage (bool) –
- Returns:
The sampled values.
- Return type:
np.ndarray
- sample_ball(theta, /, *, size, cov=None, cov_factor=1.0, rng)[source]¶
Sample a ball around the given
theta
value.The size of the ball is controlled by the covariance matrix,
cov
, which is multiplied bycov_factor
.Warning
Samples are not gauranteed to be within the parameter bounds.
- Parameters:
theta (np.ndarray) – Parameter value around which the ball is centered.
size (int) – Number of samples to draw.
cov (np.ndarray | None) – The covariance matrix. If not specified, a diagonal matrix will be used that has entry cov[j][j] set to the squared range of parameter j: (high - low)**2.
cov_factor (float) – Left-multiply the covariance matrix by this value.
rng (numpy.random.Generator) – Numpy random generator.
- Returns:
The sampled values.
- transform(thetas, /)[source]¶
Transform values bounded by [param.low, param.high] to [-inf, inf].
See
itransform()
for the inverse transformation.
- itransform(thetas, /)[source]¶
Transform values on [-inf, inf] to be bounded by [param.low, param.high].
Performs the inverse of
transform()
.
- reflect(thetas, /)[source]¶
Reflect values that are out of bounds by the amount they are out.
As reflecting does not gaurantee values will be within the bounds, values are first truncated to (2*low - high, 2*high - low), then reflected. For example, with bounds low=0, high=10, a value of -11 will be truncated to -10, then reflected to attain a final value of 10.
- static geometric_median(thetas, /, *, probs=None)[source]¶
Get the multivariate median of a weighted sample.
- Parameters:
thetas (np.ndarray) – Parameter values. thetas[j][k] is the value of the k’th parameter for the j’th multivariate sample.
probs (np.ndarray | None) – Discriminator predictions corresponding to the
thetas
.
- Returns:
Median position in multivariate space.
- Return type:
np.ndarray
- class dinf.Param(low, high, truth=None, name=None)[source]¶
An inferrable parameter with
Uniform(low, high)
prior distribution.- Parameters:
low (float) – The lower bound for the parameter.
high (float) – The uppper bound for the parameter.
truth (float | None) – The true value of the parameter. Set to None if not known (e.g. for parameters corresponding to an empirical dataset). If the value is not None, this value may be used as the truth in a simulation study.
name (str | None) –
- sample_prior(*, size, rng)[source]¶
Get a random sample from the
Uniform(low, high)
prior distribution.- Parameters:
size (int) – The sample size.
rng (numpy.random.Generator) – The numpy random number generator.
- Returns:
A numpy array of parameter values.
- Return type:
- transform(x, /)[source]¶
Transform bounded values on [low, high] to unbounded values on [-inf, inf].
Performs a logit transformation.
- itransform(x, /)[source]¶
Transform unbounded values on [-inf, inf] to bounded values on [low, high].
Performs an inverse logit transformation (aka expit, aka sigmoid).
- reflect(x, /)[source]¶
Reflect values that are out of bounds by the amount they are out.
As reflecting does not gaurantee values will be within the bounds, values are first truncated to (2*low - high, 2*high - low), then reflected. For example, with bounds low=0, high=10, a value of -11 will be truncated to -10, then reflected to attain a final value of 10.
Feature extraction¶
- class dinf.HaplotypeMatrix(*, num_individuals, num_loci, ploidy, phased, maf_thresh=None)[source]¶
A factory for feature matrices consisting of haplotypes and relative positions.
The feature is an \(n \times m \times c\) array, where the channel dimension \(c\) is 2. The first channel is a haplotype matrix and the second channel is a matrix of relative SNP positions.
The haplotype matrix is an \(n \times m\) matrix, where \(n\) corresponds to the number of haplotypes (or number of individuals, for unphased data) and \(m\) corresponds to the number of SNPs along the sequence. For phased data, each entry is a 0 or 1 corresponding to the major or minor allele respectively. For unphased data, each entry is the count of minor alleles across all chromosomes in the individual. Only polymorphic SNPs are considered, and for multiallelic sites, only the first two alleles are used. Alleles are polarised by choosing the most frequent allele to be encoded as 0 (the major allele), and the second most frequent allele as 1 (the minor allele).
The position matrix is an \(n \times m\) matrix, where the vector of \(m\) inter-SNP distances are repeated \(n\) times—once for each haplotype (or each individual, for unphased data). Each entry is the distance from the previous SNP (as a proportion of the sequence length). The first inter-SNP distance in the vector is always zero.
Chan et al. 2018, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7687905/Wang et al. 2021, https://doi.org/10.1111/1755-0998.13386- Parameters:
num_individuals (int) – The number of individuals to include in the feature matrix.
num_loci (int) – The number of SNP sites to extract. The central
num_loci
SNPs in the sequence will be used. If there are fewer thannum_loci
SNPs, the feature matrix will be padded on both sides with zeros.ploidy (int) – Ploidy of the individuals.
maf_thresh (float | None) – Minor allele frequency (MAF) threshold. Sites with MAF lower than this value are ignored. If None, only invariant sites will be excluded.
phased (bool) – If True, the individuals’ haplotypes will each be included as independent rows in the feature matrix and the shape of the feature matrix will be
(ploidy * num_individuals, num_loci, c)
. If False, the allele counts for each individual will be summed across their chromosome copies and the shape of the feature matrix will be(num_individuals, num_loci, c)
.
- from_ts(ts)¶
Create a feature matrix from a tskit tree sequence.
- Parameters:
ts (TreeSequence) – The tree sequence.
- Returns:
Array with shape
(num_pseudo_haplotypes, num_loci, c)
.- Return type:
- from_vcf(vb, *, sequence_length, max_missing_genotypes, min_seg_sites, rng)¶
Create a feature matrix from a region of a VCF/BCF.
The genomic window for a feature matrix is drawn uniformly at random from the contigs defined in the given
BagOfVcf
,vb
.Individuals in the VCFs are sampled (without replacement) for inclusion in the output matrix. The size of the feature space can therefore be vastly increased by having more individuals in the VCFs than are needed for the feature dimensions.
- Parameters:
vb (BagOfVcf) – The BagOfVcf object that describes the VCF/BCF files.
sequence_length (int) – Length of the genomic window to be sampled.
max_missing_genotypes (int) – Consider only sites with at most this many missing genotype calls.
min_seg_sites (int) – Sampled genotype matrix must have at least this many variable sites (after filtering sites for missingness).
rng (numpy.random.Generator) – Numpy random number generator.
- Returns:
Array with shape
(num_pseudo_haplotypes, num_loci, c)
.- Return type:
- class dinf.MultipleHaplotypeMatrices(*, num_individuals, num_loci, ploidy, global_maf_thresh=None, global_phased)[source]¶
A factory for labelled collections of
HaplotypeMatrix
objects.One feature matrix is produced for each label. Labels correspond to collections of individuals that will be treated as exchangeable (e.g. populations).
- Parameters:
num_individuals (Mapping[str, int]) – A dict that maps labels to the number of individuals in the feature matrix.
num_loci (Mapping[str, int]) – A dict that maps labels to the number of feature loci to be extracted from the sequence.
ploidy (Mapping[str, int]) – A dict that maps labels to the ploidy of the individuals.
global_maf_thresh (float | None) – Minor allele frequency (MAF) threshold. Sites with MAF lower than this value are ignored. MAF is calculated across all individuals. If None, only invariant sites will be excluded.
global_phased (bool) – If True, the individuals’ haplotypes will each be included as independent rows in each feature matrix and the shape of the feature matrix for label
l
will be(ploidy[l] * num_individuals[l], num_loci[l], c)
. If False, the allele counts for each individual will be summed across their chromosome copies and the shape of the feature matrix will be(num_individuals[l], num_loci[l], c)
.
- from_ts(ts, *, individuals)¶
Create pseudo-genotype matrices from a tree sequence.
- Parameters:
ts (TreeSequence) – The tree sequence.
rng (numpy.random.Generator) – Numpy random number generator.
individuals (Mapping[str, ndarray[Any, dtype[integer]]]) – A mapping from label to an array of individuals.
- Returns:
A dictionary mapping a label
l
to a feature array. Each array has shape(num_pseudo_haplotypes[l], num_loci[l], c)
. For an array \(M\), the \(M[i][j][0]\)’th entry is the count of minor alleles in the \(j\)’th bin of psdeudo-haplotype \(i\).- Return type:
- from_vcf(vb, *, sequence_length, max_missing_genotypes, min_seg_sites, rng)¶
Create pseudo-genotype matrices from a region of a VCF/BCF.
The genomic window is drawn uniformly at random from the sequences defined in the given
BagOfVcf
.Individuals in the VCFs are sampled (without replacement) for inclusion in the output matrix. The size of the feature space can therefore be vastly increased by having more individuals in the VCFs than are needed for the feature dimensions.
- Parameters:
vb (BagOfVcf) – A collection of indexed VCF/BCF files.
sequence_length (int) – Length of the genomic window to be sampled.
max_missing_genotypes (int) – Consider only sites with fewer missing genotype calls than this number.
min_seg_sites (int) – Sampled genotype matrix must have at least this many variable sites (after filtering sites for missingness).
rng (numpy.random.Generator) – Numpy random number generator.
- Returns:
A dictionary mapping a label
l
to a feature array. Each array has shape(num_pseudo_haplotypes[l], num_loci[l], c)
. For an array \(M\), the \(M[i][j][0]\)’th entry is the count of minor alleles in the \(j\)’th bin of psdeudo-haplotype \(i\).- Return type:
- class dinf.BinnedHaplotypeMatrix(*, num_individuals, num_loci, ploidy, phased, maf_thresh=None)[source]¶
A factory for feature matrices of pseudo-haplotypes.
The binned haplotype matrix is an \(n \times m\) matrix, where \(n\) corresponds to the number of haplotypes (or polyploid genotypes, for unphased data) and \(m\) corresponds to a set of equally sized bins along the sequence length. Each matrix entry contains the count of minor alleles in an individual’s haplotype (or polyploid genotype) in a given bin.
Only polymorphic SNPs are considered, and for multiallelic sites, only the first two alleles are used. Alleles are polarised by choosing the most frequent allele to be encoded as 0, and the second most frequent allele as 1.
As the features are intended to be passed to a covolutional neural network, the output dimensions are actually \(n \times m \times 1\), where the final dimension is the (unused) “channels” dimension for the convolution.
Gower et al. 2021, https://doi.org/10.7554/eLife.64669
- Parameters:
num_individuals (int) – The number of individuals to include in the feature matrix.
num_loci (int) – The number of bins into which the sequence is partitioned. Each bin spans
sequence_length / num_loci
base pairs.ploidy (int) – Ploidy of the individuals.
phased (bool) – If True, the individuals’ haplotypes will each be included as independent rows in the feature matrix and the shape of the feature matrix will be
(ploidy * num_individuals, num_loci, 1)
. If False, the allele counts for each individual will be summed across their chromosome copies and the shape of the feature matrix will be(num_individuals, num_loci, 1)
.maf_thresh (float | None) – Minor allele frequency (MAF) threshold. Sites with MAF lower than this value are ignored. If None, only invariant sites will be excluded.
- from_ts(ts)¶
Create a feature matrix from a tskit tree sequence.
- Parameters:
ts (TreeSequence) – The tree sequence.
- Returns:
Array with shape
(num_pseudo_haplotypes, num_loci, c)
.- Return type:
- from_vcf(vb, *, sequence_length, max_missing_genotypes, min_seg_sites, rng)¶
Create a feature matrix from a region of a VCF/BCF.
The genomic window for a feature matrix is drawn uniformly at random from the contigs defined in the given
BagOfVcf
,vb
.Individuals in the VCFs are sampled (without replacement) for inclusion in the output matrix. The size of the feature space can therefore be vastly increased by having more individuals in the VCFs than are needed for the feature dimensions.
- Parameters:
vb (BagOfVcf) – The BagOfVcf object that describes the VCF/BCF files.
sequence_length (int) – Length of the genomic window to be sampled.
max_missing_genotypes (int) – Consider only sites with at most this many missing genotype calls.
min_seg_sites (int) – Sampled genotype matrix must have at least this many variable sites (after filtering sites for missingness).
rng (numpy.random.Generator) – Numpy random number generator.
- Returns:
Array with shape
(num_pseudo_haplotypes, num_loci, c)
.- Return type:
- class dinf.MultipleBinnedHaplotypeMatrices(*, num_individuals, num_loci, ploidy, global_maf_thresh=None, global_phased)[source]¶
A factory for labelled collections of
BinnedHaplotypeMatrix
objects.One feature matrix is produced for each label. Labels correspond to collections of individuals that will be treated as exchangeable (e.g. populations).
- Parameters:
num_individuals (Mapping[str, int]) – A dict that maps labels to the number of individuals in the feature matrix.
num_loci (Mapping[str, int]) – A dict that maps labels to the number of feature loci to be extracted from the sequence.
ploidy (Mapping[str, int]) – A dict that maps labels to the ploidy of the individuals.
global_maf_thresh (float | None) – Minor allele frequency (MAF) threshold. Sites with MAF lower than this value are ignored. MAF is calculated across all individuals. If None, only invariant sites will be excluded.
global_phased (bool) – If True, the individuals’ haplotypes will each be included as independent rows in each feature matrix and the shape of the feature matrix for label
l
will be(ploidy[l] * num_individuals[l], num_loci[l], c)
. If False, the allele counts for each individual will be summed across their chromosome copies and the shape of the feature matrix will be(num_individuals[l], num_loci[l], c)
.
- from_ts(ts, *, individuals)¶
Create pseudo-genotype matrices from a tree sequence.
- Parameters:
ts (TreeSequence) – The tree sequence.
rng (numpy.random.Generator) – Numpy random number generator.
individuals (Mapping[str, ndarray[Any, dtype[integer]]]) – A mapping from label to an array of individuals.
- Returns:
A dictionary mapping a label
l
to a feature array. Each array has shape(num_pseudo_haplotypes[l], num_loci[l], c)
. For an array \(M\), the \(M[i][j][0]\)’th entry is the count of minor alleles in the \(j\)’th bin of psdeudo-haplotype \(i\).- Return type:
- from_vcf(vb, *, sequence_length, max_missing_genotypes, min_seg_sites, rng)¶
Create pseudo-genotype matrices from a region of a VCF/BCF.
The genomic window is drawn uniformly at random from the sequences defined in the given
BagOfVcf
.Individuals in the VCFs are sampled (without replacement) for inclusion in the output matrix. The size of the feature space can therefore be vastly increased by having more individuals in the VCFs than are needed for the feature dimensions.
- Parameters:
vb (BagOfVcf) – A collection of indexed VCF/BCF files.
sequence_length (int) – Length of the genomic window to be sampled.
max_missing_genotypes (int) – Consider only sites with fewer missing genotype calls than this number.
min_seg_sites (int) – Sampled genotype matrix must have at least this many variable sites (after filtering sites for missingness).
rng (numpy.random.Generator) – Numpy random number generator.
- Returns:
A dictionary mapping a label
l
to a feature array. Each array has shape(num_pseudo_haplotypes[l], num_loci[l], c)
. For an array \(M\), the \(M[i][j][0]\)’th entry is the count of minor alleles in the \(j\)’th bin of psdeudo-haplotype \(i\).- Return type:
- class dinf.BagOfVcf(files, /, *, samples=None, contig_lengths=None)[source]¶
A collection of indexed VCF or BCF files.
VCF data are sometimes contained in a single file, and sometimes split into multiple files by chromosome. To remove the burden of dealing with both of these common cases, this class maps a contig ID to a
cyvcf2.VCF
object for that contig. The class implements Python’s mapping protocol, plus methods for sampling regions of the genome at random.import dinf vcfs = dinf.BagOfVcf(["chr1.bcf", "chr2.bcf", "chr3.bcf"]) # Iterate over contig IDs. assert len(vcfs) == 3 assert list(vcfs) == ["chr1", "chr2", "chr3"] # Lookup a cyvcf2.VCF object by contig ID. chr1 = vcfs["chr1"] assert chr1.fname == "chr1.bcf" first_variant = next(chr1) assert first_variant.CHROM == "chr1"
- Parameters:
files (Iterable[str | pathlib.Path]) – An iterable of filenames. Each file must be an indexed VCF or BCF, and must contain the FORMAT/GT field.
contig_lengths (collections.abc.Mapping[str, int] | None) – A dict mapping a contig name to contig length. Only the contigs in this dict will be used.
samples (collections.abc.Mapping[str, List[str]] | None) – A dictionary that maps a label to a list of individual names, where the individual names correspond to the VCF columns for which genotypes will be sampled.
- lengths: np.ndarray¶
Lengths of the contigs in the bag.
The order matches the contig order obtained when iterating.
- samples: collections.abc.Mapping[str, List[str]] | None¶
A dictionary that maps a label to a list of individual names.
The individual names correspond to the VCF columns for which genotypes will be sampled. This is a bookkeeping device that records which genotypes belong to which label (e.g. which population). If None, it is assumed that all individuals in the VCF will be treated as exchangeable.
- sample_regions(size, sequence_length, rng)[source]¶
Sample a list of (chrom, start, end) triplets.
- Parameters:
size (int) – Number of genomic windows to sample.
sequence_length (int) – Length of the sequence to sample.
rng (numpy.random.Generator) – The numpy random number generator.
- Returns:
List of 3-tuples: (chrom, start, end). The start and end coordinates are 1-based and inclusive, to match the usual convention for ‘chrom:start-end’, e.g. in bcftools.
- Return type:
- sample_genotype_matrix(*, sequence_length, min_seg_sites, max_missing_genotypes, require_phased, rng, retries=1000)[source]¶
Sample a genotype matrix uniformly at random from the genome.
- Parameters:
sequence_length (int) – Length of the sequence to sample.
max_missing_genotypes (int) – Consider only sites with at most this many missing genotype calls.
min_seg_sites (int) – Sampled genotype matrix must have at least this many variable sites (after filtering sites for missingness).
require_phased (bool) – If True, raise an error if genotypes are not phased.
rng (numpy.random.Generator) – Numpy random number generator.
retries (int) – Maximum number of attempts allowed to find a genomic window with the required number of segregating sites.
- Returns:
- A 2-tuple of (G, positions) where
G
is a (num_sites, num_individuals, ploidy) genotype matrix,positions
are the site coordinates, as an offset from the the start of the genomic window.
- Return type:
Training¶
- dinf.train(*, dinf_model, training_replicates, test_replicates, epochs, parallelism=None, seed=None, callbacks=None)[source]¶
Train a discriminator network.
- Parameters:
dinf_model (DinfModel) – DinfModel object that describes the model components.
training_replicates (int) – Size of the dataset used to train the discriminator.
test_replicates (int) – Size of the test dataset used to evaluate the discriminator after each training epoch.
epochs (int) – Number of full passes over the training dataset when training the discriminator.
parallelism (None | int) – Number of processes to use for parallelising calls to the
DinfModel.generator_func()
andDinfModel.target_func()
.seed (None | int) – Seed for the random number generator.
callbacks (dict | None) –
- Returns:
The trained discriminator.
- Return type:
Inference¶
- dinf.predict(*, discriminator, dinf_model, replicates, sample_target=False, parallelism=None, seed=None, callbacks=None)[source]¶
Sample features and make predictions using the discriminator.
Features are sampled from the generator by default. To instead sample features from the target dataset, use the
sample_target
option.- Parameters:
dinf_model (DinfModel) – DinfModel object that describes the model components.
replicates (int) – Number of features to extract.
sample_target (bool) – If True, sample features from the target dataset. If False (the default), features are sampled from the generator.
parallelism (None | int) – Number of processes to use for parallelising calls to the
DinfModel.generator_func()
andDinfModel.target_func()
.seed (int | None) – Seed for the random number generator.
discriminator (Discriminator) –
callbacks (dict | None) –
- Returns:
A 2-tuple of (thetas, probs), where
thetas
are the drawn parameters (orNone
ifsample_target
is True) andprobs
are the discriminator predictions. thetas[j][k] is the j’th draw for the k’th parameter, and probs[j] is the discriminator prediction for the j’th draw.- Return type:
Tuple[np.ndarray | None, np.ndarray]
- dinf.mc(*, dinf_model, iterations, training_replicates, test_replicates, proposal_replicates, epochs, top_n=None, output_folder=None, parallelism=None, seed=None, callbacks=None)[source]¶
Adversarial Monte Carlo.
In the first iteration, p[0] is the prior distribution. The following steps are taken for iteration j:
sample training and proposal datasets from distribution p[j],
train the discriminator,
make predictions with the discriminator on the proposal dataset,
construct distribution p[j+1] as a weighted KDE of the proposals, where the weights are given by the discriminator predictions.
- Parameters:
dinf_model (DinfModel) – DinfModel object that describes the dinf model.
iterations (int) – Number of iterations.
training_replicates (int) – Size of the dataset used to train the discriminator. This dataset is constructed once each iteration.
test_replicates (int) – Size of the dataset used to evaluate the discriminator after each training epoch. This dataset is constructed once before the iteration begins, and is reused in each iteration.
proposal_replicates (int) – Number of Monte Carlo proposals in each iteration.
epochs (int) – Number of full passes over the training dataset when training the discriminator.
top_n (int | None) – If not None, do ABC rejection sampling in each iteraction by taking the
top_n
best samples, ranked by discriminator prediction.top_n
must be smaller thanproposal_replicates
.output_folder (None | str | pathlib.Path) – Folder to output results. If not specified, the current directory will be used.
parallelism (None | int) – Number of processes to use for parallelising calls to the
DinfModel.generator_func()
andDinfModel.target_func()
.seed (None | int) – Seed for the random number generator.
callbacks (dict | None) –
- dinf.mcmc(*, dinf_model, iterations, training_replicates, test_replicates, epochs, walkers, steps, Dx_replicates, output_folder=None, parallelism=None, seed=None, callbacks=None)[source]¶
Run the MCMC GAN.
In the first iteration, p[0] is the prior distribution. The following steps are taken for iteration j:
sample training dataset from the distribution p[j],
train the discriminator,
run the MCMC,
obtain distribution p[j+1] as a KDE of the MCMC sample.
- Parameters:
dinf_model (DinfModel) – DinfModel object that describes the model components.
iterations (int) – Number of GAN iterations.
training_replicates (int) – Size of the dataset used to train the discriminator. This dataset is constructed once each GAN iteration.
test_replicates (int) – Size of the test dataset used to evaluate the discriminator after each training epoch.
epochs (int) – Number of full passes over the training dataset when training the discriminator.
walkers (int) – Number of independent MCMC chains.
steps (int) – The chain length for each MCMC walker.
Dx_replicates (int) – Number of generator replicates for approximating E[D(G(θ))].
output_folder (None | str | pathlib.Path) – Folder to output results. If not specified, the current directory will be used.
parallelism (None | int) – Number of processes to use for parallelising calls to the
DinfModel.generator_func()
andDinfModel.target_func()
.seed (None | int) – Seed for the random number generator.
callbacks (dict | None) –
- dinf.pg_gan(*, dinf_model, iterations, training_replicates, test_replicates, epochs, Dx_replicates, num_proposals=10, proposal_stddev=0.06666666666666667, pretraining_method='dinf', proposals_method='rr', max_pretraining_iterations=100, output_folder=None, parallelism=None, seed=None)[source]¶
PG-GAN style simulated annealing.
Wang et al. 2021, https://doi.org/10.1111/1755-0998.13386
- Parameters:
dinf_model (DinfModel) – DinfModel object that describes the dinf model.
iterations (int) – Number of GAN iterations.
training_replicates (int) – Size of the dataset used to train the discriminator. This dataset is constructed once each GAN iteration.
test_replicates (int) – Size of the test dataset used to evaluate the discriminator after each training epoch.
epochs (int) – Number of full passes over the training dataset when training the discriminator.
Dx_replicates (int) – Number of generator replicates for approximating E[D(G(θ))].
num_proposals (int) – Number proposals per parameter in each iteration.
proposal_stddev (float) – The standard deviation for the proposal distribution is: temperature * proposal_stddev * (param.high - param.low) The default is 1 / 15, as used by PG-GAN.
pretraining_method (str) –
A string indicating which method should be used to pretrain the discriminator and choose a starting point in parameter space.
”dinf” (default): Train the discriminator on data sampled from the prior, until it can distinguish randomly chosen points from the target dataset with accuracy 0.9, or until
max_pretraining_iterations
is exhausted. In practice, 10,000–20,000 training instances are sufficient.After pretraining, we choose the starting point to be the point with the highest log loss from a fresh set of candidates drawn from the prior.
”pg-gan”: PG-GAN starts at a point in the parameter space that is favourable to the discriminator, and it only trains the discriminator enough to identify such a point.
In each pretraining iteration we train the discriminator on a single point sampled from the prior. If the discriminator can distinguish between that point and the target dataset with accuracy>=0.9, this point is used as the starting point. Otherwise, a new point is chosen by sampling from the prior. If
max_pretraining_iterations
is exhausted without reaching an accuracy of 0.9, the point at which the discriminator had the highest accuracy is used as the starting point.
proposals_method (str) –
A string indicating which method should be used to produce the proposal distribution. In each case,
num_proposals
× P proposals are produced, where P is the number of parameters (to match original PG-GAN behaviour).”rr” (default): Round-robin proposals. Proposals are generated for one parameter only, keeping the other parameters fixed. A different parameter is chosen in each iteration, cycling through all parameters in order. Proposals are drawn from a normal distribution with standard deviation: temperature ×
proposal_stddev
× (param.high - param.low).”mvn”: Multivariate-normal proposals. Proposals are drawn from a multivariate normal distribution with covariance matrix defined to have diagonals: (temperature ×
proposal_stddev
× (param.high - param.low))**2 / P, where P is the number of parameters and the high/low bounds vary by the corresponding parameter. Off-diagonals of the covariance matrix are set to zero.”pg-gan”: Proposals like PG-GAN.
num_proposals
proposals are generated for each parameter, keeping the other parameters fixed, thus producingnum_proposals
× P total proposals (for P parameters). Proposals are drawn from a normal distribution with standard deviation: temperature ×proposal_stddev
× (param.high - param.low).
max_pretraining_iterations (int) – The maximum number of pretraining iterations.
output_folder (None | str | pathlib.Path) – Folder to output results. If not specified, the current directory will be used.
parallelism (None | int) – Number of processes to use for parallelising calls to the
DinfModel.generator_func()
andDinfModel.target_func()
.seed (None | int) – Seed for the random number generator.
Classification¶
- class dinf.Discriminator(network=None, state=None, trained=False, metrics=<factory>, format_version=4, _inited=False)[source]¶
Wrapper for the discriminator neural network that classifies genotype matrices.
- Parameters:
network (nn.Module | None) –
state (Any | None) –
trained (bool) –
metrics (collections.abc.Mapping) –
format_version (int) –
_inited (bool) –
- classmethod from_file(filename, /, *, network=None, state=None)[source]¶
Load neural network weights from the given file.
- Parameters:
filename (str | pathlib.Path) – The filename of the saved model.
network – The flax neural network.
state – The flax train state.
- Returns:
A new Discriminator object.
- to_file(filename, /)[source]¶
Save neural network to the given file.
- Parameters:
filename (str | pathlib.Path) – The path where the model will be saved.
- Return type:
None
- fit(*, train_x, train_y, val_x=None, val_y=None, batch_size=64, epochs=1, ss, reset_metrics=False, entropy_regularisation=False, callbacks=None)[source]¶
Fit discriminator to training data.
- Parameters:
train_x – Training data.
train_y – Labels for training data (zeros and ones).
val_x – Validation data.
val_y – Labels for validation data (zeros and ones).
batch_size (int) – Size of minibatch for gradient update step.
epochs (int) – The number of full passes over the training data.
ss (numpy.random.SeedSequence) – Numpy random number seed sequence.
reset_metrics (bool) – If true, remove loss/accuracy metrics from previous calls to fit() (if any). If false, loss/accuracy metrics will be appended to the existing metrics.
entropy_regularisation (bool) –
callbacks (dict | None) –
Discriminator networks¶
- class dinf.ExchangeableCNN(sizes1=(32, 64), symmetric=Symmetric( # attributes func = 'max' k = None ), sizes2=(64, ), parent=<flax.linen.module._Sentinel object>, name=None)[source]¶
An exchangeable CNN for the discriminator.
Each feature matrix in the input has shape (batch_size, num_haplotypes, num_loci, channels). Two 1-d convolution layers are applied along haplotypes and feature size is reduced in each convolution using a stride of 2. Then the max function is used to collapse features across the num_haplotypes dimension. A third 1-d convolution follows (with stride of 1), and the max function is used to collapse features across the num_loci dimension. Elu activation is used throughout, except for the final output which has no activation.
The number of trainable parameters in the network is dependent on the number of feature matrices in the input, but independent of the size of the feature matrices.
- Parameters:
sizes1 (Tuple[int, ...]) – A tuple of feature sizes for the convolution layers before the symmetric function.
symmetric (Module) – The symmetric network layer.
sizes2 (Tuple[int, ...]) – A tuple of feature sizes for the convolution layers afer the symmetric function.
parent (Optional[Union[Type[Module], Type[Scope], Type[_Sentinel]]]) –
- class dinf.ExchangeablePGGAN(sizes1=(32, 64), symmetric=Symmetric( # attributes func = 'sum' k = None ), sizes2=(128, 128), parent=<flax.linen.module._Sentinel object>, name=None)[source]¶
An exchangeable CNN for the discriminator using the PG-GAN architecture.
This is a faithful translation of the network implemented in PG-GAN [1], with the addition of batch normalisation. Each haplotype is treated as exchangeable with any other haplotype within the same feature matrix (i.e. exchangeability applies within labelled groups).
Each feature matrix in the input has shape (batch_size, num_haplotypes, num_loci, channels). Two 1-d convolution layers are applied along haplotypes, and feature size is reduced following each convolution using 1-d max pooling with size and stride of 2. The sum function is then used to collapse features across the num_haplotypes dimension. Two fully connected layers with dropout follow. Relu activation is used throughout, except for the final output which has no activation.
The number of trainable parameters in the network is dependent on the number of feature matrices in the input, and dependent on the num_loci dimension of the feature matrices.
[1] Wang et al. 2021, https://doi.org/10.1111/1755-0998.13386- Parameters:
sizes1 (Tuple[int, ...]) – A tuple of feature sizes for the convolution layers before the symmetric function. Convolution weights are tied between convolution layers of the distinct labelled feature matrices.
symmetric (Module) – The symmetric network layer.
sizes2 (Tuple[int, ...]) – A tuple of feature sizes for the dense (fully connected) layers afer the symmetric function.
parent (Optional[Union[Type[Module], Type[Scope], Type[_Sentinel]]]) –
- class dinf.Symmetric(func, k=None, parent=<flax.linen.module._Sentinel object>, name=None)[source]¶
Network layer that summarises over a given axis in a way that is invariant to permutations of the input along that axis.
Chan et al. 2018, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7687905/- Parameters:
func (str) –
The permutation-invariant function to apply. Must be one of:
”max”: maximum value.
”mean”: mean value.
”sum”: sum of the values.
”var”: variance of the values.
”moments”: first
k
moments, \(E[x], E[x^2], ..., E[x^k]\).”central-moments”: first
k
central moments, \(E[x], E[(x - E[x])^2], ..., E[(x - E[x])^k]\).
k (int | None) – The number of moments to take when
func
is “moments” or “central-moments”.parent (Optional[Union[Type[Module], Type[Scope], Type[_Sentinel]]]) –
Miscellaneous¶
- dinf.save_results(filename, /, *, thetas, probs, parameters)[source]¶
Save discriminator predictions and parameter values to a .npz file.
- Parameters:
filename (str | pathlib.Path) – The output file.
thetas (np.ndarray | None) – Parameter values.
probs (np.ndarray) – Discriminator predictions corresponding to the
thetas
.parameters (Parameters) – Sequence of parameter names.
- dinf.load_results(filename, /, *, parameters=None)[source]¶
Load descriminator predictions and parameter values from a .npz file.
- Parameters:
filename (str | pathlib.Path) – The input file.
parameters (Parameters | None) – Sequence of parameter names against which the file’s arrays will be checked. A ValueError exception is raised if the names are not the same (and in the same order). If None, the parameter names are not checked.
- Returns:
A numpy structured array, where the first column is the probabilities (named
_Pr
), and the subsequent columns are the parameter values (if any).- Return type:
np.ndarray
- dinf.ts_individuals(ts, /, population=None)[source]¶
Get the individuals corresponding to the tree sequence’s samples.
- Parameters:
ts (tskit.TreeSequence) – The tree sequence.
population (str | int | None) – Only return individuals from this population. The population may be a string identifier (which will be matched against the ‘name’ metadata field in the population table of the tree sequence), or an integer population id. If not specified, all sampled individuals will be returned.
- Returns:
An array of individual IDs (indices into the individuals table).
- Return type:
npt.NDArray[np.integer]
- dinf.get_samples_from_1kgp_metadata(filename, /, *, populations)[source]¶
Get sample IDs for 1000 Genomes Project populations. Related individuals are removed based on the FatherID and MotherID.
- Parameters:
filename (str) – Path to the file “20130606_g1k_3202_samples_ped_population.txt” This file can be downloaded from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/
populations (list) – List of populations to extract.
- Returns:
A dict mapping population names to a list of sample IDs.
- Return type:
- dinf.get_contig_lengths(filename, /, keep_contigs=None)[source]¶
Load contig lengths from a whitespace-separated file like an fai (fasta index).
The file must have (at least) two columns, the first specifies the contig ID, and the second is the contig length. Additional columns are ignored.
- Parameters:
filename (pathlib.Path | str) – The path to the file.
keep_contigs (Iterable[str] | None) – Use only the specified contigs.
- Return type:
- class dinf.Store(base, *, create=False)[source]¶
A sequence of folders with names “0”, “1”, …
This class implements the
collections.abc.Sequence
protocol.- Variables:
base (pathlib.Path) – Base directory containing the sequence.
- Parameters:
base (str | pathlib.Path) – The base directory containing the sequence.
create (bool) – If True, create the base directory if it doesn’t exist. If False (default), raise an error if the base directory doesn’t exist.
Plotting¶
- dinf.plot.feature(mat, /, *, ax, channel=0, cmap=None, cb=True, cb_label=None, vmax=None)[source]¶
Plot a feature matrix as a heatmap.
- Parameters:
mat (np.ndarray) – The feature matrix.
ax (matplotlib.axes.Axes) – Matplotlib axes onto which the plot will be drawn.
channel (int) – Which channel of the feature matrix to draw. I.e., the index into the last dimension of
mat
.cmap (str | matplotlib.colors.Colourmap | None) – Matplotlib colour map.
cb (bool) – If True, add a colour bar that indicates how colours relate to the values in the feature matrix.
cb_label (str | None) – Label on the colour bar.
vmax (float | None) –
- dinf.plot.features(mats, /, *, subplots_kw=None)[source]¶
Plot multiple feature matrices as heatmaps.
- Parameters:
mats (Dict[str, np.ndarray]) – The multiple feature matrices.
subplots_kw (dict | None) – Additional keyword arguments that will be passed to
matplotlib.pyplot.subplots()
.
- Return type:
- Returns:
A (figure, axes) tuple.
- dinf.plot.metrics(metrics_collection, /, *, legend_title=None, subplot_mosaic_kw=None)[source]¶
Plot training metrics from a discriminator neural network.
- Parameters:
metrics_collection (Dict[str, Dict[str, Any]]) – A dictionary mapping labels to metrics dictionaries. The labels will be used in the legend to identify each discriminator.
legend_title (str | None) – A title for the legend.
subplot_mosaic_kw (dict | None) – Additional keyword arguments that will be passed to
matplotlib.pyplot.subplot_mosaic()
.
- Return type:
tuple[matplotlib.figure.Figure, dict[str, matplotlib.axes.Axes]]
- Returns:
A (figure, axes_dict) tuple, where axes_dict is a dictionary mapping metrics to the
matplotlib.axes.Axes
objects. (As obtained frommatplotlib.pyplot.subplot_mosaic()
).
- dinf.plot.hist2d(x, y, /, *, x_label=None, y_label=None, x_truth=None, y_truth=None, ax=None, hist2d_kw=None)[source]¶
Plot a 2d histogram.
- Parameters:
x (np.ndarray) – Values for the horizontal axis.
y (np.ndarray) – Values for the vertical axis.
x_label (float | None) – Name of the parameter on the horizontal axis.
y_label (float | None) – Name of the parameter on the vertical axis.
x_truth (float | None) – True value of the parameter on the horizontal axis.
y_truth (float | None) – True value of the parameter on the vertical axis.
ax (matplotlib.axes.Axis | None) – Axes onto which the histogram will be drawn. If None, an axes will be created with
matplotlib.pyplot.subplots()
.hist2d_kw (dict | None) – Further parameters passed to
matplotlib.axes.Axes.hist2d()
.
- Return type:
- Returns:
A (figure, axes) tuple.
- dinf.plot.hist(x, /, *, ax=None, ci=False, truth=None, hist_kw=None)[source]¶
Plot a histogram.
- Parameters:
x (np.ndarray) – Array of values for the histogram.
ax (matplotlib.axes.Axes | None) – Axes onto which the histogram will be drawn. If None, an axes will be created with
matplotlib.pyplot.subplots()
.ci (bool) – If True, draw a 95% interval at the bottom.
truth (float | None) – If not None, draw a red vertical line at this location.
hist_kw (dict | None) – Further parameters passed to
matplotlib.axes.Axes.hist()
.
- Return type:
- Returns:
A (figure, axes) tuple.
- dinf.plot.entropy(ws, /, *, ax=None)[source]¶
Line plot of relative entropy over successive iterations.
- Parameters:
ws (list[np.ndarray]) – Sequence of proposal weights from each iteration.
ax (matplotlib.axes.Axes | None) – Axes onto which the plot will be drawn. If None, an axes will be created with
matplotlib.pyplot.subplots()
.
- Return type:
- Returns:
A (figure, axes) tuple.