API reference¶

Summary¶

Defining a Dinf model¶

DinfModel(parameters, generator_func, ...[, ...])

A container that describes the components of a Dinf model.

Specifying parameters¶

`Parameters`(**kwargs)	An ordered collection of inferrable parameters.
`Param`(low, high[, truth, name])	An inferrable parameter with `Uniform(low, high)` prior distribution.

Feature extraction¶

`HaplotypeMatrix`(*, num_individuals, ...[, ...])	A factory for feature matrices consisting of haplotypes and relative positions.
`MultipleHaplotypeMatrices`(*, ...[, ...])	A factory for labelled collections of `HaplotypeMatrix` objects.
`BinnedHaplotypeMatrix`(*, num_individuals, ...)	A factory for feature matrices of pseudo-haplotypes.
`MultipleBinnedHaplotypeMatrices`(*, ...[, ...])	A factory for labelled collections of `BinnedHaplotypeMatrix` objects.
`BagOfVcf`(files, /, *[, samples, contig_lengths])	A collection of indexed VCF or BCF files.

Training¶

train(*, dinf_model, training_replicates, ...)

Train a discriminator network.

Inference¶

`predict`(*, discriminator, dinf_model, replicates)	Sample features and make predictions using the discriminator.
`mc`(*, dinf_model, iterations, ...[, top_n, ...])	Adversarial Monte Carlo.
`mcmc`(*, dinf_model, iterations, ...[, ...])	Run the MCMC GAN.
`pg_gan`(*, dinf_model, iterations, ...[, ...])	PG-GAN style simulated annealing.

Classification¶

Discriminator([network, state, trained, ...])

Wrapper for the discriminator neural network that classifies genotype matrices.

Discriminator networks¶

`ExchangeableCNN`([sizes1, symmetric, sizes2, ...])	An exchangeable CNN for the discriminator.
`ExchangeablePGGAN`([sizes1, symmetric, ...])	An exchangeable CNN for the discriminator using the PG-GAN architecture.
`Symmetric`(func[, k, parent, name])	Network layer that summarises over a given axis in a way that is invariant to permutations of the input along that axis.

Miscellaneous¶

`save_results`(filename, /, *, thetas, probs, ...)	Save discriminator predictions and parameter values to a .npz file.
`load_results`(filename, /, *[, parameters])	Load descriminator predictions and parameter values from a .npz file.
`ts_individuals`(ts, /[, population])	Get the individuals corresponding to the tree sequence's samples.
`get_samples_from_1kgp_metadata`(filename, /, ...)	Get sample IDs for 1000 Genomes Project populations.
`get_contig_lengths`(filename, /[, keep_contigs])	Load contig lengths from a whitespace-separated file like an fai (fasta index).
`Store`(base, *[, create])	A sequence of folders with names "0", "1", ...

Plotting¶

`feature`(mat, /, *, ax[, channel, cmap, cb, ...])	Plot a feature matrix as a heatmap.
`features`(mats, /, *[, subplots_kw])	Plot multiple feature matrices as heatmaps.
`metrics`(metrics_collection, /, *[, ...])	Plot training metrics from a discriminator neural network.
`hist`(x, /, *[, ax, ci, truth, hist_kw])	Plot a histogram.
`hist2d`(x, y, /, *[, x_label, y_label, ...])	Plot a 2d histogram.

Reference¶

Type aliases¶

dinf.FeatureCollection¶

A feature array or a labelled collection of feature arrays.

alias of Union[ndarray, Dict[str, ndarray]]

Defining a Dinf model¶

class dinf.DinfModel(parameters, generator_func, target_func, discriminator_network=None)[source]¶

A container that describes the components of a Dinf model.

Constructing a Dinf model requires specifying:

the inferrable parameters,

a generator_func function that accepts concrete parameter values, produces data under some simulation model, and returns a feature matrix (or matrices), and

a target_func function that samples from the target dataset, and returns a feature matrix (or matrices).

import dinf
import msprime

parameters = dinf.Parameters(...)
features = dinf.BinnedHaplotypeMatrix(...)
vcfs = dinf.BagOfVcf(...)

def generator(seed, ...):
    ...
    # E.g. simulate using msprime to get a succinct tree sequence (ts).
    ts = msprime.sim_ancestry(...)
    mts = msprime.sim_mutations(ts, ...)
    return features.from_ts(mts)

def target(seed):
    ...
    return features.from_vcf(vcfs, ...)

dinf_model = dinf.DinfModel(
    parameters=parameters,
    generator_func=generator,
    target_func=target,
)

Parameters:

parameters (Parameters) – A collection of inferrable parameters, one for each keyword argument provided to the generator_func.

generator_func (Callable[..., FeatureCollection]) –

A function that returns features for concrete parameter values.

The first (positional) argument to the function is a seed, that may be used to seed a random number generator. The subsequent (keyword) arguments correspond to concrete values for the parameters. The return type is either:

a feature matrix, i.e. an n-dimensional numpy array, or

multiple feature matrices, i.e. a dictionary of n-dimensional numpy arrays, where the dictionary keys are arbitrary labels such as population names.

The signature of the function will follow the patterns below, where p0, p1, etc. are the names of the parameters in parameters. Type annotations are provided here for clarity, but are not required in user-defined functions. In function signatures, positional-only parameters preceed a / and keyword-only parameters follow a *.

import dinf
import numpy as np

parameters = dinf.Parameters(
    p0=dinf.Param(...),
    p1=dinf.Param(...),
    ...
)

# Signature for returning a single feature matrix.
def generator_func1(
    seed: np.random.SeedSequence, /, *, p0: float, p1: float, ...
) -> np.ndarray:
    ...

# Signature for returning multiple feature matrices.
def generator_func2(
    seed: np.random.SeedSequence, /, *, p0: float, p1: float, ...
) -> dict[str, np.ndarray]:
    ...

# For generator functions accepting large numbers of parameters,
# the following pattern using ``**kwargs`` may be preferred.
def generator_func3(
    seed: np.random.SeedSequence, /, **kwargs: float
)-> dict[str, np.ndarray]:
    assert kwargs.keys() == parameters.keys()
    # do something with p0
    do_something(kwargs["p0"])
    # do something with p1
    do_something_else(kwargs["p1"])
    ...

target_func (Callable[[numpy.random.SeedSequence], FeatureCollection] | None) –
A function that returns features sampled from the target dataset.

If None (which must be specified explicitly), the generator_func will be used to simulate the target dataset using each parameter’s truth value.

The function takes a single (positional) argument, a seed, that may be used to seed a random number generator. The return type must match the return type of generator_func.
discriminator_network (nn.Module | None) – A flax neural network. If not specified, ExchangeableCNN will be used.

parameters: Parameters¶: The inferrable parameters.

generator_func: Callable[..., FeatureCollection]¶: Function that simulates features using concrete parameter values.

generator_func_v: Callable[[Tuple[numpy.random.SeedSequence, Iterable[float]]], FeatureCollection]¶: Wrapper for generator_func that accepts a single argument containing the seed and an iterable of parameter values (as opposed to keyword arguments).

target_func: Callable[[numpy.random.SeedSequence], FeatureCollection] | None¶: Function that samples features from the target distribution.

discriminator_network: nn.Module | None = None¶: A flax neural network. May be None.

filename: pathlib.Path | None = None¶: Path to the file from which the model was loaded (if any). May be None.

check(seed=None)[source]¶

Basic health checks: draw parameters and call the functions.

We don’t do this when initialising the DinfModel, because calling generator_func() and target_func() are potentially time consuming, which could lead to annoying delays for the command line interface.

static from_file(filename)[source]¶

Load the symbol “dinf_model” from a file.

https://docs.python.org/3/library/importlib.html#importing-a-source-file-directly

Parameters:: filename (str | pathlib.Path) –
Return type:: DinfModel

Specifying parameters¶

class dinf.Parameters(**kwargs)[source]¶

An ordered collection of inferrable parameters.

A collection of parameters is built by passing Param objects to the constructor by keyword. The keyword will be used as the name for the given parameter. This class implements Python’s mapping protocol.

import dinf

# Parameters named "a", "b", and "c".
parameters = dinf.Parameters(
    a=dinf.Param(low=0, high=1),
    b=dinf.Param(low=5, high=100),
    c=dinf.Param(low=1e-6, high=1e-3),
)

# Iterate over parameters.
assert len(parameters) == 3
assert list(parameters) == ["a", "b", "c"]
assert [p.high for p in parameters.values()] == [1, 100, 1e-3]

# Lookup a parameter by name.
a = parameters["a"]
assert a.low == 0
assert a.high == 1
assert a.name == "a"

Parameters:: kwargs (Param) – The Param objects, named by keyword.

bounds_contain(thetas, /)[source]¶

Test if values are within in the parameter bounds.

Parameters:: thetas (ndarray) – A 2d numpy array, where thetas[j][k] is the j’th sample for the k’th parameter.
Returns:: A 1d numpy array of boolean values, where ret[j] is True if the j’th sample is within the parameter space, and False otherwise.
Return type:: ndarray

sample_prior(*, size, rng)[source]¶

Get a random sample of parameter values from their prior distributions.

Parameters:

size (int) – The sample size.
rng (numpy.random.Generator) – The numpy random number generator.

Returns:

A 2d numpy array of parameter values, where ret[j][k] is the j’th draw for the k’th parameter.

Return type:

ndarray

sample_kde(thetas, /, *, probs, size, rng, shrinkage=True)[source]¶

Sample from a smoothed set of weighted observations.

Samples are drawn from thetas, weighted by their probability. New points are drawn within a neighbourhood of the sampled thetas using a mulivariate normal whose covariance is calculated from the thetas. This is equivalent to sampling from a Gaussian KDE, but avoids doing an explicit density estimation. Values are sampled until they are within the parameter bounds. Scott’s rule of thumb is used for bandwidth selection.

Parameters:

thetas (np.ndarray) – Parameter values to sample from.
probs (np.ndarray | None) – Discriminator predictions corresponding to the thetas.
size (int) – Number of samples to draw.
rng (numpy.random.Generator) – Numpy random generator.
shinkage – If True, shrink the thetas towards their mean. This corrects for variance inflation of the distribution due to the Gaussian sampling. See West 1993, https://doi.org/10.1111/j.2517-6161.1993.tb01911.x
shrinkage (bool) –

Returns:

The sampled values.

Return type:

np.ndarray

sample_ball(theta, /, *, size, cov=None, cov_factor=1.0, rng)[source]¶

Sample a ball around the given theta value.

The size of the ball is controlled by the covariance matrix, cov, which is multiplied by cov_factor.

Warning

Samples are not gauranteed to be within the parameter bounds.

Parameters:

theta (np.ndarray) – Parameter value around which the ball is centered.
size (int) – Number of samples to draw.
cov (np.ndarray | None) – The covariance matrix. If not specified, a diagonal matrix will be used that has entry cov[j][j] set to the squared range of parameter j: (high - low)**2.
cov_factor (float) – Left-multiply the covariance matrix by this value.
rng (numpy.random.Generator) – Numpy random generator.

Returns:

The sampled values.

transform(thetas, /)[source]¶

Transform values bounded by [param.low, param.high] to [-inf, inf].

See itransform() for the inverse transformation.

Parameters:: thetas (ndarray) – The values to be transformed.
Returns:: Transformed parameter values.
Return type:: ndarray

itransform(thetas, /)[source]¶

Transform values on [-inf, inf] to be bounded by [param.low, param.high].

Performs the inverse of transform().

Parameters:: thetas (ndarray) – The values to be transformed.
Returns:: Transformed parameter values.
Return type:: ndarray

truncate(thetas, /)[source]¶

Truncate values that are out of bounds.

Parameters:: thetas (ndarray) – The values to be truncated.
Returns:: Truncated parameter values.
Return type:: ndarray

reflect(thetas, /)[source]¶

Reflect values that are out of bounds by the amount they are out.

As reflecting does not gaurantee values will be within the bounds, values are first truncated to (2*low - high, 2*high - low), then reflected. For example, with bounds low=0, high=10, a value of -11 will be truncated to -10, then reflected to attain a final value of 10.

Parameters:: thetas (ndarray) – The values to be reflected.
Returns:: Reflected parameter values.
Return type:: ndarray

static geometric_median(thetas, /, *, probs=None)[source]¶

Get the multivariate median of a weighted sample.

Parameters:

thetas (np.ndarray) – Parameter values. thetas[j][k] is the value of the k’th parameter for the j’th multivariate sample.
probs (np.ndarray | None) – Discriminator predictions corresponding to the thetas.

Returns:

Median position in multivariate space.

Return type:

np.ndarray

static top_n(thetas, /, *, probs, n)[source]¶

Get the top n values of thetas, ranked by probs.

Parameters:

thetas (ndarray) – Parameter values. thetas[j][k] is the value of the k’th parameter for the j’th multivariate sample.
probs (ndarray) – Discriminator predictions corresponding to the thetas.
n (int) –

Returns:

A 2-tuple of (top_thetas, top_probs).

Return type:

Tuple[ndarray, ndarray]

class dinf.Param(low, high, truth=None, name=None)[source]¶

An inferrable parameter with Uniform(low, high) prior distribution.

Parameters:

low (float) – The lower bound for the parameter.
high (float) – The uppper bound for the parameter.
truth (float | None) – The true value of the parameter. Set to None if not known (e.g. for parameters corresponding to an empirical dataset). If the value is not None, this value may be used as the truth in a simulation study.
name (str | None) –

low: float¶: Parameter’s lower bound.

high: float¶: Parameter’s upper bound.

truth: float | None = None¶: The true value of the parameter. May be None.

name: str | None = None¶: Name of the parameter.

sample_prior(*, size, rng)[source]¶

Get a random sample from the Uniform(low, high) prior distribution.

Parameters:

size (int) – The sample size.
rng (numpy.random.Generator) – The numpy random number generator.

Returns:

A numpy array of parameter values.

Return type:

ndarray

bounds_contain(x, /)[source]¶

Test if values are within in the parameter bounds.

Parameters:: x (ndarray) – The values that will be bounds checked. May be a single number or an array.
Returns:: A boolean value, or array of booleans with the same length as the input. ret[j] is True if x[j] is in bounds, False otherwise.
Return type:: ndarray

transform(x, /)[source]¶

Transform bounded values on [low, high] to unbounded values on [-inf, inf].

Performs a logit transformation.

Parameters:: x (ndarray) – The values to be transformed.
Returns:: Transformed parameter values.
Return type:: ndarray

itransform(x, /)[source]¶

Transform unbounded values on [-inf, inf] to bounded values on [low, high].

Performs an inverse logit transformation (aka expit, aka sigmoid).

Parameters:: x (ndarray) – The values to be transformed.
Returns:: Transformed parameter values.
Return type:: ndarray

truncate(x, /)[source]¶

Truncate values that are out of bounds.

Parameters:: x (ndarray) – The values to be truncated.
Returns:: Truncated parameter values.
Return type:: ndarray

reflect(x, /)[source]¶

Reflect values that are out of bounds by the amount they are out.

As reflecting does not gaurantee values will be within the bounds, values are first truncated to (2*low - high, 2*high - low), then reflected. For example, with bounds low=0, high=10, a value of -11 will be truncated to -10, then reflected to attain a final value of 10.

Parameters:: x (ndarray) – The values to be reflected.
Returns:: Reflected parameter values.
Return type:: ndarray

Feature extraction¶

class dinf.HaplotypeMatrix(*, num_individuals, num_loci, ploidy, phased, maf_thresh=None)[source]¶

A factory for feature matrices consisting of haplotypes and relative positions.

The feature is an \(n \times m \times c\) array, where the channel dimension \(c\) is 2. The first channel is a haplotype matrix and the second channel is a matrix of relative SNP positions.

The haplotype matrix is an \(n \times m\) matrix, where \(n\) corresponds to the number of haplotypes (or number of individuals, for unphased data) and \(m\) corresponds to the number of SNPs along the sequence. For phased data, each entry is a 0 or 1 corresponding to the major or minor allele respectively. For unphased data, each entry is the count of minor alleles across all chromosomes in the individual. Only polymorphic SNPs are considered, and for multiallelic sites, only the first two alleles are used. Alleles are polarised by choosing the most frequent allele to be encoded as 0 (the major allele), and the second most frequent allele as 1 (the minor allele).

The position matrix is an \(n \times m\) matrix, where the vector of \(m\) inter-SNP distances are repeated \(n\) times—once for each haplotype (or each individual, for unphased data). Each entry is the distance from the previous SNP (as a proportion of the sequence length). The first inter-SNP distance in the vector is always zero.

Chan et al. 2018, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7687905/

Wang et al. 2021, https://doi.org/10.1111/1755-0998.13386

Parameters:

num_individuals (int) – The number of individuals to include in the feature matrix.
num_loci (int) – The number of SNP sites to extract. The central num_loci SNPs in the sequence will be used. If there are fewer than num_loci SNPs, the feature matrix will be padded on both sides with zeros.
ploidy (int) – Ploidy of the individuals.
maf_thresh (float | None) – Minor allele frequency (MAF) threshold. Sites with MAF lower than this value are ignored. If None, only invariant sites will be excluded.
phased (bool) – If True, the individuals’ haplotypes will each be included as independent rows in the feature matrix and the shape of the feature matrix will be (ploidy * num_individuals, num_loci, c). If False, the allele counts for each individual will be summed across their chromosome copies and the shape of the feature matrix will be (num_individuals, num_loci, c).

from_ts(ts)¶

Create a feature matrix from a tskit tree sequence.

Parameters:: ts (TreeSequence) – The tree sequence.
Returns:: Array with shape (num_pseudo_haplotypes, num_loci, c).
Return type:: ndarray

from_vcf(vb, *, sequence_length, max_missing_genotypes, min_seg_sites, rng)¶

Create a feature matrix from a region of a VCF/BCF.

The genomic window for a feature matrix is drawn uniformly at random from the contigs defined in the given BagOfVcf, vb.

Individuals in the VCFs are sampled (without replacement) for inclusion in the output matrix. The size of the feature space can therefore be vastly increased by having more individuals in the VCFs than are needed for the feature dimensions.

Parameters:

vb (BagOfVcf) – The BagOfVcf object that describes the VCF/BCF files.
sequence_length (int) – Length of the genomic window to be sampled.
max_missing_genotypes (int) – Consider only sites with at most this many missing genotype calls.
min_seg_sites (int) – Sampled genotype matrix must have at least this many variable sites (after filtering sites for missingness).
rng (numpy.random.Generator) – Numpy random number generator.

Returns:

Array with shape (num_pseudo_haplotypes, num_loci, c).

Return type:

ndarray

property shape: Tuple[int, ...]¶: Shape of the feature matrix.

class dinf.MultipleHaplotypeMatrices(*, num_individuals, num_loci, ploidy, global_maf_thresh=None, global_phased)[source]¶

A factory for labelled collections of HaplotypeMatrix objects.

One feature matrix is produced for each label. Labels correspond to collections of individuals that will be treated as exchangeable (e.g. populations).

Parameters:

num_individuals (Mapping[str, int]) – A dict that maps labels to the number of individuals in the feature matrix.
num_loci (Mapping[str, int]) – A dict that maps labels to the number of feature loci to be extracted from the sequence.
ploidy (Mapping[str, int]) – A dict that maps labels to the ploidy of the individuals.
global_maf_thresh (float | None) – Minor allele frequency (MAF) threshold. Sites with MAF lower than this value are ignored. MAF is calculated across all individuals. If None, only invariant sites will be excluded.
global_phased (bool) – If True, the individuals’ haplotypes will each be included as independent rows in each feature matrix and the shape of the feature matrix for label l will be (ploidy[l] * num_individuals[l], num_loci[l], c). If False, the allele counts for each individual will be summed across their chromosome copies and the shape of the feature matrix will be (num_individuals[l], num_loci[l], c).

from_ts(ts, *, individuals)¶

Create pseudo-genotype matrices from a tree sequence.

Parameters:

ts (TreeSequence) – The tree sequence.
rng (numpy.random.Generator) – Numpy random number generator.
individuals (Mapping[str, ndarray[Any, dtype[integer]]]) – A mapping from label to an array of individuals.

Returns:

A dictionary mapping a label l to a feature array. Each array has shape (num_pseudo_haplotypes[l], num_loci[l], c). For an array \(M\), the \(M[i][j][0]\)’th entry is the count of minor alleles in the \(j\)’th bin of psdeudo-haplotype \(i\).

Return type:

Dict[str, ndarray]

from_vcf(vb, *, sequence_length, max_missing_genotypes, min_seg_sites, rng)¶

Create pseudo-genotype matrices from a region of a VCF/BCF.

The genomic window is drawn uniformly at random from the sequences defined in the given BagOfVcf.

Individuals in the VCFs are sampled (without replacement) for inclusion in the output matrix. The size of the feature space can therefore be vastly increased by having more individuals in the VCFs than are needed for the feature dimensions.

Parameters:

vb (BagOfVcf) – A collection of indexed VCF/BCF files.
sequence_length (int) – Length of the genomic window to be sampled.
max_missing_genotypes (int) – Consider only sites with fewer missing genotype calls than this number.
min_seg_sites (int) – Sampled genotype matrix must have at least this many variable sites (after filtering sites for missingness).
rng (numpy.random.Generator) – Numpy random number generator.

Returns:

A dictionary mapping a label l to a feature array. Each array has shape (num_pseudo_haplotypes[l], num_loci[l], c). For an array \(M\), the \(M[i][j][0]\)’th entry is the count of minor alleles in the \(j\)’th bin of psdeudo-haplotype \(i\).

Return type:

Dict[str, ndarray]

property shape: Dict[str, Tuple[int, ...]]¶: Shape of the feature matrices.

class dinf.BinnedHaplotypeMatrix(*, num_individuals, num_loci, ploidy, phased, maf_thresh=None)[source]¶

A factory for feature matrices of pseudo-haplotypes.

The binned haplotype matrix is an \(n \times m\) matrix, where \(n\) corresponds to the number of haplotypes (or polyploid genotypes, for unphased data) and \(m\) corresponds to a set of equally sized bins along the sequence length. Each matrix entry contains the count of minor alleles in an individual’s haplotype (or polyploid genotype) in a given bin.

Only polymorphic SNPs are considered, and for multiallelic sites, only the first two alleles are used. Alleles are polarised by choosing the most frequent allele to be encoded as 0, and the second most frequent allele as 1.

As the features are intended to be passed to a covolutional neural network, the output dimensions are actually \(n \times m \times 1\), where the final dimension is the (unused) “channels” dimension for the convolution.

Gower et al. 2021, https://doi.org/10.7554/eLife.64669

Parameters:

num_individuals (int) – The number of individuals to include in the feature matrix.
num_loci (int) – The number of bins into which the sequence is partitioned. Each bin spans sequence_length / num_loci base pairs.
ploidy (int) – Ploidy of the individuals.
phased (bool) – If True, the individuals’ haplotypes will each be included as independent rows in the feature matrix and the shape of the feature matrix will be (ploidy * num_individuals, num_loci, 1). If False, the allele counts for each individual will be summed across their chromosome copies and the shape of the feature matrix will be (num_individuals, num_loci, 1).
maf_thresh (float | None) – Minor allele frequency (MAF) threshold. Sites with MAF lower than this value are ignored. If None, only invariant sites will be excluded.

from_ts(ts)¶

Create a feature matrix from a tskit tree sequence.

Parameters:: ts (TreeSequence) – The tree sequence.
Returns:: Array with shape (num_pseudo_haplotypes, num_loci, c).
Return type:: ndarray

from_vcf(vb, *, sequence_length, max_missing_genotypes, min_seg_sites, rng)¶

Create a feature matrix from a region of a VCF/BCF.

The genomic window for a feature matrix is drawn uniformly at random from the contigs defined in the given BagOfVcf, vb.

Individuals in the VCFs are sampled (without replacement) for inclusion in the output matrix. The size of the feature space can therefore be vastly increased by having more individuals in the VCFs than are needed for the feature dimensions.

Parameters:

vb (BagOfVcf) – The BagOfVcf object that describes the VCF/BCF files.
sequence_length (int) – Length of the genomic window to be sampled.
max_missing_genotypes (int) – Consider only sites with at most this many missing genotype calls.
min_seg_sites (int) – Sampled genotype matrix must have at least this many variable sites (after filtering sites for missingness).
rng (numpy.random.Generator) – Numpy random number generator.

Returns:

Array with shape (num_pseudo_haplotypes, num_loci, c).

Return type:

ndarray

property shape: Tuple[int, ...]¶: Shape of the feature matrix.

class dinf.MultipleBinnedHaplotypeMatrices(*, num_individuals, num_loci, ploidy, global_maf_thresh=None, global_phased)[source]¶

A factory for labelled collections of BinnedHaplotypeMatrix objects.

One feature matrix is produced for each label. Labels correspond to collections of individuals that will be treated as exchangeable (e.g. populations).

Parameters:

num_individuals (Mapping[str, int]) – A dict that maps labels to the number of individuals in the feature matrix.
num_loci (Mapping[str, int]) – A dict that maps labels to the number of feature loci to be extracted from the sequence.
ploidy (Mapping[str, int]) – A dict that maps labels to the ploidy of the individuals.
global_maf_thresh (float | None) – Minor allele frequency (MAF) threshold. Sites with MAF lower than this value are ignored. MAF is calculated across all individuals. If None, only invariant sites will be excluded.
global_phased (bool) – If True, the individuals’ haplotypes will each be included as independent rows in each feature matrix and the shape of the feature matrix for label l will be (ploidy[l] * num_individuals[l], num_loci[l], c). If False, the allele counts for each individual will be summed across their chromosome copies and the shape of the feature matrix will be (num_individuals[l], num_loci[l], c).

from_ts(ts, *, individuals)¶

Create pseudo-genotype matrices from a tree sequence.

Parameters:

ts (TreeSequence) – The tree sequence.
rng (numpy.random.Generator) – Numpy random number generator.
individuals (Mapping[str, ndarray[Any, dtype[integer]]]) – A mapping from label to an array of individuals.

Returns:

A dictionary mapping a label l to a feature array. Each array has shape (num_pseudo_haplotypes[l], num_loci[l], c). For an array \(M\), the \(M[i][j][0]\)’th entry is the count of minor alleles in the \(j\)’th bin of psdeudo-haplotype \(i\).

Return type:

Dict[str, ndarray]

from_vcf(vb, *, sequence_length, max_missing_genotypes, min_seg_sites, rng)¶

Create pseudo-genotype matrices from a region of a VCF/BCF.

The genomic window is drawn uniformly at random from the sequences defined in the given BagOfVcf.

Individuals in the VCFs are sampled (without replacement) for inclusion in the output matrix. The size of the feature space can therefore be vastly increased by having more individuals in the VCFs than are needed for the feature dimensions.

Parameters:

vb (BagOfVcf) – A collection of indexed VCF/BCF files.
sequence_length (int) – Length of the genomic window to be sampled.
max_missing_genotypes (int) – Consider only sites with fewer missing genotype calls than this number.
min_seg_sites (int) – Sampled genotype matrix must have at least this many variable sites (after filtering sites for missingness).
rng (numpy.random.Generator) – Numpy random number generator.

Returns:

A dictionary mapping a label l to a feature array. Each array has shape (num_pseudo_haplotypes[l], num_loci[l], c). For an array \(M\), the \(M[i][j][0]\)’th entry is the count of minor alleles in the \(j\)’th bin of psdeudo-haplotype \(i\).

Return type:

Dict[str, ndarray]

property shape: Dict[str, Tuple[int, ...]]¶: Shape of the feature matrices.

class dinf.BagOfVcf(files, /, *, samples=None, contig_lengths=None)[source]¶

A collection of indexed VCF or BCF files.

VCF data are sometimes contained in a single file, and sometimes split into multiple files by chromosome. To remove the burden of dealing with both of these common cases, this class maps a contig ID to a cyvcf2.VCF object for that contig. The class implements Python’s mapping protocol, plus methods for sampling regions of the genome at random.

import dinf

vcfs = dinf.BagOfVcf(["chr1.bcf", "chr2.bcf", "chr3.bcf"])

# Iterate over contig IDs.
assert len(vcfs) == 3
assert list(vcfs) == ["chr1", "chr2", "chr3"]

# Lookup a cyvcf2.VCF object by contig ID.
chr1 = vcfs["chr1"]
assert chr1.fname == "chr1.bcf"
first_variant = next(chr1)
assert first_variant.CHROM == "chr1"

Parameters:

files (Iterable[str | pathlib.Path]) – An iterable of filenames. Each file must be an indexed VCF or BCF, and must contain the FORMAT/GT field.
contig_lengths (collections.abc.Mapping[str, int] | None) – A dict mapping a contig name to contig length. Only the contigs in this dict will be used.
samples (collections.abc.Mapping[str, List[str]] | None) – A dictionary that maps a label to a list of individual names, where the individual names correspond to the VCF columns for which genotypes will be sampled.

lengths: np.ndarray¶

Lengths of the contigs in the bag.

The order matches the contig order obtained when iterating.

samples: collections.abc.Mapping[str, List[str]] | None¶

A dictionary that maps a label to a list of individual names.

The individual names correspond to the VCF columns for which genotypes will be sampled. This is a bookkeeping device that records which genotypes belong to which label (e.g. which population). If None, it is assumed that all individuals in the VCF will be treated as exchangeable.

sample_regions(size, sequence_length, rng)[source]¶

Sample a list of (chrom, start, end) triplets.

Parameters:

size (int) – Number of genomic windows to sample.
sequence_length (int) – Length of the sequence to sample.
rng (numpy.random.Generator) – The numpy random number generator.

Returns:

List of 3-tuples: (chrom, start, end). The start and end coordinates are 1-based and inclusive, to match the usual convention for ‘chrom:start-end’, e.g. in bcftools.

Return type:

List[Tuple[str, int, int]]

sample_genotype_matrix(*, sequence_length, min_seg_sites, max_missing_genotypes, require_phased, rng, retries=1000)[source]¶

Sample a genotype matrix uniformly at random from the genome.

Parameters:

sequence_length (int) – Length of the sequence to sample.
max_missing_genotypes (int) – Consider only sites with at most this many missing genotype calls.
min_seg_sites (int) – Sampled genotype matrix must have at least this many variable sites (after filtering sites for missingness).
require_phased (bool) – If True, raise an error if genotypes are not phased.
rng (numpy.random.Generator) – Numpy random number generator.
retries (int) – Maximum number of attempts allowed to find a genomic window with the required number of segregating sites.

Returns:

A 2-tuple of (G, positions) where

G is a (num_sites, num_individuals, ploidy) genotype matrix,
positions are the site coordinates, as an offset from the the start of the genomic window.

Return type:

Tuple[ndarray, ndarray]

Training¶

dinf.train(*, dinf_model, training_replicates, test_replicates, epochs, parallelism=None, seed=None, callbacks=None)[source]¶

Train a discriminator network.

Parameters:

dinf_model (DinfModel) – DinfModel object that describes the model components.
training_replicates (int) – Size of the dataset used to train the discriminator.
test_replicates (int) – Size of the test dataset used to evaluate the discriminator after each training epoch.
epochs (int) – Number of full passes over the training dataset when training the discriminator.
parallelism (None | int) – Number of processes to use for parallelising calls to the DinfModel.generator_func() and DinfModel.target_func().
seed (None | int) – Seed for the random number generator.
callbacks (dict | None) –

Returns:

The trained discriminator.

Return type:

Discriminator

Inference¶

dinf.predict(*, discriminator, dinf_model, replicates, sample_target=False, parallelism=None, seed=None, callbacks=None)[source]¶

Sample features and make predictions using the discriminator.

Features are sampled from the generator by default. To instead sample features from the target dataset, use the sample_target option.

Parameters:

dinf_model (DinfModel) – DinfModel object that describes the model components.
replicates (int) – Number of features to extract.
sample_target (bool) – If True, sample features from the target dataset. If False (the default), features are sampled from the generator.
parallelism (None | int) – Number of processes to use for parallelising calls to the DinfModel.generator_func() and DinfModel.target_func().
seed (int | None) – Seed for the random number generator.
discriminator (Discriminator) –
callbacks (dict | None) –

Returns:

A 2-tuple of (thetas, probs), where thetas are the drawn parameters (or None if sample_target is True) and probs are the discriminator predictions. thetas[j][k] is the j’th draw for the k’th parameter, and probs[j] is the discriminator prediction for the j’th draw.

Return type:

Tuple[np.ndarray | None, np.ndarray]

dinf.mc(*, dinf_model, iterations, training_replicates, test_replicates, proposal_replicates, epochs, top_n=None, output_folder=None, parallelism=None, seed=None, callbacks=None)[source]¶

Adversarial Monte Carlo.

In the first iteration, p[0] is the prior distribution. The following steps are taken for iteration j:

sample training and proposal datasets from distribution p[j],

train the discriminator,

make predictions with the discriminator on the proposal dataset,

construct distribution p[j+1] as a weighted KDE of the proposals, where the weights are given by the discriminator predictions.

Parameters:

dinf_model (DinfModel) – DinfModel object that describes the dinf model.
iterations (int) – Number of iterations.
training_replicates (int) – Size of the dataset used to train the discriminator. This dataset is constructed once each iteration.
test_replicates (int) – Size of the dataset used to evaluate the discriminator after each training epoch. This dataset is constructed once before the iteration begins, and is reused in each iteration.
proposal_replicates (int) – Number of Monte Carlo proposals in each iteration.
epochs (int) – Number of full passes over the training dataset when training the discriminator.
top_n (int | None) – If not None, do ABC rejection sampling in each iteraction by taking the top_n best samples, ranked by discriminator prediction. top_n must be smaller than proposal_replicates.
output_folder (None | str | pathlib.Path) – Folder to output results. If not specified, the current directory will be used.
parallelism (None | int) – Number of processes to use for parallelising calls to the DinfModel.generator_func() and DinfModel.target_func().
seed (None | int) – Seed for the random number generator.
callbacks (dict | None) –

dinf.mcmc(*, dinf_model, iterations, training_replicates, test_replicates, epochs, walkers, steps, Dx_replicates, output_folder=None, parallelism=None, seed=None, callbacks=None)[source]¶

Run the MCMC GAN.

In the first iteration, p[0] is the prior distribution. The following steps are taken for iteration j:

sample training dataset from the distribution p[j],

train the discriminator,

run the MCMC,

obtain distribution p[j+1] as a KDE of the MCMC sample.

Parameters:

dinf_model (DinfModel) – DinfModel object that describes the model components.
iterations (int) – Number of GAN iterations.
training_replicates (int) – Size of the dataset used to train the discriminator. This dataset is constructed once each GAN iteration.
test_replicates (int) – Size of the test dataset used to evaluate the discriminator after each training epoch.
epochs (int) – Number of full passes over the training dataset when training the discriminator.
walkers (int) – Number of independent MCMC chains.
steps (int) – The chain length for each MCMC walker.
Dx_replicates (int) – Number of generator replicates for approximating E[D(G(θ))].
output_folder (None | str | pathlib.Path) – Folder to output results. If not specified, the current directory will be used.
parallelism (None | int) – Number of processes to use for parallelising calls to the DinfModel.generator_func() and DinfModel.target_func().
seed (None | int) – Seed for the random number generator.
callbacks (dict | None) –

dinf.pg_gan(*, dinf_model, iterations, training_replicates, test_replicates, epochs, Dx_replicates, num_proposals=10, proposal_stddev=0.06666666666666667, pretraining_method='dinf', proposals_method='rr', max_pretraining_iterations=100, output_folder=None, parallelism=None, seed=None)[source]¶

PG-GAN style simulated annealing.

Wang et al. 2021, https://doi.org/10.1111/1755-0998.13386

Parameters:

dinf_model (DinfModel) – DinfModel object that describes the dinf model.
iterations (int) – Number of GAN iterations.
training_replicates (int) – Size of the dataset used to train the discriminator. This dataset is constructed once each GAN iteration.
test_replicates (int) – Size of the test dataset used to evaluate the discriminator after each training epoch.
epochs (int) – Number of full passes over the training dataset when training the discriminator.
Dx_replicates (int) – Number of generator replicates for approximating E[D(G(θ))].
num_proposals (int) – Number proposals per parameter in each iteration.
proposal_stddev (float) – The standard deviation for the proposal distribution is: temperature * proposal_stddev * (param.high - param.low) The default is 1 / 15, as used by PG-GAN.
pretraining_method (str) –
A string indicating which method should be used to pretrain the discriminator and choose a starting point in parameter space.
- ”dinf” (default): Train the discriminator on data sampled from the prior, until it can distinguish randomly chosen points from the target dataset with accuracy 0.9, or until max_pretraining_iterations is exhausted. In practice, 10,000–20,000 training instances are sufficient.
  
  After pretraining, we choose the starting point to be the point with the highest log loss from a fresh set of candidates drawn from the prior.
- ”pg-gan”: PG-GAN starts at a point in the parameter space that is favourable to the discriminator, and it only trains the discriminator enough to identify such a point.
  
  In each pretraining iteration we train the discriminator on a single point sampled from the prior. If the discriminator can distinguish between that point and the target dataset with accuracy>=0.9, this point is used as the starting point. Otherwise, a new point is chosen by sampling from the prior. If max_pretraining_iterations is exhausted without reaching an accuracy of 0.9, the point at which the discriminator had the highest accuracy is used as the starting point.
proposals_method (str) –
A string indicating which method should be used to produce the proposal distribution. In each case, num_proposals × P proposals are produced, where P is the number of parameters (to match original PG-GAN behaviour).
- ”rr” (default): Round-robin proposals. Proposals are generated for one parameter only, keeping the other parameters fixed. A different parameter is chosen in each iteration, cycling through all parameters in order. Proposals are drawn from a normal distribution with standard deviation: temperature × proposal_stddev × (param.high - param.low).
- ”mvn”: Multivariate-normal proposals. Proposals are drawn from a multivariate normal distribution with covariance matrix defined to have diagonals: (temperature × proposal_stddev × (param.high - param.low))**2 / P, where P is the number of parameters and the high/low bounds vary by the corresponding parameter. Off-diagonals of the covariance matrix are set to zero.
- ”pg-gan”: Proposals like PG-GAN. num_proposals proposals are generated for each parameter, keeping the other parameters fixed, thus producing num_proposals × P total proposals (for P parameters). Proposals are drawn from a normal distribution with standard deviation: temperature × proposal_stddev × (param.high - param.low).
max_pretraining_iterations (int) – The maximum number of pretraining iterations.
output_folder (None | str | pathlib.Path) – Folder to output results. If not specified, the current directory will be used.
parallelism (None | int) – Number of processes to use for parallelising calls to the DinfModel.generator_func() and DinfModel.target_func().
seed (None | int) – Seed for the random number generator.

Classification¶

class dinf.Discriminator(network=None, state=None, trained=False, metrics=<factory>, format_version=4, _inited=False)[source]¶

Wrapper for the discriminator neural network that classifies genotype matrices.

Parameters:

network (nn.Module | None) –
state (Any | None) –
trained (bool) –
metrics (collections.abc.Mapping) –
format_version (int) –
_inited (bool) –

classmethod from_file(filename, /, *, network=None, state=None)[source]¶

Load neural network weights from the given file.

Parameters:

filename (str | pathlib.Path) – The filename of the saved model.
network – The flax neural network.
state – The flax train state.

Returns:

A new Discriminator object.

to_file(filename, /)[source]¶

Save neural network to the given file.

Parameters:: filename (str | pathlib.Path) – The path where the model will be saved.
Return type:: None

fit(*, train_x, train_y, val_x=None, val_y=None, batch_size=64, epochs=1, ss, reset_metrics=False, entropy_regularisation=False, callbacks=None)[source]¶

Fit discriminator to training data.

Parameters:

train_x – Training data.
train_y – Labels for training data (zeros and ones).
val_x – Validation data.
val_y – Labels for validation data (zeros and ones).
batch_size (int) – Size of minibatch for gradient update step.
epochs (int) – The number of full passes over the training data.
ss (numpy.random.SeedSequence) – Numpy random number seed sequence.
reset_metrics (bool) – If true, remove loss/accuracy metrics from previous calls to fit() (if any). If false, loss/accuracy metrics will be appended to the existing metrics.
entropy_regularisation (bool) –
callbacks (dict | None) –

predict(x, *, batch_size=1024, callbacks=None)[source]¶

Make predictions about data using a pre-fitted neural network.

Parameters:

x – The data instances about which to make predictions.
batch_size (int) – Size of data batches for prediction.
callbacks (dict | None) –

Returns:

A vector of predictions, one for each input instance.

Return type:

np.ndarray

summary()[source]¶: Return a summary of the neural network.

Discriminator networks¶

class dinf.ExchangeableCNN(sizes1=(32, 64), symmetric=Symmetric( # attributes func = 'max' k = None ), sizes2=(64, ), parent=<flax.linen.module._Sentinel object>, name=None)[source]¶

An exchangeable CNN for the discriminator.

Each feature matrix in the input has shape (batch_size, num_haplotypes, num_loci, channels). Two 1-d convolution layers are applied along haplotypes and feature size is reduced in each convolution using a stride of 2. Then the max function is used to collapse features across the num_haplotypes dimension. A third 1-d convolution follows (with stride of 1), and the max function is used to collapse features across the num_loci dimension. Elu activation is used throughout, except for the final output which has no activation.

The number of trainable parameters in the network is dependent on the number of feature matrices in the input, but independent of the size of the feature matrices.

Parameters:

sizes1 (Tuple[int, ...]) – A tuple of feature sizes for the convolution layers before the symmetric function.
symmetric (Module) – The symmetric network layer.
sizes2 (Tuple[int, ...]) – A tuple of feature sizes for the convolution layers afer the symmetric function.
parent (Optional[Union[Type[Module], Type[Scope], Type[_Sentinel]]]) –
name (Optional[str]) –

class dinf.ExchangeablePGGAN(sizes1=(32, 64), symmetric=Symmetric( # attributes func = 'sum' k = None ), sizes2=(128, 128), parent=<flax.linen.module._Sentinel object>, name=None)[source]¶

An exchangeable CNN for the discriminator using the PG-GAN architecture.

This is a faithful translation of the network implemented in PG-GAN [1], with the addition of batch normalisation. Each haplotype is treated as exchangeable with any other haplotype within the same feature matrix (i.e. exchangeability applies within labelled groups).

Each feature matrix in the input has shape (batch_size, num_haplotypes, num_loci, channels). Two 1-d convolution layers are applied along haplotypes, and feature size is reduced following each convolution using 1-d max pooling with size and stride of 2. The sum function is then used to collapse features across the num_haplotypes dimension. Two fully connected layers with dropout follow. Relu activation is used throughout, except for the final output which has no activation.

The number of trainable parameters in the network is dependent on the number of feature matrices in the input, and dependent on the num_loci dimension of the feature matrices.

[1] Wang et al. 2021, https://doi.org/10.1111/1755-0998.13386

Parameters:

sizes1 (Tuple[int, ...]) – A tuple of feature sizes for the convolution layers before the symmetric function. Convolution weights are tied between convolution layers of the distinct labelled feature matrices.
symmetric (Module) – The symmetric network layer.
sizes2 (Tuple[int, ...]) – A tuple of feature sizes for the dense (fully connected) layers afer the symmetric function.
parent (Optional[Union[Type[Module], Type[Scope], Type[_Sentinel]]]) –
name (Optional[str]) –

class dinf.Symmetric(func, k=None, parent=<flax.linen.module._Sentinel object>, name=None)[source]¶

Network layer that summarises over a given axis in a way that is invariant to permutations of the input along that axis.

Chan et al. 2018, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7687905/

Parameters:

func (str) –
The permutation-invariant function to apply. Must be one of:
- ”max”: maximum value.
- ”mean”: mean value.
- ”sum”: sum of the values.
- ”var”: variance of the values.
- ”moments”: first k moments, \(E[x], E[x^2], ..., E[x^k]\).
- ”central-moments”: first k central moments, \(E[x], E[(x - E[x])^2], ..., E[(x - E[x])^k]\).
k (int | None) – The number of moments to take when func is “moments” or “central-moments”.
parent (Optional[Union[Type[Module], Type[Scope], Type[_Sentinel]]]) –
name (Optional[str]) –

Miscellaneous¶

dinf.save_results(filename, /, *, thetas, probs, parameters)[source]¶

Save discriminator predictions and parameter values to a .npz file.

Parameters:

filename (str | pathlib.Path) – The output file.
thetas (np.ndarray | None) – Parameter values.
probs (np.ndarray) – Discriminator predictions corresponding to the thetas.
parameters (Parameters) – Sequence of parameter names.

dinf.load_results(filename, /, *, parameters=None)[source]¶

Load descriminator predictions and parameter values from a .npz file.

Parameters:

filename (str | pathlib.Path) – The input file.
parameters (Parameters | None) – Sequence of parameter names against which the file’s arrays will be checked. A ValueError exception is raised if the names are not the same (and in the same order). If None, the parameter names are not checked.

Returns:

A numpy structured array, where the first column is the probabilities (named _Pr), and the subsequent columns are the parameter values (if any).

Return type:

np.ndarray

dinf.ts_individuals(ts, /, population=None)[source]¶

Get the individuals corresponding to the tree sequence’s samples.

Parameters:

ts (tskit.TreeSequence) – The tree sequence.
population (str | int | None) – Only return individuals from this population. The population may be a string identifier (which will be matched against the ‘name’ metadata field in the population table of the tree sequence), or an integer population id. If not specified, all sampled individuals will be returned.

Returns:

An array of individual IDs (indices into the individuals table).

Return type:

npt.NDArray[np.integer]

dinf.get_samples_from_1kgp_metadata(filename, /, *, populations)[source]¶

Get sample IDs for 1000 Genomes Project populations. Related individuals are removed based on the FatherID and MotherID.

Parameters:

filename (str) – Path to the file “20130606_g1k_3202_samples_ped_population.txt” This file can be downloaded from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/
populations (list) – List of populations to extract.

Returns:

A dict mapping population names to a list of sample IDs.

Return type:

dict

dinf.get_contig_lengths(filename, /, keep_contigs=None)[source]¶

Load contig lengths from a whitespace-separated file like an fai (fasta index).

The file must have (at least) two columns, the first specifies the contig ID, and the second is the contig length. Additional columns are ignored.

Parameters:

filename (pathlib.Path | str) – The path to the file.
keep_contigs (Iterable[str] | None) – Use only the specified contigs.

Return type:

dict

class dinf.Store(base, *, create=False)[source]¶

A sequence of folders with names “0”, “1”, …

This class implements the collections.abc.Sequence protocol.

Variables:

base (pathlib.Path) – Base directory containing the sequence.

Parameters:

base (str | pathlib.Path) – The base directory containing the sequence.
create (bool) – If True, create the base directory if it doesn’t exist. If False (default), raise an error if the base directory doesn’t exist.

__len__()[source]¶: Length of the sequence.

__getitem__(index)[source]¶

Get the path with the given index.

Parameters:: index (int) – The 0-based index into the sequence.

increment()[source]¶: Add a new folder to the end of the sequence.

assert_complete(expected_files)[source]¶: Raise a RuntimeError if any expected files are missing from a subfolder.

Plotting¶

dinf.plot.feature(mat, /, *, ax, channel=0, cmap=None, cb=True, cb_label=None, vmax=None)[source]¶

Plot a feature matrix as a heatmap.

Parameters:

mat (np.ndarray) – The feature matrix.
ax (matplotlib.axes.Axes) – Matplotlib axes onto which the plot will be drawn.
channel (int) – Which channel of the feature matrix to draw. I.e., the index into the last dimension of mat.
cmap (str | matplotlib.colors.Colourmap | None) – Matplotlib colour map.
cb (bool) – If True, add a colour bar that indicates how colours relate to the values in the feature matrix.
cb_label (str | None) – Label on the colour bar.
vmax (float | None) –

dinf.plot.features(mats, /, *, subplots_kw=None)[source]¶

Plot multiple feature matrices as heatmaps.

Parameters:

mats (Dict[str, np.ndarray]) – The multiple feature matrices.
subplots_kw (dict | None) – Additional keyword arguments that will be passed to matplotlib.pyplot.subplots().

Return type:

tuple[matplotlib.figure.Figure, matplotlib.axes.Axes]

Returns:

A (figure, axes) tuple.

dinf.plot.metrics(metrics_collection, /, *, legend_title=None, subplot_mosaic_kw=None)[source]¶

Plot training metrics from a discriminator neural network.

Parameters:

metrics_collection (Dict[str, Dict[str, Any]]) – A dictionary mapping labels to metrics dictionaries. The labels will be used in the legend to identify each discriminator.
legend_title (str | None) – A title for the legend.
subplot_mosaic_kw (dict | None) – Additional keyword arguments that will be passed to matplotlib.pyplot.subplot_mosaic().

Return type:

tuple[matplotlib.figure.Figure, dict[str, matplotlib.axes.Axes]]

Returns:

A (figure, axes_dict) tuple, where axes_dict is a dictionary mapping metrics to the matplotlib.axes.Axes objects. (As obtained from matplotlib.pyplot.subplot_mosaic()).

dinf.plot.hist2d(x, y, /, *, x_label=None, y_label=None, x_truth=None, y_truth=None, ax=None, hist2d_kw=None)[source]¶

Plot a 2d histogram.

Parameters:

x (np.ndarray) – Values for the horizontal axis.
y (np.ndarray) – Values for the vertical axis.
x_label (float | None) – Name of the parameter on the horizontal axis.
y_label (float | None) – Name of the parameter on the vertical axis.
x_truth (float | None) – True value of the parameter on the horizontal axis.
y_truth (float | None) – True value of the parameter on the vertical axis.
ax (matplotlib.axes.Axis | None) – Axes onto which the histogram will be drawn. If None, an axes will be created with matplotlib.pyplot.subplots().
hist2d_kw (dict | None) – Further parameters passed to matplotlib.axes.Axes.hist2d().

Return type:

tuple[matplotlib.figure.Figure, matplotlib.axes.Axes]

Returns:

A (figure, axes) tuple.

dinf.plot.hist(x, /, *, ax=None, ci=False, truth=None, hist_kw=None)[source]¶

Plot a histogram.

Parameters:

x (np.ndarray) – Array of values for the histogram.
ax (matplotlib.axes.Axes | None) – Axes onto which the histogram will be drawn. If None, an axes will be created with matplotlib.pyplot.subplots().
ci (bool) – If True, draw a 95% interval at the bottom.
truth (float | None) – If not None, draw a red vertical line at this location.
hist_kw (dict | None) – Further parameters passed to matplotlib.axes.Axes.hist().

Return type:

tuple[matplotlib.figure.Figure, matplotlib.axes.Axes]

Returns:

A (figure, axes) tuple.

dinf.plot.entropy(ws, /, *, ax=None)[source]¶

Line plot of relative entropy over successive iterations.

Parameters:

ws (list[np.ndarray]) – Sequence of proposal weights from each iteration.
ax (matplotlib.axes.Axes | None) – Axes onto which the plot will be drawn. If None, an axes will be created with matplotlib.pyplot.subplots().

Return type:

tuple[matplotlib.figure.Figure, matplotlib.axes.Axes]

Returns:

A (figure, axes) tuple.