Statistical Inference for Persistence Homology Data • inphr

The goal of {inphr} is to provide a set of functions for performing null hypothesis testing on samples of persistence diagrams using the theory of permutations. Currently, only two-sample testing is implemented. Inputs can be either samples of persistence diagrams themselves or vectorizations. In the former case, they are embedded in a metric space using either the Bottleneck or Wasserstein distance. In the former case, persistence data becomes functional data and inference is performed using tools available in the {fdatest} package.

Installation

You can install the development version of inphr from GitHub with:

# install.packages("pak")
pak::pak("tdaverse/inphr")

Usage

Let us start by loading the package:

library(inphr)

Toy data

The package contains three toy data sets of persistence diagrams, which can be used for testing. They are available in the package as trefoils1, trefoils2, and archspirals. The first two sets contain persistence diagrams computed from noisy samples of trefoil knots, while the third set contains persistence diagrams computed from noisy samples of 2-armed Archimedean spirals. Each set contains 24 persistence diagrams, each computed from a sample of 120 points sampled from the respective shape, with Gaussian noise added (standard deviation = 0.05). The persistence diagrams were computed using the TDA::ripsDiag() function with a maximum scale of 6 and up to dimension 2.

Test in the space of diagrams

You can use the two_sample_diagram_test() function to perform a two-sample test on these persistence diagrams in the space of diagrams themselves. For example, to test whether the persistence diagrams from trefoils1 are significantly different from the persistence diagrams from trefoils2, you can run:

two_sample_diagram_test(trefoils1, trefoils2, B = 100L)
#> [1] 1

To test whether the persistence diagrams from trefoils1 are significantly different from the persistence diagrams from archspirals, you can run:

two_sample_diagram_test(trefoils1, archspirals, B = 100L)
#> [1] 0.00990099

Optionnally, the two_sample_diagram_test() function can also output the distribution of the test statistic under the null hypothesis as estimated by the permutation scheme. To do that, you can use the optional argument keep_null_distribution = TRUE. It is also possible to ask for the permutations themselves to be saved as part of the output. To do that, you can use the optional argument keep_permutations = TRUE.

Test in the space of diagrams themselves is performed using test statistics that only rely on distances between sampled diagrams. By default, two such statistics that mimic Student’s t-statistic and Fisher’s F-statistic are used as proposed in Lovato, I., Pini, A., Stamm, A., & Vantini, S. (2020), Model-free two-sample test for network-valued data. Computational Statistics & Data Analysis, 144, 106896.

Test in functional spaces

You can use the two_sample_functional_test() function to perform a two-sample test on these persistence diagrams in functional spaces using one of five functional representations of persistence diagrams, namely: (i) Betti, (ii) Euler characteristic, (iii) normalized life, (iv) silhouette and (v) entropy curves. Computation of these functional representations is powered by the {TDAvec} package. For example, to test whether the persistence diagrams from trefoils1 are significantly different from the persistence diagrams from archspirals, you can use the Betti curve representation and run:

out <- two_sample_functional_test(
  trefoils1,
  archspirals,
  representation = "betti",
  B = 100L
)

The output is a length-4 list. The first two elements are xfd and yfd which are numeric matrices storing evaluations of the functional representation of the diagrams on a grid stored as the third element scale_seq. You can therefore have a look at the functional data that the function produced using something like:

matplot(
  out$scale_seq[-1],
  t(rbind(out$xfd, out$yfd)),
  type = "l",
  col = c(rep(1, length(trefoils1)), rep(2, length(archspirals)))
)

In the case of testing in functional spaces, {inphr} uses the interval-wise testing (IWT) procedure powered by the {fdatest} package which has been proposed in Pini, A., & Vantini, S. (2017), Interval-wise testing for functional data. Journal of Nonparametric Statistics, 29(2), 407-424.

The output indicates on which portions of the scale sequence does the difference between the two samples occur, providing strong control of the familywise error rate:

plot(out$iwt, xrange = range(out$scale_seq))

Contributions

Code of Conduct

Contributions are welcome! Please feel free to open an issue or a pull request if you have any suggestions or improvements. The package is still in its early stages, so any feedback is appreciated.

Please note that the {inphr} project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Acknowledgements

This project was funded by an ISC grant from the R Consortium and done in coordination with Jason Cory Brunson and with guidance from Bertrand Michel and Paul Rosen. It builds upon conversations with Mathieu Carrière and Vincent Rouvreau who are among the authors of the GUDHI library. Package development also benefited from the support of colleagues at the Department of Mathematics Jean Leray and the use of equipment at Nantes University.