UMAP  Uniform Manifold Approximation and Projection

UMAP  Leland McInnes, John Healy, James Melville
GITHUB  OFFICIAL WEBSITEUniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to tSNE, but also for general nonlinear dimension reduction. The algorithm is founded on three assumptions about the data
 The data is uniformly distributed on a Riemannian manifold;
 The Riemannian metric is locally constant (or can be approximated as such);
 The manifold is locally connected.
From these assumptions it is possible to model the manifold with a fuzzy topological structure. The embedding is found by searching for a low dimensional projection of the data that has the closest possible equivalent fuzzy topological structure.
The details for the underlying mathematics can be found in our paper on ArXiv:
 McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv eprints 1802.03426, 2018
The important thing is that you don't need to worry about that  you can use UMAP right now for dimension reduction and visualisation as easily as a drop in replacement for scikitlearn's tSNE.
Documentation is available via ReadTheDocs.
Installation, licence, how to use information is avalaible on
GITHUB  OFFICIAL WEBSITE](https://github.com/lmcinnes/umap)Benefits of UMAP
UMAP has a few signficant wins in its current incarnation.
 First of all UMAP is fast. It can handle large datasets and high dimensional data without too much difficulty, scaling beyond what most tSNE packages can manage.
 Second, UMAP scales well in embedding dimension  it isn't just for visualisation! You can use UMAP as a general purpose dimension reduction technique as a preliminary step to other machine learning tasks. With a little care (documentation on how to be careful is coming) it partners well with the hdbscan clustering library.
 Third, UMAP often performs better at preserving aspects of global structure of the data than tSNE. This means that it can often provide a better "big picture" view of your data as well as preserving local neighbor relations.
 Fourth, UMAP supports a wide variety of distance functions, including nonmetric distance functions such as cosine distance and correlation distance. You can finally embed word vectors properly using cosine distance!
 Fifth, UMAP supports adding new points to an existing embedding via the standard sklearn transform method. This means that UMAP can be used as a preprocessing transformer in sklearn pipelines.
 Sixth, UMAP supports supervised and semisupervised dimension reduction. This means that if you have label information that you wish to use as extra information for dimension reduction (even if it is just partial labelling) you can do that  as simply as providing it as the y parameter in the fit method.
 Finally UMAP has solid theoretical foundations in manifold learning (see our paper on ArXiv). This both justifies the approach and allows for further extensions that will soon be added to the library (embedding dataframes etc.).
Performance and Examples
UMAP is very efficient at embedding large high dimensional datasets. In particular it scales well with both input dimension and embedding dimension. Thus, for a problem such as the 784dimensional MNIST digits dataset with 70000 data samples, UMAP can complete the embedding in around 2.5 minutes (as compared with around 45 minutes for most tSNE implementations). Despite this runtime efficiency UMAP still produces high quality embeddings.
The obligatory MNIST digits dataset, embedded in 2 minutes and 22 seconds using a 3.1 GHz Intel Core i7 processor (n_neighbors=10, min_dist=0 .001):
UMAP embedding of MNIST digits
The MNIST digits dataset is fairly straightforward however. A better test is the more recent "Fashion MNIST" dataset of images of fashion items (again 70000 data sample in 784 dimensions). UMAP produced this embedding in 2 minutes exactly (n_neighbors=5, min_dist=0.1):
UMAP embedding of "Fashion MNIST"
The UCI shuttle dataset (43500 sample in 8 dimensions) embeds well under correlation distance in 2 minutes and 39 seconds (note the longer time required for correlation distance computations):
UMAP embedding the UCI Shuttle dataset