t-SNE vs. UMAP for Single-Cell Data: The Guide to Dimensionality Reduction

Apr 27
5 min read

The short answer: For most single-cell RNA sequencing (scRNA-seq) workflows, UMAP (Uniform Manifold Approximation and Projection) has largely superseded t-SNE (t-Distributed Stochastic Neighbor Embedding). UMAP is significantly faster, scalable to millions of cells, and theoretically preserves more of the data's global structure (meaning relationships between distant clusters are more meaningful). However, neither method should be used for quantitative analysis like clustering or trajectory inference—they are visualization tools only.

Chat with Sophie to learn more about this topic!

You might also be interested in how you can run your lab more sustainably!

1. Introduction: The High-Dimensional Problem

Single-cell data is inherently high-dimensional. A dataset with 20,000 genes measured across 5,000 cells creates a massive matrix that the human brain cannot visualize. To make sense of this, we use Dimensionality Reduction techniques to project this complex "hyperspace" into a 2D or 3D plot we can see.

While PCA (Principal Component Analysis) is the mathematical foundation used to de-noise data, it is a linear method that often fails to resolve complex biological heterogeneity. This is where non-linear neighbor-graph methods like t-SNE and UMAP come in. They excel at grouping similar cells together, but they do so using different mathematical philosophies that impact your final visualization.

2. t-SNE vs. UMAP: The Core Differences

While both algorithms aim to place similar cells near each other, their performance differs in three critical areas: Global Structure, Speed, and Initialization.

Feature	t-SNE (t-Distributed Stochastic Neighbor Embedding)	UMAP (Uniform Manifold Approximation and Projection)
Primary Goal	Preserves local neighborhood structure.	Balances local and global structure.
Speed	Slow. Computationally intensive, especially for N > 100k cells.	Fast. Scalable to millions of cells.
Global Structure	Poor. Distant clusters are placed arbitrarily.	Better (debatable). Distant clusters theoretically reflect biological distance.
Initialization	Random initialization (standard).	Laplacian Eigenmaps or PCA (standard).
Cluster Density	Meaningless. Expands dense clusters, compresses sparse ones.	Meaningless. Density does not reflect biological population size.
Deterministic?	No (Stochastic). Requires a random seed.	No (Stochastic). Requires a random seed.

The "Global Structure" Controversy

Traditionally, t-SNE was criticized for failing to preserve global structure. If Cluster A is far from Cluster B in a t-SNE plot, it doesn't necessarily mean they are biologically distinct—it just means they aren't local neighbors.

UMAP claims to solve this by optimizing a different cost function (cross-entropy) that penalizes separating similar points and clustering dissimilar points. However, recent critiques (including Chari and Pachter) argue that UMAP's perceived superiority in global structure is largely due to its initialization with PCA, not the algorithm itself. When t-SNE is initialized with PCA, it captures global structure almost as well as UMAP.

Consensus: Treat both plots as "rubber sheets" that distort distances. Trust local groups (islands), but be skeptical of the empty space between them.

3. Critical Parameters: Tuning Your Plot

One of the biggest mistakes in bioinformatics is using default parameters without testing. Changing these values can drastically alter your biological interpretation.

t-SNE: The Perplexity

The most important parameter for t-SNE is Perplexity.

Definition: A guess about the number of close neighbors each point has.
Low Perplexity (5-30): Focuses on local detail. Can break single populations into artificial shards.
High Perplexity (50-100+): Merges small clusters. Necessary for preserving global shapes in large datasets.
Rule of Thumb: Perplexity ~ N^(1/2) (Square root of the number of cells).

UMAP: Neighbors and Distance

UMAP separates the control of local and global structure into two parameters:

n_neighbors: Similar to perplexity. Low values focus on local structure; high values look at the "big picture" (global structure). Default is usually 15.
min_dist: Controls how tightly points are packed together.
- Low min_dist (0.1): Clumpy, dense clusters. Good for separating distinct cell types.
- High min_dist (0.5+): Spread out. Makes the plot look more like t-SNE.

4. Step-by-Step Protocol: From Reads to Plot

Do not run t-SNE or UMAP directly on your raw gene counts. Follow this industry-standard pipeline (implemented in Seurat, Scanpy, or OSCA) to ensure your visualization represents biological signal, not technical noise.

Quality Control (QC):
- Filter out dead cells (high mitochondrial content) and empty droplets (low gene counts).
Normalization:
- Apply log-normalization to correct for sequencing depth differences (e.g., LogNormCounts).
Feature Selection:
- Identify Highly Variable Genes (HVGs). Select the top 1,000–2,000 genes that drive biological variation.
PCA (Linear Reduction):
- Run PCA on the HVGs.
- Select the top Principal Components (e.g., PC 1–30). The later PCs (e.g., PC 45-50) often contain mostly technical noise.
- Tip: Use an "Elbow Plot" to decide how many PCs to keep.
Non-Linear Reduction (The Visualization):
- Run t-SNE or UMAP on the PCA embeddings, not the gene matrix.
- Input: The matrix of Cells x Top_30_PCs.
Visualization:
- Overlay metadata (Batch, Donor, Cell Type) to check for batch effects.

5. Troubleshooting & Common Pitfalls

"The Blob Fallacy"

Problem: You see distinct clusters (blobs) and assume they are distinct cell types.

Reality: Both algorithms are prone to pareidolia—finding patterns where none exist. If you run t-SNE on random noise, it will still create clusters.

Solution: Always validate clusters with Marker Genes (Differential Expression) or projected cell-type labels (e.g., SingleR). If a cluster doesn't have unique molecular markers, it might be an artifact.

"The Trajectory Trap"

Problem: You see a long, snake-like shape and assume it represents a developmental lineage (trajectory).

Reality: UMAP and t-SNE can tear continuous trajectories into discrete clusters or merge distinct lineages into a loop.

Solution: Use dedicated trajectory inference tools (like Monocle, Slingshot, or RNA Velocity) for developmental questions. Use UMAP only to display the result, not to compute it.

"The Hyperparameter Hazard"

Problem: Accepting default settings as truth.

Solution: Run the algorithm multiple times with different seeds and parameters (perplexity or n_neighbors). If a biological conclusion (e.g., "Cluster A interacts with Cluster B") disappears when you change the seed, it was likely an artifact.

6. Conclusion: Which One Should You Use?

Use UMAP if:
- You have a large dataset (>50k cells).
- You want a faster runtime.
- You care about preserving the continuum of cell states (global structure).
- You are publishing in 2024/2025 (it is the current standard).

Use t-SNE if:
- You have a small, well-defined dataset.
- You specifically want to break apart very subtle local sub-clusters that UMAP might merge.
- You are comparing against legacy papers that used t-SNE.

Final Verdict: UMAP is the "Best Answer" for the modern bioinformatician, provided you respect its limitations. It is a map, not the territory.

Frequently Asked Questions (FAQ) t-SNE vs. UMAP for Single-Cell Data

Is t-SNE better than UMAP for single cell?

Generally, no. For modern single-cell genomics, UMAP is preferred because it is significantly faster and handles large datasets (100k+ cells) without crashing. However, t-SNE is arguably "better" at resolving very fine local structures in smaller datasets where global positioning is less important.

When to use UMAP vs t-SNE?

Use UMAP when: You have a large dataset (>10,000 cells), you are interested in developmental trajectories (continuums), or you need a fast initial visualization.
Use t-SNE when: You have a small dataset (<5,000 cells), you want to compare your results to older literature, or you suspect UMAP is "over-compacting" your clusters and hiding subtle heterogeneity.

What is UMAP in single cell sequencing?

UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction algorithm used to visualize single-cell data. It takes the high-dimensional gene expression profile of each cell and projects it into 2D space, arranging cells so that those with similar gene expression profiles are placed close together.

When to use t-SNE?

Use t-SNE when your primary goal is to dissect local neighborhoods. If you have a specific cluster of cells and want to see if it splits into sub-populations (e.g., T-cell subtypes), running t-SNE on just that subset can sometimes reveal clearer separation than UMAP.