9 hours ago6 min read
3 days ago5 min read
5 days ago4 min read

Stop using correlation coefficients to validate your new method.
If you are comparing a new clinical instrument, a cheaper lab protocol, or an AI algorithm against a "Gold Standard," your first instinct might be to calculate a correlation coefficient (r). This is a mistake. A high correlation (r > 0.9) does not prove two methods agree—it only proves they are linearly related. A thermometer reading 5°C higher than reality perfectly correlates with a correct thermometer, but it would be disastrous to use in a clinical setting.
Enter the Bland-Altman Plot (also known as the Tukey Mean-Difference Plot).
This guide is the definitive resource for researchers, data scientists, and clinicians. We will move beyond basic definitions to cover step-by-step construction, advanced interpretation of "Limits of Agreement," and troubleshooting common pitfalls like proportional bias and non-normal data.
The Bland-Altman plot is a graphical method used to analyze the agreement between two quantitative measurements. Unlike a simple scatter plot or correlation analysis, it explicitly quantifies:
Systematic Bias: Does Method A consistently measure higher or lower than Method B?
Precision (Random Error): How scattered are the differences?
Limits of Agreement (LoA): The range within which 95% of the differences between the two methods are expected to fall.
Imagine you have a fancy new digital scale and an old analog slide scale. You weigh 100 people on both.
Correlation: If heavier people register as "heavy" on both scales, r will be high. This tells you the scales trend together.
Agreement (Bland-Altman): If the digital scale always reads 2 lbs heavier than the analog scale, they do not agree. Correlation hides this "Bias." The Bland-Altman plot reveals it immediately by plotting the difference (2 lbs) against the mean weight.
You do not need expensive software to understand the math. Here is the manual protocol you can replicate in Excel, Python, or R.
Ensure you have paired measurements (Method A and Method B) for the same samples (subjects).
n = Sample Size (Recommended n > 50 for stable Limits of Agreement).
For every sample, subtract the reading of Method B from Method A.
D = A - B
For every sample, calculate the average of the two methods.
M = (A + B)/(2)
Critical Nuance: We use the mean of both methods as the X-axis because we rarely know the "True" value. If Method A is a certified Gold Standard Reference, some statisticians argue you should plot D against A (Krouwer, 2008), but the standard Bland-Altman approach uses the mean (A+B)/2 to avoid artificial correlation between the error and the magnitude.
Mean Bias (đ): The average of all Differences (D).
Standard Deviation (SD): The standard deviation of the Differences (D).
These lines represent the interval where 95% of future differences are expected to lie, assuming the differences are normally distributed.
Upper LoA: đ + 1.96 x SD
Lower LoA: đ - 1.96 x SD
X-Axis: Mean (M)
Y-Axis: Difference (D)
Add Horizontal Lines: Draw lines for the Mean Bias (Zero line comparison) and the Upper/Lower LoA.
A Bland-Altman plot is not just a "pass/fail" test. It is a diagnostic tool.
Look at the central horizontal line representing the Mean Bias (đ).
Close to Zero: Excellent. No systematic difference.
Significant Shift: If the line is at +3.7 units, Method A consistently reads 3.7 units higher than Method B.
Is it Acceptable? Statistical significance (p-value) is irrelevant here. You must define a Clinical Acceptance Limit a priori. If a glucose meter is off by 3.7 mg/dL, is that clinically dangerous? If no, the bias is acceptable.
Look at the spread of dots between the Limits of Agreement.
Tight Scatter: High precision; the methods are interchangeable.
Wide Scatter: The methods may have zero bias (average out to zero) but are too erratic to be used interchangeably.
Points outside the LoA lines (the outer 5%) are expected, but extreme outliers require investigation. Was it a clerical error? A sample handling issue?
Does the scatter get wider as the X-axis value increases?
Uniform Scatter: Good. The error is constant (Homoscedasticity).
Funnel Shape (< or >): This indicates Proportional Bias (Heteroscedasticity). The error grows as the measured value grows (e.g., the device is 5% off, rather than 5 units off).
Solution: If you see a funnel, log-transform the data or plot the Percentage Difference instead of absolute difference.
The Limits of Agreement calculation (đ ∓ 1.96 SD) assumes the differences are normally distributed.
Check: Create a histogram of the differences (D).
Solution: If skewed, use Non-Parametric Limits of Agreement. Instead of SD, calculate the 2.5th and 97.5th percentiles of the differences to define your limits.
A common question on forums (e.g., ResearchGate) is whether BA plots make sense if Method A is calibrated to Method B.
Answer: Yes. Calibration ensures the slope is corrected, but it does not fix scatter or non-linear bias "between" calibration points. A Bland-Altman plot will reveal if the calibration holds true across the entire dynamic range or if local biases exist.
Bland-Altman: Descriptive. It visualizes the interval of agreement. You, the expert, decide if that interval is acceptable.
Equivalence Testing: Inferential. It formally tests if the means are "statistically equivalent" within bounds.
Best Practice: Use them as complements. Use Equivalence tests for p-values and Bland-Altman for visual confirmation of individual sample behavior.
When publishing your Bland-Altman analysis, ensure you include:
[ ] The Mean Bias (đ) with its 95% Confidence Interval (CI).
[ ] The Standard Deviation of the differences.
[ ] The Upper and Lower Limits of Agreement (LoA) with their respective 95% CIs.
[ ] A predefined statement on what constitutes "Clinical Acceptance."
[ ] A visual check for proportional bias (funneling).
What is the method comparison Bland-Altman?
The Bland-Altman method comparison is a statistical technique used to assess the level of agreement between two quantitative methods of measurement (e.g., a new medical device vs. a gold standard).
Unlike correlation analysis—which only calculates the strength of a linear relationship—the Bland-Altman analysis constructs a "Difference Plot" (or Tukey Mean-Difference plot). It specifically calculates the mean difference (bias) and the standard deviation of the differences to determine if the two methods are clinically interchangeable. It solves the problem where two methods might be perfectly correlated (r=0.99) but still produce significantly different values.
What is the purpose of a Bland-Altman plot?
The primary purpose of a Bland-Altman plot is to reveal systematic bias and random error that correlation coefficients hide. It answers the critical question: "Can I replace the old method with the new method without putting patients or data quality at risk?"
Specifically, it allows researchers to:
Visualize Bias: Determine if the new method systematically measures too high or too low.
Quantify Precision: Measure the spread of agreement using Limits of Agreement (LoA).
Detect Anomalies: Identify outliers or proportional bias (where error increases as the measurement value increases).
How do you interpret Bland-Altman plot results?
Interpreting the plot involves analyzing three key visual elements against your clinical acceptance criteria:
The Mean Difference Line (Bias): If this line is far from zero, there is a fixed bias. You must decide if this shift (e.g., +2.5 units) is clinically acceptable.
The Limits of Agreement (LoA): These are the upper and lower lines (∓ 1.96 SD). They represent the range where 95% of the differences fall. If this range is wider than what is clinically safe (e.g., a blood pressure range of ∓ 20 mmHg might be too dangerous), the methods do not agree, regardless of their statistical significance.
The Scatter Pattern: If the dots form a "funnel" shape (getting wider to the right), the data has proportional bias, meaning the new method becomes less accurate at higher values.
How do you compare two methods of measurement?
To correctly compare two methods of measurement, you must follow a protocol that assesses agreement, not just correlation.
Data Collection: Measure the same set of samples (n > 50) using both Method A and Method B.
Calculate Differences: For every sample, subtract Method B from Method A (A - B).
Calculate Means: For every sample, find the average of the two methods ((A + B) / 2).
Plot the Data: Create a scatter plot with the Means on the X-axis and Differences on the Y-axis.
Analyze Limits: Calculate the Mean Bias and the Limits of Agreement (Mean ∓ 1.96 x Standard Deviation). Compare these limits to your pre-defined maximum allowable error.
References
Giavarina D. (2015). Understanding Bland Altman analysis. Biochemia Medica, 25(2): 141–151. Link
Bland, J. M., & Altman, D. G. (1986). Statistical methods for assessing agreement between two methods of clinical measurement. The Lancet.
GraphPad Prism Guide. How to: Bland-Altman plot. Link
MedCalc Manual. Comparison of multiple methods. Link
StackExchange CrossValidated. Discussion on Equivalence vs. Agreement. Link
Numiqo. (2023). Bland-Altman Plot [Simply explained]. YouTube. Link
Reddit r/AskStatistics. ELI5: The Bland-Altman analysis. Link
ResearchGate. Discussion on calibrated methods. Link

