Most Inaccurate Metrics Measured by Smartwatches

From heart rate, blood pressure, sleep to energy consumption, fatigue status, and maximum oxygen uptake, the metrics displayed by smartwatches and fitness bands are increasing, but it seems that some values are not measured accurately?

Your feeling is correct, the accuracy of various metrics varies greatly; some are accurate enough for doctors to reference, while others are only suitable for casual observation.

The table below summarizes the accuracy of some common metrics, with the highest accuracy rated as 5 stars and the lowest as 1 star.

Accuracy	Metric
★★★★	Heart rate, running distance, blood pressure measured by the cuff method, blood oxygen saturation
★★★	Heart rate variability, sleep duration, energy consumption, blood pressure measured by photoplethysmography, cadence/stride length, oxygen uptake/maximum oxygen uptake
★★	Sleep stages, sleep quality, training load, recovery status, lactate threshold
★	Training effect, physical condition, fatigue status

This table rates the accuracy based on all conditions; different manufacturers may have different algorithms, and the grading has a certain subjectivity. A 5-star rating indicates very good accuracy (measured against a gold standard), 4 stars indicates good, 3 stars indicates average, 2 stars indicates poor, and 1 star indicates very poor.

No metric has achieved a 5-star rating, which means no data is absolutely accurate. Why are measurements inaccurate? It may relate to the measurement method, sensors, algorithms, wearing, and interpretation.

Next, we will fill in more detailed content into the table: Why is it accurate or not, how accurate is it, and what can be done to measure more accurately. By the end of the article, you will receive a table with doubled information content, along with how to use the judgment ability of each metric.

Accuracy depends on whether the metric is measured, estimated, or newly created

Nowadays, a smartwatch weighing just a few grams can integrate nearly ten types of sensors, such as photoplethysmography sensors for heart rate measurement, GPS sensors for measuring latitude and longitude, as well as barometric, temperature, and accelerometer sensors, etc.

Wearable devices (smartwatches, bands, rings, etc.) measure a limited set of basic metrics directly through sensors (raw data also needs algorithm processing, but for ease of understanding, we refer to this as direct measurement); these metrics are then integrated and calculated to continuously produce an infinite number of new metrics. This means that as long as there is a foundation in physiology and exercise physiology, a few basic metrics can yield a multitude of metrics.

The number of metrics is increasing, but are they all reliable? Measurement inherently carries the possibility of error, but most metrics have a recognized measurement method with the least error, commonly referred to as the “gold standard”. For example, the gold standard for measuring heart rate is an electrocardiogram, for measuring sleep duration and stages it is polysomnography, and for measuring energy expenditure it is the doubly labeled water method.

The gold standard is typically measured under laboratory conditions, most devices are expensive, measurement steps are complex, and require experienced operators for assistance. Currently, none of the metrics provided by smartwatches, bands, or rings are measured using the gold standard. Therefore, in the table at the beginning of the article, no metric can achieve a 5-star rating. By sacrificing some data accuracy, wearable devices offer a more convenient and cost-effective measurement method.

Most Inaccurate Metrics Measured by Smartwatches

The gold standard for measuring heart rate is an electrocardiogram; smartwatches can continuously measure heart rate, which is convenient but slightly less accurate丨medpick/Sina Testing

Among the metrics with a gold standard, some data is obtained through direct measurement or simple calculations, such as measuring heart rate through photoplethysmography and calculating pace based on distance and time.

Other data is estimated based on directly measured data using algorithms, such as estimating energy expenditure based on heart rate and accelerometer data. Different manufacturers may have different algorithms, and even the same manufacturer’s algorithms may continuously improve, so the results may vary greatly. In most cases, estimated data is not as accurate as directly measured data.

Additionally, many metrics without a gold standard are generally considered to be inaccurate. These metrics often exist only as concepts in exercise science (e.g., load, fatigue, recovery) and cannot be accurately measured, sometimes relying on subjective feelings as standards. Some metrics may not even have scientific definitions and are created through a “arms race” between manufacturers.

Metric Classification	Measured Metrics	Estimated Metrics	Created Metrics
Gold Standard	Yes	Yes	No
Data Acquisition Method	Sensor Measurement	Algorithm estimates based on the gold standard	Algorithm estimates based on definitions
Accuracy Level	Relatively Accurate	Average	Relatively Inaccurate
Specific Example Metrics	Heart rate Blood oxygen saturation Distance Pace	Energy expenditure Sleep duration Sleep stages Performance prediction	Sleep quality Stress level Training load Recovery level Training effect

How big is the gap? Comparing with the gold standard reveals the answer

To know how accurate a metric is, measure it using both wearable devices and the gold standard, and then compare the results. This is how most manufacturers do it, but they generally won’t disclose how big the gap is. However, by analyzing how the data is obtained and looking at the articles published by researchers, one can get a rough idea of the accuracy of the data.

Heart Rate, One of the Most Accurate Metrics

Heart rate is related to many health and exercise-related metrics; smartwatches and bands will directly display heart rate and can also provide many metrics estimated based on heart rate. Thus, the accuracy of heart rate measurement determines the accuracy of many other metrics.

The gold standard for measuring heart rate is an electrocardiogram, which detects the electrical activity of the heart and measures heart rate using electrodes placed on the chest and limbs.

When wearable devices continuously display heart rate, the measurement method is typically photoplethysmography (PPG). This measurement method can be affected by various factors, such as exercise intensity, type of exercise, wrist activity, tightness of the band, skin pigmentation, surface dirt, arrhythmias, etc.

According to a comprehensive test of 18 studies, heart rate measurement is more accurate at rest or during low-intensity exercise; as exercise intensity increases, the likelihood of obtaining data and the reliability of the data significantly decrease. In an analysis of 249 studies, the average error in heart rate measurement is ±3%.

Therefore, when the wearable device displays stable values at rest, the heart rate data is relatively credible and can be used to help assess health and exercise status. The accuracy of data decreases during intense exercise; if more accurate data is desired, a chest strap heart rate monitor can be worn.

Chest strap heart rate monitor ｜ polar

Sleep, Total Duration Slightly Better Than Stages and Quality

Some people check last night’s sleep metrics as soon as they wake up; feeling good about their sleep, they are often surprised by a low overall score, which is unnecessary.

The gold standard for measuring sleep is polysomnography, which simultaneously measures multiple signals, including electroencephalogram, electrocardiogram, electrooculogram, and electromyogram. After obtaining the raw data, sleep experts will integrate the results to determine sleep duration and manually score to analyze sleep stages.

Polysomnography illustration ｜ verywell

Wearable devices evaluate sleep by measuring heart rate and wrist activity (using accelerometers), calculating metrics such as heart rate variability and respiratory rate, and combining personal background information such as age, height, weight, and gender, based on neural network models, ultimately obtaining bedtimes and wake times, sleep onset and offset times, total sleep duration and sleep latency, awake duration, and the duration and proportion of each sleep stage, as well as an overall sleep score based on this information.

Wearable devices evaluate sleep methods ｜ Author provided

From a measurement perspective, if one remains still for a long time before falling asleep, it may be misclassified as entering sleep, leading to an overestimation of total sleep duration.

The specific algorithms of various brands are inconsistent, resulting in different errors. A review article on the application of wearable technology in sleep mentioned that compared to polysomnography, smartwatches perform relatively well in estimating total sleep time, with an overall accuracy of about 70% to 90%; however, their performance in measuring sleep stages is poor, with the accuracy of light sleep assessment being about 50% to 90%, and deep sleep and REM sleep accuracy being about 30% to 80%.

As for the overall sleep score, there is no corresponding score in medicine. When doctors evaluate sleep quality and treatment, they analyze many metrics, including sleep onset time, sleep duration, efficiency, abnormal states, hypnotic drugs, and daytime life and work conditions.

For estimated metrics like sleep, relatively accurate metrics can serve as references, such as total sleep duration; other metrics should not cause anxiety. If one feels generally well, there is no need to worry about a low total sleep score. If one consistently feels they are not sleeping well, they can undergo polysomnography to identify issues.

Recovery Status, One of the Least Accurate Metrics

The above metrics all have gold standards, while some metrics do not; they are created based on certain theories, such as recovery status.

To achieve progress in training, one must continually increase training stress without crossing the line of overtraining, making the measurement and detection of recovery status very important. However, recovery status is a very comprehensive and complex metric, influenced by training (volume, type, intensity, etc.), non-training (work, relationships, illness, medication, etc.), and recovery (sleep, diet, recovery time, recovery methods, etc.) factors.

Recovery status is influenced by many factors including training, sleep, diet, etc. ｜ oscarcaregroup

When measuring the stress endured by the body and recovery status, the activity of the autonomic nervous system is a key indicator. When the body is under stress, it physiologically manifests as increased sympathetic nervous system activity and decreased parasympathetic nervous system activity; during recovery, the opposite occurs. Studies have shown that analyzing the interaction between the sympathetic and parasympathetic nervous systems, heart rate variability is a powerful tool.

Due to the lack of a gold standard, some wearable device manufacturers use weighted models to estimate recovery status. The specific method involves collecting a series of indicators that may affect recovery, such as heart rate, sleep, and training status, calculating heart rate variability, respiratory rate, oxygen consumption, etc., and then summing these weighted indicators based on exercise science principles to obtain a value representing recovery status.

Recovery status score, estimated based on heart rate variability, resting heart rate, sleep, and respiratory rate ｜ WHOOP

The drawback of this approach is that it cannot exhaustively account for all influencing factors, such as physiological cycles and interpersonal relationships, which can affect recovery but may not be included in the model, leading to discrepancies between displayed data and actual status, resulting in undertraining or overtraining when using data as guidance.

However, it makes sense for manufacturers to introduce such metrics, as not everyone has sufficient knowledge to analyze and interpret the raw data related to recovery. By sacrificing some accuracy and making simple assumptions (e.g., less sleep and more activity equals poor recovery), the alert effect of a recovery status score may be much better than presenting complex physiological data.

How to Utilize These Metrics?

According to the classification method at the beginning of the article, all metrics can be divided into three categories: measured, estimated, and created.

Measured metrics usually have smaller errors, such as heart rate, distance, heart rate variability, and pace. These metrics are relatively credible and can be used as references for observing health status, adjusting lifestyle, and exercise plans. For example, if the heart rate this morning is higher than usual, could it be due to poor sleep last night? Or has there been excessive recent exercise? Should one reduce the load or take a rest day?

Estimated metrics are derived from measured metrics through algorithms, such as sleep, energy expenditure, and oxygen uptake. At this point, measurement errors compounded with algorithm errors may reduce the accuracy of estimated metrics. Interpretation of these metrics requires more caution. For example, overall sleep scores sometimes align with levels of fatigue, and sometimes there are discrepancies; the energy expenditure estimated by wearable devices while walking may be relatively accurate, but the expenditure during resistance (strength) training may be underestimated.

Both of these are metrics with gold standards; even if current measurements are not very accurate, we can expect advancements in measurement technology or algorithms to bring data closer to accurate values.

Expecting measurement data to get closer to accurate values ｜ rootriverarchery

Created metrics are those that do not have a measurement gold standard; they are based on the previous two categories of metrics, created by algorithms based on certain definitions or ideas, such as recovery status and training effect. Due to the absence of a measurement gold standard for comparison, along with inconsistencies in sensor hardware and algorithms between different manufacturers, and the algorithms for metrics being unpublished, it is difficult to verify the accuracy of the data.

Therefore, for these created metrics, we need not get overly concerned with the absolute values of the numbers; we can understand the trends of the metrics and, combined with our subjective feelings, actively understand how our bodies respond to daily life and exercise.

In addition, device manufacturers will regularly release software updates; it is important to check and install these updates promptly to ensure that devices are using the latest algorithms, which can enhance the accuracy of metrics to some extent.

The final table summarizes the key content of the entire article; referring to it may help you reduce confusion and increase your control over health and exercise.

References

[1]Altini M, Plews D. What is behind changes in resting heart rate and heart rate variability? A large-scale analysis of longitudinal measurements acquired in free-living[J]. Sensors, 2021, 21(23): 7932.

[2]Cudejko T, Button K, Al-Amri M. Validity and reliability of accelerations and orientations measured using wearable sensors during functional activities[J]. Scientific reports, 2022, 12(1): 14619.

[3]Shei R J, Holder I G, Oumsang A S, et al. Wearable activity trackers–advanced technology or advanced marketing?[J]. European Journal of Applied Physiology, 2022, 122(9): 1975-1990.

[4]Miller D J, Sargent C, Roach G D. A validation of six wearable devices for estimating sleep, heart rate and heart rate variability in healthy adults[J]. Sensors, 2022, 22(16): 6317.

[5]Germini F, Noronha N, Borg Debono V, et al. Accuracy and acceptability of wrist-wearable activity-tracking devices: systematic review of the literature[J]. Journal of medical Internet research, 2022, 24(1): e30791.

[6]Li Y I, Zhong-Hua L V, Shun-Ying H U, et al. Validating the accuracy of a multifunctional smartwatch sphygmomanometer to monitor blood pressure[J]. Journal of Geriatric Cardiology: JGC, 2022, 19(11): 843.

[7]de Zambotti M, Goldstein C, Cook J, et al. State of the science and recommendations for using wearable technology in sleep and circadian research[J]. Sleep, 2023: zsad325.

[8]https://www.firstbeat.com/en/athletes-recovery-analysis-firstbeat-white-paper-2/

[9]https://www.firstbeat.com/en/firstbeat-white-paper-sleep-analysis-method-based-on-heart-rate-variability/

[10]Doherty C, Baldwin M, Keogh A, Caulfield B, Argent R. Keeping Pace with Wearables: A Living Umbrella Review of Systematic Reviews Evaluating the Accuracy of Consumer Wearable Technologies in Health Measurement. Sports Med. 2024 Jul 30. doi: 10.1007/s40279-024-02077-2. Epub ahead of print. PMID: 39080098.

Author: ZIYI

Editor: Dai Tianyi

Image source: Tuchong Creative

This article is from Guokr and may not be reproduced without authorization.

If needed, please contact [email protected]

Leave a Comment Cancel reply