TY - JOUR
T1 - Next-generation data filtering in the genomics era
AU - Hemstrom, William
AU - Grummer, Jared A.
AU - Luikart, Gordon
AU - Christie, Mark R.
N1 - Publisher Copyright:
© Springer Nature Limited 2024.
PY - 2024/6/14
Y1 - 2024/6/14
N2 - Genomic data are ubiquitous across disciplines, from agriculture to biodiversity, ecology, evolution and human health. However, these datasets often contain noise or errors and are missing information that can affect the accuracy and reliability of subsequent computational analyses and conclusions. A key step in genomic data analysis is filtering — removing sequencing bases, reads, genetic variants and/or individuals from a dataset — to improve data quality for downstream analyses. Researchers are confronted with a multitude of choices when filtering genomic data; they must choose which filters to apply and select appropriate thresholds. To help usher in the next generation of genomic data filtering, we review and suggest best practices to improve the implementation, reproducibility and reporting standards for filter types and thresholds commonly applied to genomic datasets. We focus mainly on filters for minor allele frequency, missing data per individual or per locus, linkage disequilibrium and Hardy–Weinberg deviations. Using simulated and empirical datasets, we illustrate the large effects of different filtering thresholds on common population genetics statistics, such as Tajima’s D value, population differentiation (FST), nucleotide diversity (π) and effective population size (Ne).
AB - Genomic data are ubiquitous across disciplines, from agriculture to biodiversity, ecology, evolution and human health. However, these datasets often contain noise or errors and are missing information that can affect the accuracy and reliability of subsequent computational analyses and conclusions. A key step in genomic data analysis is filtering — removing sequencing bases, reads, genetic variants and/or individuals from a dataset — to improve data quality for downstream analyses. Researchers are confronted with a multitude of choices when filtering genomic data; they must choose which filters to apply and select appropriate thresholds. To help usher in the next generation of genomic data filtering, we review and suggest best practices to improve the implementation, reproducibility and reporting standards for filter types and thresholds commonly applied to genomic datasets. We focus mainly on filters for minor allele frequency, missing data per individual or per locus, linkage disequilibrium and Hardy–Weinberg deviations. Using simulated and empirical datasets, we illustrate the large effects of different filtering thresholds on common population genetics statistics, such as Tajima’s D value, population differentiation (FST), nucleotide diversity (π) and effective population size (Ne).
UR - http://www.scopus.com/inward/record.url?scp=85195957355&partnerID=8YFLogxK
U2 - 10.1038/s41576-024-00738-6
DO - 10.1038/s41576-024-00738-6
M3 - Review article
AN - SCOPUS:85195957355
SN - 1471-0056
JO - Nature Reviews Genetics
JF - Nature Reviews Genetics
ER -