Next-generation data filtering in the genomics era

William Hemstrom, Jared A. Grummer, Gordon Luikart, Mark R. Christie

Research output: Contribution to journalReview articlepeer-review

1 Scopus citations

Abstract

Genomic data are ubiquitous across disciplines, from agriculture to biodiversity, ecology, evolution and human health. However, these datasets often contain noise or errors and are missing information that can affect the accuracy and reliability of subsequent computational analyses and conclusions. A key step in genomic data analysis is filtering — removing sequencing bases, reads, genetic variants and/or individuals from a dataset — to improve data quality for downstream analyses. Researchers are confronted with a multitude of choices when filtering genomic data; they must choose which filters to apply and select appropriate thresholds. To help usher in the next generation of genomic data filtering, we review and suggest best practices to improve the implementation, reproducibility and reporting standards for filter types and thresholds commonly applied to genomic datasets. We focus mainly on filters for minor allele frequency, missing data per individual or per locus, linkage disequilibrium and Hardy–Weinberg deviations. Using simulated and empirical datasets, we illustrate the large effects of different filtering thresholds on common population genetics statistics, such as Tajima’s D value, population differentiation (FST), nucleotide diversity (π) and effective population size (Ne).

Original languageEnglish
JournalNature Reviews Genetics
DOIs
StatePublished - Jun 14 2024

Fingerprint

Dive into the research topics of 'Next-generation data filtering in the genomics era'. Together they form a unique fingerprint.

Cite this