Quality assessment and refinement of chromatin accessibility with open-source toolkit
Abstract: Chromatin accessibility assays are central to the genome-wide identification of gene regulatory elements associated with transcriptional regulation. However, the data have highly variable quality arising from several biological and technical factors. To surmount this problem, we use the predictability of open-chromatin peaks from DNA sequence-based machine-learning models to evaluate and refine chromatin accessibility data. Our framework, gapped k-mer SVM quality check (gkmQC), provides the quality metrics for a sample based on the prediction accuracy of the trained models.
We tested 886 samples with DNase-seq from the ENCODE/Roadmap projects to demonstrate that gkmQC can effectively identify high-quality samples underperforming owing to marginal read depths. Peaks identified in high-quality samples by gkmQC are more accurately aligned at functional regulatory elements, show greater enrichment of regulatory elements harboring functional variants from genome-wide association studies (GWAS), and explain greater heritability of phenotypes from their relevant tissues. Moreover, gkmQC can optimize the peak-calling threshold to identify additional peaks, especially for single-cell chromatin accessibility data as well as bulk data.
Here we provide a standalone open-source toolkit (https://github.com/Dongwon-Lee/gkmQC) for such analyses and share improved regulatory maps using gkmQC. These resources will contribute to the functional interpretation of disease-associated regulatory genetic variation.