-

January

Deep learning predicts DNA methylation regulatory variants in specific brain cell types and enhances fine mapping for brain disorders

Jiyun Zhou

0 1 6 7

Daniel R. Weinberger

drweinberger@libd.org 0 1 2 3 4 5 6 7

Shizhong Han

0 1 2 5 6 7

Shizhong Han

0 1 7

Ph.D.

0 1 7 0 855 North Wolfe St , 21205 Baltimore, MD , USA 1 Daniel R. Weinberger , M.D 2 Department of Genetic Medicine, Johns Hopkins University School of Medicine , Baltimore, MD , USA 3 Department of Neurology, Johns Hopkins University School of Medicine , Baltimore 4 Department of Neuroscience, Johns Hopkins University School of Medicine , Baltimore 5 Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of 6 Lieber Institute for Brain Development, Johns Hopkins Medical Campus , Baltimore, MD , USA 7 Medicine , Baltimore, MD, 21287 , USA

2024

21 2024 350 386

855 North Wolfe St, 21205 Baltimore, MD

Shizhong.Han@libd.org 27 DNA methylation (DNAm) is essential for brain development and function and potentially mediates the effects of genetic risk variants underlying brain disorders. We present INTERACT, a transformer-based deep learning model to predict regulatory variants impacting DNAm levels in specific brain cell types, leveraging existing single-nucleus DNAm data from the human brain. We show that INTERACT accurately predicts cell type-specific DNAm profiles, achieving an average area under the Receiver Operating Characteristic curve of 0.98 across cell types. Furthermore, INTERACT predicts cell type-specific DNAm regulatory variants, which reflect cellular context and enrich the heritability of brain-related traits in relevant cell types. Importantly, we demonstrate that incorporating predicted variant effects and DNAm levels of CpG sites enhances the fine mapping for three brain disorders4schizophrenia, depression, and Alzheimer’s disease4and facilitates mapping causal genes to particular cell types. Our study highlights the power of deep learning in identifying cell type-specific regulatory variants, which will enhance our understanding of the genetics of complex traits.

Teaser:

Deep learning reveals genetic variations impacting brain cell type-specific DNA methylation and illuminates genetic bases of brain disorders 54

Introduction

DNA methylation (DNAm) is essential for brain development and function, and its aberrations are implicated in neurological and psychiatric disorders(1-4). Genetic association studies have identified genetic variations associated with DNAm levels in the human brain, known as DNAm quantitative trait loci (mQTLs)(5-9), which may illuminate causal genetic variations within risk loci identified by genomewide association studies (GWAS)(10, 11). However, the majority of those studies were conducted with bulk tissues and may not capture cell type-specific mQTLs, thereby limiting their utility to uncover risk variants that specifically act in disease-relevant cell types(12-14). Advances in single-cell technologies have enabled profiling of DNAm at singlenucleus resolution in the human brain(15-17), providing an opportunity to identify cell typespecific mQTLs through genetic association studies. However, the current cost of these technologies is still too expensive to generate a sufficient sample size for robust statistical power. Additionally, genetic association studies face challenges in identifying functional variants that drive DNAm levels due to extensive linkage disequilibrium (LD) across the genome. As a complementary approach to population-based QTL studies, deep learning techniques have emerged as a promising tool for predicting the effects of genetic variations on various molecular traits(18), such as gene expression(19, 20), chromatin marks(21, 22) and DNAm levels(23). These techniques first build prediction models for molecular traits based on local DNA sequences and then estimate the impact of genetic variations on these traits by comparing the predicted levels of molecular traits between the two DNA sequences of different alleles. Deep learning-based approaches offer advantages over traditional QTL studies as they do not rely on populationbased samples and are not confounded by LD(23). Instead, these approaches predict regulatory variants by assessing their impact on DNA motifs associated with molecular traits. In a previous study, we developed a deep learning model INTERACT, which integrates 80 81 82 83 convolutional neural network (CNN) with the attention mechanism of transformer to predict variant effects on DNAm levels in bulk brain tissues(23). Our study demonstrated the superiority of INTERACT over a standard CNN model, but the limitation was that our previous model was trained on bulk brain samples, limiting its ability to detect cell type-specific effects. Here, we extend INTERACT to predict DNAm regulatory variants in specific brain cell types utilizing existing single-nucleus DNAm data from the human brain. We show that INTERACT models, trained for each cell type, accurately predict cell type-specific DNAm profiles and uncover DNA motifs and transcription factors that may underline these profiles. Furthermore, INTERACT predicts cell type-specific DNAm regulatory variants, which reflect cellular context and enrich heritability of brain-related traits in relevant cell types. Importantly, we demonstrate that incorporating predicted variant effects and DNAm levels of CpG sites enhances the fine mapping of risk loci for three brain disorders---schizophrenia, depression and Alzheimer’s disease---and facilitates mapping potential causal genes to particular cell types. Our study highlights the power of deep learning in identifying regulatory variants in specific cell types, which will enhance our understanding of the genetic underpinnings of complex traits.

Results INTERACT model for predicting cell type-specific DNAm levels

We designed INTERACT, a deep learning model that combines CNN and the attention mechanism of transformer, to predict DNAm levels of CpG sites from local DNA sequences (Figure 1A). To train cell type-specific INTERACT models, we utilized an existing singlenucleus DNAm dataset containing 4,137 nuclei from the human prefrontal cortex(16), which were clustered into 13 cell types, including four excitatory neuron subtypes (L2/3, L4, L5, and 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130

L6), four inhibitory neuron subtypes (Ndnf, Vip, Pvalb, and Sst), and five non-neuronal subtypes (astrocyte, oligodendrocyte (ODC), oligodendrocyte progenitor cell (OPC), microglia, and endothelial cell). Given the model’s complexity (38,203,907 parameters), we first pre-trained INTERACT by predicting DNAm levels of ∼25 million CpG sites of high coverage (> 50x) derived from a pseudo-bulk tissue comprised of all 4,137 nuclei from the same single-nucleus dataset. We then fine-tuned the pre-trained INTERACT model for each cell type by predicting the DNAm levels of 2.3 to 3.1 million CpG sites, chosen based on a coverage cutoff specific to each cell type (Table S1).

We evaluated the performance of each cell type-specific INTERACT model in predicting the DNAm levels of independent CpG sites not included in both pre-training and fine-tuning stages. We observed remarkable prediction performance across all cell type-specific models when measured by their abilities to distinguish methylation from unmethylation status of CpG sites, with an average area under the Receiver Operating Characteristic (ROC) curve of 0.984 and an average area under the Precision-Recall Curve (PRC) of 0.985. We compared the performance of INTERACT with a standard CNN model, and INTERACT consistently outperformed the CNN model (Figure 1B). The INTERACT models showed higher average ROC (0.984 for INTERACT vs. 0.978 for CNN), higher average PRC (0.985 for INTERACT vs. 0.978 for CNN), and lower average mean squared error (MSE) values (0.0358 for INTERACT vs. 0.0491 for CNN). We further assessed whether cell type-specific INTERACT models capture cell type-specific information. To do this, we predicted DNAm levels of independent CpG sites on chromosome 22 for each cell type. We then selected the top 1000 variable CpG sites across 13 brain cell types and performed clustering analysis for the 13 brain cell types. We observed that predicted top variable CpG sites can clearly cluster cell types that align with their biological relationships 131 132 (Figure 1C), suggesting that our cell type-specific models effectively learned cell type-specific information for predicting DNAm levels.

Cell type-specific INTERACT models learn DNA motifs and transcription factors underlying cell type-specific DNAm profiles

To explore the mechanism underlying how trained cell type-specific INTERACT models learned cell type-specific information, we examined filters in the first convolutional layer of each model in their ability to activate the expression of cell type-specific genes. To achieve this goal, we defined a regulatory activity for each filter, quantifying the correlation between the activation strength of filters within enhancer regions and the expression levels of the genes targeted by these enhancers. A higher regulatory activity reflects a greater potential of the filter to activate genes targeted by enhancers. Our findings revealed that filters learned by each cell typespecific model tend to have a higher regulatory activity for enhancers targeting cell type-specific genes, compared to enhancers targeting shared genes between the corresponding cell type and at least one other cell type (Figure 1D). This analysis suggests that our cell type-specific models effectively learned DNA motifs that play crucial roles in cell type-specific gene regulatory circuits, thus enabling them to capture cell type-specific information.

To gain further biological insights from the trained cell type-specific INTERACT models, we also examined filters in the first convolutional layer to identify DNA motifs learned by these filters for each cell type. We then compared learned DNA motifs with known transcription factor (TF) binding motifs using the Tomtom motif comparison tool(24). In total, we identified 143 unique TFs (FDR < 0.05) whose DNA binding motifs matched the filter-learned DNA motifs and that were expressed in the corresponding cell type (Data S1). The number of detected TFs varied 156 157 for each cell type-specific model, ranging from 45 for the excitatory neuron (L23) and microglia to 73 for the inhibitory neuron (Sst), with an average of 58 TFs for each cell type. Among the 143 TFs, 48% (69) showed physical interaction evidence with enzymes involved in DNA methylation (DNMT1, DNMT3A, DNMT3B) or demethylation (TET1, TET2, TET3 and TDG). This represents a significant enrichment (OR = 2.5, p = 1.8 × 10-6) compared to a list of background TFs whose DNA binding motifs did not match by any filter-learned DNA motifs (835 TFs, FDR > 0.05), with only 24% of them showing such evidence. Furthermore, we explored the enrichment pattern of physical interaction evidence for TFs detected at varying FDR thresholds, and found an increasing trend of enrichment when FDR cutoffs ranged from 0.01 to 0.05, followed by a decrease in enrichment when FDR cutoffs went beyond 0.05 (Figure 1E). This enrichment pattern indicates an intriguing relationship between filter-indicated TFs and enzymes responsible for the biochemical processes of DNAm, suggesting that our cell type-specific models may learn TFs that regulate cell type-specific DNAm profiles.

Predicted cell type-specific DNAm regulatory variants reflect cellular context

We performed in silico mutagenesis to predict variant effects on DNAm levels of CpGs in specific cell types based on the trained cell type-specific INTERACT models (Figure 2A). We then sought to evaluate whether predicted cell type-specific variant effects reflect the cellular context. Firstly, we hypothesized that if the predicted cell type-specific effects captured the cellular context, then variants would show similar effects within similar cell types. Indeed, we observed a stronger correlation pattern for the effects of variants predicted by models trained on similar cell types, and clustering of the pairwise correlation matrix clearly formed three distinct clusters: excitatory, inhibitory, and non-neuronal cell types (Figure 2B). Specifically, excitatory neuron subtypes (L2/3, L4, L5, and L6) were clustered together, while inhibitory neuron 182 183 184 subtypes (Ndnf, Pvalb, Sst and Vip) formed a related but distinct cluster, and non-neuronal cell types were grouped together and separated from neuronal cell types, aligning with the biological relatedness of these cell types.

Secondly, given the role of DNAm in gene regulation, we hypothesized that if the predicted variant effects on DNAm are cell type-specific, variants with higher effects would be enriched in active regulatory regions unique to the corresponding cell type. To test this, we overlapped variants with active enhancers (marked by histone modification H3K27ac outside of H3K4me3) and unique to each of the four broad cell types in the human brain (neuron, astrocyte, microglia, and ODC) determined based on a previous study(25) (Figure 2C). Our analysis confirmed that variants with higher effects in one cell type were more enriched in active enhancers in the corresponding or closely related cell types, as compared to variants with lower effects (ranked in the bottom 10%). These results provide further evidence that the predicted cell type-specific effects of variants capture the cellular context.

Predicted cell type-specific DNAm regulatory variants were enriched for the heritability of brain-related traits in relevant cell types

We performed stratified LD-score regression (S-LDSC) to evaluate the contribution of predicted cell type-specific DNAm regulatory variants to the genetic components of 18 brain-related traits. We found that variants with large effect (ranked in the top 10%) in neuronal cell types were enriched for the heritability of a number of brain-related traits and disorders (Figure 3), aligning with the known neurobiology of these traits. For example, variants in both excitatory and inhibitory neurons showed strong enrichment for schizophrenia, bipolar disorder and depression. Among three neurological diseases we examined, we observed strong enrichment for epilepsy in both excitatory and inhibitory neurons, but not in non-neuronal cell types, 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 consistent with the cell type-specificity analysis using gene expression signatures from a recent GWAS study(26). For Alzheimer’s disease (AD), we observed the strongest enrichment in microglia (fold = 5.2), but with only a trend for nominal significance (p = 0.071), probably due to the limited power of current AD GWAS. We also noted the enrichment of predicted DNAm regulatory variants in non-neuronal cell types for the heritability of multiple brain-related traits, though to a lesser extent compared to neuronal cell types. For example, variants in astrocytes were enriched for heritability of schizophrenia and bipolar disorder, while variants in astrocytes, microglia, ODC and endothelial cells showed enrichment for depression heritability, consistent with the emerging roles of non-neuronal cell types in psychiatric disorders(27). Additionally, we compared these findings with the top 10% ranked variants predicted by the INTERACT model trained on bulk brain sample in our previous study(23) (Figure 3). While we also observed a certain level of enrichment for brain-related traits from the bulk model, the strength of enrichment was generally lower compared to the INTERACT models trained on neuronal cell types. This comparison emphasizes that cellular context matters for predicting functional variants underlying brain-related traits. We also compared top-ranked variants from the cell type-specific INTERACT models with variants ranked by three other scoring systems, including CADD(28), GWAWA(29) and DeepSEA(30) (predicted effects on enhancer mark H3K27ac in frontal cortex) (Figure 3). While enrichment of heritability was also observed for brain-related traits in the variants ranked by these scores, their strength tended to be weaker compared to variants ranked by cell type-specific INTERACT models.

To further assess the impact of predicted DNAm regulatory variants on trait heritability, we employed another metric from S-LDSC: the z-score of per-SNP heritability. This metric allows us to discern the unique contributions of predicted variants to trait heritability while accounting for contributions from other functional annotations in the baseline model. We observed 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 compelling statistical evidence (FDR < 0.05) supporting the involvement of top-ranked variants in neuronal cell types for 12 brain-related traits (Figure 3). However, when we examined the top-ranked variants in bulk brain samples, only three traits (depression, neuroticism and smoking initiation) showed significant z-score (FDR < 0.05), and no traits displayed significant zscore for variants ranked by CADD and GWAWA. While DeepSEA-ranked variants showed significant contributions to five brain-related traits (FDR < 0.05), their statistical evidence was weaker when compared to variants ranked by INTERACT models trained on neuronal cell types. This analysis indicates that the variants ranked by our cell type-specific models offer unique and significant contributions to the genetic components of brain-related traits. To investigate the specificity of our findings in brain, we examined whether predicted cell typespecific DNAm regulatory variants showed enrichment for the heritability of human height and type 2 diabetes, serving as two negative controls. We did not observe significant enrichment for top-ranked variants in any cell types for these two traits, implying that the effects of these variants were relatively specific to brain-related traits. In contrast, top-ranked variants by the other three scoring systems (CADD, GWAWA and DeepSEA) showed strong enrichment for human height, suggesting their general roles in gene regulation rather than being specific to brain-related traits. Furthermore, we performed S-LDSC for variants with low effects (ranked in the bottom 10%) and found no evidence of heritability enrichment for brain-related traits in any cell types, suggesting that the observed enrichment for brain-related traits were specific to variants with large effects. Finally, we conducted S-LDSC for variants ranked at other different levels, and in general, we observed less evidence of their contributions to trait heritability for variants with weaker effects (Data S2-3).

Cell type-specific INTERACT models enhance fine mapping for brain disorders 260 261 262 263 264 265 266

Given the roles of DNAm regulatory variants in brain-related traits, we reasoned that causal variants for brain disorders would have a stronger impact on DNAm levels in cell types relevant to the disorders. To test this, we conducted fine mapping for the GWAS risk loci of three major brain disorders (schizophrenia, depression and AD), and then compared variants with a higher chance of being causal (posterior inclusion probability (PIP) > 0.1 and p < 1 × 10-6) to control variants of less likely causal (PIP < 1 × 10-4 and p > 0.99), in terms of their predicted effects on DNAm levels and DNAm levels of their affected CpG sites in each brain cell type. We observed that putative causal variants for schizophrenia tend to have a larger effect on DNAm levels than control variants in all neuronal cell types (except Sst) and astrocytes, with no significant difference in other glia cells (p > 0.05) (Figure 4A). Group difference was also noticed from INTERACT trained on bulk brain sample, though with weaker significance (p = 0.01) compared to INTERACT trained on neuronal cells, highlighting the importance of cellular context in understanding the functional impacts of risk variants. For depression, we observed group differences in all neuronal cells (except Vip) and three types of glia cells (astrocytes, OPC, and ODC). For AD, differences were observed in all excitatory neurons, one inhibitory neuron (Sst), and all glia cells (except ODC). Notably, difference in microglia was unique to AD among the three disorders, consistent with the distinct role of microglia in AD. Interestingly, the recent single-nucleus RNA-Seq study of AD detected the largest number of differentially expression genes in excitatory neurons and found depletion in the inhibitory neuron (Sst) in AD(31). This finding aligns with the neuronal cell types we observed, where AD putative causal variants show stronger effects. Additionally, we compared the DNAm levels of CpG sites impacted by the two groups of variants. We observed a trend of lower DNAm levels for putative causal variants across all three brain disorders (Figure 4B), suggesting that risk variants tend to impact genomic regions with a regulatory potential, as indicated by their low DNAm levels. 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311

Given our observation that putative causal variants have a stronger impact on DNAm levels and their affected CpG sites have lower DNA levels in specific cell types, we aimed to determine whether incorporating these predicted functional annotations could improve the fine mapping of risk loci underlying the three aforementioned brain disorders. To incorporate predicted functional annotations into fine mapping framework, we employed CARMA, a novel Bayesian approach that can jointly model GWAS summary statistics and functional annotations(32). We first performed fine mapping by including 26 functional annotations predicted from all of 13 cell typespecific models. Compared to fine mapping without annotations, we observed a much smaller number of SNPs included in the 99% credible sets from fine mapping with annotations (Figure 5A), and a larger number of risk loci counted by the maximum number of SNPs (up to 10) included in the 99% credible sets (Figure 5B, 5C, 5D). We provide details for the fine-mapped SNPs (PIP > 0.05 in 99% credible sets), their predicted functional annotations, and their assigned gene targets for each disorder in Data S4-6, offering potential candidate genes for further investigation.

Following the fine mapping strategy that incorporated annotations from all 13 brain cell types, we proceeded with additional fine mapping using annotations from refined cell types. These refined cell types encompassed four broad categories (excitatory neuron, inhibitory neuron, neuron, and glia), along with each individual cell type. Our objective was to pinpoint specific cell types that might yield a more robust fine mapping signal, offering insights into where the causal variants and their target genes are likely to exert their effects. Indeed, we observed many risk loci showing stronger PIP signals in certain cell types across the three disorders, whereas signals in other cell types were notably weaker or absent (Data S7-9). We illustrate one such risk loci for AD on chromosome 7, where fine mapping without annotations generated a credible set of six SNPs, with the highest PIP observed for rs74504435 (PIP = 0.37) (Figure 6). Notably, fine mapping with annotations from all cell types produced a credible set of only two SNPs, with 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 the highest PIP=0.986 observed again for rs74504435. Interestingly, fine mapping using annotations from refined cell types clearly identified rs74504435 in glia (PIP = 0.981), specifically in astrocytes (PIP = 0.917). In contrast, among other refined cell types, the highest PIP observed for this SNP was only 0.429, found in one neuronal cell type (L6). Further investigation into the function of rs74504435 revealed that this SNP had the largest impact on a CpG site (chr7: 54949625), situated within an enhancer in astrocytes that exhibited Hi-C contact with the promoter of EGFR (epidermal growth factor receptor). Furthermore, the risk allele <A= of rs74504435 was predicted to decrease DNAm level of the CpG site (DNAm level = 0.19) compared to the reference allele <C= (DNAm level = 0.46). Given prior evidence that regulatory regions are associated with low DNAm levels, we hypothesize that the risk allele <A= could increase the enhancer activity and hence increase the expression level of EGFR. Intriguingly, previous studies have shown that over-expression of EGFR can lead to amyloid-β induced memory loss(33) and induce neuroinflammation and activate astrocytes(34). Additionally, inhibitors of EGFR have been explored as potential treatments for AD(35).

Discussion

In our previous study, we introduced the INTERACT model, designed to predict DNAm regulatory variants in bulk brain samples(23). This current study extends the INTERACT model to identify DNAm regulatory variants in specific brain cell types utilizing existing single-nucleus DNAm data from the human brain. Our study reveals the inherent power of DNA sequences to encode cell type-specific DNAm patterns and demonstrate the ability of our cell type-specific INTERACT models to reveal DNA motifs and TFs underlying cell type-specific DNAm profiles. We show that our cell type-specific INTERACT models predict DNAm regulatory variants that capture cellular context and are enriched for the heritability of many brain-related traits in relevant cell types. Importantly, we demonstrate that the inclusion of predicted SNP effects and 338 339 340 341

DNAm levels enhances the fine mapping of risk loci for three major brain disorders and pinpoint particular cellular types where the fine-mapped risk variants may exert their effects. Our findings highlight the power of deep learning in identifying functional regulatory variants in specific cell types, which will further our understanding of the genetic underpinnings of complex traits. This work should be viewed in light of several limitations. Firstly, our model relies on input DNA sequences from the reference genome, which may not precisely align with the DNA sequences in the samples we used for training the model. The performance of the model could be further improved in future research by utilizing DNA sequences and DNAm data from the same individuals. Secondly, our study was limited by the number of nuclei available in the singlenucleus DNAm data we utilized, allowing us to build cell type-specific models for only 13 brain cell types. This hinders our ability to identify DNAm regulatory variants in more refined cell types. This limitation could be addressed by incorporating larger-scale single-nucleus DNAm datasets that cover a wider range of cell types. Thirdly, given the high sparsity of single-nucleus DNAm profile and the need for large training samples, we assumed that the DNAm status of a CpG site is consistent across nuclei of the same type, which allowed us to aggregate training samples of CpG sites that are fully methylated or unmethylated across nuclei of the same type. However, this assumption may not always hold true, particularly for nuclei within a broad cell type. This limitation could be addressed by increasing sequencing coverage in individual nucleus or collecting a larger number of nuclei for sequencing, which may lead to a more accurate measurement of DNAm levels of CpG sites for each cell type, including those of intermediate DNAm levels. Lastly, variants derived from our model can only assist in uncovering risk variants that act through regulation of DNAm, but it is unable to reveal risk variants that operate through mechanisms independent of DNAm regulation. Our model has potential to be extended to identify regulatory variants for other types of molecular traits, such as gene expression and histone modification in specific cell types.

Methods Training datasets

We trained the cell type-specific INTERACT models using an existing single-nucleus dataset generated by the single-nucleus methyl-3C sequencing technique (sn-m3C-seq)(16), which allows simultaneous capture of DNAm and chromatin contact information from individual nuclei. This dataset includes 4,137 nuclei derived from the human prefrontal cortex, which were clustered into 13 cell types. These cell types include four excitatory neuron subtypes characteristic of different cortical lamina (L2/3, L4, L5, and L6), four inhibitory neuron subtypes (Ndnf, Vip, Pvalb, and Sst), and five non-neuronal subtypes (astrocyte, oligodendrocyte (ODC), oligodendrocyte progenitor cell (OPC), microglia, and endothelial cell). To build the training dataset for each cell type, we utilized pseudo-bulk tissue derived from the nucleus of the corresponding cell type. We carefully selected CpG sites that were either fully methylated or fully unmethylated as the training samples for each cell type. Our rationale for this approach was two-fold: First, we assumed that the DNAm status at a specific CpG site tends to be consistent across nuclei of the same cell type. Second, due to the high sparsity of singlenucleus DNAm profiles, it was challenging to collect a large number of CpG sites with sufficient read coverage (> 50) for reliable estimation of intermediate DNAm levels, especially for neuronal cell types characterized by very limited number of nuclei in this dataset (Table S2). Consequently, our training dataset for each cell type included only CpG sites that were either fully methylated or fully unmethylated. To maintain data quality while ensuring an adequate training sample size, we implemented a manually adjusted coverage cutoff for each cell type. As a result, the cell type-specific training datasets contained between 2.3 and 3.1 million CpG sites. Table S1 provides details on the coverage threshold and training sample size for each cell type.

The pre-training dataset for INTERACT comprises approximately 25 million CpG sites obtained from a pseudo-bulk tissue derived from all 4,137 nuclei mentioned earlier. DNAm levels were determined by calculating the ratio of methylated reads to the total number of reads across all nuclei. To ensure high data quality for our pre-training process, we only included CpG sites covered by more than 50 reads across all nuclei.

For both the pre-training and cell type-specific training datasets, we divided CpG sites into three subsets by chromosomes for model training, validation, and evaluation. The training set consisted of CpG sites on chromosomes 1 to 20, while CpG sites on chromosome 21 were used as the validation set for model tuning, and CpG sites on chromosome 22 were used as the independent testing set to evaluate the model prediction performance.

INTERACT architecture

INTERACT contains three main modules: CNN, the encoder module of transformer, and the fully connected network. The input to the INTERACT model is a one-hot encoded DNA sequence of 2 kb, and the DNAm level of the CpG site centered in the DNA sequence as output. Our choice of 2 kb input DNA sequences was supported by our previous study that investigated the performance of INTERACT for input DNA sequences of different lengths (1 kb, 2 kb, 3 kb, and 4 kb). The architecture of our model is detailed in Table S3, and the code is publicly available on our GitHub repository. Below are descriptions of each module. The CNN module includes three convolution layers, each containing 512 kernels of length 10. Each convolutional layer is activated by a rectified linear unit (ReLU) function and is subsequently followed by a normalization layer. After the three convolution layers, a maxpooling layer is employed to capture the most significant features from the output of convolutional layers. To enhance the speed and stability of training, a batch normalization layer is utilized after the max-pooling layer. Additionally, a dropout layer with a rate of 0.5 is used following the batch normalization layer to prevent overfitting.

The transformer module receives the features learned from the CNN module. This module contains a stack of eight identical layers, each of which includes two sublayers. The first sublayer is a multi-head self-attention layer, and the second sublayer is a simple, position-wise fully connected feed-forward network. Each sublayer is followed by a normalization layer to improve the speed and stability of training, and a dropout layer with a rate of 0.1 to prevent overfitting. In addition, there is a residual connection around each sublayer.

The fully connected network includes a single hidden layer with 512 units, an output layer, and a dropout layer with a drop rate of 0.1 to prevent overfitting. The sigmoid function is used after the output layer to scale the predicted values into a range between 0 and 1. The output layer includes one unit that corresponds to the DNAm levels.

Pre-training and fine-tuning

The INTERACT model was designed with approximately 38 million parameters. It is challenging to train a robust cell type-specific INTERACT model due to the limited number of training samples collected for each cell type-specific model. To overcome this challenge, we employed a two-step training strategy (Figure 1A). First, we pre-trained the INTERACT model by predicting DNAm levels for approximately 25 million CpGs, through aggregating DNAm data across 4,137 nuclei into a single pseudo-bulk DNAm dataset. DNAm levels of a specific CpG site were calculated as the ratio of methylated reads to the total number of reads across all nuclei. To ensure high-quality of the pre-training samples, we only used CpG sites covered by more than 50 reads. This pre-training phase allowed the model to learn informative features associated with DNAm levels that are shared across cell types. In the second step, we fine-tuned the pretrained model by predicting the DNAm levels of the training samples collected for each cell type, allowing the model to learn cell type-specific features underlying DNAm levels.

Comparison with CNN model

We compared our fine-tuned cell type-specific INTERACT model with a standard CNN model for their performance in predicting DNAm levels in each cell type. The CNN model shared the same structure as the CNN module within the INTERACT architecture, followed by the same fully connected network module as described above.

DNA motif analysis

We defined a score of regulatory activity for each filter, which quantifies the correlation between the activation strength of filters within enhancers and the expression levels of the genes targeted by these enhancers. The activation strength of filters is a measure of the sequence similarity between the filter and the enhancer, with a higher activation value indicating a greater similarity of DNA sequences. A higher regulatory activity indicates a greater potential of the filter to activate genes targeted by enhancers. We calculated and compared two regulatory activity scores for each filter: one for enhancers targeting cell type-specific genes and the other for enhancers targeting shared genes between the corresponding cell type and at least one other cell type. To define enhancers and cell type-specific genes or shared genes for our study, we relied on enhancers and active promoters defined in a previous study using two histone marks (H3K4me3 and H3K27ac) for four broad cell types (neuron, microglia, ODC, and astrocyte)(25).

Specifically, cell type-specific genes were defined as those with an active promoter (defined by both H3K4me3 and H3K27ac) in the corresponding cell type but not in other broad cell types. Shared genes were defined as those with an active promoter in the corresponding cell type and at least one other broad cell type. We linked enhancers to their target genes based on Hi-C contact data collected for three broad cell types (neuron, ODC and astrocytes) from the same prior study(25), with additional supplement for the three broad cell types using cell-type specific chromatin loops called from the same nucleus in our training data. To obtain gene expression levels in specific brain cell types, we generated pseudo-bulk data for each cell type using singlenucleus transcriptomes dataset across six human cortical areas and associated cell type annotations downloaded from the ALLEN BRAIN data portal. Gene expression levels for each gene in each cell type were determined by aggregating the total number of reads assigned to the gene across all cells in the pseudo-bulk sample, which was then normalized to one million reads followed by log2 transformation. We calculated regulatory activity for filters from models trained by individual neuronal cell types (L23, L4, L5, L6, Ndnf, Vip, Sst, Pvalb) based on Hi-C contact in the broad neuron cell type. Similarly, we calculated regulatory activity for filters from models trained by ODC and microglia based on Hi-C contact in the broad ODC and microglia, respectively. We did not calculate regulatory activity for filters from models trained by astrocytes due to the limited number of of Hi-C contacts for this cell type.

We further examined filters in the first convolutional layer of each cell type-specific INTERACT model to identify DNA motifs associated with DNAm levels in each cell type. Following the method described previously(36), DNA motifs were discovered for each filter by a subset of sequences where the filter produces an activation value higher than half of the maximum activation value across all subsets of sequences the filer has scanned. Selected subsets of sequences were then aligned to generate a position weight matrix, which was further matched to annotated TF binding DNA motifs in the Homo sapiens CIS-BP database using the Tomtom v4.10.1 motif comparison tool(24). We considered matches at FDR < 0.05 significant for each filter. To assess the potential involvement of identified TFs in relation to DNA methylation levels, we investigated their physical interactions with enzymes involved in DNA methylation (DNMT1, DNMT3A, DNMT3B) or demethylation (TET1, TET2, TET3, and TDG)(37). We collected physical interaction evidence with DNMT3A and DNMT3B from a previous study(38) and from STRING database(39).

In silico mutagenesis We performed in silico mutagenesis to identify functional genetic variations impacting DNAm levels in each cell type (Figure 2B). Briefly, we first introduced a variant allele in a given DNA sequence of 1 kb flanking a CpG site. Each sequence (with or without the introduced variant allele) was then passed to each trained cell type-specific INTERACT model to get the predicted DNAm level of the CpG site centered within each sequence. The impact of the introduced variant on DNAm level of the CpG site was estimated by the difference of the predicted DNAm levels of the CpG site within the two sequences. We did this in silico mutagenesis for 9,042,066 SNPs (minor allele frequency > 0.005) observed in the European ancestry samples of 1000 Genomes and located within a 1 kb window of 25,564,506 autosomal CpG sites. This resulted in estimated effects on DNAm levels for 182,211,784 SNP3CpG pairs from each cell type-specific INTERACT model. SNPs were ranked in descending order by their maximum absolute values of predicted effects for all of their paired CpG sites.

To evaluate the contextual relevance of predicted brain cell type-specific regulatory variants, we investigated their enrichment for active regulatory regions unique to each major brain cell type, including neurons, microglia, astrocytes, and ODC. Active regulatory regions unique to each cell type were determined based on a prior study(25) and defined by the presence of active enhancers (H3K27ac peaks that were outside of H3K4me3 peaks) in each respective cell type but absent in the other three cell types.

Stratified LD score regression

We performed stratified LD score regression (S-LDSC)(40) to evaluate the enrichment of heritability of brain-related traits for variants ranked at different intervals by their impacts on DNAm levels predicted by each cell type-specific INTERACT model. We also included two nonbrain traits, human height and type 2 diabetes, as two negative controls to examine whether our findings are specific to brain-related traits. We downloaded GWAS summary statistics of each trait from the sources listed in Data S10. Following recommendations from the LDSC resource website (https://alkesgroup.broadinstitute.org/LDSCORE), S-LDSC was run for each list of variants with the baseline LD model v2.2 that included 97 annotations to control for the LD between variants with other functional annotations in the genome. We used HapMap Project Phase 3 SNPs as regression SNPs, and 1000 Genomes SNPs of European ancestry samples as reference SNPs, which were all downloaded from the LDSC resource website. To evaluate the unique contribution of predicted regulatory variants to trait heritability, we also utilized another metric from S-LDSC: the z-score of per-SNP heritability. This metric allows us to discern the unique contributions of candidate annotations while accounting for contributions from other functional annotations in the baseline model. The p-values are derived from the zscore assuming a normal distribution and FDR was computed from the p-values based on Benjamini & Hochberg procedure using R function.

To compare the performance of our cell type-specific models with other scoring systems in their abilities to predict functional variants enriched for trait heritability, we also performed S-LDSC for variants ranked by three other scoring systems: CADD(28), GWAWA(29) and DeepSEA(30).

The CADD score measures the deleteriousness of variants and is derived from a support vector machine trained to distinguish variants that have survived natural selection from simulated mutations that are enriched for deleterious variants, utilizing features from diverse genomic annotations. GWAWA scores are derived from a random forest model trained to discriminate curated pathogenic variants from benign variants by integrating various genomic annotations. DeepSEA is another deep learning-based approach that employs a CNN model to estimate noncoding variant effects on chromatin. We considered only DeepSEA-predicted effects on the enhancer mark H3K27ac in the frontal cortex, as it represents the most relevant genomic element and tissue context for brain-related traits.

Fine-mapping GWAS risk loci

We used PLINK to clump significant SNPs from GWAS summary statistics for schizophrenia(41) depression(42) and AD(43) of European ancestry samples into independent risk loci on autosomes. To achieve this, we first obtained index SNPs (p < 5 × 10−8) that were LDindependent and had r2 < 0.1 within a 3-Mb window. Next, we defined the risk loci for each index SNP as 50-kb upstream of the leftmost and 50-kb downstream of the rightmost SNPs that were within a 3-Mb window and had an r2 > 0.2 with the index SNP. We then merged risk loci that were within 50-kb, resulting in 177, 100 and 54 loci for schizophrenia, depression and AD, respectively.

We fine-mapped GWAS risk loci for each disorder using CARMA(32), a novel Bayesian fine mapping approach that jointly model GWAS summary statistics and functional annotations while accounting for discrepancies between summary statistics and LD from reference panels. We excluded risk loci within the major histocompatibility complex region (chr 6: 25336 Mb) from this analysis due to extensive LD structure of this region. We also excluded risk loci on chromosome 19 (44519920:46477842) for AD from fine mapping due to long-range LD in this region and extreme values of the summary statistics as did in the previous study. LD was estimated from 50,000 randomly selected samples of European ancestry from the UKBB GWAS dataset. We employed two fine-mapping strategies: one without functional annotations and the other incorporating functional annotations from all cell types or various subsets of cell types. These functional annotations are variant effects on DNAm levels of CpG sites and the DNAm levels of the CpG sites (from reference DNA sequence) predicted by each cell type-specific INTERACT model. When a SNP was predicted for its effects on multiple CpG sites, we chose the CpG site for which the SNP has the largest predicted effect.

Assign fine-mapped SNPs to target genes

We first assigned SNPs within 99% credible sets to putative causal genes if the genes had transcripts whose promoters overlapped with fine-mapped risk variants. Promoters were defined by 1-kb upstream and 1-kb downstream of the transcript start site, based on the GENCODE version 42 basic gene annotations. Putative causal genes were also assigned if promoters of their transcripts are distally contacted by fine-mapped SNPs in specific cell type, utilizing celltype specific chromatin loops called from the same set of nuclei used for model training. Specifically, the cell-type specific chromatin loops were called at a 10-kb resolution by the SnapHiC method(44) from the Hi-C component of the same set of nuclei described in the training datasets section. Details for the SnapHiC method and identification of chromatin loops in these nuclei were described in the prior report(44). To link candidate SNPs to distal promoters, we extended each SNP by 1-kb in both directions to define a risk variant-region. If one of the chromatin loop bins intersected the risk variant-region and another bin overlapped a promoter of a transcript, the promoter was then distally linked with the SNP. For fine-mapped

SNPs that did not overlap or distally contact any promoters, we assigned them to the nearest

Acknowledgments

We acknowledge the Psychiatric Genomics Consortium, UK BioBank, GSCAN (the GWAS & Sequencing Consortium of Alcohol and Nicotine), the International Parkinson’s Disease Genomics Consortium, and the CTGlab (the Complex Trait Genetics lab at the VU University Amsterdam the Amsterdam University Medical Centre) for making their GWAS results publicly available. We would like to thank the research participants and employees of 23andMe, inc. for making this work possible.

Funding:

National Institutes of Health grant R01MH121394 (SH) National Institutes of Health grant R01MH112751 (SH)

Author contributions:

Conceptualization: JZ, DRW, SH Methodology: JZ, SH Investigation: JZ, SH Visualization: JZ, SH Supervision: DRW, SH Writing4original draft: JZ, DRW, SH Writing4review & editing: JZ, DRW, SH

Competing interests:

DRW serves on the Scientific Advisory Boards of Sage Therapeutics and Pasithea Therapeutics. All other authors declare they have no competing interests.

Data and materials availability:

Data S7-9 for fine mapping GWAS risk loci for schizophrenia, depression and AD 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 12. 13. 14. 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791

J. Zhou, O. G. Troyanskaya, Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods 12, 931-934 (2015).

Z. P. Hansruedi Mathys, Carles A. Boix, Matheus B. Victor, Noelle Leary, Sudhagar Babu,, X. J. Ghada Abdelhady, Ayesha P. Ng, Kimia Ghafari, Alexander K. Kunisky, Julio Mantero,, V. N. L. Kyriaki Galani, Gabrielle E. Fortier, Yasmine Lotfi, Jason Ivey, Hannah P. Brown,, N. C. Pratham R. Patel, Jacob I. Beaudway, Elizabeth J. Imhoff, Cameron F. Keeler,, H. H. P. Maren M. McChesney, Sahil P. Patel, Megan T. Thai, David A. Bennett, Manolis Kellis,, a. L.-H. Tsai, Single-cell atlas reveals correlates of high cognitive function, dementia, and resilience to Alzheimer’s disease pathology. Cell 186, 4365–4385 (2023).

Z. Yang, C. Wang, L. Liu, A. Khan, A. Lee, B. Vardarajan, R. Mayeux, K. Kiryluk, I. Ionita-Laza, CARMA is a new Bayesian model for fine-mapping in genome-wide association meta-analyses. Nat Genet 55, 1057-1065 (2023).

L. Wang, H. C. Chiang, W. Wu, B. Liang, Z. Xie, X. Yao, W. Ma, S. Du, Y. Zhong, Epidermal growth factor receptor is a preferred target for treating amyloid-beta-induced memory loss. Proc Natl Acad Sci U S A 109, 16743-16748 (2012).

Y. J. Chen, C. C. Hsu, Y. J. Shiao, H. T. Wang, Y. L. Lo, A. M. Y. Lin, Anti-inflammatory effect of afatinib (an EGFR-TKI) on OGD-induced neuroinflammation. Sci Rep 9, 2516 (2019).

O. Tavassoly, T. Sato, I. Tavassoly, Inhibition of Brain Epidermal Growth Factor Receptor Activation: A Novel Target in Neurodegenerative Diseases and Brain Injuries. Mol Pharmacol 98, 13-22 (2020).

C. Angermueller, H. J. Lee, W. Reik, O. Stegle, DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning. Genome Biol 18, 67 (2017).

L. D. Moore, T. Le, G. Fan, DNA methylation and its basic function. Neuropsychopharmacology 38, 23-38 (2013).

E. Hervouet, F. M. Vallette, P. F. Cartron, Dnmt3/transcription factor interactions as crucial players in targeted DNA methylation. Epigenetics 4, 487-499 (2009).

D. Szklarczyk, A. L. Gable, D. Lyon, A. Junge, S. Wyder, J. Huerta-Cepas, M. Simonovic, N. T. Doncheva, J. H. Morris, P. Bork, L. J. Jensen, C. V. Mering, STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res 47, D607-D613 (2019).

H. K. Finucane, B. Bulik-Sullivan, A. Gusev, G. Trynka, Y. Reshef, P. R. Loh, V. Anttila, H. Xu, C. Zang, K. Farh, S. Ripke, F. R. Day, C. ReproGen, C. Schizophrenia Working Group of the Psychiatric Genomics, R. Consortium, S. Purcell, E. Stahl, S. Lindstrom, J. R. Perry, Y. Okada, S. Raychaudhuri, M. J. Daly, N. Patterson, B. M. Neale, A. L. Price, Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat Genet 47, 12281235 (2015).

V. Trubetskoy, A. F. Pardinas, T. Qi, G. Panagiotaropoulou, S. Awasthi, T. B. Bigdeli, J. Bryois, C. Y. Chen, C. A. Dennison, L. S. Hall, M. Lam, K. Watanabe, O. Frei, T. Ge, J. C. Harwood, F. Koopmans, S. Magnusson, A. L. Richards, J. Sidorenko, Y. Wu, J. Zeng, J. Grove, M. Kim, Z. Li, G. Voloudakis, W. Zhang, M. Adams, I. Agartz, E. G. Atkinson, E. Agerbo, M. Al Eissa, M. Albus, M. Alexander, B. Z. Alizadeh, K. Alptekin, T. D. Als, F. Amin, V. Arolt, M. Arrojo, L. Athanasiu, M. H. Azevedo, S. A. Bacanu, N. J. Bass, M. Begemann, R. A. Belliveau, J. Bene, B. Benyamin, S. E. Bergen, G. Blasi, J. Bobes, S. Bonassi, A. Braun, R. A. Bressan, E. J. Bromet, R. Bruggeman, P. F. Buckley, R. L. Buckner, J. Bybjerg-Grauholm, W. Cahn, M. J. Cairns, M. E. Calkins, V. J. Carr, D. Castle, S. V. Catts, K. D. Chambert, R. C. K. Chan, B. Chaumette, W. Cheng, E. F. C. Cheung, S. A. Chong, D. Cohen, A. Consoli, Q. Cordeiro, J. Costas, C. Curtis, M. Davidson, K. L. Davis, L. de Haan, F. Degenhardt, L. E. DeLisi, D. Demontis, F. Dickerson, D. Dikeos, T. Dinan, S. Djurovic, J. Duan, G. Ducci, F. Dudbridge, J. G. Eriksson, L. Fananas, S. V. Faraone, A. Fiorentino, A. Forstner, J. Frank, 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839

N. B. Freimer, M. Fromer, A. Frustaci, A. Gadelha, G. Genovese, E. S. Gershon, M. Giannitelli, I. Giegling, P. Giusti-Rodriguez, S. Godard, J. I. Goldstein, J. Gonzalez Penas, A. Gonzalez-Pinto, S. Gopal, J. Gratten, M. F. Green, T. A. Greenwood, O. Guillin, S. Guloksuz, R. E. Gur, R. C. Gur, B. Gutierrez, E. Hahn, H. Hakonarson, V. Haroutunian, A. M. Hartmann, C. Harvey, C. Hayward, F. A. Henskens, S. Herms, P. Hoffmann, D. P. Howrigan, M. Ikeda, C. Iyegbe, I. Joa, A. Julia, A. K. Kahler, T. Kam-Thong, Y. Kamatani, S. Karachanak-Yankova, O. Kebir, M. C. Keller, B. J. Kelly, A. Khrunin, S. W. Kim, J. Klovins, N. Kondratiev, B. Konte, J. Kraft, M. Kubo, V. Kucinskas, Z. A. Kucinskiene, A. Kusumawardhani, H. Kuzelova-Ptackova, S. Landi, L. C. Lazzeroni, P. H. Lee, S. E. Legge, D. S. Lehrer, R. Lencer, B. Lerer, M. Li, J. Lieberman, G. A. Light, S. Limborska, C. M. Liu, J. Lonnqvist, C. M. Loughland, J. Lubinski, J. J. Luykx, A. Lynham, M. Macek, Jr., A. Mackinnon, P. K. E. Magnusson, B. S. Maher, W. Maier, D. Malaspina, J. Mallet, S. R. Marder, S. Marsal, A. R. Martin, L. Martorell, M. Mattheisen, R. W. McCarley, C. McDonald, J. J. McGrath, H. Medeiros, S. Meier, B. Melegh, I. Melle, R. I. Mesholam-Gately, A. Metspalu, P. T. Michie, L. Milani, V. Milanova, M. Mitjans, E. Molden, E. Molina, M. D. Molto, V. Mondelli, C. Moreno, C. P. Morley, G. Muntane, K. C. Murphy, I. Myin-Germeys, I. Nenadic, G. Nestadt, L. Nikitina-Zake, C. Noto, K. H. Nuechterlein, N. L. O'Brien, F. A. O'Neill, S. Y. Oh, A. Olincy, V. K. Ota, C. Pantelis, G. N. Papadimitriou, M. Parellada, T. Paunio, R. Pellegrino, S. Periyasamy, D. O. Perkins, B. Pfuhlmann, O. Pietilainen, J. Pimm, D. Porteous, J. Powell, D. Quattrone, D. Quested, A. D. Radant, A. Rampino, M. H. Rapaport, A. Rautanen, A. Reichenberg, C. Roe, J. L. Roffman, J. Roth, M. Rothermundt, B. P. F. Rutten, S. Saker-Delye, V. Salomaa, J. Sanjuan, M. L. Santoro, A. Savitz, U. Schall, R. J. Scott, L. J. Seidman, S. I. Sharp, J. Shi, L. J. Siever, E. Sigurdsson, K. Sim, N. Skarabis, P. Slominsky, H. C. So, J. L. Sobell, E. Soderman, H. J. Stain, N. E. Steen, A. A. Steixner-Kumar, E. Stogmann, W. S. Stone, R. E. Straub, F. Streit, E. Strengman, T. S. Stroup, M. Subramaniam, C. A. Sugar, J. Suvisaari, D. M. Svrakic, N. R. Swerdlow, J. P. Szatkiewicz, T. M. T. Ta, A. Takahashi, C. Terao, F. Thibaut, D. Toncheva, P. A. Tooney, S. Torretta, S. Tosato, G. B. Tura, B. I. Turetsky, A. Ucok, A. Vaaler, T. van Amelsvoort, R. van Winkel, J. Veijola, J. Waddington, H. Walter, A. Waterreus, B. T. Webb, M. Weiser, N. M. Williams, S. H. Witt, B. K. Wormley, J. Q. Wu, Z. Xu, R. Yolken, C. C. Zai, W. Zhou, F. Zhu, F. Zimprich, E. C. Atbasoglu, M. Ayub, C. Benner, A. Bertolino, D. W. Black, N. J. Bray, G. Breen, N. G. Buccola, W. F. Byerley, W. J. Chen, C. R. Cloninger, B. Crespo-Facorro, G. Donohoe, R. Freedman, C. Galletly, M. J. Gandal, M. Gennarelli, D. M. Hougaard, H. G. Hwu, A. V. Jablensky, S. A. McCarroll, J. L. Moran, O. Mors, P. B. Mortensen, B. Muller-Myhsok, A. L. Neil, M. Nordentoft, M. T. Pato, T. L. Petryshen, M. Pirinen, A. E. Pulver, T. G. Schulze, J. M. Silverman, J. W. Smoller, E. A. Stahl, D. W. Tsuang, E. Vilella, S. H. Wang, S. Xu, C. Indonesia Schizophrenia, PsychEncode, C. Psychosis Endophenotypes International, G. O. C. Syn, R. Adolfsson, C. Arango, B. T. Baune, S. I. Belangero, A. D. Borglum, D. Braff, E. Bramon, J. D. Buxbaum, D. Campion, J. A. Cervilla, S. Cichon, D. A. Collier, A. Corvin, D. Curtis, M. D. Forti, E. Domenici, H. Ehrenreich, V. Escott-Price, T. Esko, A. H. Fanous, A. Gareeva, M. Gawlik, P. V. Gejman, M. Gill, S. J. Glatt, V. Golimbet, K. S. Hong, C. M. Hultman, S. E. Hyman, N. Iwata, E. G. Jonsson, R. S. Kahn, J. L. Kennedy, E. Khusnutdinova, G. Kirov, J. A. Knowles, M. O. Krebs, C. Laurent-Levinson, J. Lee, T. Lencz, D. F. Levinson, Q. S. Li, J. Liu, A. K. Malhotra, D. Malhotra, A. McIntosh, A. McQuillin, P. R. Menezes, V. A. Morgan, D. W. Morris, B. J. Mowry, R. M. Murray, V. Nimgaonkar, M. M. Nothen, R. A. Ophoff, S. A. Paciga, A. Palotie, C. N. Pato, S. Qin, M. Rietschel, B. P. Riley, M. Rivera, D. Rujescu, M. C. Saka, A. R. Sanders, S. G. Schwab, A. Serretti, P. C. Sham, Y. Shi, D. St Clair, H. Stefansson, K. Stefansson, M. T. Tsuang, J. van Os, M. P. Vawter, D. R. Weinberger, T. Werge, D. B. Wildenauer, X. Yu, W. Yue, P. A. Holmans, A. J. Pocklington, P. Roussos, E. Vassos, M. Verhage, P. M. Visscher, J. Yang, D. Posthuma, O. A. Andreassen, K. S. Kendler, M. J. Owen, N. R. Wray, M. J. Daly, H. Huang, B. M. Neale, P. F. Sullivan, S. Ripke, J. T. R. Walters, M. C. O'Donovan, C. Schizophrenia Working Group of the Psychiatric Genomics, 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887

Mapping genomic loci implicates genes and synaptic biology in schizophrenia. Nature 604, 502508 (2022).

D. M. Howard, M. J. Adams, T. K. Clarke, J. D. Hafferty, J. Gibson, M. Shirali, J. R. I. Coleman, S. P. Hagenaars, J. Ward, E. M. Wigmore, C. Alloza, X. Shen, M. C. Barbu, E. Y. Xu, H. C. Whalley, R. E. Marioni, D. J. Porteous, G. Davies, I. J. Deary, G. Hemani, K. Berger, H. Teismann, R. Rawal, V. Arolt, B. T. Baune, U. Dannlowski, K. Domschke, C. Tian, D. A. Hinds, T. andMe Research, C. Major Depressive Disorder Working Group of the Psychiatric Genomics, M. Trzaskowski, E. M. Byrne, S. Ripke, D. J. Smith, P. F. Sullivan, N. R. Wray, G. Breen, C. M. Lewis, A. M. McIntosh, Genome-wide meta-analysis of depression identifies 102 independent variants and highlights the importance of the prefrontal brain regions. Nat Neurosci 22, 343-352 (2019).

C. Bellenguez, F. Kucukali, I. E. Jansen, L. Kleineidam, S. Moreno-Grau, N. Amin, A. C. Naj, R. Campos-Martin, B. Grenier-Boley, V. Andrade, P. A. Holmans, A. Boland, V. Damotte, S. J. van der Lee, M. R. Costa, T. Kuulasmaa, Q. Yang, I. de Rojas, J. C. Bis, A. Yaqub, I. Prokic, J. Chapuis, S. Ahmad, V. Giedraitis, D. Aarsland, P. Garcia-Gonzalez, C. Abdelnour, E. Alarcon-Martin, D. Alcolea, M. Alegret, I. Alvarez, V. Alvarez, N. J. Armstrong, A. Tsolaki, C. Antunez, I. Appollonio, M. Arcaro, S. Archetti, A. A. Pastor, B. Arosio, L. Athanasiu, H. Bailly, N. Banaj, M. Baquero, S. Barral, A. Beiser, A. B. Pastor, J. E. Below, P. Benchek, L. Benussi, C. Berr, C. Besse, V. Bessi, G. Binetti, A. Bizarro, R. Blesa, M. Boada, E. Boerwinkle, B. Borroni, S. Boschi, P. Bossu, G. Brathen, J. Bressler, C. Bresner, H. Brodaty, K. J. Brookes, L. I. Brusco, D. Buiza-Rueda, K. Burger, V. Burholt, W. S. Bush, M. Calero, L. B. Cantwell, G. Chene, J. Chung, M. L. Cuccaro, A. Carracedo, R. Cecchetti, L. Cervera-Carles, C. Charbonnier, H. H. Chen, C. Chillotti, S. Ciccone, J. Claassen, C. Clark, E. Conti, A. Corma-Gomez, E. Costantini, C. Custodero, D. Daian, M. C. Dalmasso, A. Daniele, E. Dardiotis, J. F. Dartigues, P. P. de Deyn, K. de Paiva Lopes, L. D. de Witte, S. Debette, J. Deckert, T. Del Ser, N. Denning, A. DeStefano, M. Dichgans, J. Diehl-Schmid, M. Diez-Fairen, P. D. Rossi, S. Djurovic, E. Duron, E. Duzel, C. Dufouil, G. Eiriksdottir, S. Engelborghs, V. Escott-Price, A. Espinosa, M. Ewers, K. M. Faber, T. Fabrizio, S. F. Nielsen, D. W. Fardo, L. Farotti, C. Fenoglio, M. Fernandez-Fuertes, R. Ferrari, C. B. Ferreira, E. Ferri, B. Fin, P. Fischer, T. Fladby, K. Fliessbach, B. Fongang, M. Fornage, J. Fortea, T. M. Foroud, S. Fostinelli, N. C. Fox, E. Franco-Macias, M. J. Bullido, A. Frank-Garcia, L. Froelich, B. Fulton-Howard, D. Galimberti, J. M. Garcia-Alberca, P. Garcia-Gonzalez, S. Garcia-Madrona, G. Garcia-Ribas, R. Ghidoni, I. Giegling, G. Giorgio, A. M. Goate, O. Goldhardt, D. Gomez-Fonseca, A. Gonzalez-Perez, C. Graff, G. Grande, E. Green, T. Grimmer, E. Grunblatt, M. Grunin, V. Gudnason, T. Guetta-Baranes, A. Haapasalo, G. Hadjigeorgiou, J. L. Haines, K. L. Hamilton-Nelson, H. Hampel, O. Hanon, J. Hardy, A. M. Hartmann, L. Hausner, J. Harwood, S. Heilmann-Heimbach, S. Helisalmi, M. T. Heneka, I. Hernandez, M. J. Herrmann, P. Hoffmann, C. Holmes, H. Holstege, R. H. Vilas, M. Hulsman, J. Humphrey, G. J. Biessels, X. Jian, C. Johansson, G. R. Jun, Y. Kastumata, J. Kauwe, P. G. Kehoe, L. Kilander, A. K. Stahlbom, M. Kivipelto, A. Koivisto, J. Kornhuber, M. H. Kosmidis, W. A. Kukull, P. P. Kuksa, B. W. Kunkle, A. B. Kuzma, C. Lage, E. J. Laukka, L. Launer, A. Lauria, C. Y. Lee, J. Lehtisalo, O. Lerch, A. Lleo, W. Longstreth, Jr., O. Lopez, A. L. de Munain, S. Love, M. Lowemark, L. Luckcuck, K. L. Lunetta, Y. Ma, J. Macias, C. A. MacLeod, W. Maier, F. Mangialasche, M. Spallazzi, M. Marquie, R. Marshall, E. R. Martin, A. M. Montes, C. M. Rodriguez, C. Masullo, R. Mayeux, S. Mead, P. Mecocci, M. Medina, A. Meggy, S. Mehrabian, S. Mendoza, M. MenendezGonzalez, P. Mir, S. Moebus, M. Mol, L. Molina-Porcel, L. Montrreal, L. Morelli, F. Moreno, K. Morgan, T. Mosley, M. M. Nothen, C. Muchnik, S. Mukherjee, B. Nacmias, T. Ngandu, G. Nicolas, B. G. Nordestgaard, R. Olaso, A. Orellana, M. Orsini, G. Ortega, A. Padovani, C. Paolo, G. Papenberg, L. Parnetti, F. Pasquier, P. Pastor, G. Peloso, A. Perez-Cordon, J. Perez-Tur, P. Pericard, O. Peters, Y. A. L. Pijnenburg, J. A. Pineda, G. Pinol-Ripoll, C. Pisanu, T. Polak, J. Popp, D. Posthuma, J. Priller, R. Puerta, O. Quenez, I. Quintela, J. Q. Thomassen, A. Rabano, I. Rainero, F. 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 44.

Rajabli, I. Ramakers, L. M. Real, M. J. T. Reinders, C. Reitz, D. Reyes-Dumeyer, P. Ridge, S. RiedelHeller, P. Riederer, N. Roberto, E. Rodriguez-Rodriguez, A. Rongve, I. R. Allende, M. RosendeRoca, J. L. Royo, E. Rubino, D. Rujescu, M. E. Saez, P. Sakka, I. Saltvedt, A. Sanabria, M. B. Sanchez-Arjona, F. Sanchez-Garcia, P. S. Juan, R. Sanchez-Valle, S. B. Sando, C. Sarnowski, C. L. Satizabal, M. Scamosci, N. Scarmeas, E. Scarpini, P. Scheltens, N. Scherbaum, M. Scherer, M. Schmid, A. Schneider, J. M. Schott, G. Selbaek, D. Seripa, M. Serrano, J. Sha, A. A. Shadrin, O. Skrobot, S. Slifer, G. J. L. Snijders, H. Soininen, V. Solfrizzi, A. Solomon, Y. Song, S. Sorbi, O. Sotolongo-Grau, G. Spalletta, A. Spottke, A. Squassina, E. Stordal, J. P. Tartan, L. Tarraga, N. Tesi, A. Thalamuthu, T. Thomas, G. Tosto, L. Traykov, L. Tremolizzo, A. Tybjaerg-Hansen, A. Uitterlinden, A. Ullgren, I. Ulstein, S. Valero, O. Valladares, C. V. Broeckhoven, J. Vance, B. N. Vardarajan, A. van der Lugt, J. V. Dongen, J. van Rooij, J. van Swieten, R. Vandenberghe, F. Verhey, J. S. Vidal, J. Vogelgsang, M. Vyhnalek, M. Wagner, D. Wallon, L. S. Wang, R. Wang, L. Weinhold, J. Wiltfang, G. Windle, B. Woods, M. Yannakoulia, H. Zare, Y. Zhao, X. Zhang, C. Zhu, M. Zulaica, Eadb, Gr@Ace, Degesco, Eadi, Gerad, Demgene, FinnGen, Adgc, Charge, L. A. Farrer, B. M. Psaty, M. Ghanbari, T. Raj, P. Sachdev, K. Mather, F. Jessen, M. A. Ikram, A. de Mendonca, J. Hort, M. Tsolaki, M. A. Pericak-Vance, P. Amouyel, J. Williams, R. Frikke-Schmidt, J. Clarimon, J. F. Deleuze, G. Rossi, S. Seshadri, O. A. Andreassen, M. Ingelsson, M. Hiltunen, K. Sleegers, G. D. Schellenberg, C. M. van Duijn, R. Sims, W. M. van der Flier, A. Ruiz, A. Ramirez, J. C. Lambert, New insights into the genetic etiology of Alzheimer's disease and related dementias. Nat Genet 54, 412-436 (2022).

M. Yu, A. Abnousi, Y. Zhang, G. Li, L. Lee, Z. Chen, R. Fang, T. M. Lagler, Y. Yang, J. Wen, Q. Sun, Y. Li, B. Ren, M. Hu, SnapHiC: a computational pipeline to identify chromatin loops from single-cell Hi-C data. Nat Methods 18, 1056-1059 (2021). Illustration of INTERACT architecture and two-step training strategy (pre-training and finetuning). B. Comparison of model performance between INTERACT and CNN in predicting DNAm levels of independent CpG sites in each cell type. The performance was evaluated using three metrics: area under the Receiver Operating Characteristic curve (ROC), area under the Precision-Recall Curve (PRC), and mean squared error (MSE). ODC, oligodendrocyte. Astro, astrocyte. MG, microglia. OPC, oligodendrocyte progenitor cell. Endo, endothelial cell. L2/3, L4, L5 and L6 denote excitatory neuron subtypes in different cortical layers. Pvalb and Sst, medial ganglionic eminence-derived inhibitory subtypes. Ndnf and Vip, CGE-derived inhibitory subtypes. C. Hierarchical clustering of 13 brain cell types from predicted DNAm levels of independent CpG sites on chromosome 22 that were not used for model training and validation. D. Filters show higher regulatory activity for enhancers targeting cell type-specific genes compared to enhancers targeting shared genes. E. Enrichment of physical interaction evidence with DNA methylation or demethylation enzymes for TFs identified at varying FDR cutoffs. The upper corner highlights the proportions of TFs with physical interaction evidence in two groups, separated by an FDR cutoff of 0.05. variants. A. Schematic view of in silico discovery of cell type-specific DNAm regulatory variants. B. Clustering of 13 brain cell types by their predicted SNP effects on DNAm levels from each cell type-specific INTERACT model. C. Enrichment of active enhancers unique to each broad brain cell type (neuron, astrocyte, microglia, and oligodendrocyte (Oligo)) among variants ranked at different intervals by their predicted effects on DNAm levels in each cell type. Rank interval <0-0.1= represents variants of large effect and ranked in the top 10%. The color gradient represents enrichment fold of variants in each rank interval for their enrichment of enhancers compared to variants ranked in the bottom 10%. contributions to heritability of brain-related traits. The left and right panel represents variants of high effect (in the top 10%) and low effect (in the bottom 10%), respectively, as predicted by the INTERACT model of each brain cell type or bulk brain sample (DLPFC) and three additional scoring systems (CADD, GWAWA and DeepSEA). The color gradient represents significance levels (FDR) for enriched heritability (upper panel) or z-score of perSNP heritability (lower panel). The numbers within the squares of upper panel are enrichment fold that are significant after multiple testing correction (FDR < 0.05). The numbers within the squares of lower panel are z-score of per-SNP heritability that are significant after multiple testing correction (FDR < 0.05). causal variants from control variants for three brain disorders. A. Fine-mapped putative causal variants show stronger impacts on DNAm levels in specific brain cell types than variants that are unlikely to be causal. B. Fine-mapped putative causal variants show lower DNAm levels for their impacted CpG sites in specific brain cell types than variants that are unlikely to be causal. Red numbers indicate significant group difference (p < 0.05, Wilcoxon rank-sum test). enhance fine mapping for three brain disorders. Comparison of the average size of credible sets between fine mapping with and without annotations (top left). Comparison of the number of risk loci counted by the maximum number of SNPs (up to 10) included in the 99% credible sets between fine mapping with and without annotations for schizophrenia (top right), depression (bottom left) and AD (bottom right) 984 985 986 987 988 989 990 991 992 993 994 995 996 The left panel shows a regional plot for GWAS association signals, followed by a regional plot for posterior inclusion probability (PIP) computed by CARMA without functional annotations, a regional plot for PIP computed by CARMA with functional annotations from 13 brain cell types, and genes within the locus. Red triangles indicate fine-mapped SNPs within the 99% credible set obtained from fine mapping with all annotations. The right panel shows a regional plot of GWAS association signals for the same locus, followed by regional plots for PIP computed by CARMA without functional annotations, a series of regional plots for PIP computed by CARMA with functional annotations from all cell types, four broad cell types (nexc, ninh, neuron, glia), and 13 individual cell types. nexc: excitatory neuron; ninh: inhibitory neuron; Red triangles indicate fine-mapped SNPs within the 99% credible set from fine mapping with each type of annotation; Red triangles in GWAS regional association plot indicate fine-mapped SNPs within the 99% credible set obtained from fine mapping with all annotations.