NetREX-CF integrates incomplete transcription factor data with gene expression to reconstruct gene


NetREX-CF—method overview

The NetREX-CF model is a data integration framework for reconstructing GRNs by organically utilizing both gene expression E and a set of prior networks P = {P1, . . . Pd}. The main idea behind the NetREX-CF model is an integration of two complementary optimization strategies: (i) a machine learning component designed based on CF that is able to infer hidden features from the current observed prior networks P and utilize these features to recommend an improved GRN and (ii) a sparse NCA-based network remodeling component that can refine the topology of a GRN based on given gene expression E. These two computational components operate alternatively. The CF component recommends new edges to the current GRN and the sparse NCA-based network remodeling component screens the recommended edges and keeps the edges that are essential to explain the expression of a given gene. Once the sparse NCA-based network remodeling component confirms some of the recommended edges, the CF component further utilizes those retained recommended edges to make new edge recommendations for the sparse NCA-based network remodeling component to further examine (illustrated in Fig. 1).

Fig. 1: Method overview.
figure 1

Collaborative filtering (CF) and NCA-based gene expression modeling alternatively inform each other during a joint optimization process: CF recommends edges to be confirmed by the NCA model and used to improve CF.

Computationally, the system illustrated in Fig. 1 is achieved by simultaneous optimization of the following sets of variables: (i) the activities of TFs (matrix A), (ii) a weighted GRN (matrix S), and (iii) two feature matrices: the hidden features for target genes (X where the ith row xi represents the hidden feature vector for gene i) and the hidden features for TFs (Y where the jth row yj represents the hidden feature vector for TF j). The matrix A is optimized by the sparse NCA-based network remodeling component while the matrices X and Y are optimized by the CF component. Notably, the matrix S is the connection between the aforementioned two components and should be optimized by considering both components.

Formally, (Ein {{mathbb{R}}}^{ntimes l}) is the matrix of expression data of n genes in l experiments and prior network ({P}^{k}in {{mathbb{R}}}^{ntimes m},forall k) is a weighted adjacency matrix of the bipartite graph that records the prior knowledge of regulations between m TFs and n genes. Matrix (Ain {{mathbb{R}}}^{mtimes l}) is the TF activity for m TFs in l samples and (Sin {{mathbb{R}}}^{ntimes m}) is a weighted GRN. We further define penalty matrix C and observation matrix B based on the set of prior networks P. The matrix C is used in CF component. For edges in the prior, the corresponding elements in C will assign larger values to make sure those edges will be kept in the final prediction. For edges not in the prior, the corresponding elements in C will assign smaller values to encourage new edges as recommendations. The matrix B is used to indicate which edges have prior information. Each element in C can be computed by ({C}_{ij}=1+a{sum }_{k}{P}_{ij}^{k}) (a = 60 suggested by ref. 18). If more than one prior network suggests the regulation between the ith gene and the jth TF, then Cij tends to have a larger value. Large Cij would enforce giving the regulation between the ith gene and the jth TF a lower ranking. Each element in B is binary and can be computed by Bij = 1 if ({sum }_{k}{P}_{ij}^{k},ne, 0) and Bij = 0 otherwise. (Xin {{mathbb{R}}}^{ntimes h}) contains feature vector xi for gene i and (Yin {{mathbb{R}}}^{mtimes h}) contains feature vector yj for TF j. Then, our optimization problem is formalized as following:

$$begin{array}{rlr}mathop{min }limits_{S,A,X,Y}&{{{{{{{mathcal{H}}}}}}}}(S,A)+lambda {{{{{{{mathcal{F}}}}}}}}(S,X,Y)&\ s.t.&{leftVert {x}_{i}rightVert }^{2}le 1,,forall i\ &{leftVert {y}_{j}rightVert }^{2}le 1,,forall j.end{array}$$

(1)

where ({{{{{{{mathcal{H}}}}}}}}(S,A):= {leftVert E-SArightVert }_{F}^{2}+{lambda }_{A}{parallel} A{parallel }_{F}^{2}+{lambda }_{S}{parallel} S{parallel }_{F}^{2}+{sum }_{ij}{eta }_{ij}{parallel} {S}_{ij}{parallel }_{0}) is the sparse NCA-based network remodeling component; ({lambda }_{A}{parallel} A{parallel }_{F}^{2}+{lambda }_{S}{parallel} S{parallel }_{F}^{2}) are standard regularization terms and ∑ijηij Sij0 induces sparsity of a given prior GRN and therefore only essential edges that help to minimize ({{{{{{{mathcal{H}}}}}}}}(S,A)) are retained. Sij0 is the 0 norm that is 1 if Sij ≠ 0 and 0 otherwise.

In (1) ({{{{{{{mathcal{F}}}}}}}}(S,X,Y):= {sum }_{i,j}{{{{{varOmega}} }}}_{ij}{({{{{{varTheta}} }}}_{ij}-{x}_{i}^{T}{y}_{j})}^{2}) optimizes the hidden features X and Y of the CF component; Θij is a binary matrix of edges to be predicted by the hidden features in the given iteration and Ωij encodes penalties that guide the predictions. Both ΘijSij0Bij and ({{{{{varOmega}} }}}_{ij}:= {bar{C}}_{ij}{parallel} {S}_{ij}{parallel }_{0}+{C}_{ij}(1-{parallel} {S}_{ij}{parallel }_{0})) are defined based on Sij0 and the penalty matrix Cij is built from the prior information . Detailed explanation of Θij and Ωij are provided in Method Details section. For the initialization step, both Θij and Ωij are defined based on the prior networks only while in the subsequent steps they also take into account the results of the sparse NCA-based network remodeling component (illustrated in Fig. 1). To solve the problem (1), we first put all continuous terms together and define (H(S,A):= {leftVert E-SArightVert }_{F}^{2}+{lambda }_{A}{parallel} A{parallel }_{F}^{2}+{lambda }_{S}{parallel} S{parallel }_{F}^{2}) and then put all the non-continuous terms together and define (F(S,X,Y):= {sum }_{i,j}{{{{{varOmega}} }}}_{ij}{({parallel} {S}_{ij}{parallel }_{0}oplus {B}_{ij}-{x}_{i}^{T}{y}_{j})}^{2}+{sum }_{ij}{eta }_{ij}{parallel} {S}_{ij}{parallel }_{0}). Then the optimization problem has a general format of an objective function as Φ(S, A, X, Y) = H(S, A) + F(S, X, Y), where H(S, A) is continuous but non-convex and F(S, X, Y) is a composite function of 0 norm of elements of S and other variables so it is neither continuous nor convex. More importantly, Sij0 is coupled with xi and yj, so that Sij0 cannot be separated from F(S, X, Y) as a distinct term. To the best of our knowledge, there has been no known method that can optimize such a complex and non-convex function involving the inseparable 0 norm. To fill this gap, we propose here an algorithm, GPALM that generalizes the so-called PALM method22 and solves a class of problems of the format above, under the assumption that F(S, X, Y) is lower semi-continuous (see Supplementary Note 1). The GPALM method is fully described in Supplementary Note 2 where we also formally prove its convergence. The source code of NetREX-CF is available at: https://github.com/EJIUB/NetREX_CF.

Validation and benchmarking NetREX-CF on yeast data

To demonstrate the capability of our GRN reconstruction method, we tested using datasets from multiple species, which include yeast, fruit fly, and human. For yeast, we collect multiple datasets that measure different perspectives of gene regulation. These datasets include TF ChIP7,33,34, TF DNA binding motif7,35, genetic knockdown7,36,37, and yeast gene expression7,38,39,40. TF ChIP, motif, and genetic knockdown datasets, serving as prior knowledge for TF-gene interactions in the yeast GRN. The details of these priors are summarized in Table 1 and the overlap among priors is illustrated in Table 1. We further utilize TF-gene interactions extracted from YEASTRACT database41 as a gold standard to validate the performance of GRN reconstruction. These gold standard TF-gene interactions are supported by both DNA binding and expression evidence. The details of the gold standard TF-gene interactions and their overlap with the prior datasets are shown in Table 1. Results generated by NetREX-CF are benchmarked against the results obtained from the published sequential methods, all of which are GRN prediction methods that are able to use prior knowledge. In the sections that follow, we go into detail about the comparisons between two popular approaches that only consider gene expression, GENIE342 and GRNBoost243, as well as prior-based approaches like NetREX-CF, MerlinP7, NetREX2, LassoStARS6, the original CF18, the summation of all prior knowledge (PriorSum), and a technique that assigns a random confidence score (uniformly distributed between 0 and 1; hereafter, a random method). For a detailed description of parameter selection for competing methods, we refer the reader to Supplementary Note 3.

Table 1 Overlap between prior networks and the gold standard network.

To ensure an impartial comparison, we use Average Rank scores (ARS)18. For each method and for each gene i, we can obtain a list of TFs that are predicted to regulate gene i and sort this TFs by the confidence of the prediction (most confident have higher rank). We use ({r}_{ij}^{g}) to denote the percentile-ranking of TF j within the ordered list of all TFs for gene i. Thus, ({r}_{ij}^{g}=0 %) means that TF j is predicted with the highest confidence to regulate gene i, preceding all other TFs in the list. On the other hand, ({r}_{ij}^{g}=100 %) when TF j is predicted to be the least possible TF for gene i or there is no prediction between TF j and gene i yielded by the method. Based on the gold standard TF-gene interaction dataset I, we set Iij = 1 if TF j regulates gene i in the gold standard dataset and Iij = 0 otherwise. For any gene i, we use the average rank of the gold standard edges in the list of TF predicted…



Read More: NetREX-CF integrates incomplete transcription factor data with gene expression to reconstruct gene

2022-11-23 18:05:42

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More