### NetREX-CF—method overview

The NetREX-CF model is a data integration framework for reconstructing GRNs by organically utilizing both gene expression *E* and a set of prior networks *P* = {*P*^{1}, . . . *P*^{d}}. The main idea behind the NetREX-CF model is an integration of two complementary optimization strategies: (i) a machine learning component designed based on CF that is able to infer hidden features from the current observed prior networks *P* and utilize these features to recommend an improved GRN and (ii) a sparse NCA-based network remodeling component that can refine the topology of a GRN based on given gene expression *E*. These two computational components operate alternatively. The CF component recommends new edges to the current GRN and the sparse NCA-based network remodeling component screens the recommended edges and keeps the edges that are essential to explain the expression of a given gene. Once the sparse NCA-based network remodeling component confirms some of the recommended edges, the CF component further utilizes those retained recommended edges to make new edge recommendations for the sparse NCA-based network remodeling component to further examine (illustrated in Fig. 1).

Computationally, the system illustrated in Fig. 1 is achieved by simultaneous optimization of the following sets of variables: (i) the activities of TFs (matrix *A*), (ii) a weighted GRN (matrix *S*), and (iii) two feature matrices: the hidden features for target genes (*X* where the *i*th row *x*_{i} represents the hidden feature vector for gene *i*) and the hidden features for TFs (*Y* where the *j*th row *y*_{j} represents the hidden feature vector for TF *j*). The matrix *A* is optimized by the sparse NCA-based network remodeling component while the matrices *X* and *Y* are optimized by the CF component. Notably, the matrix *S* is the connection between the aforementioned two components and should be optimized by considering both components.

Formally, (Ein {{mathbb{R}}}^{ntimes l}) is the matrix of expression data of *n* genes in *l* experiments and prior network ({P}^{k}in {{mathbb{R}}}^{ntimes m},forall k) is a weighted adjacency matrix of the bipartite graph that records the prior knowledge of regulations between *m* TFs and *n* genes. Matrix (Ain {{mathbb{R}}}^{mtimes l}) is the TF activity for *m* TFs in *l* samples and (Sin {{mathbb{R}}}^{ntimes m}) is a weighted GRN. We further define penalty matrix *C* and observation matrix *B* based on the set of prior networks *P*. The matrix *C* is used in CF component. For edges in the prior, the corresponding elements in *C* will assign larger values to make sure those edges will be kept in the final prediction. For edges not in the prior, the corresponding elements in *C* will assign smaller values to encourage new edges as recommendations. The matrix *B* is used to indicate which edges have prior information. Each element in *C* can be computed by ({C}_{ij}=1+a{sum }_{k}{P}_{ij}^{k}) (*a* = 60 suggested by ref. ^{18}). If more than one prior network suggests the regulation between the *i*th gene and the *j*th TF, then *C*_{ij} tends to have a larger value. Large *C*_{ij} would enforce giving the regulation between the *i*th gene and the *j*th TF a lower ranking. Each element in *B* is binary and can be computed by *B*_{ij} = 1 if ({sum }_{k}{P}_{ij}^{k},ne, 0) and *B*_{ij} = 0 otherwise. (Xin {{mathbb{R}}}^{ntimes h}) contains feature vector *x*_{i} for gene *i* and (Yin {{mathbb{R}}}^{mtimes h}) contains feature vector *y*_{j} for TF *j*. Then, our optimization problem is formalized as following:

$$begin{array}{rlr}mathop{min }limits_{S,A,X,Y}&{{{{{{{mathcal{H}}}}}}}}(S,A)+lambda {{{{{{{mathcal{F}}}}}}}}(S,X,Y)&\ s.t.&{leftVert {x}_{i}rightVert }^{2}le 1,,forall i\ &{leftVert {y}_{j}rightVert }^{2}le 1,,forall j.end{array}$$

(1)

where ({{{{{{{mathcal{H}}}}}}}}(S,A):= {leftVert E-SArightVert }_{F}^{2}+{lambda }_{A}{parallel} A{parallel }_{F}^{2}+{lambda }_{S}{parallel} S{parallel }_{F}^{2}+{sum }_{ij}{eta }_{ij}{parallel} {S}_{ij}{parallel }_{0}) is the sparse NCA-based network remodeling component; ({lambda }_{A}{parallel} A{parallel }_{F}^{2}+{lambda }_{S}{parallel} S{parallel }_{F}^{2}) are standard regularization terms and ∑_{ij}*η*_{ij} ∥*S*_{ij}∥_{0} induces sparsity of a given prior GRN and therefore only essential edges that help to minimize ({{{{{{{mathcal{H}}}}}}}}(S,A)) are retained. ∥*S*_{ij}∥_{0} is the *ℓ*_{0} norm that is 1 if *S*_{ij} ≠ 0 and 0 otherwise.

In (1) ({{{{{{{mathcal{F}}}}}}}}(S,X,Y):= {sum }_{i,j}{{{{{varOmega}} }}}_{ij}{({{{{{varTheta}} }}}_{ij}-{x}_{i}^{T}{y}_{j})}^{2}) optimizes the hidden features *X* and *Y* of the CF component; *Θ*_{ij} is a binary matrix of edges to be predicted by the hidden features in the given iteration and *Ω*_{ij} encodes penalties that guide the predictions. Both *Θ*_{ij} ≔ ∥*S*_{ij}∥_{0} ⊕ *B*_{ij} and ({{{{{varOmega}} }}}_{ij}:= {bar{C}}_{ij}{parallel} {S}_{ij}{parallel }_{0}+{C}_{ij}(1-{parallel} {S}_{ij}{parallel }_{0})) are defined based on ∥*S*_{ij}∥_{0} and the penalty matrix *C*_{ij} is built from the prior information . Detailed explanation of *Θ*_{ij} and *Ω*_{ij} are provided in Method Details section. For the initialization step, both *Θ*_{ij} and *Ω*_{ij} are defined based on the prior networks only while in the subsequent steps they also take into account the results of the sparse NCA-based network remodeling component (illustrated in Fig. 1). To solve the problem (1), we first put all continuous terms together and define (H(S,A):= {leftVert E-SArightVert }_{F}^{2}+{lambda }_{A}{parallel} A{parallel }_{F}^{2}+{lambda }_{S}{parallel} S{parallel }_{F}^{2}) and then put all the non-continuous terms together and define (F(S,X,Y):= {sum }_{i,j}{{{{{varOmega}} }}}_{ij}{({parallel} {S}_{ij}{parallel }_{0}oplus {B}_{ij}-{x}_{i}^{T}{y}_{j})}^{2}+{sum }_{ij}{eta }_{ij}{parallel} {S}_{ij}{parallel }_{0}). Then the optimization problem has a general format of an objective function as *Φ*(*S*, *A*, *X*, *Y*) = *H*(*S*, *A*) + *F*(*S*, *X*, *Y*), where *H*(*S*, *A*) is continuous but non-convex and *F*(*S*, *X*, *Y*) is a composite function of *ℓ*_{0} norm of elements of *S* and other variables so it is neither continuous nor convex. More importantly, ∥*S*_{ij}∥_{0} is coupled with *x*_{i} and *y*_{j}, so that ∥*S*_{ij}∥_{0} cannot be separated from *F*(*S*, *X*, *Y*) as a distinct term. To the best of our knowledge, there has been no known method that can optimize such a complex and non-convex function involving the inseparable *ℓ*_{0} norm. To fill this gap, we propose here an algorithm, GPALM that generalizes the so-called PALM method^{22} and solves a class of problems of the format above, under the assumption that *F*(*S*, *X*, *Y*) is lower semi-continuous (see Supplementary Note 1). The GPALM method is fully described in Supplementary Note 2 where we also formally prove its convergence. The source code of NetREX-CF is available at: https://github.com/EJIUB/NetREX_CF.

### Validation and benchmarking NetREX-CF on yeast data

To demonstrate the capability of our GRN reconstruction method, we tested using datasets from multiple species, which include yeast, fruit fly, and human. For yeast, we collect multiple datasets that measure different perspectives of gene regulation. These datasets include TF ChIP^{7,33,34}, TF DNA binding motif^{7,35}, genetic knockdown^{7,36,37}, and yeast gene expression^{7,38,39,40}. TF ChIP, motif, and genetic knockdown datasets, serving as prior knowledge for TF-gene interactions in the yeast GRN. The details of these priors are summarized in Table 1 and the overlap among priors is illustrated in Table 1. We further utilize TF-gene interactions extracted from YEASTRACT database^{41} as a gold standard to validate the performance of GRN reconstruction. These gold standard TF-gene interactions are supported by both DNA binding and expression evidence. The details of the gold standard TF-gene interactions and their overlap with the prior datasets are shown in Table 1. Results generated by NetREX-CF are benchmarked against the results obtained from the published sequential methods, all of which are GRN prediction methods that are able to use prior knowledge. In the sections that follow, we go into detail about the comparisons between two popular approaches that only consider gene expression, GENIE3^{42} and GRNBoost2^{43}, as well as prior-based approaches like NetREX-CF, MerlinP^{7}, NetREX^{2}, LassoStARS^{6}, the original CF^{18}, the summation of all prior knowledge (PriorSum), and a technique that assigns a random confidence score (uniformly distributed between 0 and 1; hereafter, a random method). For a detailed description of parameter selection for competing methods, we refer the reader to Supplementary Note 3.

To ensure an impartial comparison, we use Average Rank scores (ARS)^{18}. For each method and for each gene *i*, we can obtain a list of TFs that are predicted to regulate gene *i* and sort this TFs by the confidence of the prediction (most confident have higher rank). We use ({r}_{ij}^{g}) to denote the percentile-ranking of TF *j* within the ordered list of all TFs for gene *i*. Thus, ({r}_{ij}^{g}=0 %) means that TF *j* is predicted with the highest confidence to regulate gene *i*, preceding all other TFs in the list. On the other hand, ({r}_{ij}^{g}=100 %) when TF *j* is predicted to be the least possible TF for gene *i* or there is no prediction between TF *j* and gene *i* yielded by the method. Based on the gold standard TF-gene interaction dataset *I*, we set *I*_{ij} = 1 if TF *j* regulates gene *i* in the gold standard dataset and *I*_{ij} = 0 otherwise. For any gene *i*, we use the average rank of the gold standard edges in the list of TF predicted…

Read More: NetREX-CF integrates incomplete transcription factor data with gene expression to reconstruct gene

2022-11-23 18:05:42