Operator compression with deep neural networks

This paper studies the compression of partial differential operators using neural networks. We consider a family of operators, parameterized by a potentially high-dimensional space of coefficients that may vary on a large range of scales. Based on the existing methods that compress such a multiscale operator to a finite-dimensional sparse surrogate model on a given target scale, we propose to directly approximate the coefficient-to-surrogate map with a neural network. We emulate local assembly structures of the surrogates and thus only require a moderately sized network that can be trained efficiently in an offline phase. This enables large compression ratios and the online computation of a surrogate based on simple forward passes through the network is substantially accelerated compared to classical numerical upscaling approaches. We apply the abstract framework to a family of prototypical second-order elliptic heterogeneous diffusion operators as a demonstrating example.


Introduction
The remarkable success of machine learning technology, especially deep learning, in classical AI disciplines such as image recognition and natural language processing has led to an increased research interest in leveraging the power of these approaches in other science and engineering disciplines over the last years. In the field of numerical modeling and simulation, promising approaches are emerging that try to integrate machine learning algorithms and traditional physics-based approaches, combining the advantages of the data-driven regime with known physics and domain knowledge. In this spirit, many different approaches to approximating solutions of partial differential equations (PDEs) with neural networks have been proposed, for example so-called physics-informed neural networks (PINNs) [60], the deep Galerkin method [63], or the deep Ritz method [19]. It has become evident that the strategy of using neural networks as ansatz functions for the approximation of a PDE's solution is especially advantageous for high-dimensional problems that are outside the reach of classical mesh-based methods [17,18,33]. For some classes of PDEs, e.g., Kolmogorov PDEs and semilinear heat equations, it has even been proven that neural networks break the curse of dimensionality [9,39].
In this spirit, we strongly believe that the strength of neural networks lies in scenarios where one deals with a whole family of PDEs rather than one single equation, for example in the context of so-called parametric PDEs, i.e., settings where a family of partial differential operators parameterized by some coefficient is considered, see, e.g., [62]. This is particularly true for multiscale problems, where one is interested in computing coarsescale surrogates for problems involving a range of scales that cannot be resolved in a direct numerical simulation.
In this paper, we study the problem of approximating a coefficient-to-surrogate map with a neural network in a very general setting of parameterized PDEs with arbitrarily rough coefficients that may vary on a microscopic scale. In other words, we are not trying to directly approximate the parameter-to-solution map, but rather compress the fine-scale information contained in the continuous operator to a finite-dimensional sparse object that is able to replicate the effective behavior of the solution on a macroscopic scale of interest even in the presence of unresolved oscillations of the underlying coefficient.
The output surrogate models are based on the idea of modern numerical homogenization techniques such as localized orthogonal decomposition [46,49,56], gamblets [52], rough polyharmonic splines [53], the multiscale finite element method [21,38], or the generalized finite element method [7,20]; see [5] and the references therein for a comprehensive overview. These methods have demonstrated high performance in many relevant applications such as porous media flow or wave scattering in heterogeneous media to mention only a few. In particular, they typically do not require explicit assumptions on the existence of lower-dimensional structures in the underlying family of PDE coefficients and yield sparse system matrices that ensure uniform approximation properties of the resulting surrogate. Moreover, the computation of the system matrices mimics the standard assembly procedure from finite element theory, consisting of the generation of local system matrices and their combination by local-to-global mappings, which is exploited to reduce the size of the network architecture and its complexity considerably.
The possibility of fast computation of the surrogates has high potential for multi-query problems, such as in uncertainty quantification, and time-dependent or inverse multiscale problems, which require the computation of surrogates for many different a priori unknown coefficients. Though the aforementioned numerical homogenization methods lead to accurate surrogates for the whole class of coefficients, their computation requires the resolution of all scales locally which marks a severe limitation when it has to be performed many times for the solution of a multi-query problem. There have been attempts to tackle this problem, but the results so far are only applicable to small perturbation regimes [35,50] or settings where the parameterization fulfills additional smoothness requirements [3].
To overcome this problem, we propose to learn the whole nonlinear coefficient-tosurrogate map from a training set consisting of pairs of coefficients and their corresponding surrogates with a deep neural network. In other words, we are combining the domain knowledge from numerical homogenization with a data-driven deep learning approach by essentially learning a numerical homogenization method from data. To this end, we propose using an offline-online approach. In the offline phase, the neural network is trained based on data generated with existing numerical homogenization techniques. In the on-line phase, the compression of previously unseen operators can then be reduced to a simple forward pass of the neural network, which eliminates the computational bottleneck encountered in multi-query settings.
Our method is conceptually different from the existing approaches that try to integrate ideas from homogenization with neural networks. In [6] for example, the authors propose to learn a homogenized PDE from simulation data by linking deep learning with an equation-free multiscale approach. Other papers in the context of uncertainty quantification suggest training a neural network to identify suitable multiscale basis functions for the finite volume method given a porous random medium [13,54]. In [31], the authors consider the problem of elasticity with history-dependent material properties, where a recurrent deep neural network connects microscopic and macroscopic material parameters. In deep multiscale model learning [65], learning techniques are used to predict the evolution from one time step to another within a given coarse multiscale space. The goal of this approach is to obtain a reasonable coarse operator for the successive approximation of a time-dependent PDE. Furthermore, several experimental and theoretical works on the approximation of the coefficient-to-solution map [11,28,30,43] or other quantities of interest such as the ground state energy in Schrödinger equations [41] by deep neural networks have been published in the context of parametric PDEs.
This paper is structured as follows: in Sect. 2, we introduce and motivate the abstract framework for a very general class of linear differential operators. After that, we study the problem of elliptic homogenization as an example of how to apply the general methodology in practice. In Sect. 4, we conduct numerical experiments that show the feasibility of our ideas developed in the previous two sections. We conclude this work with an outlook on further research questions.

Abstract framework
In this section, we describe the general abstract problem of finding discrete compressed surrogates to a family of differential operators that allow us to satisfactorily approximate the original operators on a target scale of interest, given only the underlying coefficients but not a high resolution representation of the operators. We elaborate on how to speed up the online computation of those compressed representatives using deep neural networks after an initial offline training phase.

Setting
3} be a bounded Lipschitz domain and H 1 0 (D) be the Sobolev space of L 2 -functions with weak first derivatives in L 2 (D) that vanish on the boundary of D. We write H -1 (D) for the dual space of H 1 0 (D) and ·, · for the duality pairing between H -1 (D) and H 1 0 (D). Consider a family of linear differential operators that is parameterized by some class A ⊆ L ∞ (D) of admissible coefficients. We emphasize that we do not pose any assumptions on the structure of the coefficients A ∈ A such as periodicity or scale separation and explicitly allow for arbitrarily rough coefficients that may vary on a continuum of scales up to some microscale ε diam(D). We assume that for every A ∈ A the associated operator L A is symmetric ( if u and v have disjoint supports) and bijective.
Bijectivity implies that for any given A ∈ A and f ∈ H -1 (D) there exists a unique u ∈ H 1 0 (D) that solves the equation in a weak sense, i.e., the solution satisfies For given problem data A and f , we are interested in computing an approximation to u on some target scale in reasonable time.

Discretization
In order to be able to solve this problem computationally, we choose a finite-dimensional subspace V h ⊆ H 1 0 (D) of dimension m = dim(V h ). As a standard example, one could take V h to be a classical finite element space based on some mesh T h with characteristic mesh size h and approximate (2.2) with a Galerkin method. However, in the very general setting with A possibly having fine oscillations on a scale that is not resolved by the mesh size h, this approach leads to unreliable approximations of u. Then again, the resolution of these fine-scale features can be prohibitively expensive in terms of computational resources if ε is very small. Note that resolution here may mean that the actual mesh size is significantly smaller than ε, depending on the oscillations of the coefficient and its regularity [8,58]. This means that more advanced discretization techniques are required to still obtain reasonable approximations in the unresolved setting. In practice, the challenge is therefore to compress the fine-scale information that is contained in the operator L A to a suitable surrogate S A on the target scale h, i.e., the surrogate S A must be chosen in such a way that it is still able to capture the characteristic behavior of the operator L A on the scale of interest. Moreover, we require S A to be a bijection that maps the space V h to itself. This ensures that for any A ∈ A and f ∈ H -1 (D) we can find unique u h ∈ V h that weakly solves the discretized equation 3) with f h = Mf , where M is a quadrature-type operator that maps a function in H -1 to an appropriate approximation in V h . Problem (2.3) needs to be understood as finding u h that satisfies The choice of the surrogate is obviously highly dependent on the problem at hand, see for example Sect. 3.2 for possible choices in the case of second-order elliptic diffusion operators.

Characterization in terms of a system matrix
We restrict our discussion to choices of surrogates that can be represented by an m × m system matrix S A that is often called the effective system matrix in the following. We assume that S A ∈ R m×m is of the form (S A ) ij = S A λ j , λ i for a basis λ 1 , . . . , λ m of V h . Note that the basis should be chosen as localized as possible in order for the resulting system matrix to be sparse. The process of operator compression can then be formalized by a compression operator that maps a given coefficient A to the system matrix S A representing the compressed surrogate S A of the operator L A . Once C has been evaluated for given A ∈ A, the solution to (2.2) can then be approximated with a function u h ∈ V h for any right-hand side f ∈ H -1 (D) by solving the linear system S A U = F, where F ∈ R m is the vector with entries F i := Mf , λ i and U ∈ R m contains the coefficients of the basis representation u h = m i=1 U i λ i .

Multi-query scenarios
For many classes of coefficients A and based on the choice of the surrogate, evaluating C requires solving local auxiliary problems, during which the finest scale ε has to be resolved at some point. While this is acceptable if one wants to compress only a few operators in an offline computation, it becomes a major problem once C has to be evaluated for many different coefficients A in an online phase, as for example in certain inverse problems, uncertainty quantification, or the simulation of evolution equations with time-dependent coefficients. This motivates a data-driven offline-online approach, where the offline phase consists of training a neural network to approximate the compression operator C, such that in the subsequent online phase the evaluation of C can be replaced with a simple forward pass through the network, thus eliminating the computational bottleneck.

System matrix decomposition
In principle, one could try to directly approximate the global operator C with a neural network. If the coefficient involves oscillations on some fine scale ε, this would lead to a network architecture with an input layer of size O(ε -d ), an output layer of size O(m), and possible hidden layers. Particularly for small ε, this leads to very large networks and, thus, requires a huge amount of free parameters and therefore extraordinary amounts of training data and storage space in order to preserve good generalization capabilities.
To reduce the necessary size of the network, one can exploit available information on the compression operator C by means of a certain structure in the resulting effective matrices S A . To this end, we think of S A as a matrix composed of multiple inflated sub-matrices, i.e., where J denotes some given index set and S A,j ∈ R s×t , s, t m, are (typically dense) local matrices of equal size. The functions represent local-to-global mappings inspired by classical finite element assembly processes as further explained below. More precisely, there exist index transformations π j and ϕ j , such that where M[i r , i c ] denotes the entry of a matrix M in the i th r row and the i th c column. Note that π j and ϕ j can also map to zero, indicating that the corresponding entry should be disregarded. Let us also emphasize that the mappings j (and, in turn, the index transformations π j and ϕ j ) are completely independent of coefficients A and solely depend on the domain D as well as the geometry of an allotted discretization. The precise definitions of the maps π j and ϕ j as well as the index set J are usually determined in a canonical way by the choice of the computational mesh and the compression operator C. In Sect. 3.4, we present an example of how such mappings may look like.
Depending on the compression operator C and decomposition (2.4), we can expect that all the local matrices S A,j are created in a similar fashion and only depend on a local subsample of the coefficient. This can be understood as a generalization of the assembly process that underlies classical finite element system matrices: these matrices are composed of local system matrices that are computed on each element separately and only require knowledge about the coefficient on the respective element. In that context, the local submatrices all have a similar structure and the mapping by the functions j leads to overlapping contributions on the global level. Going back to the abstract setting, we generalize these properties and assume the existence of a lower-dimensional reduced compression operator such that the contributions S A,j are of the form where the operators extract r relevant features of a given global coefficient. In the context of, e.g., finite element matrices, the operators R j correspond to the restriction of a coefficient to an elementbased piecewise constant approximation and C red incorporates the computation of a local system matrix based on such a sub-sample of the coefficient. To achieve a uniform length r of the output for the operators R j , these operators may include artificially introduced zeros depending on the respective geometric configurations (e.g., at the boundary). An example for a quadrilateral mesh in two dimensions is shown in Fig. 1.
The problem of evaluating C can now be decomposed into multiple evaluations of the reduced operator C red that takes the local information R j (A) of A and outputs a corresponding local matrix as described in (2.7). In our setting of a coefficient A that is potentially unresolved by the target scale h, evaluating C red is nontrivial and might become a bottleneck in multi-query scenarios as already indicated in Sect. 2.4. In such cases, we propose to approximate the operator C red with a deep neural network where θ ∈ R p is a set of p trainable parameters of moderate size such that, for given A ∈ A, the effective system matrix C(A) = S A can be efficiently approximated by which requires just a single forward pass of the minibatch (R j (A)) j∈J through the network. Note that the approximation S A possesses the same sparsity structure as the matrix S A , since the neural network yields only approximations to the local sub-matrices S A,j , whereas the assembling process which determines the sparsity structure of the global matrix is determined by the mappings j , which are independent of the network .
We emphasize that a decomposition of S A as described in (2.4)-(2.8) does not necessarily require a uniform operator C red . If multiple reduced operators are required for such a decomposition, the idea of approximating them by one single neural network can still be applied. It is, however, necessary for the ability of the network to generalize well beyond data seen during training that the reduced operators at least involve certain similarities.

Network training
In practice, the neural network (·, θ ) has to be trained in an offline phase from a set of training examples before it can be used for approximating the mapping C red . We propose to draw N global coefficients (A (i) ) N i=1 from A, extracting the relevant information (A (i) j ) := (R j (A (i) )) from them and compressing it into the corresponding effective matrices (S (i) A,j ) with C red . This results in a total of |J| · N training samples available for the neural network to train on, namely ( In order to learn the parameters of the network, we then minimize the loss functional over the parameter space R p using iterative gradient-based optimization on minibatches of the training data. This can be very efficiently implemented within modern deep learning frameworks such as TensorFlow [1], PyTorch [55], or Flux [40], which allow for the automatic differentiation of the loss functional with respect to the network parameters.

Full algorithm
After having established all the conceptual pieces, we now put them together and return to the abstract variational problem (2.2) from the beginning of the section. Suppose that we want to solve (2.2) for a large number of given coefficients A (i) , i = 1, . . . , M, and a given right-hand side f ∈ H -1 (D). For ease of notation, we restrict ourselves to a single righthand side, which is, however, not necessarily required for our approach. The proposed procedure is summarized in Algorithm 1, divided into the offline and online stages of the method.

Algorithm 1 Operator compression with neural network (i) Offline phase
Input: number of samples N, index set J, operators R j , error functional J reduced compression operator C red , initial neural network (·, θ ),

Application to elliptic homogenization
In this section, we specifically consider a family of prototypical elliptic diffusion operators as a demonstrating example of how to apply the abstract framework laid down in Sect. 2 in practice.

Setting
From now on let the domain D be polyhedral. We consider the family of linear secondorder diffusion operators parameterized by the following set of admissible coefficients which possibly encode microstructures: For the sake of simplicity, we restrict ourselves to scalar coefficients here. Note, however, that also the consideration of matrix-valued coefficients is not an issue from a numerical homogenization viewpoint. We remark that the family of operators L fulfills the assumptions of locality and symmetry from the abstract framework. In this setting, the abstract problem (2.1) amounts to solving the following linear elliptic PDE with homogeneous Dirichlet boundary condition: which possesses a unique weak solution u ∈ H 1 0 (D) for every f ∈ H -1 (D) and A ∈ A. The corresponding counterpart to the weak formulation (2.2) can be written as: by using integration by parts on the divergence term.

Discretization and compression
Let now T h be a Cartesian mesh with characteristic mesh size h and denote with Q 1 (T h ) the corresponding space of piecewise bilinear functions. We consider the conforming finite element space Generally, also other types of meshes and finite element spaces could be employed, but we restrict ourselves to the above choice for the moment. As we have already mentioned in Sect. 2.2, if the mesh T h does not resolve the fine-scale oscillations of A, approximating u with a pure finite element ansatz of seeking u h ∈ V h such that will not yield satisfactory results. In a setting where resolving A with the mesh is computationally too demanding, we are therefore interested in suitable choices for a compression operator C. In particular, we want C to produce effective system matrices on the target scale h that can be used to obtain appropriate approximations on this scale. In the following, we briefly comment on possible choices for this operator that are based on the finite element space V h .

Compression by analytical homogenization
The idea of analytical homogenization is to replace an oscillating A with an appropriate homogenized coefficient A hom ∈ L ∞ (D, R d×d ). The mathematical theory of homogenization can treat very general nonperiodic coefficients in the framework of G-or H-convergence [14,51,64]. However, apart from being nonconstructive in many cases, homogenization in the classical analytical sense considers a sequence of operators -div(A ε ∇·) indexed by ε > 0 and aims to characterize the limit as ε tends to zero. In many realistic applications, such a sequence of models can hardly be identified or may not be available in the first place. Assuming that the necessary requirements on the coefficient A are met, a homogenized coefficient A hom exists and does not involve oscillations on a fine scale. The coefficient A hom can then be used in combination with a classical finite element ansatz, since A hom does no longer include troublesome fine-scale quantities. In practice, the homogenized coefficients cannot be computed easily and need to be approximated. This is, for instance, done with the heterogeneous multiscale method (HMM) [2,15,16], which in the end replaces A with a computable approximation With this piecewise constant approximation of A, we obtain a possible compression operator C. Given an enumeration 1, . . . , m of the inner nodes in T h and writing λ 1 , . . . , λ m for the associated nodal basis of V h , the compressed operator C(A) can be defined as That is, one takes the classical finite element stiffness matrix corresponding to the homogenized coefficient A hom as an effective system matrix. In this case, decomposition (2.4) corresponds to a partition into element-wise stiffness matrices (with constant coefficient, respectively) that are merged with a simple finite element assembly routine.
We emphasize that approaches based on analytical homogenization -such as (3.3)are able to provide reasonable approximations on the target scale h but are subject to structural assumptions, in particular scale separation and local periodicity. The goal to overcome these restrictions has led to a new class of numerical methods that are specifically tailored to treating general coefficients with minimal assumptions. These methods are known as numerical homogenization approaches and typically only require a boundedness condition as in (3.1).

Compression by numerical homogenization
The general idea of numerical homogenization methods is to replace the trial space V h with a suitable multiscale spaceṼ h , see for instance the references [5,7,21,38,46,52,53]. One possible construction uses a one-to-one correspondence ofṼ h to the space V h , which implies that the two spaces possess the same number of degrees of freedom. Typically, the multiscale space is chosen in a problem-adapted way. We indicate this dependence by defining the new spaceṼ h := P A V h , where P A : V h → H 1 0 (D) particularly depends on A. Therefore, another possible choice of the operator C leads to the effective matrix C(A) given by (3.4) A prominent example for such an approach -and, thus, the operator C -is the Petrov-Galerkin version of the localized orthogonal decomposition (LOD) method which explicitly constructs a suitable operator P A . The LOD was introduced in [46] and theoretically and practically works for very general coefficients. It has also been successfully applied to other problem classes, for instance, wave propagation problems in the context of Helmholtz and Maxwell equations [26,27,45,57,61] or the wave equation [4,29,44,59], eigenvalue problems [47,48], and in connection with time-dependent nonlinear Schrödinger equations [37]. However, it requires a slight deviation from locality. That is, while the classical finite element method and the HMM result in a system matrix that only includes neighbor-to-neighbor communication between the degrees of freedom, the multiscale approach (3.4) moderately increases this communication to effectively incorporate the fine-scale information in A for a broader range of coefficients, which is a common property of modern homogenization techniques. As indicated in [12], this slightly increased communication indeed seems to be necessary to handle very general coefficients.
Since we consider a class A of arbitrarily rough coefficients, the compression operator (3.4) corresponding to the operator P A constructed in the LOD method is a suitable choice for our discussion as well as for the numerical experiments of Sect. 4. In the following subsection, we therefore have a closer look into its construction and summarize some main results. Note that we restrict ourselves to an elliptic model problem with homogeneous Dirichlet boundary conditions, but the compression approach can generally be extended to more involved settings such as the ones mentioned above.

Localized orthogonal decomposition
The method is based on a projective quasi-interpolation operator I h : H 1 0 (D) → V h with the following approximation and stability properties: for an element T ∈ T h , we require that For a particular choice of I h , we refer to [24].
For given I h with the above properties, we can define the so-called fine-scale space W, which contains all functions that are not well captured by the finite element functions in V h . It is defined as the kernel of I h with respect to H 1 0 (D), i.e., W := ker I h | H 1 0 (D) , and its local version, for any S ⊆ D, is given by In order to incorporate fine-scale information contained in the coefficient A, the idea is now to compute coefficient-dependent local corrections of functions v h ∈ V h . To this end, we define the neighborhood of order ∈ N iteratively by N (S) := N(N -1 (S)), ≥ 2. For any function v h ∈ V h , its element corrector Q A,T v h ∈ W(N (T)), T ∈ T h , is defined by Note that in an implementation, the element corrections Q A,T have to be computed on a sufficiently fine mesh that resolves the oscillations of the coefficient A. Since the algebraic realization of the correctors and guidelines for an efficient implementation of the LOD method in general are not within the scope of the article, we refer to [23] for details. We emphasize that, by construction, the supports of the correctors Q A,T v h are limited to N (T). The global correction Q A : V h → W then consists of a summation of these local contributions and is given by Note that the choice = ∞ corresponds to a computation of the element corrections on the entire domain D and leads to the orthogonality property that defines an a A -orthogonal splitting This particularly explains the name orthogonal decomposition. The use of localized element corrections is motivated by the decay of the corrections Q ∞ A,T away from the element T. This is, for instance, shown in [36,56] (based on [46]) and reads with a constant c dec which is independent of h and . Motivated by decomposition (3.6) and the localized approximations in (3.5), we choose P A := 1 -Q A in (3.4). The spaceṼ h := P A V h = (1 -Q A )V h , which has the same number of degrees of freedom as V h , can then be used as an ansatz space for the discretization of (3.2). Note that the original LOD method introduced in [46] considers a discretization whereṼ h is also used as test space. We, however, consider the Petrov-Galerkin variant of the method as analyzed in [22] that uses the classical finite element space V h as test space instead, i.e., we seek u h ∈ V h such that where f h = Mf ∈ V h is again a suitable approximation of f ∈ H -1 (D). This defines a compressed operator S A as in (2.3) that maps u h ∈ V h to f h ∈ V h . As it turns out, the Petrov-Galerkin formulation has some computational advantages over the classical method, in particular in terms of memory requirement. For details, we again refer to [23]. The theory in [22] shows that the approximation u h defined in (3.7) is first-order accurate in L 2 (D) provided that | log h| and, additionally, f ∈ L 2 (D). More precisely, it holds where Mf denotes the L 2 -projection of f onto V h . Note that the methodology can actually be applied to more general settings beyond the elliptic case, see for instance [49] for an overview.

System matrix surrogate
We now return to the discussion of the compression operator C introduced in (3.4) that maps coefficients A ∈ A to the effective system matrices obtained from the Petrov-Galerkin LOD method. Once S A has been computed for a given A, an approximation u h = m j=1 U j λ j can be computed by solving the following linear system for the coefficients U = (U 1 , . . . , U m ) T : The remainder of this section is dedicated to showing how the abstract decomposition (2.4) translates to the present LOD setting and how it can be implemented in practice. Writing N (S) for the set of inner mesh nodes on some subdomain S ⊆ D and denoting N S = |N (S)|, the effective system matrix S A can be decomposed as (3.8) where the matrices S A,T are local system matrices of the form (3.9) i.e., they correspond to the interaction of the localized ansatz functions (1 -Q A,T )λ j associated with the nodes of the element T with the classical first order nodal basis functions whose supports overlap with the element neighborhood N (T). This means that S A,T is an N N (T) × N T matrix. In practice, the coefficient A in (3.9) is often replaced with an elementwise constant approximation A ε on a finer mesh T ε that resolves all the oscillations of A and that we assume to be a uniform refinement of T h . As already explained in the abstract framework, the mappings T in (3.8) are local-toglobal mappings that assemble the contributions S A,T on an element neighborhood to a global matrix. In particular, given an enumeration 1, . . . , N N (T) of the nodes in N (N (T)), one considers a mapping g T (·) that assigns to a given node index i in the element neighborhood N (T) its global node index g T (i) ∈ {1, . . . , m}. This mapping can be represented by an m × N N (T) sparse matrix with entries Analogously, given an enumeration 1, . . . , N T of the nodes in N (T), there exists a mapping g T (·) -represented by an m × N T -matrix -that assigns to a given node in N (T) with index i its global representative with indexg T (i). The corresponding matrix is given by Using these matrices, decomposition (3.8) reads (3.10) where ϕ T denotes the transpose of the matrix ϕ T . From the definition of the local contributions S A,T introduced in (3.9), it directly follows that S A,T does only depend on the restriction of A, respectively A ε , to the element neighborhood N (T). Let now T ε (N (T)) be the restriction of the mesh T ε to N (T), consisting of r = |T ε (N (T))| elements. Enumerating the elements then leads to the following operators that correspond to the abstract reduction operators in (2.8): that map a global coefficient A to a vector that contains the values of A ε in the respective cells of T ε (N (T)). As already mentioned in the abstract section above, we aim for a uniform output size of the operators R T , since the outputs of the operators R T will later on be fed into a neural network with a fixed number of input neurons. In order to achieve that, we artificially extend the domain D and the mesh T h by layers of outer elements around the boundary elements of T h , thus ensuring that the element neighborhood N (T) always consists of the same number of elements regardless of the respective location of the central element T ∈ T h relative to the boundary. Further, we extend the piecewise constant coefficient A ε by zero on those outer elements. Figure 1 illustrates this procedure for the case d = 2 and = 1 for an element T that lies in a corner of the computational domain. In this figure, T h is a uniform quadrilateral mesh on the domain D and T ε is obtained from T h by a single Figure 1 Illustration of the extended element neighborhood N 1 (T) around a corner element T ∈ T h . An asterisk indicates that A ε | K ∈ [α, β], a zero that A ε | K = 0 in the respective cell K of the refined mesh T ε uniform refinement step. The asterisks indicate the coefficient A ε taking a regular value in the interval [α, β], whereas in the cells outside of D, we set A ε to zero.
Note that this enlargement of the mesh T h to obtain equally sized element neighborhoods N (T) also introduces artificial mesh nodes that lie outside of D and that are all formally considered as inner nodes for the definition of N S = |N (S)| with a subset S in the extended domain. This implies that the local system matrices S A,T of dimension N N (T) × N T introduced in (3.9) are all of equal size as well and the rows of S A,T corresponding to test functions associated with nodes that are attached to outer elements contain only zeros. During the assembly process of the local contributions to a global matrix, these zero rows are disregarded (which is also consistent with our definition of the matrices π T , ϕ T ).
Finally, in order to unify the computation of local contributions, we use an abstract mapping C red with fixed input dimension r and fixed output dimension N N (T) × 2 d as proposed for the abstract framework in Sect. 2.5. The mapping takes the restriction of A ε to an element neighborhood N (T) as input data and outputs the corresponding approximation of a local effective matrix S A,T that will be determined by an underlying neural network (·, θ ).

Numerical experiments
In this section, we demonstrate the feasibility of our proposed approach by performing numerical experiments in the setting of Sect. 3. For all experiments, we consider the two-dimensional computational domain D = (0, 1) 2 , which we discretize with a uniform quadrilateral mesh T h of characteristic mesh size h = 2 -5 . The coefficients are allowed to vary on the finer unresolved scale ε = 2 -8 .

Coefficient family
In order to test our method's ability to deal with coefficients that show oscillating behavior across multiple scales, we introduce a hierarchy of meshes T k , k = 0, 1, . . . , 8, where the initial mesh T 0 consists only of a single element, and the subsequent meshes are obtained by uniform refinement, i.e., T k is obtained from T k-1 by subdiving each element of T k-1 into four equally sized elements. This implies that the characteristic mesh size of T k is given by 2 -k . In the following, we refer to the parameter k as the mesh level. In particular, the computational mesh T h = T 5 corresponds to the mesh level 5, whereas the coefficients may vary on the mesh level 8 and are therefore only resolved by the finest mesh T 8 . We thus have a scenario where an information gap of 3 mesh levels has to be bridged. Based on the mesh hierarchy, we now define the coefficient family A of interest. Let A k denote the set of element-wise constant coefficients on T k whose values in the respective cells are iid uniformly distributed on the interval [α, β] := [1,5], i.e., Furthermore, let A ms denote the set of coefficients of the form These multiscale coefficients are especially interesting, since they vary on all considered scales simultaneously and are therefore very hard to deal with using classical homogeniza-tion techniques due to their unstructured nature. The total set of interest A is then defined as In the following, we will frequently index coefficients sampled from A by their corresponding level, i.e., write A k , k ∈ {0, . . . , 8, ms} instead of a plain A.

Data generation and preprocessing
In order to train the network, we sample 500 coefficients A (i) k , i = 1, . . . , 500, from each A k , k ∈ {0, . . . , 8, ms}, where the individual samples A (i) k on the coarser mesh levels k = 0, . . . , 7 are prolongated to the finest mesh T 8 in order to achieve a uniform dimension across all scales. The set of all sample coefficients is subsequently divided into a training, validation, and test set according to a 80-10-10 split. In order to achieve an identical distribution in all three sets, the splitting is performed separately on every level (including ms), i.e., for every k ∈ {0, . . . , 8, ms}, the first 400 coefficients A (i) k , i = 1, . . . , 400, get assigned to the training set D train , the samples with indices 401, . . . , 450 to the validation set D val , and those with indices 451, . . . , 500 to the test set D test . Then, we individually split each sample A (i) k , using the reduction operators R T introduced in (3.11), into sub-samples A (i) k,T based on element neighborhoods N (T) for = 2 that are centered around the elements T ∈ T h , also taking into account the artificial extension of the element neighborhoods around the boundary of D. Since our target scale of interest is h = 2 -5 and T h is a uniform quadrilateral mesh, this yields 1024 sub-samples A (i) k,T ∈ R 1600 per sample A (i) k ∈ R 65,536 . Note that the size of the sub-samples is obtained from the construction of the local neighborhoods N (T). Here, each neighborhood consists of (2 +1) 2 = 25 elements in T h = T 5 , which corresponds to 64 · 25 = 1600 elements in the mesh T ε = T 8 .
The corresponding "labels", i.e., the local effective system matrices S (i) A,k,T ∈ R 36×4 , are then computed with the Petrov-Galerkin LOD according to (3.9) and flattened columnwise to vectors in R 144 . In total, we obtain 10 · 400 · 1024 = 4,096,000 pairs (A (i) k,T , S (i) A,k,T ) ∈ D train to train our network with, and 512,000 pairs in D val and D test each.

Network architecture and training
Given the above dataset, we now try to fit it with a suitable neural network . As network architecture, we consider a dense feedforward network with a total of eight layers including the input and output layer. As activation function, we choose the standard ReLU activation given by ρ(x) := max(0, x) in the first seven layers and the identity function in the last layer. By convention, the activation function acts component-wise on vectors. The network output is thus of the form (x) = W (8) ρ W (7) . . . ρ W (2) ρ W (1) x + b (1) + b (2) . . . + b (7) + b (8) , (4.1) where the weight matrices and bias vectors have the following dimensions: yielding a total of 5,063,504 trainable parameters. The idea behind this architecture is that in the first six layers, information about the coefficient in the input element neighborhood is gathered and combined by allowing communication between all inputs in layers with odd indices, whereas in the layers with even indices, this information is repeatedly compressed. That is, every other layer is built in such a way that the input and output dimension are equal. If the neurons in that layer are understood as some sort of degrees of freedom in a mesh, this refers to having communication among all of these degrees of freedoms, while the layers in between reduce the number of degrees of freedom, which can be interpreted as transferring information to a coarser mesh. In the last two layers, this compressed information is taken and assembled to the local effective system matrix. Note that this logarithmic dependence of the number of layers on the number of scales that need to be bridged by the network (two layers per mesh level to be bridged plus two layers to assemble the local effective matrix) yielded the most reliable results in our experiments. Shallower networks had difficulties fitting the complex training set consisting of coefficients varying on different scales, whereas deeper networks were more prone to overfitting the training set. More involved architectures, for example the ones that include skip connections between layers like in the classic ResNet [34], are also conceivable; however, this seems not to be necessary to obtain good results. The key message here is that the coefficient-to-surrogate map can be satisfyingly approximated by a simple feedforward architecture, whose size does depend only on the scales ε and h, but not on any finer discretization scales. The implementation of the network as well as the training is performed using the library Flux [40] for the open-source scientific computing language Julia [10]. After initializing all parameters in the network according to a Glorot uniform distribution [32], network (4.1) is trained on minibatches of 1000 samples for a total of 20 epochs on D train , using the ADAM optimizer [42] with a step size of 10 -4 for the first 5 epochs before reducing it to 10 -5 for the subsequent 15 epochs. It could be observed that further training led to a stagnation of the validation error, whereas the error on the training set continued to decrease (very slowly but gradually), indicating overfitting of the network. The development of the loss functional J defined in (2.9) over the epochs is shown in Fig. 2. Note that training and validation loss stay very close to each other during the whole training process since D train and D val have the same sample distribution due to our chosen splitting procedure. The development of the loss during the training and an average loss of 7.78 · 10 -5 on the test set D test indicates that the network has at least learned to approximate the local effective system matrices. In applications, however, we are mostly concerned about how well this translates to the global level, when computing solutions to problem (3.2) using a global system matrix assembled from network outputs. In order to investigate this question, the next three subsections are dedicated to evaluating the performance of the trained network at exactly this task for several coefficients unseen during training. For a given right-hand side f and coefficient A, we denote with u h the solution of (3.7), obtained with the Petrov-Galerkin LOD matrix S A defined in (3.10), and with u h the solution obtained by using the neural network approximation of this matrix, i.e., The spectral norm difference S A -S A 2 , the L 2 -error u hu h L 2 (D) as well as the visual discrepancy between the two solutions are then considered as a measure of the network's global performance. We emphasize once more that computing approximate surrogates via (4.2) is significantly faster compared to (3.8) and (3.9). This is due to the fact that no corrector problems of the form (3.5) have to be solved to obtain the surrogate model. As pointed out, the solution of these local problems requires inversion on a very fine discretization scale that is significantly smaller than the scale ε on which the coefficient varies. In order to compute the system matrix S A , one has to solve N T h fine-scale linear systems, where N T h denotes the number of elements in T h . In contrast, the main computational effort of evaluating our trained network consists of N L matrix-matrix multiplications, where N L is the number of layers in the network (not taking into account bias vectors and the activation function).

Experiment 1
For our first experiment, we consider an unstructured multiscale coefficient sampled from A ms that was not a part of the training set and a constant right-hand side f ≡ 1. The coefficient (top left), the error |u h (x)u h (x)| (top right) as well as representative cross-sections along x 1 = 0.5 (bottom left) and x 2 = 0.5 (bottom right) of the two solutions u h and u h are shown in Fig. 3. The spectral norm difference S A -S A 2 ≈ 6.58 · 10 -2 and the L 2 -error u hu h L 2 (D) ≈ 2.13 · 10 -4 confirm the visual impression -the network has successfully learned to produce a system matrix that is able to capture the behavior of the solution on the target scale well.

Experiment 2
Next, we test the network's performance for smoother and more regular coefficients than the ones it has been trained with. As a demonstrating example, we consider the coefficient A(x) = 2 + sin(2πx 1 ) sin(2πx 2 ). The network's input is obtained by evaluating A on the midpoints of the mesh T ε on the fine unresolved scale ε. In this example, we choose the function f (x) = x 1 χ {x 1 ≥0.5} as a right-hand side, where χ S denotes the characteristic function of the set S ⊆ D. The results are shown in Fig. 4. We obtain an even better L 2 -error of u hu h L 2 (D) ≈ 7.56 · 10 -5 and a spectral norm difference of S A -S A 2 ≈ 3.91 · 10 -2 . A comparison between the solutions at the cross-sections shows that there is almost no discernible visual difference between the LOD-solution and the approximation obtained using our trained network.

Experiment 3
As a third experiment, we choose another coefficient that possesses an unfamiliar structure not seen by the network during the training phase, this time a less regular one. The coefficient is shown in the top left of Fig. 5. It is composed of a background part (blue region), which is obtained by sampling uniformly and independently on each element of T ε from the interval [1,2], and several "cracks" (yellow regions), in which the coefficient varies uniformly in [4,5]. The right-hand side here is f (x) = cos(2πx 1 ). A computation of the L 2 -error u hu h L 2 (D) ≈ 2.72 · 10 -4 shows that the overall error is still moderate; a closer visual inspection of the solutions along the cross-sections however reveals more prominent deviations of the neural network approximation to the ground truth. The spectral norm difference S A -S A 2 ≈ 2.81 · 10 -1 is also one order of magnitude larger than in the previous examples. Nevertheless, it might be possible that performing a few corrective training steps including samples of this nature would be sufficient to fix this issue. A thorough investigation of this hypothesis, along with the extension to other coefficient classes is subject of future work.

Conclusion and outlook
We proposed an approach to the compression of linear differential operators -parameterized by PDE coefficients that may depend on microscopic quantities that are not resolved by a target discretization scale of interest -to lower-dimensional surrogates that are based on a combination of existing model reduction methods with a data-driven deep learning framework. Our method is motivated by the fact that the computation of the surrogates (represented by effective system matrices) via classical methods is nontrivial and requires significant computational resources in multi-query settings. To overcome this problem, we showed how the compression process can be approximated by a neural network based on given training data that can be generated using existing compression approaches. Importantly, we avoid a global approximation by a neural network and instead first decompose the compression map into local contributions, which can then be approximated by one single unified network. As an example, we studied a class of second-order elliptic diffusion operators. We showed how to approximate the compression map based on the Petrov-Galerkin formulation of the localized orthogonal decomposition method with a neural network. The proposed ansatz has been numerically validated for a large set of piecewise constant and highly oscillatory multiscale coefficients. Furthermore, it has been shown that the approach also generalizes well, in the sense that a well-trained network is able to produce reasonable results even for classes of coefficients that it has not been trained on.
For future research, many possible research questions building on the present work are conceivable. Straightforward extensions would be to consider stochastic settings with differential operators parameterized by random fields or settings with high contrast. Another question to investigate is to what degree the method can be made robust against changes in geometry, for example, by training the network not only on coefficients that are sampled on a fixed domain, but rather on reference patches with varying geometries. Mimicking a hierarchical discretization approach, one may also try to directly approximate the inverse operator which can be represented by a sparse matrix [25]. On a more theoretical level, the approximation properties of neural networks for various existing compression operators could be investigated, along with the question of the number of training samples required to faithfully approximate those for a given family of coefficients.