Skip to main content

Theory and Modern Applications

Universal approximation property of a continuous neural network based on a nonlinear diffusion equation

Abstract

Recently, differential equation-based neural networks have been actively studied. This paper discusses the universal approximation property of a neural network that is based on a nonlinear partial differential equation (PDE) of the parabolic type.

Based on the assumption that the activation function is non-polynomial and Lipschitz continuous, and applying the theory of the difference method, we show that an arbitrary continuous function on any compact set can be approximated using the output of the network with arbitrary precision. Additionally, we present an estimate of the order of accuracy with respect to t and x.

1 Introduction

Recently, neural networks have been applied in numerous fields, both in social and natural sciences. However, their performance remains a topic of active research. Since Rosenblatt’s work [60], neural networks have been studied extensively. In fact, the set of functions realized by neural network models has been under discussion for some time.

Surprisingly, the transform mapping theorem, which is similar to the universal approximation property, was derived in an early research by Kolmogorov [41] and its simplified proof was provided by Sprecher [70]. However, the neural networks they considered differed slightly from existing conventional implementations. Later, in the 1980s, several studies were conducted on the universal approximation property of neural networks. On the one hand, these results greatly encouraged and facilitated research in neural networks. On the other hand, they found the universal approximation property of neural network models to be closely related with (almost simultaneous) controllability in the theory of optimal control. However, there are some differences between the two. When discussing the universal approximation property of a neural network, these works typically include the effect of the output layer, whose activation function may differ from that of the hidden layer. Arguments concerning these areas are introduced and discussed in detail in the next section. Moreover, some recent studies have considered neural networks from the perspective of optimal transport [67, 68, 78]. These arguments have led to the application of the dynamical system theory to neural networks.

For example, E [83] regarded a neural network as a method of estimating the parameters of a dynamical system. In particular, he formulated ResNet [29] as an Euler scheme for an ordinary differential equation (ODE) and discussed its stability in the forward direction. They deduced certain conditions under which forward propagation operates stably in the sense that gradient explosion and vanishing problems do not occur when the eigenvalues of the system are considered. Additionally, they highlighted a close relationship (or even equivalence) between the adjoint equation and backpropagation and introduced a regularity method.

This dynamical systems-based approach toward neural networks became more popular after a study by Chen et al., which provided a framework for representing a neural network with an ODE solver. This framework was referred to as the neural ODE [10].

Thereafter, neural ODEs began to be widely used and implemented [10].

Meanwhile, some methods have been proposed based on ODEs and partial differential equations (PDEs) [28, 31]. Han and Li [28] formulated a neural network using an ODE, considering a cost function optimized using the Hamilton–Jacobi–Bellman (HJB) equation. In our previous study [31], we proposed a framework for a neural network in which we considered the initial-boundary value problem for PDEs.

A maximum principle-based approach was also provided in [44]. Notably, some recent works have actively discussed the application of differential equations to graph neural networks (GNNs) (see, for instance, [9]), along with the “expressive power” and “stability” of GNNs. Oono and Suzuki [56] discussed that the expressive power of a GNN decreases when the it has an excessively large numbers of layers. They also proposed a concept called “over-smoothing” in which the feature vectors of all nodes tend to reach an equivalent state. This has driven ongoing research on the diffusion process of GNN models [9], which is related to the topic of the present work. From this perspective, the authors of [9] worked on the application of a range of differential equations that are popular in classical physics; see, for instance, [62]. Of note, they also considered PDEs with a diffusion term, as used in works on image processing [40, 58]. The results of these studies motivate us to consider a PDE with a diffusion term here. This study is also motivated by ongoing research on optimal control theory, especially work on the ensemble controllability of stochastic processes in terms of the Fokker–Planck equation [2]. Although the drift term differs slightly from that used here, this highlights the necessity of the control of diffusion PDEs. Insights obtained in studies on machine learning literature might be helpful in this regard. Along these lines, we consider neural networks based on PDEs with a diffusion term in our study. Our motivations are twofold.

  1. (i)

    Although neural ODEs perform well, their essential difference from classical neural networks is that the width of each layer does not change. This limitation can be overcome by PDE-based neural networks, which also consider the infinite limit of the width of the network. Because we aim to approximate a neural network with a continuous dynamical system, this advantage of PDE-based neural networks appears to be more natural.

  2. (ii)

    Similar to that in the case of ODE-based control, some fruitful theories have also been provided on PDE-based control (or distributed control). A sophisticated theoretical framework has been developed in considerable literature on diffusion equations. We can also understand the increasing freedom of such models by considering a range of forms and values of boundary conditions.

However, some uncertainties remain regarding the performance of these continuous neural networks. For example, the universal approximation property is an important aspect that all neural networks must exhibit.

Although various types of neural networks based on dynamical systems have been developed, some scope for further exploration remains in terms of their universal approximation property based on a PDE, particularly with a diffusion term.

In this paper, we first introduce the formulation of a PDE-based neural network and then show that it is well-defined under some natural setup conditions. Next, we prove the existence of a temporally global solution to the model. We also posit the existence of a vanishing diffusion limit. Finally, we show that our model possesses a universal approximation property with respect to the maximum norm.

The remainder of this paper is organized as follows. In Sect. 2, we define some notations that we use throughout this paper. In Sect. 3, we formulate the research problem and introduce some existence theorems. In Sect. 4, we present our main result, which is proven in Sect. 7. In Sect. 5, we compare our results to those of related works, referring to the history of arguments on the universal approximation property of neural networks. We also confirm the main contributions of the present work, and clarify our key theoretical and practical insights. Section 6 provides some preliminary statements to support the main results presented in Sect. 7. In Sect. 8, we discuss the learnability of our model as well as its performance based on some numerical experiments. Finally, our conclusions and some possible avenues for future research are presented in the final section.

2 Notations

In this section, we introduce some notations used for general analysis. First, let us define \(I=(0,1)\) and \(\partial I = \{0\} \cup \{1\}\). Let \({\mathcal {G}}\) denote an arbitrary region in \({\mathbb{R}}\). We denote the closure of \({\mathcal {G}}\) as \(\overline{\mathcal {G}}\).

Hereafter, \(C({\mathcal {G}})\) denotes a set of continuous functions on \({\mathcal {G}}\). For \(r \in {\mathbb{N}}\), a set of functions that are r-times continuously differentiable on \({\mathbb{R}}\) is denoted as \(C^{r}({\mathbb{R}})\). A set of infinitely differentiable functions with a compact support in \({\mathcal {G}}\) is denoted as \(C_{0}^{\infty}({\mathcal {G}})\). A set of Lipschitz continuous functions on \({\mathbb{R}}\) is denoted as \(C^{L}({\mathbb{R}})\). For \(d \in {\mathbb{N}}\), we often denote a vector \(\vec{u}=(u_{1},u_{2},\dots ,u_{d}) \in {\mathbb{R}}^{d}\) as \([u_{j}]_{j}\). For two vectors u⃗ and \(\vec{v} \in {\mathbb{R}}^{p}\) in general, we denote their inner product as \(\vec{u}\cdot \vec{v}\). For a vector space X and an element \(\vec{v} \in X\), we denote a set spanned by v⃗ as \(\operatorname{Span}\langle \vec{v} \rangle \).

Let \(\|\cdot \|_{L_{p}({\mathcal {G}})}\) denote the usual \(L_{p}\) norm with \(1 \leq p\leq +\infty \) on \({\mathcal {G}}\); i.e., for a function f in general, we define

$$\begin{aligned} & \Vert f \Vert _{L_{p}({\mathcal {G}})} \equiv \textstyle\begin{cases} ( \int _{\mathcal {G}} \vert f(x) \vert ^{p} \,\mathrm{d}x )^{\frac{1}{p}} & (p \in [1,+\infty )), \\ {\mathrm{ess}}\sup_{x \in {\mathcal {G}}} \vert f(x) \vert & (p= \infty ). \end{cases}\displaystyle \end{aligned}$$

We use a notation \((\cdot ,\cdot )_{\mathcal {G}}\) to denote the inner product in \(L_{2}({\mathcal {G}})\) space:

$$\begin{aligned} &(f,g)_{\mathcal {G}} \equiv \int _{\mathcal {G}} f(x)g(x) \,\mathrm{d}x. \end{aligned}$$

In particular, when the region is clear, we simply denote it as \((\cdot ,\cdot )\). The norm in \(L_{2}({\mathcal {G}})\) is often denoted as \(|\cdot |\). We also use this notation to denote the norm in the Euclidean space, where a step function is regarded as a simple function in the \(L_{2}\) space.

For \(r \in {\mathbb{N}}\), we define Sobolev spaces \(H^{r}({\mathcal {G}})\), which are the spaces of functions \(f(x), x\in {\mathcal {G}}\), equipped with the norm \(\|f\|_{H^{r}({\mathcal {G}})}^{2} \equiv \sum_{| \alpha |\leq r} \|D^{\alpha}f \|_{L_{2}({\mathcal {G}})}^{2}\). \(H^{-r}({\mathcal {G}})\) (\(r>0\)) is defined as the dual space of \(H_{0}^{r}({\mathcal {G}})\), which is the closure of \(C_{0}^{\infty}({\mathcal {G}})\) with respect to the norm of \(H^{r}({\mathcal {G}})\) (see, [49], §11.1 and §12.1).

For a Banach space \({\mathcal {B}}\) with the norm \(\|\cdot \|_{\mathcal {B}}\), we denote the space of \({\mathcal {B}}\)-valued measurable functions f on the interval \((a,b)\) by \(L_{p}((a,b);{\mathcal {B}})\), the norm of which is defined by

$$\begin{aligned} & \vert f \vert _{L_{p}((a,b);{\mathcal {B}})} \equiv \textstyle\begin{cases} ( \int _{a}^{b} \Vert f(t) \Vert _{\mathcal {B}}^{p} \, { \mathrm{d}}t )^{\frac{1}{p}} & (p\in [1,+\infty )), \\ {\mathrm{ess}}\sup_{a \leq t \leq b} \Vert f(t) \Vert _{\mathcal {B}} & (p= \infty ). \end{cases}\displaystyle \end{aligned}$$

Similarly, we often use notations like \(C([a,b];{\mathcal {B}})\) to denote sets of \({\mathcal {B}}\)-valued functions that are continuous with respect to time on the interval specified as the brackets. We also denote an adjoint space of \({\mathcal {B}}\) by \({\mathcal {B}}^{\prime}\). For a Hilbert space H and its linear subspace \(M \subset H\) in general, we denote the orthogonal complement of M as \(M^{\perp}\). When the inner product of two elements \(v_{1}\) and \(v_{2}\) in H vanishes, we use the notation \(v_{1} \perp v_{2}\). Hereafter, we shall use the notations below:

$$\begin{aligned} & Z*_{x}f (t,x) \equiv \int _{I} Z(t,x,y)f(t,y) \,\mathrm{d}y, \\ &Z*f (t,x)\equiv \int _{0}^{T} \, \mathrm{d}\tau \int _{I} Z(t-\tau ,x,y)f( \tau ,y) \,\mathrm{d}y, \end{aligned}$$

where \(Z(t,x,y)\) denotes the fundamental solution to the initial boundary value problem of the heat equation with the vanishing Dirichlet condition. Given \(T>0\), we use the notation \({\mathcal {H}}_{T} \equiv (0,T)\times I\times I\). The other notations used in this paper are summarized in Table 4 in Appendix A.

3 Formulation: differential equation-based neural networks

Here, we formulate the continuous limit of a multi-layer neural network [32]. Because in the supervised learning, the input vector takes the form of a vector in a Euclidean space \({\mathbb{R}}^{J}\), we represent it as a simple function on a unit interval by partitioning it into J intervals. Given \(T>0\), we formulated the continuous version of a neural network as follows:

$$\begin{aligned} &u_{t} -\nu u_{xx}= \phi \biggl( \int _{I} w_{1}(t,x,y)u(t,y) \, { \mathrm{d}}y \biggr) \quad \text{in } I_{T} \equiv I \times (0,T), \end{aligned}$$
(3.1)

where \(\nu >0\), \(\phi (\cdot )\) denotes the activation function, \(T>0\) corresponds to the depth of a classical neural network, and \(w_{1}\) and \(w_{0}\) are the weight parameters at the middle and output layers, respectively. Additionally, we impose the initial and boundary value conditions as follows:

$$\begin{aligned} &u(0,x)=u_{0}(x) \quad \text{on } I, \qquad u=1 \quad \text{on } \partial I \ \forall t \in (0,T). \end{aligned}$$
(3.2)

We employ a non-vanishing Dirichlet condition in (3.2), with which we can easily assure the existence of a solution in (6.12) in the proof of Lemma 4. We shall comment on this issue later again. In (3.2), given the input data \(\vec{\xi}=(\xi _{1},\xi _{2},\ldots ,\xi _{J})^{\top }\in {\mathbb{R}}^{J}\), the initial data is a simple function of the form: \(u_{0}(x) = \sum_{j=1}^{J} \xi _{j} \chi _{I_{j}}\), with \(\chi _{I_{j}}\) as the indicator functions of \(I_{j} \equiv ((j-1)/J, j/J] \) (\(j=1,2,\ldots ,J\)). Because we usually deal with a finite dimensional input, we translate it into this finite dimensional vector, and the corresponding simple function on a unit interval I. This is a different formulation from that of Liu and Markowich [52], in which they employed a region of the same dimension as the input feature. In that model, they computed the multiple integral of the input over the d-dimensional space in each layer. In the case of a two-dimensional CNN, their formulation coincides with the functionality of the convolution layer. In higher dimensions, however, it is different from how the multi-layer neural network works in the usual supervised learning.

By taking \(v\equiv u-1\), we can transform problem (3.1)–(3.2) as below.

$$\begin{aligned} \textstyle\begin{cases} v_{t}-\nu v_{xx} = \phi ( \int _{I} w_{1}(t,x,y)v(t,y) \,\mathrm{d}y + \int _{I} w_{1}(t,x,y) \,\mathrm{d}y ) \quad \text{in } I_{T}, \\ v(0,x)=u_{0}(x)-1 \equiv \tilde{u}_{0} \quad \text{on } I, \\ v=0 \quad \text{on } \partial I \ \forall t \in (0,T). \end{cases}\displaystyle \end{aligned}$$
(3.3)

The following result was obtained for problem (3.3).

Theorem 1

Let \(T>0\) be arbitrary, and the following be assumed:

  1. (i)

    \(u_{0} \in L_{2}(I)\),

  2. (ii)

    \(\phi \in C^{L}({\mathbb{R}})\),

  3. (iii)

    \(w_{1} \in L_{2}({\mathcal {H}}_{T})\).

Then, there exists a constant \(T_{u_{0}} \in (0,T]\) that depends on \(\|u_{0}\|_{L_{2}(I)}\) such that problem (3.3) has a unique solution \(v\in C([0,T_{u_{0}}];L_{2}(I))\) on the interval \([0,T_{u_{0}})\). In addition, this solution satisfies

$$\begin{aligned} & \Vert v \Vert _{C([0,T_{u_{0}}];L_{2}(I))} \leq c\bigl( \Vert u_{0} \Vert _{L_{2}(I)}\bigr), \end{aligned}$$

where \(c(\|u_{0}\|_{L_{2}(I)})\) is a positive constant that depends monotonically increasingly on \(\|u_{0}\|_{L_{2}(I)}\).

We prove this theorem in Appendix B, in which we use the notation \(A = -\nu \frac{\partial ^{2}}{\partial x^{2}}\) and define a sesquilinear form \(\sigma (\cdot ,\cdot ):H_{0}^{1}(I) \times H_{0}^{1}(I) \rightarrow { \mathbb{R}}\) [21] by

$$\begin{aligned} &(Au,v) = \sigma (u,v) \quad \bigl(u, v \in H_{0}^{1}(I) \bigr). \end{aligned}$$
(3.4)

Remark 1

The solution v mentioned in Theorem 1 also belongs to the space [64]

$$\begin{aligned} &L_{2}\bigl((0,T_{u_{0}});H^{1}(I)\bigr) \cap H^{1}\bigl((0,T_{u_{0}});H^{-1}(I) \bigr), \end{aligned}$$

and satisfies the same estimates as the one in the theorem with the norm of these spaces. The proof of this fact is contained in the proof of Theorem 1 in Appendix B.

Remark 2

In our proof above, we do not require \(\phi (\cdot )\) to satisfy \(\phi (0)=0\), nor the linear growth, as required in [52].

Next, we show the existence of a temporally global solution.

Theorem 2

Let \(T>0\) be an arbitrary positive number and assume that in addition to the assumptions (i), (ii) of Theorem 1, \(w_{1} \in L_{2}({\mathcal {H}}_{\infty})\) is satisfied. Then, there exists a temporally global solution \(v \in C([0,T];L_{2}(I))\) to problem (3.3), which satisfies

$$\begin{aligned} &\sup_{t \in [0,T]} \bigl\vert v(t) \bigr\vert \leq \chi \bigl( \Vert u_{0} \Vert _{L_{2}(I)}\bigr), \end{aligned}$$

where \(\chi (\cdot )\) is a monotonically increasing function.

Remark 3

As in Theorem 1, the solution v mentioned in Theorem 2 also belongs to the space

$$\begin{aligned} &L_{2}(\bigl(0,T;H^{1}(I)\bigr)\cap H^{1}\bigl((0,T);H^{-1}(I)\bigr), \end{aligned}$$

and satisfies the same estimates as the one in the theorem with the norm of these spaces.

The proof of Theorem 2 is given in Appendix B as well. Note that the estimate above does not depend on the diffusion coefficient \(\nu >0\). Thus, under the assumptions of Theorem 2, we can let ν tend to zero, to assert the corollary below [47].

Corollary 1

Under the assumptions of Theorem 2, if we denote the solution to (3.3) by \(v^{(\nu )}\), then we can take a sequence \(\{v^{(\nu _{m})}\}_{m=1}^{\infty }\subset L_{2}(I_{T})\) satisfying the following:

$$\begin{aligned} & \bigl\Vert v^{(\nu _{m})} - v^{(0)} \bigr\Vert _{L_{2}(I_{T})} \rightarrow 0, \quad m \rightarrow +\infty , \end{aligned}$$

where \(v^{(0)} \in L_{2}(I_{T})\) is a solution to the hyperbolic equation

$$\begin{aligned} \textstyle\begin{cases} v_{t}^{(0)} - \phi ( \int _{I} w_{1}(t,x,y)v^{(0)}(t,y) \,\mathrm{d}y + \int _{I} w_{1}(t,x,y) \,\mathrm{d}y )=0 \quad \textit{in } I_{T}, \\ v^{(0)}(0,x)=\tilde{u}_{0} \quad \textit{on } I. \end{cases}\displaystyle \end{aligned}$$

In our previous studies [31, 32], we set several cost functions that corresponded to specific tasks and demonstrated the presence of optimal controls, and we used the gradient descent algorithm to find the sub-optimal control. For example, in [34] in which we discussed the multiclass classification problem, the cost function is given by

$$\begin{aligned} J[\vec{w}] &= - \int \mathrm{d}P(\vec{X},\vec{t}_{(\vec{X})}) \Biggl[ \sum _{k=1}^{K-1} t_{(\vec{X}),k} \ln \biggl\{ \phi _{0}^{(k)} \biggl( \int _{I} w_{0}^{(k)}(y) u(T,y;w_{1},u_{0(\vec{X})}) \,\mathrm{d}y \biggr) \biggr\} \\ & \quad{}+t_{(\vec{X}),K} \ln \Biggl\{ 1-\sum_{k^{\prime}=1}^{K-1} \phi _{0}^{(k^{ \prime})} \biggl( \int _{I} w_{0}^{(k^{\prime})}(y) u(T,y;w_{1},u_{0({ \vec{X}})}) \,\mathrm{d}y \biggr) \Biggr\} \Biggr] \\ & \quad{}+ \frac{\gamma _{1}}{2} \sum_{k=1}^{K-1} \bigl\Vert w_{0}^{(k)} \bigr\Vert _{H^{1}(I)}^{2} + \frac{\gamma _{2}}{2} \Vert w_{1} \Vert _{L_{2}({\mathcal {H}}_{T})}^{2}, \end{aligned}$$

where \(\phi _{0}(\cdot )\) is an activation function of the output layer, \(P(\vec{X},\vec{t}_{(\vec{X})})\) is the probability distribution of \((\vec{X},\vec{t}_{(\vec{X})})\), and \(t_{(\vec{X}),k} \in \{0,1\}\) satisfies \(\sum_{k=1}^{K} t_{(\vec{X}),k}=1\) \(\forall \vec{X} \in {\mathbb{R}}^{J}\). However, because we consider the feed-forward network hereafter, we do not consider \(\phi _{0}(\cdot )\) in the present paper. Instead, we discuss the universal approximation property of this neural network with the output layer of the linear unit.

Hereafter, we frequently represent the solution to (3.3) (or equivalently, (3.1)–(3.2)) as \(u(t,x;w_{1},\vec{\xi})\), to clarify its dependency on \(w_{1}\) and ξ⃗. Thus, we regard the solution \(u(T,x;w_{1},\vec{\xi})\) as a function on K by identifying \(u_{0}\) with \(\vec{\xi} = (\xi _{1},\ldots ,\xi _{J})^{\top}\).

4 Main result

In this section, we show the universal approximation property of the partial differential equation-based neural network prescribed in Sect. 3. We discuss the universal approximation property of our neural network model based on a nonlinear partial differential equation [32]. As is the case with the previous works, we restrict ourselves to an arbitrary compact set \(K \subset {\mathbb{R}}^{J}\). Our main result is

Theorem 3

Let \(T>0\) be given and \(\phi \in C^{L}({\mathbb{R}})\) be a non-polynomial function. Then, for an arbitrary compact set \(K \subset {\mathbb{R}}^{J}\), \(F \in C(K)\), and \(\varepsilon >0\), there exist \(w_{0} \in L_{2}(I)\), \(w_{1} \in L_{2}({\mathcal {H}}_{T})\) such that

$$\begin{aligned} &\sup_{\vec{\xi} \in K} \biggl\vert F(\vec{\xi} ) - \int _{I} w_{0}(x) u(T,x;w_{1}, \vec{\xi}) \,\mathrm{d}x \biggr\vert < \varepsilon , \end{aligned}$$

where \(u(T,x;w_{1},\vec{\xi})\) is the value of a solution to (3.1)(3.2) at time T that corresponds to the initial input value ξ⃗.

We will prove Theorem 3 in Sect. 8.

Remark 4

In this paper, we only consider the scalar-valued function \(F:K\rightarrow {\mathbb{R}}\) as Leshno [43] did. This does not lose generality, because if we can approximate this function, then we can approximate an arbitrary continuous map \(F:K\rightarrow {\mathbb{R}}^{n}\) by concatenating the network in parallel, as is done in [14], as long as \(J \geq n\) holds. We also point out that our PDE-based neural network is defined only on one-dimensional Euclidean space. This is because in many supervised learning, the input data is a vector with independent attributes, which can be associated with a simple function on I as we did above. This is the similar approach with [72]. When we consider GNN, however, this assumption does not hold, which is one of our future works.

Remark 5

The controls \(w_{0}\) and \(w_{1}\) depend on T and ν. Therefore, we cannot assure at this moment that the same conclusion holds with \(\nu =0\). The discussion concerning this vanishing diffusion limit is our future work.

5 Comparison with existing works

Before going on to the proofs of our results, we discuss here the difference and novelty of our result in comparison to the existing related works. Actually, numerous contributions to the literature have studied the universal approximation property of neural networks. As an early work, Lippmann [51] postulated the formation of a range of surfaces for classification tasks to classify the points in a topological space using a neural network with two hidden layers.

This conjecture was rigorously proven by Funahashi [22], who stated that an arbitrary continuous function on a compact subset K in \({\mathbb{R}}^{n}\) could be approximated with a neural network with a single hidden layer that contained a sigmoid activation function.

Funahashi [23] also hypothesized that any \(L_{2}\) function could be approximated by a three-layer neural network with a finite number of units in the hidden layer. Four-layer networks have also been conjectured to outperform three-layer networks. These considerations are related to the study of the generalization performance of neural networks [6].

Irie and Miyake [36] derived the integral representation of three-layer neural networks based on the Fourier integral theorem under the continuity of the hidden layer.

Around the same time, Cybenko [11] first discussed the universal approximation property of sigmoidal functions. They showed that a set of the functions of the form \(\sum_{j} w_{j} \sigma (\vec{y}^{\top }\vec{x} + b)\) with some constants \(w_{j}\) and b and vectors x⃗ and y⃗ in a compact space K is dense in \(C(K)\). Their discussion did not assume the activation function to be monotonic.

Hornik, Stinchcombe, and White [35] proposed general measurable functions by making use of the Stone-Weierstrass Theorem and the cosine squasher proposed by Gallant and White. Their results can be regarded as similar to those of Funahashi [22].

Leshno [43] obtained more general results by using the fact that the set of functions spanned by the so-called ridge functions, i.e., those of the form \(f(\vec{w}\top \vec{x}+\theta )\), is dense both in \(C({\mathbb{R}}^{n})\) and \(L_{p}(\mu )\), where μ is an arbitrary finite measure on \({\mathbb{R}}^{n}\). Recently, Yun [85] proved the approximation property of a neural network constructed using a parametric sigmoidal function.

Some Bayesian perspectives on neural networks, even with an infinite number of nodes, have been discussed (see, for example, [54, 84]). Their key insight is that as the number of nodes tends to infinity, the output can be regarded as a set of Gaussian processes.

However, all these studies considered only general neural networks. Many works have also considered the universal approximation property of neural networks based on continuous dynamical systems.

Haber and Ruthotto [27] proposed a formulation of a neural network in a supervised learning framework as a dynamical system. There, they clarified the necessary condition for the stability of an equilibrium point as well as the stability of Euler method as a discrete approximation of the continuous solution. They also pointed out the close relationship between backpropagation and the adjoint method in optimal control theory. Q. Li et al. [45] discussed the approximation property of an ODE-based neural network, and gave the sufficient condition under which the set of the realizations of an ODE-based neural network can approximate an arbitrary continuous map \(f:{\mathbb{R}}^{n} \rightarrow {\mathbb{R}}^{m}\) (\(n \geq 2\)) on any compact set with respect to \(L_{p}\) (\(p \in [1,+\infty )\)) norms.

Along these lines, Aizawa and Kimura [1] recently presented the universal approximation property of neural ODEs [10] and ResNet using the result of Leshno [43]. However, their method is restricted to linear models.

Esteve [18] presented the recent works concerning the approximation property of neural ODEs and, moreover, presented that the optimal value of the loss function is estimated above to the order of \(T^{-1}\) under the Tikhonov regularization (which they called an empirical risk minimization). Roughly contemporaneously, Teshima [76] also investigated the universal approximation property of neural ODEs. In their proof, they made use of their previous work [75] with a relatively slight modification. They also discussed the relationship between their result and a preceding work by Zhang et al. [88], which showed a counterexample that cannot be approximated by a neural ODE. Zhang et al [88] also presented the universal approximation property of an augmented neural ODE [17].

Recently, a survey by DeVore et al. [14] thoroughly presented the existing results on the approximation property of neural networks. The power of Rectified Linear Unit (ReLU) networks was among the most important results introduced here, as they can contain all piecewise-continuous functions on an arbitrary compact set.

From the perspective of a practical application, Laakmann and Petersen [42] applied a neural network to the numerical computation of a transport equation.

Studies in the field of optimal control have also considered the universal approximation property of continuous neural networks.

Balet and Zuazua [61] proved the simultaneous controllability [48] of a flow map of an ODE. This means that given an arbitrary finite input in a Euclidean space, the flow map can lead to an arbitrary set of classification labels.

By making use of this property, they also showed that an arbitrary simple function, and consequently an \(L_{2}\) map \(f:{\mathbb{R}}^{d} \rightarrow {\mathbb{R}}^{d}\) can be approximated with arbitrary precision with respect to the \(L_{2}\) norm. They also discussed the relationship of the universal approximation property and simultaneous control [48, 53]. However, their method is not applicable here for two reasons. First, their method of rotating the coordinate does not suffice because we consider equations with a diffusion term. Second, because we aim at an approximation with respect to the maximum norm, their method is not applicable, as it divides the region into two sections, in one of which the function is allowed to be discontinuous. From the perspective of optimal control theory, the universal approximation property corresponds to approximate ensemble controllability [53]. Thus, our arguments here can also be regarded as describing this property of a specific type of control via a nonlinear diffusion equation. We prove this property of our model using some results from studies on machine learning.

The relationship between the optimal control of neural network and optimal transport models has been pointed out as well (see, for instance, [50]). For example, Sontag and Sussmann [69] discussed the controllability of temporally continuous recurrent neural networks. Balet et al. [61] above also argued this point and studied a nonlinear transport equation, which they called a neural transport equation (NTE), as given below.

$$\begin{aligned} &\partial _{t} \varrho + \nabla \cdot \bigl[ \bigl( W(t)\sigma \bigl(A(t)x+b(t)\bigr) \bigr)\varrho \bigr]=0, \\ &\varrho (0) = \varrho ^{0} . \end{aligned}$$

They proved a method to approximate a target measure in the form of a finite combination of Dirac measures by the solution of an NTE at \(t=T\) with arbitrary precision in the sense of 1-Wasserstein distance.

In [45], the authors theoretically considered the formulation of an ODE-based neural network and proved its universal approximation property. They first observed that the earlier discussion concerning the universal approximation property of neural ODEs [88] relied on a stronger assumption under which the right-hand side of an ODE already possesses the universal approximation property. They showed that any arbitrary continuous function \(f:{\mathbb{R}}^{n} \rightarrow {\mathbb{R}}^{n}\) on any compact set in a Euclidean space can be approximated in an \(L_{p}\) norm with arbitrary precision. They also pointed out that the set of realizations of an ODE is uniformly approximated by that of ResNet. They derived their main results based on another work by one of the authors [66]. However, in their formulation, they distinguished the one-dimensional and multi-dimensional input cases. In contrast, our proof in the present work need not distinguish these cases, because we start from Leshno’s result [43].

Regarding ResNet, Tabuada [73] gave some conditions on activation functions under which the universal approximation property of a map \(f;{\mathbb{R}}^{n} \rightarrow {\mathbb{R}}^{n}\) with \(L_{\infty}\) norm is assured. They used the technique of ensemble controllability and deduced the quadratic differential equation that should be satisfied for each activation function. Research is also being actively conducted on the controllability of systems driven by linear or nonlinear partial differential equations [4].

For example, Fernández-Cara et al. [20] studied the null controllability of a heat equation with a spatially nonlocal term, which is roughly similar to the setup considered in the present work. Because their model was linear, they considered its adjoint equation and applied the Fourier-series representation of the equation and the compactness-uniqueness argument. They also stated that the approximate controllability, which is equivalent with the universal approximation property in the terminology of our PDE-based neural network, holds under the analyticity assumption of the kernel operator of the nonlocal term. However, our setup employs a nonlinear activation function, which is essentially different from this work; they listed the nonlinear case as an open problem.

As another example, [26] discussed the controllability of a nonlinear heat equation with a distributed control in an unbounded domain in \({\mathbb{R}}^{n}\). In this formulation, however, the control term is not included in the nonlinear term, which differs from our framework.

In fact, the application of diffusion equations in image processing has been discussed in prior works [82, 86]. Along these lines, Ruthotto and Haber [63] proposed parabolic and hyperbolic CNNs models that respectively included spatial and temporal second-order derivative terms. They also considered the application of neural networks in this field as an extension of prior applications of PDEs.

Some other works have also addressed the control of parabolic PDEs [4], including linear and nonlinear heat equations. In this regard, the present work provides a link between the insights in the literature on neural networks and research on controllability in optimal control theory.

Regarding the PDE-based neural network, Liu and Markowich [52] proposed a hyperbolic nonlinear integro-differential form without a diffusion term. However, they considered only the mathematical well-posedness of the form, and did not mention the universal approximation property. An earlier work [46] also proposed PDE-based neural networks for the transport and HJB equations, one of which used a diffusion term as in the present work. Neither of these works, however, mentioned the universal approximation properties of the models. Li and Shi [46] also proposed adding an extra constraint in cases with a diffusion term. In contrast, the present work shows that the universal approximation property is satisfied even without a trick of this nature.

At the end of this section, we list some recent notable results. Ivan et al. [59] proposed a framework to train neural ODEs using the Lyapunov function, which avoids the traditional backpropagation and achieves a faster computation. Moreover, a link between turnpike theory and optimal control has been considered in relation to neural ODEs [24]. Geshkovski and Zuazua [24] computed some examples of the turnpike property of a neural ODE using the MNIST dataset. They also mentioned that related results have been reported in the form of specific setups as in [18] and [19].

Based on the aforementioned arguments, the main contributions of this study are summarized as follows.

  1. (i)

    Motivated by the application of diffusion equations in image processing and considering GNNs with a diffusion term, we formulate PDE-based neural networks with a diffusion term and rigorously clarify the conditions under which the existence of the solution is assured.

  2. (ii)

    We describe the universal approximation property of our model in the sense of the maximum norm.

Our key findings are summarized as follows.

  1. (i)

    We show that some insights from studies on machine learning can be applied to the theory of the optimal control of PDEs.

  2. (ii)

    Even though Leshno’s result (Lemma 1 below) is a useful tool to prove our result for continuous neural networks, some additional formulations are required to discuss the convergence of the temporal and spatial unit when they tend to 0.

  3. (iii)

    Because our model contains a diffusion term, our method differs from those presented in prior works, although it is based on that reported by Leshno [43]. More concretely, our proof uses estimates of the approximation of the discretized diffusion equation that were not considered in previous studies.

In subsequent sections, we prove the results presented above.

6 Preliminary results

Before proving Theorem 3, we prepare some auxiliary results in this section. We first cite the following lemma ([43], Theorem 1).

Lemma 1

Let f be a measurable function on \({\mathbb{R}}\) with a certain \(J \in {\mathbb{N}}\). Then, \(\operatorname{Span} \langle f_{{\mathbf{w}},\theta}(x) \rangle \) (\({\mathbf{w}} \in {\mathbb{R}}^{J}\), \(\theta \in {\mathbb{R}}\)) is fundamental in \(C({\mathbb{R}}^{J})\) if and only if f is not a polynomial, where \(f_{{\mathbf{w}},\theta}(x)= f({\mathbf{w}}\cdot {\mathbf{x}} + \theta ) \).

Owing to this lemma, for an arbitrary \(\vec{\xi} \in K \subset {\mathbb{R}}^{J}\) and \(\varepsilon >0\), by taking a suitable \(M \in {\mathbb{N}}\), \(\{\sigma _{0(m)}\}_{m=1}^{M} \subset {\mathbb{R}}\), \(\{\vec{\sigma}_{1}^{(m)}\}_{m=1}^{M} \subset {\mathbb{R}}^{J}\), and \(\{\theta ^{(m)}\}_{m=1}^{M} \subset {\mathbb{R}}\), we can obtain

$$\begin{aligned} &\sup_{\vec{\xi} \in K} \Biggl\vert F(\vec{\xi}) - \sum _{m=1}^{M} \sigma _{0(m)} \phi \bigl( \vec{\sigma}_{1}^{(m)\top} \vec{\xi} + \theta ^{(m)} \bigr) \Biggr\vert < \varepsilon . \end{aligned}$$
(6.1)

Based on Lemma 1, we construct an approximating solution of (3.3) that agrees with the approximation stated in (6.1). Next, by taking the temporal and spatial meshes finer enough, we show that the solution of (3.3) itself can approximate the target continuous function. In applying these steps, we make use of the estimate on the approximation accuracy of Galerkin approximation.

Let us consider the approximate problem of (3.3). Regarding the spatial variable, we employ the Galerkin approximation [77]. For this purpose, let \({\mathfrak {S}}_{h} \equiv \{S_{h}\}_{h}\) be a family of finite-dimensional subspaces of \(H_{0}^{1}(I)\) with parameter \(h<1\) that tends to 0 [77]. In the sequel, we set an integer L and take \(h=\frac{1}{L}\) (i.e., we divide I into L equipartitions). It is also assumed that

$$\begin{aligned} &\inf_{g \in S_{h}} \bigl( \Vert v-g \Vert _{L_{2}(I)} + h \Vert v-g \Vert _{H^{1}(I)} \bigr) \leq Ch^{s} \Vert v \Vert _{H^{s}(I)}\quad (1\leq s \leq r) \end{aligned}$$

holds (r is a positive value, for example, \(r=2\) [3]). We also define an approximation operator \(A_{h}:{\mathfrak {S}}_{h} \rightarrow {\mathfrak {S}}_{h}\) by using the sesqui-linear form \(\sigma (\cdot ,\cdot )\) in (3.4) as follows:

$$\begin{aligned} & ( A_{h} \phi _{h}, \psi _{h} ) = \sigma (\phi _{h}, \psi _{h} ). \end{aligned}$$

Thus, \(A_{h}\) is the operator associated with the restriction of \(\sigma (\cdot ,\cdot )\) on \({\mathfrak {S}}_{h} \times {\mathfrak {S}}_{h}\) [21]. We further define an operator \(P_{h}\) indicating a projection of \(u \in L_{2}(I)\) onto \({\mathfrak {S}}_{h}\) with respect to the \(L_{2}\) inner product [21].

Then, we divide the time interval \((0,T]\) into N intervals \(\{[ (n-1)k, nk) \}_{n=1}^{N}\), with \(Nk=T\). By using a notation \(U_{h}^{(n)} = [ U_{h(l)}^{(n)} ]_{l}\), we consider the discretized scheme of (3.3) on \(t \in (0,T]\).

$$\begin{aligned} &\textstyle\begin{cases} \widetilde{U}_{h}^{(n)} = P_{h} u(nk,\cdot ;0,\vec{\xi}) \quad (n=0,1,2,\ldots , N-2), \\ \widetilde{U}_{h}^{(N-1)} = \overline{P}_{h} u((N-1)k, \cdot ;0,\vec{\xi}) , \\ \widetilde{U}_{h}^{(N)} =r(kA_{h})\widetilde{U}_{h}^{(N-1)} \\ \hphantom{\widetilde{U}_{h}^{(N)} ={}}{}+k r(kA_{h}) P_{h} \phi ( h\sum_{l=2}^{L-1} w_{\cdot ,l}^{(N-1)} \widetilde{U}_{h(l)}^{(N-1)} +h\sum_{l=1}^{L} w_{\cdot ,l}^{(N-1)} ), \end{cases}\displaystyle \end{aligned}$$
(6.2)

where \(r(kA_{h})\) denotes the Padé approximation [3] of the semigroup

$$ e^{-tA_{h}} \approx r(kA_{h}) \equiv (I_{L} + kA_{h})^{-1}. $$

Here, \(I_{L}\) is an L-dimensional identity matrix. We also use a notation \(\|\cdot \|\) hereafter to denote a Euclidean norm; note that this is equivalent with \(L_{2}\) norm as long as we consider the piecewise \(L_{2}(I)\) functions.

Moreover, we introduce notations \(\overline{P}_{h} f = [(\overline{P}_{h} f)_{l}]_{l}\), with

$$ ( \overline{P}_{h} f )_{l} \equiv \frac{1}{h} \int _{I_{l}} f(x) \,\mathrm{d}x \quad (l=1,2,\ldots ,L), $$

for \(f \in L_{1}(I)\) in general, where \(I_{l} \equiv [\frac{l-1}{L},\frac{l}{L})\) (\(l=0,1,2,\ldots ,L\)), and \(\overline{P}_{h}:L_{2}(I) \rightarrow {\mathfrak {S}}_{h}\), the projection onto the finite dimensional space \({\mathfrak {S}}_{h}\) for each \(h=\frac{1}{L}\). Because these are projection operators, note that the inequalities \(\|\overline{P}_{h} f\| \leq \|f\|\) and \(\|P_{h} f\| \leq \|f\|\) always hold. The value \(\|\overline{P}_{h}f\|\) is computed by regarding it as a simple function on I, and then taking the usual norm of \(L_{2}(I)\).

Remark 6

The operator \(\overline{P}_{h}\) has often been used in the literature on the discrete approximation of operators ([79, 81]). It is known that this is equivalent with the operation

$$\begin{aligned} &\widetilde{P}_{h} f \equiv \bigl[f(lh)\bigr]_{l} \in {\mathbb{R}}^{L}, \end{aligned}$$

in the sense that the following equality holds [81].

$$\begin{aligned} &\lim_{h \rightarrow 0} \Vert \overline{P}_{h} f - \widetilde{P}_{h} f \Vert =0. \end{aligned}$$

In (6.2), we utilized the vanishing Dirichlet condition of (3.3). From theory, \(u((N-1)k,x)\) can be represented as follows [37].

$$\begin{aligned} u\bigl((N-1)k,x;0,\vec{\xi}\bigr) &= Z*_{x}\tilde{u}_{0} + Z*\phi (0) \\ & = \sum_{j=1}^{\infty }\lambda _{j} \Biggl(\eta _{j},\sum _{q=1}^{J} \xi _{q} \chi _{I_{q}} \Biggr) e^{-\lambda _{j}(N-1)k} \eta _{j}(x)+c \bigl((N-1)k,x\bigr), \end{aligned}$$

where \(\lambda _{j}\) and \(\eta _{j}\) are the eigenvalues and eigenvectors of an operator A, respectively, and

$$\begin{aligned} &c(t,x) = -Z*_{x} 1 + Z * \phi (0). \end{aligned}$$

Hereafter, we often use a notation \(u((N-1)k,\cdot )\) to denote \(u((N-1)k,\cdot ;0,\vec{\xi})\). Thus, we have

$$\begin{aligned} \widetilde{U}_{h(l)}^{(N-1)} &= \overline{P}_{h} u\bigl((N-1)k,\cdot \bigr) \\ & =\sum_{j=1}^{\infty }\lambda _{j} \Biggl(\eta _{j},\sum _{q=1}^{J} \xi _{q} \chi _{I_{q}} \Biggr) e^{-\lambda _{j}(N-1)k} (\overline{P}_{h} \eta _{j})_{l} + \bigl( \overline{P}_{h} c\bigl((N-1)k,\cdot \bigr) \bigr)_{l} \\ & \equiv \vec{c}^{(l)\top} \vec{\xi} + \bigl( \overline{P}_{h} c\bigl((N-1)k, \cdot \bigr) \bigr)_{l} \quad (l=1,2,\ldots ,L). \end{aligned}$$
(6.3)

Hereafter, we use the notation \(\vec{\sigma}_{0} = [\sigma _{0(m)} ]_{m=1}^{M} \in { \mathbb{R}}^{M}\). We prepare a lemma.

Lemma 2

With \(L=aM\), where an integer a has a sufficiently large value, there exist \(\vec{w}_{0(h,k)}^{\prime }\in {\mathbb{R}}^{L}\), \(\{ \vec{\theta}_{0(h)}^{(p)}\}_{p=1}^{L} \subset {\mathbb{R}}^{J}\), and \(\{ \theta _{1(h)}^{(p)} \}_{p=1}^{L} \subset {\mathbb{R}}\) such that

$$\begin{aligned} &\vec{w}_{0(h,k)}^{\prime \top} \bigl[r(kA_{h}) \widetilde{U}_{h}^{(N-1)}+ k r(kA_{h}) \bigl[ \phi \bigl( h\vec{\theta}_{0(h)}^{(p)\top} \vec{\xi} + \theta _{1(h)}^{(p)} \bigr) \bigr]_{p} \bigr] \\ & \quad =\vec{\sigma}_{0}^{\top } \bigl[ \phi \bigl( \vec{ \sigma}_{1}^{(m)\top} \vec{\xi} +\theta ^{(m)} \bigr) \bigr]_{m}. \end{aligned}$$

Remark 7

The left-hand side of the equality in Lemma 2 is the inner product of the vectors in \({\mathbb{R}}^{L}\), whereas the right-hand side is that of the vectors in \({\mathbb{R}}^{M}\).

Proof

First, we introduce disjoint subsets of \(\{1,2,\ldots ,L\}\):

$$\begin{aligned} &D_{(m)} \equiv \bigl\{ (m-1)a+1,(m-1)a+2,\ldots ,ma \bigr\} \quad (m=1,2, \ldots ,M). \end{aligned}$$

It is obvious that \(\{1,2,\ldots ,L\} = \bigcup_{m=1}^{M} D_{(m)}\). Then, we take \(\vec{\theta}_{0(h)}^{(p)}\) and \(\theta _{1(h)}^{(p)}\) so that \(h\vec{\theta}_{0(h)}^{(p)} = \vec{\sigma}_{1}^{(m)}\) (\(p \in D_{(m)}\)) and \({\theta}_{1(h)}^{(p)} = {\theta}^{(m)} \) (\(p \in D_{(m)}\)), respectively. Let us take \(\vec{w}_{0(h,k)}^{\prime}\) so that the followings are satisfied.

$$\begin{aligned} &\vec{w}_{0(h,k)}^{\prime \top}r(kA_{h}) \widetilde{U}_{h}^{(N-1)}=0, \end{aligned}$$
(6.4)
$$\begin{aligned} &k\vec{w}_{0(h,k)}^{\prime \top}r(kA_{h}) {\boldsymbol {B}}_{h} = \vec{\sigma}_{0}^{ \top}, \end{aligned}$$
(6.5)

where \({\boldsymbol {B}}_{h} =h [\vec{e}_{1},\vec{e}_{2},\ldots ,\vec{e}_{M}]\) is an \(L\times M\) matrix with \(\vec{e}_{j} = [H(l \in D_{(j)})]_{l} \in {\mathbb{R}}^{L}\), \(H(\cdot )\) being a function that returns unity if the statement in the bracket is true, and returns 0 otherwise. (6.4) means that \(\vec{w}_{0(h,k)}^{\prime}\) should belong to a subspace in \({\mathbb{R}}^{L-2}\), which is denoted as \({\mathcal {G}}_{h}\) hereafter. Therefore, we rewrite (6.5) as follows:

$$\begin{aligned} &{\boldsymbol {B}}_{h}^{\top }r(kA_{h})^{\top}|_{{\mathcal {G}}_{h}} \vec{w}_{0(h,k)}^{ \prime }= \vec{\sigma}_{0}/k, \end{aligned}$$
(6.6)

where \({\boldsymbol {B}}_{h}^{\top }r(kA_{h})^{\top} |_{{\mathcal {G}}_{h}}\) denotes the restriction of \({\boldsymbol {B}}_{h}^{\top }r(kA_{h})^{\top}\) onto the space \({\mathcal {G}}_{h} \subset {\mathbb{R}}^{L-2}\).

Based on proposition 8.14, which was presented in [87], (6.6) can have a solution if and only if \(\vec{\sigma}_{0} \perp N(r(kA_{h}){\boldsymbol {B}}_{h} |_{{\mathcal {G}}_{h}})\), where \(N(\cdot )\) denotes the kernel of the operator in its argument.

Conversely, we can show that \(r(kA_{h})\) is of full rank. In fact, if we consider that \(A_{h}\) is positive definite, all the eigenvalues of \(A_{h}\) are positive. Moreover, because \(A_{h}\) is self-adjoint, we observe that it is diagonalizable [12], and so is \(r(kA_{h})\). Thus, \(r(kA_{h})\vec{v}=0\) means \(\vec{v}=\vec{0}\). However, it is apparent that \({\boldsymbol {B}}_{h}\) is suborthogonal in the sense that its column vectors are orthogonal to each other. Thus, \(N(r(kA_{h}){\boldsymbol {B}}_{h} |_{{\mathcal {G}}_{h}})=\{0\}\), which yields the desired result. □

Remark 8

From construction, the solution \(\vec{w}_{0(h,k)}^{\prime}\) to (6.6) depends on h and k. As above, (6.6) has at least one solution. If we denote this solution by \(\breve{w}_{0(h,k)}^{\prime}\), then a set of solutions for (6.6) can be denoted as \(\breve{w}_{0(h,k)}^{\prime }+ N({\boldsymbol {B}}_{h}^{\top }r(kA_{h})^{\top})\).

Now, we define:

$$\begin{aligned} &w_{0}(x) = w_{0(h,k)(l)}^{\prime } \quad \text{on } I_{l}\ (l=1,2, \ldots ,L), \end{aligned}$$

where \(w_{0(h,k)(l)}^{\prime}\) is the l-th component of the vector \(\vec{w}_{0(h,k)}^{\prime}\). We assert that, for a certain \(R>0\), we have at least one solution stated in Lemma 2, in a certain ball with radius R in \(L_{2}(I)\). In the sequel, we use the following notations:

$$\begin{aligned} & {\mathcal {G}}_{\infty}^{(k)} \equiv \bigl\{ f \in L_{2}(I) | f \perp \operatorname{Span} \bigl\langle r(kA)u \bigl((N-1)k,\cdot \bigr) \bigr\rangle \bigr\} , \\ &{\mathcal {G}}_{h}^{(k)} \equiv \bigl\{ f \in L_{2}(I) | f \perp \operatorname{Span} \bigl\langle r(kA_{h}) \overline{P}_{h}u\bigl((N-1)k,\cdot \bigr) \bigr\rangle \bigr\} . \end{aligned}$$

Now, we state

Lemma 3

For a certain \(R>0\), k and \(h_{1}>0\), we have a solution \(\vec{w}_{0(h,k)}^{\prime}\) to (6.4) and (6.5) that satisfies

$$\begin{aligned} & \bigl\Vert \vec{w}_{0(h,k)}^{\prime} \bigr\Vert \leq R \end{aligned}$$

for \(\forall h \in (0,h_{1}]\).

Proof

First, we take a small \(k>0\), \(h_{1}=\frac{1}{L_{1}}>0\), and \(\vec{w}_{0(h_{1},k)}^{\prime }\in {\mathbb{R}}^{L_{1}}\), which satisfy

$$\begin{aligned} \textstyle\begin{cases} k{\boldsymbol {B}}_{h_{1}}^{\top }r(kA_{h_{1}})^{\top }\vec{w}_{0(h_{1},k)}^{ \prime} =\vec{\sigma}_{0}, \\ \vec{w}_{0(h_{1},k)}^{\prime \top} r(kA_{h_{1}}) \widetilde{U}_{h_{1}}^{(N-1)} =0. \end{cases}\displaystyle \end{aligned}$$
(6.7)

Note that the existence of such \(\vec{w}_{0(h_{1},k)}^{\prime}\) is guaranteed by Lemma 2. Moreover, the solution \(\vec{w}_{0(h_{1},k)}^{\prime }\) to (6.7) belongs to the intersection of \({\mathcal {G}}_{h_{1}}^{(k)}\) and a set represented as \(\breve{w}_{0(h_{1},k)} + N({\boldsymbol {B}}_{h_{1}}^{\top})\), where \(\breve{w}_{0(h_{1},k)}^{\prime}\) is the solution to the problem in \({\mathfrak {S}}_{h_{1}}\):

$$\begin{aligned} &r(kA_{h_{1}})^{\top }\breve{w}_{0(h_{1},k)}^{\prime } =\overline{P}_{h_{1}} \widetilde{\sigma}_{0}/k, \end{aligned}$$
(6.8)

with \(\frac{1}{h_{1}}= L_{1}=a_{1} M\). Here, \(\widetilde{\sigma}_{0}\) is a notation used when we regard \(\vec{\sigma}_{0}\) as an element in \(L_{2}(I)\). Hereafter, we often regard \({\mathcal {G}}_{h_{1}}^{(k)}\) as a subset of \(L_{2}(I)\). Note that we can easily obtain the solution of (6.8) if we recall the definition of \(r(kA_{h_{1}})\). We denote one such solution as \(\breve{w}_{0(h_{1},k)}^{\prime }\in {\mathcal {G}}_{h_{1}}^{(k)}\) again:

$$\begin{aligned} \textstyle\begin{cases} r(kA_{h_{1}})^{\top }\breve{w}_{0(h_{1},k)}^{\prime} = \overline{P}_{h_{1}}\widetilde{\sigma}_{0}/k, \\ \breve{w}_{0(h_{1},k)}^{\prime \top} r(kA_{h_{1}}) \widetilde{U}_{h_{1}}^{(N-1)} =0. \end{cases}\displaystyle \end{aligned}$$

For \(h>0\), we define a map \(G_{h}^{(k)} : L_{2}(I) \rightarrow {\mathfrak {S}}_{h}\) as follows:

$$\begin{aligned} &G_{h}^{(k)} \bigl[\breve{w}_{0(h,k)}^{\prime} \bigr] = r(kA_{h}) \breve{P}_{{ \mathcal {G}}_{h}}\breve{w}_{0(h,k)}^{\prime \top} -\overline{P}_{h} \widetilde{\sigma}_{0}/k, \end{aligned}$$

where \(\breve{P}_{{\mathcal {G}}_{h}^{(k)}} : L_{2}(I) \rightarrow { \mathcal {G}}_{h}^{(k)}\) is a projection onto \({\mathcal {G}}_{h}^{(k)}\) with respect to the \(L_{2}\) inner product.

We will below that if the norm of \(\breve{w}_{0(h,k)}^{\prime}\) is large enough, even if we take the projection above, the norm of \(\breve{w}_{0(h,k)}^{\prime}\) after the projection is large enough as well.

Next, we define

$$\begin{aligned} &S_{R}^{(k)} \equiv \bigl\{ f \in L_{2}(I) | \Vert f \Vert =R, f \in { \mathcal {G}}_{\infty}^{(k)} \bigr\} \quad \bigl(R > \Vert \overline{P}_{h} \widetilde{ \sigma}_{0}/k \Vert \bigr). \end{aligned}$$

Because the dimension of \(({\mathcal {G}}_{\infty}^{(k)} )^{\perp}\) is of unity, it holds that \(S_{R}^{(k)} \ne \emptyset \). We also take a small \(\varepsilon _{1}>0\) and sufficiently small so that

$$\begin{aligned} & \bigl\vert \bigl(f, r(kA_{\tilde{h}}) \overline{P}_{\tilde{h}} u \bigl((N-1)k, \cdot \bigr) \bigr) \bigr\vert < \varepsilon _{1}\quad \forall f \in S_{R}^{(k)}. \end{aligned}$$

This is possible if we note

$$\begin{aligned} & \bigl\vert \bigl\vert \bigl( f, r(kA_{h}) \overline{P}_{h} u \bigl((N-1)k, \cdot \bigr) \bigr) \bigr\vert - \bigl\vert \bigl( f, r(kA) u \bigl((N-1)k, \cdot \bigr) \bigr) \bigr\vert \bigr\vert \\ & \quad \leq \Vert f \Vert \bigl\Vert r(kA_{h}) \overline{P}_{h} u \bigl((N-1)k,\cdot \bigr) - r(kA)u \bigl((N-1)k, \cdot \bigr) \bigr\Vert , \end{aligned}$$

and the relationship that holds with \(v \in L_{2}(I)\) [3]:

$$\begin{aligned} & \bigl\Vert r(kA_{h}) v - r(kA)v \bigr\Vert \leq c \bigl( \gamma (h) + h^{r} + k \bigr) \Vert v \Vert , \end{aligned}$$
(6.9)

with r being the one stated right after (6.1), where \(\gamma (h)\) tends to zero as h does. Moreover, let \(R>0\) have a sufficiently large value so that the following holds (R should be redefined, if necessary):

$$\begin{aligned} & \bigl\Vert r(kA) v_{0} - \overline{P}_{h} \widetilde{\sigma}_{0}/k \bigr\Vert > \delta _{0} \quad \forall v_{0} \in S_{R}^{(k)}. \end{aligned}$$
(6.10)

In order to show that this is possible, we can demonstrate the continuity of the resolvent \(r(kA)=(I_{d}+kA)^{-1}\) with respect to k, where \(I_{d}\) is an identity operator. We can prove this by using the resolvent equation [39] and the boundedness of \(r(kA)\), as presented by Fujita and Mizutani [21]. Thus, for \(\varepsilon _{1}>0\) above, if we take a sufficiently small k, we have

$$\begin{aligned} & \bigl\Vert r(kA)v_{0}-v_{0} \bigr\Vert \leq \frac{\varepsilon _{1}}{2}, \end{aligned}$$

for \(v_{0} \in S_{R}^{(k)}\). This yields

$$\begin{aligned} & \bigl\Vert r(kA)v_{0} \bigr\Vert \geq R-\varepsilon _{1}, \end{aligned}$$

with an arbitrary \(\varepsilon _{1}>0\). Therefore, if we take R sufficiently large, we arrive at (6.10) and consequently,

$$\begin{aligned} &G_{h}^{(k)}[v_{0}] \ne 0 \quad \forall h \in \bigl(0,\min \{h_{1}, \tilde{h}\}\bigr), \end{aligned}$$

for this \(v_{0}\). Now, for an arbitrary \(h_{2}=\frac{1}{a_{2}M}\) with \(a_{2}>a_{1}\), we define a homotopy mapping \(H:L_{2}(I) \times [0,1] \rightarrow {\mathfrak {S}}_{h_{2}}\):

$$\begin{aligned} &H(f,s) \equiv s D_{a_{2},a_{1}}G_{h_{1}}^{(k)} f + (1-s)G_{h_{2}}^{(k)}f, \end{aligned}$$

where \(s \in [0,1]\) and \(D_{a_{2},a_{1}}\) is a \(a_{2}M \times a_{1} M\) matrix whose components are either 0 or 1. That is, this matrix is used to translate the image of \(G_{h_{1}}^{(k)}\) as an element of \({\mathbb{R}}^{a_{2}M}\). In virtue of the arguments presented above, we have

$$\begin{aligned} &H(f,s) \ne \vec{\sigma}_{0} \quad \forall f \in S_{R}^{(k)}, s \in [0,1]. \end{aligned}$$

Then, we have

$$\begin{aligned} &H(f,0)= G_{h_{2}}^{(k)}f, \qquad H(f,1)= D_{a_{2},a_{1}}G_{h_{1}}^{(k)}f, \end{aligned}$$

and \(H(f,s)\) is a compact operator for each s because its range has a finite dimension. Owing to the result of degree theory [87], we can conclude that the equation

$$\begin{aligned} &G_{h_{2}}^{(k)}f=0 \end{aligned}$$

has a solution. Consequently,

$$\begin{aligned} &{\boldsymbol {B}}_{h_{2}}^{\top }r(kA_{h_{2}})^{\top } \breve{P}_{{\mathcal {G}}_{h_{2}}^{(k)}}f = {\boldsymbol {B}}_{h_{2}}^{\top } \overline{P}_{h_{2}} \widetilde{\sigma}_{0}/k \end{aligned}$$

has a solution \(\forall h_{2} \in (0,h_{1}]\) that satisfies \(\|f\| \leq R\). If we take \(\vec{w}_{0(h_{2},k)}^{\prime }= \breve{P}_{{\mathcal {G}}_{h_{2}}^{(k)}}f \), this is the desired solution. □

By using this, we assert the following lemma.

Lemma 4

Let h and k be sufficiently small positive numbers. Then, for an arbitrary \(\vec{\xi} \in K\subset {\mathbb{R}}^{J}\) and \(\varepsilon >0\), there exists an array \({\boldsymbol {W}} = [w_{p,l}^{(N-1)} ]_{p,l=1,2,\ldots ,L}\), with which \(\widetilde{U}_{h}^{(N)}\) defined in (6.2) satisfies

$$\begin{aligned} & \bigl\vert F(\vec{\xi}) - \vec{w}_{0(h,k)}^{\prime \top} \widetilde{U}_{h}^{(N)} \bigr\vert < \frac{\varepsilon}{2}. \end{aligned}$$
(6.11)

Proof

In fact, based on (6.4) and (6.5), we consider the following equations for \({\boldsymbol {W}} = [w_{p,l}^{(N-1)} ]_{p,l=1,2,\ldots ,L}\).

$$\begin{aligned} &\textstyle\begin{cases} h\sum_{l=2}^{L-1} w_{p,l}^{(N-1)} \vec{c}^{(l)} \cdot \vec{\xi} = h\vec{\theta}_{0(h)}^{(p)} \cdot \vec{\xi}, \\ h \sum_{l=2}^{L-1} w_{p,l}^{(N-1)} \overline{P}_{h} c((N-1)k,l) + h \sum_{l=1}^{L}w_{p,l}^{(N-1)} = \theta _{1(h)}^{(p)} \\ \quad (p=1,2,\ldots ,L). \end{cases}\displaystyle \end{aligned}$$
(6.12)

For each fixed p (\(1\leq p \leq L\)), this can be written as an equation for \(\vec{w}_{p} \equiv [w_{p,l}^{(N-1)}]_{l}\) as shown below:

$$\begin{aligned} &h{\boldsymbol {T}}_{p} \vec{w}_{p} = \breve{\boldsymbol {\theta}}_{p(h)} \equiv \bigl(h \vec{\theta}_{0(h)}^{(p)} \cdot \vec{\xi}, \theta _{1(h)}^{(p)}\bigr)^{ \top } \in {\mathbb{R}}^{2}, \end{aligned}$$
(6.13)

where

$$\begin{aligned} &{\boldsymbol {T}}_{p} \equiv \begin{bmatrix} 0 & \vec{c}^{(2)} \cdot \vec{\xi} & \cdots & \vec{c}^{(L-1)} \cdot \vec{\xi} & 0 \\ 1 & 1+ (\overline{P}_{h}c((N-1)k,\cdot ) )_{2} &\cdots & 1+ (\overline{P}_{h}c((N-1)k,\cdot ) )_{L-1} & 1 \end{bmatrix} \in {\mathbb{R}}^{2 \times L}. \end{aligned}$$

Thanks to the same argument as in the proof of Lemma 2, we shall show that \(N({\boldsymbol {T}}_{p}^{\top}) = \{0\}\). In fact, if we recall that \({\boldsymbol {T}}_{p}\) is a linear map from \({\mathbb{R}}^{2}\) to \({\mathbb{R}}^{L}\), if \({\boldsymbol {T}}_{p}^{\top }\vec{q}=\vec{0}\) holds with \(\vec{q}=(q_{1},q_{2})^{\top}\), then, it can be easily observed that \(q_{2}=0\) (actually, adding a non-vanishing Dirichlet boundary condition in (3.2) works here). Regarding \(q_{1}\), if \(q_{1} \ne 0\), all the following equalities should hold:

$$\begin{aligned} &\vec{c}^{(l)} \cdot \vec{\xi}=0\quad (l=2,3,\ldots ,L-1). \end{aligned}$$

However, from (6.3), this means that

$$\begin{aligned} &\bigl[\overline{P}_{h}Z*u_{0}(T,\cdot ) \bigr]_{l}=0 \quad (l=2,3,\ldots ,L-1), \end{aligned}$$
(6.14)

which does not hold if we take L sufficiently large. In fact, for an arbitrary \(\varepsilon ^{\prime}>0\), if we take \(h>0\) small enough, we obtain

$$\begin{aligned} & \bigl\Vert Z*u_{0}(T,\cdot ) - \overline{P}_{h} \bigl(Z*u_{0}(T,\cdot ) \bigr) \bigr\Vert < \frac{\varepsilon ^{\prime}}{2}, \end{aligned}$$

and thus, we have

$$\begin{aligned} \bigl\Vert Z*u_{0}(T,\cdot ) \bigr\Vert &\leq \bigl\Vert Z*u_{0}(T,\cdot ) - \overline{P}_{h} \bigl(Z*u_{0}(T,\cdot ) \bigr) \bigr\Vert + \bigl\Vert \overline{P}_{h} \bigl(Z*u_{0}(T,\cdot ) \bigr) \bigr\Vert \\ & \leq \frac{\varepsilon ^{\prime}}{2} + \bigl\Vert \overline{P}_{h} \bigl(Z*u_{0}(T, \cdot ) \bigr) \bigr\Vert . \end{aligned}$$

But (6.14) implies that if we take h sufficiently small, then we can attain

$$\begin{aligned} & \bigl\Vert \overline{P}_{h} \bigl(Z*u_{0}(T,\cdot ) \bigr) \bigr\Vert < \frac{\varepsilon ^{\prime}}{2}. \end{aligned}$$

Thus, we have \(\|Z*u_{0}(T,\cdot ) \| <\varepsilon ^{\prime}\). Because \(\varepsilon ^{\prime}\) is arbitrary, we have \(Z*u_{0}(T,\cdot )=0\). If we recall (6.3), this implies

$$\begin{aligned} & \sum_{j=1}^{\infty }\lambda _{j} \Biggl(\eta _{j},\sum _{q=1}^{J} \xi _{q} \chi _{I_{q}} \Biggr) e^{-\lambda _{j}(N-1)k} \eta _{j}(x)=0, \end{aligned}$$

from which we obtain \(\lambda _{j} (\eta _{j},\sum_{q=1}^{J} \xi _{q} \chi _{I_{q}} ) e^{-\lambda _{j}(N-1)k} = 0\), and consequently, \((\eta _{j},\sum_{q=1}^{J} \xi _{q} \chi _{I_{q}} ) = 0\) \(\forall j=1,2,\ldots \) . This means \(u_{0} \equiv 0\), a contradiction. Thus, we can conclude that \(\vec{q} = \vec{0}\), which means that \(N({\boldsymbol {T}}_{p}^{\top}) =\{\vec{0}\}\). Thus, (6.13) and consequently (6.12) has a solution. This means that

$$\begin{aligned} &\vec{w}_{0(h,k)}^{\prime \top} \Biggl[ r(kA_{h}) \widetilde{U}_{h}^{(N-1)} + k r(kA_{h}) \Biggl[ \phi \Biggl( h\sum_{l=2}^{L-1} w_{p,l}^{(N-1)} \vec{c}^{(l)} \cdot \vec{\xi} \\ & \quad \quad{}+ h \sum_{l=2}^{L-1} w_{p,l}^{(N-1)} \overline{P}_{h} c \bigl((N-1)k,l\bigr) + h \sum_{l=1}^{L}w_{p,l}^{(N-1)} \Biggr) \Biggr]_{p} \Biggr] \\ & \quad =\vec{\sigma}_{0}^{\top} \bigl[ \phi \bigl( \vec{ \sigma}_{1}^{(m)\top} \vec{\xi} +\theta ^{(m)} \bigr) \bigr]_{m}, \end{aligned}$$

holds with \({\boldsymbol {W}} = [w_{p,l}^{(N-1)} ]_{p,l=1,2,\ldots ,L}\) prescribed. Moreover, recalling (6.3), this is rewritten as

$$\begin{aligned} &\vec{w}_{0(h,k)}^{\prime \top} \Biggl[ r(kA_{h}) \widetilde{U}_{h(l)}^{(N-1)} + k r(kA_{h}) \Biggl[ \phi \Biggl( h\sum_{l=2}^{L-1} w_{p,l}^{(N-1)} \widetilde{U}_{h}^{(N-1)} + h \sum_{l=1}^{L}w_{p,l}^{(N-1)} \Biggr) \Biggr]_{p} \Biggr] \\ & \quad =\vec{\sigma}_{0}^{\top} \bigl[ \phi \bigl( \vec{ \sigma}_{1}^{(m)\top} \vec{\xi} +\theta ^{(m)} \bigr) \bigr]_{m}. \end{aligned}$$

If we further recall (6.2), this implies that

$$\begin{aligned} &\vec{w}_{0(h,k)}^{\prime \top} \bigl[ \widetilde{U}_{h}^{(N)} + k r(kA_{h}) [ P_{h} \phi _{1} - \phi _{1} ]_{p} \bigr] =\vec{\sigma}_{0}^{ \top} \bigl[ \phi \bigl( \vec{\sigma}_{1}^{(m)\top} \vec{\xi} + \theta ^{(m)} \bigr) \bigr]_{m}, \end{aligned}$$

where \(\phi _{1} = \phi ( h\sum_{l=2}^{L-1} w_{p,l}^{(N-1)} \widetilde{U}_{h(l)}^{(N-1)} + h \sum_{l=1}^{L}w_{p,l}^{(N-1)} ) \).

Note that \(\|\vec{w}_{0(h,k)}^{\prime}\|\) is bounded with respect to \((h,k)\) thanks to the proof of Lemma 3. Thus, if we take k small enough, we can bound the second term of the left-hand side above small enough. This, together with Lemma 1, yields the desired statement. □

By using the solution of (6.13), we construct a function \(\bar{w}_{1}(t,x,y)\)

$$\begin{aligned} &\bar{w}_{1}(t,x,y) = \textstyle\begin{cases} w_{p,l}^{(N-1)} & \text{on } ((N-1)k,Nk] \times I_{p} \times I_{l} \ (p,l=1,2,\ldots ,L), \\ 0 & \text{on } (0,(N-1)k]. \end{cases}\displaystyle \end{aligned}$$
(6.15)

By noting that \(\bar{w}_{1} \in L_{2}({\mathcal {H}}_{T})\), we set \(\bar{u}(t,x) \equiv u(t,x;\bar{w}_{1},\vec{\xi})\), which solves (3.3) with \(w_{1}=\bar{w}_{1}\). This neural network with \(\bar{w}_{1}\) and ū can be regarded as a forward neural network with (6.2), which is a kind of RBF network with a Gaussian kernel [7, 84]; in our case, however, we use the fundamental solution with the Dirichlet condition. We have a similar result to Lemma 3 for \(\bar{w}_{1}\) as well.

Corollary 2

For a certain \(R^{\prime}>0\), k and \(h_{2}>0\), we have a solution \(\vec{w}_{p}\) to (6.13) that satisfies

$$\begin{aligned} & \Vert \vec{w}_{p} \Vert \leq R^{\prime }\quad (p=1,2,\ldots ,L), \end{aligned}$$

for \(\forall h \in (0,h_{2}]\). Thus, for \(\bar{w}_{1}\) defined in (6.15), we have

$$\begin{aligned} & \bigl\Vert \bar{w}_{1}(t,\cdot ,\cdot ) \bigr\Vert _{L_{2}(I\times I)} \leq R^{\prime }, \quad t \in \bigl((N-1)k,T\bigr]. \end{aligned}$$

Remark 9

Regarding the mapping degree of a map between two spaces with different dimensions, one can refer to, for instance, the work of [55].

7 Proof of Theorem 3

Now, we present Lemma 5 below, which is crucial for the proof of the main theorem. It assures that we can make \(\widetilde{U}_{h}^{(N)}\) and \(\bar{u}(T)\) be sufficiently close if we take h and k small enough while maintaining some relationship. In its proof, we insert another variable \(V_{h}^{(N)}\), with which we can prove the lemma by using the estimate of the fundamental solution of heat equation.

After proving Lemma 5, we can easily prove Theorem 3, by using Lemmas 3 and 4 as well. First, we note that we have the estimate of \(\phi (\cdot )\) just we did in the proof of Theorem 2 right above (B.18) in Appendix B. Combining it with the a-priori estimate there, we can estimate the left-hand side from above in the form

$$\begin{aligned} & \biggl\Vert \phi \biggl( \int _{I} w_{1}(\cdot ,\cdot ,y) \breve{v}( \cdot ,y) \,\mathrm{d}y + \int _{I} w_{1}(\cdot ,\cdot ,y) \,\mathrm{d}y \biggr) \biggr\Vert _{L_{2}(I_{T})} \leq c\bigl( \vert u_{0} \vert \bigr), \end{aligned}$$
(7.1)

where \(c(|u_{0}|)\) is a positive constant that depends on \(|u_{0}|\).

Lemma 5

\(\{ \widetilde{U}_{h}^{(n)} \}_{n}\) defined in (6.2) satisfies:

$$\begin{aligned} & \bigl\Vert \widetilde{U}_{h}^{(N)}-\bar{u}(T) \bigr\Vert \leq d( k,h), \end{aligned}$$

where \(d( k,h)\) is a constant that is independent of \(w_{1}\), and tends to 0 when k and h tend to 0 satisfying \(h^{2}=o (\log ( \frac{T}{k} )^{-1} )\).

Proof

The overview of the proof of this lemma goes as follows. Because ū is a continuous variable, while \(\widetilde{U}_{h}^{(N)}\) is a discretized one, we shall insert another variable \(V_{h}^{(N)}\), and estimate both \(\|V_{h}^{(N)}-\bar{u}(T)\|\) and \(\|V_{h}^{(N)}-\widetilde{U}_{h}^{(N)}\|\), to obtain the desired result. For the first one, we shall estimate the accuracy of the discrete approximation of a continuous solution by its discretization. For the latter one, we shall make use of the property of Padé approximation.

On the one hand, based on Duhamel’s principle, ū satisfies the following:

$$\begin{aligned} \bar{u}(Nk,x) &= Z*\bar{u}\bigl((N-1)k,\cdot \bigr) + \int _{(N-1)k}^{Nk} { \mathrm{d}}s \int _{I} Z(Nk-s,x,z) \\ & \quad{}\times \phi \biggl( \int _{I} \bar{w}_{1}(s,z,y) \bar{u}(s,y) \, { \mathrm{d}}y + \int _{I} \bar{w}_{1}(s,x,y) \,\mathrm{d}y \biggr) \,\mathrm{d}z. \end{aligned}$$

On the other hand, \(\{ \widetilde{U}_{h}^{(n)} \}_{n}\) defined in (6.2) satisfies the following:

$$\begin{aligned} \widetilde{U}_{h}^{(N)}&=r(kA_{h}) \overline{P}_{h} \bar{u}\bigl((N-1)k, \cdot \bigr) \\ & \quad{}+ k r(kA_{h}) \Biggl[ P_{h}\phi \Biggl( \sum _{l=1}^{L-2} w_{\cdot ,l}^{(N-1)} \widetilde{U}_{h(l)}^{(N-1)} +h \sum _{l=1}^{L} w_{\cdot ,l}^{(N-1)} \Biggr) \Biggr]. \end{aligned}$$

We also consider

$$\begin{aligned} {V}_{h}^{(n)} &\equiv r(kA_{h})^{n} P_{h} u_{0} \\ & \quad{}+ k \sum_{j=1}^{n} r(kA_{h})^{(n-j)} \\ & \quad{}\times P_{h} \biggl[ \phi \biggl( \int _{I} \bar{w}_{1}(jk,\cdot ,y) \bar{u}(jk,y) \,\mathrm{d}y + \int _{I} \bar{w}_{1}(jk,\cdot ,y) \, { \mathrm{d}}y \biggr) \biggr]. \end{aligned}$$

Note that \(\bar{w}_{1}\) is a piecewise constant function from its construction, thus the right-hand side above makes sense. Recalling \(T=Nk\), we first consider \(\|\bar{u}(T)-{V}_{h}^{(N)}\|\). We have

$$\begin{aligned} \bigl\Vert \bar{u}(T)-{V}_{h}^{(N)} \bigr\Vert &= \bigl\Vert Z(Nk)*u_{0} - r(kA_{h})^{N} P_{h} u_{0} \bigr\Vert \\ & \quad{}+ \Biggl\Vert \int _{0}^{Nk} \mathrm{d}s \int _{I} Z(Nk-s,\cdot ,z) \\ & \quad{}\times \phi \biggl( \int _{I} \bar{w}_{1}(s,z,y) \bar{u}(s,y) \, { \mathrm{d}}y + \int _{I} \bar{w}_{1}(s,z,y) \,\mathrm{d}y \biggr) \,\mathrm{d}z \\ & \quad{}- k \sum_{j=1}^{N} r(kA_{h})^{N-j} \\ & \quad{}\times P_{h} \biggl[ \phi \biggl( \int _{I} \bar{w}_{1}(jk,\cdot ,y) \bar{u}(jk,y) \,\mathrm{d}y + \int _{I} \bar{w}_{1}(jk,\cdot ,y) \, { \mathrm{d}}y \biggr) \biggr] \Biggr\Vert . \end{aligned}$$
(7.2)

Based on the estimate presented by Fujita and Mizutani [21], we have

$$\begin{aligned} & \bigl\Vert Z(Nk)*u_{0} - r(kA_{h})^{N} P_{h}u_{0} \bigr\Vert \leq \frac{c_{81}(h^{2}+k)}{T} \vert u_{0} \vert , \end{aligned}$$
(7.3)

where \(c_{81}\) is a positive constant. On the other hand, regarding the second term of the right-hand side of (7.2), we have

$$\begin{aligned} & \Biggl\Vert \int _{0}^{Nk} \mathrm{d}s \int _{I} Z(Nk-s,\cdot ,z) \\ &\quad \quad{}\times \phi \biggl( \int _{I} \bar{w}_{1}(s,z,y) \bar{u}(s,y) \, { \mathrm{d}}y + \int _{I} \bar{w}_{1}(s,z,y) \,\mathrm{d}y \biggr) \,\mathrm{d}z \\ &\quad \quad{}- k \sum_{j=1}^{N} r(kA_{h})^{N-j} \\ &\quad \quad{}\times P_{h} \biggl[ \phi \biggl( \int _{I} \bar{w}_{1} (jk,\cdot ,y) \bar{u}(jk,y) \,\mathrm{d}y + \int _{I} \bar{w}_{1}(jk,\cdot ,y) \, { \mathrm{d}}y \biggr) \biggr] \Biggr\Vert \\ &\quad \quad \leq \Biggl\Vert \int _{0}^{(N-1)k} \mathrm{d}s \int _{I} Z(Nk-s, \cdot ,z) \\ &\quad \quad{}\times \phi \biggl( \int _{I} \bar{w}_{1}(s,z,y) \bar{u}(s,y) \, { \mathrm{d}}y + \int _{I} \bar{w}_{1}(s,z,y) \,\mathrm{d}y \biggr) \,\mathrm{d}z \\ &\quad \quad{}- k\sum_{j=1}^{N-1} \int _{I} Z\bigl((N-j)k,\cdot ,z\bigr) \\ &\quad \quad{}\times \phi \biggl( \int _{I} \bar{w}_{1}(jk,z,y) \bar{u}(jk,y) \, { \mathrm{d}}y + \int _{I} \bar{w}_{1}(jk,z,y) \,\mathrm{d}y \biggr) \,\mathrm{d}z \Biggr\Vert \\ &\quad \quad{}+ \biggl\Vert \int _{(N-1)k}^{Nk} \mathrm{d}s \int _{I} Z(Nk-s,\cdot ,z) \\ &\quad \quad{}\times \phi \biggl( \int _{I} \bar{w}_{1}(s,z,y) \bar{u}(s,y)\, { \mathrm{d}}y + \int _{I} \bar{w}_{1}(s,z,y) \,\mathrm{d}y \biggr) \,\mathrm{d}z \\ &\quad \quad{}- kP_{h} \biggl[ \phi \biggl( \int _{I} \bar{w}_{1}(Nk,\cdot ,y) \bar{u}(Nk,y) \,\mathrm{d}y + \int _{I} \bar{w}_{1}(Nk,\cdot ,y) \, { \mathrm{d}}y \biggr) \biggr] \biggr\Vert \\ &\quad \quad{}+k \Biggl\Vert \sum_{j=1}^{N-1} \int _{I} Z\bigl((N-j)k,\cdot ,z\bigr) \\ &\quad \quad{}\times \phi \biggl( \int _{I} \bar{w}_{1}(jk,z,y) \bar{u}(jk,y) \, { \mathrm{d}}y + \int _{I} \bar{w}_{1}(jk,z,y) \,\mathrm{d}y \biggr) \,\mathrm{d}z \\ &\quad \quad{}- \sum_{j=1}^{N-1} r(kA_{h})^{N-j} \\ &\quad \quad{}\times P_{h} \biggl[ \phi \biggl( \int _{I} \bar{w}_{1}(jk,\cdot ,y) \bar{u}(jk,y) \,\mathrm{d}y + \int _{I} \bar{w}_{1}(jk,\cdot ,y) \, { \mathrm{d}}y \biggr) \biggr] \Biggr\Vert \\ & \quad \equiv \sum_{j=1}^{3} J_{j}. \end{aligned}$$

Regarding the estimate of \(J_{1}\), let us recall (6.15). Then, following the direction of Hoff and Smoller [30], we have

$$\begin{aligned} &J_{1} \leq k \phi (0) \int _{0}^{(N-1)k} \,\mathrm{d}s \int _{I} \frac{\partial Z}{\partial t} (Nk-s,x,z) \,\mathrm{d}z \\ & \quad \leq k \phi (0) \int _{0}^{(N-1)k} (Nk-s)^{-\frac{3}{2}} \, \mathrm{d}s \int _{I} e^{-\frac{(x-z)^{2}}{Nk-s}} \,\mathrm{d}z \\ & \quad \leq 2k \phi (0) \biggl( \frac{1}{\sqrt{T}}-\frac{1}{\sqrt{k}} \biggr). \end{aligned}$$

On the other hand,

$$\begin{aligned} J_{2} &\leq \biggl\Vert \int _{(N-1)k}^{Nk} \mathrm{d}s \int _{I} Z(Nk-s, \cdot ,z) \\ & \quad{}\times \phi \biggl( \int _{I} \bar{w}_{1}(s,z,y) \bar{u}(s,y) \, { \mathrm{d}}y + \int _{I} \bar{w}_{1}(s,z,y) \,\mathrm{d}y \biggr) \,\mathrm{d}z \\ & \quad{}- k\phi \biggl( \int _{I} \bar{w}_{1}(T,\cdot ,y) \bar{u}(T,y) \, { \mathrm{d}}y + \int _{I} \bar{w}_{1}(T,\cdot ,y) \, \mathrm{d}y \biggr) \biggr\Vert \\ & \quad{}+ \biggl\Vert k\phi \biggl( \int _{I} \bar{w}_{1}(T,\cdot ,y) \bar{u}(T,y) \,\mathrm{d}y + \int _{I} \bar{w}_{1}(T,\cdot ,y) \, \mathrm{d}y \biggr) \\ & \quad{}- kP_{h} \biggl[ \phi \biggl( \int _{I} \bar{w}_{1}(T,\cdot ,y) \bar{u}(T,y) \,\mathrm{d}y + \int _{I} \bar{w}_{1}(T,\cdot ,y) \, \mathrm{d}y \biggr) \biggr] \biggr\Vert \\ & =k \Biggl\Vert \frac{1}{k} \int _{(N-1)k}^{Nk} \mathrm{d}s \int _{I} Z(Nk-s,x,z) \\ & \quad{}\times \phi \Biggl( \sum_{l} w_{k,l}^{(N)} \int _{I_{l}} \bar{u}(s,y) \,\mathrm{d}y + \sum _{l=1}^{L} w_{k,l}^{(N)} \Biggr) \,\mathrm{d}z \\ & \quad{}- \phi \Biggl( \sum_{l} w_{\cdot ,l}^{(N)} \int _{I_{l}}\bar{u}(T,y) \,\mathrm{d}y + \sum _{l=1}^{L} w_{k,l}^{(N)} \Biggr) \Biggr\Vert + c\bigl( \vert u_{0} \vert \bigr) k. \end{aligned}$$

Note that for an integrable function \(f(s)\) of s, we have

$$\begin{aligned} &\frac{1}{k} \int _{(N-1)k}^{Nk}f(s) \,\mathrm{d}s = f(Nk)+o(k). \end{aligned}$$

Thus, we have

$$\begin{aligned} &\frac{1}{k} \int _{(N-1)k}^{Nk} \mathrm{d}s \int _{I} Z(Nk-s,x,z) \phi \Biggl( \sum _{l} w_{\cdot ,l}^{(N-1)} \int _{I_{l}}\bar{u}(T,y) \,\mathrm{d}y + \sum _{l=1}^{L} w_{k,l}^{(N-1)} \Biggr) \,\mathrm{d}z \\ & \quad = \phi \Biggl( \sum_{l} w_{\cdot ,l}^{(N-1)} \int _{I_{l}}\bar{u}(T,y) \,\mathrm{d}y + \sum _{l=1}^{L} w_{k,l}^{(N-1)} \Biggr) +o(k), \end{aligned}$$

which yields \(|J_{2}| \leq c_{82} ( k+ko(k) )\) with some \(c_{82}>0\). As for the estimate of \(J_{3}\), applying (7.3) again together with (7.1) and Corollary 2, we have

$$\begin{aligned} \Vert J_{3} \Vert &= k \Biggl\Vert \sum _{j=1}^{N-1} \biggl\{ \int _{I} Z\bigl((N-j)k, \cdot ,z\bigr) \\ & \quad{}\times \phi \biggl( \int _{I} \bar{w}_{1}(jk,z,y) \bar{u}(jk,y) \, { \mathrm{d}}y + \int _{I} \bar{w}_{1}(jk,z,y) \,\mathrm{d}y \biggr) \,\mathrm{d}z \\ & \quad{}-r(kA_{h})^{N-j} \\ & \quad{}\times P_{h} \biggl( \phi \biggl( \int _{I} \bar{w}_{1}(jk,\cdot ,y) \bar{u}(jk,y) \,\mathrm{d}y + \int _{I} \bar{w}_{1}(jk,\cdot ,y)\, { \mathrm{d}}y \biggr) \biggr) \biggr\} \Biggr\Vert \\ & \leq c_{81} \bigl(h^{2}+k\bigr)\sum _{j=1}^{N-1} \frac{1}{(N-j)} \biggl\vert \phi \biggl( \int _{I} \bar{w}_{1}(jk,\cdot ,y) \bar{u}(jk,y) \,\mathrm{d}y + \int _{I} \bar{w}_{1}(jk,\cdot ,y) \, \mathrm{d}y \biggr) \biggr\vert . \end{aligned}$$

Recalling (6.15), the right-most hand side is estimated by

$$\begin{aligned} &c_{81} \bigl\vert \phi (0) \bigr\vert \bigl(h^{2}+k \bigr) \sum_{j=1}^{N-1} \frac{1}{(N-j)} \\ & \quad = c_{81} \bigl\vert \phi (0) \bigr\vert \bigl(h^{2}+k\bigr) \biggl\{ \log (N-1) + \frac{1}{2(N-1)} + \frac{1}{2} + \int _{1}^{N-1} \frac{P_{1}(t)}{t^{2}} \, \mathrm{d}t \biggr\} , \end{aligned}$$

where \(P_{1}(t)=\{t\}-\frac{1}{2}\) with \(\{x\}\) being the fractional part of its argument, and we have used the Euler–Maclaurin formula [38]. Combining these, under the assumption of the lemma, we arrive at the following:

$$\begin{aligned} & \bigl\Vert \bar{u}(T)-{V}_{h}^{(N)} \bigr\Vert \leq c(h,k), \end{aligned}$$
(7.4)

where \(c(h,k) \rightarrow 0\) as \(h,k \rightarrow 0\) satisfying \(h^{2} = o ( \log (T/k)^{-1} )\). Next, we estimate \(\|{V}_{h}^{(N)}-\widetilde{U}_{h}^{(N)}\|\). Recall the following equalities.

$$ \begin{aligned} &\begin{aligned} \widetilde{U}_{h}^{(N-1)}&=\overline{P}_{h} e^{-(N-2)kA}u_{0} \\ & \quad{}+ k\overline{P}_{h} \sum_{j=1}^{N-1} e^{-(N-j-1)kA} \\ & \quad{}\times \biggl[ \phi \biggl( \int _{I} \bar{w}_{1}(jk,\cdot ,y) \bar{u}(jk,y) \,\mathrm{d}y + \int _{I} \bar{w}_{1}(jk,\cdot ,y) \, \mathrm{d}y \biggr) \biggr], \end{aligned} \\ &\begin{aligned} {V}_{h}^{(N-1)} &\equiv r(kA_{h})^{N-1} P_{h} u_{0} \\ & \quad{}+ k \sum_{j=1}^{N-1} r(kA_{h})^{N-j-1} \\ & \quad{}\times P_{h} \biggl[ \phi \biggl( \int _{I} \bar{w}_{1}(jk,x,y)\bar{u}(jk,y) \, \mathrm{d}y + \int _{I} \bar{w}_{1}(jk,x,y) \,\mathrm{d}y \biggr) \biggr]. \end{aligned} \end{aligned} $$
(7.5)

Thus, we have

$$\begin{aligned} \bigl\Vert \widetilde{U}_{h}^{(N-1)}-{V}_{h}^{(N-1)} \bigr\Vert &\leq \bigl\Vert \overline{P}_{h} e^{-(N-2)kA}u_{0} - r(kA_{h})^{N-2} P_{h} u_{0} \bigr\Vert \\ & \quad{}+ k \sum_{j=1}^{N-2} \biggl\Vert \overline{P}_{h} e^{-(N-j-1)kA} \\ & \quad{}\times \biggl[ \phi \biggl( \int _{I} \bar{w}_{1}(jk,\cdot ,y) \bar{u}(jk,y) \,\mathrm{d}y + \int _{I} \bar{w}_{1}(jk,\cdot ,y) \, \mathrm{d}y \biggr) \biggr] \\ & \quad{}- r(kA_{h})^{N-j-1} \\ & \quad{}\times P_{h} \biggl[ \phi \biggl( \int _{I} \bar{w}_{1}(jk,\cdot ,y) \bar{u}(jk,y) \,\mathrm{d}y + \int _{I} \bar{w}_{1}(jk,\cdot ,y) \, { \mathrm{d}}y \biggr) \biggr] \biggr\Vert . \end{aligned}$$
(7.6)

Regarding the first term of the right-hand side of (7.6), we have the following inequality.

$$\begin{aligned} & \bigl\Vert \overline{P}_{h} e^{-(N-2)kA}u_{0} - r(kA_{h})^{N-2} P_{h} u_{0} \bigr\Vert \\ & \quad \leq \bigl\Vert \overline{P}_{h} e^{-(N-2)kA}u_{0} - e^{-(N-2)kA} u_{0} \bigr\Vert \\ & \quad \quad{}+ \bigl\Vert e^{-(N-2)kA}u_{0} - r(kA_{h})^{N-2} P_{h} u_{0} \bigr\Vert . \end{aligned}$$
(7.7)

Because N is sufficiently large, we have \(e^{-(N-2)kA}u_{0} \in H^{2}(I)\). For an arbitrary \(\varepsilon _{2}>0\), if we take h sufficiently small, we can obtain

$$\begin{aligned} & \Vert \overline{P}_{h} f - f \Vert < \varepsilon _{2}, \end{aligned}$$
(7.8)

for a uniformly continuous function f in general. Moreover, we have

$$\begin{aligned} & \bigl\Vert e^{-(N-2)kA}u_{0} - r(kA_{h})^{N-2} P_{h} u_{0} \bigr\Vert \leq c_{81} \bigl(h^{2}+k\bigr) \vert u_{0} \vert . \end{aligned}$$

By combining this and (7.8), and applying to (7.7), we obtain the estimate

$$\begin{aligned} & \bigl\Vert \overline{P}_{h} e^{-(N-2)kA}u_{0} - r(kA_{h})^{N-2} P_{h} u_{0} \bigr\Vert \leq c\bigl( \vert u_{0} \vert \bigr) \bigl(h^{2}+k\bigr)+\varepsilon _{2}. \end{aligned}$$

Regarding the second term of the right-hand side of (7.6), we have

$$\begin{aligned} &k \sum_{j=1}^{N-2} \biggl\| \overline{P}_{h} e^{-(N-j-1)kA} \\ & \quad \quad{}\times \biggl[ \phi \biggl( \int _{I} \bar{w}_{1}(jk,\cdot ,y) \bar{u}(jk,y) \,\mathrm{d}y + \int _{I} \bar{w}_{1}(jk,\cdot ,y) \, \mathrm{d}y \biggr) \biggr] \\ & \quad \quad{}- r(kA_{h})^{N-j-1} \\ & \quad \quad{}\times P_{h} \biggl[ \phi \biggl( \int _{I} \bar{w}_{1}(jk,\cdot ,y) \bar{u}(jk,y) \,\mathrm{d}y + \int _{I} \bar{w}_{1}(jk,\cdot ,y) \, { \mathrm{d}}y \biggr) \biggr] \biggr\| \\ & \quad \leq k \phi (0) \Biggl\{ \sum_{j=1}^{N-2} \bigl\Vert \overline{P}_{h} e^{-(N-j-1)kA} 1 -e^{-(N-j-1)kA}1 \bigr\Vert \\ & \quad \quad{}+k \sum_{j=1}^{N-2} \bigl\Vert e^{-(N-j-1)kA}1 -r(kA_{h})^{N-j-1}P_{h} 1 \bigr\Vert \Biggr\} \\ & \quad \leq \varepsilon _{2}\phi (0) (T-2k) + c_{81} \phi (0) \sum_{j=1}^{N-2} \frac{h^{2}+k}{N-j-1} \\ & \quad \leq \varepsilon _{2} T \phi (0) +c_{81}\phi (0) \bigl(h^{2}+k\bigr) \biggl\{ 1+ \log \biggl(\frac{T}{k}-2 \biggr) \biggr\} . \end{aligned}$$
(7.9)

Thus, (7.6), (7.7), and (7.9) yield

$$\begin{aligned} \bigl\Vert \widetilde{U}_{h}^{(N-1)} - V_{h}^{(N-1)} \bigr\Vert &\leq c\bigl( \vert u_{0} \vert \bigr) \bigl(h^{2}+k\bigr) \biggl\{ 1+\log \biggl(\frac{T}{k}-2 \biggr) \biggr\} \\ & \quad{}+c_{84}\varepsilon _{2}+\varepsilon _{2} T \phi (0). \end{aligned}$$
(7.10)

Next, we proceed to the estimation of \(\| \widetilde{U}_{h}^{(N)} -V_{h}^{(N)} \|\). Note that \(V_{h}^{(N)}\) satisfies the following recurrence relation:

$$\begin{aligned} V_{h}^{(N)} &=r(kA_{h}) V_{h}^{(N-1)} \\ & \quad{}+kr(kA_{h}) P_{h} \biggl[ \phi \biggl( \int _{I} \bar{w}_{1}(Nk,\cdot ,y) \bar{u}(Nk,y) \,\mathrm{d}y \\ & \quad{}+ \int _{I} \bar{w}_{1}(Nk,\cdot ,y) \, \mathrm{d}y \biggr) \biggr]. \end{aligned}$$

Then, by using (6.2), we observe that

$$\begin{aligned} \bigl\Vert \widetilde{U}_{h}^{(N)} -V_{h}^{(N)} \bigr\Vert &= r(kA_{h}) \bigl( \widetilde{U}_{h}^{(N-1)} -V_{h}^{(N-1)} \bigr) \\ & \quad{}+ kr(kA_{h}) P_{h} \Biggl[ \phi \Biggl( h \sum_{l=2}^{L-1} {w}_{ \cdot ,l}^{(N-1)} \widetilde{U}_{h(l)}^{(N-1)} + h\sum _{l=1}^{L} {w}_{ \cdot ,l}^{(N-1)} \Biggr) \\ & \quad{}-\phi \biggl( \int _{I} \bar{w}_{1}(Nk,\cdot ,y) \bar{u}(Nk,y) \, { \mathrm{d}}y \\ & \quad{}+ \int _{I} \bar{w}_{1}(Nk,\cdot ,y) \, \mathrm{d}y \biggr) \Biggr]. \end{aligned}$$
(7.11)

Recalling (6.15), we have that \(h \sum_{l=1}^{L} w_{\cdot ,l}^{(N-1)} =\int _{I} \bar{w}_{1}(Nk, \cdot ,y) \,\mathrm{d}y\), and

$$\begin{aligned} \int _{I} \bar{w}_{1}\bigl((N-1)k,x,y\bigr) \, \mathrm{d}y &= \sum_{l=1}^{L} h \int _{I_{l}}\bar{w}_{1}\bigl((N-1)k,x,y\bigr) \, \mathrm{d}y \\ & =h\sum_{l=1}^{L} w_{p,l}^{(N-1)} \quad (x \in I_{p}). \end{aligned}$$

Similarly, if we recall the definitions of \(\overline{P}_{h}\), we have

$$ \widetilde{U}_{h(l)}^{(N-1)} = \bigl[\overline{P}_{h} \bar{u}\bigl((N-1)k, \cdot \bigr) \bigr]_{l} =\frac{1}{h} \int _{I_{l}}\bar{u}\bigl((N-1)k,y\bigr) \, { \mathrm{d}}y. $$

Thus, we have

$$\begin{aligned} & \phi \Biggl( h \sum_{l=2}^{L-1} {w}_{\cdot ,l}^{(N-1)} \widetilde{U}_{h(l)}^{(N-1)} + h\sum_{l=1}^{L} {w}_{\cdot ,l}^{(N-1)} \Biggr) \\ & \quad = \phi ( \int _{I} \bar{w}_{1}((Nk,\cdot ,y) \bar{u} \biggl((Nk,y) \, { \mathrm{d}}y + \int _{I} \bar{w}_{1}(Nk,\cdot ,y) \, \mathrm{d}y \biggr), \end{aligned}$$

which implies that the second term of (7.11) vanishes. Thus, owing to this and (7.10), we can estimate (7.11) as shown below.

$$\begin{aligned} \bigl\Vert r(kA_{h}) \bigl( \widetilde{U}_{h}^{(N-1)} -V_{h}^{(N-1)} \bigr) \bigr\Vert &\leq c\bigl( \vert u_{0} \vert \bigr) \bigl(h^{2}+k\bigr) \biggl\{ 1+\log \biggl( \frac{T}{k}-2 \biggr) \biggr\} \\ & \quad{}+c_{84}\varepsilon _{2}+\varepsilon _{2} T \phi (0). \end{aligned}$$

Regarding the second term on the right-hand side of (7.11), let us recall (6.15) and the definition of \(\overline{P}_{h}\) to obtain:

$$\begin{aligned} \int _{I} \bar{w}_{1}\bigl((N-1)k,x,y\bigr) \, \mathrm{d}y &= \sum_{l=1}^{L} h \int _{I_{l}}\bar{w}_{1}\bigl((N-1)k,x,y\bigr) \, \mathrm{d}y \\ & =h\sum_{l=1}^{L} w_{p,l}^{(N-1)} \quad (x \in I_{p}). \end{aligned}$$

Similarly, if we recall the definitions of \(\overline{P}_{h}\), \(\widetilde{U}_{h(l)}^{(N-1)}\) and

$$ \widetilde{U}_{h(l)}^{(N-1)} = \bigl[\overline{P}_{h} \bar{u}\bigl((N-1)k, \cdot \bigr) \bigr]_{l} =\frac{1}{h} \int _{I_{l}}\bar{u}\bigl((N-1)k,y\bigr) \, { \mathrm{d}}y, $$

we observe

$$\begin{aligned} &h\sum_{l=2}^{L-1} w_{p,l}^{(N-1)} \widetilde{U}_{h(l)}^{(N-1)} = \int _{I_{l}}\bar{u}\bigl((N-1)k,y\bigr) \,\mathrm{d}y. \end{aligned}$$

Thus, we arrive at the estimate:

$$\begin{aligned} & \bigl\Vert \widetilde{U}_{h}^{(N)}-V_{h}^{(N)} \bigr\Vert \leq c\bigl( \vert u_{0} \vert \bigr) \bigl(h^{2}+k\bigr) \biggl\{ 1+\log \biggl(\frac{T}{k}-2 \biggr) \biggr\} +\varepsilon _{2}+ \varepsilon _{2} T \phi (0). \end{aligned}$$

Finally, by combining (7.2) and above, we arrive at the desired inequality of Lemma 5 because \(\varepsilon _{2}>0\) is arbitrary. □

Owing to Lemmas 3, 4, and 5, if the spatio-temporal mesh is sufficiently fine and the relationship \(h^{2}= o (\log ( \frac{T}{k} )^{-1} )\) is satisfied, we can approximate the solution u of (3.3) with the fully discretized one, for example, \(( (\triangle x)_{i},(\triangle t)_{i} )\), regardless of \(w_{1}\). This leads us to the proof of Theorem 3 in Sect. 4. Actually, owing to Lemma 4, we can assume that \(\|\vec{w}_{0(h,k)}^{\prime}\| \leq R\) with some \(R>0\). Then, we have

$$\begin{aligned} \biggl\vert F(\xi )- \int _{I} \bar{w}_{0}(x)\bar{u}(T,x;\xi ) \, \mathrm{d}x \biggr\vert &\leq \bigl\vert F(\xi )-\vec{w}_{0(h,k)}^{\prime \top} \widetilde{U}_{h}^{(N)} \bigr\vert \\ & \quad{}+ \biggl\vert \vec{w}_{0(h,k)}^{\prime \top} \widetilde{U}_{h}^{(N)}- \int _{I} \bar{w}_{0}(x)\bar{u}(T,x;\xi ) \, \mathrm{d}x \biggr\vert \\ & \leq \frac{\varepsilon}{2} + Rc(h,k). \end{aligned}$$
(7.12)

Thus, if h and k are set to have sufficiently small values maintaining the relationship \(h^{2}= o (\log ( \frac{T}{k} )^{-1} )\), the right-hand side of (7.12) can be less than ε. This proves Theorem 3.

Remark 10

The estimate (7.12) above is observed for a fixed value of ν. Actually, the estimate (7.3) in Fujita and Mizutani [21] is obtained by assuming \(\nu =1\). In the general case, let us introduce a transform \(\bar{x} = x/\sqrt{\nu}\). Then, a problem

$$\begin{aligned} &u_{t}-\nu u_{xx}=0 \quad \text{in } I_{T}, \end{aligned}$$

with an initial value \(u_{0}(x)\) for a function \(u(t,x)\) is transformed into the form:

$$\begin{aligned} &\bar{u}_{t}- \bar{u}_{\bar{x}\bar{x}}=0 \quad \text{in } (0,1/\sqrt{ \nu}) \times (0,T), \end{aligned}$$

for a function \(\bar{u}(t,\bar{x})\) with an initial value \(\bar{u}_{0}(\bar{x}) = u_{0}(x)\). We can easily find that

$$\begin{aligned} & \Vert \bar{u}_{0} \Vert _{L_{2}(0,1/\sqrt{\nu})}^{2} = \frac{1}{\sqrt{\nu}} \Vert u_{0} \Vert _{L_{2}(I)}^{2}, \end{aligned}$$

which means that the right-hand side of (7.12) diverges in case \(\nu \rightarrow 0\). Therefore, we leave the discussion concerning the convergence of the universal approximation property we have proved here when \(\nu \rightarrow 0\) as an open problem.

8 Capacity and learnability of the model

In the proof of universal approximation property, we fixed the values of \(w_{1}\) up to the time right before the terminal moment. However, this does not mean that the temporal direction is not necessary in our model. Universal approximation property is not the only property that a learner should possess; indeed, the learnability and generalization performance are also important. In this section, we will observe that our model possesses the learnability in some sense. In doing so, we will also observe that the estimations to deduce the learnability depend on time.

In this regard, we will discuss other aspects of the proposed model in the sequel. First, we will address its learnability, specifically focusing on classification performance metrics or classes of functions, such as the VC-dimension and Glivenko-Cantelli class. Our discussion is limited to binary classifications. Although we discuss VC-dimension, we delegate its definition to some monographs [8, 65]. Finally, we present the results of our numerical experiments.

8.1 Learnability

In this section, we discuss the learnability of the proposed model. Hereafter, we will often denote the solution to (3.3) (or equivalently, (3.1)–(3.2)) as \(u(t,x;w_{1},\vec{\xi},\nu )\) for clearly indicating its dependence on \(w_{1}\), ξ⃗, and ν. Then, given the terminal moment T and diffusion coefficient ν, we define a hypothesis set. This set comprises functions on \({\mathbb{R}}^{J}\) realized by our model:

$$\begin{aligned} &{\mathscr{F}}_{T}^{(\nu )} \equiv \biggl\{ \vec{\xi} \longmapsto \int _{I} w_{0}(x) u(T,x;w_{1} \vec{\xi} ,\nu ) \,\mathrm{d}x \Big| w_{0} \in L_{2}(I), w_{1} \in L_{2}({\mathcal {H}}_{T}) \biggr\} . \end{aligned}$$
(8.1)

Let us discuss the learnability of \({\mathscr{F}}_{T}^{(\nu )}\). In the following, the VC-dimension of a hypothesis set \({\mathscr{F}}\) is denoted as \(VC({\mathscr{F}})\). Our first result is

Theorem 4

Suppose that the assumptions of Theorem 3are satisfied. Let T and ν be arbitrary positive numbers. Subsequently, for our proposed PDE-based neural network,

$$\begin{aligned} &VC\bigl({\mathscr{F}}_{T}^{(\nu )}\bigr)=+\infty . \end{aligned}$$

Proof

Suppose that we are given an arbitrary \(N \in {\mathbb{N}}\) and a dataset \(\{\vec{\xi}_{i},y_{i}\}_{i=1}^{N} \in {\mathbb{R}}^{J} \times \{\pm 1 \}\). Then, let us take \(\varepsilon >0\) so that \(B(\vec{\xi}_{i};\varepsilon ) \cap B(\vec{\xi}_{i};\varepsilon ) = \phi (i \ne j)\). In virtue of Theorem 3, by suitably taking \(w_{0}\) and \(w_{1}\), we can make a continuous function \(f \in {\mathscr{F}}_{T}^{(\nu )}\) which associates each element in \(B(\vec{\xi}_{i};\varepsilon )\) with \(y_{i}\) for all \(i=1,2,\ldots ,N\). This means that the set \({\mathscr{F}}_{T}^{(\nu )}\) shatters the given dataset with an arbitrary \(N \in {\mathbb{N}}\). □

Theorem 4 also implies that we require an infinite amount of training data, which is practically impossible, and that our model is not PAC-learnable in the classical sense [65]. However, using the concept of a structural risk minimization (SRM) scheme, we can still make it nonuniformly learnable [65]. A relaxation of the concept of learnability of this kind has also been applied to support vector machines [8].

To discuss this in more detail, we introduce certain notations. In general, the “risk” over a loss function \(l(\cdot )\) and a general hypothesis set \({\mathscr{F}}\) is defined by the following:

$$\begin{aligned} &L_{D}(h) = E_{z \sim {\mathcal {D}}} \bigl[ l(h;z) \bigr], \end{aligned}$$
(8.2)

where \({\mathcal {D}}\) is an unknown data-generating distribution defined as follows: \({\mathcal {Z}} \equiv {\mathcal {X}} \times \{\pm 1\}\), with \({\mathcal {X}}\) being a set of inputs. The notation \(z \sim {\mathcal {D}}\) means that a random variable z is drawn from \({\mathcal {D}}\). Similarly, we use the notation \(S \sim {\mathcal {D}}^{m}\) to denote that a dataset S of sample size m is i.i.d. drawn from \({\mathcal {D}}\). If some \(h \in {\mathscr{F}}\) attains (8.2), we call it a Bayesian hypothesis. However, we usually do not know the actual distribution \({\mathcal {D}}\). For this reason, we usually try to minimize the surrogate quantity, which is called the empirical risk:

$$\begin{aligned} &L_{S}(h) \equiv \frac{1}{m} \sum _{i=1}^{m} l(h;\vec{\xi}_{i},y_{i}), \end{aligned}$$

where \(S = \{(\vec{\xi}_{i},y_{i})\}_{i=1}^{m} \subset {\mathcal {Z}}\) represents the training data drawn from the original unknown distribution \({\mathcal {D}}\). This framework is called empirical risk minimization (ERM). Utilizing the law of large numbers, \(L_{S}(h)\) converges to the true risk as \(m \rightarrow +\infty \) for each h. We also define

$$\begin{aligned} &\hat{h}_{S}={\mathrm{ERM}}_{{\mathscr{F}}}(S) \in \operatorname*{argmin}_{h \in { \mathscr{F}}} L_{S}(h), \end{aligned}$$

where \({\mathrm{ERM}}_{{\mathscr{F}}}(S)\) denotes a hypothesis returned as (one of) the minimizer(s) of the empirical risk under training dataset S.

To evaluate the “goodness” of the training data, we define the following concept.

Definition 1

A training set S is called ε-representative with respect to the domain \({\mathcal {Z}} \equiv {\mathcal {X}} \times \{\pm 1\}\), hypothesis set \({\mathscr{F}}\), loss function \(l(\cdot )\), and distribution \({\mathcal {D}}\) if the following holds.

$$\begin{aligned} & \bigl\vert L_{S}(h)- L_{\mathcal {D}}(h) \bigr\vert \leq \varepsilon \quad \forall h \in {\mathscr{F}}. \end{aligned}$$

To determine the conditions under which the ERM scheme works well, we need the following definition (please refer to [65], Definition 4.3).

Definition 2

We say that a hypothesis set \({\mathscr{F}}\) possesses the uniform convergence property with respect to the domain \({\mathcal {Z}}\) and loss function \(l(\cdot )\) if there exists a function \(m_{\mathscr{F}}^{UC}:(0,1)^{2} \rightarrow {\mathbb{N}}\), which is called the sample complexity, such that for each \(\varepsilon ,\delta \in (0,1)\) and for every probability distribution \({\mathcal {D}}\) over \({\mathcal {Z}}\), if S is a sample of \(m \geq m_{\mathscr{F}}^{UC}(\varepsilon ,\delta )\) elements that are drawn i.i.d. according to \({\mathcal {D}}\), then, with a probability of at least \(1-\delta \), S is ε-representative.

A well-known theorem states that (see [65], Theorem 6.7) the uniform convergence property is equivalent to the fact that the VC-dimension of the hypothesis set is finite. Thus, together with Theorem 4 above, our hypothesis set \({\mathscr{F}}_{T}^{(\nu )}\) does not satisfy the uniform convergence property itself (consequently, neither PAC nor agnostic PAC is learnable, although we omit the definitions of these terms here). However, we can also consider a relaxed concept of learnability [65].

Definition 3

A hypothesis set \({\mathscr{F}}\) is said to be non-uniformly learnable if there exists a learning algorithm A that associates a dataset S with a hypothesis \(A(S) \in {\mathscr{F}}\) and a function \(m_{\mathscr{F}}:(0,1)^{2} \times {\mathscr{F}} \rightarrow {\mathbb{N}}\), such that for every \(\varepsilon , \delta \in (0,1)\), and for every \(h \in {\mathscr{F}}\), if \(m\geq m_{\mathscr{F}}(\varepsilon ,\delta ,h)\) then for every distribution \({\mathcal {D}}\) over \({\mathcal {X}} \times \{\pm 1\}\), with a probability of at least \(1-\delta \) over the choice of \(S \sim {\mathcal {D}}^{m}\), it is ensured that

$$\begin{aligned} &L_{\mathcal {D}}\bigl(A(S)\bigr) \leq L_{\mathcal {D}} (h) + \varepsilon . \end{aligned}$$

The following theorem [65] describes an important characterization of nonuniform learnability.

Theorem 5

Let \({\mathscr{F}}\) be a hypothesis set that can be written as a countable union of the individual hypothesis sets.

$$\begin{aligned} &{\mathscr{F}}= \bigcup_{n \in {\mathbb{N}}} {\mathscr{F}}_{n}, \end{aligned}$$

where each \({\mathscr{F}}_{n}\) exhibits a uniform convergence property. Then, \({\mathscr{F}}\) is nonuniformly learnable.

Returning to our specific case, we can show that our hypothesis set \({\mathscr{F}}_{T}^{(\nu )}\) defined in (8.1) is nonuniformly learnable. To demonstrate this, we will introduce a sequence of hypothesis sets.

$$\begin{aligned} &{\mathscr{F}}_{T}^{(\nu )}(n) \equiv \biggl\{ \vec{\xi} \longmapsto \int _{I} w_{0}(x) u(T,x;w_{1}, \cdot ,\nu ) \,\mathrm{d}x \Big| \Vert w_{0} \Vert _{L_{2}(I)}, \Vert w_{1} \Vert _{L_{2}({\mathcal {H}}_{T})} \leq n \biggr\} \\ & \quad (n=1,2,\ldots ). \end{aligned}$$
(8.3)

Evidently, these sets form the following relationships.

$$\begin{aligned} & {\mathscr{F}}_{T}^{(\nu )}(1) \subset {\mathscr{F}}_{T}^{(\nu )}(2) \subset \ldots , \\ &{\mathscr{F}}_{T}^{(\nu )} = \bigcup _{n=1}^{\infty }{\mathscr{F}}_{T}^{( \nu )}(n) \quad \forall T,\nu >0. \end{aligned}$$
(8.4)

Next, we demonstrate that each set \({\mathscr{F}}_{T}^{(\nu )}(n)\) in (8.4) satisfies uniform convergence property. We also use the notations

$$\begin{aligned} {\mathscr{L}}(n) &\equiv \tilde{l} \circ {\mathscr{F}}_{T}^{(\nu )}(n) \\ & = \biggl\{ (\vec{\xi},y) \longmapsto \tilde{l} \biggl( \int w_{0}(x) u(T,x;w_{1}, \vec{\xi},\nu ) \, \mathrm{d}x, y \biggr) \Big| \Vert w_{0} \Vert _{L_{2}(I)}, \Vert w_{1} \Vert _{L_{2}({\mathcal {H}}_{T})} \leq n \biggr\} \\ & \quad (n=1,2,\ldots ). \end{aligned}$$

To assess the uniform convergence property of \({\mathscr{F}}_{T}^{(\nu )}(n)\) with respect to the loss function \(\tilde{l}(\cdot )\), it is necessary and sufficient to check that the set \({\mathscr{L}}(n)\) is a Glivenko–Cantelli class [65].

Hereafter, we denote a probability space as \((\Omega ,{\mathscr{A}},P)\), where Ω is the sample space, \({\mathscr{A}}\), the σ-algebra with respect to probability measure P. We also denote a corresponding empirical measure as \(P_{m}(A) =\frac{1}{m} \sum_{j=1}^{m} \delta _{\vec{\xi}_{j}}(A)\) for a Borel set A with \(\delta (\cdot )\) being Dirac measure, and define

$$\begin{aligned} &Pf= \int _{\Omega }f \,\mathrm{d}P,\qquad \Vert P_{m}-P \Vert _{\mathscr{F}} \equiv \sup_{f \in {\mathscr{F}}} \sqrt{m} \vert P_{m}f-Pf \vert . \end{aligned}$$

Definition 4

Given a probability space \((\Omega ,{\mathscr{A}},P)\) and a set of integrable real-valued functions \({\mathscr{F}}\), we say that \({\mathscr{F}}\) is a Glivenko–Cantelli class for P if and only if

$$\begin{aligned} & \Vert P_{m}-P \Vert _{\mathscr{F}} \rightarrow 0\quad (m \rightarrow +\infty ) \end{aligned}$$

holds almost uniformly.

In case of binary classification, being a Glivenko–Cantelli class is equivalent to satisfying the uniform convergence property [65]. Moreover, the following theorem is known [16]. Here, \(I^{d} = [0,1]^{d}\) with \(d \in {\mathbb{N}}\).

Theorem 6

Let \(K>0\) and \({\mathscr{F}}_{1,K}(I^{d})\) be a set of the Lipschitz continuous functions on \(I^{d}\):

$$\begin{aligned} &{\mathscr{F}}_{1,K}\bigl(I^{d}\bigr) = \biggl\{ f \in C \bigl(I^{d}\bigr) \Big| \sup_{x} \bigl\vert f(x) \bigr\vert + \sup_{x \ne y} \frac{ \vert f(x)-f(y) \vert }{ \Vert x-y \Vert _{{\mathbb{R}}^{d}}} \leq K \biggr\} . \end{aligned}$$

Then, \({\mathscr{F}}_{1,K}(I^{d})\) is a Glivenko–Cantelli class for any probability measure P on \(I^{d}\).

Thus, if we impose Lipschitz continuity on the loss function, we can guarantee that the set \({\mathscr{L}}(n)\) becomes a Glivenko–Cantelli class for each n.

Theorem 7

Suppose that the assumptions of Theorem 3hold. Let \(T>0\) be arbitrary, \({\mathcal {Z}} = {\mathcal {X}} \times {\pm 1}\) with \({\mathcal {X}} \subset {\mathbb{R}}^{J}\) being compact, and a loss function \(l(\cdot ):{\mathcal {Z}} \times L_{2}(I)\times L_{2}({\mathcal {H}}_{T}) \rightarrow {\mathbb{R}}\) of the form

$$\begin{aligned} &l\bigl((\vec{\xi},y),w_{0},w_{1}\bigr) = \tilde{l} \biggl( \int _{I} w_{0}(x)u(T,x;w_{1}, \vec{\xi},\nu ) \,\mathrm{d}x,y \biggr) \end{aligned}$$

with a function \(\tilde{l}(a,y):{\mathbb{R}}\times {\mathbb{R}}\rightarrow {\mathbb{R}}\) being Lipschitz continuous with respect to \((a,y)\) with Lipschitz coefficient L. Then, the set \({\mathscr{L}}(n)\) is a Glivenko–Cantelli class.

Proof

The following is a simple denotation: \(\|w_{0}\|_{L_{2}(I)}\), \(\|w_{1}\|_{L_{2}({\mathcal {H}}_{T})}\) by \(|w_{0}|\), \(|w_{1}|\), respectively. Without losing generality, we can assume that \({\mathcal {X}} =I^{J}\). Under the assumptions of the theorem, we have

$$\begin{aligned} & \bigl\vert l\bigl((\vec{\xi}_{1},y_{1}),w_{0},w_{1} \bigr) - l\bigl((\vec{\xi}_{2},y_{2}),w_{0},w_{1} \bigr) \bigr\vert \\ & \quad \leq L \biggl\{ \vert y_{1}-y_{2} \vert + \biggl\vert \int _{I} w_{0}(x) u(t,x;w_{1}, \vec{\xi}_{1},\nu ) - \int _{I} w_{0}(x) u(t,x;w_{1}, \vec{\xi}_{2}, \nu ) \biggr\vert \biggr\} . \end{aligned}$$
(8.5)

In order to verify the continuity of \(u(T,x;w_{1},\vec{\xi},\nu )\) with respect to ξ⃗, we appeal to a standard energy estimate. Let us denote \(u(t,x;w_{1},\vec{\xi}_{i},\nu )\) (\(i=1,2\)) by \(u_{i}(t,x)\) and \(\tilde{u}(t,x) \equiv u_{1}(t,x)-u_{2}(t,x)\). Then, we have

$$\begin{aligned} &\frac{1}{2} \frac{\mathrm{d}}{\mathrm{d}t} \bigl\vert \tilde{u}(t,\cdot ) \bigr\vert ^{2} + \frac{\nu}{2} \bigl\vert \nabla \tilde{u} (t,\cdot ) \bigr\vert ^{2} \leq L \bigl\Vert w_{1}(t,\cdot ,\cdot ) \bigr\Vert _{L_{2}(I\times I)} \bigl\vert \tilde{u}(t, \cdot ) \bigr\vert ^{2}, \quad t \in (0,T]. \end{aligned}$$

By the Gronwall’s inequality, we obtain [74]

$$\begin{aligned} & \bigl\vert \tilde{u}(T,\cdot ) \bigr\vert ^{2} \leq \bigl\vert \tilde{u}(0,\cdot ) \bigr\vert ^{2} e^{n\sqrt{T}}. \end{aligned}$$
(8.6)

By noting \(|u(0;\vec{\xi}_{i})|^{2} =\frac{1}{J} \|\vec{\xi}_{i}\|_{{\mathbb{R}}^{J}}^{2}\), and consequently, \(| \tilde{u}(0) |^{2} = \frac{1}{J}\|\vec{\xi}_{1}-\vec{\xi}_{2} \|_{{\mathbb{R}}^{J}}^{2}\), and combining (8.5) and (8.6), we obtain

$$\begin{aligned} & \bigl\vert l(\vec{\xi}_{1},y_{1})-l(\vec{ \xi}_{2},y_{2}) \bigr\vert \leq L \biggl( 1+ \frac{nLe^{\frac{n\sqrt{T}}{2}}}{\sqrt{J}} \biggr) \bigl( \vert y_{1}-y_{2} \vert + \Vert \vec{\xi}_{1}-\vec{\xi}_{2} \Vert _{{\mathbb{R}}^{J}} \bigr). \end{aligned}$$

By Theorem 6, this implies that \({\mathcal {L}}(n)\) forms a Glivenko–Cantelli class. □

Theorem 7 implies that our model achieves uniform convergence property of the hypothesis set under the boundedness of \(\|w_{0}\|_{L_{2}(I)}\) and \(\|w_{1}\|_{L_{2}({\mathcal {H}}_{T})}\) and the compactness of the input space \({\mathcal {X}}\) with which \({\mathcal {D}}\) is defined. Thus, for each \(n \in {\mathbb{N}}\), we establish that \({\mathscr{F}}_{T}^{(\nu )}(n)\) has a uniform convergence property with respect to this \(l(\cdot )\) and \({\mathcal {D}}\).

Before introducing another theorem, let us present a known lemma [80] concerning the covering number \(N(\cdot )\) and bracketing number \(N_{[]}(\cdot )\). We delegate the definitions of these quantities to other references (see, for instance, [16, 25, 80]).

Lemma 6

Let \({\mathcal {F}} = \{f_{t} |t \in {\mathcal {T}}\}\) be a class of functions defined on a set \({\mathcal {X}}\) satisfying Lipschitz continuity in the index parameter:

$$\begin{aligned} & \bigl\vert f_{s}(x)-f_{t}(x) \bigr\vert \leq d(s,t)F(x) \quad \forall x \in { \mathcal {X}}, \forall s,t \in {\mathcal {T}}, \end{aligned}$$
(8.7)

for some fixed function \(F(\cdot )\), where \(d(\cdot ,\cdot )\) is a metric in the index space \({\mathcal {T}}\). Then, for any norm \(\|\cdot \|\), \(N_{[]} ( 2\varepsilon \|F\|,{\mathcal {F}} ,\|\cdot \| ) \leq N( \varepsilon ,{\mathcal {T}},d)\).

We also introduce the following lemma concerning the metric entropy of a set of functions.

Lemma 7

Let \(B_{M} \equiv \{u \in H^{1}(I) | \|u\|_{H^{1}(I)} \leq M\}\). Then, \(B_{M}\) is relatively compact in \(L_{2}(I)\) and satisfies

$$\begin{aligned} &\log N\bigl(\varepsilon ,B_{M},L_{2}(I)\bigr) \leq \frac{KM}{\varepsilon} \quad \forall \varepsilon > 0, \end{aligned}$$

where K is a constant.

Proof

This lemma can be proved if we take \(p=q=2\) in Theorem 4.3.36 of [25] and note that the inclusion of function spaces \(H^{1}(I) \subset B_{2 \infty}^{1,W}(I)\), where \(B_{2 \infty}^{1,W}(I)\) is the Besov space defined in [25]. □

In the optimization procedure, it is often the case that \(w_{0}\) is determined depending on \(w_{1}\), and consequently \(u(T,x;w_{1})\). Based on Lemmas 6 and 7, we can assert the following theorem.

Theorem 8

Under the assumptions of Theorem 7, suppose that \(\tilde{l}(a,y)\) is Lipschitz continuous with respect to its first argument a and \(w_{0}\) can be determined as a functional of \(u(T,x;w_{1})\): \(w_{0} = w_{0}(u(T,x;w_{1}))\) and satisfies

$$\begin{aligned} & \bigl\Vert w_{0}\bigl(u(T,\cdot ;w_{1}) \bigr)-w_{0}\bigl(u\bigl(T,\cdot ;w_{1}^{\prime} \bigr)\bigr) \bigr\Vert _{L_{2}(I)} \leq L_{w} \bigl\Vert u(T,\cdot ;w_{1})- u\bigl(T,\cdot ;w_{1}^{\prime} \bigr) \bigr\Vert _{L_{2}(I)}, \end{aligned}$$

with some \(L_{w}>0\). Then, the the set \({\mathscr{L}}(n)\) is a Glivenko–Cantelli class.

Proof

Let us simply denote \(w_{0}=w_{0}(u(T,x;w_{1}))\) and \(w_{0}(x)^{\prime }(x)=w_{0}(u(T,x;w_{1}^{\prime}))\). We first show that

$$\begin{aligned} & \biggl\vert \tilde{l} \biggl( \int _{I} w_{0}(x)u(T,x;w_{1}, \vec{\xi}, \nu ) \,\mathrm{d}x ,y \biggr) - \tilde{l} \biggl( \int _{I} w_{0}^{ \prime}(x)u \bigl(T,x;w_{1}^{\prime},\vec{\xi},\nu \bigr) \,\mathrm{d}x, y \biggr) \biggr\vert \\ & \quad \leq c_{T}^{(\nu )} \bigl\Vert u(T,\cdot ;w_{1},\vec{\xi},\nu ) -u\bigl(T, \cdot ;w_{1}^{\prime}, \vec{\xi},\nu \bigr) \bigr\Vert _{L_{2}(I)}, \end{aligned}$$
(8.8)

where \(c_{T}^{(\nu )}>0\) is some constant depending on T and ν. Here, we have used the assumption on \(w_{0}\) as well as the assumption \(\|w_{0}\|_{L_{2}(I)} \leq n\), and the boundedness of \(u(T,\cdot )\), which can be derived as follows.

Applying the standard energy estimate to (3.3) yields the following equation:

$$\begin{aligned} &\frac{1}{2} \frac{\mathrm{d}}{\mathrm{d}t} \bigl\vert u(t,\cdot ;w_{1}, \vec{\xi},\nu ) \bigr\vert ^{2} + \frac{\nu}{2} \bigl\vert \nabla u(t,\cdot ;w_{1}, \vec{ \xi},\nu ) \bigr\vert ^{2} \\ & \quad \leq 2L \bigl\Vert w_{1}(t,\cdot ,\cdot ) \bigr\Vert _{L_{2}(I\times I)} \bigl\vert u(t) \bigr\vert ^{2} + \frac{1}{\nu} \bigl( 2L \bigl\Vert w_{1}(t,\cdot ,\cdot ) \bigr\Vert _{L_{2}(I\times I)} + \sqrt{2} c_{1} \bigr)^{2}. \end{aligned}$$

By introducing the notation \(c_{1}=|\phi (0)|^{2}\), together with Gronwall’s inequality, we obtain [74]

$$\begin{aligned} \bigl\vert u(T,\cdot ;w_{1},\vec{\xi},\nu ) \bigr\vert ^{2} &\leq \biggl\{ \bigl\vert u(0) \bigr\vert ^{2} + \frac{2}{\nu} \int _{0}^{T} \bigl( 2L \bigl\Vert w_{1}( \tau ,\cdot ,\cdot ) \bigr\Vert _{L_{2}(I\times I)} + \sqrt{2}c_{1} \bigr)^{2} \,\mathrm{d}\tau \biggr\} \\ & \quad{}\times \exp \biggl( 4L \int _{0}^{T} \bigl\Vert w_{1}( \tau ,\cdot ,\cdot ) \bigr\Vert _{L_{2}(I \times I)} \,\mathrm{d}\tau \biggr). \end{aligned}$$
(8.9)

By noting \(|u(0;\vec{\xi})|^{2} =\frac{1}{J} \|\vec{\xi}\|_{{\mathbb{R}}^{J}}^{2}\), and together with (8.9), we obtain the following:

$$\begin{aligned} & \bigl\vert u(T,\cdot ;w_{1},\vec{\xi},\nu ) \bigr\vert ^{2} \leq \biggl\{ \frac{1}{J} \Vert \vec{\xi} \Vert _{{\mathbb{R}}^{J}}^{2} + \frac{2}{\nu} ( 2nL + c_{1} \sqrt{2T} )^{2} \biggr\} \exp \bigl( 4nL T^{\frac{1}{2}} \bigr). \end{aligned}$$

Moreover, we can estimate the right-hand side of (8.8) (we omit the procedure of this estimate, for it is quite similar to the deduction of (8.6)). This, combined with (8.8), implies that the assumption of Lemma 6 is satisfied if we regard \({\mathcal {L}}(n)\) as a set of functions indexed by a set of functions of the form \(u(T,\cdot ;w_{1}) \in H^{1}(I)\). Indeed, in this case, (8.7) holds, where \(d(\cdot ,\cdot )\) is \(L_{2}(I)\)-norm and \(F(x)\) is a constant. Thus, Lemma 6 implies \(N_{[]} ( 2\varepsilon c_{T}^{(\nu )},{\mathcal {L}}(n) ,|\cdot | ) \leq N(\varepsilon ,{\mathcal {B}}_{H^{1}(I)}^{M},\|\cdot \|_{L_{2}(I)}) \leq \frac{KM}{\varepsilon}\), where \({\mathcal {B}}_{H^{1}(I)}^{M}\) denotes a ball in \(H^{1}(I)\) with radius M. Because a finite bracketing number implies that the function space is a Glivenko-Cntelli class, this completes the proof. □

Remark 11

Note that in the proof of Theorem 8, the estimate above depends on T and ν, which implies that the generalization performance may depend on them. As a special case, when \(\phi (\cdot )\) is bounded, we obtain the following:

$$\begin{aligned} & \bigl\vert u(T,\cdot ;w_{1},\vec{\xi},\nu ) \bigr\vert ^{2} \leq e^{-\nu T} \Vert \vec{\xi} \Vert _{{\mathbb{R}}^{J}}^{2} +\nu ^{-1} \bigl( 1-e^{-\nu T} \bigr), \end{aligned}$$

which implies that the increase of T may lead to a smaller covering number.

We have seen that under some conditions, \({\mathscr{L}}(n)\) is a Glivenko–Cntelli class, and consequently, \({\mathcal {F}}_{T}^{(\nu )}(n)\) has a finite VC-dimension and sample complexity, say \(d_{n}\) and \(m^{UC}_{{\mathscr{F}}_{T}^{(\nu )}(n)}(\epsilon ,\delta )\), respectively. To examine nonuniform learnability of \({\mathcal {F}}_{T}^{(\nu )}\), let us consider

$$\begin{aligned} &\varepsilon _{n}(m,\delta ) = \min_{\varepsilon \in (0,1)} \bigl\{ m^{UC}_{{ \mathscr{F}}_{T}^{(\nu )}(n)}(\epsilon ,\delta ) \leq m \bigr\} . \end{aligned}$$

Then, it clearly holds that for each \(n \in {\mathbb{N}}\).

$$\begin{aligned} & \bigl\vert L_{D}(h)-L_{S}(h) \bigr\vert \leq \varepsilon _{n}(m,\delta ) \quad \forall h \in {\mathscr{F}}_{T}^{(\nu )}(n). \end{aligned}$$

In addition, if we consider a family of functions \(w(n):{\mathbb{N}} \rightarrow [0,1]\) that satisfies \(\sum_{n=1}^{\infty }w(n) \leq 1\), we have the following approach called structural risk minimization (SRM) (Algorithm 1) [65]:

Algorithm 1
figure a

SRM scheme

Theorem 9

Let \({\mathscr{F}}\) be a hypothesis class, such that \({\mathscr{F}} = \bigcup_{n} {\mathscr{F}}_{n}\), where each \({\mathscr{F}}_{n}\) has uniform convergence property with sample complexity \(m_{{\mathscr{F}}_{n}}^{UC}\). Let \(w:{\mathbb{N}} \rightarrow [0,1]\) be defined as \(w(n)=6/n^{2}\pi ^{2}\). Then, \({\mathscr{F}}\) becomes nonuniformly learnable using the SRM scheme at a rate

$$\begin{aligned} &m_{{\mathscr{F}}}^{NUC} (\varepsilon ,\delta ,h) \leq m_{{\mathscr{F}}_{n}}^{UC} \biggl( \frac{\varepsilon}{2}, \frac{6\delta}{(\pi n(h))^{2}} \biggr). \end{aligned}$$

Theorem 9 with \({\mathscr{F}}_{n}\) replaced by \(\tilde{l} \circ {\mathscr{F}}_{T}^{(\nu )}(n)\) guarantees that our PDE-based neural network has nonuniform learnability.

8.2 Numerical computation

Finally, we conducted some numerical experiments to evaluate the performance of our model using practical datasets. Because the main focus of the present paper is the theoretical argument, this is the first example to check the effectiveness of our model. In the following section, we first clarify the setting of our numerical experiment and then state the results.

8.2.1 Settings

In this experiment, we focused exclusively on binary classification. The proposed model was implemented using Python 3.7 on a Windows Server 2019 (64 bits), 12th Gen Intel (R) Core (TM) i7-12700, 2.11 GHz, RAM 96.0 GB. In this experiment, we used the time difference \(\triangle t = 5 \times 10^{-4}\) and a range of values for the number of temporal and spatial grids, denoted as N and L, respectively. At the output layer, we employed a logistic regression scheme with \(L_{1}\) regularization using statsmodels [71]. The optimization of \(w_{1}\) in our model involved optimizing the values \(w_{1}({i_{1},i_{2},i_{3}})\) (\(i_{1}=1,2,\ldots ,N\), \(i_{2}, i_{3}=1,2, \ldots ,L\)), each of which is a temporally discretized version of \(w_{1}(t,x,y)\). Optimization was conducted using a genetic algorithm with the deap library [13] in Python.

8.2.2 Datasets

Numerical simulations are conducted with “adult income” [5] and “diabetes” [15] datasets, which are well-known benchmarks of binary classification.

The former dataset contains 121 adult attributes and their annual income. The task is to predict whether their income is larger than 50 thousands dollars (which corresponds to the label “1”) or not (“0”). The latter dataset contained eight attributes with the human subjects and a binary label indicating whether each subject had the symptoms of diabetes.

Table 1 presents an overview of the datasets. For both datasets, we employed 70% of the training dataset, and the remaining part of the training dataset was used to check the test accuracy.

Table 1 Overview of datasets

We applied min–max scaler, which transforms the values of each attribute onto the interval \([0,1]\).

8.2.3 Results of experiments

Tables 2 and 3 show the results of the training and test accuracies, the area under the curve (AUC) (boldface indicates the largest value for each indicator) under a range of values T, and the number of points in the discretization of both spatial and temporal directions. The performance of the proposed method was comparable to that of the existing methods (Random Forest Classifier (RFC), Support Vector Classifier (SVC) with RFB kernel, XGBoost, and LightGBM) in terms of test accuracy and AUC. Note that in the existing methods, we tuned the hyperparameters by using cross-validation and grid-search.

Table 2 Results of “adult income” dataset
Table 3 Results of “diabetes” dataset

The values of generations and population size in the genetic algorithm are 5 and 10, respectively, for “adult income” dataset, and 10 and 200 for ‘diabetes’ dataset. This is due to fact that the “adult income” dataset is larger and requires much longer computation time. From Tables 2 and 3, we observe that the performance of our model varies depending on the values of T.

In summary, the considerations in this section imply the following issues:

  1. (i)

    Although our model achieves an infinite VC-dimension, it is still non-uniformly learnable under some assumptions about the underlying distribution behind the dataset. This property is also observed in some well-known machine learning algorithms, such as Support Vector Machine (SVM) with kernels.

  2. (ii)

    By adjusting the parameters, we can adjust the generalization performance of our model. On the one hand, optimal values of the parameters yield a model with lower generalization error. On the other hand, this enlarges the search space during optimization, leading to the concern that we might not attain a (sub-)optimal solution within a realistic computation time. Therefore, in our future work, we will continue to search for an effective approach to optimize our model.

9 Conclusion

This study demonstrates the universal approximation property of our PDE-based neural network. It has been demonstrated that any continuous function on a compact set in \({\mathbb{R}}^{J}\) can be approximated by the output of a neural network with arbitrary precision.

We have also discussed the learnability of our model. Moreover, we implemented our model on a computer and performed certain numerical experiments. It showed a comparable performance to that of the existing models, such as RFC, SVC, LightGBM, and XGBoost. It was shown that the generalization performance could be adjusted by some parameters of the model. The exploration of more effective optimization procedures can be performed in the future.

Future work will consider the limit when ν tends to zero, in which case the proposed model could be considered the continuous limit of the usual neural network or one with an artificial diffusion term. Although we observed weak convergence of our solution, we should appeal to the theory of singular perturbation to factor in the boundary condition of a thin layer.

There is room for improvement in optimization procedure. We are planning to explore Bayesian optimization approaches that we have already attempted using ODE-based neural networks [33]. Therefore, it is important to discuss the PAC-Bayes perspective of the proposed model as well.

Additionally, we intend to extend our PDE-based neural network to multidimensional Euclidean spaces. As stated in Remark 5 at the end of Sect. 4, this is necessary when considering a GNN in which the elements are treated in the matrix form.

Availability of data and materials

The datasets generated and/or analyzed during the current study are available in: (i) UCI Machine Learning Repository, [https://doi.org/10.24432/C5XW20], (ii) Kaggle repository, [https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset].

References

  1. Aizawa, Y., Kimura, M.: Universal approximation properties for ODENet and ResNet. CoRR (2021). arXiv:2101.10229

  2. Annunziato, M., Borzì, A.: A Fokker–Planck control framework for multidimensional, stochastic processes. J. Comput. Appl. Math. 237, 487–507 (2013). https://doi.org/10.1016/j.cam.2012.06.019

    Article  MathSciNet  MATH  Google Scholar 

  3. Baker, G.A., Bramble, J.H., Thomee, V.: Single step Galerkin approximations for parabolic problems. Math. Comput. 31, 818–847 (1977). https://doi.org/10.2307/2006116

    Article  MathSciNet  MATH  Google Scholar 

  4. Barbu, V.: Analysis and Control of Nonlinear Infinite Dimensional Systems. Academic Press, London (2012)

    Google Scholar 

  5. Barry, B., Ronny, K.: Adult income dataset, UCI Machine Learning Repository. https://doi.org/10.24432/C5XW20

  6. Baum, E.B., Haussler, D.: What size net gives valid generalization? Neural Comput. 1, 151–160 (1989). https://doi.org/10.1162/neco.1989.1.1.151

    Article  Google Scholar 

  7. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Singapore (2006)

    MATH  Google Scholar 

  8. Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. (1998). https://doi.org/10.1023/A:1009715923555

    Article  Google Scholar 

  9. Chamberlain, B.P., et al.: GRAND: graph neural diffusion. In: Proc. ICML 2021 (2021)

    Google Scholar 

  10. Chen, R.T.Q., et al.: Neural ordinary differential equations. Adv. Neural Inf. Process. Syst. 31, 6572–6583 (2018)

    Google Scholar 

  11. Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2, 303–314 (1989). https://doi.org/10.1007/BF02551274

    Article  MathSciNet  MATH  Google Scholar 

  12. Dautray, R., Lions, L.J.: Mathematical Analysis and Numerical Methods for Science and Technology, vol. 5. Springer, Berlin (1991)

    MATH  Google Scholar 

  13. Deap (2023). https://deap.readthedocs.io/en/master/

  14. DeVore, R., Hanin, B., Petrova, G.: Neural network approximation. Acta Numer. 30, 327–444 (2021). https://doi.org/10.1017/S0962492921000052

    Article  MathSciNet  MATH  Google Scholar 

  15. Diabetes dataset: Kaggle (2020). https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset

  16. Dudley, R.M.: Uniform Central Limit Theorems. Cambridge Studies in Advanced Mathematics. Cambridge University Press, Cambridge (1999). https://doi.org/10.1017/CBO9780511665622

    Book  MATH  Google Scholar 

  17. Dupont, E., Doucet, A., Teh, Y.W.: Augmented neural ODEs. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Red Hook (2019)

    Google Scholar 

  18. Esteve-Yagüe, C., et al.: Large-time asymptotics in deep learning (2021). https://hal.archives-ouvertes.fr/hal-02912516

  19. Esteve-Yagüe, C., Geshkovski, B.: Sparse approximation in learning via neural ODEs. (2021). arXiv:2102.13566

  20. Fernández-Cara, E., et al.: Null controllability of linear heat and wave equations with nonlocal spatial terms. SIAM J. Control Optim. 54, 2009–2019 (2016). https://doi.org/10.1137/15M1044291

    Article  MathSciNet  MATH  Google Scholar 

  21. Fujita, H., Mizutani, A.: On the finite element method for parabolic equations, I; approximation of holomorphic semi-groups. J. Math. Soc. Jpn. 28, 749–771 (1976). https://doi.org/10.2969/jmsj/02840749

    Article  MathSciNet  MATH  Google Scholar 

  22. Funahashi, K.: On the approximate realization of continuous mappings by neural networks. Neural Netw. 2, 183–192 (1989). https://doi.org/10.1016/0893-6080(89)90003-8

    Article  Google Scholar 

  23. Funahashi, K., Nakamura, Y., Networks, N.: Neural Networks, Approximation Theory, and Dynamical Systems (Structure and Bifurcation of Dynamical Systems), Suuri-kaiseki kenykuujo Kokyuroku, 18–37 (1992). http://hdl.handle.net/2433/82914

  24. Geshkovski, B., Zuazua, E.: Turnpike in optimal control of PDEs, ResNets, and beyond. Acta Numer. 31, 135–263 (2022). https://doi.org/10.1017/S0962492922000046

    Article  MathSciNet  MATH  Google Scholar 

  25. Giné, E., Nickl, R.: Mathematical Foundations of Infinite-Dimensional Statistical Models. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge (2015). https://doi.org/10.1017/CBO9781107337862

    Book  MATH  Google Scholar 

  26. González-Burgos, M., de Teresa, L.: Some results on controllability for linear and nonlinear heat equations in unbounded domains. Adv. Differ. Equ. 12, 1201–1240 (2007). https://doi.org/10.57262/ade/1355867413

    Article  MathSciNet  MATH  Google Scholar 

  27. Haber, E., Ruthotto, L.: Stable architectures for deep neural networks. Inverse Probl. 34, 014004 (2017). https://doi.org/10.1088/1361-6420/aa9a90

    Article  MathSciNet  MATH  Google Scholar 

  28. Han, E.W., Han, J., Li, Q.: A mean-field optimal control formulation of deep learning. Res. Math. Sci. 6, 10 (2019). https://doi.org/10.1007/s40687-018-0172-y

    Article  MathSciNet  MATH  Google Scholar 

  29. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. IEEE Comput. Soc., Los Alamitos (2016). https://doi.org/10.1109/CVPR.2016.90

    Chapter  Google Scholar 

  30. Hoff, D., Smoller, J.: Error bounds for finite-difference approximations for a class of nonlinear parabolic systems. Math. Comput. 45, 35–49 (1985). https://doi.org/10.2307/2008048

    Article  MathSciNet  MATH  Google Scholar 

  31. Honda, H.: On continuous limit of neural network. In: Proc. of NOLTA 2020 (2020)

    Google Scholar 

  32. Honda, H.: On a partial differential equation based neural network. IEICE Commun. Express 10, 137–143 (2021). https://doi.org/10.1587/comex.2020XBL0174

    Article  Google Scholar 

  33. Honda, H., et al.: An ODE-based neural network with bayesian optimization. JSIAM Lett. 15, 101–104 (2023). https://doi.org/10.1587/comex.2020XBL0174

    Article  MathSciNet  Google Scholar 

  34. Honda, H.: Approximating a multilayer neural network by an optimal control of a partial differential equation. Preprint

  35. Hornik, K., et al.: Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359–366 (1989). https://doi.org/10.1016/0893-6080(89)90020-8

    Article  MATH  Google Scholar 

  36. Irie, B., Miyake, S.: Capabilities of three-layered perceptrons. In: Proc. IEEE Int. Conf. on Neural Networks, pp. 641–648 (1988). https://doi.org/10.1109/ICNN.1988.23901

    Chapter  Google Scholar 

  37. Ito, S.: Fundamental solutions of parabolic differential equations and boundary value problems. Jpn. J. Math., Trans. Abstr. 27, 55–102 (1957). https://doi.org/10.4099/jjm1924.27.055

    Article  MathSciNet  MATH  Google Scholar 

  38. Kac, V.G., Cheung, P.: Quantum Calculus. Springer, New York (2001)

    MATH  Google Scholar 

  39. Kato, T.: Perturbation Theory for Linear Operators, 2nd edn. Springer, New York (1976)

    MATH  Google Scholar 

  40. Koenderink, J.J.: The structure of images. Biol. Cybern. 50, 363–370 (1984). https://doi.org/10.1007/BF00336961

    Article  MathSciNet  MATH  Google Scholar 

  41. Kolmogorov, A.N.: On the representation of continuous function of many variables by superposition of continuous function of one variable and addition. Dokl. Akad. Nauk SSSR 144, 679–681 (1957)

    MathSciNet  Google Scholar 

  42. Laakmann, F., Petersen, P.C.: Efficient approximation of solutions of parametric linear transport equations by ReLU DNNs. Adv. Comput. Math. 47, 11 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  43. Leshno, M., et al.: Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Netw. 6, 303–314 (1993). https://doi.org/10.1016/S0893-6080(05)80131-5

    Article  Google Scholar 

  44. Li, Q., et al.: Maximum principle based algorithms for deep learning. J. Mach. Learn. Res. 18, 5998–6026 (2017)

    MathSciNet  MATH  Google Scholar 

  45. Li, Q., Lin, T., Shen, Z.: Deep learning via dynamical systems: an approximation perspective. J. Eur. Math. Soc. (2019). https://doi.org/10.4171/jems/1221

    Article  MATH  Google Scholar 

  46. Li, Z., Shi, Z.: Deep residual learning and PDEs on manifold (2017). arXiv:1708.05115

  47. Lions, J.L.: Perturbations Singulières dans les Problèmes aux Limites et en Contrôle Optimal. Springer, Berlin (1973)

    Book  MATH  Google Scholar 

  48. Lions, J.L.: Exact controllability, stabilization and perturbations for distributed systems. SIAM Rev. 30, 1–68 (1988). https://doi.org/10.1137/1030001

    Article  MathSciNet  MATH  Google Scholar 

  49. Lions, J.L., Magenes, E.: Non-homogeneous Boundary Values Problems and Applications I. Springer, Berlin (1972)

    Book  MATH  Google Scholar 

  50. Lions, P.L.: Une vision mathématique du Deep Learning (2018). https://www.college-de-france.fr/fr/agenda/seminaire/mathematiques-appliquees/une-vision-mathematique-du-deep-learning

  51. Lippmann, R.: An introduction to computing with neural nets. IEEE ASSP Mag. 4, 4–22 (1987). https://doi.org/10.1109/MASSP.1987.1165576

    Article  Google Scholar 

  52. Liu, H., Markowich, P.: Selection dynamics for deep neural networks. J. Differ. Equ. 269, 11540–11574 (2020). https://doi.org/10.1016/j.jde.2020.08.041

    Article  MathSciNet  MATH  Google Scholar 

  53. Lohéac, J., Zuazua, E.: From averaged to simultaneous controllability. Ann. Fac. Sci. Toulouse, Math. 25, 785–828 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  54. Neal, R.M.: Bayesian Learning for Neural Networks. Springer, Berlin (1996)

    Book  MATH  Google Scholar 

  55. Nirenberg, L.: Topics in Nonlinear Functional Analysis. Am. Math. Soc., Providence (2001)

    Book  MATH  Google Scholar 

  56. Oono, K., Suzuki, T.: Graph neural networks exponentially lose expressive power for node classification (2020). https://api.semanticscholar.org/CorpusID:209994765

  57. Pachpatte, B.G., Ames, W.F.: Inequalities for Differential and Integral Equations. Academic Press, London (1997)

    Google Scholar 

  58. Perona, P., Malik, J.: Scale-space and edge detection using anisotropic diffusion. IEEE Trans. Pattern Anal. Mach. Intell. 12, 629–639 (1990). https://doi.org/10.1109/34.56205

    Article  Google Scholar 

  59. Rodriguez, I.D.J., Ames, A.D., Yue, Y.: Lyanet: a Lyapunov framework for training neural ODEs. CoRR (2022). arXiv:2202.02526

  60. Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65, 386–408 (1958)

    Article  Google Scholar 

  61. Ruiz-Balet, D., Zuazua, E.: Neural ODE control for classification, approximation and transport. SIAM Rev. 65, 735–773 (2023). https://doi.org/10.1137/21M1411433

    Article  MathSciNet  MATH  Google Scholar 

  62. Rusch, T.K., et al.: Graph-coupled oscillator networks. CoRR (2022). arXiv:2202.02296

  63. Ruthotto, L., Haber, E.: Deep neural networks motivated by partial differential equations. J. Math. Imaging Vis. 62, 352–364 (2020). https://doi.org/10.1007/s10851-019-00903-1

    Article  MathSciNet  MATH  Google Scholar 

  64. Ryu, S.U., Yagi, A.: Optimal control of Keller–Segel equations. J. Math. Anal. Appl. 256, 45–66 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  65. Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning. Cambridge University Press, Padstow Cornwall (2014)

    Book  MATH  Google Scholar 

  66. Shen, Z., Yang, H., Zhang, S.: Nonlinear approximation via compositions. CoRR (2019). arXiv:1902.10170

  67. Sonoda, S., Murata, N.: Double continuum limit of deep neural networks. In: Proc. of ICML 2017, Workshop on Principled Approaches to Deep Learning (2017)

    Google Scholar 

  68. Sonoda, S., Murata, N.: Transport analysis of infinitely deep neural network. J. Mach. Learn. Res. 20, 1–52 (2019)

    MathSciNet  MATH  Google Scholar 

  69. Sontag, E., Sussmann, H.: Complete controllability of continuous-time recurrent neural networks. Syst. Control Lett. 30, 177–183 (1997). https://doi.org/10.1016/S0167-6911(97)00002-9

    Article  MathSciNet  MATH  Google Scholar 

  70. Sprecher, D.A.: On the structure of continuous functions of several variables. Trans. Am. Math. Soc. 115, 340–355 (1965). https://doi.org/10.2307/1994273

    Article  MathSciNet  MATH  Google Scholar 

  71. Statsmodels (2023). https://www.statsmodels.org/

  72. Stelzer, F., et al.: Deep neural networks using a single neuron: folded-in-time architecture using feedback-modulated delay loops. Nat. Commun. 12, 1–10 (2021). https://doi.org/10.1038/s41467-021-25427-4

    Article  Google Scholar 

  73. Tabuada, P., et al.: Universal approximation power of deep residual neural networks through the lens of control. IEEE Trans. Autom. Control 68, 2715–2728 (2023). https://doi.org/10.1109/TAC.2022.3190051

    Article  MathSciNet  MATH  Google Scholar 

  74. Temam, R.: Infinite-Dimensional Dynamical Systems in Mechanics and Physics. Springer, New York (1997)

    Book  MATH  Google Scholar 

  75. Teshima, T., et al.: Coupling-based invertible neural networks are universal diffeomorphism approximators. CoRR (2020). arXiv:2006.11469

  76. Teshima, T., et al.: Universal approximation property of neural ordinary differential equations (2020). arXiv:2012.02414

  77. Thomée, V.: Galerkin Finite Element Methods for Parabolic Problems. Springer, Berlin (2006)

    MATH  Google Scholar 

  78. Thorpe, M., van Gennip, Y.: Deep limits of residual neural networks. Res. Math. Sci. 10, 6 (2023). https://doi.org/10.1007/s40687-022-00370-y

    Article  MathSciNet  MATH  Google Scholar 

  79. Trotter, H.F.: Approximation of semi-groups of operators. Pac. J. Math. 8, 887–919 (1958)

    Article  MathSciNet  MATH  Google Scholar 

  80. Vaart, A.W., Wellner, J.A.: Weak Convergence and Empirical Processes. Springer Series in Statistics. Springer, New York (1996). https://doi.org/10.1007/978-1-4757-2545-2

    Book  MATH  Google Scholar 

  81. Vainikko, G.: Funktionalanalysis der Diskretisierungsmethoden. Teubner, Leipzig (1976)

    MATH  Google Scholar 

  82. Weickert, J.: Anisotropic Diffusion in Image Processing (1998). https://www.mia.uni-saarland.de/weickert/Papers/book.pdf

    MATH  Google Scholar 

  83. Weinan, E.: A proposal on machine learning via dynamical systems. Commun. Math. Stat. 5, 1–11 (2017). https://doi.org/10.1007/s40304-017-0103-z

    Article  MathSciNet  MATH  Google Scholar 

  84. Williams, C.: Computing with infinite networks. In: Mozer, M., Jordan, M., Petsche, T. (eds.) Advances in Neural Information Processing Systems, vol. 9. MIT Press, Cambridge (1996)

    Google Scholar 

  85. Yun, B.I.: A neural network approximation based on a parametric sigmoidal function. Mathematics 7, 262 (2019). https://www.mdpi.com/2227-7390/7/3/262

    Article  Google Scholar 

  86. Yunjin, C., Thomas, P.: Trainable nonlinear reaction diffusion: a flexible framework for fast and effective image restoration. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1256–1272 (2017). https://doi.org/10.1109/TPAMI.2016.2596743

    Article  Google Scholar 

  87. Zeidler, E.: Nonlinear Functional Analysis and Its Applications. Springer, New York (1986)

    Book  MATH  Google Scholar 

  88. Zhang, H., et al.: Approximation capabilities of neural ODEs and invertible residual networks. In: Daumé, H., Singh, A. (eds.) Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 11086–11095 (2020)

    Google Scholar 

Download references

Acknowledgements

We thank the anonymous reviewers whose comments and suggestions greatly help improve and clarify this manuscript. We also appreciate Mamoru Miyazawa, who contributed to the numerical experiments in this study.

Funding

This work was supported by Toyo University Top Priority Research Program.

Author information

Authors and Affiliations

Authors

Contributions

Only one author for this paper. The author read and approved the final manuscript.

Corresponding author

Correspondence to Hirotada Honda.

Ethics declarations

Ethics approval and consent to participate

There are no ethics approval.

Competing interests

The author declares no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Summary of notations

We summarize the notations used in this paper, which are not presented in Sect. 2 in Table 4 below.

Table 4 Notations of function spaces and operators

Appendix B: Proofs of existence

2.1 B.1 Proof of Theorem 1

Before introducing our first result, we shall define the Galerkin approximation [12].

Definition 5

Let V be a separable Hilbert space and \(\{V_{m}\}_{m=1}^{\infty}\) be a family of finite dimensional vector spaces satisfying the assumptions (i) and (ii) below.

  1. (i)

    \(V_{m} \subset V\), \(\operatorname{dim}V_{m} < +\infty \).

  2. (ii)

    \(V_{m} \rightarrow V \) (\(m \rightarrow \infty \)) in the sense below: there exists a dense subspace of V, every element v of which has a corresponding sequence \(\{v_{m}\}_{m=1}^{\infty }\subset V_{m}\) satisfying \(\|v_{m}-v\|_{V} \rightarrow 0\) (\(m \rightarrow +\infty \)).

Then, each space \(V_{m}\) (\(m=1,2,\ldots \)) is called the Galerkin approximation of order m of V.

Now, we prove Theorem 1. First, let us introduce a space

$$\begin{aligned} &\mathfrak{W}(T) \equiv \biggl\{ u \Big| u \in L_{2} \bigl(0,T;H^{1}(I)\bigr) , \frac{\mathrm{d}u}{\mathrm{d}t} \in L_{2}\bigl(0,T;H^{-1}(I)\bigr) \biggr\} . \end{aligned}$$

We note the fact that (see, [12], Chapter XVIII, Theorem 1):

$$\begin{aligned} &\mathfrak{W}(T) \subset C\bigl(0,T;L_{2}(I)\bigr), \end{aligned}$$
(B.1)

holds. We shall seek a \(T_{u_{0}}\) and \(v \in \mathfrak{W}(T_{u_{0}})\) that solves (3.3) in the following sense:

$$\begin{aligned} \textstyle\begin{cases} \frac{\mathrm{d}}{\mathrm{d}t} (v(\cdot ) ,w ) + \sigma (v(\cdot ,w)) \\ \quad = ( \phi ( \int _{I} w_{1}(t,x,y)v(t,y) \,\mathrm{d}y + \int _{I} w_{1}(t,x,y) \,\mathrm{d}y ), w ) \\ \qquad \text{on } (0,T_{u_{0}}), \\ v(0)=\tilde{u}_{0} \quad \text{on } I, \end{cases}\displaystyle \end{aligned}$$
(B.2)

in the sense of \(( C_{0}^{\infty}(0,T) )^{\prime}\) for all \(w \in H^{1}(I)\). Note that due to (B.1), the initial condition in the second equation of (B.2) has a meaning. We shall prove this in the following steps [12, 64]. First, assuming the temporally local solvability of the problem, we prove the uniqueness of the local solution. Second, we prove the existence of a local solution up to a certain time \(T_{u_{0}}\). Let us assume that we have temporally local two solutions to (B.2) on a time interval \([0,T^{*}]\), say, \(v^{(1)}\) and \(v^{(2)}\), which belong to the space mentioned in Theorem 1 and subsequent Remark 1.

We introduce a notation \(\tilde{v} \equiv v_{(1)}-v_{(2)}\). This should satisfy:

$$\begin{aligned} \textstyle\begin{cases} \frac{\mathrm{d}}{\mathrm{d}t} (\tilde{v}(\cdot ) ,w ) + \sigma (\tilde{v}(\cdot ),w ) = ( \Phi ( \cdot ),w ), \\ \tilde{v}(0)=0, \end{cases}\displaystyle \end{aligned}$$
(B.3)

where

$$\begin{aligned} \Phi (t,x)& \equiv \phi \biggl( \int _{I} w_{1}(t,x,y)v_{(1)}(t,y) \, { \mathrm{d}}y + \int _{I} w_{1}(t,x,y) \,\mathrm{d}y \biggr) \\ & \quad{}- \phi \biggl( \int _{I} w_{1}(t,x,y)v_{(2)}(t,y) \,\mathrm{d}y + \int _{I} w_{1}(t,x,y) \,\mathrm{d}y \biggr). \end{aligned}$$

Replacing w with \(\tilde{v}(t,x)\) on both sides of (B.3), and applying the Schwartz’s inequality, we observe:

$$\begin{aligned} &\frac{\mathrm{d}}{\mathrm{d} t} \bigl\vert \tilde{v}(t) \bigr\vert ^{2} + \nu \bigl\vert \tilde{v}_{x} (t) \bigr\vert ^{2} \leq L \int _{I} \bigl\Vert w_{1}(t,\cdot , \cdot ) \bigr\Vert _{L_{2}(I\times I)} \bigl\vert \tilde{v}(t) \bigr\vert ^{2} \, { \mathrm{d}}t, \end{aligned}$$

where \(L>0\) is the Lipschitz constant of \(\phi (\cdot )\). This, together with the Gronwall’s inequality [57] and the fact that \(\tilde{v} |_{t=0}=0\), yields

$$\begin{aligned} &\tilde{v}(t) \equiv 0 \quad \forall t \in \bigl(0,T^{*}\bigr), \end{aligned}$$

which implies the uniqueness of the solution.

Next, we prove the existence of a local solution. Let \(\{V_{m}\}_{m=1}^{\infty}\) be an increasing family of \(d_{m}\) dimensional subspaces of \(H^{1}(I)\), in which each \(v\in H^{1}(I)\) has its approximating sequence \(\{v^{(m)}\}_{m=1}^{\infty}\) such that \(v^{(m)} \in V_{m}\) for each m, and \(\|v^{(m)}-v\|_{H^{1}(I)}\rightarrow 0\) as \(m \rightarrow \infty \). Because \(V_{m}\) is a Galerkin approximation of \(L_{2}(I)\) as well, we have a sequence \(\{\tilde{u}_{0m}\}_{m=1}^{\infty}\) such that

$$\begin{aligned} &\tilde{u}_{0m} \in V_{m}, \\ &\tilde{u}_{0m} \rightarrow \tilde{u}_{0} \quad \text{in } L_{2}(I). \end{aligned}$$

Let \(\{W_{jm}\}_{j=1}^{d_{m}}\) be a basis in \(V_{m}\). We seek \(v^{(m)}\) and \(\tilde{u}_{0m}\) of the form of linear combinations of \(\{W_{jm}\}_{j=1}^{d_{m}}\) that solve

$$\begin{aligned} &\textstyle\begin{cases} (\frac{\mathrm{d}v^{(m)}}{\mathrm{d}t},W_{jm} ) + \sigma (v^{(m)},W_{jm}) \\ \quad = ( \phi ( \int _{I} w_{1}(t,x,y)v^{(m)}(t,y) \,\mathrm{d}y + \int _{I} w_{1}(t,x,y) \,\mathrm{d}y ) , W_{jm} ), \\ v^{(m)} |_{t=0} = \tilde{u}_{0m} \quad (j=1,2, \ldots ,d_{m}). \end{cases}\displaystyle \end{aligned}$$
(B.4)

Because \(W_{jm}\) are linearly independent with each other, (B.4) is assured to have a local solution \(v^{(m)} \in C(0,T_{u_{0}};V_{m})\) with some \(T_{u_{0}}\). It also satisfies \(\frac{\mathrm{d}v^{(m)}}{\mathrm{d}t} \in L_{2}(0,T_{u_{0}};V_{m})\) under the assumptions of the theorem.

Next, we observe the a-priori estimate. Let us multiply the coefficient of \(v^{(m)}\) on both sides of (B.4) for each j, and sum up with respect to \(j=1,2,\ldots ,d_{m}\). Then, we have

$$\begin{aligned} &\frac{\mathrm{d}}{\mathrm{d} t} \bigl\vert v^{(m)}(t) \bigr\vert ^{2} + \nu \bigl\vert v_{x}^{(m)}(t) \bigr\vert ^{2} \\ & \quad \leq \bigl\vert v^{(m)}(t) \bigr\vert \biggl\vert \phi \biggl( \int _{I} w_{1}(t, \cdot ,y)v^{(m)}(t,y) \,\mathrm{d}y + \int _{I} w_{1}(t,\cdot ,y)\, { \mathrm{d}}y \biggr) \biggr\vert . \end{aligned}$$

Regarding the right-hand side, with a notation \(c_{1} = | \phi (0) |^{2} \), we estimate from above as follows.

$$\begin{aligned} & \int _{I} \biggl\vert \phi \biggl( \int _{I} w_{1}(t,x,y)v^{(m)}(t,y) \, { \mathrm{d}}y + \int _{I} w_{1}(t,x,y) \,\mathrm{d}y \biggr) \biggr\vert ^{2} \, { \mathrm{d}}x \\ & \quad \leq 2 \int _{I} \biggl\vert \phi \biggl( \int _{I} w_{1}(t,x,y)v^{(m)}(t,y) \,\mathrm{d}y + \int _{I} w_{1}(t,x,y) \,\mathrm{d}y \biggr) - \phi (0) \biggr\vert ^{2} \,\mathrm{d}x \\ & \quad \quad{}+ 2 \bigl\vert \phi (0) \bigr\vert ^{2} \\ & \quad \leq 4L^{2} \int _{I} \biggl\vert \int _{I} w_{1}(t,x,y)v^{(m)}(t,y) \, { \mathrm{d}}y \biggr\vert ^{2} \,\mathrm{d}x \\ & \quad \quad{}+ 4L^{2} \int _{I} \biggl\vert \int _{I} w_{1}(t,x,y) \,\mathrm{d}y \biggr\vert ^{2} \,\mathrm{d}x + 2c_{1} \\ & \quad \leq 4L^{2} \bigl\Vert w_{1}(t,\cdot ,\cdot ) \bigr\Vert _{L_{2}(I \times I)}^{2} \bigl\vert v^{(m)}(t) \bigr\vert ^{2} + 4L^{2} \bigl\Vert w_{1}(t,\cdot ,\cdot ) \bigr\Vert _{L_{2}(I \times I)}^{2} +2c_{1}. \end{aligned}$$

This yields

$$\begin{aligned} &\frac{\mathrm{d}}{\mathrm{d} t} \bigl\vert v^{(m)}(t) \bigr\vert ^{2} + \nu \bigl\vert v_{x}^{(m)} (t) \bigr\vert ^{2} \\ & \quad \leq 2L \bigl\vert v^{(m)}(t) \bigr\vert ^{2} \bigl\Vert w_{1}(t,\cdot ,\cdot ) \bigr\Vert _{L_{2}(I \times I)} + \bigl\{ 2L \bigl\Vert w_{1}(t,\cdot ,\cdot ) \bigr\Vert _{L_{2}(I \times I)}+\sqrt{2c_{1}} \bigr\} , \end{aligned}$$

from which, together with the Gronwall’s inequality again, we obtain

$$\begin{aligned} \bigl\vert v^{(m)}(t) \bigr\vert ^{2} &\leq \biggl\{ \vert \tilde{u}_{0m} \vert ^{2} + \frac{1}{2} \int _{0}^{t} \bigl( 2L \bigl\Vert w_{1}(\tau ,\cdot ,\cdot ) \bigr\Vert _{L_{2}(I \times I)}+ \sqrt{2c_{1}} \bigr) \,\mathrm{d}\tau \biggr\} \\ & \quad{}\times \exp \biggl( 2L \int _{0}^{t} \bigl\Vert w_{1}( \tau ,\cdot ,\cdot ) \bigr\Vert _{L_{2}(I \times I)} \,\mathrm{d}\tau \biggr). \end{aligned}$$

This enables us to extract a subsequence \(\{v^{(m^{\prime})}\} \subset \{v^{(m)}\}\) satisfying the following issues with some \(v_{\infty }\in L_{2}(I)\):

$$\begin{aligned} & v^{(m^{\prime})} \rightarrow v_{\infty } \quad \text{weakly in } L_{2}\bigl(0,T_{u_{0}};H^{1}(I)\bigr), \\ & v^{(m^{\prime})} \rightarrow v_{\infty } \quad \text{weakly* in } L_{\infty}\bigl(0,T_{u_{0}};L_{2}(I)\bigr), \\ & Av^{(m^{\prime})} \rightarrow Av_{\infty} \quad \text{weakly in } L_{2}\bigl(0,T_{u_{0}};H^{-1}(I)\bigr). \end{aligned}$$
(B.5)

In virtue of the Relich’s theorem, we have

$$\begin{aligned} &v^{(m)} \rightarrow v_{\infty }\quad \text{strongly in } L_{2}\bigl(0,T_{u_{0}};L_{2}(I)\bigr). \end{aligned}$$

Now, we are in a position to check that this \(v_{\infty}\) certainly solves (B.2). In order for this, we take an arbitrary smooth function \(\zeta (t) \in C_{0}^{\infty}(0,T)\) and \(\breve{w} \in H^{1}(I)\), a sequence \(\{w_{m}\}_{m} \subset H^{1}(I)\) satisfying

$$\begin{aligned} &\lim_{m \rightarrow \infty} w_{m} = \breve{w} \quad \text{in } H^{1}(I), \end{aligned}$$

and define \(\psi _{m}\equiv \zeta (t)w_{m}\) and \(\psi \equiv \zeta (t)\breve{w}\) (note that because we consider in one-dimensional space where \(H^{1}(I)\) can be embedded into \(C(I)\), we can regard \(\mathfrak{V} = H^{1}(I)\) in Definition 1 [12]). It is clear that as \(m \rightarrow +\infty \),

$$\begin{aligned} \begin{aligned} &\psi _{m} \rightarrow \psi \quad \text{strongly in } L_{2}\bigl(0,T_{u_{0}};H^{1}(I)\bigr), \\ &\frac{\mathrm{d}\psi _{m} }{\mathrm{d}t}\rightarrow \frac{\mathrm{d}\psi}{\mathrm{d}t} \quad \text{strongly in } L_{2}\bigl(0,T_{u_{0}};L_{2}(I)\bigr). \end{aligned} \end{aligned}$$
(B.6)

For now, we can replace m above with \(m^{\prime}\) prescribed. Thus, from (B.4), after integration by parts (note that \(\zeta (t) \in C_{0}^{\infty}(0,T)\)), we have

$$\begin{aligned} &{-} \int _{0}^{T_{u_{0}}} \biggl(v^{(m^{\prime})}(t), \frac{\mathrm{d}\psi _{m^{\prime}}(t)}{\mathrm{d}t} \biggr) \,\mathrm{d}t + \int _{0}^{T_{u_{0}}} \sigma \bigl(v^{(m^{\prime})}(t), \psi _{m^{\prime}}(t)\bigr) \,\mathrm{d}t \\ &\quad = \int _{0}^{T_{u_{0}}} \biggl( \phi \biggl( \int _{I} w_{1}(t,x,y)v^{(m^{ \prime})}(t,y) \,\mathrm{d}y + \int _{I} w_{1}(t,x,y) \,\mathrm{d}y \biggr) , \psi _{m^{\prime}}(t) \biggr) \,\mathrm{d}t. \end{aligned}$$
(B.7)

In virtue of (B.5) and (B.6), as \(m \rightarrow +\infty \), we have

$$\begin{aligned} &{-} \int _{0}^{T_{u_{0}}} \bigl(v_{\infty}(t), \breve{w} \bigr) \zeta ^{\prime}(t) \,\mathrm{d}t + \int _{0}^{T_{u_{0}}} \sigma \bigl( v_{\infty}(t),\breve{w} \bigr)\zeta (t) \,\mathrm{d}t \\ &\quad = \int _{0}^{T_{u_{0}}} \biggl( \phi \biggl( \int _{I} w_{1}(t,x,y)v_{ \infty}(t,y) \,\mathrm{d}y + \int _{I} w_{1}(t,x,y) \,\mathrm{d}y \biggr) , \breve{w} \biggr) \zeta (t) \,\mathrm{d}t. \end{aligned}$$
(B.8)

The equality above holds for any \(\breve{w} \in H^{1}(I)\), and thus, this \(v_{\infty}\) solves (B.2).

Now, (B.8) can be rewritten as follows.

$$\begin{aligned} &{-} \int _{0}^{T_{u_{0}}} \bigl(v_{\infty}(t), \breve{w}\bigr) \zeta ^{\prime}(t) \,\mathrm{d}t \\ & \quad = \int _{0}^{T_{u_{0}}} \biggl( \phi \biggl( \int _{I} w_{1}(t,x,y)v_{ \infty}(t,y) \,\mathrm{d}y + \int _{I} w_{1}(t,x,y) \,\mathrm{d}y \biggr) -Av_{ \infty },\breve{w} \biggr) \zeta (t) \,\mathrm{d}t. \end{aligned}$$

We can easily see

$$\begin{aligned} &\frac{\mathrm{d}v_{\infty}}{\mathrm{d}t} \in L_{2}\bigl(0,T;H^{-1}(I) \bigr), \end{aligned}$$

which, together with (B.1), yields the fact that \(v_{\infty}\) belongs to the same space mentioned in Theorem 1 and subsequent Remark 1.

Finally, we verify that \(v_{\infty}\) above satisfies the initial condition. Let \(\eta (t) \in C^{\infty}(0,T_{u_{0}})\) be a function that satisfies \(\eta (t)=0\) near \(T_{u_{0}}\) and \(\eta (0) \ne 0\). We again take \(\breve{w} \in H^{1}(I)\) and a sequence \(\{w_{m}\}_{m} \subset H^{1}(I)\) satisfying

$$\begin{aligned} &\lim_{m \rightarrow \infty} w_{m} = \breve{w} \quad \text{in } H^{1}(I). \end{aligned}$$

Then, \(\psi = \eta (t) \breve{w} \in \mathfrak{W}(T_{u_{0}})\) and by integration by parts, we have

$$\begin{aligned} & \int _{0}^{T_{u_{0}}} \biggl( \frac{\mathrm{d}v_{\infty}}{\mathrm{d}t}(t), \eta (t)\breve{w} \biggr) \,\mathrm{d}t = - \int _{0}^{T_{u_{0}}} \bigl( v_{ \infty}(t), \breve{w} \bigr) \eta ^{\prime}(t) \,\mathrm{d}t - \bigl( v_{ \infty}(0),\breve{w} \bigr)\eta (0). \end{aligned}$$
(B.9)

From Equation (B.2), we can derive

$$\begin{aligned} & \int _{0}^{T_{u_{0}}} \biggl( \frac{\mathrm{d}v_{\infty}(t)}{\mathrm{d}t}, \eta (t) \breve{w} \biggr) \,\mathrm{d}t \\ & \quad = \int _{0}^{T_{u_{0}}} \biggl( \phi \biggl( \int _{I} w_{1}(t,x,y)v_{ \infty}(t,y) \,\mathrm{d}y + \int _{I} w_{1}(t,x,y) \,\mathrm{d}y \biggr), \breve{w} \biggr)\eta (t) \,\mathrm{d}t \\ & \quad \quad{}- \int _{0}^{T_{u_{0}}} \sigma \bigl( v_{\infty}(t),\breve{w} \bigr) \eta (t) \,\mathrm{d}t. \end{aligned}$$
(B.10)

Moreover, from (B.4) we have

$$\begin{aligned} & \int _{0}^{T_{u_{0}}} \biggl( \frac{\mathrm{d}v^{(m^{\prime})}(t)}{\mathrm{d}t},w_{m^{\prime}} \biggr) \eta (t) \,\mathrm{d}t \\ & \quad = \int _{0}^{T_{u_{0}}} \biggl( \phi \biggl( \int _{I} w_{1}(t,x,y)v^{(m^{ \prime})}(t,y) \,\mathrm{d}y + \int _{I} w_{1}(t,x,y) \,\mathrm{d}y \biggr) ,w_{m^{ \prime}} \biggr) \eta (t) \,\mathrm{d}t \\ &\quad \quad{}- \int _{0}^{T_{u_{0}}} \sigma \bigl(v^{(m^{\prime})}(t),w_{m^{\prime}} \bigr) \eta (t) \,\mathrm{d}t. \end{aligned}$$
(B.11)

The left-hand side of (B.11) has another representation:

$$\begin{aligned} & \int _{0}^{T_{u_{0}}} \biggl( \frac{\mathrm{d}v^{(m^{\prime})}(t)}{\mathrm{d}t},w_{m^{\prime}} \biggr) \eta (t) \,\mathrm{d}t \\ & \quad =- \int _{0}^{T_{u_{0}}} \bigl( v^{(m^{\prime})}(t), w_{m^{\prime}}\bigr) \eta ^{ \prime}(t) \,\mathrm{d}t - ( \tilde{u}_{0m^{\prime}},w_{m^{\prime}} )\eta (0). \end{aligned}$$
(B.12)

Making \(m^{\prime}\) tend to +∞, on the one hand, (B.11) yields

$$\begin{aligned} &\lim_{m^{\prime }\rightarrow +\infty} \int _{0}^{T_{u_{0}}} \biggl( \frac{\mathrm{d}v^{(m^{\prime})}}{\mathrm{d}t}(t),w_{m^{\prime}} \biggr) \eta (t) \,\mathrm{d}t \\ & \quad = \int _{0}^{T_{u_{0}}} \biggl( \phi \biggl( \int _{I} w_{1}(t,x,y)v_{ \infty}(t,y) \,\mathrm{d}y + \int _{I} w_{1}(t,x,y) \,\mathrm{d}y \biggr), \breve{w} \biggr)\eta (t) \,\mathrm{d}t \\ & \quad \quad{}- \int _{0}^{T_{u_{0}}} \sigma \bigl( v_{\infty}(t),\breve{w} \bigr) \eta (t) \,\mathrm{d}t \\ & \quad = \int _{0}^{T_{u_{0}}} \biggl( \frac{\mathrm{d}v_{\infty}}{\mathrm{d}t}(t), \eta (t) \breve{w} \biggr) \,\mathrm{d}t, \end{aligned}$$
(B.13)

where we used (B.12) to deduce the right-most hand side. On the other hand, (B.12) yields

$$\begin{aligned} &\lim_{m^{\prime }\rightarrow +\infty} \int _{0}^{T_{u_{0}}} \biggl( \frac{\mathrm{d}v^{(m^{\prime})}}{\mathrm{d}t}(t),w_{m^{\prime}} \biggr) \eta (t) \,\mathrm{d}t \\ & \quad = - \int _{0}^{T_{u_{0}}} \bigl( v_{\infty}(t), \breve{w}\bigr) \eta ^{\prime}(t) \,\mathrm{d}t - ( \tilde{u}_{0}, \breve{w} )\eta (0). \end{aligned}$$
(B.14)

By comparing (B.12), (B.13), and (B.14), we arrive at

$$\begin{aligned} &\bigl(v_{\infty}(0),\breve{w}\bigr) = (\tilde{u}_{0}, \breve{w}) \quad \forall \breve{w} \in H^{1}(I). \end{aligned}$$
(B.15)

Because \(H^{1}(I)\) is dense in \(L_{2}(I)\), (B.15) holds for all \(\breve{w} \in L_{2}(I)\), which implies

$$\begin{aligned} &v_{\infty}(0)=\tilde{u}_{0}. \end{aligned}$$

This is the desired result.

2.2 B.2 Proof of Theorem 2

Here, we prove Theorem 2. Because the local solvability is assured in Theorem 1, we assume that for some \(T^{*}>0\), we have a solution v of (3.3) on the interval \([0,T^{*}]\). Now, let us first construct a variable:

$$\begin{aligned} &\breve{v}(t,x) \equiv e^{-\lambda t}v(t,x), \quad t \in \bigl[0,T^{*}\bigr], \end{aligned}$$

with \(\lambda \in {\mathbb{R}}\) specified later, which solves

$$\begin{aligned} \textstyle\begin{cases} \breve{v}_{t} -\nu \breve{v}_{xx}= -\lambda \breve{v}+ e^{-\lambda t} \phi ( \int _{I} w_{1}(t,x,y)e^{ \lambda t} \breve{v}(t,y) \,\mathrm{d}y + \int _{I} w_{1}(t,x,y) \, { \mathrm{d}}y ) \quad \text{in } I_{T^{*}}, \\ \breve{v}(0,x)=\tilde{u}_{0} \quad \text{on } I, \\ \breve{v}=0 \quad \text{on } \partial I \ \forall t \in (0,T^{*}). \end{cases}\displaystyle \end{aligned}$$
(B.16)

By multiplying on both sides in (B.16), we can deduce an estimation as below:

$$\begin{aligned} &\frac{1}{2} \frac{\mathrm{d}}{\mathrm{d}t} \bigl\vert \breve{v}(t) \bigr\vert ^{2} + \nu \bigl\vert \breve{v}_{x} (t) \bigr\vert ^{2} +\lambda \bigl\vert \breve{v}(t) \bigr\vert ^{2} \\ & \quad \leq e^{-\lambda t} \bigl\vert \breve{v}(t) \bigr\vert \biggl\Vert \phi \biggl( \int _{I} w_{1}(t, \cdot ,y)e^{\lambda t} \breve{v}(t,y) \,\mathrm{d}y + \int _{I} w_{1}(t, \cdot ,y) \,\mathrm{d}y \biggr) \biggr\Vert _{L_{2}(I)}. \end{aligned}$$
(B.17)

By introducing a notation \(c_{1} = | \phi (0) |^{2} \) again, by applying the Schwartz inequality and the Lipschiz continuity of ϕ, we have

$$\begin{aligned} & \int _{I} \biggl\vert \phi \biggl( \int _{I} w_{1}(t,x,y)e^{\lambda t} \breve{v}(t,y) \,\mathrm{d}y + \int _{I} w_{1}(t,x,y) \,\mathrm{d}y \biggr) \biggr\vert ^{2} \,\mathrm{d}x \\ & \quad \leq 4L^{2} e^{2\lambda t} \bigl\Vert w_{1}(t,\cdot ,\cdot ) \bigr\Vert _{L_{2}(I \times I)}^{2} \bigl\vert \breve{v}(t) \bigr\vert ^{2} +4L^{2} \bigl\Vert w_{1}(t, \cdot ,\cdot ) \bigr\Vert _{L_{2}(I\times I)}^{2} +2c_{1}, \end{aligned}$$

if we substitute this to (B.17), we obtain

$$\begin{aligned} &\frac{1}{2} \frac{\mathrm{d}}{\mathrm{d}t} \bigl\vert \breve{v}(t) \bigr\vert ^{2} + \nu \bigl\vert \breve{v}_{x}(t) \bigr\vert ^{2} +\lambda \bigl\vert \breve{v}(t) \bigr\vert ^{2} \\ & \quad \leq 2L \bigl\Vert w_{1}(t,\cdot ,\cdot ) \bigr\Vert _{L_{2}(I\times I)} \bigl\vert \breve{v}(t) \bigr\vert ^{2} \\ &\quad \quad{}+ \frac{e^{-2\lambda t}}{2\lambda} \bigl( 2L \bigl\Vert w_{1}(t, \cdot , \cdot ) \bigr\Vert _{L_{2}(I\times I)} +c_{1} \bigr)^{2} + \frac{\lambda}{2} \bigl\vert \breve{v}(t) \bigr\vert ^{2}. \end{aligned}$$
(B.18)

If we denote

$$\begin{aligned} &G(t) \equiv \frac{e^{-2\lambda t}}{2\lambda} \bigl( 2L \bigl\Vert w_{1}(t, \cdot ,\cdot ) \bigr\Vert _{L_{2}(I\times I)} +c_{1} \bigr)^{2}, \end{aligned}$$

by the Gronwall’s inequality, we have

$$\begin{aligned} & \bigl\vert \breve{v}(t) \bigr\vert ^{2} \leq \biggl( \vert \tilde{u}_{0} \vert ^{2} + 2 \int _{0}^{t} G(\tau ) \,\mathrm{d}\tau \biggr) \exp \biggl( 4L \int _{0}^{t} \bigl\Vert w_{1}( \tau ,\cdot ,\cdot ) \bigr\Vert _{L_{2}(I\times I)} \,\mathrm{d} \tau -\lambda t \biggr). \end{aligned}$$
(B.19)

Applying the Schwartz’s inequality to the right-hand side of (B.19) and taking λ so that

$$\begin{aligned} &\lambda \geq \frac{4L \Vert w_{1} \Vert _{L_{2}({\mathcal {H}}_{\infty})}}{\sqrt{T^{*}}} \end{aligned}$$

holds, then (B.19) yields

$$\begin{aligned} & \bigl\vert \breve{v}\bigl(T^{*}\bigr) \bigr\vert ^{2} \leq \vert \tilde{u}_{0} \vert ^{2} + 2 \int _{0}^{ \infty }G(\tau ) \,\mathrm{d}\tau . \end{aligned}$$

This implies that the norm \(| \breve{v}(T^{*}) |\) does not depend on \(T^{*}\). Tracing the same argument as in [64], we have the statement of the theorem.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Honda, H. Universal approximation property of a continuous neural network based on a nonlinear diffusion equation. Adv Cont Discr Mod 2023, 43 (2023). https://doi.org/10.1186/s13662-023-03787-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13662-023-03787-z

Mathematics Subject Classification

Keywords