Path classification by stochastic linear recurrent neural networks

We investigate the functioning of a classifying biological neural network from the perspective of statistical learning theory, modelled, in a simplified setting, as a continuous-time stochastic recurrent neural network (RNN) with identity activation function. In the purely stochastic (robust) regime, we give a generalisation error bound that holds with high probability, thus showing that the empirical risk minimiser is the best-in-class hypothesis. We show that RNNs retain a partial signature of the paths they are fed as the unique information exploited for training and classification tasks. We argue that these RNNs are easy to train and robust and back these observations with numerical experiments on both synthetic and real data. We also exhibit a trade-off phenomenon between accuracy and robustness.


Introduction
Recurrent neural networks (RNNs) constitute the simplest machine learning paradigm that is able to handle variable-length data sequences while tracking long term dependencies and taking into account the temporal order of the received information. These data streams appear naturally in many fields such as (audio or video) signal processing or financial data. The RNN architecture is inspired from biological neural networks where both recurrent connectivity and stochasticity in the temporal dynamics are ubiquitous. Despite the empirical success of RNNs and their many variants (long short-term memory networks (LSTMs), gated recurrent units (GRUs), etc.), several fundamental mathematical questions related to the functioning of these networks remain open: • What is the exact type of information that an RNN learns from the input sequences?
• Training artificial RNNs with classical methods like gradient descent suffer from fundamental problems such as instability, non-convergence, exploding gradient errors [Doy92] and plateauing [LHEL21]. On the other hand, biological networks seem to be robust and easy to train. How does stochasticity contribute in regard to this? • What is the amount of data needed for such a network to achieve a small estimation error with high probability? In the current paper, we set out to answer these questions by modelling a biological neural network as a continuous-time (stochastic) RNN with a randomly chosen connectivity matrix and an identity activation function in view of classifying data streams (in this case, time-dependent paths.) Let us say a few words about each of our three working assumptions: • The continuous-time dynamics are a generalisation of the classical discrete-time dynamics frequently encountered in the literature [GGO20b]. The latter can be seen as an Euler discretisation of the former as the data stream gets sampled at shorter time intervals. Working with continuous-time dynamics provides us with a richer mathematical toolbox while still being applicable to the discrete-time case and keeping key features and issues of such systems such as the dependence on the whole data sequence and its order. • Randomly generating the connectivity matrix of an RNN is the cornerstone of reservoir computing [JH04,MJS07]. This paradigm is based on the idea that universal approximation properties can be achieved for several dynamical systems without the need to optimise all parameters and has shown exceptional performances in a variety of tasks. This working assumption has also the benefit of simplifying the training process (as will be clear from the formulas in this paper, optimising over this matrix is computationally heavy, even in the linear case.) This will consist in our case in finding a pre-processing projection vector and the parameters of a read-out map. This simplicity can be practically exploited for instance to deploy the same network to deal with several tasks (i.e. multi-tasking) without the need for heavy retraining or storing a large number of parameters, in a fashion that is reminiscent of biological networks. Compared to the existing literature (e.g. [DCR + 18]), we included the pre-processing map (input projection vector) as a tunable parameter in order to increase the performance compared to a classical reservoir computer. • We choose to work with identity activation functions in order to build the intuition as to the answer to the questions above. In this case, we obtain precise formulas. We aim to generalise the results of this study to the non-linear case in a later work.
Before setting out our roadmap, let us note for the sake of completeness that there exists a number of ways in which one may avoid altogether recurrent architectures in order to handle data streams and use instead a feed-forward network, which is a more studied and understood paradigm. These are usually based on the transformation of paths into fixed length vectors that can then be fed to the feed-forward structure. In particular, we cite the Independent Component Analysis [HM17,HH20,OS21], the signature methods [Gra13, KO19,LLN13] and the PCA-type dimension reduction introduced in [BHKS20]. However, these methods work best when the whole signal is processed (which may be computationally heavy) before being fed to the network while RNNs are able to work with these signals in a continuous manner as they come, rendering them more suitable to real-time situations. As to the approximation properties of these recurrent architectures, there are several works which go into the direction of indicating that such properties may also hold for the path classification problem treated in this paper, although a precise statement in the stochastic case, which is of interest to us, is still missing. For example, rigorous results providing the approximation properties of (discrete-time) RNNs with a randomly generated connectivity matrix can be found in [GGO20a]. In [FN93], it is shown that every continuous path can be approximated (in the uniform convergence norm) as the outcome of an RNN with a suitable activation function while, more recently in [LHEL21], the authors show that an RNN with the identity activation function can approximate any functional on a path space provided it is continuous, linear, regular and timehomogeneous.
We will approach the problem of the binary classification of continuous-time paths with RNNs in the presence of noise from the point of view of statistical learning theory. After introducing the necessary mathematical notation and the learning setup (model, loss function, etc.) in Section 2, we will give a generalisation error bound that holds with high probability in Section 3. The uniform bound that we derive controls the difference between the risk of a hypothesis and its empirical counterpart and answers practical questions concerning the size of the sample and the bounds on the pre-processing and read-out maps needed to achieve a certain accuracy. Consequently, minimising the empirical risk achieves agnostic PAC learnability and gives guarantees on the ability of the empirical risk minimiser to generalise to unseen data. Section 4 looks more in details into the empirical risk minimisation (ERM) procedure: • we compare its output to that of the popular Support Vector Machine (SVM) considered for example in [NKD + 20], • argue heuristically that noise, which is a natural assumption in modelling biological neural networks, provides stability and robustness against different types of perturbations to the dataset, • show rigorously that in the linear case, the RNN retains a "partial signature" of the timelifted input signal as global information about said signal. The empirical risk is a function of the tunable parameters of the model and the partial signatures of the training data. Finally, we look into the numerical minimisation of the empirical risk using gradient descent in Section 5. With the explicit formulas we obtain, the global effect through time of the tunable parameters on the loss function is taken into account and we do not need some sort of unrolling of the network to apply back-propagation through time. The experiments are performed using the Japanese vowels dataset and classes of trigonometric polynomials.
2. The learning setup 2.1. The recurrent network input-output map. Let r, n and d be integers. The input and the hidden state of the RNN are modelled, respectively, as r and n-dimensional time-dependent continuous paths x and y. Given a filtered probability space (Ω, A, (F t ) 0≤t≤T , P), a simple continuous-time model describing the time-evolution of input-output dynamics is given by the following stochastic differential equation (SDE) (which can be seen as a stochastic version of the one in [SCS88]) with initial condition taken to be y(0) = 0. Here, u ∈ L(R r , R n ) is a linear pre-processing map (identified as a matrix in R n×r ), W ∈ R n×n is the network matrix that models the connection strength between neurons, φ is an activation function (applied element-wise) and Σ ∈ R n×d is a matrix (which we will call the noise matrix) describing the random effect of a d-dimensional Brownian noise B. In this paper, we will consider the linear case when φ is the identity function. The more interesting case where φ is non-linear will be the subject of a future work. Given a path x, we will denote by y(T, x) (or y u (T, x) to emphasise the dependence on the preprocessing map u) the terminal value (i.e. at time T ) of the solution to the equation (1). A readout map h is then combined with the final hidden state of the neural network to produce a prediction v = h(y(T, x)). In our case, h will be a labelling function, and more specifically, a hyperplane classifier. Given a training set of labelled inputs, we aim to train this network by changing the values of the parameters in order to increase the accuracy of future predictions (in a sense to be made clear in the following subsections). In the classical framework of reservoir computing, the network's tunable parameter is the hyperplane h while the connectivity matrix W and the pre-processing map u are generated randomly. In our case, we aim to increase the performance by considering u a tunable parameter to be optimised according to the learning task at hand.

2.2.
Hypothesis class and read-out maps. As we have alluded to above, our global hypothesis class H comprises of maps that can be written as the composition of the reservoir-solution map y u (T, .) and a read-out map h chosen from a read-out hypothesis class H * . As the read-out map h will be applied to the random vector y u (T, x), we can think of the hypothesis class H as a class of random learners where X denotes the space of input paths and V the target set of outputs; for example the labels {−1, +1} as will be in our case. If one identifies random variables and their probability distributions, then one may think of the result of the learners applied to an input x as probability distributions instead of a single label. In our very simple setting, this translates into thinking of the hypothesis H as a regression function x −→ P(H(x) = 1) ∈ [0, 1]. This discussion fits into the framework of probabilistic binary classification or in the wider one of probabilistic supervised learning as introduced in [GKMO18]. However, let us emphasize the fundamental difference that in our case we label inputs based on one realisation of the hypothesis rather than produce a probability distribution on the space of labels.
In the current work, and in line with most common practices, we will take the read-out hypothesis class H * to be the class of hyperplane classifiers. We will adopt the following notations: Notation 2.1.
(2) If u ∈ R n×r , H u,ω,b denotes the global classifier of the stochastic RNN with pre-processing map u H u,ω,b (x) = h ω,b (y u (T, x)), for all x ∈ X .
2.3. Loss functions and associated risk. The goal of supervised learning is to maximise the ability, with high probability and measured against a loss function, of the predicted outputs to generalise to unseen data. As our learner is generating a random variable instead of a deterministic label, we need to consider loss functions of the type l : V Ω ×V → R + . Given a classical loss functioñ l : V × V → R + (e.g. the square loss or the binary loss), we may construct loss functions suitable to our framework in two manners amongst others. The first way is by defining the loss function l to be a statistic of the random variable ω →l(ṽ(ω), v), for example For a fixed input and label, the sole source of randomness in our example is the d-dimensional Brownian motion B with respect to which we will take the expectation. An explicit example would be the binary loss functionl(ṽ, v) = 1ṽ =v to which we associate the loss l(ṽ, v) = P(ṽ = v). This is the loss function that we will consider in our case. We choose this loss function as it is simpler to analyse than other popular types of losses (square loss, hinge loss, etc.) while involving similar key quantities (the Gaussian cumulative distribution function as will be seen later) in the classification process and risk minimisation. A second type of loss function can be obtained by defining a statistical functional Ψ : V Ω → V (here V can be understood in a broader sense, for example R, instead of the labels {−1, +1}) then define l(ṽ, v) =l(Ψ(ṽ), v). This type of loss functions depends on the distribution produced by the hypothesis rather than on its single realisations and defines therefore a probabilistic loss function in the sense of [GKMO18]. An instance of such type would be to take Ψ = E and the square loss l(ṽ, v) = |ṽ − v| 2 to obtain l(ṽ, v) = |Eṽ − v| 2 . The generalisation error (also called risk) of a hypothesis H ∈ H associated to a loss function l is then classically given by R(H) = E (x,v) l(H(x), v), where the expectation is taken with respect to the (unknown) distribution according to which an input and its label (x, v) are generated. One usually interprets the generalisation error as the ability of the learned hypothesis to generalise well for unseen data, assuming it is sampled according to the same unknown distribution which generated the training sample. As no assumptions are made in regard to the distribution of the data, the generalisation error remains unknown and thus its minimisation as such impossible. Instead, one classically aims to minimise its empirical counterpart based on a training sample (X, Obviously, one also needs to provide theoretical arguments justifying the use of the empirical error instead of the generalised one. These often come in the form of a concentration inequality. We will recall such an argument using the notion of Rademacher complexity in a subsequent subsection.
If we denote by H the set of all measurable hypotheses, we can then decompose the difference between the true risk of a hypothesis H ∈ H and the smallest possible risk in the following manner On the one hand, the estimation error E est depends on the ability to solve the problem of minimising the risk over the chosen hypotheses class H. Intuitively, this problem gets harder with larger classes of hypotheses. In many situations where the concentration inequalities mentioned above apply, one obtains quantitative guarantees on how small the estimation error E est can be made in the case of an empirical risk minimiser On the other hand, the approximation error E app depends on how accurately the hypotheses in H can approximate any measurable hypothesis.

Generalisation error bounds
3.1. Solution to the SDE. In our simplified case of the identity function as an activation function, and similarly to an Ornstein-Uhlenbeck process, the solution to the SDE (1) is explicitly given by Notation 3.1. For convenience, we will write W 0 = W − I.
The final hidden state y(T, x) is then a Gaussian random vector with mean and covariance matrix given by Notation 3.2. For the rest of this paper, we will denote Remark 3.3. Note that the covariance matrix A is independent from the tunable parameters u, ω and b and the input signals. As will be seen later, this is a key property of this algorithm.
It is also interesting to note that in this linear case, the hidden states of different inputs have similar distributions (Gaussian) differing only by their means. Loosely speaking, the role of the parameter u will be then to separate as much as possible the data x i according to their labels v i 's through their processed weighted averages ν x i ,u , while the covariance matrix A quantifies by how much the hidden states are likely to deviate from their means. As in [NKD + 20], one may be tempted to apply traditional separating algorithms such as soft SVM as a classification technique. However, we follow here a strategy dictated by the risk minimisation and the learning guarantees given by the generalisation error bounds that we will obtain. We will compare these two techniques in a subsequent subsection.
3.2. Empirical risk for the binary loss. As stated in Subsection 2.3, we consider the case of the binary loss function. We will denote byl the classical binary loss function (ṽ, v) → 1ṽ =v , while l denotes its counterpart associated with random labels (assuming measurability) We will give first the exact formula of the loss in the pure stochastic regime, which shows the smoothing effect of the noise on the binary loss function. For a lighter notation, we will sometimes avoid making the dependence on other variables explicit. For example, we may write y instead of y u (T, x).
where Φ denotes the standard Gaussian cumulative distribution function (CDF).
Proof. First note thatl and that this inequality is not an identity if and only if v = 1 and y, ω + b = 0 (because of the convention on the sign of 0). Next, recall that y, ω + b is a Gaussian random variable We assume that ω / ∈ Ker(A). Then y, ω + b is a non-trivial Gaussian random variable (and which completes the proof. Remark 3.5. Note that if ω / ∈ Ker(A), then one cannot obtain a null empirical risk because of the Gaussian CDF. The quantity ν x,u , ω + b √ ω Aω resembles a distance from a hyperplane (more details will be provided in the next section). Similar in spirit to the soft-SVM problem, the loss function (3) penalises both misclassification and close proximity to said hyperplane.
Remark 3.6. If ω ∈ Ker(A), then y, ω = ν x,u , ω . We are then in a non-stochastic regime where full accuracy on the training set can be achieved if the averages ν x i ,u are separable since then In this regime, only misclassification is penalised.
3.3. Main result. Classically, quantitative guarantees for the minimisation of the estimation error within a chosen hypothesis class (agnostic PAC-learnability) is obtained by first showing that the hypothesis set satisfies the uniform convergence property (c.f. [SSBD14]), i.e., by obtaining probabilistic bounds for the worst-in-class difference between the generalisation error and the empirical error In turn, a way to achieve this is through controlling the Rademacher complexity (or the growth function or the VC-dimension) of the hypothesis class. Given a state space Z, a sample Z = {z i } m i=1 of points in Z and a class G of real valued maps defined on Z, the (empirical) Rademacher complexity of G with respect to the sample Z is defined by (c.f. [MRT12]) where the ε i 's are independent Rademacher (i.e. symmetric Bernoulli) random variables and ε = (ε 1 , . . . , ε m ) 1 . The Rademacher complexity can be seen as a measure of the richness of the class of functions G and its ability to provide a variety of labels {g(z i )} m i=1 for the sample S. If we assume that the sample Z is drawn in an i.i.d. manner according to a distribution D, we obtain the following concentration inequality We will estimate the Rademacher complexity in our setting then apply the above theorem to quantify the error in estimating the true risk by its empirical counterpart. This will yield our main theoretical result below. For matrices, · denotes the spectral norm.
Theorem 3.8. Let Θ and Λ be two positive real numbers. We consider the family of hypotheses given by Let δ > 0, R > 0 and m ∈ N * . We assume that the input signals lie almost surely in the L 2 -ball of radius R and that the covariance matrix A is positive definite. We denote by λ min (A) its smallest eigenvalue. Let D be a distribution over Then, with probability at least 1 − δ Proof. By applying Theorem 3.7 for the set of functions together with the formula obtained in Proposition 3.4 (using the assumption on A being positive definite), we get that the following holds with probability at least 1 − δ, with the supremum taken over the set and on the other hand We now use the following inequality for matrix valued maps f : together with the linearity of u and u ≤ Λ, ensuring Using Jensen's and Fubini's inequalities, we obtain the bound Using that the input paths live in the set B R , we conclude that: , which gives the desired inequality.
Remark 3.9. The use of the Rademacher complexity allows us to obtain a generalisation error bound that decays like 1 √ m , which is the common rate of decay encountered in the classical supervised learning framework (with i.i.d entries). This comes, however, at the cost of the technical assumption of the inputs being uniformly bounded in the L 2 -norm (which is arguably a realistic assumption.) Similarly to the error bounds for the SVM algorithm, it could be possible to relax this assumption at the cost of a much slower rate of decay by using the notion of VC-dimension. For example, In the discrete case, such bounds on the VC dimension of RNNs (in function of the number of weights in the network and the length of the sequence -which would correspond here to the number of steps one uses to discretise an input path) have been obtained for example by Koiran and Sontag in [KS98].
The previous theorem allows us to quantitatively control the estimation error: Corollary 3.10. Let Θ, Λ and R be positive real numbers. We assume that the input signals lie almost surely in the L 2 -ball of radius R and that the covariance matrix A is definite. Then the class H Θ,Λ is agnostically PAC learnable through its empirical risk minimiser hypothesis H ERM (defined in (2)), i.e., there exists a functionm : (0, 1) 2 → N such that for all ε, δ ∈ (0, 1), for every distribution D over X × {−1, 1} and every sample (X, V ) ∼ D m of size m ≥m(ε, δ), we have with probability at least 1 − δ More explicitly, one may takem(ε, δ) to be an integer m (ideally the smallest one) such that Proof. Let ε, δ ∈ (0, 1). Letm(ε, δ) be the smallest integer m such that Let D be a distribution over X × {−1, 1} and (X, Finally, note that by Theorem 3.8 and the definition ofm(ε, δ) we have P sup This completes the proof.
Remark 3.11. Corollary 3.10 shows an additional major advantage of the bound obtained in Theorem 3.8: it (quantitatively) guarantees that the empirical risk minimiser is the best-in-class hypothesis with high probability. The bounds discussed for example in Remark 3.9 aim to directly lower the risk by choosing parameters for the hypothesis that minimises the upper-bound of the risk (the right-hand side term in (4)). The efficiency of such technique is thus very dependent on said bound being tight.

A study of the empirical risk
In this section, we aim to provide a better understanding of the empirical risk (in view of our future work on the non-linear case). We recall that in the pure robust stochastic case (i.e., where the covariance matrix A is positive definite) and given a sample (X, , the empirical risk of a hypothesis H u,ω,b is given by the formula In the subsequent subsections, we will compare and draw parallels between the empirical risk minimisation (ERM) and the support vector machine (SVM) approaches and show that stochastic (linear) RNNs keep a "partial signature" of the input path as a summary of the information about the path.
4.1. An interpretation via margins. By making the change of variable α = A 1/2 ω, one can rewrite Note that the absolute value of the quantity above is the distance of the "transformed mean" A −1/2 ν x,u from the hyperplane H(α, b). Given such a hyperplane, let I 0 be the set of indices of input paths whose transformed means are correctly classified and m 0 := |I 0 | be its cardinal.
Define the corresponding margin ρ 0 as the distance between the hyperplane H(α, b) and the closest correctly classified transformed average A −1/2 ν x i ,u , H(α, b)).
Then, for all i ∈ I 0 , one has Similarly, let I 1 be the complement of I 0 and ρ 1 the distance of the furthest misclassified average A −1/2 ν x i ,u from the hyperplane H(α, b) H(α, b)).
Then, for all i ∈ I 1 , one has Using these inequalities, one can bound the empirical risk as follows Intuitively, decreasing the upper bound above requires a balance between, on the one hand, correctly classifying the averages A −1/2 ν x i ,u with a large margin ρ 0 and, on the other hand, misclassifying as few averages as possible with the smallest worst "misclassification margin" ρ 1 . This is reminiscent of the soft SVM algorithm (see for example [CV95,MRT12]), which is the approach adopted in [NKD + 20] to classify the data according to the statistics of their classes. Let us recall that the soft SVM algorithm consists of solving the following (constrained convex) optimisation problem (the parameter λ ≥ 0 is to be freely chosen depending on the desired properties of the final classifier) Classically, this is equivalent to the dual (quadratic) problem max θ 1≤i≤m Given its minimizer θ * , the solution of the primal SVM problem (5) can be computed as the direc- First, note that this optimisation problem is not convex anymore if we include the optimisation over the pre-processing map u. Second, while the solutions (for a fixed u) of the ERM and soft SVM may be similar in some generic situations, there are some significant difference in behaviour and interpretation in the results of the two algorithms 2 : • The solution to the ERM algorithm is sensitive to the number of inputs in each class (thus, in some way, "learning" the data generating distribution) while that of the SVM only depends on the support vectors (Figure 4.1). • Another key difference between these two algorithms lies in their respective objectives: the ERM attempts to find a hyperplane where most of the data is (correctly classified and) far away from said hyperplane while the soft SVM attempts the same but only with respect to the support vectors. The results of these procedures can lead sometimes to very different outputs (Figure 4.2). • The two algorithms have different sensitivities to outliers and mislabelled training data (which we will discuss in the next subsection.) • Given the preprocessing map u, if one denotes the hyperplane parameters returned by the SVM algorithm by (ω u , b u ), then the smoothness of the map u → (ω u , b u ) (and therefore that of u → R (X,V ) (H u,ωu,bu )) is much less trivial to prove or even define, thus rendering the use 2 In the following arguments, we show these differences for classification tasks for points in the plane. In our setting, this is equivalent to taking the dimensions r = n = 2, the pre-processing map u = Id and the input paths as constant paths.  Remark 4.1. In [NKD + 20], the SVM algorithm is successfully used to separate classes that can be separated by their statistics (for example, the realisations of two Gaussian processes.) A "good" pre-processing map u (based on a mathematical formula) is chosen beforehand; thus avoiding the use of gradient descent to optimise over this parameter.

4.2.
Robustness and further analysis of the ERM. Making again the change of variable α = A 1/2 ω, we saw from the above that, informally, an ERM algorithm has the task to find parameters (u, ω, b) such that all the transformed averages A −1/2 ν x i ,u are correctly classified, i.e., and as distant from the hyperplane H(α, b) as possible (thus prompting Φ −v · νx,u,ω +b √ ω Aω to be as small as possible) while working under the constraint that the overall sum of the losses associated to each training input (i.e., the empirical risk) has to be as small as possible. The combination of the latter with the loss function involving the Gaussian CDF is the reason why this ERM algorithm differs from a simple classification of the transformed averages A −1/2 ν x i ,u . For example, if we assume that the two clouds are concentrated and well separated and then introduce an additional training input x * with label v * = 1 but such that A −1/2 ν x * ,u is much closer to C − than to C + , it is then very possible for the ERM to choose to misclassify x * rather than to correctly classify it (doing the latter might mean an increase in the empirical risk as the cloud C − moves much closer to the new hyperplane H(α, b).) In practice, the label v * given to the path x * could be a wrong one (i.e., mislabelled training data). However, this does not necessarily alter the result of this ERM algorithm in a drastic way (which would be the case for the hard SVM algorithm or a soft SVM algorithm with a bad choice of the regularising parameter λ).  We analyse now the bounds u ≤ Λ and |b| ≤ Θ in Theorem 3.8. While these were necessary to obtain the bounds in said theorem, they are also necessary for the ERM to converge in general. Let us first start with the bound u ≤ Λ. As far as the separation of the means ν x i ,u of the processed inputs is concerned, u u is responsible for geometrically separating these means as well as possible, while u is an amplifying factor that could be exploited for adjusting to noise and randomness in the evaluation maps once a perfect separation is achieved. For example, if (u, ω, b) are parameters of the algorithm such that all the averages ν x i ,u are correctly classified in the sense that In other words, once a training algorithm finds a pre-processing map u that enables a correct hyperplane separation of (the weighted means ν x,u of) the data, it will tend to linearly scale the norm of u to infinity with two consequences: first, further distancing the means ν x i ,u (or more precisely A −1/2 ν x i ,u as seen earlier) from the classifying hyperplane and second, distancing these averages among themselves so that, with high probability, the random vectors y(T, x i ) 's take their values in similar non-intersecting elliptic domains (getting larger with α) centred at the ν x i ,u 's (since the Gaussian random vectors y(T, x i )'s have the same covariance matrix A.) Hence, if there are no bounds on the norm of u and if perfect classification is possible, the (non-converging) algorithm will try to transition from a robust classification to an accurate one (more details below.) The bound |b| ≤ Θ prevents in some particular scenarios the choice of hyperplanes whose distance from the origin tends to infinity. If we take for example the case where all the data belong to the same class (less extreme examples can also be explicited), say "+1", and if (u, ω, b 0 ) are parameters of the algorithm such that all the averages ν x i ,u are correctly classified, then lim b→+∞ R (X,V ) (H u,ω,b 0 +b ) = 0.
Note also that the bound on the norm of u (and b) are interchangeable with a rescaling of the covariance matrix A as the classifier with parameters (αu, ω, αb) and covariance matrix A generates the same risk as the classifier with parameters (u, ω, b) and covariance matrix A/α 2 . For this reason, we choose the bound Λ = 1 in the numerical experiments and control the covariance matrix through a noise scale to be chosen carefully (see Section 5).
Finally, we discuss the condition of definiteness of the covariance matrix A and its role in the robustness of the classification algorithm. When A is positive definite, we have seen that the goal of the ERM is solving the minimisation problem On the one hand, if the empirical risk is interpreted as a measure of the precision of the classification task on the training set, then we see that in this regime (A positive definite), a full precision (null risk) is not achievable (due to the randomness of the evaluation map). The same applies to the classification of the data on a validation set as the label is a random variable whose value will depend on the current simulation (of the Brownian motion and the solution to the S.D.E). On the other hand, and as we have previously seen, note that even for a fixed pre-processing map u, it is clear that the choice of the optimal parameters depends explicitly on all the training data, providing in this way a robustness against mislabelled data in the training set. For these reasons and the ones detailed above, we call this the robust or the stochastic regime.
If we now take A = 0, the objective of an ERM becomes the solution of the minimisation problem min u ≤Λ,|b|≤Θ,ω The task at hand is a mere classification of the averages ν x i ,u of the processed data. If there exists a pre-processing map u such that these averages are separable, then an ERM algorithm will achieve a full precision on the training data set (and in general, a unique solution does not exist unless we introduce an additional constraint like maximising the margin of the classifying hyperplane as in SVM). However, this classification will generally depend only on some support vectors ν x i ,u (i.e. the closest averages to the classifying hyperplane) while ignoring the information provided by the rest of the training data, hence potentially becoming sensitive and vulnerable to mislabelling and the data-generating distribution. We call such regime the accurate regime. In the general case (A not positive definite but not necessarily null), the ERM may choose either the robust or the accurate regime depending on how well the data can be separated. If the averages ν x i ,u are separable, the accurate regime is preferred as it leads to a null empirical risk, otherwise the ERM may choose a robust solution. We refer for example to [FFF18, RXY + 20, TSE + 19] for more information on the general topic of the trade-off between accuracy and robustness.
4.3. Information retained by stochastic RNNs. In this subsection, we highlight that the ERM is equivalent to a minimisation of a functional of the signatures of the augmented input paths (t, x t dt) t≤T . We recall that the signature S(z) = (Z n ) n∈N of a path z defined over an interval [0, T ] with values in a Banach space E is the sequence of its iterated integrals, i.e., for all s ≤ t, The role of the signature in RNNs should not come as a surprise for two main reasons: • The signature of a path is the only information needed from a path to solve a differential equation. This is the cornerstone of the theory of rough paths (c.f. [Lyo98,LCL07].) • Every path is uniquely characterised (up to what is called a tree-like equivalence) by its signature [HL10]. This fact is the basis for a lot of research in the machine learning of data streams. These streams are mapped via their (truncated) signatures to vectors in the tensor algebra, thus allowing one to use many of the classical machine learning techniques (like feed-forward neural networks) for data streams (e.g. [Gra13, KO19,LLN13]). For more information, we refer the reader to the original work of K.T. Chen [Che57], the modern formulation in [LCL07] and the introduction to signature methods in machine learning in [CK16].
We now reformulate the ERM problem in a way that fully separates the tunable parameters of the network from the information provided by the paths in the training dataset.
with inner product given by Then the ERM is equivalent to the following optimisation problem where C A is defined as Proof. We write the problem at hand in the simpler equivalent form Note now that we can expand the mean ν x,u into a series so that one can separate, in the inner product α, ν x,u +β, the tunable parameters from a functional of the input signals This concludes the proof.
where X · := · 0 x(s)ds denotes a primitive of x. Hence the sequence can be obtained from the signature of the path (t, X t ) t≤T .
Theorem 4.2 highlights the information retained by a stochastic RNN (with architecture and loss function chosen as described in this paper.) More explicitly, we have the following theorem.
Theorem 4.4. Both during training (solving the ERM problem) and classification (generating labels for unseen data), the continuous-time stochastic RNN (with linear activation function) is uniquely determined by the partial signature S of the input path defined by Proof. We have already shown in Theorem 4.2 that the ERM problem can be written as a minimisation problem of a functional of the partial signatures of the training paths (and that no other information about said paths is required). If we denote by (u, ω, b) some possible parameters of an ERM classifier H ERM and given an input path x, then the RNN generates the label v = sign( y u (T, x), ω + b) with y u (T, x) ∼ N (ν x,u , A). Following the proof of Theorem 4.2, ν x,u can also be expressed as a functional of S(x) only This completes the proof of the claims.
We believe that a similar result as Theorem 4.4 holds for generic RNNs, i.e., that continuous-time RNNs (even with non-linear activation functions) can be viewed as kernel machines involving the (full) signatures of the input paths. We refer for example to [Lim21] (and the references therein) for first results in this direction.
Even though the RNN uses only the partial signature (7) of the time-lifted input paths (instead of the full signature), it turns out that it is still a faithful representation of continuous paths. Proof. Without loss of generality, we will consider the case of real-valued paths. By a simple change of variable and a reparametrisation of the path, it is equivalent to show that the map is injective. As S is linear it is also equivalent to show that S(x) = 0 if and only if x = 0. Let then x be a continuous real-valued path such that S(x) = 0. Then for every polynomial function P , one has T 0 P (s)x(s)ds = 0. Let ε > 0 be arbitrary and let P be a polynomial function such that x − P ∞,[0,T ] ≤ ε. Then one has

Therefore
T 0 x 2 (s)ds = 0, from which we conclude that indeed x = 0. Remark 4.6. Despite the partial signature transform being injective, it is still possible for the linear RNN not to be able to distinguish two paths x and x if the inner product of their partial signatures with the matrices W k 0 u k≥0 are the same, i.e., Replacing the sequence W k 0 u k≥0 by another whose entries can be independent is the cornerstone and the reason behind the power of signature techniques. However, these techniques are confronted with computational issues and therefore the signature has to be restricted to low orders. RNNs are able however to surmount this obstacle by increasing the dimension n and thus allowing for more degrees of freedom (while signatures are computed implicitly as shown above through the recursive architecture.) Using the factorial decay of the elements of the partial signature sequence (which is trivial and explicit in our case), we can then obtain a good approximation for the ERM by truncating the signature. To show such result, we first prove the following technical lemma.
Lemma 4.7. Let N ∈ N * , u ∈ R n×r , ω ∈ R n and x ∈ L 1 ([0, T ], (R r , · 2 )). Then Proof. Expanding the difference that we want to bound we get For every k ∈ N, one trivially has The combination of the three arguments above gives then the desired result.
We can now show how one may exploit the signature-based expansion of the empirical risk in order to obtain an approximate solution to the ERM. Assume that all input paths take their values in the L 1 -ball of radius R Given a sample (X, and (ū,ω,b) be a solution to the "truncated" ERM problem Proof. For lighter expressions, we will introduce the following notation The inequality R (X,V ) (H u 0 ,ω 0 ,b 0 ) ≤ R (X,V ) (ū,ω,b) is a direct consequence of the definition of (u 0 , ω 0 , b 0 ). We decompose the difference of these two terms in the following way . By the definition of (ū,ω,b), the second difference is non-positive, For each i ∈ [[1, m]], the following holds by Lemma 4.7 and the assumptions on the parameters ω and u and the path x i The result is then directly obtained by applying the above bounds to the vectors (ū,ω,b) and (u 0 , ω 0 , b 0 ).
Theorem 4.8 demonstrates then the possible power of the application of the signature-based decomposition of the empirical risk. In our setting, this technique avoids for example the expensive computation of ν x,u (as integrals of time-dependent matrix exponentials) for each path in the training set and replaces it with the computation and storage of the powers (W k 0 ) k≤N . However, in this still simple setting, we will not base our optimisation techniques on this method and will reserve its application for the non-linear case where exact formulae are not available.

Numerical results
In this subsection, we present the results of some numerical experiments run on real and synthetic data. The focus will be on the effect of noise on the accuracy and the robustness of the RNN and on the verification of the theoretical bound obtained in Theorem 3.8.
The application to real-world data will be demonstrated on the Japanese Vowels dataset 4 . This dataset contains speech recordings of nine male subjects pronouncing a combination of Japanese vowels. The recordings are in the form of 12-dimensional (r = 12) discrete time series with varying lengths. To create a binary classification problem of continuous time paths, we restrict ourselves to the recordings of the first two subjects and we reparametrise the time-series to the unit interval. For the synthetic data, we consider 5-dimensional trigonometric polynomials (r = 5) We take the dimension of the network to be n = 50 (and the dimension d of the Brownian motion is taken to be equal to n.) This is quite large for our simple synthetic dataset and appropriate for the Japanese Vowels dataset but still small compared to typical reservoirs where n can be larger than 500 to make the reservoir self-averaging. The entries of the connectivity matrix W are drawn from independent Gaussian distributions N (0, 0.9 √ n ) (the choice of the value 0.9 is to insure the numerical stability of the system). To generate the noise matrix Σ, we first draw positive values λ 1 , . . . , λ n uniformly over (0, 1), then uniformly a random orthogonal matrix U to output the matrix The parameter δ is a noise scale that has to be chosen carefully. A large value for δ makes the noise dominant and the RNN unable to reach an acceptable accuracy, while a too small value makes the system less robust.
The ERM problem will be solved numerically using a gradient descent (GD) algorithm. Recall that, given a training sample (X, as a function of the parameters u, ω and b. As discussed in Subsection 4.2, we impose the bound u ≤ 1. To achieve a superior classification performance, we do not impose an a priori bound on b. Let us first say a few words about some simplifications in the implementation of the gradient descent (GD) algorithm in this simple linear case. If we consider the loss function associated with a single observation (x, v) as a function of u, ω and b where e i,j stands for the canonical n × r basis matrix (δ ik δ jl ) k≤n,l≤r and z i stands for the i th coordinate of the vector z. Apart from its simple expression, one can see that the averages (ν x,e i,j ) i≤n,j≤r need to be computed for each example x in the training set in order to run the algorithm. One can then exploit these to compute the mean ν x,u at each iteration of the algorithm as a linear combination of the means (ν x,e i,j ) i≤n,j≤r instead of a full new computation which can be heavy due to the presence of matrix exponentials. In our implementation, we used Scipy's minimize function with nonlinear constraints 5 which is a gradient descent algorithm with suitably adjusted step-sizes (based on the Sequential Least Squares Programming algorithm detailed in [K + 88]). Due to the non-convexity of the problem, the algorithm is run a few times (typically 5 times) with random initialisations of the parameters in order to avoid bad local minima. The solution with the smallest loss function is then chosen.
The experiments consist of three parts: (1) first we check the accuracy achieved on a fixed-size testing set (30% of the total available data) while increasing the size of the training set. Recall that the computed accuracies are random realisations depending on the results of the simulations (of the solution to the SDE (1)).
In the above, the true risk R(H) is computed as an empirical risk over the testing set. The value R is computed as the maximum of the L 2 -norm of the paths in the whole dataset while Θ is computed as the largest value obtained for |b| in the accuracy experiment detailed above. We take the value δ = 0.01 so that the bound holds with probability 0.99. (3) We verify the robustness of the learning in the following way. We train the network with a corrupt dataset where a portion of the labels of the training inputs has been changed, then compute the resulting accuracy based on the test set. This robustness test is quite simple and does not show the full potential of the architecture but we suspect that such robustness also holds when the input paths are themselves being perturbed. For other measures of robustness, we refer the reader to the previously cited references and the additional references [BNL12, BNS + 06, DDSV04] that look into the robustness of training with corrupted datasets.
We start first by the experiments on the Japanese vowels dataset with three choices of noise scales δ ∈ {1, 1.5, 2.5} in (8). The resulting plots are shown in Figure 5.1. In the accuracy plots: • the solid lines show the accuracy based on classifying the averages ν x,u of the paths in the test dataset with the obtained hyperplane h ω,b . In other words, the generated label for a  Table 1. Numerical values when working the Japanese vowels dataset. The first column shows the noise scale (see Equation (8)), the next four columns give respectively the accuracy of classification by a noiseless RNN (trained as a stochastic RNN), then the lowest, the largest and the average accuracy given by the (stochastic) RNN over 10 simulations. The following four columns show the same numbers after training the RNN with 10% of mislabelled data. The last column shows the ratio of the average accuracy in the robustness test (after training with some corrupted labels) over the normal average accuracy (after training with the correct labels). path x is v = sign( ν x,u , ω + b). As ν x,u is the hidden state of a noiseless RNN 6 , we refer to this scenario as noiseless accuracy or NRNN.
• The dotted lines show the accuracy of 5 simulated labels produced by the stochastic RNN.
Here, the generated label for a path x is v = sign( y u (T, x), ω + b) where y u (T, x) is simulated as y u (T, x) ∼ N (ν x,u , A). We refer to this scenario as simulated stochastic accuracy.
Note that because of the discontinuity of the hyperplane classifying function, the former is not expected to be an average of the latter. For readability, we report some numerical values for 10 simulations in Table 1: • the first column refers to the value of the noise scale δ in (8), • the group of columns "Accuracy" reports the results of the experiment after training with the entire training set while the group of columns "Robustness test accuracies" reports the accuracy of the network after training with the entire training set with 10% of mislabelled data, • the column NRNN gives the accuracy of classification of the averages ν x,u with the hyperplane h ω,b (these labels are given by as v = sign( ν x,u , ω + b)), • the columns "min", "max" and "avg." give, respectively, the smallest, the largest and the average observed accuracy over the 10 simulations (the labels are simulated as v ∼ sign( N (ν x,u , A) , ω + b).) As usually observed when working with PAC-bounds based on the Rademacher complexity, the theoretical bound obtained in Theorem 3.8 can be very loose. We note however in Figure 5.1 that the difference gets smaller with larger noise scales. In Table 1, we note an almost perfect noiseless accuracy even in the presence of noise. A striking fact, however, is that, on average and in the presence of noise, there is no remarkable loss in the performance of the RNN when training with some mislabelled data.
Training with synthetic data showed a higher tolerance to the noise scales. In Figure 5.2, we show the result of the same experiments as before with the noise scales δ ∈ {2, 4, 6}. Numerical values are reported on Table 2 (with 15% for mislabelled data in the robustness test). We observe again the validity of the theoretical PAC-bound obtained in Theorem 3.8 (with the difference getting smaller with higher noise scales) and notice the same persistence in the quality of the labels generated by the RNN when training with mislabelled data.