Convergence analysis of gradient-based iterative algorithms for a class of rectangular Sylvester matrix equations based on Banach contraction principle

We derive an iterative procedure for solving a generalized Sylvester matrix equation AXB + CXD = E, where A,B,C,D, E are conforming rectangular matrices. Our algorithm is based on gradients and hierarchical identification principle. We convert the matrix iteration process to a first-order linear difference vector equation with matrix coefficient. The Banach contraction principle reveals that the sequence of approximated solutions converges to the exact solution for any initial matrix if and only if the convergence factor belongs to an open interval. The contraction principle also gives the convergence rate and the error analysis, governed by the spectral radius of the associated iteration matrix. We obtain the fastest convergence factor so that the spectral radius of the iteration matrix is minimized. In particular, we obtain iterative algorithms for the matrix equation AXB = C, the Sylvester equation, and the Kalman–Yakubovich equation. We give numerical experiments of the proposed algorithm to illustrate its applicability, effectiveness, and efficiency.


Introduction
It is well known that linear matrix equations play crucial roles in control theory and related areas. Indeed, certain problems concerning analysis and design of control systems (e.g., existence of solutions or controllability/observability of the system) are converted to properties of associated matrix equations; see, for example, [1, and [2]. Such matrix equations are particular cases of or closely related to the generalized Sylvester matrix equation where A, B, C, D, E are given matrices, and X is an unknown matrix. This equation includes the equation AXB = E, the Lyapunov equation AX +XA T = E, the (continuous-time) Sylvester equation AX + XB = E, and the Kalman-Yakubovich equation or the discretetime Sylvester equation AXB + X = E. The generalized Sylvester equation naturally arises in robust control, singular system control, neural network, and statistics; see, foe example, [3,4]. An important particular case of (1), the Sylvester equation, also has applications to image restoration and numerical methods for implicit ordinary differential equations; see, for example, [5,6]. Let us discuss how to solve (1) via the direct Kronecker linearization. We can convert the matrix equation (1) into a vector-matrix equation by taking the vector operator vec(·) so that (1) is reduced to Px = b, where Here the symbol ⊗ denotes the Kronecker product. Thus Eq. (1) has a unique solution if and only if P is nonsingular. However, if the dimensions of matrices are large, then it will lead to computational difficulty, and so this approach is only applicable for smalldimensional matrices. For more information about analytical methods for solving such linear matrix equations, see, for example, [1,Ch. 12], [7,Ch. 4], and [8,Sect. 7.1]. Another technique is transforming the coefficient matrix into a Schur or Hessenberg form via an orthogonal transformation; see [9,10]. For large matrix systems, iterative methods for solving matrix equations have received much attention. There are several ideas to formulate an iterative procedure for solving Eq. (1) and particular cases, for example, block successive overrelaxation [11], matrix sign function [12], block recursion [13,14], Krylov subspace [15,16], and truncated low-rank methods [17]. A group of iterative methods, called Hermitian and skew-Hermitian splitting (HSS) methods, relies on the fact that every square complex matrix can be written as the sum of its Hermitian and skew-Hermitian parts. Recently, there are several variants of HSS, namely, the generalized modified HSS (GMHSS) method [18], the accelerated double-step scale splitting (ADSS) method [19], the preconditioned HSS (PHSS) method [20], and the four-parameter positive skew-Hermitian splitting (FPPSS) method [21]. The idea of conjugate gradient (CG) also leads to several finite-step procedures to obtain the exact solution of the linear matrix equations. The principle of CG is constructing an orthogonal basis from the gradient of the associated quadratic function, consisting of vectors in the direction that approaches the fastest the exact solution. There are several variants of CG to solve such linear matrix equations, for example, the generalized conjugate direction (GCD) method [22], the conjugate gradient least-squares (CGLS) method [23], and generalized product-type methods based on biconjugate gradient (GPBiCG) method [24]. See more information in a survey [8] and references therein.
A group of gradient-based iterative algorithms relies on the ideas of hierarchical identification principle and minimization of quadratic norm-error functions; see, for example, [25][26][27][28][29][30][31]. Convergence analysis for such algorithms depends on the Frobenius norm · F , the spectral norm · 2 , the spectral radius ρ(·), and the condition number κ(·) of the associated iteration matrix. Let us focus on the following iterative algorithms to approximate the unique solution of Eq. (1) when all A, B, C, D are square matrices.

Algorithm 1.1 ([32])
The gradient iterative algorithm (GI) for (1): If we choose the convergence factor τ such that then X(k) converges to the exact solution for any initial values X 1 (0) and X 2 (0). Numerical simulations in [25] reveal that Algorithm 1.1 is more efficient than the B-Q algorithm [12].
In [33], Algorithm 1.1 was shown to be applicable if so that the range of τ is wider than that of (2).
There are many iterative algorithms for the Sylvester equation. The first solver is the gradient iterative algorithm (GI), introduced by Ding and Chen [32]. Niu et al. [34] introduced the relaxed gradient iterative algorithm (RGI), that is, GI with relaxation parameter ω ∈ (0, 1). Wang et al. [35] modified the GI algorithm in such a way that the information X 1 (k) has been fully considered to update X(k -1); the result is called the MGI algorithm. Recently [37,38].
In this paper, we introduce an iterative method for solving the generalized Sylvester equation (1), for which matrices A, B, C, D, E are not necessarily square. Our algorithm is based on gradients and hierarchical identification principle. This algorithm consists of only one parameter, the convergence factor θ , and the only initial value. To perform a convergence analysis of the algorithm, we use analysis on a complete metric space, together with matrix analysis. We convert the matrix iteration process to a first-order linear difference vector equation with matrix coefficient. Then we apply the Banach contraction principle to show that the proposed algorithm converges to the unique solution for any initial value if and only if 0 < θ < 2/ P 2 2 . The range of the parameter is wider than those of [25,33]. The convergence rate of the proposed algorithm is governed by the spectral radius of the associated iteration matrix. We also discuss error estimates; in particular, the error at each iteration becomes smaller than the previous one. The fastest convergence factor is determined so that the spectral radius of the iteration matrix is minimal. Moreover, we make convergence analysis of gradient-based iterative algorithms for the equation AXB = C, the Sylvester equation, and the Kalman-Yakubovich equation. We also provide numerical simulations to illustrate our results for the matrix equation (1) and the Sylvester equation. We compare the efficiency of our algorithm with the direct Kronecker linearization and recent algorithms, namely, GI, LSI, RGI, MGI, JGI, AJGI1, and AJGI2 algorithms.
The rest of the paper is organized as follows. We derive a gradient-based iterative algorithm in Sect. 2. We then analyze the convergence of the algorithm in Sect. 3. Iterative algorithms for particular cases of (1) are investigated in Sect. 4. We illustrate and discuss numerical simulations of the algorithm in Sect. 5. Finally, we conclude the work in Sect. 6.

Deriving a gradient-based iterative algorithm for the generalized Sylvester matrix equation
We denote by M m,n the set of m × n real matrices and set M n := M n,n . In this section, we derive an iterative algorithm based on gradients to find a matrix X ∈ M n,p satisfying Here we are given A, C ∈ M m,n , B, D ∈ M p,q , and E ∈ M m,q , where m, n, p, q ∈ N are such that mq = np.
Recall that equation (4) has a unique solution if and only if the square matrix P = B T ⊗ A + D T ⊗ C is invertible. In this case, the (vector) solution is given by vec X = P -1 vec E.
To derive an iterative procedure for solving (4), we recall the hierarchical identification principle from [33]. Define two matrices M := E -CXD and N := E -AXB.
In view of (4), we would like to solve two subsystems We will minimize the following quadratic norm-error functions: Now we deduce their gradients as follows: Similarly, we have Let X 1 (k) and X 2 (k) be the estimates or iterative solutions of system (5) at iteration k. We introduce a step-size parameter τ ∈ R and a relaxation parameter ω ∈ (0, 1). We can derive recursive formulas for X 1 (k) and X 2 (k) from the gradient formulas (7) and (8) as follows: By the hierarchical identification principle the unknown variable X is replaced by its estimates at iteration k -1. Instead of taking the arithmetic mean of X 1 (k) and X 2 (k) as in Algorithm 1.1, our algorithm computes the weighted arithmetic mean ωX 1 (k) + (1ω)X 2 (k). By introducing the parameter θ = τ ω(1ω) we get the following iterative algorithm.
Note that our algorithm avoids duplicate computations by introducing F(k) at each iteration. To stop the process, we can impose a stopping rule such as F(k) F < or where is a chosen permissible error. The convergence of Algorithm 2.1 relies on the convergence factors θ , which will be determined in the next section. Note that the algorithm requires only one parameter and one initial value and uses less computing time than other gradient-based algorithms mentioned in Introduction.

Convergence analysis of the algorithm
In this section, we analyze the convergence of Algorithm 2.1. We convert the matrix iteration process to a first-order linear difference vector equation with contraction matrix as the coefficient. It follows that the contraction reflects the convergence criteria, convergence rate, and error estimates of the algorithm.
To analyze this algorithm, we recall useful facts in matrix analysis.
Lemma 3.1 (e.g. [7]) For any matrices A and B of conforming dimensions, we have

Convergence criteria
From Algorithm 2.1 we start with considering the error matrix We will show that X(k) → 0 or, equivalently, vec X(k) → 0 as k → ∞. Now we convert the matrix iteration process to a first-order linear difference vector equation with matrix coefficient. Indeed, we have where P θ = I npθ P T P. Denoting u(k) = vec X(k) for k ∈ N, we obtain a first-order linear difference vector equation, as desired. Note that iteration (9) is also the Picard iteration where T is the self-mapping on R n defined by Tx = P θ x. We will find some properties of T yielding that the iteration converges to the fixed point u * = 0 of T for arbitrary initial point u(0). In fact, this can be guaranteed by the Banach contraction principle: Then T has a unique fixed point x * . The following estimates are equivalent and describe the convergence rate: Now we look for some conditions on P θ making the mapping T a contraction. For each x ∈ R n , we have by Lemma 3.1 that The last equality holds since P θ is a symmetric matrix. It follows that Thus, if ρ(P θ ) < 1, then T is a contraction relative to the metric induced by · F . Note that further characterizations of matrix contractions, involving (induced) matrix norms, are given, for example, in [40]. Since P θ is a symmetric matrix, all its eigenvalues are real, and thus ρ(P θ ) = max 1θλ min P T P , 1θλ max P T P .
It follows that ρ(P θ ) < 1 if and only if 0 < θλ min P T P < 2 and 0 < θλ max P T P < 2.
Since P is invertible and P T P is positive semidefinite, we have that P T P is positive definite and λ min (P T P) > 0. The positive definiteness of P T P and Lemma 3.1(i) imply λ max P T P = P T P 2 = P 2 2 .
Hence condition (12) holds if and only if Therefore, if (13) holds, then the sequence X(k) generated by Algorithm 2.1 converges to the solution of (4) for any initial value X(0). Conversely, suppose that θ does not satisfy (13). The above discussion implies that ρ(P θ ) ≥ 1, that is, there is an eigenvalue λ of P θ such that |λ| ≥ 1. We can choose an eigen- We summarize a necessary and sufficient condition for the convergence criteria as follows.

Theorem 3.3
Let θ ∈ R be given. Then the sequence X(k) generated by Algorithm 2.1 converges to the solution of (4) for any initial value X(0) if and only if θ satisfies (13).
, then Algorithm 2.1 is not applicable for some initial values.

Convergence rate and error estimates
We now apply the Banach contraction principle to analyze the convergence rate and error estimates of Algorithm 2.1. Note that the error at each step of the associated Picard iteration is equal to that of the original matrix iterative algorithm. Indeed, for any k ∈ N ∪ {0}, we have Thus by Theorem 3.2(i) we obtain It follows inductively that for each k ∈ N, Hence ρ(P θ ) describes how fast the approximate solutions X(k) converge to the exact solution X. The smaller the spectral radius, the faster X(k) goes to X. In that case, since ρ(P θ ) < 1, if X(k) -X F = 0, then Moreover, from the prior and posterior estimates in Theorem 3.4 we obtain the bounds we summarize our discussion in the following theorem.

Theorem 3.4 Suppose that the parameter θ is chosen as in Theorem 3.3 so that the sequence X(k) generated by Algorithm 2.1 converges to X for any initial value. Then we have:
• The convergence rate of the algorithm is governed by the spectral radius (11).
• The error estimates X(k + 1) -X F compared to the previous step and the first step are provided by (14) and (15), respectively. In particular, the error at each iteration gets smaller than the (nonzero) previous one, as in (16). • The prior estimate (17) and the posterior estimate (18) hold.
From (11), if the eigenvalues of θ P T P are close to 1, then the spectral radius of the iterative matrix is close to 0, and hence the errors X(k) converge faster to 0.
Remark 3.5 The convergence criteria and the convergence rate of Algorithm 2.1 depend on A, B, C, D but not on E. However, the matrix E can be used for a stopping criteria.
The following proposition determines the iteration number for which the approximated solution X(k) is close to the exact solution X so that X(k) -X F < . Proposition 3.6 According to Algorithm 2.1, for each given error > 0, we have X(k) -X F < after k * iterations for any Proof From estimate (15) we have This precisely means that for each given > 0, there is k * ∈ N such that for all k ≥ k * , Taking logarithms, we have that this condition is equivalent to (19). Thus, if we run Algorithm 2.1 k * times, then we get X(k) -X F < , as desired.

Optimal parameter
We now discuss the fastest convergence factors for Algorithm 2.1. . Denote κ as the condition number of the matrix P. Then the optimal value of θ for which Algorithm 2.1 is applicable for any initial value is determined by θ opt = 2 λ min (P T P) + λ max (P T P) . (20) In this case the spectral radius of the iteration matrix is given by Proof Theorem 3.3 tells us that the convergence of the algorithm implies (13). The convergence rate of the algorithm is the same as that of the linear iteration (9) and thus is governed by the spectral radius (11). Hence we would like to minimize the spectral radius ρ(P θ ) subject to condition (13). Putting a = λ min (P T P) and b = λ max (P T P), we make the following optimization: The minimality is reached at θ opt = 2/(a + b), so that the minimum value of ρ(P θ ) is equal to (ba)/(b + a).
From Theorem 3.7, Algorithm 2.1 has a fast convergence if the condition number of P is close to 1 or, equivalently, λ max (P T P) is close to λ min (P T P).

Iterative algorithms for particular cases of the generalized Sylvester equation
In this section, we discuss iterative algorithms for solving interesting particular cases of the generalized Sylvester equation.

The equation AXB = E
Assume that the equation AXB = E has a unique solution or, equivalently, the square matrix Q := B T ⊗ A is invertible. In the particular case where A and B are square matrices, this condition is reduced to the invertibility of both A and B. The following algorithm is proposed to find the solution X.
Note that Q T Q = BB T ⊗ A T A by the mixed-product property of the Kronecker product. Since Q is invertible, so is Q T Q. It follows that A T A and BB T are positive definite. Thus . Now we obtain the following: 2) The convergence rate of the iteration is governed by the spectral radius 3) The optimal convergence factor for which Algorithm 4.1 is applicable for any initial value is given by

The Sylvester equation
Suppose that m = n and p = q. Assume that the Sylvester equation has a unique solution. This condition is equivalent to the invertibility of the Kronecker sum D T ⊕ A := D T ⊗ I n + I p ⊗ A, or all possible sums of eigenvalues of A and D are nonzero.

Corollary 4.4
Assume that the equation AX + XD = E has a unique solution X. Then the iterative sequence X(k) generated by Algorithm 4.3 converges to X for any initial value X(0) if and only if Error estimates and the optimal convergence factor for Algorithm 4.3 can also be obtained from Theorem 2.1 when B and C are the identity matrices.

The Kalman-Yakubovich equation
Suppose that m = n and p = q. Assume that the Kalman-Yakubovich equation has a unique solution. This condition is equivalent to the invertibility of B T ⊗ A + I np or, equivalently, to that all possible products of eigenvalues of A and B are not equal to -1.

Corollary 4.7 Assume that the equation AXB + X = E has a unique solution X. The iterative sequence X(k) generated by Algorithm 4.6 converges to X for any initial value X(0) if and only if
Remark 4.8 If A and B are positive semidefinite, then B T ⊗ A + I n 2 2 is reduced to A 2 B 2 + 1.

Numerical simulations
In this section, we report numerical results illustrating the effectiveness of Algorithm 2.1. We consider matrix systems from small dimensions (say, 2 × 2) to large dimensions (say, 120 × 120). We investigate the effect of changing convergence factors (see Example 5.1) and initial points (Example 5.2). We compare the performance of the algorithm to the direct Kronecker linearization (Example 5.1) and recent iterative algorithms (Example 5.3). We show that Algorithm 2.1 is still effective when dealing with a nonsquare problem (see Example 5.4). To measure errors at step k of the iteration, we consider the following relative error: All iterations have been carried out on the same PC environment: MATLAB R2018a, Intel(R) Core(TM) i7-6700HQ CPU @ 2.60 GHz 2.60 GHz, RAM 8.00 GB. To measure the computational time (in seconds) taken for a program, we use the tic and toc functions in MATLAB and abbreviate CT for it. The readers are recommended to consider both reported errors and CTs while comparing the performance of any algorithms. We run Algorithm 2.1 with five convergence factors; one of them is the optimal convergence factor θ opt = 6.5398e-04, determined by Theorem 3.7. According to (13), the range of appropriate θ is given by 0 < θ < 2/ P 2 2 ≈ 6.5398e-04 (in this case, λ min (P T P) ≈ 0), that is, Algorithm 2.1 is applicable for every chosen θ . The result after 100 iterations is presented by Fig. 1 and Table 1. From the relative error plot versus iteration time in Fig. 1 we see that the optimal convergence factor gives the fastest convergence. Table 1 shows that the computational times with the five convergence factors are significantly less than that of the direct method vec X = P -1 vec E. The relative errors after 100 iterations are very small in comparison with the dimensions of the coefficient matrices.
For each n ∈ {2, 10, 100}, we consider the coefficient matrices A = A 0 ⊗ I n/2 , B = B 0 ⊗ I n/2 , and X * = Z ⊗ I n/2 together with initial condition X(0) = 10 -6 × ones(n We run 50 iterations for dimension 2 × 2, 100 iterations for dimension 10 × 10, and 200 iterations for dimension 100 × 100. The relative error and the computational time in Table 3 reveal that our algorithm well performs comparing to the other algorithms. Figure 2 displays the error plots of the first 100 iterations for each case; 2 × 2 (a), 10 × 10 (b), and 100 × 100 (c). We can see that our algorithm gives the fastest convergence.    The constant matrix E is determined by E = AX * B + CX * D. We compare Algorithm 2.1 (θ opt = 6.4000e-05) and the algorithms compatible with nonsquare matrices, that is, GI [25], RGI [34], and MGI [35], with 100 iterations. The relative error at terminal iteration and the computational time of each algorithm and of the direct method are shown in Table 4, whereas Fig. 3 displays the error plots. Both of them reveal the effectiveness in performance of our algorithm. The computational times of GI-opt and GI are less than those of RGI and MGI. The reason that GI-opt takes little more time than GI is that our algorithm needs more time to compute θ opt in Theorem 3.7. On the other hand, Algorithm 2.1 obtains a significantly better error than GI algorithm.

Conclusion
We propose a gradient-based iterative algorithm (Algorithm 2.1) for solving the rectangular generalized Sylvester matrix equation (4) and its famous particular cases. Theorem 3.3 informs us that the parameter θ must be chosen properly to have the proposed algorithm applicable for any initial matrices. Moreover, we determine the optimal convergence factors, which makes the algorithm reach the fastest convergence rate. The asymptotic convergence rate of the algorithm is governed by the spectral radius of I npθ P T P. If the eigenvalue θ P T P is close to 1, then the algorithm converges faster in the long run. The numerical simulations reveal that our algorithm is suitable for both small and large matrix systems, both square and nonsquare problems, and any initial points. In addition, the algorithm always gives the effective performance comparing to the errors obtained from recent methods, namely, GI, RGI, MGI, JGI, AJGI1, AJGI2, and LSI algorithms. There are two reasons that cause our algorithm to perform well. The first reason is that the algorithm requires only one parameter and one initial value, and avoids duplicated computations. The second is that the convergence factor is chosen by an optimization method.