Approximated least-squares solutions of a generalized Sylvester-transpose matrix equation via gradient-descent iterative algorithm

This paper proposes an effective gradient-descent iterative algorithm for solving a generalized Sylvester-transpose equation with rectangular matrix coefficients. The algorithm is applicable for the equation and its interesting special cases when the associated matrix has full column-rank. The main idea of the algorithm is to have a minimum error at each iteration. The algorithm produces a sequence of approximated solutions converging to either the unique solution, or the unique least-squares solution when the problem has no solution. The convergence analysis points out that the algorithm converges fast for a small condition number of the associated matrix. Numerical examples demonstrate the efficiency and effectiveness of the algorithm compared to renowned and recent iterative methods.


Introduction
In differential equations and control engineering, there has been much attention for the following linear matrix equations: AX + XB = C : Sylvester equation, AXB + CXD = E : a generalized Sylvester equation, AXB + CX T D = E : a generalized Sylvester-transpose equation, X + AXB = C : Stein equation, X + AX T B = C : Stein-transpose equation.
These equations are special cases of a generalized Sylvester-transpose matrix equation: where, for each t = 1, . . . , p, A t ∈ R l×m , B t ∈ R n×r , for each s = 1, . . . , q, C s ∈ R l×n , D s ∈ R m×r , E ∈ R l×r are known matrices whereas X ∈ R m×n is the matrix to be determined. These equations play important roles in control and system theory, robust simulation, neural network, and statistics; see e.g. [1][2][3][4].
A traditional method of finding their exact solutions is to use the Kronecker product of a matrix and the vectorization to reduce the matrix equation to a linear system; see e.g. [5,Ch. 4]. However, the dimension of the linear system can be very large due to the Kronecker multiplication, so that the step of finding the inversion of the associated matrix will result in excessive computer storage memory. For that reason, iterative approaches have received much attention. The conjugate gradient (CG) is an interesting idea to formulate finite-step iterative procedures to obtain the exact solution at the final step. There are variants of CG method for solving linear matrix equations, namely, the generalized conjugate direction method (GCD) [6], the conjugated gradient least-squares method (CGLs) [7], generalized product-type methods based on a bi-conjugate gradient (GPBi) [8]. Another interesting idea to create an iterative method is to use Hermitian and skew-Hermitian splitting (HSS); see e.g. [9].
A group of methods, called gradient-based iterative methods, aim to construct a sequence of approximated solutions that converges to the exact solution for any given initial matrices. These methods are derived from the minimization of associated norm-error functions using gradients, and the hierarchical identification. Such techniques have stimulated and have played a role in many pieces of research in a few decades. In 2005, Ding and Chen [10] proposed a gradient-based iterative (GI) method for solving Eqs. (3), (4), and (6). Ding et al. [11] proposed the GI and the least-squares iterative (LSI) methods for solving p j=1 A j XB j = F which includes Eqs. (1) and (4). Niu et al. [12] developed a relaxed gradient-based iterative (RGI) method for solving Eq. (3) by introducing a weighted factor. The MGI method, developed by Wang et al. [13], is a half-step-update modification of the GI method. Zhaolu et al. [14] presented two methods for solving Eq. (3). The first method is based on the GI method and called the Jacobi gradient iterative (JGI) method.  (2). See more algorithms in [16][17][18][19][20][21][22][23][24]. The developed iterative methods can be applied to state-space models [25], controlled autoregressive systems [26], and parameter estimation in signal processing [27].
Let us focus on gradient-based iterative methods for solving Eqs. (5) and (8). A recent gradient iterative method for Eq. (5) is AGBI method, developed in [28]. The following two methods were proposed to produce the sequence X(k) of approximated solutions converging to the exact solution X * of Eq. (8).
A conservative choice of the convergence factor μ is In this work, we introduce a new iterative algorithm based on gradient-descent for solving Eq. (8). The techniques of gradient and steepest descent let us obtain the search direction and the step sizes. Indeed, our varied step sizes are the optimal convergence factors that guarantee the algorithm to have a minimum error at each iteration. Our convergence analysis proves that, when Eq. (8) has a unique solution, the algorithm constructs a sequence of approximated solutions converging to the exact solution. On the other hand, when Eq. (8) has no solution, the generated sequence converges to the unique least-squares solution. We provide the convergence rate to show that the speed of convergence depends on the condition number of the associated certain matrix. In addition, we have an error analysis that gives an error estimation comparing the current iteration with the preceding and the initial iterations. Finally, we provide numerical simulations to guarantee the efficiency and effectiveness of our algorithm. The illustrative examples show that our algorithm is applicable to both Eq. (8) and its certain interesting special cases.
The organization of this paper is as follows. In Sect. 2, we recall the criterion for the matrix equation (8) to have a unique solution or a unique least-squares solution, via the Kronecker linearization. We propose the gradient-descent algorithm to solve Eq. (8) in Sect. 3. The proof of convergence criteria, convergence rates, and error estimation for the proposed algorithm are provided in Sect. 4. In Sect. 5, we present the comparison of the efficiency of our proposed algorithm to well-known and recent iterative algorithms.
In the remainder of this paper, all vectors and matrices are real. Denote the set of n columns vectors by R n and the set of m × n matrices by R m×n . The (i, j)th entry of a matrix A is denoted by A(i, j) or a ij . To perform a convergence analysis, we use the Frobenius norm, the spectral norm, and the (spectral) condition number of A ∈ R m×n , which are, respectively, defined by

Exact and least-squares solutions of the matrix equation by the Kronecker linearization
In this section, we explain how to solve the generalized Sylvester-transpose matrix equation (8) directly using the Kronecker linearization.
Recall that the Kronecker product of A = [a ij ] ∈ R m×n and B ∈ R p×q is defined by A ⊗ B = [a ij B] ∈ R mp×nq . The vector operator Vec(·) turns each matrix A = [a ij ] ∈ R m×n to the vector Lemma 2.1 (e.g. [5]) For compatible matrices A, B, and C, we have the following properties of the Kronecker product and the vector operator.
Recall also that there is a permutation matrix P(m, n) ∈ R mn×mn such that This matrix depends only on the dimensions m and n and is given by where E ij has entry 1 in (i, j)th position and all other entries are 0. Now, we can transform Eq. (8) to an equivalent linear system by applying the vector operator and utilizing Lemma 2.1(ii) and the property (9). Indeed, we get the linear system where Thus Eq. (8) has a (unique) solution if and only if Eq. (10) does. We impose the assumption that Q is of full column-rank, or equivalently, Q T Q is invertible. If Eq. (8) has a solution, then we obtain the exact (vector) solution to be If Eq. (8) has no solution, then we can seek for a least-squares solution, i.e. a matrix X * that minimizes the squared Frobenius norm Q Vec(X) -Vec(E) 2 F . The assumption on Q implies that the least-squares solution for Eq. (8) is uniquely determined by the solution of the associated normal equation, and it is also given by Eq. (12). In this case, the leastsquares error is given by We denote both the exact and the least-squares solutions of Eq. (8) by X * .

Gradient-descent iterative solutions for the matrix equation
This section is intended to propose a new iterative algorithm for creating a sequence {X k } of well-approximated solutions of Eq. (8) that converges to the exact or least-squares solution X * . This algorithm will be applicable if the matrix Q is of full column-rank, no matter Eq. (8) has a solution or not. Our aim is to generate a sequence {x k }, starting from an initial vector x 0 , using the recurrence where x k is the kth approximation, τ k+1 > 0 is the step size, and d k is the search direction. To obtain the search direction, we consider the Frobenius-norm error p t=1 A t XB t + q s=1 C s X T D s -E F which is then transformed into Qx -Vec(E) F via Lemma 2.1(ii) and x = Vec(X). Let f : R mn → R be the norm-error function defined by It is easily seen that f is convex. Hence, the gradient-descent iterative method can be shown as the following recursive equation: To find the gradient of the function f , the following properties of the matrix trace will be used: By lettingẽ = Vec(E), we compute the derivative of f as follows: Thus, we have the new form of the iterative equation as follows: The above equation can be transformed into matrix form via Lemma 2.1(ii), i.e., We differentiate φ k+1 by using the properties of a matrix trace and obtain It is obvious that the second-order derivative of φ k+1 is QQ T (ẽ -Qx k ) 2 F which is a positive constant. So when d dτ φ k+1 (τ ) = 0, we get the minimizer of φ k+1 , i.e.
s . An implementation of the gradient-descent iterative algorithm for solving Eq. (8) is given by the following algorithm where the search direction and the step size are taken into account. To terminate the algorithm, one can alternatively set the stopping rule to be R k Fδ < where > 0 is a small error and δ is the least-squares error described in Eq. (13).

Convergence analysis of the proposed algorithm
In this section, Algorithm 1 will be proved to converge to the exact solution or the unique least-squares solution. Recall the next lemma.

Algorithm 1:
The gradient-descent iterative algorithm for Eq. (8) The following definition is an extension of the Frobenius norm and will be used in the convergence analysis. . Then X k Q → X * Q for any initial matrix X 0 . Here, · Q is the Q-weighted Frobenius norm defined by Eq. (17).
Proof Since x * = Vec(X * ) is the optimal solution of min x∈R mn f (x), we denote the minimum value, inf x∈R mn f (x) = f (x * ) as δ. Note that δ is equal to the least-squares error determined by Eq. (13) and is zero if X * is the unique exact solution. If there exists k ∈ N such that ∇f (x k ) = 0, then X k = X * and the result holds. To investigate the convergence of the algorithm, we assume that ∇f (x k ) = 0 for all k. Considering the strong convexity of f , we have from Eq. (14) ∇ 2 f (x k ) = Q T Q. Let λ min (λ max ) be the minimum (maximum) eigenvalue of Q T Q, respectively. Since Q T Q is symmetric, we have Thus, f is strongly convex. From (15), substituting y = x k+1 and x = x k yields We minimize the RHS by taking τ = 1/λ min , so that Since the above equation is true for all y ∈ R mn , we have Similarly, from (16), we have Minimizing the RHS by taking τ = 1/λ max yields Subtracting each side of (19) by δ and combining with ∇f ( Putting α := 1λ min /λ max , we have By induction, we obtain Since Q T Q is assumed to be invertible, Q T Q > 0, it follows that λ min > 0 and hence 0 < α < 1. Consider the case of X * is the unique exact solution, i.e., δ = 0. We have f (x k ) → 0, or equivalently Qx k -Vec(E) → 0 as k → ∞. Now, the assumption that Q is of full columnrank implies that Therefore, X k = Vec -1 (x k ) → X * as k → ∞.
The other case is that X * is the unique least-squares solution, i.e., δ > 0. We have f (x k ) → δ or 1 2 Qx k -Vec(E) 2 F → Vec(E) 2 F -Vec(E) T Qx * . Then We omit some algebraic operations and hence immediately write Therefore, X k Q → X * Q as k → ∞.
We denote the condition number of Q by κ = κ(Q). Observe that α = 1-κ -2 . The relation between the quadratic norm-error f (x k ) and the norm of residual error R k is given by Making use of Lemma 2.1(ii), the inequalities (20) and (21) become the following estimation: In the case of Eq. (8) having a unique exact solution (δ = 0), the error estimations (22) and (23) reduce to (24) and (25), respectively.
Since 0 < α < 1, it follows that, if R k-1 F are nonzero, then The above discussion is summarized in the following theorem.

Theorem 4.4 Assume that Q is of full column-rank.
(i) Suppose Eq. (8) has a unique solution. The error estimation R k F compared with R k-1 F (the preceding iteration) and R 0 F (the initial iteration) are given by (24) and (25), respectively. Particularly, the relative error R k F gets smaller than the preceding (nonzero) error, as in (26). (ii) When Eq. (8) has a unique least-squares solution, the error estimation (22) and (23) hold. In both cases, the convergence rate of Algorithm 1 (regarding the error R k F ) is governed by Remark 4.5 The relative errors (22) and (23) do not seem to decrease every step of iteration since the terms 2δκ -2 and 2δ(1α k ) are positive. However, the inequality (19) implies that { R k F } ∞ k=1 is a strictly decreasing sequence converging to δ.
We recall the following properties.

Theorem 4.7
Suppose that Q is of full column-rank and Eq. (8) has a unique exact solution. We have the error estimation X k -X * F compared with the preceding iteration and the initial iteration of Algorithm 1 are provided by Particularly, the convergence rate of the algorithm is governed by Proof Utilizing (25) and Lemma 4.6, we have As the limiting behavior of X k -X * F depends on (1κ -2 ) k 2 , the convergence rate for Algorithm 1 is governed by √ 1κ -2 . Similarly, using (24), it follows that and hence (28) is obtained.

Theorem 4.8 Suppose Q is of full column-rank and Eq. (8) has a unique least-squares solution.
The error estimation X k -X * 2 F compared to the preceding iteration and the initial iteration of Algorithm 1 are provided by Proof The proof is similar to that of Theorem 4.7 and carried out by (22) and (23). We, therefore, omit the proof.
Consequently, our convergence analysis indicates that the proposed algorithm always converges to the unique (exact or least-squares) solution for any initial matrices and small condition numbers. Moreover, the algorithm will converge fast when the condition number is close to 1.

Numerical experiments for the generalized Sylvester-transpose matrix equation and its special cases
In this section, we provide numerical results to show the efficiency and effectiveness of Algorithm 1. We perform the experiments in the following cases: • a large-scaled square generalized Sylvester-transpose equation, • a small-scaled rectangular generalized Sylvester-transpose equation, • a small-scaled square Sylvester-transpose equation, • a large-scaled square Sylvester equation, • a moderate-scaled square Lyapunov equation.
Each example contains some comparisons of the proposed algorithm (denoted by TauOpt) with the mentioned existing algorithms as well as the direct method Eq. (12). CT stands for the computational time (in seconds) and is measured by the tic toc function in MATLAB. The relative error R k F is used to measure error at the kth step of the iteration. All iterations have been evaluated by MATLAB R2020b, on a PC (2.60-GHz intel(R) Core(TM) i7 processor, 8 Gbyte RAM). We choose an initial matrix X 0 = zero(100), where zero(n) is the n × n zero matrix. In fact, this equation has the unique solution X * = tridiag(0.293, 0.152, 0.905). Table 1 shows that the direct method consumes a big amount of time to get the exact solution, while Algorithm 1 produces a small-error solution in a small time (0.1726 sec-     We find that 4 = rank Q = rank[Q Vec(E)] = 5, i.e., the matrix equation does not have an exact solution. However, the size of Q is 9 × 4, i.e., Q is of full-column rank. Hence, according to Theorem 4.3, Algorithm 1 will converge to the least-squares solution in which the least-squares error (13) is equal to 0.0231. We choose an initial matrix X 0 = zero(2). Algorithm 1 is compared with GI (Method 1.1), LSI (Method 1.2) and the direct method Eq. (12). In this case, we consider the error X * -X k F where X * is the least-squares solution. Figure 2 displays the error plot, and Table 2 shows the errors and CTs for TauOpt, GI, LSI and the direct method. We see that the errors converge monotonically to zero, i.e., the approximate solutions X k generated by Algorithm 1 converge to X * . Moreover, Algorithm 1 consumes less computational time than other methods.
Next, we will consider the Sylvester-transpose equation (5) which is a special case of the generalized Sylvester-transpose equation (8). From Algorithm 1, the optimal step size τ is described by  We report the comparison of Algorithm 1 with GI (Method 1.1), LSI (Method 1.2), AGBI ( [28]) and the direct method Eq. (12) by Fig. 3 and Table 3. Both of them imply that Algorithm 1 outperforms other algorithms.
Next, we will consider the Sylvester equation (3) which is also a special case of Eq. (8). For this equation, the optimal step size τ is described by Figure 3 Relative errors for Ex. 5.3 where W k = A T R k + R k B T and R k = C -AX k -X k B. where A, B, C ∈ R 100×100 . We choose an initial matrix X 0 = zero(100). Here, the symmetric exact solution is given by X * = tridiag(1, -5, 1), so that AGBI algorithm can be applicable. We compare Algorithm 1 with GI (Method 1.1), AGBI ( [28]), RGI [12], MGI [13], JGI [14], and AJGI [14]. Although Table 4 tells us that our algorithm takes a slightly more time than some other algorithms, Fig. 4 illustrates that Algorithm 1 reaches the fastest convergence.
The last example presents another special case of Eq. (8) that is the Lyapunov equation (2). The optimal step size τ is described by Figure 4 Relative errors for Ex. 5.4 where W k = A T R k + R k A and R k = B -AX k -X k A T .
In conclusion, Algorithm 1 takes a slightly more computational time than some other algorithms but still outperforms distinctly in performance of convergence.

Concluding remarks
We properly establish a gradient-descent iterative algorithm for solving the generalized Sylvester-transpose matrix equation (8). We show that the proposed algorithm is useful and applicable for wide range of problems, even though the problem has no solution, as long as the associated matrix Q, defined by Eq. (11), is of full column-rank. If the problem has the unique exact solution, then the approximate solutions converge to the exact solution. In the case of a no-solution problem, we have X Q → X * Q where X * is the unique least-squares solution. The convergence rate is described in terms of κ, the matrix condition number of Q, that is, √ 1κ -2 . Moreover, the analysis shows that the sequence of errors generated by our algorithm is monotone decreasing. Numerical examples are provided to verify our theoretical findings.