# Solving Sparse Linear System in O(nm) Time

There are many practical applications that involve solving sparse linear system $$\mathbf{A}\mathbf {x} = \mathbf{b}$$, such as random walk in sparse graphs. Traditional techniques such as Gauss elimination or LU decomposition applies for general linear systems, and they run in cubic time. Fastest known algorithms for general matrix run in about $$O(n^{2.373})$$ time. However, for sparse matrix, we may achieve a much faster result. Specifically, we may solve the system in $$O(mn)$$ time, where $$m$$ is the number of nonempty entries in $$\mathbf{A}$$.  We assume that the matrix $$\mathbf{A}$$ is non-singular (i.e., invertible) throughout this passage.

### 1. Inverting a matrix from its annealing polynomial

Given polynomial $$p(x)$$, if $$p(\mathbf{A}) = \mathbf{0}$$ for matrix $$\mathbf{A}$$, then we say $$p(x)$$ is an annealing polynomial for matrix $$\mathbf{A}$$. Due to Hamilton-Cayley theorem, the characteristic polynomial $$\det( x\mathbf{I} – \mathbf{A} )$$ is an annealing polynomial for $$\mathbf{A}$$. Among all annealing polynomials for matrix $$\mathbf{A}$$, the one with minimum degree is called the minimal polynomial for $$\mathbf{A}$$. It can be proved that the minimum polynomial for given matrix is unique up to a constant factor, and any other annealing polynomial is a polynomial multiple of the minimum polynomial.

We can invert a matrix $$\mathbf{A}$$ if we have a minimal polynomial for $$\mathbf{A}$$ :
$a_0 I + a_1 \mathbf{A} + a_2 \mathbf{A}^2 + \cdots + a_k \mathbf{A}^k = 0 \tag{*}$Since $$\mathbf{A}$$ is invertible, 0 is never a root of its characteristic polynomial, hence we must have $$a_0 \neq 0$$. Multiplying $$\mathbf{A}^{-1}$$ on both sides yields
$\mathbf{A}^{-1} = -\frac{a_1 I + a_2 \mathbf{A} + a_3 \mathbf{A}^2 + \cdots + a_k \mathbf{A}^{k-1}}{a_0}$this means that we may represent the inversion of $$\mathbf{A}$$ by the linear combination of powers of  $$\mathbf{A}$$.

### 2. The Berlekamp-Massey algorithm

Berlek-Massey algorithm solves the following problem in $O(n^2)$ time:

Given a finite sequence $$\{x_i\}_{i=1}^n$$, find a minimum order linear recurrence consistent with the given sequence. Formally, find a shortest sequence $$c_0 = 1, c_1, \cdots, c_{k-1}$$, such that $$\sum_{l=0}^{k-1} x_{j-l} c_l = 0$$ holds for all possible $$j$$.

This algorithm has many real world applications. The most typical one is to find the shortest linear feedback shift register for a given binary sequence. Also, it can be viewed as interpolating a sequence with exponential terms. One important fact for Berlekamp-Massey algorithm is, for an order-$$r$$ linearly recurrent sequence, taking the first $$2r$$ elements as the input of the algorithm suffices to recover the recurrence.

### 3. Finding the minimum polynomial

Note that the annealing polynomial is exactly the linear recurrence of powers of a matrix. However, it is infeasible to compute the minimum polynomial from the powers of $$\mathbf{A}$$. However, we may randomly pick vectors $$\mathbf{u}$$ and $$\mathbf{v}$$ and compute the minimum polynomial from $$\mathbf{u}^T \mathbf{A}^i \mathbf{v}$$. We claim without proof that, with high probability, the coefficients of the recurrence of the sequence are exactly those of the minimum polynomial. The sequence can be computed in $$O(mn)$$ time by iteratively doing sparse matrix-vector multiplication in $O(m)$ time. Finally, apply Berlekamp-Massey algorithm to the given sequence.

### 4. Solving the linear system

Since the inverse of a sparse matrix is generally not sparse, we won’t actually compute the inverse of $$\mathbf{A}$$.  Actually, we can compute $$\mathbf{A}^{-1}\mathbf{b}$$ via formula (*) in $$O(mn)$$ time. The procedure is exactly the same as in finding minimum polynomial: just iteratively perform spare matrix-vector multiplication.

# 线性代数相关OJ题选讲

### Fast Matrix Calculation

2014 Multi-University Training Contest 9, available at HDUOJ 4965

Step 1. 计算 $$N \times N$$ 的矩阵 $$C = AB$$.
Step 2. 计算 $$M = C^{N \times N} \bmod 6$$.
Step 3. 输出 $$M$$ 中所有元素的和.

$M = (AB)^{N \times N} = A (BA)^{N \times N – 1} B$

### Matrix Multiplication

2014 Multi-University Training Contest 5, available at HDUOJ 4920

$AB = (A_1 – A_2)(B_1 – B_2) = A_1B_1 – A_1B_2 – A_2B_1 + A_2B_2$

（当然，使用其他一些优秀的卡常方法也能过这题）

### Matrix Revolution

HDUOJ 1759

$A + A^2 + \cdots + A^k$

$$A$$ 是稀疏矩阵，用 $$m$$ 个三元组 $$(a, b, c)$$ 表示，每个三元组代表 $$A_{ab} = c$$。此外，保证主对角线上的元素全为 1。其他元素均为0。

$$(0 \leq n \leq 1000, 0 \leq m \leq 10n, n < k < 10^{100}, 0 < c < 10^9)$$

### Matrix Multiplication

POJ 3318

（当然，使用三次方的算法加一些优秀的卡常，也许也能过）

Remark:

### Square

UVA 11542, also available at vjudge

$$1 \leq T \leq 30$$, $$1 \leq n \leq 100$$， 所有整数在 1 和 $$10^{15}$$ 之间。整数的素因子最多为 500。

# 如何编写缓存友好的代码——从一道卡常数的题目说起

WC2017的第二题是一道卡常三合一题。其中，第一个子任务要求对最多2e8个32位无符号整数排序，时限3s，内存2G。输入采用伪随机数，输出采用hash，从而避免输入输出的影响。

#include <cstdio>
#include <algorithm>
using namespace std;

typedef unsigned uint32_t;

inline uint32_t next_integer(uint32_t &x) {
x = x ^ (x << 13);
x = x ^ (x >> 17);
x = x ^ (x << 5);
return x;
}

const int n = 200000000;
uint32_t arr[n];

void genarray() {
uint32_t x = 0x98765432;
for (int i=0; i<n; i++) arr[i] = next_integer(x);
}

uint32_t output(uint32_t* ptr) {
uint32_t ret = n * 4;
uint32_t x = 23333333;
for (int i=0; i<n; i++) {
ret ^= ptr[i] + x;
next_integer(x);
}
return ret;
}

int main() {
genarray();
sort(arr, arr+n);
printf("%x\n", output(arr));
return 0;
}

// ...
typedef unsigned uint32_t;
typedef unsigned short uint16_t;
typedef unsigned char uint8_t;

union {
uint32_t u32;
uint16_t word[2];
} arr[n], arr2[n];

uint32_t cnt[2][65536];

void counting() {
for (int i=0; i<n; i++) {
cnt[0][arr[i].word[0]]++;
cnt[1][arr[i].word[1]]++;
}
for (int j=0; j<2; j++) partial_sum(cnt[j], cnt[j]+65536, cnt[j]);
}

void sort0() {
for (int i=n-1; i>=0; i--)
arr2[--cnt[0][arr[i].word[0]]].u32 = arr[i].u32;
}

void sort1() {
for (int i=n-1; i>=0; i--)
arr[--cnt[1][arr2[i].word[1]]].u32 = arr2[i].u32;
}

int main() {
genarray();
counting();
sort0(); sort1();
printf("%x\n", output((uint32_t*)arr));
return 0;
}

$hit\% = \min\{1, \frac{\#\text{cache line}}{k}\} * \left( 1 – \frac{\text{size of element}}{\text{size of cache line}}\right)$

L1 data cache: 2 x 32 KBytes, 8-way set associative, 64-byte line size
L2 cache: 2 x 256 KBytes, 4-way set associative, 64-byte line size
L3 cache: 3 MBytes, 12-way set associative, 64-byte line size

L1 cache一共只有512行，那么我们将32位整数分为4段，每段8位，然后进行基数排序行不行呢？让我们重新改写一下程序。

// ...

uint32_t cnt[4][256];

void counting() {
for (int i=0; i<n; i++) {
cnt[0][arr[i].byte[0]]++;
cnt[1][arr[i].byte[1]]++;
cnt[2][arr[i].byte[2]]++;
cnt[3][arr[i].byte[3]]++;
}
for (int j=0; j<4; j++)
partial_sum(cnt[j], cnt[j]+256, cnt[j]);
}

void sort0() {
for (int i=n-1; i>=0; i--)
arr2[--cnt[0][arr[i].byte[0]]].u32 = arr[i].u32;
}

void sort1() {
for (int i=n-1; i>=0; i--)
arr[--cnt[1][arr2[i].byte[1]]].u32 = arr2[i].u32;
}

void sort2() {
for (int i=n-1; i>=0; i--)
arr2[--cnt[2][arr[i].byte[2]]].u32 = arr[i].u32;
}

void sort3() {
for (int i=n-1; i>=0; i--)
arr[--cnt[3][arr2[i].byte[3]]].u32 = arr2[i].u32;
}

int main() {
genarray();
counting();
sort0(); sort1(); sort2(); sort3();
printf("%x\n", output((uint32_t*)arr));
return 0;
}

CPU在访问主存时，往往需要等待数百个时钟周期。由于burst机制的存在，CPU可以一次性与主存交换大量连续数据，使得cache能够被十分有效的实现。要增加cache的利用率，主要是要提高程序访问数据的空间局部性和时间局部性。将 $$O(n \log n)$$ 的排序改为线性时间排序，是一种典型的space-time tradeoff, 如果排序的基数太大，那么count数组会变得很大，排序后的数据在内存中的分布会很散，降低了缓存利用率，这时就可以通过适当降低基数的方式，提高cache利用率，从而达到理论运行时间和cache利用率之间的平衡。

# k次幂前缀和的线性做法

$\sum_{i=1}^{x} i^k \bmod p$

### 线性预处理逆元

$i^{-1} = -\left\lfloor \frac{p}{i} \right\rfloor \cdot (p \bmod i)^{-1} \bmod p$

$-\frac{\left\lfloor \frac{p}{i} \right\rfloor}{p-\left\lfloor \frac{p}{i} \right\rfloor i} = \frac{1}{i}$

### 拉格朗日插值法

$f(x) = \sum_{i=0}^{k+1} y_i l_i(x)$

$l_i(x) = \prod_{0 \leq m \leq k+1, m \neq i} \frac{x – x_m}{x_j – x_m}$

$f(n) = \sum_{i = 0}^{k+1} (-1)^{k+i+1} f(i) \frac{n(n-1) \cdots (n-i+1)(n-i-1) \cdots (n-d)}{(d-i)!i!}$

#include <bits/stdc++.h>
using namespace std;

#define rep(i, n) for (int i=0; i<(n); i++)
#define Rep(i, n) for (int i=1; i<=(n); i++)
#define range(x) (x).begin(), (x).end()
typedef long long LL;
typedef unsigned long long ULL;

#define pow owahgrauhgso

ULL n, k, m;

ULL powmod(ULL b, ULL e) {
ULL r = 1;
while (e) {
if (e & 1) r = r * b % m;
b = b * b % m;
e >>= 1;
}
return r;
}

const int MAXN = 1000006;
ULL inv[MAXN];

void init_inv() {
inv[1] = 1;
for (int i = 2; i <= k+1; i++) {
inv[i] = (m - m / i) * inv[m % i] % m;
assert(inv[i] * i % m == 1);
}
}

ULL pow[MAXN];
ULL prime[MAXN], cnt;

void sieve() {
pow[1] = 1;
for (int i = 2; i <= k+1; i++) {
if (!pow[i]) {
pow[i] = powmod(i, k);
prime[cnt++] = i;
for (int j = 0; j < cnt && i*prime[j] <= k+1; j++) {
pow[i * prime[j]] = pow[i] * pow[prime[j]] % m;
if (i % prime[j] == 0) break;
}
}
}
}

ULL sum[MAXN];
ULL ans[MAXN];

auto addmod = [](ULL a, ULL b) -> ULL {return (a+b)%m;};

ULL lagrange() {
ULL p;
p = 1;
for (int i=0; i<=k+1; i++) {
if (i) p = p * inv[i] % m;
ans[i] = (k+1-i)&1 ? m-sum[i] : sum[i];
ans[i] = ans[i] * p % m;
p = p * (m + n - i) % m;
}
p = 1;
for (int i=k+1; i>=0; i--) {
if (k+1-i) p = p * inv[k+1-i] % m;
ans[i] = ans[i] * p % m;
p = p * (m + n - i) % m;
}
}
int main() { cin >> n >> k >> m; init_inv(); sieve(); partial_sum(pow, pow+k+2, sum, addmod); cout << lagrange() << endl; return 0; }