Systems of Equations

linear-algebra
Published

March 30, 2026

What is a Linear Equation?

Back in secondary school, you probably saw something like:

\[2x + 3 = 7\]

You solve it by finding the value of \(x\) that makes it true. Here \(x = 2\). That is a linear equation in one variable.

The word linear means the variable appears only to the power of 1. No \(x^2\), no \(\sqrt{x}\), no \(\frac{1}{x}\). When you plot it, you always get a straight line.

With two variables it looks like:

\[2x + 3y = 12\]

Now there is no single answer. Infinitely many pairs \((x, y)\) satisfy it. For example \((0, 4)\), \((6, 0)\), and \((3, 2)\) all work. All such pairs trace a straight line on a 2D graph.

The general form of a linear equation in \(n\) variables is:

\[w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b = y\]

where \(w_1, \ldots, w_n\) are coefficients (weights), \(b\) is a constant (bias), and \(y\) is the output. It is still called linear because no variable is raised to a power other than 1.

What is a System of Linear Equations?

Consider this familiar dataset table with \(m\) training examples and \(n\) features:

Example Feature \(x_1\) Feature \(x_2\) \(\cdots\) Feature \(x_n\) Output \(y\)
\((1)\) \(x_1^{(1)}\) \(x_2^{(1)}\) \(\cdots\) \(x_n^{(1)}\) \(y^{(1)}\)
\((2)\) \(x_1^{(2)}\) \(x_2^{(2)}\) \(\cdots\) \(x_n^{(2)}\) \(y^{(2)}\)
\(\vdots\) \(\vdots\) \(\vdots\) \(\ddots\) \(\vdots\) \(\vdots\)
\((m)\) \(x_1^{(m)}\) \(x_2^{(m)}\) \(\cdots\) \(x_n^{(m)}\) \(y^{(m)}\)

The goal is to find the weights \(w_1, \ldots, w_n\) and bias \(b\) that best describe the linear relationship between the input features and the target \(y\). Following the DeepLearning.AI convention, \(w_1, \ldots, w_n\) are the weights, \(b\) is the bias, and \(y^{(i)}\) is the output for example \(i\). We want to find weights and a bias such that for each row, a weighted sum of the features plus the bias equals the output. Row \((i)\) gives:

\[w_1 x_1^{(i)} + w_2 x_2^{(i)} + \cdots + w_n x_n^{(i)} + b = y^{(i)}\]

Writing this for every row gives a system of \(m\) equations with \(n\) features:

\[\begin{cases} w_1 x_1^{(1)} + w_2 x_2^{(1)} + \cdots + w_n x_n^{(1)} + b = y^{(1)} \\ w_1 x_1^{(2)} + w_2 x_2^{(2)} + \cdots + w_n x_n^{(2)} + b = y^{(2)} \\ \vdots \\ w_1 x_1^{(m)} + w_2 x_2^{(m)} + \cdots + w_n x_n^{(m)} + b = y^{(m)} \end{cases}\]

where superscripts \((i)\) index the training example (from \(1\) to \(m\)) and subscripts index the feature (from \(1\) to \(n\)).

Notice that \(w_1, w_2, \ldots, w_n\) and \(b\) are the same across all \(m\) equations. They are the shared model parameters we want to learn. The inputs differ from one equation to the next. They are fixed observations from the dataset. The goal is to find the single \(W\) and \(b\) that simultaneously satisfy (or best approximate) all \(m\) equations.

Matrix Form

The input matrix that collects all features is denoted \(X\). In matrix form:

\[W \cdot X + b = \hat{y}\]

where \(W = \begin{bmatrix} w_1 & w_2 & \cdots & w_n \end{bmatrix}\) is the weight row vector, \(X \in \mathbb{R}^{m \times n}\) holds all inputs (each row is one training example), \(b \in \mathbb{R}\) is the bias, and \(\hat{y} = \begin{bmatrix} y^{(1)} & y^{(2)} & \cdots & y^{(m)} \end{bmatrix}\) is the predicted output.

Note

Reading the matrix form in plain English:

  • \(x_j^{(i)}\): feature \(j\) of training example \(i\)
  • \(w_j\): the weight for feature \(j\)
  • \(b \in \mathbb{R}\): \(b\) is a single number (the bias)
  • \(A\), \(B\), \(C\): capital letters represent matrices
  • \(X \in \mathbb{R}^{m \times n}\): \(X\) is a matrix with \(m\) rows (training examples) and \(n\) columns (features)
  • \(\mathbb{R}\) simply means “a real number”, so \(\mathbb{R}^n\) means “a list of \(n\) real numbers” and \(\mathbb{R}^{m \times n}\) means “a matrix with \(m\) rows and \(n\) columns”
  • \(u\), \(v\), \(w\): lowercase letters represent vectors
  • \(W\): the weight row vector with \(n\) entries, one per feature
  • \(y^{(i)}\): the observed target for training example \(i\)
  • \(\hat{y}\): the predicted output from the model; \(\hat{y} = W \cdot X + b\)

Vector-Matrix Representation

The weight row vector \(W\) and the input matrix \(X\) are defined as: \[W = \begin{bmatrix} w_1 & w_2 & \cdots & w_n \end{bmatrix}, \quad X = \begin{bmatrix} x_1^{(1)} & x_1^{(2)} & \cdots & x_1^{(m)} \\ x_2^{(1)} & x_2^{(2)} & \cdots & x_2^{(m)} \\ \vdots & \vdots & \ddots & \vdots \\ x_n^{(1)} & x_n^{(2)} & \cdots & x_n^{(m)} \end{bmatrix}\]

When we compute the product \(W \cdot X\), the weight vector multiplies each column of \(X\) (each column is one training example). Adding the bias \(b\) gives us the output \(\hat{y}\):

\[W \cdot X + b = \begin{bmatrix} y^{(1)} & y^{(2)} & \cdots & y^{(m)} \end{bmatrix}\]

This compact notation represents the entire system of \(m\) equations. By organizing the data this way, we can use efficient linear algebra libraries to find the values of \(W\) and \(b\) that best fit our observations.

Types of Solutions

A system can have:

  • Unique solution: exactly one \(\mathbf{x}\) satisfies all equations (consistent, independent)
  • Infinitely many solutions: equations are dependent (consistent, dependent)
  • No solution: equations are contradictory (inconsistent)

Geometric Interpretation

Each equation defines a hyperplane in \(\mathbb{R}^n\). The solution set is the intersection of all hyperplanes:

  • In \(\mathbb{R}^2\): intersection of lines
  • In \(\mathbb{R}^3\): intersection of planes

Example

\[\begin{cases} 2x + y = 5 \\ x - y = 1 \end{cases}\]

Adding both equations: \(3x = 6 \Rightarrow x = 2\), then \(y = 1\).

Unique solution: \((x, y) = (2, 1)\).


Next: Solving by Elimination