The Short Time Fourier Transform

This notebook describes an implementation of the STFT. Specifically, I consider how the forward (analysis) and the inverse (synthesis) operations are related by a transpose operation, and use this to establish perfect reconstruction (PR) conditions. The PR conditions depend only on the window function and the hop size. Using this, I show how to find windows that make the STFT a self-inverting transform, or a tight frame.

In [1]:
# some imports
using PyPlot
using FFTW

For demonstration purposes, let us choose a window, and a hop-size. We specifically use a Hann window in this example. Note that the hop-size is not really a special value for this window function.

In [2]:
win_len = 50
win = sin.(Ï€ * (1:win_len)/(win_len+1)).^2;
hop = 15

fig, ax = subplots(1,1,figsize = (8,3)) 
ax.plot(1:win_len, win)
ax.set_title("window function")
Out[2]:
PyObject Text(0.5, 1, 'window function')

We now produce a random signal to work with. In order to keep the code tidy, I choose the signal length such that it allows $K$ hops of the window, without any left-over samples.

In [3]:
K = 30
signal_len = win_len + (K-1) * hop
x = randn(signal_len);

Given this signal, window, and hop-size, the STFT is computed by as follows. Starting the initial point of the window with that of the signal, we

  1. window the signal to extract a certain portion in 'time'
  2. compute the fft of the windowed segment
  3. put the fft as a column in a matrix holding the STFT coefficients (so the horizontal and vertical axes correspond to time and frequency respectively)
  4. shift the window by 'hop' samples, and repeat the steps above, until we reach the end of the signal.

This operation is demonstrated in the code snippet below. There's one catch -- I do a little magnitude adjustment by dividing the fft by the sqrt of the length of the fft. This makes the fft an orthogonal operation, which I will need for the demonstration in the sequel. Normally, we don't bother with this normalization in an actual implementation.

In [4]:
# declare a complex array to hold the STFT of x
X = Array{Complex,2}(undef, (win_len, K))

# compute windowed fft's
for (i,j) = zip( 1 : hop : (signal_len - win_len + 1), 1 : K )
    X[:,j] = fft(x[i:i + win_len - 1] .* win) / √(win_len) 
end

Finally, I note that the operation that takes $x$ to $X$ is a linear operation. Let me denote this operation as $U$. Note that $U$ takes the signal (the 1D sequence) $x$ of length $S$, and maps it onto a matrix of size $W\times K$, where $W$ is the length of the window and $K$ is the number of hops (hop count). Therefore, $U : \mathbb{C}^S \to \mathbb{C}^{W\times K}$. In the sequel, we will be interested in the transpose of $U$.

In order to find $U^T$, I interpret $U$ as a cascade of three linear mappings, namely $M_1$, $M_2$, $M_3$.

  • $M_1$ takes the signal (the 1D sequence) $x$ of length $S$, and fills the entries of a matrix of size $W\times K$, using the samples of $x$. This is essentially like $U$ stripped of its windowing and fft operations. In the code snippet below, $M_1$ is defined via the aid of the function 'Map1'. Note that $M_1 : \mathbb{C}^S \to \mathbb{C}^{W\times K}$.
  • $M_2$ (point-) multiplies each column of a $W\times K$ matrix with the window. Once again, $M_2$ uses specifically the window we declared above, and the function Map2 is a generic one which is used for defining $M_2$. Therefore, $M_2 : \mathbb{C}^{W\times K} \to \mathbb{C}^{W\times K}$.
  • $M_3$ computes the 1-D fft of each column of a $W\times K$ matrix. Therefore, $M_3 : \mathbb{C}^{W\times K} \to \mathbb{C}^{W\times K}$.
In [5]:
# definition of Map1, M1
function Map1(x, signal_length, window_length, hop_size, hop_count)
    X = Array{Complex, 2}(undef, (window_length,K)) 
    for (i,j) = zip( 1 : hop_size : (signal_length - window_length + 1), 1 : hop_count )
        X[:,j] = x[i:i + window_length - 1]
    end
    return X
end
M1 = z -> Map1(z, signal_len, win_len, hop, K)

# definition of Map2, M2
function Map2(X, window)
    return X .* window
end

M2 = Z -> Map2(Z, win)

# definition of M3, as an orthogonal operation
M3 = Z -> fft(Z,[1])/√(win_len)

# we declare a new map by cascading these operations.
M = z -> M3( M2( M1(z) ) )
Out[5]:
#9 (generic function with 1 method)

I now verify that the operator $M = M_3\,M_2\,M_1$ in fact reproduces the STFT we computed earlier. The error below should be numerically zero.

In [6]:
Y = M(x);
error = sum(abs.(Y[:] - X[:]).^2)
println("Error : ", error)
Error : 0.0

Let us now consider the transpose operations. For this, our guide will be the definition of transpose. Recall that if $A$ and $B$ are two vector spaces with inner product, and $Q$ is a linear mapping from $A$ to $B$, then $Q^T$ is defined as that operation for which $\langle y, Q(x)\rangle = \langle Q^T(y), x \rangle$ for any $x\in A$, and $y \in B$.

Given this definition, I first note that, since $M_3$ simply does in-place (normalized) fft's, its transpose is its inverse, which is (normalized) in-place ifft's. Let's declare this as a new mapping $M_3^T$.

In [7]:
M3T = Z -> ifft(Z,[1]) * √(win_len)
Out[7]:
#11 (generic function with 1 method)

The tranpose of $M_2$ can also be obtained by considering its action. $M_2$ simply point-multiplies its input matrix with a real-valued matrix of the same size. Abusing notation, we have $M_2(x)_n = x_n\,w_n$. Therefore, we can verify that $\langle x, M_2(y)\rangle = \langle M_2(x), y\rangle$. In words, $M_2$ is its own transpose (it's Hermitian).

In [8]:
M2T = M2
Out[8]:
#5 (generic function with 1 method)

What remains is $M_1$. The transpose of $M_1$ amounts to an 'overlap-add' operation. This can be verified algebraically, using the definition of transpose. I will skip the algebraic derivation here (for now...), but I will provide a numerical verification below. Once again, we define $M_1^T$ with the aid of a generic function (Map1Xpose).

In [9]:
function Map1Xpose(X, signal_len, win_len, hop, K)
    x = zeros(signal_len)im
    for (i,j) = zip( 1 : hop : (signal_len - win_len + 1), 1 : K )
        x[i:i + win_len - 1] += X[:,j]
    end
    return x
end
M1T = Z -> Map1Xpose(Z, signal_len, win_len, hop, K)
Out[9]:
#13 (generic function with 1 method)

Let us now numerically verify that our transposes are correctly identified and defined. For this, we define an inner product function.

In [10]:
inner_product = (x,y) -> sum( (x .* conj(y))[:] )
Out[10]:
#15 (generic function with 1 method)

We start by checking the validity of $M_1^T$. For this, we randomly choose $z \in \mathbb{C}^S$, $Z \in \mathbb{C}^{W\times K}$ and we compute the difference $\langle M_1(z), Z \rangle - \langle z, M_1^T(Z) \rangle$. This value should be numerically zero.

In [11]:
z = randn(signal_len)+ randn(signal_len)im
Z = randn(win_len,K) + randn(win_len,K)im
in1 = inner_product( M1(z), Z )
in2 = inner_product( z, M1T(Z) )
println("*** Testing transpose for M1 ***")
println("Difference between the two inner products: ", in1 - in2)
*** Testing transpose for M1 ***
Difference between the two inner products: -3.552713678800501e-14 + 6.039613253960852e-14im

Let us perform a similar verification for $M_2$. Note now that we need to use two matrices.

In [12]:
Z2 = randn(win_len,K) + randn(win_len,K)im
in1 = inner_product( M2(Z), Z2 )
in2 = inner_product( Z, M2T(Z2) )
println("*** Testing transpose for M2 ***")
println("Difference between the two inner products: ", in1 - in2)
*** Testing transpose for M2 ***
Difference between the two inner products: 0.0 - 7.105427357601002e-15im

Finally, below I verify the validity of $M_3^T$.

In [13]:
in1 = inner_product( M3(Z), Z2 )
in2 = inner_product( Z, M3T(Z2) )
println("*** Testing transpose for M3 ***")
println("Difference between the two inner products: ", in1 - in2)
*** Testing transpose for M3 ***
Difference between the two inner products: 4.263256414560601e-14 + 1.9539925233402755e-14im

Using the operators $M_1^T$, $M_2^T$, $M_3^T$, we define the cascade operator $M = M_1^T\,M_2^T\,M_3^T$, and also define the 'frame' operator $M^T\,M$.

In [14]:
MT = Z -> M1T( M2T( M3T(Z) ) )

MTM = z -> MT( M(z) )
Out[14]:
#19 (generic function with 1 method)

Note that the range space of $M^T\,M$ is the signal space. If it were that $M^T\,M = I$, then the analysis and synthesis operations would actually correspond to a perfect reconstruction filter bank. Let's check if this is the case.

In [15]:
x2 = MTM(x)
fig, ax = subplots(1,1, figsize = (10,3))
ax.plot(x, label = "input")
ax.plot(x2, label = "reconstruction")
ax.legend()
Out[15]:
PyObject <matplotlib.legend.Legend object at 0x11e800c90>

Save for the edges, we almost have perfect reconstruction, but not exactly. Let us now derive what adjustment we need to do, to achieve PR.

For that, consider the mapping $M^T\,M$. Since $M_3^T\,M_3 = I$, (transpose of fft is its inverse), we find that, $$ M^T\,M = M_1^T\,M_2^T\,M_3^T\,M_3\,M_2\,M_1 = M_1^T\,(M_2^T\,M_2)\,M_1. $$

Recall now that $M_2^T = M_2$. Since the action of $M_2$ is to point-multiply each column of a matrix with a 1D window function, it follows that $M_2^T\,M_2 = M_2^2$ simply point-multiplies each column with the square of the window function.

Given this, and the fact that $M_1^T$ amounts to overlap-adding, we find (after some thinking, or algebra) that the action of $M^T\,M$ is to point multiply the input with a signal defined as $$ P(n) = \sum_{k} w^2(n - k\,H), \text{ for } 1 \leq n \leq S $$ where $H$ is the hop-size, and $S$ is the signal length (Julia indices start from 1).

We can obtain $P$ with the aid of the mappings we defined so far, by using a constant signal as the input.

In [16]:
P = MTM(ones(signal_len))
fig, ax = subplots(1,1,figsize= (8,3))
ax.plot(1:signal_len, P)
Out[16]:
1-element Array{PyCall.PyObject,1}:
 PyObject <matplotlib.lines.Line2D object at 0x136600d10>

Let us now verify the claim that $M^T\,M(x)$ is equal to $x$ point-multiplied by the window $P$ above, by plotting the difference between the two signals against the point-product $P\circ x$.

In [17]:
fig, ax = subplots(1,1, figsize = (10,3) )
ax.plot(P.*x, label = "P[n] x[n]")
ax.plot(P.*x - MTM(x), label = "P[n] x[n] - MTM(x)[n]")
ax.legend()
Out[17]:
PyObject <matplotlib.legend.Legend object at 0x136934fd0>

This observation immediately suggests an inverse operation : apply $M^T\,M$ and correct by dividing by $P$.

However, we can do better. By 'normalizing' (made precise below) the window, we can make $P$ equal unity in the middle portion. This in turn means that if the signal were infinite length, the analysis and synthesis operations would be PR. This is preferrable from a numerical point of view because it implies that the singular values of $M$ and $M^T$ are all unity.

Suppose now that in fact we were operating on infinite length sequences. Then, the function $P$ would be periodic by the hop size, $H$. Let us now define a 'normalized' window $\hat{w}(n)$ as $$ \hat{w}(n) = \frac{w(n)}{\sqrt{P(n)}}. $$ This normalized window function has its own periodic $\hat{P}$, which evaluates to \begin{align} \hat{P}(n) &= \sum_k\,\hat{w}^2(n - k\,H) \\ &= \sum_k \frac{w^2(n-kH)}{P(n - kH)} \\ &= \sum_k \frac{w^2(n-kH)}{P(n)} \quad \quad \text{(because $P$ is periodic with $H$)}\\ &= \frac{1}{{P(n)}}\sum_k w^2(n-kH)\\ &= \frac{1}{{P(n)}}P(n)\\ &= 1. \end{align} In words, the new window achieves perfect reconstruction without any further correction steps.

Let us now modify our window and define modified mappings. We only need to compute the portion of $P$ that is on the support of the window function.

In [18]:
# initialize the normalization window
norm = win.^2
for i = 1+hop:hop:win_len
    norm[1:end-i+1] += win[i:end].^2
    norm[i:end] += win[1:end-i+1].^2
end

win_hat = win ./ sqrt.(real(norm))

fig, ax = subplots(1,1,figsize = (10,3))
ax.plot(win, label = "original window function")
ax.plot(win_hat, label = "modified window function")
ax.legend()
ax.set_title("window functions")
Out[18]:
PyObject Text(0.5, 1, 'window functions')

We now modify the mappings. Notice that we only need to modify $M_2$, $M_2^T$, since $M_1$, $M_3$ do not use the window function.

In [19]:
# modify M2
MÌ‚2 = Z -> Map2(Z, win_hat)
MÌ‚2T = MÌ‚2

# modify M
MÌ‚ = z -> M3(MÌ‚2(M1(z)))
MÌ‚T = Z -> M1T(MÌ‚2T(M3T(Z)))
Out[19]:
#25 (generic function with 1 method)

Let us now check whether we get PR with the modified window. Note that we expect PR only in the middle portion, between the first and last $N \leq (W-H)$ samples.

In [20]:
N = win_len - hop
PÌ‚ = MÌ‚T(MÌ‚(ones(signal_len)))
dif = MÌ‚T(MÌ‚(x)) - x
fig, ax = subplots(1,1,figsize = (10,4))
ax.plot(x, label = "input signal")
ax.plot(PÌ‚, label = "PÌ‚")
ax.plot(real(dif), label = "real part of the reconstruction error")
ax.plot(imag(dif), label = "imaginary part of the reconstruction error")
ax.set_xticks([(N), signal_len - (N)], minor = true)
ax.xaxis.grid(true, linestyle = "dashed", which = "minor")
ax.legend()
Out[20]:
PyObject <matplotlib.legend.Legend object at 0x1371a4f90>

Ilker Bayram, ibayram@ieee.org, 2020