Monthly Archives: February 2018

Probability Measure

In this lecture notes, we study basic measure theory in terms of probability. If you want to learn more about general measure theory, I recommend [2].

Let $\Omega$ be a set whose elements will be called samples.

Definition. A $\sigma$-algebra is a collection $\mathscr{U}$ of subsets of $\Omega$ satisfying

  1. $\varnothing,\Omega\in\mathscr{U}$
  2. If $A\in\mathscr{U}$, then $A^c\in\mathscr{U}$
  3. If $A_1,A_2,\cdots\in\mathscr{U}$, then $\bigcup_{k=1}^\infty A_k,\bigcap_{k=1}^\infty A_k\in\mathscr{U}$

Note: In condition 3, it suffices to say if $A_1,A_2,\cdots\in\mathscr{U}$, then $\bigcup_{k=1}^\infty A_k\in\mathscr{U}$ or if $A_1,A_2,\cdots\in\mathscr{U}$, then $\bigcap_{k=1}^\infty A_k\in\mathscr{U}$. For example, lets assume that if $A_1,A_2,\cdots\in\mathscr{U}$, then $\bigcup_{k=1}^\infty A_k\in\mathscr{U}$. Let $A_1,A_2,\cdots\in\mathscr{U}$. Then by condition 2, $(A_1)^c,(A_2)^c,\cdots\in\mathscr{U}$ so we have $\bigcup_{k=1}^\infty (A_k)^c\in\mathscr{U}$. By condition 2 again with De Morgan’s laws, this means $\bigcap_{k=1}^\infty A_k=\left[\bigcup_{k=1}^\infty (A_k)^c\right]^c\in\mathscr{U}$.

Definition. Let $\mathscr{U}$ be a $\sigma$-algebra of subsets of $\Omega$. A map $P:\mathscr{U}\longrightarrow[0,1]$ a probability measure if $P$ satisfies

  1. $P(\varnothing)=0$, $P(\Omega)=1$
  2. If $A_1,A_2,\cdots\in\mathscr{U}$, then $$P\left(\bigcup_{k=1}^\infty A_k\right)\leq\sum_{k=1}^\infty P(A_k)$$
  3. If $A_1,A_2,\cdots\in\mathscr{U}$ are mutually disjoint, then $$P\left(\bigcup_{k=1}^\infty A_k\right)=\sum_{k=1}^\infty P(A_k)$$

Proposition. Let $A,B\in\mathscr{U}$. If $A\subset B$ then $P(A)\leq P(B)$.

Proof. Let $A,B\in\mathscr{U}$ with $A\subset B$. Then $B=(B-A)\dot\cup A$ where $\dot\cup$ denotes disjoint union. So by condition 3, $P(B)=P(B-A)+P(A)\geq P(A)$ since $P(B-A)\geq 0$.

Definition. A triple $(\Omega,\mathscr{U},P)$ is called a probability space. We say $A\in\mathscr{U}$ is an event and $P(A)$ is the probability of the event $A$. A property which is true except for an event of probability zero is said to hold almost surely (abbreviated “a.s.”).

Example. The smallest $\sigma$-algebra containing all the open subsets of $\mathbb{R}^n$ is called the Borel $\sigma$-algebra and is denoted by $\mathscr{B}$. Here we mean “open subsets” in terms of the usual Euclidean topology on $\mathbb{R}^n$. Since $\mathbb{R}^n$ with the Euclidean topology is second countable, the “open subsets” can be replaced by “basic open subsets”. Assume that a function $f$ is nonnegative, integrable (whatever that means, we will talk about it later) such that $\int_{\mathbb{R}^n}f(x)dx=1$. Define
$$P(B)=\int_Bf(x)dx$$ for each $B\in\mathscr{B}$. Then $(\mathbb{R}^n,\mathscr{B},P)$ is a probability space. The function $f$ is called the density of the probability measure $P$.

Definition. Let $(\Omega,\mathscr{U},P)$ be a probability space. A mapping $X:\Omega\longrightarrow\mathbb{R}^n$ is called an $n$-dimensional random variable if for each $B\in\mathscr{B}$, $X^{-1}(B)\in\mathscr{U}$. Equivalently we also say $X$ is $\mathscr{U}$-measurable. The probability space $(\Omega,\mathscr{U},P)$ is a mathematical construct that we cannot observe directly. But the values $X(\omega)$, $\omega\in\Omega$ of random variable $X$ are observables. Following customary notations in probability theory, we write $X(\omega)$ simply by $X$. Also $P(X^{-1}(B))$ is denoted by $P(X\in B)$.

Definition. Let $A\in\mathscr{U}$. Then the indicator $I_A: \Omega\longrightarrow\{0,1\}$ of $A$ is defined by $$I_A(\omega)=\left\{\begin{array}{ccc}1 & \mbox{if} & \omega\in A\\0 & \mbox{if} & \omega\not\in A\end{array}\right.$$
In measure theory the indicator of $A$ is also called the characteristic function of $A$ and is usually denoted by $\chi_A$. Here we reserve the term “characteristic function” for something else. Clearly the indicator is a random variable since both $\{0\},\{1\}$ are open. The Borel $\sigma$-algebra $\mathscr{B}$ coincides with the discrete topology on $\{0,1\}$. Or without mentioning subspace topology, let $B\in\mathscr{B}$, the Borel $\sigma$-algebra of $\mathbb{R}$. If $0\in B$ and $1\notin B$ then $I_A^{-1}(B)=A^c\in\mathscr{U}$. If $0\notin B$ and $1\in B$ then $I_A^{-1}(B)=A\in\mathscr{U}$. If $0,1\notin B$ then $I_A^{-1}(B)=\varnothing\in\mathscr{U}$. If $0,1\in B$ then $I_A^{-1}(B)=\Omega\in\mathscr{U}$.

If $A_1,A_2,\cdots,A_m\in\mathscr{U}$ with $\Omega=\bigcup_{i=1}^m A_i$ and $a_1,a_2,\cdots,a_m\in\mathbb{R}$, then
$$X=\sum_{i=1}^m a_iI_{A_i}$$ is a random variable called a simple function.

Simple function

Simple function

Lemma. Let $X: \Omega\longrightarrow\mathbb{R}^n$ be a random variable. Then
$$\mathscr{U}(X)=\{X^{-1}(B): B\in\mathscr{B}\}$$ is the smallest $\sigma$-algebra with respect to which $X$ is measurable. $\mathscr{U}(X)$ is called the $\sigma$-algebra generated by $X$.

Definition. A collection $\{X(t)|t\geq 0\}$ of random variables parametrized by time $t$ is called a stochastic process. For each $\omega\in\Omega$, the map $t\longmapsto X(t,\omega)$ is the corresponding sample path.

Let $(\Omega,\mathscr{U},P)$ be a probability space and $X=\sum_{i=1}^k a_iI_{A_i}$ a simple random variable. The probability that $X=a_i$ is $P(X=a_i)=P(X^{-1}(a_i))=P(A_i)$, so $\sum_{i=1}^k a_iP(A_i)$ is the expected value of $X$. We define the integral of $X$ by
\begin{equation}\label{eq:integral}\int_{\Omega}XdP=\sum_{i=1}^k a_iP(A_i)\end{equation}
if $X$ is a simple random variable. A random variable is not necessarily simple so we obviously want to extend the notion of integral to general random variables. First suppose that $X$ is a nonnegative random variable. Then we define
\begin{equation}\label{eq:integral2}\int_{\Omega}XdP=\sup_{Y\leq X,\ Y\ \mbox{simple}}\int_{\Omega}YdP\end{equation}
Let $X$ be a random variable. Let $X^+=\max\{X,0\}$ and $X^-=\max\{-X,0\}$. Then $X=X^+-X^-$. Define
\begin{equation}\label{eq:integral3}\int_{\Omega}XdP=\int_{\Omega}X^+dP-\int_{\Omega}X^-dP\end{equation}For a random variable $X$, we would still call the integral \eqref{eq:integral3} the expected value of $X$ and denote it by $E(X)$. This integral is called Lebesgue integral in real analysis (see [2]). When I first learned Lebesgue integral in my senior year in college, it wasn’t very clear to me as to what motivated one to define Lebesgue integral the way it is. In terms of probability the motivation is so much clear. I personally think that it would be better if we introduce Lebesgue integral to undergraduate students in the context of probability theory rather than  abstract real analysis. If $X:\Omega\longrightarrow\mathbb{R}^n$ is a vector-valued random variable and $X=(X_1,X_2,\cdots,X_n)$, we define $$\int_{\Omega}XdP=\left(\int_{\Omega}X_1dP,\int_{\Omega}X_2dP,\cdots,\int_{\Omega}X_ndP\right)$$As one would expect from an integral, the expected value $E(\cdot)$ is linear.

Definition. We call $$V(X)=\int_{\Omega}|X-E(X)|^2dP$$the variance of $X$.

It follows from the linearity of $E(\cdot)$ that $$V(X)=E(|X-E(X)|^2)=E(|X|^2)-|E(X)|^2$$

Lemma. If $X$ is a random variable and $1\leq p<\infty$, then \begin{equation}\label{eq:chebyshev}P(|X|\geq\lambda)\leq\frac{1}{\lambda^p}E(|X|^p)\end{equation}for all $\lambda>0$. The inequality \eqref{eq:chebyshev} is called Chebyshev’s inequality.

Proof. Since $1\leq p<\infty$, $|X|\geq\lambda\Rightarrow |X|^p\geq\lambda^p$. So, \begin{align*}E(|X|^p)&=\int_{\Omega}|X|^pdP\\&\geq\int_{|X|\geq\lambda}|X|^pdP\\

Example. Let a random variable $X$ have the probability density function $$f(x)=\left\{\begin{array}{ccc}\frac{1}{2\sqrt{3}} & \mbox{if} & -\sqrt{3}<x<\sqrt{3}\\ 0 & \mbox{elsewhere}
\end{array}\right.$$For $p=1$ and $\lambda=\frac{3}{2}$, $\frac{1}{\lambda}E(|X|)=\frac{1}{\sqrt{3}}\approx 0.58$. Note that $E(|X|)=\int_{-\infty}^\infty |x|f(x)dx$. (We will discuss this later.) $P(|X|\geq\frac{3}{2})=1-\int_{-\frac{3}{2}}^{\frac{3}{2}}f(x)dx=1-\frac{\sqrt{3}}{2}=0.134$. Hence we confirm Chebyshev’s inequality.
References: Not in particular order

  1. Lawrence C. Evans, An Introduction to Stochastic Differential Equations, Lecture Notes
  2. H. L. Royden, Real Analysis, Second Edition, Macmillan
  3. Robert V. Hogg, Joseph W. McKean, Allen T. Craig, Introduction to Mathematical Statistics, Sixth Edition, Pearson

Itô’s Formula

Let us consider the 1-dimensional case ($n=1$) of the Stochastic Equation (4) from the last post
\begin{equation}\label{eq:sd3}dX=b(X)dt+dW\end{equation} with $X(0)=0$.
Let $u: \mathbb{R}\longrightarrow\mathbb{R}$ be a smooth function and $Y(t)=u(X(t))$ ($t\geq 0$). What we learned in calculus (the chain rule) would dictate us that $dY$ is
where $’=\frac{d}{dx}$. It may come to you as a surprise to hear this but this is not correct. First by Taylor series expansion we obtain
Now we introduce the following striking formula
The proof of \eqref{eq:wiener2} is beyond the scope of this notes and so it won’t be given now or ever. However it can be found, for example, in [2]. Using \eqref{eq:wiener2} $dY$ can be written as
The terms beyond $u’dW$ are of order $(dt)^{\frac{3}{2}}$ and higher. Neglecting these terms, we have
\eqref{eq:sd4} is the stochastic differential equation satisfied by $Y(t)$ and it is called the Itô’s Formula named after a Japanese mathematician Kiyosi Itô.

Example. Let us consider the stochastic differential equation
\begin{equation}\label{eq:sd5}dY=YdW,\ Y(0)=1\end{equation}
Comparing \eqref{eq:sd4} and \eqref{eq:sd5}, we obtain
The equation \eqref{eq:sd5b} along with the initial condition $Y(0)=1$ results $u(X(t))=e^{X(t)}$. Using this $u$ with equation \eqref{eq:sd5a} we get $b=-\frac{1}{2}$ and so the equation \eqref{eq:sd3} becomes
in which case $X(t)=-\frac{1}{2}t+W(t)$. Hence, we find $Y(t)$ as

Example. Let $P(t)$ denote the price of a stock at time $t\geq 0$. A standard model assumes that the relative change of price $\frac{dP}{P}$ evolves according to the stochastic differential equation
\begin{equation}\label{eq:relprice}\frac{dP}{P}=\mu dt+\sigma dW\end{equation}
where $\mu>0$ and $\sigma$ are constants called the drift and the volatility of the stock, respectively. Again using Itô’s formula similarly to what we did in the previous example, we find the price function $P(t)$ which is the solution of
$$dP=\mu Pdt+\sigma PdW,\ P(0)=p_0$$
$$P(t)=p_0\exp\left[\left(\mu-\frac{1}{2}\sigma^2\right)\right]t+\sigma W(t).$$


1. Lawrence C. Evans, An Introduction to Stochastic Differential Equations, Lecture Notes

2. Bernt Øksendal, Stochastic Differential Equations, An Introduction with Applications, 5th Edition, Springer, 2000

What is a Stochastic Differential Equation?

Consider the population growth model
\begin{equation}\label{eq:popgrowth}\frac{dN}{dt}=a(t)N(t),\ N(0)=N_0\end{equation}
where $N(t)$ is the size of a population at time $t$ and $a(t)$ is the relativive growth rate at time $t$. If $a(t)$ is completely known, one can easily solve \eqref{eq:popgrowth}. In fact, the solution would be $N(t)=N_0\exp\left(\int_0^t a(t)dt\right)$. Now suppose that $a(t)$ is not completely known but it can be written as $a(t)=r(t)+\mbox{noise}$. We do not know the exact behavior of noise but only its probability distribution. Such a case equations like \eqref{eq:popgrowth} is called a stochastic differential equation. More genrally, a stochastic differential equation can be written as
\begin{equation}\label{eq:sd}\frac{dX}{dt}=b(X(t))+B(X(t))\xi(t)\ (t>0),\ X(0)=x_0,\end{equation}
where $b: \mathbb{R}^n\longrightarrow\mathbb{R}^n$ is a smooth vector field and $X: [0,\infty)\longrightarrow\mathbb{R}^n$, $B: \mathbb{R}^n\longrightarrow\mathbb{M}^{n\times m}$ and $\xi(t)$ is an $m$-dimensional white noise. If $m=n$, $x_0=0$, $b=0$ and $B=I$, then \eqref{eq:sd} turns into
\begin{equation}\label{eq:wiener}\frac{dX}{dt}=\xi(t),\ X(0)=0\end{equation}
The solution of \eqref{eq:wiener} is denoted by $W(t)$ and is called the $n$-dimensional Wiener process or Brownian motion. In other words, white noise $\xi(t)$ is the time derivative of the Wiener process. Replace $\xi(t)$ in \eqref{eq:sd} by $\frac{W(t)}{dt}$ and divide the resulting equation by $dt$. Then we obtain
\begin{equation}\label{eq:sd2}dX(t)=b(X(t))dt+B(X(t))dW(t),\ X(0)=x_0\end{equation}
The stochastic differential equation \eqref{eq:sd2} is solved symbolically as
for all $t>0$. In order to make sense of $X(t)$ in \eqref{eq:sdsol} we will have to know what $W(t)$ is and what the integral $\int_0^tb(X(s))dW(s)$, which is called a stochastic integral, means.


  1. Lawrence C. Evans, An Introduction to Stochastic Differential Equations, Lecture Notes
  2. Bernt Øksendal, Stochastic Differential Equations, An Introduction with Applications, 5th Edition, Springer, 2000