Preface#

The 17\(^{th}\) century was an exciting time for the world. The Renaissance was in full swing across Europe, and englightened thinking began to permeate every aspect of society, culturally, scientifically, and religiously. It was in 1650 when a French mathematician, Blaise Pascal, first entered a gambling place. A friend, the Chevalier de M’er’e, introduced a rather interesting problem to Pascal: the problem of points. The problem concerns a simple game of chance which proceeds as follows. First, the players decide in advance a number of wins that must be achieved. Within each round, each player provides a bet of equal size, followed by a fair coin being flipped (each player has an equal chance of winning each round). The winner is the player for whom the coin landed in favor of. The first player to reach the required number of wins obtains the entire pot that has accumulated since the start of the game.

The problem of points is, how do you divide the winnings if the game is ended early in a manner which is fair? Early solutions, such as that by Luca Pacioli, relied on this principle: simply divide the stakes in proportion to the number of rounds won by each player. However, this approach produces some weird results. Let’s say, for instance, that the players agreed to play one hundred games, and the game is stopped after the first round. In this case, the player with one win obtains the entire pot, and the other player walks away empty handed. However, intuitively, there are still \(99\) wins to go, so the player with no wins still has a decent shot of getting to \(100\) wins before the other player gets \(99\) more wins. Numerous other early approaches shared similar inconsistencies with intuition about what would be a fair division of the pot in this hypothetical game.

Blaise Pascal, being a prodigy mathematician who was well-educated in both geometry and actuarial sciences, posed this problem in correspondences with Pierre de Fermat, a respected French Mathematician who had already pioneered work in analytic geometry and seminal advancements in differential calculus. During their correspondences, Fermat proposed a tabular solution: if one player needs \(r\) rounds to win and the other player needs \(s\), then the game will surely have been won by someone after \(r + s - 1\) additional rounds. Fermat proceeded by simply proceeding the game in all possible ways for these \(r + s - 1\) additional rounds: as each game has \(2\) possibilities, this means that the game could proceed in \(2^{r + s - 1}\) different ways. Then, Fermat simply counted how many of the \(2^{r + s - 1}\) games led to a win for each player, and divided the stakes by the number of wins for each player.

Perhaps problematically, this solution is extremely computationally intensive without a calculator. If the first player needs just \(4\) more wins and the second player needs \(5\) more wins, then \(r + s - 1 = 8\), which means you would need to calculate out \(2^8 = 256\) different possibilities. Further, it is a little bit odd: some of these \(256\) games continue after a player has already won. On the other hand, Pascal proceeded a little bit differently. He proceeded by explicitly defining what it meant for a game to be fair with something called expected values. He defined that the game was fair if the division of pot matched the expected value for a particular player winning. The approach he used, which we will discuss at length later in the book when we talk about martingales, served as the motivating example for his work on the now famous Pascal’s Triangle, and perhaps equally significantly, inspired the first book on probability theory by Christiaan Huygens.

Probability Theory today#

At the time of this writing, computation has matured far beyond simply tossing dice and playing casino games: we have computers that can process and analyze enormous volumes of data, and can run algorithms in fractions of a second to solve the most complicated problems faced by our \(17^{th}\) century gamblers. Perhaps, in some sense, it seems almost trivial to think about these early problems faced by previous mathematicians, because we can just solve them in real time by a computer.

However, the question remains: how can you contribute to these new problems? While the problems of today aren’t always dice and coin flips, the problems are still extremely complicated. You might be studying a genome with millions of individual base-pairs, and you want to be able to deduce whether particular sequences of base pairs make someone more, or less, likely to get cancer later in their life. You might be studying financial data with thousands of past time points, and may want to see whether you can learn to forecast how the market might look in a few months. You might be studying social networking data with billions of users, and want to determine users who with high probability are bots.

The problem, and the beauty, is that, as the data has gotten bigger, our need to understand probability theory has grown even more important. The fact remains that, no matter how much computation power grows, we only have a finite capacity to understand the universe. We cannot simply know the exact state of the entire world, and use this knowledge to determine exactly what its next state will be at some point in the future. What we can do is, we can conduct experiments, delineate sets of rules that we think these experiments might loosely follow, and then learn about how our experiment tells us something about the world. What we learn from these experiments, and how confident we are in what we learn, are amongst the most crucial applications from probability theory.

So: where does this book fit in? When determining how to interpret the results of an experiment, you are often going to capture the things you don’t know about this experiment with randomness. For instance, if the experiment is a coin flip, you probably won’t be able to say, with certainty, that the coin is going to land on heads or tails with certainty in your experiment. Rather, you’re going to capture this randomness with probability. You might say that the coin has a probability that it will land on heads, and a probability that it will land on tails. What if you flipped the coin \(100\) times, and you attempted to use the ratio of the number of times it landed on heads divided by the number of flips to guess the probability of this coin? Is this guess reasonable? How often is it going to be close to the true probability of the coin? What happens if you’re no longer dealing with a coin, but you’re measuring the expected temperature? What happens if you’re dealing with a genome?

Determining how things that are random behave during experiments is hard work. As a theorist, your job is going to involve a good deal of math, a bit of knowledge of statistics, and a bit of experimentation. In this book, we’re going to attempt to give you the tools that you will need to, first and foremost, understand what exactly randomness is. There are a number of ways you can conceptualize randomness, some of which are easier to work with than others. The particular approach that we’re going to use to conceptualize randomness is called measure theory. Under some assumptions, you can capture the behavior of random things using things called probability measures, which we’ll talk more about in mt:mt. These ascribe rules to the behavior of a particular set of functions that can be described by these probability measures called random variables. As you’ll hopefully see later, random variables are not, like some statistics courses may lead you to believe, scary letters that have a bunch of rules attached to them. With the right intuition, in fact, they are quite intuitive, and most of the work you have learned so far in courses like calculus have already prepared you to understand how random variables work. Hopefully, by the end of this book, we’ll have equipped you with the knowledge to understand how, exactly, you can conceptualize and understand random variables, so that those scary letters and scary formulas are a bit less scary and easy to read.

Objectives and Approach#

The draft notes for this book were written while I undertook my doctoral studies at Johns Hopkins University. Going in, I was absolutely not the ideal candidate for a statistics PhD student: I had taken my basic engineering undergraduate mathematics courses (univariate and multivariate calculus, differential equations), I had some basic exposure to probability and statistics (I had taken statistics 101), and nearly no mathematical theory courses (I had no clue what a proof was, beyond that it was something too complicated for me to understand). When I saw my first proofs in real analysis, things were pretty easy: everything was clearly and explicitly stated, the proof was easy to follow, and most importantly, it cemented my insight into the topic. On the other hand, probability theory did not quite work this way. The proofs were often extremely terse and brief. Emphasis was placed on brevity, and giving the reader a general sketch of what a proof might look like, and none of the dots were really ever connected for a large portion of the material. I remain pretty confident that, had I been given a theorem on a test that was directly from the book and, verbatim provided the proof/insight in the book (plagiarism aside), that I would have failed every single test. It was basically impossible for me to figure out what a complete and correct proof of a probability theory result looked like because, in some sense, I hadn’t seen one.

To me, this leaves a substantial gap in probability theory: it is a crucially important subject to learn, as most of the algorithms and approaches that you will use daily in your future as an industry professional or as a researcher will be best understood through the lens of probability theory. On the other hand, there is no medium through which you can appreciate these completely without substantial omissions in the proofs, or logical inconsistencies that are, for students like me, extremely arduous (or impossible in one’s year of study) to connect back. Some perspectives might view this as an advantage: it leaves open the gap for the student to explore the topic on their own, and appreciate the finer points in their own time. Truly understanding mathematics is a hands-on subject, and that requires self-study, self-reflection, and self-paced homeworks with attention to detail. However, I think that for a lot of students and practitioners alike, it is invaluable to know what rigor and completeness, two expectations that your job or your class will have for you, look like in their unadulturated forms.

The mantra I promise to do my best to keep to is that no result is too obvious, and no proof is too trivial, for some person in a probability theory class somewhere out there. I want to help you get from step \(a\) to \(z\), and I want to show you everything from \(b\) to \(y\) too.

Prerequisites#

Studying probability theory is going to require you to have some background in math.

I am going to expect that you are familiar with basic univariate and multivariate calculus and differential equations.

Further, you should be familiar with basic concepts from real analysis, such as concepts involving limits (\(\lim\), \(\liminf\), \(\limsup\)), basic differential results, and some exposure to proofs. I am going to go through the gorey details with you, but I’m not going to start completely from scratch, so I’ll provide a Chapter \(0\) which outlines some concepts from real analysis I’m going to expect you know already.

Perhaps crucially, I’m going to write this book as though you know effectively nothing about probability or statistics, as we’re going to build those concepts from the ground up along the way.

Roadmap#

This book is organized into four parts:

  1. Part I, Foundations, gives you a brief overview of what you are expected to know already. We’ll discuss some key results from real analysis, which you should have seen before, and get introduced to proofs in a (hopefully) familiar environment.

  2. Part II, Introduction to Measure Theory, will be where you are introduced to the foundational building blocks of probability and statistics, random variables, and interesting results as to how these random variables behave.

  3. Part III, Asympototic Probability Theory, will introduce you to many of the foundational results from probability theory that are used for statistical inference. We will talk about laws of large numbers, central limit theorems, and concentration results for random vectors and matrices.

  4. Part IV, Markov Chain Theory, will introduce you to the concepts of Martingales and Markov Chains, two related techniques which provide you with the basic techniques for evaluating state-space type models.

Other resources#

Many resources exist which will help you greatly with some of the content of this book.

If you don’t have any background in real analysis quite yet, I’d recommend you start out with , by . I would recommend you start with an introduction to real analysis prior to focusing on this book, because many of the concepts and ideas that will come up over and over again in this book, such as limits and concepts about basic calculus, will be crucial to the content of this book. Further, understanding the basics of how formal mathematical proof works is not a concept that this book is going to cover. We are going to prove things, explicitly and completely, in probability theory, but we are not going to be starting from scratch with reading and understanding proofs.

Conventions used#

The following conventions are used in this book:

  • Italic: indicates emphasis behind a term or exclamation to draw attention that this is the sentence in which it is being used.

  • Bold: indicates a definition for a term or concept.

  • Unicode block: used to indicate the name of an algorithm, function names, package names, programmatic text elements, and related concepts.

Admonitions

Used to indicate ideas that are directly relevant or supplementary to the content of the book, but are not essential for understanding the main concepts of a section or paragraph.

Throughout this book, we’re going to heavily develop new results, and then leverage them. You’ll often see numbered results in four different classes:

Theorems#

Theorems are the backbone of mathematics. They are “big-deal” results, and they’ll tend to often have clever names attached to them which describe what they are, or what they do. For instance, if we wanted to make a theorem that the sky is blue, we might name it “The Sky is Blue” Theorem, and it would look like this:

Theorem 1 (The-Sky-is-Blue)

On sunny days, the sky is blue.

If we wanted to reference this theorem for another result, we would do it like this: Theorem 1. For theorems, you should have a general idea of how to use them in appropriate contexts, and perhaps remember some of the bigger ones.

Lemmas#

Lemmas are to theorems what ingredients are to a recipe. Lemmas are used when the result that they outline is not, in and of itself, particularly interesting, but when coupled together with other interesting facts, can create an interesting result (such as in a theorem). For instance, let’s say that in order to ascertain whether the sky was blue, we needed to first define blue.

Lemma 1 (Simple addition)

\(2 + 2 = 4\).

If we wanted to reference this lemma for another result, we would do it like this: Lemma 1. Lemmas are similar to theorems as far as where we expect them to land in your head: we expect you to have a general idea of how to use them in appropriate contexts, but you usually will have to refer back to your notes for the specific details.

Properties#

Properties are descriptive facts about things that effectively “fall out” straight from the definition, usually rather directly. They’ll look like this:

Property 1 (A property)

Cats have whiskers.

Usually, when we call something a property explicitly, what we’re trying to tell you is that you should, more than likely, try to associate this result rather immediately with the concept that it is a property of (here, you should associate “whiskers” with “cats”, for instance), and be able to use it without needing a prompt back to the exact statement every time. If we call it a property, it’s because we expect that you should, eventually, be able to have a working knowledge of the fact it describes.

Definitions#

Definitions are names ascribed to particular concepts or ideas that will come up again and again. Rather than having to use some big name to describe a single concept, the definition will capture all of these ideas simultaneously. They’ll look like this:

Definition 1 (Natural-Numbers-\(\mathbb N\))

The natural numbers are the set of numbers which are used for counting; e.g.:

\[\begin{align*} \mathbb N &= \left\{1, 2, 3, 4, ...\right\} \end{align*}\]

Proofs#

With every theorem and lemma, with few exceptions, we’ll also provide things called proofs. Theorems and lemmas basically assert a particular fact, and then the proof shows you why that particular result happens to be true. For instance, for a simple theorem, you might see a theorem with a proof look like this:

Theorem 2 (The-Sky-is-Gray)

On rainy days, the sky is gray.

Proof. On a rainy day, go outside, and observe the grayness of the sky.

This proof is now over.

The proof is preceeded by the words “Proof.”, and ends at the final indented line.

Code Blocks#

The entirety of this book has been compiled using jupyter notebooks, integrated through the jupyter-book framework. The book is currently hosted on github at github.com/ebridge2/prob_book, and the build log for the text can be found at github.com/ebridge2/prob_book/actions. You can find every section notebook, and every command which was used to prepare the environment in which the corresponding textbook pages were executed, using those two links.

Often times, we’re going to try our best to not just show you empirical results, we’re going to try to demonstrate how the result works in practice. This is going to be done, more often than not, using code snippets and figures. In general, all code relevant to an algorithm or simulation is going to be shown explicitly. For example, a python code block, and its corresponding output, looks like this:

def howdy_world():
    print("Howdy world!")

howdy_world()
Howdy world!

Sometimes, there will be code sections that aren’t really relevant to the message being conveyed. These code blocks, which may include calls to plotting methods the user has seen before, or redundant blocks that only add styling to figures or the like, will generally be hidden if we think that they detract from the overall flow of a section. These blocks will look like this:

def howdy_hidden():
    print("Howdy world, but I'm hidden!")

howdy_hidden()
Howdy world, but I'm hidden!

Using code examples, citing the book, and reaching out with feedback#

This book is intended to teach you a foundation in probability theory. In general, I am providing the proofs, code, and results here for you: it is intended that you will borrow and repurpose the theorems, lemmas, and code snippets I describe in your programs and documentation. If you are using a few brief snippets from my book for your work, feel free with proper attribution to the citation (below). If you want to write something that borrows some results, feel free without permission. If you intend to sell or financially profit from work directly found in this book, you need to request permission.

To cite this textbook, you can use the following citation:

Eric W. Bridgeford. Measure Theoretic Probability Theory: A Hands-On Introduction. Eric Bridgeford, https://ericwb.me/prob_book/coverpage.html.

To request permissions, provide me with feedback about this in-progress draft, or even just to say hi and let me know what probability concepts you want to see me discuss, feel free to reach out directly. You can reach out at ericwb95 - at - gmail - dot - com. Hopefully you can piece that together to form a valid email address :)

Acknowledgements#

This book would not exist if not for the help I received in my graduate studies. Many of the proofs that were settled upon in this book were from me iterating with classmates and professors alike, searching for unambiguous clarity and exposition of every relevant detail to a given result. Further, my instructors Dr. Michael Rosenblum, Dr. Cristian Tomasetti, and Dr. Mauro Maggioni and my classmates Grant Schumock, Elizabeth Sarker, Claire Heffernan, Brian Gilbert, and Haley Grant always made themselves available to help with slothing through the details and refining the proofs you’ll find within.

Finally, this book absolutely would not exist from some of the wonderful resources that I composed it from. For a lot of the traditional probability theory results, I followed the presentation in “Probability: Theory and Examples” [2] basically to a T. I think Dr. Durrett did a fantastic job of laying out the landscape, and building up the subject area with a great organization. Hopefully he will agree that imitation is the greatest form of flattery, because why bother changing something that works. I filled in a lot of the holes in my understanding left by this book with a combination of “A Course in Probability Theory” [1] and the immeasurable help of the many excellent probability theorists over on Stack Overflow. If you have questions about probability theory, I cannot imagine a more helpful place than that community of scientists. While preparing this book, I also used a number of other measure theory and probability theory books as references along the way to supplement these three resources. Unfortunately, the names of these wonderful works are passing me (sadly, they are not on my shelf), so I apologize for not being able to individually delineate every source that has helped me along the way.

I learned a lot of the concentration-type of results from “High-Dimensional Probability” [3], and I think this ultra-relevant subject-material is best learned in this context, so I wove it in. Dr. Vershynin’s book is absolutely stellar and, while not as theoretically motivated as a measure-theoretic treatment, trades rigor for readability and ease-of-access. It is absolutely a must-read and one for the shelves.

1

Kai Lai Chung. A Course in Probability Theory, Third Edition. Academic Press, Cambridge, MA, USA, October 2000. ISBN 978-0-12174151-8. URL: https://www.amazon.com/Course-Probability-Theory-Third/dp/0121741516.

2

Rick Durrett. Probability. Cambridge University Press, Cambridge, England, UK, May 2019. ISBN 978-1-10847368-2. URL: https://www.cambridge.org/us/academic/subjects/statistics-probability/probability-theory-and-stochastic-processes/probability-theory-and-examples-5th-edition?format=HB.

3

Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press, Cambridge, England, UK, September 2018. ISBN 978-1-10823159-6. doi:10.1017/9781108231596.