0 like 0 dislike
26 views
A first step towards identifying spam is to create a list of words that are more likely to appear in spam than in normal messages. For instance, words like buy or the brand name of an enhancement drug are more likely to occur in spam messages than in normal messages. Suppose a specified list of words is available and that your data base of 5000 messages contains 1700 that are spam. Among the spam messages, 1343 contain words in the list. Of the 3300 normal messages, only 297 contain words in the list.

Obtain the probability that a message is spam given that the message contains words in the list.
| 26 views

0 like 0 dislike
Let $A=[$ message contains words in list $]$ be the event a message is identified as spam and let $B_{1}=$ [message is spam] and $B_{2}=$ [message is normal]. We use the observed relative frequencies from the data base as approximations to the probabilities.
$\begin{gathered} P\left(B_{1}\right)=\frac{1700}{5000}=.34 \quad P\left(B_{2}\right)=\frac{3300}{5000}=.66 \\ P\left(A \mid B_{1}\right)=\frac{1343}{1700}=.79 \quad P\left(A \mid B_{2}\right)=\frac{297}{3300}=.09 \end{gathered}$
Bayes' Theorem expresses the probability of being spam, given that a message is identified as spam, as
$P\left(B_{1} \mid A\right)=\frac{P\left(A \mid B_{1}\right) P\left(B_{1}\right)}{P\left(A \mid B_{1}\right) P\left(B_{1}\right)+P\left(A \mid B_{2}\right) P\left(B_{2}\right)}$
The updated, or posterior probability, is
$P\left(B_{1} \mid A\right)=\frac{.79 \times .34}{.79 \times .34+.09 \times .66}=\frac{.2686}{.328}=.819$
Because this posterior probability of being spam is quite large, we suspect that this message really is spam. Since $P\left(B_{1}\right)=.34$, or $34 \%$ of the incoming messages are spam, we likely would want the spam filter to remove this message. Existing spam filer programs learn and improve as you mark your incoming messages spam.
by Diamond (40,719 points)

2 like 0 dislike
3 like 0 dislike
0 like 0 dislike
2 like 0 dislike
0 like 0 dislike
1 like 0 dislike
1 like 0 dislike
0 like 0 dislike
0 like 0 dislike