Misplaced Pages

Apache SpamAssassin

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

A computer program is a sequence or set of instructions in a programming language for a computer to execute . It is one component of software , which also includes documentation and other intangible components.

#782217

86-555: Apache SpamAssassin is a computer program used for e-mail spam filtering . It uses a variety of spam-detection techniques, including DNS and fuzzy checksum techniques, Bayesian filtering , external programs, blacklists and online databases. It is released under the Apache License 2.0 and is a part of the Apache Foundation since 2004. The program can be integrated with the mail server to automatically filter all mail for

172-462: A list of integers could be called integer_list . In object-oriented jargon, abstract datatypes are called classes . However, a class is only a definition; no memory is allocated. When memory is allocated to a class and bound to an identifier , it is called an object . Object-oriented imperative languages developed by combining the need for classes and the need for safe functional programming . A function, in an object-oriented language,

258-422: A programming language . Programming language features exist to provide building blocks to be combined to express programming ideals. Ideally, a programming language should: The programming style of a programming language to provide these building blocks may be categorized into programming paradigms . For example, different paradigms may differentiate: Each of these programming styles has contributed to

344-403: A standalone application or as a subprogram of another application (such as a Milter , SA-Exim , Exiscan , MailScanner , MIMEDefang , Amavis ) or as a client ( spamc ) that communicates with a daemon ( spamd ). The client/server or embedded mode of operation has performance benefits, but under certain circumstances may introduce additional security risks. Typically either variant of

430-428: A store which consisted of memory to hold 1,000 numbers of 50 decimal digits each. Numbers from the store were transferred to the mill for processing. The engine was programmed using two sets of perforated cards. One set directed the operation and the other set inputted the variables. However, the thousands of cogged wheels and gears never fully worked together. Ada Lovelace worked for Charles Babbage to create

516-440: A Perl plug-in for Apache SpamAssassin. Apache SpamAssassin reinforces its rules through Bayesian filtering where a user or administrator "feeds" examples of good (ham) and bad (spam) into the filter in order to learn the difference between the two. For this purpose, Apache SpamAssassin provides the command-line tool sa-learn , which can be instructed to learn a single mail or an entire mailbox as either ham or spam. Typically,

602-515: A SpamAssassin ruleset into a deterministic finite automaton that allows Apache SpamAssassin to use processor power more efficiently. Apache SpamAssassin is designed to trigger on the GTUBE , a 68-byte string similar to the antivirus EICAR test file . If this string is inserted in an RFC 5322 formatted message and passed through the Apache SpamAssassin engine, Apache SpamAssassin will trigger with

688-630: A corrected probability: where: (Demonstration: ) This corrected probability is used instead of the spamicity in the combining formula. This formula can be extended to the case where n is equal to zero (and where the spamicity is not defined), and evaluates in this case to P r ( S ) {\displaystyle Pr(S)} . "Neutral" words like "the", "a", "some", or "is" (in English), or their equivalents in other languages, can be ignored. These are also known as Stop words . More generally, some bayesian filtering filters simply ignore all

774-604: A description of the Analytical Engine (1843). The description contained Note G which completely detailed a method for calculating Bernoulli numbers using the Analytical Engine. This note is recognized by some historians as the world's first computer program . In 1936, Alan Turing introduced the Universal Turing machine , a theoretical device that can model every computation. It is a finite-state machine that has an infinitely long read/write tape. The machine can move

860-580: A language's basic syntax . The syntax of the language BASIC (1964) was intentionally limited to make the language easy to learn. For example, variables are not declared before being used. Also, variables are automatically initialized to zero. Here is an example computer program, in Basic, to average a list of numbers: Once the mechanics of basic computer programming are learned, more sophisticated and powerful languages are available to build large computer systems. Improvements in software development are

946-406: A picture's size in bytes is bigger than the equivalent text's size, so the spammer needs more bandwidth to send messages directly including pictures. Some filters are more inclined to decide that a message is spam if it has mostly graphical contents. A solution used by Google in its Gmail email system is to perform an OCR (Optical Character Recognition) on every mid to large size image, analyzing

SECTION 10

#1733085433783

1032-521: A profound influence on programming language design. Emerging from a committee of European and American programming language experts, it used standard mathematical notation and had a readable, structured design. Algol was first to define its syntax using the Backus–Naur form . This led to syntax-directed compilers. It added features like: Algol's direct descendants include Pascal , Modula-2 , Ada , Delphi and Oberon on one branch. On another branch

1118-412: A proposal to sell counterfeit copies of well-known brands of watches. The spam detection software, however, does not "know" such facts; all it can do is compute probabilities. The formula used by the software to determine that, is derived from Bayes' theorem where: (For a full demonstration, see Bayes' theorem#Extended form .) Statistics show that the current probability of any message being spam

1204-557: A result, the computer could be programmed quickly and perform calculations at very fast speeds. Presper Eckert and John Mauchly built the ENIAC. The two engineers introduced the stored-program concept in a three-page memo dated February 1944. Later, in September 1944, John von Neumann began working on the ENIAC project. On June 30, 1945, von Neumann published the First Draft of a Report on

1290-426: A single test will not usually be enough to reach the threshold. If Apache SpamAssassin considers a message to be spam, it can be further rewritten. In the default configuration, the content of the mail is appended as a MIME attachment, with a brief excerpt in the message body, and a description of the tests which resulted in the mail being classified as spam. If the score is lower than the defined settings, by default

1376-464: A site. It can also be run by individual users on their own mailbox and integrates with several mail programs . Apache SpamAssassin is highly configurable; if used as a system-wide filter it can still be configured to support per-user preferences. Apache SpamAssassin was created by Justin Mason, who had maintained a number of patches against an earlier program named filter.plx by Mark Jeftovic, which in turn

1462-422: A time frame during which the user is allowed to review the software's decision. The initial training can usually be refined when wrong judgements from the software are identified (false positives or false negatives). That allows the software to dynamically adapt to the ever-evolving nature of spam. Some spam filters combine the results of both Bayesian spam filtering and other heuristics (pre-defined rules about

1548-448: A weight of 1000. Computer program A computer program in its human-readable form is called source code . Source code needs another computer program to execute because computers can only execute their native machine instructions . Therefore, source code may be translated to machine instructions using a compiler written for the language. ( Assembly language programs are translated using an assembler .) The resulting file

1634-581: Is free / open source software , licensed under the Apache License 2.0 . Versions prior to 3.0 are dual-licensed under the Artistic License and the GNU General Public License . Many commercially available anti-spam packages integrate SpamAssassin as part of their products, such as SpamKiller by McAfee , Kerio MailServer by Kerio, and SmarterMail by SmarterTools. sa-compile is a utility distributed with Apache SpamAssassin that compiles

1720-403: Is 80%, at the very least: The filters that use this hypothesis are said to be "not biased", meaning that they have no prejudice regarding the incoming email. This assumption permits simplifying the general formula to: This is functionally equivalent to asking, "what percentage of occurrences of the word 'replica' appear in spam messages?" This quantity is called "spamicity" (or "spaminess") of

1806-418: Is assigned to a class. An assigned function is then referred to as a method , member function , or operation . Object-oriented programming is executing operations on objects . Object-oriented languages support a syntax to model subset/superset relationships. In set theory , an element of a subset inherits all the attributes contained in the superset. For example, a student is a person. Therefore,

SECTION 20

#1733085433783

1892-399: Is called an executable . Alternatively, source code may execute within an interpreter written for the language. If the executable is requested for execution, then the operating system loads it into memory and starts a process . The central processing unit will soon switch to this process so it can fetch, decode, and then execute each machine instruction. If the source code

1978-420: Is computed using Bayes' theorem . Then, the email's spam probability is computed over all words in the email, and if the total exceeds a certain threshold (say 95%), the filter will mark the email as a spam. As in any other spam filtering technique, email marked as spam can then be automatically moved to a "Junk" email folder, or even deleted outright. Some software implement quarantine mechanisms that define

2064-399: Is matched against all tests and Apache SpamAssassin combines the results into a global score which is assigned to the message. The higher the score, the higher the probability that the message is spam. Apache SpamAssassin has an internal (configurable) score threshold to classify a message as spam. Usually a message will only be considered as spam if it matches multiple criteria; matching just

2150-421: Is not generally satisfied (for example, in natural languages like English the probability of finding an adjective is affected by the probability of having a noun), but it is a useful idealization, especially since the statistical correlations between individual words are usually not known. On this basis, one can derive the following formula from Bayes' theorem: where: Spam filtering software based on this formula

2236-410: Is one of the oldest ways of doing spam filtering, with roots in the 1990s. Bayesian algorithms were used for email filtering as early as 1996. Although naive Bayesian filters did not become popular until later, multiple programs were released in 1998 to address the growing problem of unwanted email. The first scholarly publication on Bayesian spam filtering was by Sahami et al. in 1998. Variants of

2322-402: Is requested for execution, then the operating system loads the corresponding interpreter into memory and starts a process. The interpreter then loads the source code into memory to translate and execute each statement . Running the source code is slower than running an executable . Moreover, the interpreter must be installed on the computer. The "Hello, World!" program is used to illustrate

2408-555: Is sometimes embedded within mail server software itself. CRM114 , oft cited as a Bayesian filter, is not intended to use a Bayes filter in production, but includes the ″unigram″ feature for reference. Particular words have particular probabilities of occurring in spam email and in legitimate email. For instance, most email users will frequently encounter the word " Viagra " in spam email, but will seldom see it in other email. The filter doesn't know these probabilities in advance, and must first be trained so it can build them up. To train

2494-405: Is sometimes referred to as a naive Bayes classifier , as "naive" refers to the strong independence assumptions between the features. The result p is typically compared to a given threshold to decide whether the message is spam or not. If p is lower than the threshold, the message is considered as likely ham, otherwise it is considered as likely spam. Usually p is not directly computed using

2580-557: Is spam or not. Most rules are based on regular expressions that are matched against the body or header fields of the message, but Apache SpamAssassin also employs a number of other spam-fighting techniques. The rules are called "tests" in the SpamAssassin documentation. Each test has a score value that will be assigned to a message if it matches the test's criteria. The scores can be positive or negative, with positive values indicating "spam" and negative "ham" (non-spam messages). A message

2666-548: Is to alter the electrical resistivity and conductivity of a semiconductor junction . First, naturally occurring silicate minerals are converted into polysilicon rods using the Siemens process . The Czochralski process then converts the rods into a monocrystalline silicon , boule crystal . The crystal is then thinly sliced to form a wafer substrate . The planar process of photolithography then integrates unipolar transistors, capacitors , diodes , and resistors onto

Apache SpamAssassin - Misplaced Pages Continue

2752-466: Is to replace text with pictures, either directly included or linked. The whole text of the message, or some part of it, is replaced with a picture where the same text is "drawn". The spam filter is usually unable to analyze this picture, which would contain the sensitive words like «Viagra». However, since many mail clients disable the display of linked pictures for security reasons, the spammer sending links to distant pictures might reach fewer targets. Also,

2838-623: The new statement. A module's other file is the source file . Here is a C++ source file for the GRADE class in a simple school application: Here is a C++ header file for the PERSON class in a simple school application: Bayesian spam filtering Naive Bayes classifiers are a popular statistical technique of e-mail filtering . They typically use bag-of-words features to identify email spam , an approach commonly used in text classification . Naive Bayes classifiers work by correlating

2924-604: The IBM System/360 (1964) had a CPU made from circuit boards containing discrete components on ceramic substrates . The Intel 4004 (1971) was a 4- bit microprocessor designed to run the Busicom calculator. Five months after its release, Intel released the Intel 8008 , an 8-bit microprocessor. Bill Pentz led a team at Sacramento State to build the first microcomputer using the Intel 8008:

3010-480: The Sac State 8008 (1972). Its purpose was to store patient medical records. The computer supported a disk operating system to run a Memorex , 3- megabyte , hard disk drive . It had a color display and keyboard that was packaged in a single console. The disk operating system was programmed using IBM's Basic Assembly Language (BAL) . The medical records application was programmed using a BASIC interpreter. However,

3096-550: The circuits . At its core, it was a series of Pascalines wired together. Its 40 units weighed 30 tons, occupied 1,800 square feet (167 m ), and consumed $ 650 per hour ( in 1940s currency ) in electricity when idle. It had 20 base-10 accumulators . Programming the ENIAC took up to two months. Three function tables were on wheels and needed to be rolled to fixed function panels. Function tables were connected to function panels by plugging heavy black cables into plugboards . Each function table had 728 rotating knobs. Programming

3182-404: The programming environment to advance from a computer terminal (until the 1990s) to a graphical user interface (GUI) computer. Computer terminals limited programmers to a single shell running in a command-line environment . During the 1970s, full-screen source code editing became possible through a text-based user interface . Regardless of the technology available, the goal is to program in

3268-469: The Bayesian noise better, at the expense of a bigger database. There are other ways of combining individual probabilities for different words than using the "naive" approach. These methods differ from it on the assumptions they make on the statistical properties of the input data. These different hypotheses result in radically different formulas for combining the individual probabilities. For example, assuming

3354-494: The EDVAC , which equated the structures of the computer with the structures of the human brain. The design became known as the von Neumann architecture . The architecture was simultaneously deployed in the constructions of the EDVAC and EDSAC computers in 1949. The IBM System/360 (1964) was a family of computers, each having the same instruction set architecture . The Model 20 was

3440-433: The ENIAC also involved setting some of the 3,000 switches. Debugging a program took a week. It ran from 1947 until 1955 at Aberdeen Proving Ground , calculating hydrogen bomb parameters, predicting weather patterns, and producing firing tables to aim artillery guns. Instead of plugging in cords and turning switches, a stored-program computer loads its instructions into memory just like it loads its data into memory. As

3526-493: The above formula due to floating-point underflow . Instead, p can be computed in the log domain by rewriting the original equation as follows: Taking logs on both sides: Let η = ∑ i = 1 N [ ln ⁡ ( 1 − p i ) − ln ⁡ p i ] {\displaystyle \eta =\sum _{i=1}^{N}\left[\ln(1-p_{i})-\ln p_{i}\right]} . Therefore, Hence

Apache SpamAssassin - Misplaced Pages Continue

3612-409: The alternate formula for computing the combined probability: In the case a word has never been met during the learning phase, both the numerator and the denominator are equal to zero, both in the general formula and in the spamicity formula. The software can decide to discard such words for which there is no information available. More generally, the words that were encountered only a few times during

3698-420: The application is set up in a generic mail filter program, or it is called directly from a mail user agent that supports this, whenever new mail arrives. Mail filter programs such as procmail can be made to pipe all incoming mail through Apache SpamAssassin with an adjustment to a user's procmailrc file. Apache SpamAssassin comes with a large set of rules which are applied to determine whether an email

3784-403: The basic technique have been implemented in a number of research works and commercial software products. Many modern mail clients implement Bayesian spam filtering. Users can also install separate email filtering programs . Server-side email filters, such as DSPAM , SpamAssassin , SpamBayes , Bogofilter , and ASSP , make use of Bayesian spam filtering techniques, and the functionality

3870-640: The cheaper Intel 8088 . IBM embraced the Intel 8088 when they entered the personal computer market (1981). As consumer demand for personal computers increased, so did Intel's microprocessor development. The succession of development is known as the x86 series . The x86 assembly language is a family of backward-compatible machine instructions . Machine instructions created in earlier microprocessors were retained throughout microprocessor upgrades. This enabled consumers to purchase new computers without having to purchase new application software . The major categories of instructions are: VLSI circuits enabled

3956-429: The company name and the names of clients or customers will be mentioned often. The filter will assign a lower spam probability to emails containing those names. The word probabilities are unique to each user and can evolve over time with corrective training whenever the filter incorrectly classifies an email. As a result, Bayesian spam filtering accuracy after training is often superior to pre-defined rules. Depending on

4042-419: The computer was an evolutionary dead-end because it was extremely expensive. Also, it was built at a public university lab for a specific purpose. Nonetheless, the project contributed to the development of the Intel 8080 (1974) instruction set . In 1978, the modern software development environment began when Intel upgraded the Intel 8080 to the Intel 8086 . Intel simplified the Intel 8086 to manufacture

4128-537: The configuration, an execute button was pressed. This process was then repeated. Computer programs also were automatically inputted via paper tape , punched cards or magnetic-tape . After the medium was loaded, the starting address was set via switches, and the execute button was pressed. A major milestone in software development was the invention of the Very Large Scale Integration (VLSI) circuit (1964). Following World War II , tube-based technology

4214-427: The contents, looking at the message's envelope, etc.), resulting in even higher filtering accuracy, sometimes at the cost of adaptiveness. Bayesian email filters utilize Bayes' theorem . Bayes' theorem is used several times in the context of spam: Let's suppose the suspected message contains the word " replica ". Most people who are used to receiving e-mail know that this message is likely to be spam, more precisely

4300-434: The descendants include C , C++ and Java . BASIC (1964) stands for "Beginner's All-Purpose Symbolic Instruction Code". It was developed at Dartmouth College for all of their students to learn. If a student did not go on to a more powerful language, the student would still remember Basic. A Basic interpreter was installed in the microcomputers manufactured in the late 1970s. As the microcomputer industry grew, so did

4386-459: The email's spam score, making it more likely to slip past a Bayesian spam filter. However, with (for example) Paul Graham's scheme only the most significant probabilities are used, so that padding the text out with non-spam-related words does not affect the detection probability significantly. Words that normally appear in large quantities in spam may also be transformed by spammers. For example, «Viagra» would be replaced with «Viaagra» or «V!agra» in

SECTION 50

#1733085433783

4472-435: The fact that a given word appears several times in the examined message, others don't. Some software products use patterns (sequences of words) instead of isolated natural languages words. For example, with a "context window" of four words, they compute the spamicity of "Viagra is good for", instead of computing the spamicities of "Viagra", "is", "good", and "for". This method gives more sensitivity to context and eliminates

4558-444: The filter, the user must manually indicate whether a new email is spam or not. For all words in each training email, the filter will adjust the probabilities that each word will appear in spam or legitimate email in its database. For instance, Bayesian spam filters will typically have learned a very high spam probability for the words "Viagra" and "refinance", but a very low spam probability for words seen only in legitimate email, such as

4644-460: The first Fortran standard in 1966. In 1978, Fortran 77 became the standard until 1991. Fortran 90 supports: COBOL (1959) stands for "COmmon Business Oriented Language". Fortran manipulated symbols. It was soon realized that symbols did not need to be numbers, so strings were introduced. The US Department of Defense influenced COBOL's development, with Grace Hopper being a major contributor. The statements were English-like and verbose. The goal

4730-485: The implementation, Bayesian spam filtering may be susceptible to Bayesian poisoning , a technique used by spammers in an attempt to degrade the effectiveness of spam filters that rely on Bayesian filtering. A spammer practicing Bayesian poisoning will send out emails with large amounts of legitimate text (gathered from legitimate news or literary sources). Spammer tactics include insertion of random innocuous words that are not normally associated with spam, thereby decreasing

4816-489: The individual probabilities follow a chi-squared distribution with 2 N degrees of freedom, one could use the formula: where C is the inverse of the chi-squared function . Individual probabilities can be combined with the techniques of the Markovian discrimination too. The spam that a user receives is often related to the online user's activities. For example, a user may have been subscribed to an online newsletter that

4902-468: The information about the tests passed and total score is still added to the email headers and can be used in post-processing for less severe actions, such as tagging the mail as suspicious. Apache SpamAssassin allows for a per-user configuration of its behavior, even if installed as system-wide service; the configuration can be read from a file or a database. In their configuration users can specify individuals whose emails are never considered spam, or change

4988-475: The language BCPL was replaced with B , and AT&T Bell Labs called the next version "C". Its purpose was to write the UNIX operating system . C is a relatively small language, making it easy to write compilers. Its growth mirrored the hardware growth in the 1980s. Its growth also was because it has the facilities of assembly language , but uses a high-level syntax . It added advanced features like: C allows

5074-400: The language. Basic pioneered the interactive session . It offered operating system commands within its environment: However, the Basic syntax was too simple for large programs. Recent dialects added structure and object-oriented extensions. Microsoft's Visual Basic is still widely used and produces a graphical user interface . C programming language (1973) got its name because

5160-460: The learning phase cause a problem, because it would be an error to trust blindly the information they provide. A simple solution is to simply avoid taking such unreliable words into account as well. Applying again Bayes' theorem, and assuming the classification between spam and ham of the emails containing a given word ("replica") is a random variable with beta distribution , some programs decide to use

5246-485: The matrix was to burn out the unneeded connections. There were so many connections, firmware programmers wrote a computer program on another chip to oversee the burning. The technology became known as Programmable ROM . In 1971, Intel installed the computer program onto the chip and named it the Intel 4004 microprocessor . The terms microprocessor and central processing unit (CPU) are now used interchangeably. However, CPUs predate microprocessors. For example,

SECTION 60

#1733085433783

5332-423: The messages identified as ham during the learning phase. For these approximations to make sense, the set of learned messages needs to be big and representative enough. It is also advisable that the learned set of messages conforms to the 50% hypothesis about repartition between spam and ham, i.e. that the datasets of spam and ham are of same size. Of course, determining whether a message is spam or ham based only on

5418-399: The names of friends and family members. After training, the word probabilities (also known as likelihood functions ) are used to compute the probability that an email with a particular set of words in it belongs to either category. Each word in the email contributes to the email's spam probability, or only the most interesting words. This contribution is called the posterior probability and

5504-413: The presence of the word "replica" is error-prone, which is why bayesian spam software tries to consider several words and combine their spamicities to determine a message's overall probability of being spam. Most bayesian spam filtering algorithms are based on formulas that are strictly valid (from a probabilistic standpoint) only if the words present in the message are independent events . This condition

5590-443: The programmer to control which region of memory data is to be stored. Global variables and static variables require the fewest clock cycles to store. The stack is automatically used for the standard variable declarations . Heap memory is returned to a pointer variable from the malloc() function. In the 1970s, software engineers needed language support to break large projects down into modules . One obvious feature

5676-486: The result of improvements in computer hardware . At each stage in hardware's history, the task of computer programming changed dramatically. In 1837, Jacquard's loom inspired Charles Babbage to attempt to build the Analytical Engine . The names of the components of the calculating device were borrowed from the textile industry. In the textile industry, yarn was brought from the store to be milled. The device had

5762-448: The scores for certain rules. The user can also define a list of languages which they want to receive mail in, and Apache SpamAssassin then assigns a higher score to all mails that appear to be written in another language. Apache SpamAssassin is based on heuristics (pattern recognition), and such software exhibits false positives and false negatives. Apache SpamAssassin also supports: More methods can be added reasonably easily by writing

5848-438: The set of students is a subset of the set of persons. As a result, students inherit all the attributes common to all persons. Additionally, students have unique attributes that other people do not have. Object-oriented languages model subset/superset relationships using inheritance . Object-oriented programming became the dominant language paradigm by the late 1990s. C++ (1985) was originally called "C with Classes". It

5934-467: The smallest and least expensive. Customers could upgrade and retain the same application software . The Model 195 was the most premium. Each System/360 model featured multiprogramming —having multiple processes in memory at once. When one process was waiting for input/output , another could compute. IBM planned for each model to be programmed using PL/1 . A committee was formed that included COBOL , Fortran and ALGOL programmers. The purpose

6020-401: The spam message. The recipient of the message can still read the changed words, but each of these words is met more rarely by the Bayesian filter, which hinders its learning process. As a general rule, this spamming technique does not work very well, because the derived words end up recognized by the filter just like the normal ones. Another technique used to try to defeat Bayesian spam filters

6106-430: The synthesis of different programming languages . A programming language is a set of keywords , symbols , identifiers , and rules by which programmers can communicate instructions to the computer. They follow a set of rules called a syntax . Programming languages get their basis from formal languages . The purpose of defining a solution in terms of its formal language is to generate an algorithm to solve

6192-447: The tape back and forth, changing its contents as it performs an algorithm . The machine starts in the initial state, goes through a sequence of steps, and halts when it encounters the halt state. All present-day computers are Turing complete . The Electronic Numerical Integrator And Computer (ENIAC) was built between July 1943 and Fall 1945. It was a Turing complete , general-purpose computer that used 17,468 vacuum tubes to create

6278-553: The underlining problem. An algorithm is a sequence of simple instructions that solve a problem. The evolution of programming languages began when the EDSAC (1949) used the first stored computer program in its von Neumann architecture . Programming the EDSAC was in the first generation of programming language . Imperative languages specify a sequential algorithm using declarations , expressions , and statements : FORTRAN (1958)

6364-419: The use of tokens (typically words, or sometimes other things), with spam and non-spam e-mails and then using Bayes' theorem to calculate a probability that an email is or is not spam. Naive Bayes spam filtering is a baseline technique for dealing with spam that can tailor itself to the email needs of individual users and give low false positive spam detection rates that are generally acceptable to users. It

6450-401: The user considers to be spam. This online newsletter is likely to contain words that are common to all newsletters, such as the name of the newsletter and its originating email address. A Bayesian spam filter will eventually assign a higher probability based on the user's specific patterns. The legitimate e-mails a user receives will tend to be different. For example, in a corporate environment,

6536-434: The user will move unrecognized spam to a separate folder, and then run sa-learn on the folder of non-spam and on the folder of spam separately. Alternatively, if the mail user agent supports it, sa-learn can be called for individual emails. Regardless of the method used to perform the learning, SpamAssassin's Bayesian test will help score future e-mails based on this learning to improve the accuracy. Apache SpamAssassin

6622-448: The wafer to build a matrix of metal–oxide–semiconductor (MOS) transistors. The MOS transistor is the primary component in integrated circuit chips . Originally, integrated circuit chips had their function set during manufacturing. During the 1960s, controlling the electrical flow migrated to programming a matrix of read-only memory (ROM). The matrix resembled a two-dimensional array of fuses. The process to embed instructions onto

6708-446: The word "replica", and can be computed. The number Pr ( W | S ) {\displaystyle \Pr(W|S)} used in this formula is approximated to the frequency of messages containing "replica" in the messages identified as spam during the learning phase. Similarly, Pr ( W | H ) {\displaystyle \Pr(W|H)} is approximated to the frequency of messages containing "replica" in

6794-451: The words which have a spamicity next to 0.5, as they contribute little to a good decision. The words taken into consideration are those whose spamicity is next to 0.0 (distinctive signs of legitimate messages), or next to 1.0 (distinctive signs of spam). A method can be for example to keep only those ten words, in the examined message, which have the greatest absolute value  |0.5 −  pI |. Some software products take into account

6880-552: Was begun in August 1997. Mason rewrote all of Jeftovic's code from scratch and uploaded the resulting codebase to SourceForge on April 20, 2001. In Summer 2004 the project became an Apache Software Foundation project and later officially renamed to Apache SpamAssassin . Apache SpamAssassin is a Perl -based application ( Mail::SpamAssassin in CPAN ) which is usually used to filter all incoming mail for one or several users. It can be run as

6966-427: Was designed to expand C's capabilities by adding the object-oriented facilities of the language Simula . An object-oriented module is composed of two files. The definitions file is called the header file . Here is a C++ header file for the GRADE class in a simple school application: A constructor operation is a function with the same name as the class name. It is executed when the calling operation executes

7052-436: Was replaced with point-contact transistors (1947) and bipolar junction transistors (late 1950s) mounted on a circuit board . During the 1960s , the aerospace industry replaced the circuit board with an integrated circuit chip . Robert Noyce , co-founder of Fairchild Semiconductor (1957) and Intel (1968), achieved a technological improvement to refine the production of field-effect transistors (1963). The goal

7138-405: Was to decompose large projects physically into separate files . A less obvious feature was to decompose large projects logically into abstract data types . At the time, languages supported concrete (scalar) datatypes like integer numbers, floating-point numbers, and strings of characters . Abstract datatypes are structures of concrete datatypes, with a new name assigned. For example,

7224-433: Was to design a language so managers could read the programs. However, the lack of structured statements hindered this goal. COBOL's development was tightly controlled, so dialects did not emerge to require ANSI standards. As a consequence, it was not changed for 15 years until 1974. The 1990s version did make consequential changes, like object-oriented programming . ALGOL (1960) stands for "ALGOrithmic Language". It had

7310-425: Was to develop a language that was comprehensive, easy to use, extendible, and would replace Cobol and Fortran. The result was a large and complex language that took a long time to compile . Computers manufactured until the 1970s had front-panel switches for manual programming. The computer program was written on paper for reference. An instruction was represented by a configuration of on/off settings. After setting

7396-423: Was unveiled as "The IBM Mathematical FORmula TRANslating system". It was designed for scientific calculations, without string handling facilities. Along with declarations , expressions , and statements , it supported: It succeeded because: However, non-IBM vendors also wrote Fortran compilers, but with a syntax that would likely fail IBM's compiler. The American National Standards Institute (ANSI) developed

#782217