Misplaced Pages

Breidbart Index

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

The Breidbart Index , developed by Seth Breidbart, is the most significant cancel index in Usenet .

#994005

50-482: A cancel index measures the dissemination intensity of substantively identical articles. If the index exceeds a threshold the articles are called newsgroup spam . They can then be removed using third party cancel controls. The principal idea of the Breidbart-Index is to give these methods different weight. With a crossposted message less data needs to be transferred and stored. And excessive crossposts (ECP) are also

100-450: A blind eye to spam in its archive of Usenet News. News server A news server is a collection of software used to handle Usenet articles. It may also refer to a computer itself which is primarily or solely used for handling Usenet. Access to Usenet is only available through news server providers. End users often use the term "posting" to refer to a single message or file posted to Usenet. For articles containing plain text, this

150-440: A blind eye" to the problem since the websites being pointed to use Google ads, which potentially generate revenue for both the spammer AND Google. The spam is extremely unfair to the companies paying Google and the spammer for an ad-click, as the most prevalent current spam (2010) is trying to trick readers into clicking on web ads by referring to them as images and saying that a link is hidden in them "due to high sex content" or that

200-468: A commercial news service. Speed, in relation to Usenet, is how quickly a server can deliver an article to the user. The server that the user connects to is typically part of a server farm that has many servers dedicated to multiple tasks. How fast the data can move throughout this farm is the first thing that affects the speed of delivery. The speed of data traveling throughout the farm can be severely bottlenecked through hard drive operations. Retrieving

250-414: A disk storage area generically called a "spool". There are several common ways in which the spool may be organized: A reader server provides an interface to read and post articles, generally with the assistance of a news client . A transit server exchanges articles with other servers. Most servers can provide both functions. Modern transit servers usually use NNTP to exchange news continually over

300-501: A given server, or it may have been present but already expired. All Usenet servers peer with one or more other servers in order to exchange articles. Occasionally, new servers appear. Although there are several web resources which may aid in finding peers, a better resource is the newsgroup news.admin.peering (Google Groups portal). As of 2020, text feeds can usually be attained for free, while full binary feeds can be free or paid (depending on how many articles each server sends to

350-457: A group may be retained for longer than others, articles from remote servers do not always arrive promptly, and at times the date headers are simply incorrect. A sampling of many or all articles, preferably in more than one newsgroup, is required to detect such anomalies. News servers do not have unlimited storage, and due to this fact they can only hold posts for a length of time before they must delete them in order to make room for new posts. This

400-562: A likely beginner's error, while excessive multiposts (EMP) suggest deliberate usage of special software. The crucial issue is categorizing multiple articles as substantively identical . This includes The Breidbart Index of a set of articles is defined as the sum of the square root of n , where n is the number of newsgroups to which an article is cross posted . BI = ∑ k = 1 m n k {\displaystyle {\mbox{BI}}=\sum _{k=1}^{m}{\sqrt {n_{k}}}} Two copies of

450-410: A link hidden in the image (Google ad) will take them to a "PayPal form" that will give them money. While most newsreaders filter the spam at either the server or user level, Google does not filter spam out of its Usenet News archive. Google does, however, offer spam filtering for groups that decide to abandon Usenet and form a moderated Google Group, which gives another reason why Google would turn

500-415: A newsgroup faster for both the client and server by eliminating the need to open each individual article to present them in list form. If non-overview headers are required, such as for when using a kill file , it may still be necessary to use the slower method of reading all the complete article headers. Many clients are unable to do this, and limit filtering to what is available in the summaries. Among

550-434: A posting are made, one to 9 groups, and one to 16. 9 + 16 = 3 + 4 = 7 {\displaystyle {\sqrt {9}}+{\sqrt {16}}=3+4=7} A more aggressive criterion, Breidbart Index Version 2, has been proposed. The BI2 is defined as the sum of the square root of n , plus the sum of n , divided by two. A single message would only need to be crossposted to 35 newsgroups to breach

SECTION 10

#1732909476995

600-401: A single post equals 3 plus the number of groups that the post was sent to. The index of multiple posts is the sum of the indices of the individual posts. In fact a cancel message is a just a non-binding request to remove a certain article. News server operators can freely decide on how to implement the conflicting policies. Newsgroup spam Newsgroup spam is a type of spam where

650-408: A special inews program. When an article is posted, the process is much the same as when a transit server receives news, but with additional checks. For posting, the server will normally fill in missing Path and Message-ID lines and check the syntax of headers intended for human readers, such as From and Subject . If the article is posted to a moderated group, the server will attempt to mail it to

700-460: Is a particular problem to binary newsgroups which transmit large volumes of articles. For news servers provided by Internet Service Providers as part of a user's subscription package, typical retention rates are usually only 2–4 days. To deal with the increase of Usenet traffic, many providers turn to a hybrid system, in which old articles not found on the provider's server will request the article from another server with longer retention. Given

750-556: Is one that makes the articles available in the hierarchical disk directory format originated by B News 2.10, or offers the NNTP or IMAP commands, for use by newsreaders. A reader server typically also works as a transit server, but it may operate independently or serve as an alternative interface to an Internet forum . When receiving news, this type of server must perform the additional steps of filing articles into newsgroups and assigning sequential numbers within each group. An Xref line

800-407: Is synonymous with an article. For binary content such as pictures and files, it is often necessary to split the content among multiple articles. Typically through the use of numbered Subject: headers, the multiple-article postings are automatically reassembled into a single unit by the newsreader . Most servers do not distinguish between single and multiple-part postings, dealing only at the level of

850-519: Is usually added, listing all the groups where the message appears and the sequence numbers. Unlike Message IDs, the numbers and ordering of articles will differ on each server; but related servers may force agreement by operating in a slave mode, re-using their siblings' Xref lines. Reader servers typically also maintain a News Overview (NOV) database that allows newsreaders to quickly obtain message summaries and present messages in threaded form. Most reader servers support posting, either through NNTP or

900-599: The Internet and similar always-on connections. In the past, servers normally employed the UUCP protocol, which was designed for intermittent dial-up connections. Other ad hoc protocols, including e-mail , are less commonly seen. News servers normally connect with multiple peers, with the redundancy helping to spread loads and ensure that articles are not lost. Smaller sites, called leaf nodes , are connected to one other major server. Articles are routed based on information found in

950-608: The Breidbart index is used with a time range of seven days instead of 45. This is denoted by the abbreviation BI7 . In hierarchy hamster.de.* the Breidbart index is used with a time range of 30 days instead of 45. This is denoted by the abbreviation BI30 . This is defined in the FAQ of the group at.usenet.cancel-reports. The term used in the Call for Votes and in the FAQ is "Cancel-Index". Unofficial abbreviations are CI and ACI . The ACI of

1000-472: The British intelligence agency. These rambling messages used to state the originator as MI5Victim@mi5.gov.uk. Lately (December 2007) the spammer has taken to altering the "from" address and subject line in an attempt to get past newsgroup "kill" filters. This UK-based spammer readily admits that he has mental illness in several of his postings. See also The Corley Conspiracy . The prevalence of Usenet spam led to

1050-1018: The abbreviation SBI are mentioned in the Spam Thresholds FAQ . However, in hierarchy nl.* this index is called BI3. The SBI is calculated similar to the BI2 but adds up the number of groups in Followup-to: (if present) instead of the number of groups in Newsgroups: . This encourages the use of Followup-to:. Two posts contain the same text. One is crossposted to 9 groups. The other is crossposted to 16, with four groups in Followup-to:. 9 + 16 + 9 + 4 2 = 3 + 4 + 9 + 4 2 = 20 2 = 10 {\displaystyle {\frac {{\sqrt {9}}+{\sqrt {16}}+9+4}{2}}={\frac {3+4+9+4}{2}}={\frac {20}{2}}=10} In hierarchy de.*

SECTION 20

#1732909476995

1100-408: The amount of storage available on the servers and continually increasing traffic. As of 2009, it is common for average news providers to have text retention of over 1000 days and binary retention of over 200 days. Large news providers offer text retention up to 2480 days and binary retention of 850 days or more. It's important to understand that retention time varies between different newsgroups within

1150-474: The article and overview information can cause massive stress on hard drives. To combat this, caching technology and cylindrical file storage systems have been developed. Once the farm is able to deliver the data to the network, then the provider has limited control over the speed to the user. Since the network path to each user is different, some users will have good routes and the data will flow quickly. Other users will have overloaded routers between them and

1200-427: The article is stored, the server attempts to retransmit it to any servers in its own newsfeed list. Articles with Control lines are given special handling. They are typically filed in special "control" newsgroups and may cause the server to automatically carry out exceptional actions. The newgroup and rmgroup commands can cause newsgroups to be created or removed; checkgroups can be used to reconcile

1250-423: The basis of advertisement or commercial solicitations. The word "spam" was usually taken to mean "excessive multiple posting (EMP)", and other neologisms were coined for other abuses – such as "velveeta" (from the processed cheese product of that name ) for "excessive cross-posting". A subset of spam was deemed "cancellable spam", for which it is considered justified to issue third-party cancel messages. In

1300-473: The development of the Breidbart Index as an objective measure of a message's "spamminess", and attempts to purge newsgroups of spam. Spamming of Usenet newsgroups pre-dates e-mail spam . The first widely recognized Usenet spam (though not the most famous) was posted on 18 January 1994 by Clarence L. Thomas IV, a sysadmin at Andrews University . Entitled "Global Alert for All: Jesus is Coming Soon", it

1350-467: The development of the Breidbart Index as an objective measure of a message's "spamminess". The use of the BI and spam-detection software has led to Usenet being policed by anti-spam volunteers, who purge newsgroups of spam by sending cancels and filtering it out on the way into servers. This very active form of policing has meant that Usenet is a far less attractive target to spammers than it used to be, and most of

1400-491: The early 1990s there was substantial controversy among Usenet system administrators (news admins) over the use of cancel messages to control spam. A " cancel message " is a directive to news servers to delete a posting, causing it to be inaccessible. Some regarded this as a bad precedent, leaning towards censorship , while others considered it a proper use of the available tools to control the growing spam problem. A culture of neutrality towards content precluded defining spam on

1450-450: The first Usenet spam of any sort, was an advertisement for legal services entitled "Green Card Lottery – Final One?". It was posted on 12 April 1994, by Arizona lawyers Laurence Canter and Martha Siegel , and hawked legal representation for United States immigrants seeking green cards . Usenet convention defines spamming as "excessive multiple posting", that is, the repeated posting of a message (or substantially similar messages). During

1500-546: The header lines defined in RFC 1036. Of particular interest to a transit server are: In most cases, the sending server controls the article transfer process. It compares the Newsgroups and Distribution of each newly arrived article against a set of patterns called newsfeeds , listing each remote server and the newsgroups its operator wishes to receive. Some senders also examine the Path; if

1550-515: The individual component articles. Each news article contains a complete set of header lines, but in common use the term "headers" is also used when referring to the News Overview database. The overview is a list of the most frequently used headers, and additional information such as article sizes, typically retrieved by the client software using the NNTP XOVER command. Overviews make reading

Breidbart Index - Misplaced Pages Continue

1600-628: The industrial-scale spammers have now moved into e-mail spam instead. The advent of the large Usenet archive kept as part of the Google Groups website, has made Usenet more attractive to spammers than ever. The goal in this case is not just to reach the members of a newsgroup, but to also take advantage of the fact that Google gives a higher pagerank to websites that are referred to by these messages, which are catalogued and mirrored in multiple languages at Google's top-level domain. Critics have suggested that Google has ulterior motives for "turning

1650-458: The large number of articles transferred between servers and the large size of individual articles, their complete propagation to any one server farm is not guaranteed. The term "completion" is used to describe how well a service is keeping up with the traffic. The primary obstacle to calculating the completion percentage is how many articles were posted. Looking at only one server, one cannot know how many articles were actually inserted throughout

1700-603: The late 1990s, spam became used as a means of vandalising newsgroups, with malicious users committing acts of sporgery to make targeted newsgroups all but unreadable without heavily filtering. A prominent example occurred in alt.religion.scientology . Prevalent in recent times is the MI-5 Persecution spam, which is well known across many newsgroups. These rambling postings often appear as clusters of twenty or more messages with varying subjects and content, but all related to Mike Corley's perceived surveillance of himself by MI5,

1750-466: The lists is mostly a straightforward task. Practical limitations to this type of measurement include the impossibility of obtaining lists from all servers worldwide, the fact that many servers filter out spam or employ Usenet Death Penalties , and that some servers mask incompletion by hiding multipart binary sets with missing articles. It is also necessary to take into account propagation times and retention; an article may simply have not yet arrived at

1800-442: The local active list with a commonly accepted set; and cancel commands are used to request the deletion of a specific article. ihave and sendme are sometimes used with UUCP to transmit lists of offered and wanted Message-IDs. Other commands ( version , sendsys , and uuname ) are requests for server configuration details. Once used to create network maps, they now are generally obsolete. A reader server

1850-511: The meantime), the Date or Expires lines indicate that the article is too old, the header syntax appears to be invalid, the Approved header is missing for a moderated newsgroup, or additional local rules disallow it. Most servers also maintain a list of active newsgroups. If the Newsgroups header of a new article does not match the active list, it may be discarded or placed in a special "junk" newsgroup. Once

1900-415: The network. Articles may never make their way outside the originating server, or may fail to find their way out to the transit cloud. Very large articles are frequently dropped, and tend to propagate less well than smaller ones. One way to measure completion is to access multiple servers and retrieve lists of articles. Because Message-ID: headers are nominally unique throughout the network, comparison of

1950-430: The newsgroup moderator if the Approved header is absent. Additional identity checks and filters are also typically applied at this point. Smaller sites with limited network bandwidth may operate "sucking" or cache servers. These perform the same reader server role as conventional news servers, but themselves act as newsreaders to exchange articles with other reader servers. Hybrid servers allow greater flexibility for

2000-653: The operators and users of commercial news servers, common concerns are the continually increasing storage and network capacity requirements and their effects. Completion (the ability of a server to successfully receive all traffic), retention (the amount of time articles are made available to readers) and overall system performance. With the increasing demands, it is common for the transit and reader server roles to be subdivided further into numbering, storage and front end systems. These server farms are continually monitored by both insiders and outsiders, and measurements of these characteristics are often used by consumers when choosing

2050-433: The other). Due to the large amount of data in a full binary+text Usenet feed (can be high as 30 terabytes a day) and the high costs of transmitting that data through an IP transit provider like Cogent , Telia , or Zayo , most Usenet providers will only engage in binary peering when they are interconnected at an Internet exchange like AMS-IX , SIX , or DeCIX . When the server stores the body of an article, it places it in

Breidbart Index - Misplaced Pages Continue

2100-457: The provider which will cause delays. About all a provider can do in that case is try moving the traffic through a different route. If the ISP has limited connectivity to the network, routing changes may have little effect. Frequently a user can reduce the impact of network problems by using multiple connections. Some servers allow as many as 60 simultaneous connections, but this varies widely based on

2150-492: The provider. Article sizes are limited to what each news server will accept. The larger the article size, the more space it occupies, and thus the fewer articles on each server. This generally means that a server can run with less overhead which makes for a more efficient server, but gives less articles for users to access. Retention is simply defined as how long the server keeps articles. Historically, most users want retention to be long enough so that they don't need to access

2200-516: The receiving server appears in this line, it is not offered. Other local rules may also be added. The sender transmits matching articles' Message-IDs to the receiving server. The receiver indicates which Message-IDs it has not yet stored locally, and those articles are sent. The receiving server examines the incoming articles. A message is normally discarded if the Message-ID is duplicated by an article already received (i.e., another server sent it in

2250-464: The server every day but not overly long retention that can overwhelm users with slow computers or network connections. In the modern era, high speed connections, large storage capacity, and advanced search tools allows users to utilize extensive retention without any drawbacks. Retention is generally quoted separately for text and binary articles, though it may also vary between different groups within these categories. The times vary greatly according to

2300-455: The server operator in that received groups can be adjusted without manual intervention by operators. They may also be the only available means to obtain articles from remote servers that do not offer conventional feeding. Because hybrid servers usually use the posting function to send news, article headers are reformatted by the posting function and tracing information can be lost. Also, the delayed sucking process can result in excess activity on

2350-454: The targets are Usenet newsgroups . Usenet convention defines spamming as excessive multiple posting, i.e. repeated posting of a message or very similar messages to newsgroups. The spam may be commercial advertisements, opinionated messages, malicious files, or nonsensical posts designed to disrupt the newsgroups. A type of newsgroup spam is sporgery which is intended to make the targeted newsgroups unreadable. The prevalence of Usenet spam led to

2400-474: The text and binary categories. Omicron's HW Media is currently the Usenet server with the highest amount of binary retention, while Google is the Usenet server with the highest amount of text retention. It can be difficult for end users to accurately measure the retention of a server. One common method is to examine the oldest articles in a group and examine the date, but this is not always accurate. Some articles in

2450-652: The threshold of 20. BI2 = ∑ k = 1 m n k + n k 2 {\displaystyle {\mbox{BI2}}=\sum _{k=1}^{m}{\frac {n_{k}+{\sqrt {n_{k}}}}{2}}} Two copies of a posting are made, one to 9 groups, and one to 16. 9 + 16 + 9 + 16 2 = 3 + 4 + 9 + 16 2 = 32 2 = 16 {\displaystyle {\frac {{\sqrt {9}}+{\sqrt {16}}+9+16}{2}}={\frac {3+4+9+16}{2}}={\frac {32}{2}}=16} The name Skirvin-Breidbart Index and

2500-465: Was a fundamentalist religious tract claiming that "this world's history is coming to a climax." The newsgroup posting bot Serdar Argic also appeared in early 1994, posting tens of thousands of messages to various newsgroups, consisting of identical copies of a political screed relating to the Armenian genocide . The first "commercial" Usenet spam, and the one which is often (mistakenly) claimed to be

#994005