BookCorpus - Misplaced Pages

A data set (or dataset ) is a collection of data . In the case of tabular data, a data set corresponds to one or more database tables , where every column of a table represents a particular variable , and each row corresponds to a given record of the data set in question. The data set lists values for each of the variables, such as for example height and weight of an object, for each member of the data set. Data sets can also consist of a collection of documents or files.

#514485

6-503: BookCorpus (also sometimes referred to as the Toronto Book Corpus ) is a dataset consisting of the text of around 7,000 self-published books scraped from the indie ebook distribution website Smashwords . It was the main corpus used to train the initial GPT model by OpenAI , and has been used as training data for other early large language models including Google's BERT . The dataset consists of around 985 million words, and

12-557: Is factually incorrect. These books were published by self-published ("indie") authors who priced them at free; the books were downloaded without the consent or permission of Smashwords or Smashwords authors and in violation of the Smashwords Terms of Service. The dataset was initially hosted on a University of Toronto webpage. An official version of the original dataset is no longer publicly available, though at least one substitute, BookCorpusOpen, has been created. Though not documented in

18-466: The books that comprise it span a range of genres, including romance, science fiction, and fantasy. The corpus was introduced in a 2015 paper by researchers from the University of Toronto and MIT titled "Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books". The authors described it as consisting of "free books written by yet unpublished authors," yet this

24-427: The kinds described as a level of measurement . For each variable, the values are normally all of the same kind. Missing values may exist, which must be indicated somehow. In statistics , data sets usually come from actual observations obtained by sampling a statistical population , and each row corresponds to the observations on one element of that population. Data sets may further be generated by algorithms for

30-443: The number and types of the attributes or variables, and various statistical measures applicable to them, such as standard deviation and kurtosis . The values may be numbers, such as real numbers or integers , for example representing a person's height in centimeters, but may also be nominal data (i.e., not consisting of numerical values), for example representing a person's ethnicity. More generally, values may be of any of

36-420: The original 2015 paper, the site from which the corpus's books were scraped is now known to be Smashwords . Data set In the open data discipline, data set is the unit to measure the information released in a public open data repository. The European data.europa.eu portal aggregates more than a million data sets. Several characteristics define a data set's structure and properties. These include

#514485