Controlled vocabularies provide a way to organize knowledge for subsequent retrieval. They are used in subject indexing schemes, subject headings , thesauri , taxonomies and other knowledge organization systems . Controlled vocabulary schemes mandate the use of predefined, preferred terms that have been preselected by the designers of the schemes, in contrast to natural language vocabularies, which have no such restriction.
38-505: The Terminologia Histologica ( TH ) is the controlled vocabulary for use in cytology and histology . In April 2011, Terminologia Histologica was published online by the Federative International Programme on Anatomical Terminologies (FIPAT), the successor of FCAT. It was intended to replace Nomina Histologica . The Nomina Histologica was introduced in 1977, with the fourth edition of Nomina Anatomica . It
76-507: A bijection between concepts and preferred terms. In short, controlled vocabularies reduce unwanted ambiguity inherent in normal human languages where the same concept can be given different names and ensure consistency. For example, in the Library of Congress Subject Headings (a subject heading system that uses a controlled vocabulary), preferred terms—subject headings in this case—have to be chosen to handle choices between variant spellings of
114-534: A document, the indexer also has to choose the level of indexing exhaustivity, the level of detail in which the document is described. For example, using low indexing exhaustivity, minor aspects of the work will not be described with index terms. In general the higher the indexing exhaustivity, the more terms indexed for each document. In recent years free text search as a means of access to documents has become popular. This involves using natural language indexing with an indexing exhaustively set to maximum (every word in
152-404: A given search. Red dots represent irrelevant results, and green dots represent relevant results. Relevancy is indicated by the proximity of search results to the center of the inner circle. Of all possible results shown, those that were actually returned by the search are shown on a light-blue background. In the example only 1 relevant result of 3 possible relevant results was returned, so the recall
190-647: A person including, but not limited to, name, honorific prefix, affiliation, email address, and homepage, or the Person vocabulary of Schema.org . Similarly, a book can be described using the Book vocabulary of Schema.org and general publication terms from the Dublin Core vocabulary, an event with the Event vocabulary of Schema.org , and so on. To use machine-readable terms from any controlled vocabulary, web designers can choose from
228-476: A public library. In large organizations, controlled vocabularies may be introduced to improve technical communication . The use of controlled vocabulary ensures that everyone is using the same word to mean the same thing. This consistency of terms is one of the most important concepts in technical writing and knowledge management , where effort is expended to use the same word throughout a document or organization instead of slightly different ones to refer to
266-432: A search, while precision is the measure of the quality of the results returned. Recall is the ratio of relevant results returned to all relevant results. Precision is the ratio of the number of relevant results returned to the total number of results returned. The diagram at right represents a low-precision, low-recall search. In the diagram the red and green dots represent the total population of potential search results for
304-439: A variety of annotation formats, including RDFa, HTML5 Microdata , or JSON-LD in the markup, or RDF serializations (RDF/XML, Turtle, N3, TriG, TriX) in external files. Free text search In text retrieval , full-text search refers to techniques for searching a single computer -stored document or a collection in a full-text database . Full-text search is distinguished from searches based on metadata or on parts of
342-412: A way that ambiguities are eliminated. The trade-off between precision and recall is simple: an increase in precision can lower overall recall, while an increase in recall lowers precision. Full-text searching is likely to retrieve many documents that are not relevant to the intended search question. Such documents are called false positives (see Type I error ). The retrieval of irrelevant documents
380-427: Is a very low ratio of 1/3, or 33%. The precision for the example is a very low 1/4, or 25%, since only 1 of the 4 results returned was relevant. Due to the ambiguities of natural language , full-text-search systems typically includes options like filtering to increase precision and stemming to increase recall. Controlled-vocabulary searching also helps alleviate low-precision issues by tagging documents in such
418-550: Is designed on faceted classification principles. Controlled vocabularies of the Semantic Web define the concepts and relationships (terms) used to describe a field of interest or area of concern. For instance, to declare a person in a machine-readable format, a vocabulary is needed that has the formal definition of "Person", such as the Friend of a Friend ( FOAF ) vocabulary, which has a Person class that defines typical properties of
SECTION 10
#1732858127225456-435: Is low. For example, an article might mention football as a secondary focus, and the indexer might decide not to tag it with "football" because it is not important enough compared to the main focus. But it turns out that for the searcher that article is relevant and hence recall fails. A free text search would automatically pick up that article regardless. On the other hand, free text searches have high exhaustivity (every word
494-497: Is not neutral, and the indexer must carefully consider the ethics of their word choices. For example, traditionally colonialist terms have often been the preferred terms in chosen vocabularies when discussing First Nations issues, which has caused controversy. Controlled vocabularies, such as the Library of Congress Subject Headings , are an essential component of bibliography , the study and classification of books. They were initially developed in library and information science . In
532-494: Is often caused by the inherent ambiguity of natural language . In the sample diagram to the right, false positives are represented by the irrelevant results (red dots) that were returned by the search (on a light-blue background). Clustering techniques based on Bayesian algorithms can help reduce false positives. For a search term of "bank", clustering can be used to categorize the document/data universe into "financial institution", "place to sit", "place to store" etc. Depending on
570-410: Is searched) so although it has much lower precision, it has potential for high recall as long as the searcher overcome the problem of synonyms by entering every combination. Controlled vocabularies may become outdated rapidly in fast developing fields of knowledge, unless the preferred terms are updated regularly. Even in an ideal scenario, a controlled vocabulary is often less specific than the words of
608-425: Is usable for indexing web pages is PSH . It is unlikely that a single metadata scheme will ever succeed in describing the content of the entire Web. To create a Semantic Web, it may be necessary to draw from two or more metadata systems to describe a Web page's contents. The eXchangeable Faceted Metadata Language (XFML) is designed to enable controlled vocabulary creators to publish and share metadata systems. XFML
646-586: The Library of Congress system , Medical Subject Headings (MeSH) created by the United States National Library of Medicine , and Sears . Well known thesauri include the Art and Architecture Thesaurus and the ERIC Thesaurus. When selecting terms for a controlled vocabulary, the designer has to consider the specificity of the term chosen, whether to use direct entry, inter consistency and stability of
684-599: The 1950s, government agencies began to develop controlled vocabularies for the burgeoning journal literature in specialized fields; an example is the Medical Subject Headings (MeSH) developed by the U.S. National Library of Medicine . Subsequently, for-profit firms (called Abstracting and indexing services) emerged to index the fast-growing literature in every field of knowledge. In the 1960s, an online bibliographic database industry developed based on dialup X.25 networking. These services were seldom made available to
722-420: The 1990s. Many websites and application programs (such as word processing software) provide full-text-search capabilities. Some web search engines, such as the former AltaVista , employ full-text-search techniques, while others index only a portion of the web pages examined by their indexing systems. When dealing with a small number of documents, it is possible for the full-text-search engine to directly scan
760-414: The contents of the documents with each query , a strategy called " serial scanning ". This is what some tools, such as grep , do when searching. However, when the number of documents to search is potentially large, or the quantity of search queries to perform is substantial, the problem of full-text search is often divided into two tasks: indexing and searching. The indexing stage will scan the text of all
798-407: The controlled vocabulary scheme to make best use of the system. But as already mentioned, the control of synonyms, homographs can help increase precision. Numerous methodologies have been developed to assist in the creation of controlled vocabularies, including faceted classification , which enables a given data record or document to be described in multiple ways. Word choice in chosen vocabularies
SECTION 20
#1732858127225836-399: The correct preferred term is searched, there is no need to search for other terms that might be synonyms of that term. A controlled vocabulary search may lead to unsatisfactory recall , in that it will fail to retrieve some documents that are actually relevant to the search question. This is particularly problematic when the search question involves terms that are sufficiently tangential to
874-452: The differences between the two are diminishing, there are still some minor differences. The terms are chosen and organized by trained professionals (including librarians and information scientists) who possess expertise in the subject area. Controlled vocabulary terms can accurately describe what a given document is actually about, even if the terms themselves do not occur within the document's text. Well known subject heading systems include
912-407: The documents and build a list of search terms (often called an index , but more correctly named a concordance ). In the search stage, when performing a specific query, only the index is referenced, rather than the text of the original documents. The indexer will make an entry in the index for each term or word found in a document, and possibly note its relative position within the document. Usually
950-466: The documents in such a way that the ambiguities are eliminated. Compared to free text searching, the use of a controlled vocabulary can dramatically increase the performance of an information retrieval system, if performance is measured by precision (the percentage of documents in the retrieval list that are actually relevant to the search topic). In some cases controlled vocabulary can enhance recall as well, because unlike natural language schemes, once
988-410: The indexer will ignore stop words (such as "the" and "and") that are both common and insufficiently meaningful to be useful in searching. Some indexers also employ language-specific stemming on the words being indexed. For example, the words "drives", "drove", and "driven" will be recorded in the index under the single concept word "drive". Recall measures the quantity of relevant results returned by
1026-702: The inherent ambiguity of natural language . Take the English word football for example. Football is the name given to a number of different team sports . Worldwide the most popular of these team sports is association football , which also happens to be called soccer in several countries. The word football is also applied to rugby football ( rugby union and rugby league ), American football , Australian rules football , Gaelic football , and Canadian football . A search for football therefore will retrieve documents that are about several completely different sports. Controlled vocabulary solves this problem by tagging
1064-475: The language. Lastly the amount of pre-coordination (in which case the degree of enumeration versus synthesis becomes an issue) and post-coordination in the system is another important issue. Controlled vocabulary elements (terms/phrases) employed as tags , to aid in the content identification process of documents, or other information system entities (e.g. DBMS, Web Services) qualifies as metadata . There are three main types of indexing languages. When indexing
1102-639: The occurrences of words relevant to the categories, search terms or a search result can be placed in one or more of the categories. This technique is being extensively deployed in the e-discovery domain. The deficiencies of full text searching have been addressed in two ways: By providing users with tools that enable them to express their search questions more precisely, and by developing new search algorithms that improve retrieval precision. The PageRank algorithm developed by Google gives more prominence to documents to which other Web pages have linked. See Search engine for additional examples. The following
1140-441: The original texts represented in databases (such as titles, abstracts, selected sections, or bibliographical references). In a full-text search, a search engine examines all of the words in every stored document as it tries to match search criteria (for example, text specified by a user). Full-text-searching techniques appeared in the 1960s, for example IBM STAIRS from 1969, and became common in online bibliographic databases in
1178-652: The public because they were difficult to use; specialist librarians called search intermediaries handled the searching job. In the 1980s, the first full text databases appeared; these databases contain the full text of the index articles as well as the bibliographic information. Online bibliographic databases have migrated to the Internet and are now publicly available; however, most are proprietary and can be expensive to use. Students enrolled in colleges and universities may be able to access some of these services without charge; some of these services may be accessible without charge at
Terminologia Histologica - Misplaced Pages Continue
1216-477: The same thing. Web searching could be dramatically improved by the development of a controlled vocabulary for describing Web pages; the use of such a vocabulary could culminate in a Semantic Web , in which the content of Web pages is described using a machine-readable metadata scheme. One of the first proposals for such a scheme is the Dublin Core Initiative. An example of a controlled vocabulary which
1254-467: The same word (American versus British), choice among scientific and popular terms ( cockroach versus Periplaneta americana ), and choices between synonyms ( automobile versus car ), among other difficult issues. Choices of preferred terms are based on the principles of user warrant (what terms users are likely to use), literary warrant (what terms are generally used in the literature and documents), and structural warrant (terms chosen by considering
1292-439: The structure, scope of the controlled vocabulary). Controlled vocabularies also typically handle the problem of homographs with qualifiers. For example, the term pool has to be qualified to refer to either swimming pool or the game pool to ensure that each preferred term or heading refers to only one concept. There are two main kinds of controlled vocabulary tools used in libraries: subject headings and thesauri . While
1330-400: The subject area such that the indexer might have decided to tag it using a different term (but the searcher might consider the same). Essentially, this can be avoided only by an experienced user of controlled vocabulary whose understanding of the vocabulary coincides with that of the indexer. Another possibility is that the article is just not tagged by the indexer because indexing exhaustivity
1368-401: The text is indexed ). These methods have been compared in some studies, such as the 2007 article, "A Comparative Evaluation of Full-text, Concept-based, and Context-sensitive Search". Controlled vocabularies are often claimed to improve the accuracy of free text searching, such as to reduce irrelevant items in the retrieval list. These irrelevant items ( false positives ) are often caused by
1406-413: The text itself. Indexers trying to choose the appropriate index terms might misinterpret the author, while this precise problem is not a factor in a free text, as it uses the author's own words. The use of controlled vocabularies can be costly compared to free text searches because human experts or expensive automated systems are necessary to index each entry. Furthermore, the user has to be familiar with
1444-520: Was developed by the Federative International Committee on Anatomical Terminology . Controlled vocabulary In library and information science , controlled vocabulary is a carefully selected list of words and phrases , which are used to tag units of information (document or work) so that they may be more easily retrieved by a search. Controlled vocabularies solve the problems of homographs , synonyms and polysemes by
#224775