CMU Sphinx , also called Sphinx for short, is the general term to describe a group of speech recognition systems developed at Carnegie Mellon University . These include a series of speech recognizers (Sphinx 2 - 4) and an acoustic model trainer (SphinxTrain).
49-469: In 2000, the Sphinx group at Carnegie Mellon committed to open source several speech recognizer components, including Sphinx 2 and later Sphinx 3 (in 2001). The speech decoders come with acoustic models and sample applications. The available resources include in addition software for acoustic model training, language model compilation and a public domain pronunciation dictionary, cmudict . Sphinx encompasses
98-437: A {\displaystyle a} or some form of regularization . The log-bilinear model is another example of an exponential language model. Skip-gram language model is an attempt at overcoming the data sparsity problem that the preceding model (i.e. word n -gram language model) faced. Words represented in an embedding vector were not necessarily consecutive anymore, but could leave gaps that are skipped over. Formally,
147-461: A dropout color which can be easily removed by the OCR system. Palm OS used a special set of glyphs, known as Graffiti , which are similar to printed English characters but simplified or modified for easier recognition on the platform's computationally limited hardware. Users would need to learn how to write these special glyphs. Zone-based OCR restricts the image to a specific part of a document. This
196-423: A k -skip- n -gram is a length- n subsequence where the components occur at distance at most k from each other. For example, in the input text: the set of 1-skip-2-grams includes all the bigrams (2-grams), and in addition the subsequences In skip-gram model, semantic relations between words are represented by linear combinations , capturing a form of compositionality . For example, in some such models, if v
245-543: A self-supervised and semi-supervised training process. Although sometimes matching human performance, it is not clear whether they are plausible cognitive models . At least for recurrent neural networks, it has been shown that they sometimes learn patterns that humans do not, but fail to learn patterns that humans typically do. Evaluation of the quality of language models is mostly done by comparison to human created sample benchmarks created from typical language-oriented tasks. Other, less established, quality tests examine
294-490: A semi-continuous representation for acoustic modeling (i.e., a single set of Gaussians is used for all models, with individual models represented as a weight vector over these Gaussians). Sphinx 3 adopted the prevalent continuous HMM representation and has been used primarily for high-accuracy, non-real-time recognition. Recent developments (in algorithms and in hardware) have made Sphinx 3 "near" real-time, although not yet suitable for critical interactive applications. Sphinx 3
343-452: A "Statistical Machine" for searching microfilm archives using an optical code recognition system. In 1931, he was granted US Patent number 1,838,389 for the invention. The patent was acquired by IBM . In 1974, Ray Kurzweil started the company Kurzweil Computer Products, Inc. and continued development of omni- font OCR, which could recognize text printed in virtually any font. (Kurzweil is often credited with inventing omni-font OCR, but it
392-439: A character error rate of 1% (99% accuracy) may result in an error rate of 5% or worse if the measurement is based on whether each whole word was recognized with no incorrect letters. Using a large enough dataset is important in a neural-network-based handwriting recognition solutions. On the other hand, producing natural datasets is very complicated and time-consuming. An example of the difficulties inherent in digitizing old text
441-513: A combination of larger datasets (frequently using words scraped from the public internet ), feedforward neural networks , and transformers . They have superseded recurrent neural network -based models, which had previously superseded the pure statistical models, such as word n -gram language model . A word n -gram language model is a purely statistical model of language. It has been superseded by recurrent neural network –based models, which have been superseded by large language models . It
490-464: A list of optical character recognition software, see Comparison of optical character recognition software . OCR accuracy can be increased if the output is constrained by a lexicon – a list of words that are allowed to occur in a document. This might be, for example, all the words in the English language, or a more technical lexicon for a specific field. This technique can be problematic if
539-554: A noun, for example, allowing greater accuracy. The Levenshtein Distance algorithm has also been used in OCR post-processing to further optimize results from an OCR API. In recent years, the major OCR technology providers began to tweak OCR systems to deal more efficiently with specific types of input. Beyond an application-specific lexicon, better performance may be had by taking into account business rules, standard expression, or rich information contained in color images. This strategy
SECTION 10
#1732868562618588-416: A number of software systems, described below. Sphinx is a continuous-speech, speaker-independent recognition system making use of hidden Markov acoustic models ( HMMs ) and an n-gram statistical language model. It was developed by Kai-Fu Lee . Sphinx featured feasibility of continuous-speech, speaker-independent large-vocabulary recognition, the possibility of which was in dispute at the time (1986). Sphinx
637-399: A ranked list of candidate characters. Software such as Cuneiform and Tesseract use a two-pass approach to character recognition. The second pass is known as adaptive recognition and uses the letter shapes recognized with high confidence on the first pass to better recognize the remaining letters on the second pass. This is advantageous for unusual fonts or low-quality scans where the font
686-411: A single character) – are still the subject of active research. The MNIST database is commonly used for testing systems' ability to recognize handwritten digits. Accuracy rates can be measured in several ways, and how they are measured can greatly affect the reported accuracy rate. For example, if word context (a lexicon of words) is not used to correct software finding non-existent words,
735-451: A time. Advanced systems capable of producing a high degree of accuracy for most fonts are now common, and with support for a variety of image file format inputs. Some systems are capable of reproducing formatted output that closely approximates the original page including images, columns, and other non-textual components. Early optical character recognition may be traced to technologies involving telegraphy and creating reading devices for
784-479: Is a common method of digitizing printed texts so that they can be electronically edited, searched, stored more compactly, displayed online, and used in machine processes such as cognitive computing , machine translation , (extracted) text-to-speech , key data and text mining . OCR is a field of research in pattern recognition , artificial intelligence and computer vision . Early versions needed to be trained with images of each character, and worked on one font at
833-428: Is accomplished relatively simply by aligning the image to a uniform grid based on where vertical grid lines will least often intersect black areas. For proportional fonts , more sophisticated techniques are needed because whitespace between letters can sometimes be greater than that between words, and vertical lines can intersect more than one character. There are two basic types of core OCR algorithm, which may produce
882-587: Is based on an assumption that the probability of the next word in a sequence depends only on a fixed size window of previous words. If only one previous word is considered, it is called a bigram model; if two words, a trigram model; if n − 1 words, an n -gram model. Special tokens are introduced to denote the start and end of a sentence ⟨ s ⟩ {\displaystyle \langle s\rangle } and ⟨ / s ⟩ {\displaystyle \langle /s\rangle } . Maximum entropy language models encode
931-548: Is called "Application-Oriented OCR" or "Customized OCR", and has been applied to OCR of license plates , invoices , screenshots , ID cards , driver's licenses , and automobile manufacturing . The New York Times has adapted the OCR technology into a proprietary tool they entitle Document Helper , that enables their interactive news team to accelerate the processing of documents that need to be reviewed. They note that it enables them to process what amounts to as many as 5,400 pages per hour in preparation for reporters to review
980-413: Is distorted (e.g. blurred or faded). As of December 2016 , modern OCR software includes Google Docs OCR, ABBYY FineReader , and Transym. Others like OCRopus and Tesseract use neural networks which are trained to recognize whole lines of text instead of focusing on single characters. A technique known as iterative OCR automatically crops a document into sections based on the page layout. OCR
1029-411: Is generally an offline process, which analyses a static document. There are cloud based services which provide an online OCR API service. Handwriting movement analysis can be used as input to handwriting recognition . Instead of merely using the shapes of glyphs and words, this technique is able to capture motion, such as the order in which segments are drawn, the direction, and the pattern of putting
SECTION 20
#17328685626181078-631: Is of historical interest only; it has been superseded in performance by subsequent versions. An archival article describes the system in detail. A fast performance-oriented recognizer, originally developed by Xuedong Huang at Carnegie Mellon and released as open-source with a BSD -style license on SourceForge by Kevin Lenzo at LinuxWorld in 2000. Sphinx 2 focuses on real-time recognition suitable for spoken language applications. As such it incorporates functionality such as end-pointing, partial hypothesis generation, dynamic language model switching and so on. It
1127-476: Is often referred to as Template OCR . Crowdsourcing humans to perform the character recognition can quickly process images like computer-driven OCR, but with higher accuracy for recognizing images than that obtained via computers. Practical systems include the Amazon Mechanical Turk and reCAPTCHA . The National Library of Finland has developed an online interface for users to correct OCRed texts in
1176-608: Is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo (for example the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image (for example: from a television broadcast). Widely used as a form of data entry from printed paper data records – whether passport documents, invoices, bank statements , computerized receipts, business cards, mail, printed data, or any suitable documentation – it
1225-399: Is the partition function , a {\displaystyle a} is the parameter vector, and f ( w 1 , … , w m ) {\displaystyle f(w_{1},\ldots ,w_{m})} is the feature function. In the simplest case, the feature function is just an indicator of the presence of a certain n -gram. It is helpful to use a prior on
1274-629: Is the function that maps a word w to its n -d vector representation, then v ( k i n g ) − v ( m a l e ) + v ( f e m a l e ) ≈ v ( q u e e n ) {\displaystyle v(\mathrm {king} )-v(\mathrm {male} )+v(\mathrm {female} )\approx v(\mathrm {queen} )} Continuous representations or embeddings of words are produced in recurrent neural network -based language models (known also as continuous space language models ). Such continuous space embeddings help to alleviate
1323-434: Is the inability of OCR to differentiate between the " long s " and "f" characters. Web-based OCR systems for recognizing hand-printed text on the fly have become well known as commercial products in recent years (see Tablet PC history ). Accuracy rates of 80% to 90% on neat, clean hand-printed characters can be achieved by pen computing software, but that accuracy rate still translates to dozens of errors per page, making
1372-552: Is then performed on each section individually using variable character confidence level thresholds to maximize page-level OCR accuracy. A patent from the United States Patent Office has been issued for this method. The OCR result can be stored in the standardized ALTO format, a dedicated XML schema maintained by the United States Library of Congress . Other common formats include hOCR and PAGE XML. For
1421-432: Is under active development and in conjunction with SphinxTrain provides access to a number of modern modeling techniques, such as LDA/MLLT, MLLR and VTLN, that improve recognition accuracy (see the article on Speech Recognition for descriptions of these techniques). Sphinx 4 is a complete rewrite of the Sphinx engine with the goal of providing a more flexible framework for research in speech recognition, written entirely in
1470-475: Is under active development and incorporates features such as fixed-point arithmetic and efficient algorithms for GMM computation. Language model A language model is a probabilistic model of a natural language. In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘ Shannon -style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing
1519-478: Is used in dialog systems and language learning systems. It can be used in computer based PBX systems such as Asterisk . Sphinx 2 code has also been incorporated into a number of commercial products. It is no longer under active development (other than for routine maintenance). Current real-time decoder development is taking place in the Pocket Sphinx project. An archival article describes the system. Sphinx 2 used
CMU Sphinx - Misplaced Pages Continue
1568-450: The Amount line of a check (which is always a written-out number) is an example where using a smaller dictionary can increase recognition rates greatly. The shapes of individual cursive characters themselves simply do not contain enough information to accurately (greater than 98%) recognize all handwritten cursive script. Most programs allow users to set "confidence rates". This means that if
1617-586: The curse of dimensionality , which is the consequence of the number of possible sequences of words increasing exponentially with the size of the vocabulary, furtherly causing a data sparsity problem. Neural networks avoid this problem by representing words as non-linear combinations of weights in a neural net. A large language model (LLM) is a type of computational model designed for natural language processing tasks such as language generation . As language models, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during
1666-469: The 2000s, OCR was made available online as a service (WebOCR), in a cloud computing environment, and in mobile applications like real-time translation of foreign-language signs on a smartphone . With the advent of smartphones and smartglasses , OCR can be used in internet connected mobile device applications that extract text captured using the device's camera. These devices that do not have built-in OCR functionality will typically use an OCR API to extract
1715-487: The Java programming language. Sun Microsystems supported the development of Sphinx 4 and contributed software engineering expertise to the project. Participants included individuals at MERL, MIT and CMU . (Currently supported languages are C, C++, C#, Python, Ruby, Java, and JavaScript.) Current development goals include: A version of Sphinx that can be used in embedded systems (e.g., based on an ARM processor). PocketSphinx
1764-458: The blind. In 1914, Emanuel Goldberg developed a machine that read characters and converted them into standard telegraph code. Concurrently, Edmund Fournier d'Albe developed the Optophone , a handheld scanner that when moved across a printed page, produced tones that corresponded to specific letters or characters. In the late 1920s and into the 1930s, Emanuel Goldberg developed what he called
1813-828: The contents. There are several techniques for solving the problem of character recognition by means other than improved OCR algorithms. Special fonts like OCR-A , OCR-B , or MICR fonts, with precisely specified sizing, spacing, and distinctive character shapes, allow a higher accuracy rate during transcription in bank check processing. Several prominent OCR engines were designed to capture text in popular fonts such as Arial or Times New Roman, and are incapable of capturing text in these fonts that are specialized and very different from popularly used fonts. As Google Tesseract can be trained to recognize new fonts, it can recognize OCR-A, OCR-B and MICR fonts. Comb fields are pre-printed boxes that encourage humans to write more legibly – one glyph per box. These are often printed in
1862-406: The document contains words not in the lexicon, like proper nouns . Tesseract uses its dictionary to influence the character segmentation step, for improved accuracy. The output stream may be a plain text stream or file of characters, but more sophisticated OCR systems can preserve the original layout of the page and produce, for example, an annotated PDF that includes both the original image of
1911-485: The intrinsic character of a language model or compare two such models. Since language models are typically intended to be dynamic and to learn from data they see, some proposed models investigate the rate of learning, e.g., through inspection of learning curves. Various data sets have been developed for use in evaluating language processing systems. These include: Optical character recognition Optical character recognition or optical character reader ( OCR )
1960-581: The leaders of the National Federation of the Blind . In 1978, Kurzweil Computer Products began selling a commercial version of the optical character recognition computer program. LexisNexis was one of the first customers, and bought the program to upload legal paper and news documents onto its nascent online databases. Two years later, Kurzweil sold his company to Xerox , which eventually spun it off as Scansoft , which merged with Nuance Communications . In
2009-716: The most authoritative of the Annual Test of OCR Accuracy from 1992 to 1996. Recognition of typewritten, Latin script text is still not 100% accurate even where clear imaging is available. One study based on recognition of 19th- and early 20th-century newspaper pages concluded that character-by-character OCR accuracy for commercial OCR software varied from 81% to 99%; total accuracy can be achieved by human review or Data Dictionary Authentication. Other areas – including recognition of hand printing, cursive handwriting, and printed text in other scripts (especially those East Asian language characters which have many strokes for
CMU Sphinx - Misplaced Pages Continue
2058-406: The page and a searchable textual representation. Near-neighbor analysis can make use of co-occurrence frequencies to correct errors, by noting that certain words are often seen together. For example, "Washington, D.C." is generally far more common in English than "Washington DOC". Knowledge of the grammar of the language being scanned can also help determine if a word is likely to be a verb or
2107-416: The pen down and lifting it. This additional information can make the process more accurate. This technology is also known as "online character recognition", "dynamic character recognition", "real-time character recognition", and "intelligent character recognition". OCR software often pre-processes images to improve the chances of successful recognition. Techniques include: Segmentation of fixed-pitch fonts
2156-522: The performance of human subjects in predicting or correcting text. Language models are useful for a variety of tasks, including speech recognition (helping prevent predictions of low-probability (e.g. nonsense) sequences), machine translation , natural language generation (generating more human-like text), optical character recognition , route optimization , handwriting recognition , grammar induction , and information retrieval . Large language models , currently their most advanced form, are
2205-759: The relationship between a word and the n -gram history using feature functions. The equation is P ( w m ∣ w 1 , … , w m − 1 ) = 1 Z ( w 1 , … , w m − 1 ) exp ( a T f ( w 1 , … , w m ) ) {\displaystyle P(w_{m}\mid w_{1},\ldots ,w_{m-1})={\frac {1}{Z(w_{1},\ldots ,w_{m-1})}}\exp(a^{T}f(w_{1},\ldots ,w_{m}))} where Z ( w 1 , … , w m − 1 ) {\displaystyle Z(w_{1},\ldots ,w_{m-1})}
2254-561: The standardized ALTO format. Crowd sourcing has also been used not to perform character recognition directly but to invite software developers to develop image processing algorithms, for example, through the use of rank-order tournaments . Commissioned by the U.S. Department of Energy (DOE), the Information Science Research Institute (ISRI) had the mission to foster the improvement of automated technologies for understanding machine printed documents, and it conducted
2303-458: The technology useful only in very limited applications. Recognition of cursive text is an active area of research, with recognition rates even lower than that of hand-printed text . Higher rates of recognition of general cursive script will likely not be possible without the use of contextual or grammatical information. For example, recognizing entire words from a dictionary is easier than trying to parse individual characters from script. Reading
2352-688: The text from the image file captured by the device. The OCR API returns the extracted text, along with information about the location of the detected text in the original image back to the device app for further processing (such as text-to-speech) or display. Various commercial and open source OCR systems are available for most common writing systems , including Latin, Cyrillic, Arabic, Hebrew, Indic, Bengali (Bangla), Devanagari, Tamil, Chinese, Japanese, and Korean characters. OCR engines have been developed into software applications specializing in various subjects such as receipts, invoices, checks, and legal billing documents. The software can be used for: OCR
2401-402: Was in use by companies, including CompuScan, in the late 1960s and 1970s. ) Kurzweil used the technology to create a reading machine for blind people to have a computer read text to them out loud. The device included a CCD -type flatbed scanner and a text-to-speech synthesizer. On January 13, 1976, the finished product was unveiled during a widely reported news conference headed by Kurzweil and
#617382