Misplaced Pages

Apache Solr

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

Solr (pronounced "solar") is an open-source enterprise-search platform, written in Java . Its major features include full-text search , hit highlighting, faceted search , real-time indexing, dynamic clustering, database integration, NoSQL features and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is designed for scalability and fault tolerance . Solr is widely used for enterprise search and analytics use cases and has an active development community and regular releases.

#122877

43-494: Solr runs as a standalone full-text search server. It uses the Lucene Java search library at its core for full-text indexing and search, and has REST -like HTTP / XML and JSON APIs that make it usable from most popular programming languages. Solr's external configuration allows it to be tailored to many types of applications without Java coding, and it has a plugin architecture to support more advanced customization. Apache Solr

86-400: A copy of a cheque might take weeks; a bank employee had to contact the warehouse where the right box, file and cheque were located. The cheque would be pulled and a copy made and mailed to the bank, which would then mail it to the customer. With an ECM system in place, a bank employee could query the system for the customer's account number and the number of the requested cheque. When an image of

129-460: A document-management system. As organizations established an Internet presence, they wanted to manage web content. Organizations which had automated individual departments began to envision a broader deployment. The movement toward integrated DMS systems reflected a common trend in the software industry: the integration of small systems into more comprehensive ones. Word processing, spreadsheet, and presentation software were standalone products until

172-576: A number of sub-projects, such as Lucene.NET, Mahout , Tika and Nutch . These three are now independent top-level projects. In March 2010, the Apache Solr search server joined as a Lucene sub-project, merging the developer communities. Version 4.0 was released on October 12, 2012. In March 2021, Lucene changed its logo, and Apache Solr became a top level Apache project again, independent from Lucene. While suitable for any application that requires full text indexing and searching capability, Lucene

215-525: A number of users on the same content item. Collaboration uses skill-based knowledge, resources and background data for joint information processing. Administration components (such as virtual whiteboards for brainstorming, appointment scheduling, and project management systems) and communications applications such as video conferencing may be included. Collaborative ECM may also integrate information from other applications. ECM integrates Content management systems (CMS), presenting existing information managed in

258-408: A secure repository for managed items, analog or digital. They also include one (or more) methods for importing content to manage new items, and several presentation methods to make items available for use. Although ECM content may be protected by digital rights management (DRM), it is not required. ECM is distinguished from general content management by its cognizance of the processes and procedures of

301-951: A small-scale imaging and workflow system (perhaps one department) to improve a paper-intensive process and work towards a paperless office . The first stand-alone DMS technologies intended to save time (or improve information access) by reducing paper handling and storage, reducing document loss and speeding access to information. DMS could provide online access to information formerly available only on paper, microfilm, or microfiche. By improving control over documents and their processes, DMS streamlined business practices. Their audit trail increased document security and measured productivity and efficiency. DMS product categories were seen as complementary, and organizations wished to use several DMS products. A customer-service department could combine imaging, document management and workflow; an accounting department could access supplier invoices from an ERM system, purchase orders from an imaging system, and contracts from

344-478: A standalone top-level project (TLP) and grew steadily with accumulated features, thereby attracting users, contributors, and committers. Although quite new as a public project, it powered several high-traffic websites. In September 2008, Solr 1.3 was released including distributed search capabilities and performance enhancements among many others. In January 2009, Yonik Seeley along with Grant Ingersoll and Erik Hatcher joined Lucidworks (formerly Lucid Imagination),

387-503: A storage system's racks but not inserted in a drive that can read it), and offline storage (data and documents on a medium which is not quickly available). If the document management system does not provide it, the library service must have version management to control the status of information and check-in/check-out for controlled information provision. It generates an audit trail, logs of information usage and editing. A variety of technologies can be used to store information, depending on

430-590: A web interface. Digital asset management is a form of ECM involving digitally-stored content. Specialized Healthcare Content Management Systems meet the special regulatory requirements for medical devices and interoperability . The technologies which encompassed ECM in 2016 descend from the electronic Document Management Systems (DMS) of the late 1980s and early 1990s. The original DMS products were stand-alone, providing functionality in one of four areas: imaging , workflow, document management, and enterprise relationship management (ERM). A typical early DMS user had

473-673: Is a free and open-source search engine software library , originally written in Java by Doug Cutting . It is supported by the Apache Software Foundation and is released under the Apache Software License . Lucene is widely used as a standard foundation for production search applications. Lucene has been ported to other programming languages including Object Pascal , Perl , C# , C++ , Python , Ruby and PHP . Doug Cutting originally wrote Lucene in 1999. Lucene

SECTION 10

#1732844478123

516-699: Is checked in and out, each use generates new metadata (automatically, to some extent). Information about how (and when) people use the content can allow the system to acquire new filtering, routing and search pathways, corporate taxonomies and semantic networks , and retention-rule decisions. Solutions can provide intranet services to employees (B2E), and can include enterprise portals for business-to-business (B2B), business-to-government (B2G), government-to-business (G2B), or other business relationships. This category includes most former document-management groupware and workflow solutions that had not, by 2016, fully converted their architecture to ECM but provided

559-612: Is developed in an open, collaborative manner by the Apache Solr project at the Apache Software Foundation . In 2004, Solr was created by Yonik Seeley at CNET Networks as an in-house project to add search capability for the company website. In January 2006, CNET Networks decided to openly publish the source code by donating it to the Apache Software Foundation . Like any new Apache project, it entered an incubation period that helped solve organizational, legal, and financial issues. In January 2007, Solr graduated from incubation status into

602-445: Is just an indexing and search library and does not contain crawling and HTML parsing functionality. However, several projects extend Lucene's capability: Enterprise content management Enterprise content management ( ECM ) extends the concept of content management by adding a timeline for each content item and, possibly, enforcing processes for its creation, approval, and distribution. Systems using ECM generally provide

645-992: Is recognized for its utility in the implementation of Internet search engines and local, single-site searching. Lucene includes a feature to perform a fuzzy search based on edit distance . Lucene has also been used to implement recommendation systems. For example, Lucene's 'MoreLikeThis' Class can generate recommendations for similar documents. In a comparison of the term vector-based similarity approach of 'MoreLikeThis' with citation-based document similarity measures, such as co-citation and co-citation proximity analysis, Lucene's approach excelled at recommending documents with very similar structural characteristics and more narrow relatedness. In contrast, citation-based document similarity measures tended to be more suitable for recommending more broadly related documents, meaning citation-based approaches may be more suitable for generating serendipitous recommendations, as long as documents to be recommended contain in-text citations. Lucene itself

688-522: Is working properly when it is invisible to users. It supports specialized applications as subordinate services. ECM is a multi-layer model which includes technology for handling, delivering, and managing structured data and unstructured information. It manages the information in a web content management system and archives as a universal repository. ECM combines components which can be used as stand-alone systems without being incorporated into an enterprise-wide system. The five ECM components were defined by

731-505: The Lucene and Solr projects merged. Separate downloads continued, but the products were now jointly developed by a single set of committers. In 2011, the Solr version number scheme was changed in order to match that of Lucene. After Solr 1.4, the next release of Solr was labeled 3.1, in order to keep Solr and Lucene on the same version number. In October 2012, Solr version 4.0 was released, including

774-649: The Payment Card Industry Data Security Standard (PCI DSS), and the Federal Rules of Civil Procedure . Security at the user, function, and record levels protect sensitive data. Some information in a document can be redacted , so the remainder can be shared without compromising identity or key data. Every action in the system is tracked, and can be reported to demonstrate compliance with a wide variety of regulations. In his Computerwoche article, Ulrich Kampffmeyer characterized ECM as: ECM

817-468: The overall cost of information management. ECM streamlines access to records with keyword and full-text searching, allowing employees to quickly obtain needed information from their desktops. ECM facilitates organizational efficiency through the following capabilities: The management systems can help businesses comply with government and industry regulations such as HIPAA, the Sarbanes–Oxley Act ,

860-489: The Association for Information and Image Management (AIIM) as: Capture involves converting information from paper documents into an electronic format by scanning, and collects electronic files and information into a consistent structure for management. Capture technologies also encompass the creation of metadata , describing characteristics of a document for easy location through search technology. A medical chart might include

903-572: The ECM repository. Unlike traditional electronic archival systems, file and archive management is the administration of records, important information, and data which companies are required to archive. Independent of storage media, managed information does not need to be stored electronically. File and archive management includes: The terms "workflow" and " business process management " (BPM) are often used interchangeably. Production workflow uses predefined sequences to control processes; in an ad-hoc workflow,

SECTION 20

#1732844478123

946-537: The application and system environment: Preserve is the long-term, safe storage and backup of unchanging information. Typically accomplished by ECM records management, it may be designed to help companies comply with government and industry regulations. Content eventually stops changing and becomes static. ECM's digital preservation components also temporarily store information which does not need to be archived. Preserve components have special viewers, conversion and migration tools, and long-term storage media: To ensure

989-568: The cheque appeared on-screen, the bank could mail a copy immediately to the customer; usually while the customer was still on the phone. Enterprise content management, a form of content management , combines the capture, search and networking of documents with digital archiving , document management and workflow . It includes the challenges involved in using and preserving a company's internal (often unstructured) information in all of its forms. Most ECM solutions focus on business-to-employee (B2E) systems. New ECM components have emerged. As content

1032-950: The dynamic part of the information's life cycle. Records management manages finalized documents in accordance with the organization's retention period , which must comply with government mandates and industry practices. Manage components incorporate databases and access-authorization systems. Document management systems control documents from creation to archiving. They include: Document management overlaps with other manage components, office applications (like Microsoft Outlook and Exchange, or Lotus Notes and Domino), and library services which administer information storage. Collaboration components in an ECM system help users work together to develop and process content. Many of these components were developed from collaborative-software packages; ECM collaborative systems include elements of knowledge management . They use information databases and processing methods which are designed to be used simultaneously by

1075-435: The early 1990s, when the market shifted toward integration. Early developers offered multiple stand-alone DMS technologies as a single, packaged "suite", with little (or no) functional integration. Around 2001, the industry began to use the term "enterprise content management" for integrated systems. In 2006, Microsoft (with its SharePoint product family) and Oracle Corporation (with Oracle Content Management ) entered

1118-485: The enterprise for which it is created. The latest definition encompasses areas which have traditionally been addressed by records- and document-management systems. It implies the conversion of data to digital and traditional forms, including paper and microfilm. ECM, as an umbrella term , covers document and web content management , search, collaboration, records management, digital asset management (DAM), workflow management , and capture and scanning . It manages

1161-527: The first company providing commercial support and training for Apache Solr search technologies. Since then, support offerings around Solr have been abundant. In November 2009, saw the release of Solr 1.4. This version introduced enhancements in indexing, searching and faceting along with many other improvements such as rich document processing ( PDF , Word , HTML ), Search Results clustering based on Carrot2 and also improved database integration. The release also features many additional plug-ins. In March 2010,

1204-570: The first release independent from Lucene, requiring Java 11, and with highlights such as KNN "Neural" search, better modularization, more security plugins and more. In order to search a document, Apache Solr performs the following operations in sequence: Solr has both individuals and companies who contribute new features and bug fixes. Solr is bundled as the built-in search in many applications such as content management systems and enterprise content management systems. Hadoop distributions from Cloudera , Hortonworks and MapR all bundle Solr as

1247-536: The information content and character of the documents may be identical. It is the capture of printed forms via scanning; recognition technologies are often used, since well-designed forms enable automatic processing. Automatic processing can capture electronic forms (such as those submitted via webpages) if the layout, structure, logic, and contents are known to the capturing system. Enterprise report management (ERM) records reports and other documents on optical disks or other digital storage for ECM systems. The technology

1290-412: The life cycle of information, from initial publication (or creation) through archival and eventual disposal. It is delivered in four ways: Benefits to an organization include improved efficiency, better control, and reduced costs. Banks have converted to storing copies of old cheques in ECM systems from the older method of keeping physical cheques in warehouses. Under the old system, a customer request for

1333-438: The long term availability of information, several strategies are used for electronic archiving. Applications, index data, metadata and objects may be continuously migrated from older systems to newer ones. Emulation of older software allows users to access original data and objects; software can identify the format of preserved objects and display them in a new environment. Enterprise output management presents information from

Apache Solr - Misplaced Pages Continue

1376-502: The low-cost ECM market. Open source ECM products are also available. Government standards, including the Health Insurance Portability and Accountability Act (HIPAA), BS 7799 and ISO/IEC 27001 , influence the development and use of ECM. In 2016, organizations could deploy a single ECM system to manage information in all departments. Businesses adopt ECM to increase efficiency, improve information control, and reduce

1419-424: The new SolrCloud feature. 2013 and 2014 saw a number of Solr releases in the 4.x line, steadily growing the feature set and improving reliability. In February 2015, Solr 5.0 was released, the first release where Solr is packaged as a standalone application, ending official support for deploying Solr as a war . Solr 5.3 featured a built-in pluggable Authentication and Authorization framework. In April 2016, Solr 6.0

1462-573: The number of possible index values or automatically assign certain criteria. Automatic classification programs can extract index, category, and transfer data autonomously. Based on the information contained in electronic information objects, it can evaluate information based on predefined criteria or in a self-learning process. The manage category has five application areas: It connects the other components, which can be used in combination or separately. Document management, web content management, collaboration, workflow and business process management address

1505-848: The patient ID, name, date of visit and procedure for medical personnel to locate the chart. Earlier document automation systems photographed documents for storage on microfilm or microfiche . Image scanners make digital copies of paper documents. Documents already in digital form can be copied (or linked to) if they are available online. Automatic or semi-automatic capture can use electronic data interchange (EDI) or XML documents, business and ERP applications, or specialized-application systems as sources. Recognition technologies to extract information from scanned documents and digital faxes include: Image-cleanup features include rotation, straightening, color adjustment, transposition, zoom, aligning, page separation, annotations and noise reduction . Forms processing has two groups of technology, although

1548-604: The search engine for their products marketed for big data . DataStax DSE integrates Solr as a search engine with Cassandra . Solr is supported as an end point in various data processing frameworks and Enterprise integration frameworks. Solr exposes industry standard HTTP REST-like APIs with both XML and JSON support, and will integrate with any system or programming language supporting these standards. For ease of use there are also client libraries available for Java , C# , PHP , Python , Ruby and most other popular programming languages. Lucene Apache Lucene

1591-434: The store component uses media suitable for long-term archiving, it is still separate from "preserve." Store components may be divided into three categories: ECM repositories may be combined. Types include: Library services are ECM administrative components which handle access to information, taking in and storing information from the capture and manage components. They also manage the storage locations in dynamic storage,

1634-416: The store, and the long-term preserve archive. The storage location is determined by information characteristics and classification. The library service works with the manage components' database to provide search and retrieval . It manages online storage (direct access to data and documents), nearline storage (data and documents on a medium which can be accessed quickly, such as data on an optical disc in

1677-474: The user determines the process sequence. Users interact in workflow solutions, and workflow engines are a background service controlling information and data flow. Workflow management includes: According to the AIIM, BPM is a way of looking at (and controlling) organizational processes. Store components temporarily store information which is not required, desired, or ready for long-term storage or preservation. Even if

1720-626: Was added with support for BasicAuth and Kerberos. And plotting math expressions in Apache Zeppelin is now possible. In November 2020, Bloomberg donated the Solr Operator to the Lucene/Solr project. The Solr Operator helps deploy and run Solr in Kubernetes . In February 2021, Solr was established as a separate Apache project (TLP), independent from Lucene. In May 2022, Solr 9.0 was released, as

1763-727: Was his fifth search engine. He had previously written two while at Xerox PARC , one at Apple , and a fourth at Excite . It was initially available for download from its home at the SourceForge web site. It joined the Apache Software Foundation's Jakarta family of open-source Java products in September 2001 and became its own top-level Apache project in February 2005. The name Lucene is Doug Cutting's wife's middle name and her maternal grandmother's first name. Lucene formerly included

Apache Solr - Misplaced Pages Continue

1806-586: Was originally used with laserDiscs . Data aggregation unifies documents from different applications and sources, forwarding them to storage and processing systems in a uniform structure and format. Subject indexing improves searches, providing alternative ways of organizing information. Manual indexing assigns index database attributes to content by hand, and is typically used by a "manage" database for administration and access. Automatic and manual attribute indexing can be facilitated with preset input-design profiles, which can describe document classes that limit

1849-665: Was released. Added support for executing Parallel SQL queries across SolrCloud collections. Includes StreamExpression support and a new JDBC Driver for the SQL Interface. In September 2017, Solr 7.0 was released. This release among other things, added support multiple replica types, auto-scaling, and a Math engine. In March 2019, Solr 8.0 was released including many bugfixes and component updates. Solr nodes can now listen and serve HTTP/2 requests. Be aware that by default, internal requests are also sent by using HTTP/2. Furthermore, an admin UI login

#122877