Misplaced Pages

Data warehouse

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

In computing , a data warehouse ( DW or DWH ), also known as an enterprise data warehouse ( EDW ), is a system used for reporting and data analysis and is a core component of business intelligence . Data warehouses are central repositories of data integrated from disparate sources. They store current and historical data organized so as to make it easy to create reports, query and get insights from the data. Unlike databases , they are intended to be used by analysts and managers to help make organizational decisions.

#889110

63-492: The data stored in the warehouse is uploaded from operational systems (such as marketing or sales). The data may pass through an operational data store and may require data cleansing for additional operations to ensure data quality before it is used in the data warehouse for reporting. The two main approaches for building a data warehouse system are extract, transform, load (ETL) and extract, load, transform (ELT). The environment for data warehouses and marts includes

126-521: A BitTorrent client . While the BitTorrent protocol itself is legal and agnostic of the type of content shared, many of the services that did not enforce a strict policy to take down copyrighted material would eventually also run into legal difficulties. Staging (data) A staging area , or landing zone , is an intermediate storage area used for data processing during the extract, transform and load (ETL) process. The data staging area sits between

189-499: A dimensional approach , transaction data is partitioned into "facts", which are usually numeric transaction data, and " dimensions ", which are the reference information that gives context to the facts. For example, a sales transaction can be broken up into facts such as the number of products ordered and the total price paid for the products, and into dimensions such as order date, customer name, product number, order ship-to and bill-to locations, and salesperson responsible for receiving

252-632: A motion for a preliminary injunction in order to stop the exchange of copyrighted songs on the service. After a failed appeal by Napster, the injunction was granted on March 5, 2001. On September 24, 2001, Napster, which had already shut down its entire network two months earlier, agreed to pay a $ 26 million dollar settlement. After Napster had ceased operations, many other P2P file-sharing services also shut down, such as Limewire , Kazaa and Popcorn Time . Besides software programs , there were many BitTorrent websites that allowed files to be indexed and searched. These files could then be downloaded via

315-417: A network . Common methods of uploading include: uploading via web browsers , FTP clients , and terminals ( SCP / SFTP ). Uploading can be used in the context of (potentially many) clients that send files to a central server . While uploading can also be defined in the context of sending files between distributed clients, such as with a peer-to-peer (P2P) file-sharing protocol like BitTorrent ,

378-493: A business transaction being stored in dozens to hundreds of tables. Relational databases are efficient at managing the relationships between these tables. The databases have very fast insert/update performance because only a small amount of data in those tables is affected by each transaction. To improve performance, older data are periodically purged. Data warehouses are optimized for analytic access patterns, which usually involve selecting specific fields rather than all fields as

441-408: A central data warehouse, or external data. As with warehouses, stored data is usually not normalized. Types of data marts include dependent , independent, and hybrid data marts. The typical extract, transform, load (ETL)-based data warehouse uses staging , data integration , and access layers to house its key functions. The staging layer or staging database stores raw data extracted from each of

504-415: A city, then the facts above can be aggregated to the city level in the network dimension. For example: The two most important approaches to store data in a warehouse are dimensional and normalized. The dimensional approach uses a star schema as proposed by Ralph Kimball . The normalized approach, also called the third normal form (3NF) is an entity-relational normalized model proposed by Bill Inmon. In

567-432: A comprehensive data warehouse. The data warehouse bus architecture is primarily an implementation of "the bus", a collection of conformed dimensions and conformed facts , which are dimensions that are shared (in a specific way) between facts in two or more data marts. The top-down approach is designed using a normalized enterprise data model . "Atomic" data , that is, data at the greatest level of detail, are stored in

630-670: A computer or other digital device to the memory of another device (such as a larger or remote computer) especially via the internet. Remote file sharing first came into fruition in January 1978, when Ward Christensen and Randy Suess , who were members of the Chicago Area Computer Hobbyists' Exchange (CACHE), created the Computerized Bulletin Board System (CBBS). This used an early file transfer protocol (MODEM, later XMODEM ) to send binary files via

693-474: A copy of information from the source transaction systems. This architectural complexity provides the opportunity to: The concept of data warehousing dates back to the late 1980s when IBM researchers Barry Devlin and Paul Murphy developed the "business data warehouse". In essence, the data warehousing concept was intended to provide an architectural model for the flow of data from operational systems to decision support environments . The concept attempted to address

SECTION 10

#1732851331890

756-595: A data latency of a few hours, while data mart latency is closer to one day. The OLAP approach is used to analyze multidimensional data from multiple sources and perspectives. The three basic operations in OLAP are roll-up (consolidation), drill-down, and slicing & dicing. Online transaction processing (OLTP) is characterized by a large numbers of short online transactions (INSERT, UPDATE, DELETE). OLTP systems emphasize fast query processing and maintaining data integrity in multi-access environments. For OLTP systems, performance

819-417: A data warehouse to be replaced with a master data management repository where operational (not static) information could reside. The data vault modeling components follow hub and spokes architecture. This modeling style is a hybrid design, consisting of the best practices from both third normal form and star schema . The data vault model is not a true third normal form, and breaks some of its rules, but it

882-522: A goal of minimizing contention within source systems. Copying required data from source systems to the staging area in one shot is often more efficient than retrieving individual records (or small sets of records) on a one-off basis. The former method takes advantage of technical efficiencies, such as data streaming technologies, reduced overhead through minimizing the need to break and re-establish connections to source systems and optimization of concurrency lock management on multi-user source systems. By copying

945-513: A hardware modem , accessible by another modem via a telephone number . In the following years, new protocols such as Kermit were released, until the File Transfer Protocol (FTP) was standardized 1985 ( RFC   959 ). FTP is based on TCP/IP and gave rise to many FTP clients , which, in turn, gave users all around the world access to the same standard network protocol to transfer data between devices. The transfer of data saw

1008-446: A long time horizon (up to 10 years) which means it stores mostly historical data. It is mainly meant for data mining and forecasting. (E.g. if a user is searching for a buying pattern of a specific customer, the user needs to look at data on the current and past purchases.) The data in the data warehouse is read-only, which means it cannot be updated, created, or deleted (unless there is a regulatory or statutory obligation to do so). In

1071-436: A mobile telephone system, if a base transceiver station (BTS) receives 1,000 requests for traffic channel allocation, allocates for 820, and rejects the rest, it could report three facts to a management system: Raw facts are aggregated to higher levels in various dimensions to extract information more relevant to the service or business. These are called aggregated facts or summaries. For example, if there are three BTSs in

1134-529: A music-sharing platform specialized in MP3 files that used peer-to-peer (P2P) file-sharing technology to allow users exchange files freely. The P2P nature meant there was no central gatekeeper for the content, which eventually led to the widespread availability of copyrighted material through Napster. The Recording Industry Association of America (RIAA) took notice of Napster's ability to distribute copyrighted music among its user base, and, on December 6, 1999, filed

1197-626: A significant increase in popularity after the release of the World Wide Web in 1991, which, for the first time, allowed users who were not computer hobbyists to easily share files, directly from their web browser over HTTP . Transfers became more reliable with the launch of HTTP/1.1 in 1997 ( RFC   2068 ), which gave users the option to resume downloads that were interrupted, for instance due to unreliable connections. Before web browsers widely rolled out support, software programs like GetRight could be used to resume downloads. Resuming uploads

1260-407: A staging area inside the data warehouse itself. In this approach, data gets extracted from heterogeneous source systems and are then directly loaded into the data warehouse, before any transformation occurs. All necessary transformations are then handled inside the data warehouse itself. Finally, the manipulated data gets loaded into target tables in the same data warehouse. A data warehouse maintains

1323-483: A staging area to support highly responsive service level agreements (SLAs) for summary reporting in target systems. Data archiving can be performed in, or supported by, a staging area. In this scenario the staging area can be used to maintain historical records during the load process, or it can be used to push data into a target archive structure. Additionally data may be maintained within the staging area for extended periods of time to support technical troubleshooting of

SECTION 20

#1732851331890

1386-432: A storage area where summary data could be further leveraged to inform executive decision-making. This concept served to promote further thinking of how a data warehouse could be developed and managed in a practical way within any enterprise. Key developments in early years of data warehousing: A fact is a value or measurement in the system being managed. Raw facts are ones reported by the reporting entity. For example, in

1449-499: A target database to self-contained database instances or file systems. Though the source systems and target systems supported by ETL processes are often relational databases, the staging areas that sit between data sources and targets need not also be relational databases. Staging areas can be designed to provide many benefits, but the primary motivations for their use are to increase efficiency of ETL processes, ensure data integrity and support data quality operations. The functions of

1512-458: Is a top-down architecture with a bottom up design. The data vault model is geared to be strictly a data warehouse. It is not geared to be end-user accessible, which, when built, still requires the use of a data mart or star schema-based release area for business purposes. There are basic features that define the data in the data warehouse that include subject orientation, data integration, time-variant, nonvolatile data, and data granularity. Unlike

1575-406: Is a type of staging area in a data warehouse which tracks the whole change history of a source table or query. Staging areas can be implemented in the form of tables in relational databases, text-based flat files (or XML files) stored in file systems or proprietary formatted binary files stored in file systems. Staging area architectures range in complexity from a set of simple relational tables in

1638-477: Is an example of this, as is the InterPlanetary File System (IPFS). Peer-to-peer allows users to both receive (download) and host (upload) content. Files are transferred directly between the users' computers. The same file transfer constitutes an upload for one party, and a download for the other party. The rising popularity of file sharing during the 1990s culminated in the emergence of Napster ,

1701-436: Is common in operational databases. Because of these differences in access, operational databases (loosely, OLTP) benefit from the use of a row-oriented database management system (DBMS), whereas analytics databases (loosely, OLAP) benefit from the use of a column-oriented DBMS . Operational systems maintain a snapshot of the business, while warehouses maintain historic data through ETL processes that periodically migrate data from

1764-485: Is not currently supported by HTTP, but can be added with the Tus open protocol for resumable file uploads , which layers resumability of uploads on top of existing HTTP connections. Transmitting a local file to a remote system following the client–server model , e.g., a web browser transferring a video to a website, is called client-to-server uploading . Transferring data from one remote system to another remote system under

1827-414: Is not efficient for business intelligence reports where dimensional modelling is prevalent. Small data marts can shop for data from the consolidated warehouse and use the filtered, specific data for the fact tables and dimensions required. The data warehouse provides a single source of information from which the data marts can read, providing a wide range of business information. The hybrid architecture allows

1890-408: Is reactive. Predictive systems are also used for customer relationship management (CRM). A data mart is a simple data warehouse focused on a single subject or functional area. Hence it draws data from a limited number of sources such as sales, finance or marketing. Data marts are often built and controlled by a single department in an organization. The sources could be internal operational systems,

1953-416: Is sometimes called a star schema . The access layer helps users retrieve data. The main source of the data is cleansed , transformed, catalogued, and made available for use by managers and other business professionals for data mining , online analytical processing , market research and decision support . However, the means to retrieve and analyze data, to extract, transform, and load data, and to manage

Data warehouse - Misplaced Pages Continue

2016-522: Is that it is straightforward to add information into the database. Disadvantages include that, because of the large number of tables, it can be difficult for users to join data from different sources into meaningful information and access the information without a precise understanding of the date sources and the data structure of the data warehouse. Both normalized and dimensional models can be represented in entity–relationship diagrams because both contain joined relational tables. The difference between them

2079-410: Is that the dimensional model does not involve a relational database every time. Thus, this type of modeling technique is very useful for end-user queries in data warehouse. The model of facts and dimensions can also be understood as a data cube , where dimensions are the categorical coordinates in a multi-dimensional cube, the fact is a value corresponding to the coordinates. The main disadvantages of

2142-440: Is the degree of normalization. These approaches are not mutually exclusive, and there are other approaches. Dimensional approaches can involve normalizing data to a degree (Kimball, Ralph 2008). In Information-Driven Business , Robert Hillard compares the two approaches based on the information needs of the business problem. He concludes that normalized models hold far more information than their dimensional equivalents (even when

2205-462: Is the number of transactions per second. OLTP databases contain detailed and current data. The schema used to store transactional databases is the entity model (usually 3NF ). Normalization is the norm for data modeling techniques in this system. Predictive analytics is about finding and quantifying hidden patterns in the data using complex mathematical models and to predict future outcomes. By contrast, OLAP focuses on historical data analysis and

2268-537: Is used by some online file hosting services . Another example can be found in FTP clients, which often support the File eXchange Protocol (FXP) in order to instruct two FTP servers with high-speed connections to exchange files. A web-based example is the Uppy file uploader that can transfer files from a user's cloud storage such as Dropbox , directly to a website without first going to

2331-464: The data dictionary are also considered essential components of a data warehousing system. Many references to data warehousing use this broader context. Thus, an expanded definition of data warehousing includes business intelligence tools , tools to extract, transform, and load data into the repository, and tools to manage and retrieve metadata . ELT -based data warehousing gets rid of a separate ETL tool for data transformation. Instead, it maintains

2394-493: The extract transform load process, data warehouses often make use of an operational data store , the information from which is parsed into the actual data warehouse. To reduce data redundancy, larger systems often store the data in a normalized way. Data marts for specific reports can then be built on top of the data warehouse. A hybrid (also called ensemble) data warehouse database is kept on third normal form to eliminate data redundancy . A normal relational database, however,

2457-420: The control of a local system is called remote uploading or site-to-site transferring. This is used when a local computer has a slow connection to the remote systems, but these systems have a fast connection between them. Without remote uploading functionality, the data would have to first be downloaded to the local system and then uploaded to the remote server, both times over a slower connection. Remote uploading

2520-411: The correct functionality of a data warehouse are the main components of the data warehouse architecture. All data warehouses have multiple phases in which the requirements of the organization are modified and fine-tuned. These terms refer to the level of sophistication of a data warehouse: Upload Uploading refers to transmitting data from one computer system to another through means of

2583-504: The creation of a new database containing personal information can make it easier to comply with privacy regulations. However, with data virtualization, the connection to all necessary data sources must be operational as there is no local copy of the data, which is one of the main drawbacks of the approach. The different methods used to construct/organize a data warehouse specified by an organization are numerous. The hardware utilized, software created and data resources specifically required for

Data warehouse - Misplaced Pages Continue

2646-565: The data source(s) and the data target(s), which are often data warehouses , data marts , or other data repositories. Data staging areas are often transient in nature, with their contents being erased prior to running an ETL process or immediately following successful completion of an ETL process. Such a staging area is sometimes called a transient staging area (TSA). There are staging area architectures, however, which are designed to hold data for extended periods of time for archival or troubleshooting purposes. A persistent staging area (PSA)

2709-409: The data used remains in its original locations and real-time access is established to allow analytics across multiple sources creating a virtual data warehouse. This can aid in resolving some technical difficulties such as compatibility problems when combining data from various platforms, lowering the risk of error caused by faulty data, and guaranteeing that the newest data is used. Furthermore, avoiding

2772-437: The data warehouse process, data can be aggregated in data marts at different levels of abstraction. The user may start looking at the total sale units of a product in an entire region. Then the user looks at the states in that region. Finally, they may examine the individual stores in a certain state. Therefore, typically, the analysis starts at a higher level and drills down to lower levels of details. With data virtualization ,

2835-441: The data warehouse. Dimensional data marts containing data needed for specific business processes or specific departments are created from the data warehouse. Data warehouses often resemble the hub and spokes architecture . Legacy systems feeding the warehouse often include customer relationship management and enterprise resource planning , generating large amounts of data. To consolidate these various data models, and facilitate

2898-431: The data was placed in the staging area. Aligning data includes standardization of reference data across multiple source systems and validation of relationships between records and data elements from different sources. Data alignment in the staging area is a function closely related to, and acting in support of, master data management capabilities. The staging area and ETL processes it supports are often designed with

2961-420: The dimensional approach are: In the normalized approach, the data in the warehouse are stored following, to a degree, database normalization rules. Normalized relational database tables are grouped into subject areas (for example, customers, products and finance). When used in large enterprises, the result is dozens of tables linked by a web of joins.(Kimball, Ralph 2008). The main advantage of this approach

3024-477: The disparate source data systems. The integration layer integrates disparate data sets by transforming the data from the staging layer, often storing this transformed data in an operational data store (ODS) database. The integrated data are then moved to yet another database, often called the data warehouse database, where the data is arranged into hierarchical groups, often called dimensions, and into facts and aggregate facts. The combination of facts and dimensions

3087-445: The following: Operational databases are optimized for the preservation of data integrity and speed of recording of business transactions through use of database normalization and an entity–relationship model . Operational system designers generally follow Codd's 12 rules of database normalization to ensure data integrity. Fully normalized database designs (that is, those satisfying all Codd rules) often result in information from

3150-427: The operational systems to the warehouse. Online analytical processing (OLAP) is characterized by a low rate of transactions and complex queries that involve aggregations. Response time is an effective performance measure of OLAP systems. OLAP applications are widely used for data mining . OLAP databases store aggregated, historical data in multi-dimensional schemas (usually star schemas ). OLAP systems typically have

3213-674: The operational systems, the data in the data warehouse revolves around the subjects of the enterprise. Subject orientation is not database normalization . Subject orientation can be really useful for decision-making. Gathering the required objects is called subject-oriented. The data found within the data warehouse is integrated. Since it comes from several operational systems, all inconsistencies must be removed. Consistencies include naming conventions, measurement of variables, encoding structures, physical attributes of data, and so forth. While operational systems reflect current values as they support day-to-day operations, data warehouse data represents

SECTION 50

#1732851331890

3276-412: The order. This dimensional approach makes data easier to understand and speeds up data retrieval. Dimensional structures are easy for business users to understand because the structure is divided into measurements/facts and context/dimensions. Facts are related to the organization's business processes and operational system, and dimensions are the context about them (Kimball, Ralph 2008). Another advantage

3339-456: The same data may be sent in a monthly aggregated form to a data warehouse. The staging area supports efficient change detection operations against target systems. This functionality is particularly useful when the source systems do not support reliable forms of change detection, such as system-enforced timestamping, change tracking or change data capture (CDC) . Data cleansing includes identification and removal (or update) of invalid data from

3402-471: The same fields are used in both models) but at the cost of usability. The technique measures information quantity in terms of information entropy and usability in terms of the Small Worlds data transformation measure. In the bottom-up approach, data marts are first created to provide reporting and analytical capabilities for specific business processes . These data marts can then be integrated to create

3465-462: The same stored data. The process of gathering, cleaning and integrating data from various sources, usually from long-term existing operational systems (usually referred to as legacy systems ), was typically in part replicated for each environment. Moreover, the operational systems were frequently reexamined as new decision support requirements emerged. Often new requirements necessitated gathering, cleaning and integrating new data from " data marts " that

3528-459: The source data from the source systems and waiting to perform intensive processing and transformation in the staging area, the ETL process exercises a great degree of control over concurrency issues during processing. The staging area can support hosting of data to be processed on independent schedules, and data that is meant to be directed to multiple targets. In some instances data might be pulled into

3591-510: The source systems. The ETL process utilizing the staging area can be used to implement business logic to identify and handle "invalid" data. Invalid data is often defined through a combination of business rules and technical limitations. Technical constraints may additionally be placed on staging area structures (such as table constraints in a relational database) to enforce data validity rules. Precalculation of aggregates, complex calculations and application of complex business logic may be done in

3654-453: The staging area at different times to be held and processed all at once. This situation might occur when enterprise processing is done across multiple time zones each night, for instance. In other cases data might be brought into the staging area to be processed at different times; or the staging area may be used to push data to multiple target systems. As an example, daily operational data might be pushed to an operational data store (ODS) while

3717-450: The staging area include the following: One of the primary functions performed by a staging area is consolidation of data from multiple source systems. In performing this function the staging area acts as a large "bucket" in which data from multiple source systems can be temporarily placed for further processing. It is common to tag data in the staging area with additional metadata indicating the source of origin and timestamps indicating when

3780-544: The term file sharing is more often used in this case. Moving files within a computer system, as opposed to over a network, is called file copying . Uploading directly contrasts with downloading , where data is received over a network. In the case of users uploading files over the internet , uploading is often slower than downloading as many internet service providers (ISPs) offer asymmetric connections , which offer more network bandwidth for downloading than uploading. To transfer something (such as data or files), from

3843-479: The user's device. Peer-to-peer (P2P) is a decentralized communications model in which each party has the same capabilities, and either party can initiate a communication session. Unlike the client–server model, in which the client makes a service request and the server fulfils the request (by sending or accepting a file transfer), the P2P network model allows each node to function as both client and server. BitTorrent

SECTION 60

#1732851331890

3906-421: The various problems associated with this flow, mainly the high costs associated with it. In the absence of a data warehousing architecture, an enormous amount of redundancy was required to support multiple decision support environments. In larger corporations, it was typical for multiple decision support environments to operate independently. Though each environment served different users, they often required much of

3969-431: Was tailored for ready access by users. Additionally, with the publication of The IRM Imperative (Wiley & Sons, 1991) by James M. Kerr, the idea of managing and putting a dollar value on an organization's data resources and then reporting that value as an asset on a balance sheet became popular. In the book, Kerr described a way to populate subject-area databases from data derived from transaction-driven systems to create

#889110