DATA WAREHOUSE
Data warehousing provides architectures and tools for business executives to systematically organize, understand, and use their data to make strategic decisions. A large number of organizations have found that data warehouse systems are valuable tools in today's competitive, fast evolving world. In the last several years, many firms have spent millions of dollars in building enterprise-wide data warehouses. Many people feel that with competition mounting in every industry, data warehousing is the latest must-have marketing weapon —— a way to keep customers by learning more about their needs.
“So\ Data warehouses have been defined in many ways, making it difficult to formulate a rigorous definition. Loosely speaking, a data warehouse refers to a database that is maintained separately from an organization's operational databases. Data warehouse systems allow for the integration of a variety of application systems. They support information processing by providing a solid platform of consolidated, historical data for analysis.
According to W. H. Inmon, a leading architect in the construction of data
warehouse systems, “a data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management's decision making process.\data warehouse. The four keywords, subject-oriented, integrated, time-variant, and nonvolatile, distinguish data warehouses from other data repository systems, such as relational database systems, transaction processing systems, and file systems. Let's take a closer look at each of these key features.
(1).Subject-oriented: A data warehouse is organized around major subjects, such as customer, vendor, product, and sales. Rather than concentrating on the day-to-day operations and transaction processing of an organization, a data warehouse focuses on the modeling and analysis of data for decision makers. Hence, data warehouses typically provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.
(2) Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous sources, such as relational databases, flat files, and on-line transaction records. Data cleaning and data integration techniques are applied to ensure
consistency in naming conventions, encoding structures, attribute measures, and so on.
(3).Time-variant: Data are stored to provide information from a historical perspective (e.g., the past 5-10 years). Every key structure in the data warehouse contains, either implicitly or explicitly, an element of time.
(4)Nonvolatile: A data warehouse is always a physically separate store of data transformed from the application data found in the operational environment. Due to this separation, a data warehouse does not require transaction processing, recovery,
and concurrency control mechanisms. It usually requires only two operations in data accessing: initial loading of data and access of data.
In sum, a data warehouse is a semantically consistent data store that serves as a physical implementation of a decision support data model and stores the information on which an enterprise needs to make strategic decisions. A data warehouse is also often viewed as an architecture, constructed by integrating data from multiple heterogeneous sources to support structured and/or ad hoc queries, analytical reporting, and decision making.
“OK\
Based on the above, we view data warehousing as the process of constructing and using data warehouses. The construction of a data warehouse requires data integration, data cleaning, and data consolidation. The utilization of a data warehouse often necessitates a collection of decision support technologies. This allows “knowledge workers\and conveniently obtain an overview of the data, and to make sound decisions based on information in the warehouse. Some authors use the term “data warehousing\refer only to the process of data warehouse construction, while the term warehouse DBMS is used to refer to the management and utilization of data warehouses. We will not make this distinction here.
“How are organizations using the information from data warehouses?\ Many organizations are using this information to support business decision making activities, including:
(1) increasing customer focus, which includes the analysis of customer buying patterns (such as buying preference, buying time, budget cycles, and appetites for spending),
(2) repositioning products and managing product portfolios by comparing the performance of sales by quarter, by year, and by geographic regions, in order to fine-tune production strategies,
(3) analyzing operations and looking for sources of profit,
(4) managing the customer relationships, making environmental corrections, and managing the cost of corporate assets.
Data warehousing is also very useful from the point of view of heterogeneous database integration. Many organizations typically collect diverse kinds of data and maintain large databases from multiple, heterogeneous, autonomous, and distributed information sources. To integrate such data, and provide easy and efficient access to it is highly desirable, yet challenging. Much effort has been spent in the database industry and research community towards achieving this goal.
The traditional database approach to heterogeneous database integration is to build wrappers and integrators (or mediators) on top of multiple, heterogeneous databases. A variety of data joiner and data blade products belong to this category. When a query is posed to a client site, a metadata dictionary is used to translate the query into queries appropriate for the individual heterogeneous sites involved. These queries are then mapped and sent to local query processors. The results returned from the different sites are integrated into a global answer set. This query-driven approach
requires complex information filtering and integration processes, and competes for resources with processing at local sources. It is inefficient and potentially expensive for frequent queries, especially for queries requiring aggregations.
Data warehousing provides an interesting alternative to the traditional approach of heterogeneous database integration described above. Rather than using a
query-driven approach, data warehousing employs an update-driven approach in which information from multiple, heterogeneous sources is integrated in advance and stored in a warehouse for direct querying and analysis. Unlike on-line transaction processing databases, data warehouses do not contain the most current information. However, a data warehouse brings high performance to the integrated heterogeneous database system since data are copied, preprocessed, integrated, annotated, summarized, and restructured into one semantic data store. Furthermore, query
processing in data warehouses does not interfere with the processing at local sources. Moreover, data warehouses can store and integrate historical information and support complex multidimensional queries. As a result, data warehousing has become very popular in industry.
1. Differences between operational database systems and data warehouses Since most people are familiar with commercial relational database systems, it is easy to understand what a data warehouse is by comparing these two kinds of systems.
The major task of on-line operational database systems is to perform on-line transaction and query processing. These systems are called on-line transaction processing (OLTP) systems. They cover most of the day-to-day operations of an organization, such as, purchasing, inventory, manufacturing, banking, payroll,
registration, and accounting. Data warehouse systems, on the other hand, serve users or “knowledge workers\
systems can organize and present data in various formats in order to accommodate the diverse needs of the different users. These systems are known as on-line analytical processing (OLAP) systems.
The major distinguishing features between OLTP and OLAP are summarized as follows.
(1). Users and system orientation: An OLTP system is customer-oriented and is used for transaction and query processing by clerks, clients, and information
technology professionals. An OLAP system is market-oriented and is used for data analysis by knowledge workers, including managers, executives, and analysts.
(2). Data contents: An OLTP system manages current data that, typically, are too detailed to be easily used for decision making. An OLAP system manages large
amounts of historical data, provides facilities for summarization and aggregation, and stores and manages information at different levels of granularity. These features make the data easier for use in informed decision making.
(3). Database design: An OLTP system usually adopts an entity-relationship (ER) data model and an application -oriented database design. An OLAP system typically adopts either a star or snowflake model, and a subject-oriented database design.
(4). View: An OLTP system focuses mainly on the current data within an enterprise or department, without referring to historical data or data in different organizations. In contrast, an OLAP system often spans multiple versions of a
database schema, due to the evolutionary process of an organization. OLAP systems also deal with information that originates from different organizations, integrating information from many data stores. Because of their huge volume, OLAP data are stored on multiple storage media.
(5). Access patterns: The access patterns of an OLTP system consist mainly of short, atomic transactions. Such a system requires concurrency control and recovery mechanisms. However, accesses to OLAP systems are mostly read-only operations (since most data warehouses store historical rather than up-to-date information), although many could be complex queries.
Other features which distinguish between OLTP and OLAP systems include database size, frequency of operations, and performance metrics and so on.
2. But, why have a separate data warehouse?
“Since operational databases store huge amounts of data\perform on-line analytical processing directly on such databases instead of spending additional time and resources to construct a separate data warehouse?\
A major reason for such a separation is to help promote the high performance of both systems. An operational database is designed and tuned from known tasks and workloads, such as indexing and hashing using primary keys, searching for particular records, and optimizing “canned\data warehouse queries are often complex. They involve the computation of large groups of data at
summarized levels, and may require the use of special data organization, access, and implementation methods based on multidimensional views. Processing OLAP queries in operational databases would substantially degrade the performance of operational tasks.
Moreover, an operational database supports the concurrent processing of several transactions. Concurrency control and recovery mechanisms, such as locking and logging, are required to ensure the consistency and robustness of transactions. An OLAP query often needs read-only access of data records for summarization and aggregation. Concurrency control and recovery mechanisms, if applied for such OLAP operations, may jeopardize the execution of concurrent transactions and thus substantially reduce the throughput of an OLTP system.
Finally, the separation of operational databases from data warehouses is based on the different structures, contents, and uses of the data in these two systems. Decision support requires historical data, whereas operational databases do not typically maintain historical data. In this context, the data in operational databases, though
abundant, is usually far from complete for decision making. Decision support requires consolidation (such as aggregation and summarization) of data from heterogeneous sources, resulting in high quality, cleansed and integrated data. In contrast,
operational databases contain only detailed raw data, such as transactions, which need to be consolidated before analysis. Since the two systems provide quite different
functionalities and require different kinds of data, it is necessary to maintain separate databases.
数据仓库
数据仓库结构和工具,提供业务主管,有系统地整理,理解,用自己的数据做出的战略决策。一些组织发现,有大量的数据仓库系统在当今竞争激烈的宝贵工具,快速发展的世界。在过去的几年里,许多公司都花费在构建企业级数据仓库的数百万美元。许多人觉得与安装在每一个行业的竞争,数据仓库,它是最新的,必须具备的营销武器 - 一种通过学习不断对自己需要更多的客户。 “所以”,你可能会问,到底什么是数据仓库?数据仓库已被多种方式定义,因此很难制定一个严格的定义。宽松地讲,数据仓库是指维持一个从组织的业务数据库分开的数据库。数据仓库系统允许将各种应用系统的集成。他们通过提供一个综合,分析历史数据信息处理坚实的平台。据瓦印蒙,一个在数据仓库系统建设领导建筑师,在对管理层的决策过程决策支持数据非易失性的集合。要特征。四个关键词,面向主题的,如关系数据库系统,事务处理系统,文件系统,(1)面向主题:。数据仓库围绕一些主题,如客户,供应商,产品和销售,有组织的。而不是在日常的日常业务和事务处理机构集中,数据仓库侧重于建模与决策者的数据分析。因此,数据仓库通常提供的数据,不排除在决策支持过程中非常有用的问题围绕特定主题的简明视图。(2)综合:数据仓库的建设通常是通过整合多个异构源,例如关系型数据库,平面文件,和在线交易记录。数据清理和数据集成技术,确保命名约定,编码结构,属性措施的一致性,等等。 (3)时变:。数据存储提供从历史的角度(例如,在过去的个关键结构包含,隐式或显式地,一时间因素。(4)非易失的:数据仓库是从身体总是在行动环境中发现的应用程序数据转换的数据分开存储。由于这种分离,数据仓库不需要事务处理,恢复和并发控制机制。它通常只需要两个数据访问操作:数据和数据访问初始装载。总之,数据仓库是一个语义一致的数据存储作为一个决策支持的数据模型和存储信息,出战略决策服务的物理实现。一个数据仓库也常常被看作一种体系结构,构建了整合,从多个异构数据源的数据,支持结构化和你现在问:“那么,什么是数据仓库?在此基础上,我们认为数据的构造和使用数据仓库的过程仓储。一个数据仓库建设需要数据集成,数据清理和数据整合。一个数据仓库利用率往往需要一个决策支持技术的集合。这使得如,经理,分析师和管理人员)使用仓库快速,方便地获取数据的概述,并作出正确的决定基于在仓库信息。有些作者使用术语和数据仓库的利用率。我们不会作出这种区别在这里。“?如何组织使用了从数据仓库的信息:(一)增加客户为中心,其中包括购买客户分析的模式,
(2)重新定位产品和进行比较,以季度的销售业绩的一年,按照地理区域和产品组合管理,以精细调整生产战略,
(3)分析业务和利润来源来看,(4)客户关系管理,使环境更正,并管理企业资产的成本。数据仓库也很从异构数据库集成的角度来看非常有用。”
“数据仓库是一个面向主题的,集成的,时变的,集成的,时变的, 。让我们在这些关键功能中每个一探究竟。
/或即席查询,分析报告和决策。”
数据仓储“仅指数据仓库建设过程中,”许多组织正在使用这些信息来支持业务决策活动,包括:(例如购买消费偏好,
区别于其他数据的数据仓库库系统,
5-10年)的信息。数据仓库中的每一供企业需要做“知识工人”(例而仓库DBMS一词是用来指的管理 购买时间,预算周期,和欲望)
”这个短,但全面的定义提出了数据仓库的主非易失的,
“
许多组织通常是不同种类的数据收集和维护多