In computing , data deduplication is a technique for eliminating duplicate copies of repeating data. Successful implementation of the technique can improve storage utilization, which may in turn lower capital expenditure by reducing the overall amount of storage media required to meet storage capacity needs. It can also be applied to network data transfers to reduce the number of bytes that must be sent.
52-460: RDE may refer to: Redundant data elimination, the process of reducing file storage requirements through data deduplication Revue d'Égyptologie , a scholarly journal of Egyptology (commonly abbreviated RdE) Rotating detonation engine , a rocket engine that uses continuous detonation to provide thrust. Rotating disk electrode , a type of electrode used in electrochemistry Remote data entry ,
104-401: A hash collision occurs, and additional means of verification are not used to verify whether there is a difference in data, or not. Both in-line and post-process architectures may offer bit-for-bit validation of original data for guaranteed data integrity. The hash functions used include standards such as SHA-1 , SHA-256 , and others. The computational resource intensity of the process can be
156-462: A source-code editor that can alert the programmer to common errors. Modification often includes code refactoring (improving the structure without changing functionality) and restructuring (improving structure and functionality at the same time). Nearly every change to code will introduce new bugs or unexpected ripple effects , which require another round of fixes. Code reviews by other developers are often used to scrutinize new code added to
208-430: A trade secret . Proprietary, secret source code and algorithms are widely used for sensitive government applications such as criminal justice , which results in black box behavior with a lack of transparency into the algorithm's methodology. The result is avoidance of public scrutiny of issues such as bias. Access to the source code (not just the object code) is essential to modifying it. Understanding existing code
260-447: A drawback of data deduplication. To improve performance, some systems utilize both weak and strong hashes. Weak hashes are much faster to calculate but there is a greater risk of a hash collision. Systems that utilize weak hashes will subsequently calculate a strong hash and will use it as the determining factor to whether it is actually the same data or not. Note that the system overhead associated with calculating and looking up hash values
312-427: A minor increase in storage space requirements. One method for deduplicating data relies on the use of cryptographic hash functions to identify duplicate segments of data. If two different pieces of information generate the same hash value, this is known as a collision . The probability of a collision depends mainly on the hash length (see birthday attack ). Thus, the concern arises that data corruption can occur if
364-666: A process for the collection of data in electronic format European Democratic Alliance ( Rassemblement des Démocrates Européens ), a political group in the European Parliament 1984–1995. European Democratic and Social Rally group , formerly the Democratic and European Rally group ( groupe du Rassemblement démocratique et européen ), a parliamentary group in the French Senate Real Driving Emissions, see European emission standards . Topics referred to by
416-444: A project. The purpose of this phase is often to verify that the code meets style and maintainability standards and that it is a correct implementation of the software design . According to some estimates, code review dramatically reduce the number of bugs persisting after software testing is complete. Along with software testing that works by executing the code, static program analysis uses automated tools to detect problems with
468-487: A server connected to a SAN/NAS, The SAN/NAS would be a target for the server (target deduplication). The server is not aware of any deduplication, the server is also the point of data generation. A second example would be backup. Generally this will be a backup store such as a data repository or a virtual tape library . One of the most common forms of data deduplication implementations works by comparing chunks of data to detect duplicates. For that to happen, each chunk of data
520-536: A single disk. In the case of data backups, which routinely are performed to protect against data loss, most data in a given backup remain unchanged from the previous backup. Common backup systems try to exploit this by omitting (or hard linking ) files that haven't changed or storing differences between files. Neither approach captures all redundancies, however. Hard-linking does not help with large files that have only changed in small ways, such as an email database; differences only find redundancies in adjacent versions of
572-498: A single file (consider a section that was deleted and later added in again, or a logo image included in many documents). In-line network data deduplication is used to reduce the number of bytes that must be transferred between endpoints, which can reduce the amount of bandwidth required. See WAN optimization for more information. Virtual servers and virtual desktops benefit from deduplication because it allows nominally separate system files for each virtual machine to be coalesced into
SECTION 10
#1732877035762624-462: A single storage space. At the same time, if a given virtual machine customizes a file, deduplication will not change the files on the other virtual machines—something that alternatives like hard links or shared disks do not offer. Backing up or making duplicate copies of virtual environments is similarly improved. Deduplication may occur "in-line", as data is flowing, or "post-process" after it has been written. With post-process deduplication, new data
676-406: A specific platform, source code can be ported to a different machine and recompiled there. For the same source code, object code can vary significantly—not only based on the machine for which it is compiled, but also based on performance optimization from the compiler. Most programs do not contain all the resources needed to run them and rely on external libraries . Part of the compiler's function
728-447: Is a simple variant of data deduplication. While data deduplication may work at a segment or sub-block level, single instance storage works at the object level, eliminating redundant copies of objects such as entire files or email messages. Single-instance storage can be used alongside (or layered upon) other data duplication or data compression methods to improve performance in exchange for an increase in complexity and for (in some cases)
780-504: Is according to where they occur. Deduplication occurring close to where data is created, is referred to as "source deduplication". When it occurs near where the data is stored, it is called "target deduplication". Source deduplication ensures that data on the data source is deduplicated. This generally takes place directly within a file system. The file system will periodically scan new files creating hashes and compare them to hashes of existing files. When files with same hashes are found then
832-442: Is an overarching term that can refer to a code's correct and efficient behavior, its reusability and portability , or the ease of modification. It is usually more cost-effective to build quality into the product from the beginning rather than try to add it later in the development process. Higher quality code will reduce lifetime cost to both suppliers and customers as it is more reliable and easier to maintain . Maintainability
884-431: Is assigned an identification, calculated by the software, typically using cryptographic hash functions. In many implementations, the assumption is made that if the identification is identical, the data is identical, even though this cannot be true in all cases due to the pigeonhole principle ; other implementations do not assume that two blocks of data with the same identifier are identical, but actually verify that data with
936-414: Is different from Wikidata All article disambiguation pages All disambiguation pages Data deduplication The deduplication process requires comparison of data 'chunks' (also known as 'byte patterns') which are unique, contiguous blocks of data. These chunks are identified and stored during a process of analysis, and compared to other chunks within existing data. Whenever a match occurs,
988-504: Is done by, for example, storing information in variables so that they don't have to be written out individually but can be changed all at once at a central referenced location. Examples are CSS classes and named references in MediaWiki . Storage-based data deduplication reduces the amount of storage needed for a given set of files. It is most effective in applications where many copies of very similar or even identical data are stored on
1040-488: Is first stored on the storage device and then a process at a later time will analyze the data looking for duplication. The benefit is that there is no need to wait for the hash calculations and lookup to be completed before storing the data, thereby ensuring that store performance is not degraded. Implementations offering policy-based operation can give users the ability to defer optimization on "active" files, or to process files based on type and location. One potential drawback
1092-452: Is frequently cited as a contributing factor to the maturation of their programming skills. Some people consider source code an expressive artistic medium . Source code often contains comments —blocks of text marked for the compiler to ignore. This content is not part of the program logic, but is instead intended to help readers understand the program. Companies often keep the source code confidential in order to hide algorithms considered
SECTION 20
#17328770357621144-464: Is implemented in some filesystems such as in ZFS or Write Anywhere File Layout and in different disk arrays models. It is a service available on both NTFS and ReFS on Windows servers. Source code In computing , source code , or simply code or source , is a plain text computer program written in a programming language . A programmer writes the human readable source code to control
1196-449: Is named analogously to hard links , which work at the inode level, and symbolic links that work at the filename level. The individual entries have a copy-on-write behavior that is non-aliasing, i.e. changing one copy afterwards will not affect other copies. Microsoft's ReFS also supports this operation. Target deduplication is the process of removing duplicates when the data was not generated at that location. Example of this would be
1248-496: Is necessary to understand how it works and before modifying it. The rate of understanding depends both on the code base as well as the skill of the programmer. Experienced programmers have an easier time understanding what the code does at a high level. Software visualization is sometimes used to speed up this process. Many software programmers use an integrated development environment (IDE) to improve their productivity. IDEs typically have several features built in, including
1300-446: Is primarily a function of the deduplication workflow. The reconstitution of files does not require this processing and any incremental performance penalty associated with re-assembly of data chunks is unlikely to impact application performance. Another concern is the interaction of compression and encryption. The goal of encryption is to eliminate any discernible patterns in the data. Thus encrypted data cannot be deduplicated, even though
1352-476: Is that duplicate data may be unnecessarily stored for a short time, which can be problematic if the system is nearing full capacity. Alternatively, deduplication hash calculations can be done in-line: synchronized as data enters the target device. If the storage system identifies a block which it has already stored, only a reference to the existing block is stored, rather than the whole new block. The advantage of in-line deduplication over post-process deduplication
1404-533: Is that it requires less storage and network traffic, since duplicate data is never stored or transferred. On the negative side, hash calculations may be computationally expensive, thereby reducing the storage throughput. However, certain vendors with in-line deduplication have demonstrated equipment which performs in-line deduplication at high rates. Post-process and in-line deduplication methods are often heavily debated. The SNIA Dictionary identifies two methods: Another way to classify data deduplication methods
1456-617: Is that many software engineering courses do not emphasize it. Development engineers who know that they will not be responsible for maintaining the software do not have an incentive to build in maintainability. The situation varies worldwide, but in the United States before 1974, software and its source code was not copyrightable and therefore always public domain software . In 1974, the US Commission on New Technological Uses of Copyrighted Works (CONTU) decided that "computer programs, to
1508-477: Is the quality of software enabling it to be easily modified without breaking existing functionality. Following coding conventions such as using clear function and variable names that correspond to their purpose makes maintenance easier. Use of conditional loop statements only if the code could execute more than once, and eliminating code that will never execute can also increase understandability. Many software development organizations neglect maintainability during
1560-418: Is to link these files in such a way that the program can be executed by the hardware. Software developers often use configuration management to track changes to source code files ( version control ). The configuration management system also keeps track of which object code file corresponds to which version of the source code file. The number of lines of source code is often used as a metric when evaluating
1612-458: The compilers needed to translate the source code automatically into machine code that can be directly executed on the computer hardware . Source code is the form of code that is modified directly by humans, typically in a high-level programming language. Object code can be directly executed by the machine and is generated automatically from the source code, often via an intermediate step, assembly language . While object code will only work on
RDE - Misplaced Pages Continue
1664-414: The attachment is actually stored; the subsequent instances are referenced back to the saved copy for deduplication ratio of roughly 100 to 1. Deduplication is often paired with data compression for additional storage saving: Deduplication is first used to eliminate large chunks of repetitive data, and compression is then used to efficiently encode each of the stored chunks. In computer code , deduplication
1716-418: The backups being bigger than the source data. Source deduplication can be declared explicitly for copying operations, as no calculation is needed to know that the copied data is in need of deduplication. This leads to a new form of "linking" on file systems called the reflink (Linux) or clonefile (MacOS), where one or more inodes (file information entries) are made to share some or all of their data. It
1768-469: The behavior of a computer . Since a computer, at base, only understands machine code , source code must be translated before a computer can execute it. The translation process can be implemented three ways. Source code can be converted into machine code by a compiler or an assembler . The resulting executable is machine code ready for the computer. Alternatively, source code can be executed without conversion via an interpreter . An interpreter loads
1820-400: The details of the hardware, instead being designed to express algorithms that could be understood more easily by humans. As instructions distinct from the underlying computer hardware , software is therefore relatively recent, dating to these early high-level programming languages such as Fortran , Lisp , and Cobol . The invention of high-level programming languages was simultaneous with
1872-426: The development phase, even though it will increase long-term costs. Technical debt is incurred when programmers, often out of laziness or urgency to meet a deadline, choose quick and dirty solutions rather than build maintainability into their code. A common cause is underestimates in software development effort estimation , leading to insufficient resources allocated to development. A challenge with maintainability
1924-403: The duplicate data. In primary storage systems, this overhead may impact performance. The second reason why deduplication is applied to secondary data, is that secondary data tends to have more duplicate data. Backup application in particular commonly generate significant portions of duplicate data over time. Data deduplication has been deployed successfully with primary storage in some cases where
1976-406: The extent that they embody an author's original creation, are proper subject matter of copyright". Proprietary software is rarely distributed as source code. Although the term open-source software literally refers to public access to the source code , open-source software has additional requirements: free redistribution, permission to modify the source code and release derivative works under
2028-464: The file copy is removed and the new file points to the old file. Unlike hard links however, duplicated files are considered to be separate entities and if one of the duplicated files is later modified, then using a system called copy-on-write a copy of that changed file or block is created. The deduplication process is transparent to the users and backup applications. Backing up a deduplicated file system will often cause duplication to occur resulting in
2080-475: The instructions can be carried out. After being compiled, the program can be saved as an object file and the loader (part of the operating system) can take this saved file and execute it as a process on the computer hardware. Some programming languages use an interpreter instead of a compiler. An interpreter converts the program into machine code at run time , which makes them 10 to 100 times slower than compiled programming languages. Software quality
2132-483: The intent of deduplication is to inspect large volumes of data and identify large sections – such as entire files or large sections of files – that are identical, and replace them with a shared copy. For example, a typical email system might contain 100 instances of the same 1 MB ( megabyte ) file attachment. Each time the email platform is backed up, all 100 instances of the attachment are saved, requiring 100 MB storage space. With data deduplication, only one instance of
RDE - Misplaced Pages Continue
2184-426: The processor). Machine language was difficult to debug and was not portable between different computer systems. Initially, hardware resources were scarce and expensive, while human resources were cheaper. As programs grew more complex, programmer productivity became a bottleneck. This led to the introduction of high-level programming languages such as Fortran in the mid-1950s. These languages abstracted away
2236-465: The productivity of computer programmers, the economic value of a code base, effort estimation for projects in development, and the ongoing cost of software maintenance after release. Source code is also used to communicate algorithms between people – e.g., code snippets online or in books. Computer programmers may find it helpful to review existing source code to learn about programming techniques. The sharing of source code between developers
2288-413: The redundant chunk is replaced with a small reference that points to the stored chunk. Given that the same byte pattern may occur dozens, hundreds, or even thousands of times (the match frequency is dependent on the chunk size), the amount of data that must be stored or transferred can be greatly reduced. A related technique is single-instance (data) storage , which replaces multiple copies of content at
2340-400: The referenced data chunk. The deduplication process is intended to be transparent to end users and applications. Commercial deduplication implementations differ by their chunking methods and architectures. To date, data deduplication has predominantly been used with secondary storage systems. The reasons for this are two-fold: First, data deduplication requires overhead to discover and remove
2392-427: The same identification is identical. If the software either assumes that a given identification already exists in the deduplication namespace or actually verifies the identity of the two blocks of data, depending on the implementation, then it will replace that duplicate chunk with a link. Once the data has been deduplicated, upon read back of the file, wherever a link is found, the system simply replaces that link with
2444-403: The same term [REDACTED] This disambiguation page lists articles associated with the title RDE . If an internal link led you here, you may wish to change the link to point directly to the intended article. Retrieved from " https://en.wikipedia.org/w/index.php?title=RDE&oldid=1190064145 " Category : Disambiguation pages Hidden categories: Short description
2496-432: The source code into memory. It simultaneously translates and executes each statement . A method that combines compilation and interpretation is to first produce bytecode . Bytecode is an intermediate representation of source code that is quickly interpreted. The first programmable computers, which appeared at the end of the 1940s, were programmed in machine language (simple instructions that could be directly executed by
2548-414: The source code. Many IDEs support code analysis tools, which might provide metrics on the clarity and maintainability of the code. Debuggers are tools that often enable programmers to step through execution while keeping track of which source code corresponds to each change of state. Source code files in a high-level programming language must go through a stage of preprocessing into machine code before
2600-452: The system design does not require significant overhead, or impact performance. Single-instance storage (SIS) is a system's ability to take multiple copies of content objects and replace them by a single shared copy. It is a means to eliminate data duplication and to increase efficiency. SIS is frequently implemented in file systems , email server software, data backup , and other storage-related computer software. Single-instance storage
2652-405: The underlying data may be redundant. Although not a shortcoming of data deduplication, there have been data breaches when insufficient security and access validation procedures are used with large repositories of deduplicated data. In some systems, as typical with cloud storage, an attacker can retrieve data owned by others by knowing or guessing the hash value of the desired data. Deduplication
SECTION 50
#17328770357622704-465: The whole-file level with a single shared copy. While possible to combine this with other forms of data compression and deduplication, it is distinct from newer approaches to data deduplication (which can operate at the segment or sub-block level). Deduplication is different from data compression algorithms, such as LZ77 and LZ78 . Whereas compression algorithms identify redundant data inside individual files and encodes this redundant data more efficiently,
#761238