FamilySearch GEDCOM , or simply GEDCOM ( / ˈ dʒ ɛ d k ɒ m / JED -kom , acronym of Genealogical Data Communication ), is an open file format and the de facto standard specification for storing genealogical data. It was developed by The Church of Jesus Christ of Latter-day Saints (LDS, also known as the Mormon Church), the operators of FamilySearch , to aid in the research and sharing of genealogical information. A common usage is as a standard format for the backup and transfer of family tree data between different genealogy software and Web sites , most of which support importing from and exporting to GEDCOM format.
101-464: GEDCOM is defined as a plain text file, using UTF-8 encoding as of version 7.0. This file contains genealogical information about individuals such as names, events, and relationships; metadata links these records together. GEDCOM 7.0, released in 2021, is the most recent version of the GEDCOM specification as of July 2024. However, its predecessor, GEDCOM 5.5.1, remains the industry's format standard for
202-502: A Private Use Area . In either approach, the byte value is encoded in the low eight bits of the output code point. These encodings are needed if invalid UTF-8 is to survive translation to and then back from the UTF-16 used internally by Python, and as Unix filenames can contain invalid UTF-8 it is necessary for this to work. The official name for the encoding is UTF-8 , the spelling used in all Unicode Consortium documents. The hyphen-minus
303-402: A byte order mark or escape sequences ; compressing schemes try to minimize the number of bytes used per code unit (such as SCSU and BOCU ). Although UTF-32BE and UTF-32LE are simpler CESes, most systems working with Unicode use either UTF-8 , which is backward compatible with fixed-length ASCII and maps Unicode code points to variable-length sequences of octets, or UTF-16BE , which
404-437: A string of the letters "ab̲c𐐀"—that is, a string containing a Unicode combining character ( U+0332 ̲ COMBINING LOW LINE ) as well as a supplementary character ( U+10400 𐐀 DESERET CAPITAL LETTER LONG I ). This string has several Unicode representations which are logically equivalent, yet while each is suited to a diverse set of circumstances or range of requirements: Note in particular that 𐐀
505-400: A BOM (a change from Windows 7 Notepad ), bringing it into line with most other text editors. Some system files on Windows 11 require UTF-8 with no requirement for a BOM, and almost all files on macOS and Linux are required to be UTF-8 without a BOM. Programming languages that default to UTF-8 for I/O include Ruby 3.0, R 4.2.2, Raku and Java 18. Although
606-422: A GEDCOM file begins with a level number where all top-level records (HEAD, TRLR, SUBN, and each INDI, FAM, OBJE, NOTE, REPO, SOUR, and SUBM) begin with a line with level 0, while other level numbers are positive integers . Although it is possible to write a GEDCOM file by hand, the format was designed to be used with software and thus is not especially human-friendly. A GEDCOM validator that can be used to validate
707-445: A byte stream encoding of its 32-bit code points. This encoding was not satisfactory on performance grounds, among other problems, and the biggest problem was probably that it did not have a clear separation between ASCII and non-ASCII: new UTF-1 tools would be backward compatible with ASCII-encoded text, but UTF-1-encoded text could confuse existing code expecting ASCII (or extended ASCII ), because it could contain continuation bytes in
808-472: A byte with the high bit set cannot be alone; and in a truly random string a byte with a high bit set has only a 1 ⁄ 15 chance of starting a valid UTF-8 character. This has the (possibly unintended) consequence of making it easy to detect if a legacy text encoding is accidentally used instead of UTF-8, making conversion of a system to UTF-8 easier and avoiding the need to require a Byte Order Mark or any other metadata. Since RFC 3629 (November 2003),
909-546: A character encoding are known as code points and collectively comprise a code space, a code page , or character map . Early character codes associated with the optical or electrical telegraph could only represent a subset of the characters used in written languages , sometimes restricted to upper case letters , numerals and some punctuation only. The advent of digital computer systems allows more elaborate encodings codes (such as Unicode ) to support hundreds of written languages. The most popular character encoding on
1010-629: A future version of Python is planned to store strings as UTF-8 by default. Modern versions of Microsoft Visual Studio use UTF-8 internally. Microsoft's SQL Server 2019 added support for UTF-8, and using it results in a 35% speed increase, and "nearly 50% reduction in storage requirements." Java internally uses Modified UTF-8 (MUTF-8), in which the null character U+0000 uses the two-byte overlong encoding 0xC0 , 0x80 , instead of just 0x00 . Modified UTF-8 strings never contain any actual null bytes but can contain all Unicode code points including U+0000, which allows such strings (with
1111-468: A null byte appended) to be processed by traditional null-terminated string functions. Java reads and writes normal UTF-8 to files and streams, but it uses Modified UTF-8 for object serialization , for the Java Native Interface , and for embedding constant strings in class files . The dex format defined by Dalvik also uses the same modified UTF-8 to represent string values. Tcl also uses
SECTION 10
#17328512818311212-675: A number of problems existed and that "The most commonly found fault leading to data loss was the failure to read the NOTE tag at all the possible levels at which it may appear." In 2005, the Genealogical Software Report Card was evaluated (by Bill Mumford who participated in the original GEDCOM Testbook Project ) and included testing the GEDCOM 5.5 standard using the Gedcheck program. To assist with adoption of GEDCOM 7.0, validation tools now exist for that standard as well. The following
1313-406: A particular individual never married. GEDCOM 7.0 was the first version to use semantic versioning , and is the most recent minor version of the specification. As of July 2024, the next planned minor release is v7.1, which is under development. A GEDCOM file can contain information on events such as births, deaths, census records, ship's records, marriages, etc.; a rule of thumb is that an event
1414-420: A particular sequence of bits. Instead, characters would first be mapped to a universal intermediate representation in the form of abstract numbers called code points . Code points would then be represented in a variety of ways and with various default numbers of bits per character (code units) depending on context. To encode code points higher than the length of the code unit, such as above 256 for eight-bit units,
1515-402: A person and the order of the children within a relationship (FAM) can be lost. In many cases the sequence of events can be derived from the associated dates. But dates are not always known, in particular when dealing with data from centuries ago. For example, in the case that a person has had two relationships, both with unknown dates, but from descriptions it is known that the second one is indeed
1616-580: A person, so names can be stored in multiple languages, although there is no standardized way to indicate which instance is in which language. Finally, in version 5.5.1, the NAME field also supports a phonetic variation (FONE) and a romanized variation (ROMN) of the name. In February 2012 at the RootsTech 2012 conference, FamilySearch outlined a major new project around genealogical standards called GEDCOM X, and invited collaboration. It includes software developed under
1717-422: A process known as transcoding . Some of these are cited below. Cross-platform : Windows : The most used character encoding on the web is UTF-8 , used in 98.2% of surveyed web sites, as of May 2024. In application programs and operating system tasks, both UTF-8 and UTF-16 are popular options. UTF-8 UTF-8 is a character encoding standard used for electronic communication. Defined by
1818-501: A reader start anywhere and immediately detect character boundaries, at the cost of being somewhat less bit-efficient than the previous proposal. It also abandoned the use of biases that prevented overlong encodings . Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike . In the following days, Pike and Thompson implemented it and updated Plan 9 to use it throughout, and then communicated their success back to X/Open, which accepted it as
1919-606: A row in the above table to encode a code point less than "First code point" (thus using more bytes than necessary) is termed an overlong encoding . These are a security problem because they allow the same code point to be encoded in multiple ways. Overlong encodings (of ../ for example) have been used to bypass security validations in high-profile products including Microsoft's IIS web server and Apache's Tomcat servlet container. Overlong encodings should therefore be considered an error and never decoded. Modified UTF-8 allows an overlong encoding of U+0000 . The chart below gives
2020-461: A single glyph . The former simplifies the text handling system, but the latter allows any letter/diacritic combination to be used in text. Ligatures pose similar problems. Exactly how to handle glyph variants is a choice that must be made when constructing a particular character encoding. Some writing systems, such as Arabic and Hebrew, need to accommodate things like graphemes that are joined in different ways in different contexts, but represent
2121-546: A single character per code unit. However, due to the emergence of more sophisticated character encodings, the distinction between these terms has become important. "Code page" is a historical name for a coded character set. Originally, a code page referred to a specific page number in the IBM standard character set manual, which would define a particular character encoding. Other vendors, including Microsoft , SAP , and Oracle Corporation , also published their own sets of code pages;
SECTION 20
#17328512818312222-432: A stream of octets (bytes). The purpose of this decomposition is to establish a universal set of characters that can be encoded in a variety of ways. To describe this model precisely, Unicode uses its own set of terminology to describe its process: An abstract character repertoire (ACR) is the full set of abstract characters that a system supports. Unicode has an open repertoire, meaning that new characters will be added to
2323-505: A well-defined and extensible encoding system, has replaced most earlier character encodings, but the path of code development to the present is fairly well known. The Baudot code, a five- bit encoding, was created by Émile Baudot in 1870, patented in 1874, modified by Donald Murray in 1901, and standardized by CCITT as International Telegraph Alphabet No. 2 (ITA2) in 1930. The name baudot has been erroneously applied to ITA2 and its many variants. ITA2 suffered from many shortcomings and
2424-473: Is backward compatible with fixed-length UCS-2BE and maps Unicode code points to variable-length sequences of 16-bit words. See comparison of Unicode encodings for a detailed discussion. Finally, there may be a higher-level protocol which supplies additional information to select the particular variant of a Unicode character, particularly where there are regional variants that have been 'unified' in Unicode as
2525-479: Is a sample GEDCOM file. The header (HEAD) includes the source program and version (Personal Ancestral File, 5.0), the GEDCOM version (5.5), the character encoding ( ANSEL ), and a link to information about the submitter of the file. The individual records (INDI) define John Smith (ID I1), Elizabeth Stansfield (ID I2), and James Smith (ID I3). The family record (FAM) links the husband (HUSB), wife (WIFE), and child (CHIL) by their ID numbers. The current version of
2626-498: Is an XML -based open format created by the open source genealogy project Gramps and used also by PhpGedView . The Family History Information Standards Organisation was established in 2012 with the aim of developing international standards for family history and genealogical information. One of the standards the organization proposed was Extended Legacy Format (ELF), compatible with GEDCOM 5.5(.1), but including an extensibility mechanism. The organization requested public comment on
2727-497: Is assumed to be UTF-8 to be losslessly transformed to UTF-16 or UTF-32, by translating the 128 possible error bytes to reserved code points, and transforming those code points back to error bytes to output UTF-8. The most common approach is to translate the codes to U+DC80...U+DCFF which are low (trailing) surrogate values and thus "invalid" UTF-16, as used by Python 's PEP 383 (or "surrogateescape") approach. Another encoding called MirBSD OPTU-8/16 converts them to U+EF80...U+EFFF in
2828-464: Is called CESU-8 . If the Unicode byte-order mark U+FEFF is at the start of a UTF-8 file, the first three bytes will be 0xEF , 0xBB , 0xBF . The Unicode Standard neither requires nor recommends the use of the BOM for UTF-8, but warns that it may be encountered at the start of a file trans-coded from another encoding. While ASCII text encoded using UTF-8 is backward compatible with ASCII, this
2929-442: Is defined by a CEF. A character encoding scheme (CES) is the mapping of code units to a sequence of octets to facilitate storage on an octet-based file system or transmission over an octet-based network. Simple character encoding schemes include UTF-8 , UTF-16BE , UTF-32BE , UTF-16LE , and UTF-32LE ; compound character encoding schemes, such as UTF-16 , UTF-32 and ISO/IEC 2022 , switch between several simple schemes by using
3030-444: Is defined by the encoding. Thus, the number of code units required to represent a code point depends on the encoding: Exactly what constitutes a character varies between character encodings. For example, for letters with diacritics , there are two distinct approaches that can be taken to encode them: they can be encoded either as a single unified character (known as a precomposed character), or as separate characters that combine into
3131-409: Is dominant for all countries/languages on the internet, is used in most standards, often the only allowed encoding, and is supported by all modern operating systems and programming languages. The International Organization for Standardization (ISO) set out to compose a universal multi-byte character set in 1989. The draft ISO 10646 standard contained a non-required annex called UTF-1 that provided
GEDCOM - Misplaced Pages Continue
3232-589: Is done; for this you can use utf8-c8 ". That UTF-8 Clean-8 variant, implemented by Raku, is an encoder/decoder that preserves bytes as is (even illegal UTF-8 sequences) and allows for Normal Form Grapheme synthetics. Version 3 of the Python programming language treats each byte of an invalid UTF-8 bytestream as an error (see also changes with new UTF-8 mode in Python 3.7 ); this gives 128 different possible errors. Extensions have been created to allow any byte sequence that
3333-456: Is either one continuation byte, or ends at the first byte that is disallowed, so E1,A0,20 is a two-byte error followed by a space. This means an error is no more than three bytes long and never contains the start of a valid character, and there are 21,952 different possible errors. Technically this makes UTF-8 no longer a prefix code (you have to read one byte past some errors to figure out they are an error), but searching still works if
3434-408: Is event based, it is still a model built on assumed reality rather than evidence. Event GEDCOM was more flexible, as it allowed some separation between believed events and the participants. However, Event GEDCOM was not widely adopted by other developers due to its semantic differences. With Roots and Ultimate Family Tree no longer available, very few people today are using Event GEDCOM. Gramps XML
3535-417: Is not true when Unicode Standard recommendations are ignored and a BOM is added. A BOM can confuse software that isn't prepared for it but can otherwise accept UTF-8, e.g. programming languages that permit non-ASCII bytes in string literals but not at the start of the file. Nevertheless, there was and still is software that always inserts a BOM when writing UTF-8, and refuses to correctly interpret UTF-8 unless
3636-430: Is preferred, usually in the larger context of locales. IBM's Character Data Representation Architecture (CDRA) designates entities with coded character set identifiers ( CCSIDs ), each of which is variously called a "charset", "character set", "code page", or "CHARMAP". The code unit size is equivalent to the bit measurement for the particular encoding: A code point is represented by a sequence of code units. The mapping
3737-492: Is represented with either one 32-bit value (UTF-32), two 16-bit values (UTF-16), or four 8-bit values (UTF-8). Although each of those forms uses the same total number of bits (32) to represent the glyph, it is not obvious how the actual numeric byte values are related. As a result of having many character encoding methods in use (and the need for backward compatibility with archived data), many computer programs have been developed to translate data between character encoding schemes,
3838-492: Is something that took place at a specific time, at a specific place (even if time and place are not known). GEDCOM files can also contain attributes such as physical description, occupation, and total number of children; unlike events, attributes generally cannot be associated with a specific time or place. The GEDCOM specification requires that each event or attribute is associated with exactly one individual or family. This causes redundancy for events such as census records where
3939-455: Is the deliberate de facto common denominator. Despite version 5.5 of the GEDCOM standard first being published in 1996, many genealogical software suppliers have never fully supported the feature of multilingual Unicode text (instead of the ANSEL character set) introduced with that version of the specification. Uniform use of Unicode would allow for the usage of international character sets. An example
4040-446: Is the storage of East Asian names in their original Chinese, Japanese and Korean (CJK) characters, without which they could be ambiguous and of little use for genealogical or historical research. PAF 5.2 is an example of software that uses UTF-8 as its internal character set, and can output a UTF-8 GEDCOM. GEDCOM 7.0 requires UTF-8 encoding throughout, and resolves other long-standing issues with GEDCOM 5.5.1. Multimedia support in
4141-497: The Apache open source license . It includes data formats that facilitate basing family trees on sources and records (both physical artifacts and digital artifacts), support for sharing and linking data online, and an API. In August 2012 FamilySearch employee and GEDCOM X project leader Ryan Heaton dropped the claim that GEDCOM X is the new industry standard, and repositioned GEDCOM X as another FamilySearch open source project. After
GEDCOM - Misplaced Pages Continue
4242-519: The Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit . Almost every webpage is stored in UTF-8. UTF-8 is capable of encoding all 1,112,064 valid Unicode scalar values using a variable-width encoding of one to four one- byte (8-bit) code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. It
4343-597: The World Wide Web is UTF-8 , which is used in 98.2% of surveyed web sites, as of May 2024. In application programs and operating system tasks, both UTF-8 and UTF-16 are popular options. The history of character codes illustrates the evolving need for machine-mediated character-based symbolic information over a distance, using once-novel electrical means. The earliest codes were based upon manual and hand-written encoding and cyphering systems, such as Bacon's cipher , Braille , international maritime signal flags , and
4444-399: The replacement character "�" (U+FFFD) and continue decoding. Some decoders consider the sequence E1,A0,20 (a truncated 3-byte code followed by a space) as a single error. This is not a good idea as a search for a space character would find the one hidden in the error. Since Unicode 6 (October 2010) the standard (chapter 3) has recommended a "best practice" where the error
4545-501: The 10 January 1800 date and giving the birth certificate as the source, and the second with the 11 January 1800 date and giving the death certificate as the source. The preferred record is usually listed first. This example encoded in GEDCOM might look like this: Conflicting data may also be the result of user errors. The standard does not specify in any way that the contents must be consistent. A birth date like "10 APR 1819" might mistakenly have been recorded as "10 APR 1918" long after
4646-482: The 1980s faced the dilemma that, on the one hand, it seemed necessary to add more bits to accommodate additional characters, but on the other hand, for the users of the relatively small character set of the Latin alphabet (who still constituted the majority of computer users), those additional bits were a colossal waste of then-scarce and expensive computing resources (as they would always be zeroed out for such users). In 1985,
4747-532: The 4-digit encoding of Chinese characters for a Chinese telegraph code ( Hans Schjellerup , 1869). With the adoption of electrical and electro-mechanical techniques these earliest codes were adapted to the new capabilities and limitations of the early machines. The earliest well-known electrically transmitted character code, Morse code , introduced in the 1840s, used a system of four "symbols" (short signal, long signal, short space, long space) to generate codes of variable length. Though some commercial use of Morse code
4848-428: The 7.0 specification document: "The FAM record was originally structured to represent families where a male HUSB (husband or father) and female WIFE (wife or mother) produce CHIL (children)." Although the links in a GEDCOM family record still use the original naming indicating a husband and a wife, the specification now states that "sex, gender, titles, and roles of partners should not be inferred based on
4949-434: The GEDCOM file makes transmission of data easier, in that all of the information (including the multimedia data) is in one file, but the resulting file can be enormous. Linking multimedia keeps the size of the GEDCOM file under control, but then when transmitting the file, the multimedia objects must either be transmitted separately or archived together with the GEDCOM into one larger file. Support for embedding media directly
5050-509: The Unicode standard is U+0000 to U+10FFFF, inclusive, divided in 17 planes , identified by the numbers 0 to 16. Characters in the range U+0000 to U+FFFF are in plane 0, called the Basic Multilingual Plane (BMP). This plane contains the most commonly-used characters. Characters in the range U+10000 to U+10FFFF in the other planes are called supplementary characters . The following table shows examples of code point values: Consider
5151-535: The actual census entry often contains information on multiple individuals. In the GEDCOM file, for census records a separate census "CENS" event must be added for each individual referenced. Some genealogy programs, such as Gramps and The Master Genealogist , have elaborate database structures for sources that are used, among other things, to represent multi-person events. When databases are exported from one of these programs to GEDCOM, these database structures cannot be represented in GEDCOM due to this limitation, with
SECTION 50
#17328512818315252-464: The average personal computer user's hard disk drive could store only about 10 megabytes, and it cost approximately US$ 250 on the wholesale market (and much higher if purchased separately at retail), so it was very important at the time to make every bit count. The compromise solution that was eventually found and developed into Unicode was to break the assumption (dating back to telegraph codes) that each character should always directly correspond to
5353-631: The capability for an application to set UTF-8 as the "code page" for the Windows API, removing the need to use UTF-16; and more recently has recommended programmers use UTF-8, and even states "UTF-16 [...] is a unique burden that Windows places on code that targets multiple platforms". The default string primitive in Go , Julia , Rust , Swift (since version 5), and PyPy uses UTF-8 internally in all cases. Python (since version 3.3) uses UTF-8 internally for Python C API extensions and sometimes for strings and
5454-476: The current version of Python requires an option to open() to read/write UTF-8, plans exist to make UTF-8 I/O the default in Python ;3.15. C++23 adopts UTF-8 as the only portable source code file format (surprisingly there was none before). Backwards compatibility is a serious impediment to changing code and APIs using UTF-16 to use UTF-8, but this is happening. As of May 2019 , Microsoft added
5555-500: The default encoding in XML and HTML (and not just using UTF-8, also declaring it in metadata), "even when all characters are in the ASCII range ... Using non-UTF-8 encodings can have unexpected results". Lots of software has the ability to read/write UTF-8. It may though require the user to change options from the normal settings, or may require a BOM (byte-order mark) as the first character to read
5656-440: The detailed meaning of each byte in a stream encoded in UTF-8. Not all sequences of bytes are valid UTF-8. A UTF-8 decoder should be prepared for: Many of the first UTF-8 decoders would decode these, ignoring incorrect bits. Carefully crafted invalid UTF-8 could make them either skip or create ASCII characters such as NUL , slash, or quotes, leading to security vulnerabilities. It is also common to throw an exception or truncate
5757-451: The early days of Unicode there were no characters greater than U+FFFF and combining characters were rarely used, so the 16-bit encoding was fixed-size. This made processing of text more efficient, though the gains are nowhere as great as novice programmers may imagine. All such advantages were lost as soon as UTF-16 became variable width as well. The code points U+0800 – U+FFFF take 3 bytes in UTF-8 but only 2 in UTF-16. This led to
5858-536: The era had their own character codes, often six-bit, but usually had the ability to read tapes produced on IBM equipment. These BCD encodings were the precursors of IBM's Extended Binary-Coded Decimal Interchange Code (usually abbreviated as EBCDIC), an eight-bit encoding scheme developed in 1963 for the IBM System/360 that featured a larger character set, including lower case letters. In trying to develop universally interchangeable character encodings, researchers in
5959-518: The exchange of genealogical data. First released as a draft standard in 1999, GEDCOM 5.5.1 received only minor updates in the subsequent 20 years leading up to the release of 5.5.1 final in 2019. To address its shortcomings, some genealogy programs introduced proprietary extensions to GEDCOM which are not always recognized by other programs, such as GEDCOM 5.5 EL (Extended Locations). Efforts have been made to have 7.0 more widely adopted since its release. FamilySearch intends to be GEDCOM 7.0 compatible in
6060-406: The features that the GEDCOM standard allows. The GEDCOM standard supports the inclusion of multimedia objects (for example, photos of individuals). Such multimedia objects can be either included in the GEDCOM file itself (called the "embedded form") or in an external file where the name of the external file is specified in the GEDCOM file (called the "linked form"). Embedding multimedia directly in
6161-426: The file. Examples of software supporting UTF-8 include Microsoft Word , Microsoft Excel (2016 and later), Google Drive , LibreOffice and most databases. Software that "defaults" to UTF-8 (meaning it writes it without the user changing settings, and it reads it without a BOM) has become more common since 2010. Windows Notepad , in all currently supported versions of Windows, defaults to writing UTF-8 without
SECTION 60
#17328512818316262-579: The first character is a BOM (or the file only contains ASCII). For a long time there was considerable argument as to whether it was better to process text in UTF-16 or in UTF-8. The primary advantage of UTF-16 is that the Windows API required it to be used to get access to all Unicode characters (only recently has this been fixed). This caused several libraries such as Qt to also use UTF-16 strings which propagates this requirement to non-Windows platforms. In
6363-472: The form of an associated .zip file, called a GEDZip, is another inclusion. Efforts are underway to see 7.0 embraced as the new exchange standard. GEDCOM 7.0 allows explicitly identifying what standards other than GEDCOM may apply to a particular file. GEDCOM has always been extensible , but prior to 7.0 there was no standard way to identify such extensions. Also, GEDCOM 7.0 allows explicitly marking an event as nonexistent. This allows, for example, documenting that
6464-632: The high and low surrogates used by UTF-16 ( U+D800 through U+DFFF ) are not legal Unicode values, and their UTF-8 encodings must be treated as an invalid byte sequence. These encodings all start with 0xED followed by 0xA0 or higher. This rule is often ignored as surrogates are allowed in Windows filenames and this means there must be a way to store them in a string. UTF-8 that allows these surrogate halves has been (informally) called WTF-8 , while another variation that also encodes all non-BMP characters as two surrogates (6 bytes instead of 4)
6565-451: The high bit was set. The name File System Safe UCS Transformation Format ( FSS-UTF ) and most of the text of this proposal were later preserved in the final specification. In August 1992, this proposal was circulated by an IBM X/Open representative to interested parties. A modification by Ken Thompson of the Plan 9 operating system group at Bell Labs made it self-synchronizing , letting
6666-522: The idea that text in Chinese and other languages would take more space in UTF-8. However, text is only larger if there are more of these code points than 1-byte ASCII code points, and this rarely happens in the real-world documents due to spaces, newlines, digits, punctuation, English words, and HTML markup. UTF-8 has the advantages of being trivial to retrofit to any system that could handle an extended ASCII , not having byte-order problems, and taking about 1/2
6767-441: The last byte of a code point to decode it. Unlike many earlier multi-byte text encodings such as Shift-JIS , it is self-synchronizing so searches for short strings or characters are possible and that the start of a code point can be found from a random position by backing up at most 3 bytes. The values chosen for the lead bytes means sorting a list of UTF-8 strings puts them in the same order as sorting UTF-32 strings. Using
6868-475: The most well-known code page suites are " Windows " (based on Windows-1252) and "IBM"/"DOS" (based on code page 437 ). Despite no longer referring to specific page numbers in a standard, many character encodings are still referred to by their code page number; likewise, the term "code page" is often still used to refer to character encodings in general. The term "code page" is not used in Unix or Linux, where "charmap"
6969-629: The partner that the HUSB or WIFE structure points to" and that these individuals within a family structure are collectively referred to as 'partners', 'parents' or 'spouses'. A FAM record can also be used for "cohabitation, fostering, adoption, and so on, regardless of the gender of the partners." A GEDCOM file consists of a header section, records, and a trailer section. Within these sections, records represent people (INDI record), families (FAM records), sources of information (SOUR records), and other miscellaneous records, including notes. Every line of
7070-440: The person's death. The only way to reveal such inconsistencies is by rigorous validation of the content data . The GEDCOM standard supports internationalization in several ways. First, newer versions of the standard allow data to be stored in Unicode (or, more recently, UTF-8), so text in any language can be stored. Secondly, in the same way that one can have multiple events on a person, GEDCOM allows one to have multiple names for
7171-407: The proposed standard in 2017. It withdrew the proposal because release 7.0 of GEDCOM addressed many of the organization's concerns. Text encoding Character encoding is the process of assigning numbers to graphical characters , especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical values that make up
7272-412: The punched card code then in use only allowed digits, upper-case English letters and a few special characters, six bits were sufficient. These BCD encodings extended existing simple four-bit numeric encoding to include alphabetic and special characters, mapping them easily to punch-card encoding which was already in widespread use. IBM's codes were used primarily with IBM equipment; other computer vendors of
7373-554: The range 0x21–0x7E that meant something else in ASCII, e.g., 0x2F for / , the Unix path directory separator. In July 1992, the X/Open committee XoJIG was looking for a better encoding. Dave Prosser of Unix System Laboratories submitted a proposal for one that had faster implementation characteristics and introduced the improvement that 7-bit ASCII characters would only represent themselves; multi-byte sequences would only include bytes with
7474-451: The release of GEDCOM 7, FamilySearch positioned GEDCOM X as useful for interoperation with its FamilySearch Family Tree software. Commsoft, the authors of the Roots series of genealogy software and Ultimate Family Tree, defined a version called Event-Oriented GEDCOM (also known as "Event GEDCOM" and originally called InterGED), which included events as first class (zero-level) items. Although it
7575-418: The remaining 61,440 codepoints of the Basic Multilingual Plane (BMP), including most Chinese, Japanese and Korean characters . Four bytes are needed for the 1,048,576 codepoints in the other planes of Unicode , which include emoji (pictographic symbols), less common CJK characters , various historic scripts, and mathematical symbols . This is a prefix code and it is unnecessary to read past
7676-460: The repertoire over time. A coded character set (CCS) is a function that maps characters to code points (each code point represents one character). For example, in a given repertoire, the capital letter "A" in the Latin alphabet might be represented by the code point 65, the character "B" by 66, and so on. Multiple coded character sets may share the same character repertoire; for example ISO/IEC 8859-1 and IBM code pages 037 and 500 all cover
7777-513: The result that the event or source information including all of the relevant citation reference information must be duplicated each place that it is used. This duplication makes it difficult for the user to maintain the information related to sources. In the GEDCOM specification, events that are associated with a family such as marriage information is only stored in a GEDCOM once, as part of the family (FAM) record, and then both spouses are linked to that single family record. The GEDCOM specification
7878-491: The same character. An example is the XML attribute xml:lang. The Unicode model uses the term "character map" for other systems which directly assign a sequence of characters to a sequence of bytes, covering all of the CCS, CEF and CES layers. In Unicode, a character can be referred to as 'U+' followed by its codepoint value in hexadecimal. The range of valid code points (the codespace) for
7979-582: The same modified UTF-8 as Java for internal representation of Unicode data, but uses strict CESU-8 for external data. All known Modified UTF-8 implementations also treat the surrogate pairs as in CESU-8 . Raku programming language (formerly Perl 6) uses utf-8 encoding by default for I/O ( Perl 5 also supports it); though that choice in Raku also implies "normalization into Unicode NFC (normalization form canonical) . In some cases you may want to ensure no normalization
8080-537: The same repertoire but map them to different code points. A character encoding form (CEF) is the mapping of code points to code units to facilitate storage in a system that represents numbers as bit sequences of fixed length (i.e. practically any computer system). For example, a system that stores numeric information in 16-bit units can only directly represent code points 0 to 65,535 in each unit, but larger code points (say, 65,536 to 1.4 million) could be represented by using multiple 16-bit units. This correspondence
8181-522: The same semantic character. Unicode and its parallel standard, the ISO/IEC 10646 Universal Character Set , together constitute a unified standard for character encoding. Rather than mapping characters directly to bytes , Unicode separately defines a coded character set that maps characters to unique natural numbers ( code points ), how those code points are mapped to a series of fixed-size natural numbers (code units), and finally how those units are encoded as
8282-465: The searched-for string does not contain any errors. Making each byte be an error, in which case E1,A0,20 is two errors followed by a space, also still allows searching for a valid string. This means there are only 128 different errors which makes it practical to store the errors in the output string, or replace them with characters from a legacy encoding. Only a small subset of possible byte strings are error-free UTF-8: several bytes cannot appear;
8383-421: The second one. The order in which these FAMS are recorded in GEDCOM's INDI record will depend on the exporting program. In Aldfaer for instance, the sequence depends on the ordering of the data by the user (alphabetical, chronological, reference, etc.). The proposed XML GEDCOM standard does not address this issue either. GEDCOM has many features that are not commonly used. Some software packages do not support all
8484-433: The solution was to implement variable-length encodings where an escape sequence would signal that subsequent bits should be parsed as a higher code point. Informally, the terms "character encoding", "character map", "character set" and "code page" are often used interchangeably. Historically, the same standard would specify a repertoire of characters and how they were to be encoded into a stream of code units — usually with
8585-437: The space for any language using mostly Latin letters. UTF-8 has been the most common encoding for the World Wide Web since 2008. As of November 2024 , UTF-8 is used by 98.4% of surveyed web sites. Although many pages only use ASCII characters to display content, very few websites now declare their encoding to only be ASCII instead of UTF-8. Virtually all countries and languages have 95% or more use of UTF-8 encodings on
8686-624: The specification for FSS-UTF. UTF-8 was first officially presented at the USENIX conference in San Diego , from January 25 to 29, 1993. The Internet Engineering Task Force adopted UTF-8 in its Policy on Character Sets and Languages in RFC ;2277 ( BCP 18) for future internet standards work in January 1998, replacing Single Byte Character Sets such as Latin-1 in older RFCs. In November 2003, UTF-8
8787-414: The specification in wide use is GEDCOM 5.5.1 final , which was released on 15 November 2019. Its predecessor, GEDCOM 5.5.1 draft was issued in 1999, introducing nine new attribute, tags and adding UTF-8 as an approved character encoding . The draft was not formally approved, but its provisions were adopted in some part by a number of genealogy programs including FamilySearch.org. Lineage-linked GEDCOM
8888-645: The string at an error but this turns what would otherwise be harmless errors (i.e. "file not found") into a denial of service , for instance early versions of Python 3.0 would exit immediately if the command line or environment variables contained invalid UTF-8. RFC 3629 states "Implementations of the decoding algorithm MUST protect against decoding invalid sequences." The Unicode Standard requires decoders to: "... treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence." The standard now recommends replacing each error with
8989-489: The structure of a GEDCOM file is included as part of PhpGedView project, though it is not meant to be a standalone validator. For standalone validation "The Windows GEDCOM Validator" can be used. or the older unmaintained Gedcheck from the LDS Church. During 2001, The GEDCOM TestBook Project evaluated how well four popular genealogy programs conformed to the GEDCOM 5.5 standard using the Gedcheck program. Findings showed that
9090-510: The third quarter 2022 and Ancestry.com is planning for 7.0 compatibility, but has not yet specified an implementation date. GEDCOM uses a lineage-linked data model based on the conceptual model of the nuclear family . The family ( FAM ) record type is therefore the only source of links between the individuals ( INDI ) in the file, assigning parents (as HUSB and WIFE ) and children (as CHIL ) by referring to individuals' unique ID numbers. These historical origins are described in
9191-543: The value of the code point. In the following table, the characters u to z are replaced by the bits of the code point, from the positions U+uvwxyz : The first 128 code points (ASCII) need 1 byte. The next 1,920 code points need two bytes to encode, which covers the remainder of almost all Latin-script alphabets , and also IPA extensions , Greek , Cyrillic , Coptic , Armenian , Hebrew , Arabic , Syriac , Thaana and N'Ko alphabets, as well as Combining Diacritical Marks . Three bytes are needed for
9292-580: The web. Many standards only support UTF-8, e.g. JSON exchange requires it (without a byte-order mark (BOM)). UTF-8 is also the recommendation from the WHATWG for HTML and DOM specifications, and stating "UTF-8 encoding is the most appropriate encoding for interchange of Unicode " and the Internet Mail Consortium recommends that all e‑mail programs be able to display and create mail using UTF-8. The World Wide Web Consortium recommends UTF-8 as
9393-504: Was adopted fairly widely. ASCII67's American-centric nature was somewhat addressed in the European ECMA-6 standard. Herman Hollerith invented punch card data encoding in the late 19th century to analyze census data. Initially, each hole position represented a different data element, but later, numeric information was encoded by numbering the lower rows 0 to 9, with a punch in a column representing its row number. Later alphabetic data
9494-497: Was designed for backward compatibility with ASCII : the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that a UTF-8-encoded file using only those characters is identical to an ASCII file. Most software designed for any extended ASCII can read and write UTF-8 (including on Microsoft Windows ) and this results in fewer internationalization issues than any alternative text encoding. UTF-8
9595-408: Was dropped in the draft 5.5.1 standard. The GEDCOM standard allows for the specification of multiple opinions or conflicting data, simply by specifying multiple records of the same type. For example, if an individual's birth date was recorded as 10 January 1800 on the birth certificate, but 11 January 1800 on the death certificate, two BIRT records for that individual would be included, the first with
9696-667: Was encoded by allowing more than one punch per column. Electromechanical tabulating machines represented date internally by the timing of pulses relative to the motion of the cards through the machine. When IBM went to electronic processing, starting with the IBM 603 Electronic Multiplier, it used a variety of binary encoding schemes that were tied to the punch card code. IBM used several Binary Coded Decimal ( BCD ) six-bit character encoding schemes, starting as early as 1953 in its 702 and 704 computers, and in its later 7000 Series and 1400 series , as well as in associated peripherals. Since
9797-436: Was made purposefully flexible to support many ways of encoding data, particularly in the area of sources. This flexibility has led to a great deal of ambiguity, and has produced the side effect that some genealogy programs which import GEDCOM do not import all of the data from a file. The GEDCOM specification does not offer explicit support for keeping a known order of events. In particular, the order of relationships (FAMS) for
9898-409: Was often improved by many equipment manufacturers, sometimes creating compatibility issues. In 1959 the U.S. military defined its Fieldata code, a six-or seven-bit code, introduced by the U.S. Army Signal Corps. While Fieldata addressed many of the then-modern issues (e.g. letter and digit codes arranged for machine collation), it fell short of its goals and was short-lived. In 1963 the first ASCII code
9999-534: Was released (X3.4-1963) by the ASCII committee (which contained at least one member of the Fieldata committee, W. F. Leubbert), which addressed most of the shortcomings of Fieldata, using a simpler code. Many of the changes were subtle, such as collatable character sets within certain numeric ranges. ASCII63 was a success, widely adopted by industry, and with the follow-up issue of the 1967 ASCII code (which added lower-case letters and fixed some "control code" issues) ASCII67
10100-413: Was restricted by RFC 3629 to match the constraints of the UTF-16 character encoding: explicitly prohibiting code points corresponding to the high and low surrogate characters removed more than 3% of the three-byte sequences, and ending at U+10FFFF removed more than 48% of the four-byte sequences and all five- and six-byte sequences. UTF-8 encodes code points in one to four bytes, depending on
10201-576: Was via machinery, it was often used as a manual code, generated by hand on a telegraph key and decipherable by ear, and persists in amateur radio and aeronautical use. Most codes are of fixed per-character length or variable-length sequences of fixed-length codes (e.g. Unicode ). Common examples of character encoding systems include Morse code, the Baudot code , the American Standard Code for Information Interchange (ASCII) and Unicode. Unicode,
#830169