Sitemaps - Misplaced Pages

This is an accepted version of this page

#440559

70-400: ‹The template Manual is being considered for merging .› Sitemaps is a protocol in XML format meant for a webmaster to inform search engines about URLs on a website that are available for web crawling . It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs of

140-539: A numeric character reference . Consider the Chinese character "中", whose numeric code in Unicode is hexadecimal 4E2D, or decimal 20,013. A user whose keyboard offers no method for entering this character could still insert it in an XML document encoded either as 中 or 中 . Similarly, the string "I <3 Jörg" could be encoded for inclusion in an XML document as I <3 Jörg . 

210-498: A URI must be percent-encoded. When a character from the reserved set (a "reserved character") has a special meaning (a "reserved purpose") in a certain context, and a URI scheme says that it is necessary to use that character for some other purpose, then the character must be percent-encoded . Percent-encoding a reserved character involves converting the character to its corresponding byte value in ASCII and then representing that value as

280-421: A URL (or, more generally, a URI). Unreserved characters have no such meanings. Using percent-encoding, reserved characters are represented using special character sequences. The sets of reserved and unreserved characters and the circumstances under which certain reserved characters have special meaning have changed slightly with each revision of specifications that govern URIs and URI schemes. Other characters in

350-448: A file format. XML standardizes this process. It is therefore analogous to a lingua franca for representing information. As a markup language , XML labels, categorizes, and structurally organizes information. XML tags represent the data structure and contain metadata . What is within the tags is data, encoded in the way the XML standard specifies. An additional XML schema (XSD) defines

420-469: A format that is both human-readable and machine-readable . The World Wide Web Consortium 's XML 1.0 Specification of 1998 and several other related specifications —all of them free open standards —define XML. The design goals of XML emphasize simplicity, generality, and usability across the Internet . It is a textual data format with strong support via Unicode for different human languages . Although

490-477: A hint as to what encoding was used, or if the encoding conflicts with the use of ASCII to percent-encode reserved and unreserved characters, then the URI cannot be reliably interpreted. Some schemes fail to account for encoding at all and instead just suggest that data characters map directly to URI characters, which leaves it up to implementations to decide whether and how to percent-encode data characters that are in neither

560-448: A list of syntax rules provided in the specification. Some key points in the fairly lengthy list include: The definition of an XML document excludes texts that contain violations of well-formedness rules; they are simply not XML. An XML processor that encounters such a violation is required to report such errors and to cease normal processing. This policy, occasionally referred to as " draconian error handling", stands in notable contrast to

630-522: A mechanism whereby an XML processor can reliably, without any prior knowledge, determine which encoding is being used. Encodings other than UTF-8 and UTF-16 are not necessarily recognized by every XML parser (and in some cases not even UTF-16, even though the standard mandates it to also be recognized). XML provides escape facilities for including characters that are problematic to include directly. For example: There are five predefined entities : All permitted Unicode characters may be represented with

700-546: A more compact non-XML syntax; the two syntaxes are isomorphic and James Clark 's conversion tool— Trang —can convert between them without loss of information. RELAX NG has a simpler definition and validation framework than XML Schema, making it easier to use and implement. It also has the ability to use datatype framework plug-ins ; a RELAX NG schema author, for example, can require values in an XML document to conform to definitions in XML Schema Datatypes. Schematron

770-504: A networked context appear in RFC 3470 , also known as IETF BCP 70, a document covering many aspects of designing and deploying an XML-based language. XML has come into common use for the interchange of data over the Internet. Hundreds of document formats using XML syntax have been developed, including RSS , Atom , Office Open XML , OpenDocument , SVG , COLLADA , and XHTML . XML also provides

SECTION 10

#1732854586441

840-460: A pair of hexadecimal digits (if there is a single hex digit, a leading zero is added). The digits, preceded by a percent sign ( % ) as an escape character , are then used in the URI in place of the reserved character. (A non-ASCII character is typically converted to its byte sequence in UTF-8 , and then each byte value is represented as above.) The reserved character / , for example, if used in

910-506: A rich datatyping system and allow for more detailed constraints on an XML document's logical structure. XSDs also use an XML-based format, which makes it possible to use ordinary XML tools to help process them. xs:schema element that defines a schema: RELAX NG (Regular Language for XML Next Generation) was initially specified by OASIS and is now a standard (Part 2: Regular-grammar-based validation of ISO/IEC 19757 – DSDL ). RELAX NG schemas may be written in either an XML based syntax or

980-400: A sitemap index file (a file that points to multiple sitemaps). A syndication feed is a permitted method of submitting URLs to crawlers; this is advised mainly for sites that already have syndication feeds. One stated drawback is this method might only provide crawlers with more recently created URLs, but other URLs can still be discovered during normal crawling. It can be beneficial to have

1050-533: A string, then percent-escapes the resulting bytes. When data that has been entered into HTML forms is submitted, the form field names and values are encoded and sent to the server in an HTTP request message using method GET or POST , or, historically, via email . The encoding used by default is based on an early version of the general URI percent-encoding rules, with a number of modifications such as newline normalization and replacing spaces with + instead of %20 . The media type of data encoded this way

1120-406: A syndication feed as a delta update (containing only the newest content) to supplement a complete sitemap. If Sitemaps are submitted directly to a search engine ( pinged ), it will return status information and any processing errors. The details involved with submission will vary with the different search engines. The location of the sitemap can also be included in the robots.txt file by adding

1190-421: A validity error must be able to report it, but may continue normal processing. A DTD is an example of a schema or grammar . Since the initial publication of XML 1.0, there has been substantial work in the area of schema languages for XML. Such schema languages typically constrain the set of elements that may be used in a document, which attributes may be applied to them, the order in which they may appear, and

1260-527: A vocabulary to refer to the constructs within an XML document, but does not provide any guidance on how to access this information. A variety of APIs for accessing XML have been developed and used, and some have been standardized. Existing APIs for XML processing tend to fall into these categories: Stream-oriented facilities require less memory and, for certain tasks based on a linear traversal of an XML document, are faster and simpler than other alternatives. Tree-traversal and data-binding APIs typically require

1330-555: A website, but that are hosted externally, such as on Vimeo or YouTube . Image sitemaps are used to indicate image metadata, such as licensing information, geographic location, and an image's caption. Google supports a Google News sitemap type for facilitating quick indexing of time-sensitive news subjects. In December 2011, Google announced the annotations for sites that want to target users in many languages and, optionally, countries. A few months later Google announced, on their official blog, that they are adding support for specifying

1400-540: Is application/x-www-form-urlencoded , and it is currently defined in the HTML and XForms specifications. In addition, the CGI specification contains rules for how web servers decode data of this type and make it available to applications. When HTML form data is sent in an HTTP GET request, it is included in the query component of the request URI using the same syntax described above. When sent in an HTTP POST request or via email,

1470-412: Is a lexical , event-driven API in which a document is read serially and its contents are reported as callbacks to various methods on a handler object of the user's design. SAX is fast and efficient to implement, but difficult to use for extracting information at random from the XML, since it tends to burden the application author with keeping track of what part of the document is being processed. It

SECTION 20

#1732854586441

1540-726: Is a language for making assertions about the presence or absence of patterns in an XML document. It typically uses XPath expressions. Schematron is now a standard (Part 3: Rule-based validation of ISO/IEC 19757 – DSDL ). DSDL (Document Schema Definition Languages) is a multi-part ISO/IEC standard (ISO/IEC 19757) that brings together a comprehensive set of small schema languages, each targeted at specific problems. DSDL includes RELAX NG full and compact syntax, Schematron assertion language, and languages for defining datatypes, character repertoire constraints, renaming and entity expansion, and namespace-based routing of document fragments to different validators. DSDL schema languages do not have

1610-509: Is also used in the preparation of data of the application/x-www-form-urlencoded media type , as is often used in the submission of HTML form data in HTTP requests. The characters allowed in a URI are either reserved or unreserved (or a percent character as part of a percent-encoding). Reserved characters are those characters that sometimes have special meaning. For example, forward slash characters are used to separate different parts of

1680-570: Is an XML industry data standard. XML is used extensively to underpin various publishing formats. One of the applications of XML is in the transfer of Operational meteorology (OPMET) information based on IWXXM standards. The material in this section is based on the XML Specification . This is not an exhaustive list of all the constructs that appear in XML; it provides an introduction to the key constructs most often encountered in day-to-day use. XML documents consist entirely of characters from

1750-404: Is an alias), application/xml-external-parsed-entity ( text/xml-external-parsed-entity is an alias) and application/xml-dtd . They are used for transmitting raw XML files without exposing their internal semantics . RFC 7303 further recommends that XML-based languages be given media types ending in +xml , for example, image/svg+xml for SVG . Further guidelines for the use of XML in

1820-511: Is based on ideas from "Crawler-friendly Web Servers," with improvements including auto-discovery through robots.txt and the ability to specify the priority and change frequency of pages. Sitemaps are particularly beneficial on websites where: The Sitemap Protocol format consists of XML tags. The file itself must be UTF-8 encoded. Sitemaps can also be just a plain text list of URLs. They can also be compressed in .gz format. A sample Sitemap that contains just one URL and uses all optional tags

1890-498: Is better suited to situations in which certain types of information are always handled the same way, no matter where they occur in the document. Pull parsing treats the document as a series of items read in sequence using the iterator design pattern . This allows for writing of recursive descent parsers in which the structure of the code performing the parsing mirrors the structure of the XML being parsed, and intermediate parsed results can be used and accessed as local variables within

1960-442: Is not permitted because the null character is one of the control characters excluded from XML, even when using a numeric character reference. An alternative encoding mechanism such as Base64 is needed to represent such characters. Comments may appear anywhere in a document outside other markup. Comments cannot appear before the XML declaration. Comments begin with  . For compatibility with SGML ,

2030-636: Is only used to suggest to the crawlers how important pages of the site are to one another. Does not apply to <sitemap> elements. Support for the elements that are not required can vary from one search engine to another. The Sitemaps protocol allows the Sitemap to be a simple list of URLs in a text file. The file specifications of XML Sitemaps apply to text Sitemaps as well; the file must be UTF-8 encoded, and cannot be more than 50MiB (uncompressed) or contain more than 50,000 URLs. Sitemaps that exceed these limits should be broken up into multiple sitemaps with

2100-447: Is shown below. The Sitemap XML protocol is also extended to provide a way of listing multiple Sitemaps in a 'Sitemap index' file. The maximum Sitemap size of 50 MiB or 50,000 URLs means this is necessary for large sites. An example of Sitemap index referencing one separate sitemap follows. The definitions for the elements are shown below: "Always" is used to denote documents that change each time that they are accessed. "Never"

2170-546: Is typically preferred, as it results in shorter URLs. The procedure for percent-encoding binary data has often been extrapolated, sometimes inappropriately or without being fully specified, to apply to character-based data. In the World Wide Web 's formative years, when dealing with data characters in the ASCII repertoire and using their corresponding bytes in ASCII as the basis for determining percent-encoded sequences, this practice

Sitemaps - Misplaced Pages Continue

2240-547: Is up to the URI scheme specifications to account for this possibility and require one or the other, but in practice, few, if any, actually do. There exists a non-standard encoding for Unicode characters: %u xxxx , where xxxx is a UTF-16 code unit represented as four hexadecimal digits. This behavior is not specified by any RFC and has been rejected by the W3C. The 13th edition of ECMA-262 still includes an escape function that uses this syntax, which applies UTF-8 encoding to

2310-424: Is used to denote archived URLs (i.e. files that will not be changed again). This is used only as a guide for crawlers , and is not used to determine how frequently pages are indexed. Does not apply to <sitemap> elements. The valid range is from 0.0 to 1.0, with 1.0 being the most important. The default value is 0.5. Rating all pages on a site with a high priority does not affect search listings, as it

2380-693: The .NET Framework , and the DOM traversal API (NodeIterator and TreeWalker). URL encoding URL encoding , officially known as percent-encoding , is a method to encode arbitrary data in a uniform resource identifier (URI) using only the US-ASCII characters legal within a URI. Although it is known as URL encoding , it is also used more generally within the main Uniform Resource Identifier (URI) set, which includes both Uniform Resource Locator (URL) and Uniform Resource Name (URN). Consequently, it

2450-453: The Unicode repertoire. Except for a small number of specifically excluded control characters , any character defined by Unicode may appear within the content of an XML document. XML includes facilities for identifying the encoding of the Unicode characters that make up the document, and for expressing characters that, for one reason or another, cannot be used directly. Unicode code points in

2520-410: The infoset augmentation facility and attribute defaults. RELAX NG and Schematron intentionally do not provide these. A cluster of specifications closely related to XML have been developed, starting soon after the initial publication of XML 1.0. It is frequently the case that the term "XML" is used to refer to XML together with one or more of these other technologies that have come to be seen as part of

2590-444: The " query " component of a URI (the part after a ? character), for example, / is still considered a reserved character but it normally has no reserved purpose, unless a particular URI scheme says otherwise. The character does not need to be percent-encoded when it has no reserved purpose. URIs that differ only by whether a reserved character is percent-encoded or appears literally are normally considered not equivalent (denoting

2660-446: The "path" component of a URI , has the special meaning of being a delimiter between path segments. If, according to a given URI scheme, / needs to be in a path segment, then the three characters %2F or %2f must be used in the segment instead of a raw / . Reserved characters that have no reserved purpose in a particular context may also be percent-encoded but are not semantically different from those that are not. In

2730-530: The Sitemaps protocol in November 2006. The schema version was changed to "Sitemap 0.90", but no other changes were made. In April 2007, Ask.com and IBM announced support for Sitemaps. Also, Google, Yahoo, MSN announced auto-discovery for sitemaps through robots.txt . In May 2007, the state governments of Arizona, California, Utah and Virginia announced they would use Sitemaps on their web sites. The Sitemaps protocol

2800-429: The XML core. Some other specifications conceived as part of the "XML Core" have failed to find wide adoption, including XInclude , XLink , and XPointer . The design goals of XML include, "It shall be easy to write programs which process XML documents." Despite this, the XML specification contains almost no information about how programmers might go about doing such processing. The XML Infoset specification provides

2870-503: The XML processor inserts in the DTD itself and in the XML document wherever they are referenced, like character escapes. DTD technology is still used in many applications because of its ubiquity. A newer schema language, described by the W3C as the successor of DTDs, is XML Schema , often referred to by the initialism for XML Schema instances, XSD (XML Schema Definition). XSDs are far more powerful than DTDs in describing XML languages. They use

Sitemaps - Misplaced Pages Continue

2940-434: The allowable parent/child relationships. The oldest schema language for XML is the document type definition (DTD), inherited from SGML. DTDs have the following benefits: DTDs have the following limitations: Two peculiar features that distinguish DTDs from other schema types are the syntactic support for embedding a DTD within XML documents and for defining entities , which are arbitrary fragments of text or markup that

3010-598: The base language for communication protocols such as SOAP and XMPP . It is one of the message exchange formats used in the Asynchronous JavaScript and XML (AJAX) programming technique. Many industry data standards, such as Health Level 7 , OpenTravel Alliance , FpML , MISMO , and National Information Exchange Model are based on XML and the rich features of the XML schema specification. In publishing, Darwin Information Typing Architecture

3080-415: The basis for percent-encoding, leading to ambiguities and difficulty interpreting URIs reliably. For example, many URI schemes and protocols based on RFCs 1738 and 2396 presume that the data characters will be converted to bytes according to some unspecified character encoding before being represented in a URI by unreserved characters or percent-encoded bytes. If the scheme does not allow the URI to provide

3150-401: The behavior of programs that process HTML , which are designed to produce a reasonable result even in the presence of severe markup errors. XML's policy in this area has been criticized as a violation of Postel's law ("Be conservative in what you send; be liberal in what you accept"). The XML specification defines a valid XML document as a well-formed XML document which also conforms to

3220-423: The case of C1 characters, this restriction is a backwards incompatibility; it was introduced to allow common encoding errors to be detected. The code point U+0000 (Null) is the only character that is not permitted in any XML 1.1 document. The Unicode character set can be encoded into bytes for storage or transmission in a variety of different ways, called "encodings". Unicode itself defines encodings that cover

3290-407: The characters ampersand (&), single quote ('), double quote ("), less than (<), and greater than (>). Best practice for optimising a sitemap index for search engine crawlability is to ensure the index refers only to sitemaps as opposed to other sitemap indexes. Nesting a sitemap index within a sitemap index is invalid according to Google. A number of additional XML sitemap types outside of

3360-546: The design of XML focuses on documents, the language is widely used for the representation of arbitrary data structures , such as those used in web services . Several schema systems exist to aid in the definition of XML-based languages, while programmers have developed many application programming interfaces (APIs) to aid the processing of XML data. The main purpose of XML is serialization , i.e. storing, transmitting, and reconstructing arbitrary data. For two disparate systems to exchange information, they need to agree upon

3430-442: The direct use of almost any Unicode character in element names, attributes, comments, character data, and processing instructions (other than the ones that have special symbolic meaning in XML itself, such as the less-than sign, "<"). The following is a well-formed XML document including Chinese , Armenian and Cyrillic characters: The XML specification defines an XML document as a well-formed text, meaning that it satisfies

3500-516: The entire repertoire; well-known ones include UTF-8 (which the XML standard recommends using, without a BOM ) and UTF-16 . There are many other text encodings that predate Unicode, such as ASCII and various ISO/IEC 8859 ; their character repertoires are in every case subsets of the Unicode character set. XML allows the use of any of the Unicode-defined encodings and any other encodings whose characters also appear in Unicode. XML also provides

3570-449: The following line: The <sitemap_location> should be the complete URL to the sitemap, such as: This directive is independent of the user-agent line, so it doesn't matter where it is placed in the file. If the website has several sitemaps, multiple "Sitemap:" records may be included in robots.txt , or the URL can simply point to the main sitemap index file. The following table lists

SECTION 50

#1732854586441

3640-498: The following ranges are valid in XML 1.0 documents: XML 1.1 extends the set of allowed characters to include all the above, plus the remaining characters in the range U+0001–U+001F. At the same time, however, it restricts the use of C0 and C1 control characters other than U+0009 (Horizontal Tab), U+000A (Line Feed), U+000D (Carriage Return), and U+0085 (Next Line) by requiring them to be written in escaped form (for example U+0001 must be written as  or its equivalent). In

3710-685: The functions performing the parsing, or passed down (as function parameters) into lower-level functions, or returned (as function return values) to higher-level functions. Examples of pull parsers include Data::Edit::Xml in Perl , StAX in the Java programming language, XMLPullParser in Smalltalk , XMLReader in PHP , ElementTree.iterparse in Python , SmartXML in Red , System.Xml.XmlReader in

3780-426: The necessary metadata for interpreting and validating XML. (This is also referred to as the canonical schema.) An XML document that adheres to basic XML rules is "well-formed"; one that adheres to its schema is "valid." IETF RFC 7303 (which supersedes the older RFC 3023 ), provides rules for the construction of media types for use in XML message. It defines three media types: application/xml ( text/xml

3850-542: The only option was to add the hreflang annotation either in the HTTP header or as HTML elements on both URLs like this But now, one can alternatively use the following equivalent markup in Sitemaps: XML Extensible Markup Language ( XML ) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in

3920-467: The percent character ( % ) serves to indicate percent-encoded octets, it must itself be percent-encoded as %25 to be used as data within a URI. Most URI schemes involve the representation of arbitrary data, such as an IP address or file system path, as components of a URI. URI scheme specifications should, but often do not, provide an explicit mapping between URI characters and all possible data values being represented by those characters. Since

3990-435: The publication of RFC 1738 in 1994 it has been specified that schemes that provide for the representation of binary data in a URI must divide the data into 8-bit bytes and percent-encode each byte in the same manner as above. Byte value 0x0F, for example, should be represented by %0F , but byte value 0x41 can be represented by A , or %41 . The use of unencoded characters for alphanumeric and other unreserved characters

4060-487: The rel="alternate" and hreflang annotations in Sitemaps. Instead of the (until then only option) HTML link elements the Sitemaps option offered many advantages which included a smaller page size and easier deployment for some websites. One example of the multilingual sitemap would be as follows: If for example we have a site that targets English language users through https://www.example.com/en and Greek language users through https://www.example.com/gr , up until then

4130-540: The reserved nor unreserved sets. Arbitrary character data is sometimes percent-encoded and used in non-URI situations, such as for password-obfuscation programs or other system-specific translation protocols. The generic URI syntax recommends that new URI schemes that provide for the representation of character data in a URI should, in effect, represent characters from the unreserved set without translation and should convert all other characters to bytes according to UTF-8 , and then percent-encode those values. This suggestion

4200-487: The rules of a Document Type Definition (DTD). In addition to being well formed, an XML document may be valid . This means that it contains a reference to a Document Type Definition (DTD), and that its elements and attributes are declared in that DTD and follow the grammatical rules for them that the DTD specifies. XML processors are classified as validating or non-validating depending on whether or not they check XML documents for validity. A processor that discovers

4270-739: The same resource) unless it can be determined that the reserved characters in question have no reserved purpose. This determination is dependent upon the rules established for reserved characters by individual URI schemes. Characters from the unreserved set never need to be percent-encoded. URIs that differ only by whether an unreserved character is percent-encoded or appears literally are equivalent by definition, but URI processors, in practice, may not always recognize this equivalence. For example, URI consumers should not treat %41 differently from A or %7E differently from ~ , but some do. For maximal interoperability, URI producers are discouraged from percent-encoding unreserved characters. Because

SECTION 60

#1732854586441

4340-513: The scope of the Sitemaps protocol are supported by Google to allow webmasters to provide additional data on the content of their websites. Video and image sitemaps are intended to improve the capability of websites to rank in image and video searches. Video sitemaps indicate data related to embedding and autoplaying, preferred thumbnails to show in search results, publication date, video duration, and other metadata. Video sitemaps are also used to allow search engines to index videos that are embedded on

4410-487: The site. This allows search engines to crawl the site more efficiently and to find URLs that may be isolated from the rest of the site's content. The Sitemaps protocol is a URL inclusion protocol and complements robots.txt , a URL exclusion protocol. Google first introduced Sitemaps 0.84 in June 2005 so web developers could publish lists of links from across their sites. Google, Yahoo! and Microsoft announced joint support for

4480-470: The sitemap submission URLs for a few major search engines: Sitemap URLs submitted using the sitemap submission URLs need to be URL-encoded , for example: replace : (colon) with %3A , replace / (slash) with %2F . Sitemaps supplement and do not replace the existing crawl-based mechanisms that search engines already use to discover URLs. Using this protocol does not guarantee that web pages will be included in search indexes, nor does it influence

4550-469: The string "--" (double-hyphen) is not allowed inside comments; this means comments cannot be nested. The ampersand has no special significance within comments, so entity and character references are not recognized as such, and there is no way to represent characters outside the character set of the document encoding. An example of a valid comment:  XML 1.0 (Fifth Edition) and XML 1.1 support

4620-472: The use of much more memory, but are often found more convenient for use by programmers; some include declarative retrieval of document components via the use of XPath expressions. XSLT is designed for declarative description of XML document transformations, and has been widely implemented both in server-side packages and Web browsers. XQuery overlaps XSLT in its functionality, but is designed more for searching of large XML databases . Simple API for XML (SAX)

4690-426: The vendor support of XML Schemas yet, and are to some extent a grassroots reaction of industrial publishers to the lack of utility of XML Schemas for publishing . Some schema languages not only describe the structure of a particular XML format but also offer limited facilities to influence processing of individual XML files that conform to this format. DTDs and XSDs both have this ability; they can for instance provide

4760-591: The way that pages are ranked in search results. Specific examples are provided below. Sitemap files have a limit of 50,000 URLs and 50 MiB (52,428,800 bytes) per sitemap. Sitemaps can be compressed using gzip , reducing bandwidth consumption. Multiple sitemap files are supported, with a Sitemap index file serving as an entry point. Sitemap index files may not list more than 50,000 Sitemaps and must be no larger than 50MiB and can be compressed. You can have more than one Sitemap index file. As with all XML files, any data values (including URLs) must use entity escape codes for

4830-457: Was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected. Not addressed by the current specification is what to do with encoded character data. For example, in computers, character data manifests in encoded form, at some level, and thus could be treated as either binary or character data when being mapped to URI characters. Presumably, it

4900-437: Was relatively harmless; it was just assumed that characters and bytes mapped one-to-one and were interchangeable. The need to represent characters outside the ASCII range, however, grew quickly, and URI schemes and protocols often failed to provide standard rules for preparing character data for inclusion in a URI. Web applications consequently began using different multi-byte, stateful , and other non-ASCII-compatible encodings as

#440559