US20080281815A1 - Optimal storage and retrieval of xml data - Google Patents

Optimal storage and retrieval of xml data Download PDF

Info

Publication number
US20080281815A1
US20080281815A1 US12/134,176 US13417608A US2008281815A1 US 20080281815 A1 US20080281815 A1 US 20080281815A1 US 13417608 A US13417608 A US 13417608A US 2008281815 A1 US2008281815 A1 US 2008281815A1
Authority
US
United States
Prior art keywords
query
node
dictionary
documents
terms
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US12/134,176
Inventor
Chetan Narsude
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yahoo Inc
Original Assignee
Yahoo Inc until 2017
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yahoo Inc until 2017 filed Critical Yahoo Inc until 2017
Priority to US12/134,176 priority Critical patent/US20080281815A1/en
Assigned to YAHOO! INC. reassignment YAHOO! INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: NARSUDE, CHETAN
Publication of US20080281815A1 publication Critical patent/US20080281815A1/en
Assigned to YAHOO HOLDINGS, INC. reassignment YAHOO HOLDINGS, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: YAHOO! INC.
Assigned to OATH INC. reassignment OATH INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: YAHOO HOLDINGS, INC.
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching

Definitions

  • the present invention relates generally to document management, and more particularly, to a method and system for optimally storing and retrieving documents having a hierarchical structure such as XML documents.
  • Extensible Markup Language is a universally accepted format for representing structured data in textual form.
  • the XML format embeds content within tags that express its structure.
  • XML makes it possible for different tools, applications and repositories on a variety of platforms and middleware to meaningfully share data and to easily search for data that is embedded in the XML documents.
  • XML documents are typically managed using a database.
  • a query is issued.
  • the XML document identified in the query is retrieved from the database and parsed, and the desired information is extracted from the parsed XML document.
  • the parsed XML document is commonly known as an XML DOM (Document Object Model).
  • the invention provides a document management system that manages a large number of XML documents or any other documents having a hierarchical structure on an efficient and cost-effective basis. Storage requirements are reduced, because compressed versions of such documents, which are much smaller in size than the documents themselves, are stored in a database that is accessed when processing queries. Processing requirements are reduced, because parsing is not a required step when processing queries. Instead of parsing, the query is processed by unpacking the compressed version of the document identified in the query node-by-node until enough information has been unpacked to satisfy the query. Processing speed is improved in two ways. First, unpacking as carried out according to the invention is a much faster process than parsing. Second, the entire document does not need to be unpacked.
  • FIG. 1 illustrates a block diagram of a document management system that implements an embodiment of the invention
  • FIG. 2 is a sample XML document
  • FIG. 3 is the XML DOM of the sample XML document of FIG. 2 ;
  • FIG. 4 is a binary object converted from the XML DOM of FIG. 3 ;
  • FIG. 5 is a flow diagram illustrating the steps of creating a binary object from an XML document
  • FIG. 6 is a flow diagram illustrating the steps for processing a query in which the child nodes of the XML document are processed in series.
  • FIG. 7 is a flow diagram illustrating the steps for processing a query in which the child nodes of the XML document are processed in parallel.
  • FIG. 1 illustrates a block diagram of a document management system for XML documents.
  • the document management system includes a web server 20 that receives HTTP requests made over the Internet 30 and transmits HTML documents in response to the HTTP requests. If the HTTP request includes a request for information in an XML document, the web server 20 passes on the HTTP request to an XML server 40 . The XML server 40 receives the request, processes it and then transmits an XML response in return. The web server 20 creates the HTML response transmitted over the Internet 30 from the XML response received from the XML server 40 using XLST (Extensible Stylesheet Language Transformation).
  • XLST Extensible Stylesheet Language Transformation
  • the XML server 40 includes an application program interface (API) 42 , a binary large object (BLOB) process 44 , an auxiliary database 46 , and a query process 48 .
  • the API 42 represents a set of routines, protocols, and tools used in converting the HTTP request into an XPATH query and in creating the XML response transmitted to the web server 20 based on the results of the query process 48 .
  • the BLOB process 44 converts XML documents into BLOBs and stores them in the auxiliary database 46 . Each BLOB is stored in the auxiliary database 46 against a unique key, which is typically the title of the XML document that has been converted.
  • the auxiliary database 46 can be any database that is capable of storing files against keys that are used as file identifiers.
  • the query process 48 executes a query (e.g., an XPATH query) from the web server 20 . It first retrieves a BLOB corresponding to the document identified in the query from the auxiliary database 46 and unpacks the BLOB to the extent necessary to process the query. Details of the BLOB process 44 and the query process 48 are set forth below.
  • a query e.g., an XPATH query
  • the XML documents that are created or received from another document management system are stored in their original text form in an SQL database 50 and replicated in SQL slave databases 60 .
  • Any external entity or process (not shown), which wants to put one or more XML documents in the auxiliary database 46 may employ the BLOB process 44 to do so.
  • the BLOB process 44 is initialized with an “hdinit” call.
  • the external entity or process calls the “hdprocess” for each document that is to be placed in the auxiliary database 46 .
  • the “hdprocess” is defined as:
  • int hdprocess (const char*path, const char*data, unsigned int size, unsigned int deletion).
  • the path argument refers to the key used to identify the document uniquely (e.g., the title of this document).
  • the data contains the XML text representing the content of the document.
  • the size argument is the length of this document in bytes and the deletion flag is set to a non-zero value when the document corresponding to path needs to be deleted from, instead of added to, the auxiliary database 46 .
  • the deletion flag is redundant since the size argument set to zero automatically means that the document needs to be deleted.
  • hdinit This is the initialization method for the BLOB process 44 . It first initializes a memory-mapped dictionary of words that is used by “hdprocess.” This dictionary maps words appearing in XML documents to IDs that require much less memory. Because XML documents are very verbose and a lot of words in the document are repetitive, a lot of memory can be saved if, instead of storing the words, the associated IDs of the words are stored in the BLOBs. The “hdinit” method also initializes the underlying database (the auxiliary database 46 in FIG. 1 ), which is capable of storing any sequence of bytes as a key and any sequence of bytes as data associated with the key. In the embodiment illustrated in FIG. 1 , Berkeley DB-4 may be used.
  • the “hdinit” method creates an instance of the object, hdprocess, that parses the XML document, removes unwanted white spaces, maps all the words appearing in the XML documents to the IDs in the dictionary, and creates the packed (compressed) BLOBs which are ready to be put in the database.
  • hdprocess This is the method that generates the BLOB corresponding to the XML data and stores the BLOB in the database against the key represented by the path argument.
  • generating the BLOB it parses the XML data in the data argument, identifies all unwanted white spaces usually appearing between the end of one element and the beginning of the next element, and maps all text appearing in the XML data to associated IDs in the dictionary. Any text, for which an ID has not already been assigned, is assigned a new ID during this process.
  • IDs are created in such a way that they are consistent across multiple processes.
  • One simple way to achieve this is by getting the positional offset of the word from the beginning of the dictionary file.
  • any conventional parser may be used.
  • expat which is a Simple API for XML (SAX)
  • hdfini This method does the exact opposite of the “hdinit” method. It closes the dictionary, flushes the database content from the memory to the disk and closes the database. Also, it releases the resources reserved by the parser that were used for parsing the XML document.
  • FIG. 2 is a sample XML document. After parsing, the XML DOM of the XML document in FIG. 2 may be graphically represented as shown in FIG. 3 .
  • the dictionary for the XML element nodes when built completely for the XML document in FIG. 2 is shown in the following table.
  • the dictionary for the XML non-element nodes is shown in the following table.
  • Each node in the BLOB, after “hdprocess” is performed on an XML document, is represented by the following tuple:
  • Root-Identifier is the byte offset (ID) of the tag associated with the node
  • Children_Count is the number of child nodes
  • Attributes_Count is the number of attributes of the node
  • NodeType is the node type, which may be:
  • FIG. 4 illustrates the BLOB corresponding to the XML document in FIG. 2 .
  • the tuples shown in FIG. 4 are stored contiguously in memory for the auxiliary database 46 , and are associated with the key for the XML document in FIG. 2 .
  • FIG. 5 is a flow diagram illustrating the steps of creating a BLOB from an XML document.
  • the XML document is parsed to generate the XML DOM of the XML document. Any conventional XML parser may be used. During parsing, white spaces (e.g., new line, tab and space characters) that appear before an opening element tag or after a closing element tag, but not between the tags, are removed.
  • white spaces e.g., new line, tab and space characters
  • the root node of the XML document is retrieved as the current node for processing.
  • the node type of the current node is determined.
  • Step 504 the dictionary used with the “hdprocess” method is retrieved to see if the current node is stored as a term in the dictionary. If the node type is an element, then an element node dictionary is retrieved. If the node type is not an element, then a non-element node dictionary is retrieved. If the current node is not stored as a term in the dictionary, it is added to the dictionary and an ID is assigned (Step 505 ). The ID assigned corresponds to the positional offset (in bytes) in memory of the stored term with respect to the beginning of the dictionary. If the current node already appears in the dictionary, flow proceeds to Step 506 , where the ID associated with the current node is retrieved.
  • Step 507 the number of attributes and the number of children nodes corresponding to the current node are determined, and in Step 508 , the ID, the children count, the attributes count, the node type, and all IDs associated with each attribute-value pair (if any) in the dictionary are stored.
  • the dictionary used for the attributes and their associated values is the same as the dictionary used for the nodes, and the terms for attributes and/or values not found in the dictionary are created and assigned IDs in the same manner as for the nodes.
  • the element node dictionary is used for the attributes and the non-element node dictionary is used for the values.
  • Steps 509 - 510 and Steps 503 - 508 After the current node is processed, its children nodes are processed one-by-one in the same manner (Steps 509 - 510 and Steps 503 - 508 ). If there are no children nodes or all children nodes have been processed, the current node's sister nodes are processed one-by-one in the same manner (Steps 511 - 512 and Steps 503 - 508 ). If there are no sister nodes or all sister nodes have been processed, the parent node becomes the current node (Steps 513 ).
  • Step 514 If this node is not the root node (Step 514 ), any sister nodes of this node are processed one-by-one in the same manner as before (Steps 511 - 512 and Steps 503 - 508 ). The processing ends when the current node becomes the root node (Step 514 ).
  • FIG. 6 is a flow diagram illustrating the steps for processing a query, e.g., an XPATH query.
  • the query is parsed and the BLOB corresponding to the document identified in the query is retrieved from the auxiliary database 46 .
  • the root query node is set as the query node
  • the root node of the retrieved BLOB is set as the current node to be compared to the query node.
  • the ID, the children count, the attributes count, the node type, and the IDs associated with any attribute-value pair of the current node are retrieved.
  • Step 605 the words associated with the current node's ID and the IDs associated with each attribute-value pair are retrieved from the dictionary. If the node type of the current node is element (or the ID is an attribute ID), the element node dictionary is used. If the node type of the current node is not an element (or the ID is a value ID), the non-element node dictionary is used.
  • Step 606 the retrieved word and the query node are compared, and also any attributes defined in the query node are compared with the corresponding attributes defined in the current node. If there is a match in Step 606 and there are no more query nodes (Step 607 ), the query response is compiled (Step 608 ) and the process ends.
  • Step 609 If the there are additional query nodes, flow proceeds to the decision block in Step 609 . If children count >0, the next query node becomes the (current) query node and the first child node of the current node becomes the current node to be compared (Step 610 ), and flow returns to Step 604 . If children count is 0, the query cannot be processed and an error is returned (Step 611 ).
  • Step 612 determines if any of the current node's sister nodes matches the query node and any attributes of the query node. If the current node has sister nodes then the next sister node becomes the current node to be compared (Step 613 ) and flow proceeds to Step 604 . If there are no sister nodes to the current node or all sister nodes have been processed for comparison, an error is returned in Step 614 .
  • the child nodes may be processed in parallel instead of in series as described in connection with FIG. 6 .
  • the parallel processing of the child nodes is illustrated in FIG. 7 .
  • Step 701 the query is parsed and the BLOB corresponding to the document identified in the query is retrieved from the auxiliary database 46 .
  • Step 702 the root query node is set as the query node, and in Step 703 , the root node of the retrieved BLOB is set as the current node to be compared to the query node.
  • Step 704 the ID, the children count, the attributes count, the node type, and the IDs associated with any attribute-value pair of the current node are retrieved.
  • Step 705 the words associated with the current node's ID and the IDs associated with each attribute-value pair are retrieved from the dictionary.
  • the element node dictionary is used. If the node type of the current node is not an element (or the ID is a value ID), the non-element node dictionary is used.
  • Step 706 the retrieved word and the query node are compared, and also any attributes defined in the query node are compared with the corresponding attributes defined in the current node. If there is a match in Step 706 and there are no more query nodes (Step 707 ), the query response is compiled (Step 708 ) and the process ends.
  • Step 709 If the there are additional query nodes, flow proceeds to the decision block in Step 709 .
  • Step 710 if children count >0, the next query node becomes the (current) query node and Steps 704 - 709 are executed as a separate process for each child node. If children count is 0, the query cannot be processed and an error is returned (Step 711 ).
  • Step 712 If, in the decision block of Step 706 , there is no match in the comparisons made, flow proceeds to Step 712 where the process is exited. If none of the other child node processes that are running in parallel with the child node process that exited in Step 712 found a match in Step 706 or if there is no other child node process, an error is returned (Step 713 ).
  • HDDomObject interprets the bytes associated with the root node of the BLOB. If there is a query made against it, it first validates that the root node matches the first node of the query. If the root node matches the first node of the query, then it creates an HDDomObject for each of its children nodes and delegates the query to each child with corresponding part of the BLOB.
  • the HDDomObject class objects are constructed on the stack so they are very fast compared to creating the objects on the heap. In many cases, the search query narrows down as the XML DOM tree is traversed downwardly and so the unpacking is done only for a fraction of the BLOB, thereby speeding up the application.
  • the BLOB is taken from the auxiliary database 46 , it is reference counted. Reference count on the BLOB is incremented for each HDDomObject that is created. As a result, HDDomObject does not have to worry about memory management, which becomes messy as the tree grows. When the last XML DOM node goes out of scope, the object which reference counts the BLOB automatically frees it.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)

Abstract

A document management system manages a large number of XML documents on an efficient and cost-effective basis. Storage requirements are reduced, because compressed versions of the XML documents, which are much smaller in size than the XML documents themselves, are used when processing queries. Processing requirements are reduced, because parsing is not a required step when processing queries. Instead of parsing, the query is processed by unpacking the compressed version of the document identified in the query, node by node until enough information has been decoded to satisfy the query. Processing speed is improved in two ways. First, unpacking as carried out according to the invention is a much faster process than parsing. Second, the entire document need not be unpacked.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application is a continuation of U.S. application Ser. No. 10/990,426, filed Nov. 16, 2004, which claims the benefit of Provisional Patent Application No. 60/605,927, filed Aug. 31, 2004, both of which are incorporated herein by reference in their entirety.
  • BACKGROUND OF THE INVENTION
  • 1. Field of the Invention
  • The present invention relates generally to document management, and more particularly, to a method and system for optimally storing and retrieving documents having a hierarchical structure such as XML documents.
  • 2. Description of the Related Art
  • Extensible Markup Language (XML) is a universally accepted format for representing structured data in textual form. The XML format embeds content within tags that express its structure. XML makes it possible for different tools, applications and repositories on a variety of platforms and middleware to meaningfully share data and to easily search for data that is embedded in the XML documents.
  • XML documents are typically managed using a database. When specific information from an XML document is desired, a query is issued. In response to the query, the XML document identified in the query is retrieved from the database and parsed, and the desired information is extracted from the parsed XML document. The parsed XML document is commonly known as an XML DOM (Document Object Model). When the number and size of the XML documents stored in the database is very large, the processing of the queries carried out as described above requires expensive storage and becomes computationally expensive.
  • SUMMARY OF THE INVENTION
  • The invention provides a document management system that manages a large number of XML documents or any other documents having a hierarchical structure on an efficient and cost-effective basis. Storage requirements are reduced, because compressed versions of such documents, which are much smaller in size than the documents themselves, are stored in a database that is accessed when processing queries. Processing requirements are reduced, because parsing is not a required step when processing queries. Instead of parsing, the query is processed by unpacking the compressed version of the document identified in the query node-by-node until enough information has been unpacked to satisfy the query. Processing speed is improved in two ways. First, unpacking as carried out according to the invention is a much faster process than parsing. Second, the entire document does not need to be unpacked.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • So that the manner in which the above recited features of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments.
  • FIG. 1 illustrates a block diagram of a document management system that implements an embodiment of the invention;
  • FIG. 2 is a sample XML document;
  • FIG. 3 is the XML DOM of the sample XML document of FIG. 2;
  • FIG. 4 is a binary object converted from the XML DOM of FIG. 3;
  • FIG. 5 is a flow diagram illustrating the steps of creating a binary object from an XML document;
  • FIG. 6 is a flow diagram illustrating the steps for processing a query in which the child nodes of the XML document are processed in series; and
  • FIG. 7 is a flow diagram illustrating the steps for processing a query in which the child nodes of the XML document are processed in parallel.
  • DETAILED DESCRIPTION
  • FIG. 1 illustrates a block diagram of a document management system for XML documents. The document management system includes a web server 20 that receives HTTP requests made over the Internet 30 and transmits HTML documents in response to the HTTP requests. If the HTTP request includes a request for information in an XML document, the web server 20 passes on the HTTP request to an XML server 40. The XML server 40 receives the request, processes it and then transmits an XML response in return. The web server 20 creates the HTML response transmitted over the Internet 30 from the XML response received from the XML server 40 using XLST (Extensible Stylesheet Language Transformation).
  • The XML server 40 includes an application program interface (API) 42, a binary large object (BLOB) process 44, an auxiliary database 46, and a query process 48. The API 42 represents a set of routines, protocols, and tools used in converting the HTTP request into an XPATH query and in creating the XML response transmitted to the web server 20 based on the results of the query process 48. The BLOB process 44 converts XML documents into BLOBs and stores them in the auxiliary database 46. Each BLOB is stored in the auxiliary database 46 against a unique key, which is typically the title of the XML document that has been converted. The auxiliary database 46 can be any database that is capable of storing files against keys that are used as file identifiers. The query process 48 executes a query (e.g., an XPATH query) from the web server 20. It first retrieves a BLOB corresponding to the document identified in the query from the auxiliary database 46 and unpacks the BLOB to the extent necessary to process the query. Details of the BLOB process 44 and the query process 48 are set forth below.
  • The XML documents that are created or received from another document management system are stored in their original text form in an SQL database 50 and replicated in SQL slave databases 60. Any external entity or process (not shown), which wants to put one or more XML documents in the auxiliary database 46 may employ the BLOB process 44 to do so. First, the BLOB process 44 is initialized with an “hdinit” call. On successful initialization, the external entity or process calls the “hdprocess” for each document that is to be placed in the auxiliary database 46. The “hdprocess” is defined as:
  • int hdprocess(const char*path, const char*data, unsigned int size, unsigned int deletion).
  • The path argument refers to the key used to identify the document uniquely (e.g., the title of this document). The data contains the XML text representing the content of the document. The size argument is the length of this document in bytes and the deletion flag is set to a non-zero value when the document corresponding to path needs to be deleted from, instead of added to, the auxiliary database 46. The deletion flag is redundant since the size argument set to zero automatically means that the document needs to be deleted. After “hdprocess” is called for each document that is to be placed in the auxiliary database 46, “hdfini” is called to indicate completion of the operation.
  • The echo of the calls, “hdinit,” “hdprocess,” and “hdfini,” is described below.
  • hdinit: This is the initialization method for the BLOB process 44. It first initializes a memory-mapped dictionary of words that is used by “hdprocess.” This dictionary maps words appearing in XML documents to IDs that require much less memory. Because XML documents are very verbose and a lot of words in the document are repetitive, a lot of memory can be saved if, instead of storing the words, the associated IDs of the words are stored in the BLOBs. The “hdinit” method also initializes the underlying database (the auxiliary database 46 in FIG. 1), which is capable of storing any sequence of bytes as a key and any sequence of bytes as data associated with the key. In the embodiment illustrated in FIG. 1, Berkeley DB-4 may be used. Besides the above two subsystems, the “hdinit” method creates an instance of the object, hdprocess, that parses the XML document, removes unwanted white spaces, maps all the words appearing in the XML documents to the IDs in the dictionary, and creates the packed (compressed) BLOBs which are ready to be put in the database.
  • hdprocess: This is the method that generates the BLOB corresponding to the XML data and stores the BLOB in the database against the key represented by the path argument. In generating the BLOB, it parses the XML data in the data argument, identifies all unwanted white spaces usually appearing between the end of one element and the beginning of the next element, and maps all text appearing in the XML data to associated IDs in the dictionary. Any text, for which an ID has not already been assigned, is assigned a new ID during this process. These IDs are created in such a way that they are consistent across multiple processes. One simple way to achieve this is by getting the positional offset of the word from the beginning of the dictionary file. For parsing the XML document, any conventional parser may be used. In the embodiment of the invention illustrated herein, expat, which is a Simple API for XML (SAX), is used.
  • hdfini: This method does the exact opposite of the “hdinit” method. It closes the dictionary, flushes the database content from the memory to the disk and closes the database. Also, it releases the resources reserved by the parser that were used for parsing the XML document.
  • FIG. 2 is a sample XML document. After parsing, the XML DOM of the XML document in FIG. 2 may be graphically represented as shown in FIG. 3. The dictionary for the XML element nodes when built completely for the XML document in FIG. 2 is shown in the following table.
  • Byte Offset (ID) Word
    0 company
    8 employees
    18 employee
    27 id
    30 name
    35 type
    40 dept
    45 title
  • The dictionary for the XML non-element nodes is shown in the following table.
  • Byte Offset (ID) Word
    0 chetan
    7 Chetan Narsude
    22 Permanent Fulltime
    41 Yahoo! Finance
    56 Engineering Manager I
    78 kekre
    84 Amol Kekre
    95 Engineering Manager II
  • Each node in the BLOB, after “hdprocess” is performed on an XML document, is represented by the following tuple:
  • int Root_Identifier
    int Children_Count
    int Attributes_Count
    int NodeType

    where Root-Identifier is the byte offset (ID) of the tag associated with the node; Children_Count is the number of child nodes; Attributes_Count is the number of attributes of the node; and NodeType is the node type, which may be:
      • const NodeType NodeElement=0 (for an element node);
      • const NodeType NodeText=1 (for a text node);
      • const NodeType NodeCData=2 (for a Cdata node);
      • const NodeType NodeComment=3 (for a comment node);
      • const NodeType NodeRaw=4 (for a raw data node);
        If Attributes_Count>0, the tuple further comprises additional two byte offsets (IDs) for each attribute-value pair. The attribute is defined in the element node dictionary and the value is defined in the non-element node dictionary.
  • FIG. 4 illustrates the BLOB corresponding to the XML document in FIG. 2. The tuples shown in FIG. 4 are stored contiguously in memory for the auxiliary database 46, and are associated with the key for the XML document in FIG. 2.
  • FIG. 5 is a flow diagram illustrating the steps of creating a BLOB from an XML document. In Step 501, the XML document is parsed to generate the XML DOM of the XML document. Any conventional XML parser may be used. During parsing, white spaces (e.g., new line, tab and space characters) that appear before an opening element tag or after a closing element tag, but not between the tags, are removed. In Step 502, the root node of the XML document is retrieved as the current node for processing. In Step 503, the node type of the current node is determined.
  • In Step 504, the dictionary used with the “hdprocess” method is retrieved to see if the current node is stored as a term in the dictionary. If the node type is an element, then an element node dictionary is retrieved. If the node type is not an element, then a non-element node dictionary is retrieved. If the current node is not stored as a term in the dictionary, it is added to the dictionary and an ID is assigned (Step 505). The ID assigned corresponds to the positional offset (in bytes) in memory of the stored term with respect to the beginning of the dictionary. If the current node already appears in the dictionary, flow proceeds to Step 506, where the ID associated with the current node is retrieved.
  • In Step 507, the number of attributes and the number of children nodes corresponding to the current node are determined, and in Step 508, the ID, the children count, the attributes count, the node type, and all IDs associated with each attribute-value pair (if any) in the dictionary are stored. The dictionary used for the attributes and their associated values is the same as the dictionary used for the nodes, and the terms for attributes and/or values not found in the dictionary are created and assigned IDs in the same manner as for the nodes. The element node dictionary is used for the attributes and the non-element node dictionary is used for the values.
  • After the current node is processed, its children nodes are processed one-by-one in the same manner (Steps 509-510 and Steps 503-508). If there are no children nodes or all children nodes have been processed, the current node's sister nodes are processed one-by-one in the same manner (Steps 511-512 and Steps 503-508). If there are no sister nodes or all sister nodes have been processed, the parent node becomes the current node (Steps 513). If this node is not the root node (Step 514), any sister nodes of this node are processed one-by-one in the same manner as before (Steps 511-512 and Steps 503-508). The processing ends when the current node becomes the root node (Step 514).
  • FIG. 6 is a flow diagram illustrating the steps for processing a query, e.g., an XPATH query. In Step 601, the query is parsed and the BLOB corresponding to the document identified in the query is retrieved from the auxiliary database 46. In Step 602, the root query node is set as the query node, and in Step 603, the root node of the retrieved BLOB is set as the current node to be compared to the query node. In Step 604, the ID, the children count, the attributes count, the node type, and the IDs associated with any attribute-value pair of the current node are retrieved. In Step 605, the words associated with the current node's ID and the IDs associated with each attribute-value pair are retrieved from the dictionary. If the node type of the current node is element (or the ID is an attribute ID), the element node dictionary is used. If the node type of the current node is not an element (or the ID is a value ID), the non-element node dictionary is used.
  • In Step 606, the retrieved word and the query node are compared, and also any attributes defined in the query node are compared with the corresponding attributes defined in the current node. If there is a match in Step 606 and there are no more query nodes (Step 607), the query response is compiled (Step 608) and the process ends. The compiling of the query response typically involves unpacking of all nodes that originate from the last query node. For example, for the query, /company/employees/employee[@id=‘chetan’ ], the following portion of the XML DOM is compiled as the query response:
      • <employee id=“chetan”><name>Chetan Narsude</name><type>Permanent Fulltime</type><dept>Yahoo! Finance</dept><title>Engineering Manager I</title></employee>
  • If the there are additional query nodes, flow proceeds to the decision block in Step 609. If children count >0, the next query node becomes the (current) query node and the first child node of the current node becomes the current node to be compared (Step 610), and flow returns to Step 604. If children count is 0, the query cannot be processed and an error is returned (Step 611).
  • If, in the decision block of Step 606, there is no match in the comparisons made, flow proceeds to Step 612, to determine if any of the current node's sister nodes matches the query node and any attributes of the query node. If the current node has sister nodes then the next sister node becomes the current node to be compared (Step 613) and flow proceeds to Step 604. If there are no sister nodes to the current node or all sister nodes have been processed for comparison, an error is returned in Step 614.
  • Alternatively, the child nodes may be processed in parallel instead of in series as described in connection with FIG. 6. The parallel processing of the child nodes is illustrated in FIG. 7.
  • In Step 701, the query is parsed and the BLOB corresponding to the document identified in the query is retrieved from the auxiliary database 46. In Step 702, the root query node is set as the query node, and in Step 703, the root node of the retrieved BLOB is set as the current node to be compared to the query node. In Step 704, the ID, the children count, the attributes count, the node type, and the IDs associated with any attribute-value pair of the current node are retrieved. In Step 705, the words associated with the current node's ID and the IDs associated with each attribute-value pair are retrieved from the dictionary. If the node type of the current node is element (or the ID is an attribute ID), the element node dictionary is used. If the node type of the current node is not an element (or the ID is a value ID), the non-element node dictionary is used.
  • In Step 706, the retrieved word and the query node are compared, and also any attributes defined in the query node are compared with the corresponding attributes defined in the current node. If there is a match in Step 706 and there are no more query nodes (Step 707), the query response is compiled (Step 708) and the process ends. The compiling of the query response typically involves unpacking of all nodes that originate from the last query node. For example, for the query, /company/employees/employee[@id=‘chetan’ ], the following portion of the XML DOM is compiled as the query response:
      • <employee id=“chetan”><name>Chetan Narsude</name><type>Permanent Fulltime</type><dept>Yahoo! Finance</dept><title>Engineering Manager I</title></employee>
  • If the there are additional query nodes, flow proceeds to the decision block in Step 709. In Step 710, if children count >0, the next query node becomes the (current) query node and Steps 704-709 are executed as a separate process for each child node. If children count is 0, the query cannot be processed and an error is returned (Step 711).
  • If, in the decision block of Step 706, there is no match in the comparisons made, flow proceeds to Step 712 where the process is exited. If none of the other child node processes that are running in parallel with the child node process that exited in Step 712 found a match in Step 706 or if there is no other child node process, an error is returned (Step 713).
  • Applications, which need to use the XML document, make a call against the auxiliary database 46 with the key corresponding to that document. The auxiliary database 46 returns the BLOB corresponding to the XML document, which was originally packed to be stored against the key with the “hdprocess” method. This BLOB is wrapped with a class called HDDomObject. HDDomObject interprets the bytes associated with the root node of the BLOB. If there is a query made against it, it first validates that the root node matches the first node of the query. If the root node matches the first node of the query, then it creates an HDDomObject for each of its children nodes and delegates the query to each child with corresponding part of the BLOB. Each child now behaves as if it was the root node for the BLOB passed to it and recursively tries to resolve the query. The HDDomObject class objects are constructed on the stack so they are very fast compared to creating the objects on the heap. In many cases, the search query narrows down as the XML DOM tree is traversed downwardly and so the unpacking is done only for a fraction of the BLOB, thereby speeding up the application.
  • Furthermore, once the BLOB is taken from the auxiliary database 46, it is reference counted. Reference count on the BLOB is incremented for each HDDomObject that is created. As a result, HDDomObject does not have to worry about memory management, which becomes messy as the tree grows. When the last XML DOM node goes out of scope, the object which reference counts the BLOB automatically frees it.
  • In summary, the features of the invention as applied to an XML document management system are as follows:
      • The invention works with different types of databases so it can take advantage of the best of databases available. The auxiliary database 46 simply stores the BLOBs representing the XML documents against a key, which usually is title of the document.
      • White spaces appearing inside text tags are preserved but the others are removed during the BLOB process 44, thereby saving on the byte processing and bandwidth.
      • The entire XML or the valid XML fragments may be retrieved quickly using XPATH.
      • The invention provides for optimal unpacking of the data (i.e., the entire XML DOM need not be unpacked from the BLOB), thus boosting the performance of the application.
      • Reference counted memory management for the BLOB so that applications do not need to manage the memory.
      • Most frequently accessed elements are cached in the memory as a result of using the dictionary, and this speeds up the access.
      • A different dictionary can be plugged in thus changing the elements consistently across all of the XML documents on the fly. For example, the language of the XML documents can be easily changed by translating the words in the dictionary to the desired language.
  • While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims (15)

1: A document management system managing documents having a hierarchical structure for responding to a query, the document management system comprising:
a first storage section containing binary objects corresponding to documents having a hierarchical structure;
a second storage section including at least two dictionaries of terms, each dictionary associated with an identified type of node including corresponding terms, each term having a unique value associated therewith in the dictionary; and
a processor for generating a query result using the binary objects and at least one of the at least two dictionaries in response to queries.
2: The document management system according to claim 1, further comprising a master storage section comprising the documents.
3: The document management system according to claim 2, wherein the processor is operable for accessing the first storage section during processing of a query but not the master storage section.
4: The document management system according to claim 1, wherein the documents are XML documents and the binary objects are derived from the XML documents that have been parsed.
5: The document management system according to claim 4, wherein the at least two dictionaries comprise a first dictionary, the first dictionary of terms comprises terms corresponding to a node of a first type and a second dictionary of terms comprises terms corresponding to a node of a second type.
6: The document management system according to claim 1, wherein the unique value associated with each term in the dictionary corresponds to a positional offset in memory of said each term with respect to the beginning of the dictionary.
7: A method of managing documents having a hierarchical structure for responding to a query, the method comprising:
receiving a query; and
generating a query result using binary objects and at least one of at least two dictionaries of terms, wherein each dictionary of the at least two dictionaries is associated with an identified type of node including corresponding terms in the documents, each term having a unique value associated therewith in the dictionary, and the documents are stored as binary objects using the unique values in place of the terms.
8: The method of claim 7, wherein the binary objects are stored in a database with keys that identify the documents.
9: The method of claim 8, further comprising:
accessing a database to retrieve the binary object associated with a document identified in the query; and
comparing the query against the binary object for generating the query result.
10: The method according to claim 9, wherein comparing includes conducting first search through nodes of the document that are represented in the binary object.
11: The method according to claim 9, wherein comparing comprises:
retrieving a term associated with a root node of the document represented in the binary object; and
comparing a root node identified in the query with the retrieved term.
12: The method according to claim 11, wherein, if the root node identified in the query matches the retrieved term and there are additional node levels in the query, comparing further comprises:
retrieving terms associated with child nodes of the document that are represented in the binary object; and
comparing the next level node identified in the query with the retrieved terms.
13: The method according to claim 12, wherein, if there is a match between the next level node and one of the retrieved terms, the query is processed with respect to those nodes that originate from the child node having a term that matches the next level node, and not with respect to those nodes that originate from the other child nodes.
14: The method according to claim 11, wherein the term is retrieved based on a value associated with the term from the dictionary that associates each of a plurality of different values, including said value, with a unique term.
15: The method according to claim 9, wherein the documents are XML documents and the query comprises an XPATH query.
US12/134,176 2004-08-31 2008-06-05 Optimal storage and retrieval of xml data Abandoned US20080281815A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US12/134,176 US20080281815A1 (en) 2004-08-31 2008-06-05 Optimal storage and retrieval of xml data

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US60592704P 2004-08-31 2004-08-31
US10/990,426 US7403940B2 (en) 2004-08-31 2004-11-16 Optimal storage and retrieval of XML data
US12/134,176 US20080281815A1 (en) 2004-08-31 2008-06-05 Optimal storage and retrieval of xml data

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US10/990,426 Continuation US7403940B2 (en) 2004-08-31 2004-11-16 Optimal storage and retrieval of XML data

Publications (1)

Publication Number Publication Date
US20080281815A1 true US20080281815A1 (en) 2008-11-13

Family

ID=36000655

Family Applications (2)

Application Number Title Priority Date Filing Date
US10/990,426 Active 2025-07-25 US7403940B2 (en) 2004-08-31 2004-11-16 Optimal storage and retrieval of XML data
US12/134,176 Abandoned US20080281815A1 (en) 2004-08-31 2008-06-05 Optimal storage and retrieval of xml data

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US10/990,426 Active 2025-07-25 US7403940B2 (en) 2004-08-31 2004-11-16 Optimal storage and retrieval of XML data

Country Status (2)

Country Link
US (2) US7403940B2 (en)
WO (1) WO2006026534A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080065978A1 (en) * 2006-09-08 2008-03-13 Microsoft Corporation XML Based Form Modification With Import/Export Capability

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7146409B1 (en) * 2001-07-24 2006-12-05 Brightplanet Corporation System and method for efficient control and capture of dynamic database content
US7499915B2 (en) 2004-04-09 2009-03-03 Oracle International Corporation Index for accessing XML data
US8346737B2 (en) 2005-03-21 2013-01-01 Oracle International Corporation Encoding of hierarchically organized data for efficient storage and processing
US7805424B2 (en) * 2006-04-12 2010-09-28 Microsoft Corporation Querying nested documents embedded in compound XML documents
US9460064B2 (en) * 2006-05-18 2016-10-04 Oracle International Corporation Efficient piece-wise updates of binary encoded XML data
US8108765B2 (en) * 2006-10-11 2012-01-31 International Business Machines Corporation Identifying and annotating shared hierarchical markup document trees
US8635242B2 (en) * 2006-10-11 2014-01-21 International Business Machines Corporation Processing queries on hierarchical markup data using shared hierarchical markup trees
US7739251B2 (en) 2006-10-20 2010-06-15 Oracle International Corporation Incremental maintenance of an XML index on binary XML data
US8010889B2 (en) 2006-10-20 2011-08-30 Oracle International Corporation Techniques for efficient loading of binary XML data
US9953103B2 (en) 2006-11-16 2018-04-24 Oracle International Corporation Client processing for binary XML in a database system
US7908260B1 (en) 2006-12-29 2011-03-15 BrightPlanet Corporation II, Inc. Source editing, internationalization, advanced configuration wizard, and summary page selection for information automation systems
CA2578979A1 (en) * 2007-02-19 2008-08-19 Cognos Incorporated System and method of report representation
CA2578980A1 (en) 2007-02-19 2008-08-19 Cognos Incorporated System and method of report rendering
US20090319579A1 (en) * 2007-05-15 2009-12-24 Fedor Pikus Electronic Design Automation Process Restart
US8291310B2 (en) 2007-08-29 2012-10-16 Oracle International Corporation Delta-saving in XML-based documents
US7831540B2 (en) 2007-10-25 2010-11-09 Oracle International Corporation Efficient update of binary XML content in a database system
US8543898B2 (en) 2007-11-09 2013-09-24 Oracle International Corporation Techniques for more efficient generation of XML events from XML data sources
US8250062B2 (en) 2007-11-09 2012-08-21 Oracle International Corporation Optimized streaming evaluation of XML queries
US9842090B2 (en) 2007-12-05 2017-12-12 Oracle International Corporation Efficient streaming evaluation of XPaths on binary-encoded XML schema-based documents
US8429196B2 (en) 2008-06-06 2013-04-23 Oracle International Corporation Fast extraction of scalar values from binary encoded XML
US7958112B2 (en) * 2008-08-08 2011-06-07 Oracle International Corporation Interleaving query transformations for XML indexes
US20110040806A1 (en) * 2009-08-12 2011-02-17 CareSmart Solutions, LLC Dynamically flexible database
US8255372B2 (en) 2010-01-18 2012-08-28 Oracle International Corporation Efficient validation of binary XML data
US8930808B2 (en) * 2011-07-21 2015-01-06 International Business Machines Corporation Processing rich text data for storing as legacy data records in a data storage system
US9235650B2 (en) 2012-09-27 2016-01-12 Siemens Product Lifecycle Management Software Inc. Efficient conversion of XML data into a model using persistent stores and parallelism
US8812523B2 (en) 2012-09-28 2014-08-19 Oracle International Corporation Predicate result cache

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5167011A (en) * 1989-02-15 1992-11-24 W. H. Morris Method for coodinating information storage and retrieval
US20040015783A1 (en) * 2002-06-20 2004-01-22 Canon Kabushiki Kaisha Methods for interactively defining transforms and for generating queries by manipulating existing query data
US20040181543A1 (en) * 2002-12-23 2004-09-16 Canon Kabushiki Kaisha Method of using recommendations to visually create new views of data across heterogeneous sources
US20050021512A1 (en) * 2003-07-23 2005-01-27 Helmut Koenig Automatic indexing of digital image archives for content-based, context-sensitive searching
US20050050011A1 (en) * 2003-08-25 2005-03-03 Van Der Linden Robbert C. Method and system for querying structured documents stored in their native format in a database
US20050060647A1 (en) * 2002-12-23 2005-03-17 Canon Kabushiki Kaisha Method for presenting hierarchical data
US20050091188A1 (en) * 2003-10-24 2005-04-28 Microsoft Indexing XML datatype content system and method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5167011A (en) * 1989-02-15 1992-11-24 W. H. Morris Method for coodinating information storage and retrieval
US20040015783A1 (en) * 2002-06-20 2004-01-22 Canon Kabushiki Kaisha Methods for interactively defining transforms and for generating queries by manipulating existing query data
US20040181543A1 (en) * 2002-12-23 2004-09-16 Canon Kabushiki Kaisha Method of using recommendations to visually create new views of data across heterogeneous sources
US20050060647A1 (en) * 2002-12-23 2005-03-17 Canon Kabushiki Kaisha Method for presenting hierarchical data
US20050021512A1 (en) * 2003-07-23 2005-01-27 Helmut Koenig Automatic indexing of digital image archives for content-based, context-sensitive searching
US20050050011A1 (en) * 2003-08-25 2005-03-03 Van Der Linden Robbert C. Method and system for querying structured documents stored in their native format in a database
US20050091188A1 (en) * 2003-10-24 2005-04-28 Microsoft Indexing XML datatype content system and method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080065978A1 (en) * 2006-09-08 2008-03-13 Microsoft Corporation XML Based Form Modification With Import/Export Capability
US8255790B2 (en) * 2006-09-08 2012-08-28 Microsoft Corporation XML based form modification with import/export capability

Also Published As

Publication number Publication date
US20060059184A1 (en) 2006-03-16
US7403940B2 (en) 2008-07-22
WO2006026534A3 (en) 2009-04-02
WO2006026534A2 (en) 2006-03-09

Similar Documents

Publication Publication Date Title
US20080281815A1 (en) Optimal storage and retrieval of xml data
US7398265B2 (en) Efficient query processing of XML data using XML index
US7921101B2 (en) Index maintenance for operations involving indexed XML data
US7366735B2 (en) Efficient extraction of XML content stored in a LOB
US6449620B1 (en) Method and apparatus for generating information pages using semi-structured data stored in a structured manner
US7756857B2 (en) Indexing and querying of structured documents
US8209352B2 (en) Method and mechanism for efficient storage and query of XML documents based on paths
US8255394B2 (en) Apparatus, system, and method for efficient content indexing of streaming XML document content
US6581062B1 (en) Method and apparatus for storing semi-structured data in a structured manner
US9928289B2 (en) Method for storing XML data into relational database
US7493305B2 (en) Efficient queribility and manageability of an XML index with path subsetting
US20080010256A1 (en) Element query method and system
US20020078041A1 (en) System and method of translating a universal query language to SQL
US9361398B1 (en) Maintaining a relational database and its schema in response to a stream of XML messages based on one or more arbitrary and evolving XML schemas
US8661022B2 (en) Database management method and system
US7120864B2 (en) Eliminating superfluous namespace declarations and undeclaring default namespaces in XML serialization processing
US7457812B2 (en) System and method for managing structured document
US7519574B2 (en) Associating information related to components in structured documents stored in their native format in a database
US8037090B2 (en) Processing structured documents stored in a database
US7051016B2 (en) Method for the administration of a data base
JP4724177B2 (en) Index for accessing XML data
Park et al. XML query processing using signature and dtd
JP4866844B2 (en) Efficient extraction of XML content stored in a LOB
Park et al. Efficient schemes of executing star operators in XPath query expressions

Legal Events

Date Code Title Description
AS Assignment

Owner name: YAHOO| INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:NARSUDE, CHETAN;REEL/FRAME:021055/0878

Effective date: 20041109

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: YAHOO HOLDINGS, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:YAHOO| INC.;REEL/FRAME:042963/0211

Effective date: 20170613

AS Assignment

Owner name: OATH INC., NEW YORK

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:YAHOO HOLDINGS, INC.;REEL/FRAME:045240/0310

Effective date: 20171231