A METHOD AND SYSTEM FOR MODELING COMPONENTS IN DOCUMENTS
FIELD OF THE INVENTION
[0001] This invention relates to the processing of electronic documents. In particular, it relates to the integration of electronic documents from diverse sources.
BACKGROUND
[0002] Modern business enterprises today face the challenge of integrating business documents from diverse document sources or systems. For example, an enterprise may wish to access data in a legacy system, or a system of a previous competitor that has been acquired, for example, through merger. Integrating the data from these various systems can be beneficial in achieving operational competitiveness by integrating internal systems within the enterprise to obtain a single view of the enterprise's data. Alternatively, such integration can be beneficial in achieving collaborative competitiveness by integrating the enterprise's systems with those of strategic trading partners. [0003] One aspect of integration involves the use of maps or transformations which map or transform data from one source or system representation to another. Such maps can be automatically generated if the source and target documents are modeled in terms of a common vocabulary of semantic concepts. This involves deriving a semantic model for the source and target documents wherein equivalent components comprising a set of semantically related fields in the source and target documents are related using the common vocabulary. In the known prior art, to derive such a semantic model, each component in the source and target document must be manually related to an equivalent concept in the semantic model. This can be a tedious process, especially in cases where a document has a large number of components.
SUMMARY OF THE INVENTION
[0004] According to one aspect of the invention there is provided a method for transforming a source document in a source format into a target document in
a target format. The method comprises, for each of the source and target formats, determining all species of a concept that can occur in a document in that format, a concept being a set of semantically related fields and deriving a semantic model for transforming documents between the source and target formats. The semantic model contains generic information common to all species to allow automatic recognition of the concept, and specific information unique to each species to allow differentiation of the concept into a particular species thereof.
BRIEF DESCRIPTION OF THE DRAWINGS
[0005] Figure 1 shows a source document which has to be transformed into a target document and a semantic model which will be used in generating appropriate transformations;
[0006] Figure 2 shows a flowchart of operations that are performed in transforming the source document of Figure 1 into the target document of Figure
1 ;
[0007] Figure 3 shows a table of all possible fields for an "address" concept;
[0008] Figure 4 shows how the components within a semantic model are structured in accordance with one embodiment of the invention; and
[0009] Figure 5 shows a block diagram of a system in accordance with one embodiment of the present invention.
DETAILED DESCRIPTION
[0010] In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the invention. It will be apparent, however, to one skilled in the art that the invention can be practiced without these specific details. In other instances, structures and devices are shown in block diagram form in order to avoid obscuring the invention.
[0011] Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one
embodiment of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Moreover, various features are described which may be exhibited by some embodiments and not by others. Similarly, various requirements are described which may be requirements for some embodiments but not other embodiments. [0012] Embodiments of the present invention relate to a method for transforming data from a source document in a source format to a target document in a target format. More particularly, embodiments of the present invention allow components comprising a set of semantically related fields within a document to be modeled in a fashion that reduces the amount of manual mapping required when transforming data from a source document in a source format to a target document in a target format.
[0013] According to embodiments of the present invention a semantic concept that can occur within documents of a source or target standard (format) is modeled by assigning to it a generic meaning as well as a specific meaning. Thereafter, whenever the semantic concept appears within an actual document in the source or target standard, it is automatically assigned its generic meaning and a user is prompted to select its specific meaning from a list of specific meanings. Thus, each concept is modeled once per document standard and thereafter each occurrence of the concept within an actual document of the standard is assigned a generic and specific meaning based on the model. According to embodiments of the present invention, the specific meaning is assigned based on a location of the component within a document. Thus, embodiments of the present invention serve to reduce the degree of manual mapping required when modeling source and target documents in terms of a semantic model.
[0014] Figure 1 of the drawings shows a source document 100 for example, an invoice, a sales order, etc. which has to be transformed into a target document 102, such as for purposes of document integration. The source document 100 includes four components designated 100A-100D, respectively,
and the target document 102 includes four components designated 102A-102D, respectively. The semantic meaning of each of the components in source document 100 and target document 102 is shown in semantic model 104 wherein each of the components of the source and target document are related to a semantic component designated 104A-104D, respectively. Thus, for example, component 100A and component 102C are related and will contain information relating to a semantic concept which provides a corporate head office address. It will be seen that semantic model 104 specifies the relationship between components in source document 100 and target document 102 in terms of a common semantic concept, which allows for a mapping program to use this information to automatically generate mappings for transforming data from source document 100 into target document 102.
[0015] In generating the semantic model 104 each of the components in the source and target documents will have to be manually modeled. However, it will be appreciated that because each of the components in the source and target documents is related to a generic semantic concept which provides address information, a certain portion of information contained in each of the components will overlap. For example, each of the components may have address line, city and zip code fields which overlap. One of the advantages of the present invention is that it reduces the degree of manual mapping required to produce a semantic model such as semantic model 104, by taking advantage of the information which overlaps or is common to various components within the source and target documents, as will be described below. [0016] Figure 2 of the drawings shows a flow chart of operations performed in accordance with one embodiment of the present invention, when transforming source document 100 into target document 102. Referring to Figure 2, at block 200 a semantic model is derived for documents of the source and target data standards or formats. Block 200 involves examining allowable concepts in each of the source and target document formats and identifying a generic form for each of the concepts as well as specific forms or species of the generic form. For example, Figure 3 shows a table 300 which includes all possible fields for the "address" concept, which fields can occur or are allowable in a valid address
in the source or target document of a source or target format respectively. By examining all allowable types of addresses, it may be determined that fields containing address, city, state, and zip code information are generic in a sense that they are common to all valid types of addresses, whereas the remaining fields occur additionally in only some types of addresses. In deriving the semantic model, at block 200 each semantic concept is structured into a generic concept and specific or species concepts. Each generic concept comprises fields which are common to all particular species of the concept, whereas each specific or species concept includes particular fields in addition to the common fields which are shared with the generic concept. This is shown in Figure 4 of the drawings where a component 400 comprising fields 1 to n is structured so as to have a generic meaning 402 and a specific meaning 404. By way of example, the generic meaning 402 relates to an address and the specific meaning could be a particular kind or type of address such as a "shipping address". At block 202 the semantic model is stored within a system, such as is shown in Figure 5 of the drawings, which is used to perform the transformation. At block 204 actual source and target documents that are of the source target standards, respectively, are loaded into the system. At block 206 using the previously stored semantic model, components in the source and target documents are automatically recognized based on the generic meanings of concepts from the semantic model. At block 208, a prompt is displayed to a user, prompting the user to select a specific meaning for a particular concept. This is done based on a location of a component within a document. Thus, using document 100 as an example, components 100A - 100D will automatically be recognized as an address concept based on its generic meaning and thereafter a user will be prompted to select a particular meaning for each address thus specifying whether the recognized address concept is a corporate head office address, a shipping address, a billing address, or a branch address based on a location of the component within the document. Once the user has made the selection, the selected specific meaning is assigned to the component. Thereafter, at block 212 a mapping based on the assigned meaning of concepts in source document 100 and 102 is created.
[0017] Referring now to Figure 5 of the drawings, reference numeral 500 generally indicates an example of a system which may be used to implement to perform embodiments of the invention described above. The system 500 includes a memory 504, which may represent one or more physical memory devices, which may include any type of random access memory (RAM), read only memory (ROM) which may be programmable, flash memory, non-volatile mass storage device, or a combination of such memory devices. The memory 504 is connected via a system bus 512 to a processor 502. The memory 504 includes instructions 506 which when executed by the processor 502 cause the processor to perform the methodology of the invention as discussed above. Additionally, the system 500 includes a disk drive 508 and a CD ROM drive 510 each of which is coupled to a peripheral-device and user-interface 514 via bus 512. Processor 502, memory 504, disk drive 508 and CD ROM 510 are generally known in the art. Peripheral-device and user-interface 514 provides an interface between system bus 512 and various optional components connected to a peripheral bus 516 as well as to user interface components, such as a display, mouse and other user interface devices. A network interface 518 is coupled to peripheral bus 516 and provides network connectivity to system 500. [0018] Although the present invention has been described with reference to specific exemplary embodiments, it will be evident that the various modification and changes can be made to these embodiments without departing from the broader spirit of the invention as set forth in the claims. Accordingly, the specification and drawings are to be regarded in an illustrative sense rather than in a restrictive sense.