CA2188050A1

CA2188050A1 - Data embedding

Info

Publication number: CA2188050A1
Application number: CA002188050A
Authority: CA
Inventors: Maxwell T. Ii Sandford; Theodore G. Handel
Original assignee: University of California
Current assignee: Individual
Priority date: 1995-02-23
Filing date: 1996-02-22
Publication date: 1996-08-29
Also published as: US5659726A; EP0760981A4; EP0760981A1

Abstract

A method of embedding auxiliary information into a set of host data, such as a photograph, television signal, facsimile transmission, or identification card. All such host data contain intrinsic noise, allowing pixels in the host data which are nearly identical and which have values differing by less than the noise value to be manipulated and replaced with auxiliary data. As the embedding method does not change the elemental values of the host data, the auxiliary data do not noticeably affect the appearance or interpretation of the host data. By a substantially reverse process, the embedded auxiliary data can be retrieved easily by an authorized user.

Description

~ WO 96/26494 2 i 8 8 05 0 PCTIUS96102357 DATA EMBEDDING
FIE~D OF THE INVENTION
The present invention generally relates to digital manipulation of numerical data and, more specifically, to the embedding of external data into existing data fields.
This invention was made with Government support under=
Contract No. W-7405-ENG-36 awarded by the U.S. Department of Energy. The Government has certain rights in the invention.
The use of data in digital form is revolutionizing communication throughout the world. Much of this digital communication is over wire, microwaves, and fIber optic media. Currently, data can be transmitted flawlessly over land, water, and between satellites. Satellites in orbit allow communication virtually between any two points on earth, or in space.
In many situations, it may be of benefit to send particular secondary data along with the primary data.
Secondary data could involve the clQsed captioning of television programs, identification information associated with photographs, or the sending of covert information with facsimile transmissions. Such a technique is suited also for~ use as a digital signature verifying the origin and authenticity of the primary data.
Data in digital form are transmitted routinely using wide band communications channels. Communicating in digital fashion is facilitated greatly by error-correcting software and hardware protocols that provide absolute data fidelity.
These communication systems ensure that the digital bit stream transmitted by one station is received by the other station unchanged.
SUBSTITUTE SHEET (RULE 26) W096/26494 ~ ~880so PCr/US96/02357 However, most digital data sources contain redundant information and intrinsic noise.~ ~ An example is a digital image generated by scanning a photograph, an original work of electronic ~art, or a digitlzed video signal. In the 5 scanning or digltal production process of such images, noise is introduced in the digital rendition. Additionally, image sources, such as photographic images and identification cards, contain noise resulting from the grain struc~ture of the film, optical aberrations, and subject motion. Works of 10 art contain noise which is introduced by brush str4kes, paint texture, and artistic license.
Redundancy is intrinsic to digital imaqe ~ data, because any particular numerical value o:~ the digital intensity exists in many different parts of the image. For example, a 15 given grey-level may exist in the image of~ trees,~sky, people or 4ther objects. In any digital image, the same or similar numerical picture element, or pixel value, may represent a variety of image content. This means that pixels having similar numerical values and frequency of -20 occurrence in different parts of an image can beinterchanged freely, without noticeably altering the appearance of the image or the statistical frequency of occurrence of the pixel values. -=
Redundancy also occllrs in most types of digital 25 information, whenever the same values are present more than once in the stream of numerical values representing the information. For a two-color, black and white FAX image, noise consists of the presence or absence of a black or white pixel value. Documents scanned into black and white SUBSTITUTE SHEET (RULE 26) Wo 96/26494 2 ~ 8 8 0 5 0 PCTrUS96/02357 BIT~P~ format contain runs of successlve black ~1) and white ( 0 ) values . Noise in these images introduces a variation in the length of a pixel run. Runs of the same value are present in many parts of the black and white 5 image, in different rows. This allows the present invention also to be applied to facsimile transmissions.
The existence of noise and redundant pixel information in digital data permits a process for implanting additional information in the noise component of digital data. Because 10 of the fidelity of current digital communication systems, the implanted information is preserved in transmission to the receiver, where it can be extracted. The embedding of information in this manner does not increase the bandwidth re~uired for the transmission because the data implanted 15 reside in the noise component of the host data. One may convey thereby meaningful, new information in the redundant noise component of the original data without it ever being detected by unauthorized persons.
It is therefore an object of the present invention to 20 provide apparatus and method for embedding data into a digital information stream so that the digital information is not changed significantly.
It is another object of the present invention to provide apparatus and method for thwarting unauthorized 25 access to information embedded in normal digital data.
Additional objects, advantages and novel features of the invention will be set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following or may "
SUBSTITUTE SHEET (RULE 26) ~ ~ 8~os~ ~ -Wo 96/2649~ PCT/US96J02357 be learned by practice of ~ the invention. The objects and advantages of the invention may be reali7ed and attained by means of the instrumentalities and combinations particularly pointed out in the appended claims.
BACKGRO~I~ OF T~IF Tl~VENTION
Additional objects, advantages and novel features of =the invention will be set forth in part in the description which follows, and in part will become ~apparent to those skilled in the art upon examination of the following or may be learned by practice of the i~vention. The objects and advantages of the invention may be realized and attained by means of the instrumentalities and co~binations particularly pointed out in the appended claims.
SUMMARY OF THE INVENTION
In accordance with the purposes of the present invention there is provided a method of embedding A~ ?ry data into host data comprising the steps of creating a digital representation of the host data consisting of elements having numerical values and cnnt;3;n;n~ a-noise component; creating a digital ;representation of the ry data i~ the form of a sequence of bits; evaluating the rLoise component of the digital representation of the host data; comparing elements of =the host data with the noise component to determine pairs of the host elements 25 having numerical values which differ by less than said value of said noise component; and replacing individual values of the elements with substAnt;~lly equivalent values :from said pairs of elements in order to embed individual bit values of the auxiliary data corresponding~to the sequence of bits of SUBSTITUTE SHEET(RULE 26) Wo 96l26494 2 1 8 8 0 5 0 PcT/US96102357 S
the ~llx;l;;~ry data; and outputting the ho~t data with the auxiliary data embedded into the host data as a f ile .
In aeeordanee with the purposes of the present invention there is f urther provided a method of extraeting embedded auxiliary data from ho~t data cnn~;l;n;ng a noise nn~pnnf~nt eomprising the steps of extraeting from the host data a bit sequenee indieative of the embedded AllX; l; i: ry data, and whieh allows for verifieation of the host data;
interpreting the host data-elernent pairs whieh differ by less than the value of the noise component and which correspond to bit values of the ~llX; 1; ~3ry data; identifying the auxiliary data bit sequenee eorresponding to the pair values; and extraeting the ~llx;l;~ry data as a file.
BRIEF DES~RIPTIO~ QF THE DRAWI~GS
~he accompanying drawings, which are incorporated in and form a part of the speeification, illustrate the embodiments of the present invention and, together with the description, serve to explain the principles of the invention. In the drawings:
FIGURE 1 is a block diagram illustrating the proeesses used in the embedding and extractior of data from a host.
FIGURE 2 is a partial listing of computer code used for determining host data pairs having similar values and for con~erting RGB cDmponents to HSI components.
FIGURE 3 is a partial listing of computer code used for eliminating duplicate host data pairs.
FIGURE 4 is a partial listing of computer code which, for Trueeolor images, introduees a eonstraint on the frequeney of oecurrence of host data palrs that minimizes the effect of embedding on the host data histogram.
SUBSTITUTE SHEET (RULE 26) alssO~5c, Wo 96~26494 PCrNS96/02357 FIGUl~E 5 is a partial listing of compute=r code that performs the actual embedding of ~auxiliary data into the host data, including the considerable information which is necessary to manipulate the data in the header 5 information, auxiliary bit-stream, and the host data files.
FIGURE ~ is a partial listing of computer code that analyzes the lengths of runs in a row of pixels in two-color facsimile host data.
FIGURE 7 is a partial listing of computer code whose 10 purpose is to ensure that the first pixel in a PACKET-START
data row starts in aF~ even column number. The location of the first pixel in the row fIags the start of the data packets.
FIGURE 8 is a partial listing of computer code for 15 embedding data into two-colo~ host images, such as facsimile transmissions .
DETAILED DESCRIPTION _ The present invention allows~ data to be embedded into a digital transmission or image without naturally discernible 20 alteration of the content and meaning ~of the transmission or image. This is made possible because of the techni~ue of the present invention, in which similar pixel values in a set of digital host data are re-ordered according to the desired embedded or implanted information. The host data 25 image examples are represented in the MICROSOFT~19 BITMAP~
(.BMP) format, so that the resulting image contains the embedded auxiliary information without that information being readily discernible. ~ ~
SUBSTITUTE SHEET(RULE 26) ~ W096l26494 2 ! 8 8 0 5 0 I_llu~ 6 A7~7 The MICROSOFT~) BITMAP~ image format is a public-domain format supporting images in the Truecolor, color palette, grey-scale, or black and white representations. Truecolor images have 24-bits per pixel element, with each byte of the 5 pixel element representing the intensity of the red, green, and blue (RGB) color component. Color palette images contain a table of the permitted RGB values. The pixel value in a color palette image is an index to this table.
Grey-scale images qive the numerical intensity of the pixel lO values. Black and white representation assigns either zero or one as one of the two possible pixel values. The invention will be made understandable in the context of the BITMAP~ image types by reference to the following description .
At the point when most sensory obtained information is represented in digital form, whether it be from video, photographs, laboratory measurements, or facsimile transmissions, the digital data contain intrinsic noise and redundant information which can be manipulated to carry 20 extra information. Through use of this invention, the extra information also can be extracted easily by an authorized and enabled receiver of the data.
Redundancy in digital image data occurs when a particular numerical value of the digital intensity exists 2~ in many different parts of the image. Redundancy is found commonly in images because a given grey-level exists in the rendition of trees, sky, clouds, people, and other objects.
The presence of noise in digital images permits the picture elements, pixeIs, to vary slightly in numerical value. For SUBSTITUTE SHEET (RULE 26) a1 saoso Wo 96/26494 r~l~rJ~, ' 8-bit digital data, the pixel numerical values range from 0-255. As the pixels having the same or similar numerical values represent a variety of image content, many values in r~i ff,~rr?nt locations of an image can be interchanged frPely.
5 The image appearance and the statistical frequency of occurrence ~f a particular pixel value are affected little by the interchanging of the spatial position of pixels close in numerical value.
Initially, from the original dlgital data (hereinafter 10 often referred to as the "host" data~, the present invention first converts the host data to digital form, if necess= ry, and then creates an image hlstogram to show the probability density of numerical pixel values occurring in the image.
The number ~of. times a particular pixel value occurs in ~he 15 image is plotted Yersus the value. For 8-bit di4ital data, the pixel Yalues range from 0-255. Of course, the lever of noise in an image will depend on the source of ~he data with different noise levels expected between photos, original artwork, digital audio, video, and facsimile 2~ transmissions. -~
The actual embedding of the auxiliary data into thehost data is a three-part process, the basic steps of which are illustrated in Figure ~. First, an estimate of the noise component of the host data is determined and used in 25 combination with an analysis of a histogram of the host data numerical values to identify pairs of values in the host data that occur with approximately the same statistical~
frequency, and that differ in value by less than the value of the noise component. Second, the position of occurrence 5UBSTITUTE SHEET (RULE 26) wo 96/2649~ 2 ~ 8 8 3 5 0 PCTNS96/0~357 of the pair values found is adjusted to embed the bitstream of the auxiliary information set. Third, the identified pairs of values in the host data are used to create a key f or the extraction of the embedded data .
Extracting embedded data inverts this process. The key placed in the image in the embedding phase specifies the pair-values which contain the embedded auxiliary information. With the pair-values known, extraction consists of recreating the auxiliary data according to the 10 positions of pixels having the pair-values given in the key.
The key data are used first to extract header information.
The header information specifies the length and the file name of the auxiliary data, and serves to validate the key.
If the image containing embedded information has been 15 modified, the header information will not extract data correctly. However, successful extraction recreates the auxiliary data exactly in an output file.
The principle of data embedding according to the present invention involves the rearrangement of certain host 20 data values in order to encode the values of the extra data which is to be added. For the purposes of this description of the invention, consider a host data set represented by 8 bits of binary information, with values ranging between 0 and 255 bits for each host data sample. Further, assume 25 that the noise value, N, for a signal, S, is given by N=~S/10, or approximately 10% of the signal value. For many data, the noise component can be approximated by a constant value or percentage, such as the 10% value used for this SUBSTITUTE SHEE~ (RULE 26) ,~11 8~3o5 0 description. Two values in the host data, di and dj, are within the noise value if ~
¦d, -d,¦ = < N 10 The frequency of occurrence or histogram value of a cer~ain value, di, is f ~di) . Data values meeting the criteria of Equation 10, and occurring in the host data wlth frequency of occurrence f (di) -f (dj) < ~, where ~ is the tolerance lO imposed for statistical equality, are candidates for =
embedding use. The values, di and dj, constitute~ a pair of data values, Pk. There are k=0, 1, 2, . . .Np such pairs in the host data set, giving a total number of embedding bits, Mk, for each pair:
~ ~ ~ ~ =
M~=~f(dj)+~f(dj) 20 where the summations fol i and j run to the limits of the frequency of occurrence in the data set, f (di) and f ~dj), 20 for the respective data values It is now helpful to refer to Figure 2, wherein a partial listing of computer code= in the C-1anguage is printed. The determination of the host data pixel pair values, di and dj, in Equation 1~, is acc mplished through 25 the code listed in Figure 2. In Figure 2, these 8 bit values are interpreted as indices in a color palette table The comparison indicated in Equation lO is therefore ~
required to be a comparison between the corresponding colors SUBSTITUTE SHEET (RULE 26) ~ Wo 96l26494 2 1 8 8 ~ 5 ~ PCT/US9610235'7 in the palette Entries in the color palette are Red, Green, and Blue (RGB) color-component values, each within the range of 0-255.
~f additlonal information is desired on the format used 5 for BITMAP(!~ images, reference should be made to two sources.
One is the book, F_, 'n~ for ~: ~.rhi~.: Files, by J.
Levine, 1994 (J. Wiley & Sons, New York) . The other is a technical article, "The BMP Format, " by M. Luse, Dr Dobb's JoYrnall Vol. 19, Page 18, 1994.
The code fragment in Figure 2 begins at line 1 with a loop running over the number of colors in the palette. The loop index, i, is used to test each palette color against all other entries, in sequence, to identify pairs of color entries meeting the criteria established by Equation 10.
15 Each color identified in the i-loop then is tested against all other colors in the palette by a second loop using another index, j, starting at line 16. Line 7 provides a modification for~images which have a palette for greyscale instead of colors. For greyscale images, the RGB components 20 are identical for each palette entry, although some grey scale formats include a 16-color table as well.
The comparison indicated in Equation 10 is made by converting the Red, Green, and Blue (RGB) color component values to corresponding Hue, Saturation, and Intensity (HSI) 25 color components. Line 12 uses a separate routine, rgbhsi (), to effect this conversion. Line 20 converts RGB
color component values in the j-loop to HSI data structure components, and line 21 calculates the color difference in the HSI system. Line 24 then implements the test required SUIISTITUTE SHEET(RULE 26) a ~88 Wo 96l26494 S ~ ~ 7 ~
1~ _ by Equation 10. If. the color difference ls less:than a~
fixed noise value (COLOR_NOISE=10 in the listing of Figure 2), the intensity difference is tested to determine if the two palette entries are acceptable as differlng by less than 5 the noise value specified. Two additional constraints are imposed before ~ccepting the entries as candidate pair ~
values. First, the difference in color is r~equlred to= be the smallest color difference between the test ~i-loop) value, and all the other (j-loop) values. Second, the 10 number of pairs selecte~ (k) must be less than half the number of columns in a row of pixels in the image, in order .
for the pair-value key to be stored in a single row of pixels. This is an algorithmic constraint, and is not required by the invention.
A data-structure array, pair[~, is used~ to hold the values of candidate pairs (i, j ) and thelr total frequency of occurrence, Mk. If the image is a greyscale palette, the test at line 35 is used to forc~ comparison of only the intensity of the two palette entries. Greyscale~images do 20 not require the RGB to HSI conversion made for color palettes.
The embedding process of the present invention ignores dif ferences in the saturation component of color=palette entries because saturation is ordinarily not noticeable in a 25 color image. Only the Hue and Intensity components are constrained to fall within fixed noise limits to determine the palette pair values.
Pixel pair values found by the code listed in Figure 2 include generally redundant values. The same pixel value, S~ IITUTESHEET(R~

Wo 96/26494 2 1 8 8 0 ~ O PCT/U5961023s7 i, is found in several diierent pair coTnbinations. secause multiple pairs cannot contain the same palette entry, due to each pair combination of pixel values having to be unique, it is necessary- to eliminate some pairs. The number of 5 pairs located by applying the criterion of Equation 10 is stored in the variable, no_pairs, in line 51.
Referring now to Figure 3, the code fragment listed therein illustrates the manner in which duplicate pairs are eliminated by a separate routine. Eirst, the histogram of 10 the image is used to calculate the :total number of occurrences in each pair, as requlred by Equation 20, above.
Line 1 shows the i-loop used to calculate the value, Mk, for each pair. Next, the pairs are sorted according to decreasing order of the pair [ ] . count data-structure member 15 in line 5. The elimination of duplicates in the following line retains the pairs, Pk, having the largest total number of frequency values, Mk. Line 10 and the lines following calculate the total number of bytes that can be embedded into the host data using the unique pixel pairs found by 20 this code fragment.
Sorting the pair values in decreasing order of value, Mk, minimizes the number of pairs required to embed a particular auxiliary data stream. However, the security of the embedded data is lncreased significantly if the pair 25 values are arranged in random order. Randomizing the pair-value order is part of this invention. This is accomplished by rearranging the pair-values to random order by calculating a data structure having entries for an integer~
index pts[k].i, k=0,1,2,..., no_pairs; and pts[k].gamma=~o, SUBSTITUTE SHEET (RULE 26) a~ssO~O
Wo 96/26494 r~ A7~7 r~ no p~3irs~ where the ~ir values are random. Sorting the data structure, pts [], to put the random values in ascending order r~ntir1mi 7~.5_ the index values . The random index values are used with the pair-values calculated as 5 indicated above, to re-order the table to give random pair-value ordering.
The algorithm described for=palette-format images -permits manipulating pixel values without~ regard~ to the individual frequency of occurrence. Reference should now be 0 made to Figure 9 where another code fraqment is listed in which, for TruecQlQr images, a constraint is introduced Dn the frequency of occurrence that minimizes the effect of embedding on the host data histogram.
Truecolor images c~nsist of three individual 8-bit 15 greyscale images,~ one each for the red, green, and blue image components. Truecolor= images have no color palette.
The possible combinations of the three 8-bit components give approximately 16 million colors.~ The present invention embeds data into Truecolor images by treating each RGB color=
20 component image separately. The effect of emoedaing on the composite image color is the,ref Qre within the noise Yalue of the individual color intensity components.
In Figure 4, the ip-loop starting in line 2 refers -to the color plane ~ip=0,1,2 for R,G,B). The frequency of 25 occurrence of each numerical value (0 through 255) is given in the array, hist values [ ], with the color plane histograms ., offset by the quantity, ip~25~, in line 7. The variable, fvalue[], holds the floating point histogram values for=
color-component, ip. Line ll begins a loop to constrain the SUBSTITUTE SHEET (RULE 26) W096,264g4 21 8895D PCr/Uss6/02357 pairs selected for nearly equal frequency of occurrence.
Pixel intensities within the noise limit, RANGE, are selected for comparison of: statistical frequency. The tolerance, ~, fQr statistical agreement is fixed at 5g in 5 line 17. This tolerance could be adjusted for particular applications.
After all possible values are tested for the constraints of noise and statisticaI frequency, the pairs found are sorted in line 27, the duplicates are removed, the 10 starting index is incremented in line 31, and the search continued. A maximum number of pairs aga~-n is set by the algorithmic constraint that the i and j pair values must be less than one-half the number o~ pixels in an image row. As with palette-format images, the security of the invention 15 includes randomizing the pair-value entries.
Applying the statistical constraint minimizes the host image effects of embedding the auxiliary data. If the tolerance, ~, is set at 0, each pai~ chosen will contain data values less than the noise value in intensity 20 separation, and occurring with exactly the same statistical frequency. Setting the tolerance at ~-5g, as in the code fragment of Figure ~, permits the acceptance of pixel pairs that are close in frequency, while still preserving most of the statistical properties of the host data. Few, if any, 25 pairs might be found by requiring exactly the same frequency of occurrence.
The actual embedding of auxiliary data lnto a set of host data consists of rearranging the order of occurrence of redundant numer~cal values. The pairs of host data values SUBSTITUTE SHEET (RULE 26) ~ ~ 8~ )5 WO 96/26494 PCTIUS96/023~7 found by analysis are the pixel values used to e~Lcode the bit-stream of the auxili~ry data into the host data. It is important to reali~P that the numerical values used for embedding are the values already occurring in the host data.
5 The embedding process of the current invention does not alter the number or quantity of the numerical values in the host data.
In the embedding process of the present invention, the host data are processed sequentially. A first pass through 10 the host data examines each value and tests for a match with the pixel-pair values. Matching values in the host data are initiali~ed to the data-structure value, pair[k] .i, for k=0, 1, 2. . .Np. This step initializes the host BITMAP~ image (Figure 1) to the pair values corresponding to zeroes in the 1~ auxiliary data. A second pass through the auxiliary data examines the sequential bits of the data to be embedded, and sets the pair-value of the host data element to the value i or ~, according to the auxiliary~bit value to be embedded If the bit-stream being embedded is random, the host data 20 pair-values, i and j, occur with equal frequency in the host image after the embedding process is completed.
Figure ~ illustrates the code fragment that performs the actual embedding, including the considerable information which is necessary to manipulate=the data in the header 2~ information, auxiIiary bit-stream, and the host data files.
.ines 1-12 allocate memory and initiaIize variables. The header and bit-stream data to be ~embedded are denoted the "data-image," and are stored in the array, data_row[] . The host data are denoted the "image-data "
SUBSTITUTE SHEET (RULE 26) ~ W096l26494 2 1 880~;~ PcTlu596/02357 The index, li, is used in a loop beginning at line 12 to count the byte position in the data-image. The loop begins with li=-51~ because header information is embedded before the data-image bytes. Line 14 contains the test for 5 loading data_row[~ with the header infQrmation. Line 20 contains the test for loading data_row[~ with bytes from the data-image file, tape5.
Line 30 starts a loop for the bits within a data-image byte. The variable, bitindex=(0,1,2...7), counts the bit 10 position within the data-image byte, data_row[d_inrow], indexed by the variable, d_inrow. The variable, lj, indexes the byte (pixel) in the host image. The variable, inrow, indexes the image-data buffer, image_row[inrow]. Line 32 tests for output of embedded data (a completed row of 15 pixels) to the image-data file, and line 40 tests for completion of a pass through the image-data. One pass through the image-data is made for each of the pixel pairs, pair[k~, k=0,1,2...Np.
In line 47, the pair index is incremented. A temporary 20 palr data-structure variable named "pvalue" is used to hold the working pair values of the host data pixels being used for embedding. Line 60 provides for refreshing the image-data buffer, image_row.
The embedding test is made at line 72. If the 25 image_row[inrowl content equals the pair value representing a data-image bit of zero, no change is made, and the image-data value remains pvalue. i . However, if the bit-stream value is one, the image-data value is changed to equal pvalue. j . Line 84 treats the case for image-data values not SUBSTITUTE SHEET (RULE 26) ,31880sf~ - `
Wo 96l26494 PCT/US96102357 equal to the embedding pair value, pvalue.i. In this case, the bitindex variable is decremented, because the data-image bit is not yet embedded, and the image-data indices are incremented to examine the next host data value.
The extraction of embedded data is accomplished by reversing the process used to embed the auxiliary data-image bit-stream. A histogram analysis of the embedded image-data set will reveal the candidate pairs for~ extraction for only the case where the individual statistical frequencies are unchanged by the embedding process. In the listings of Figures 2-5, the statistical frQquencies are changed slightly by the embedding proce~s. The pair table used for embedding can be recreated by analysis of the original image-data, but it cannot generally be recovered .e~cactly f rom the embedded image-data .
Additionally, as described above, the invention includes randomizing the order of the pair-values, thereby increasing greatly the amount~ of ~analysis needed to extract the embedded data wlthout prior knowledge of the pair-value order.~
As previously described, the ordered pairs selected~ for embedding constitute the "key" for extraction of the data-image from the image-data. The listings illustrated in ~
~igures 2-5 demonstrate how embedding analysis reduces the statistical proper~ies of the noise component in host data to a table of pairs of numerical ~ values_: ~he key-pairs are .=
required for extraction of the embedded data, but they cannot be generated by analyzing the host data after the embedding process is completed. = However, the key can be SUBSTITUTE SHEET (RULE 26) WO 96/26494 2 1 8 8 0 5 0 r~l,u recreated ~rom the original, unmodified host data. Thus, data embedding is similar to one-time-pad encryption, providing extremely high security to the embedded bit-stream.
With the pair table known, extraction consists of sequentially testing the pixel values to recreate an output bit-stream for the header information and the data-image.
In the present invention, the pair table is inserted into the host image-data, where it ls available for the ex'~:raction process. Optiona~ly, the present invention permits removing the pair table, and storing it in a separate fiIe. Typically, the pair table ranges from a few to perhaps hundreds of bytes in size. The maximum table size permitted is one-half the row length in pixels. With the pair table missing, the embedded data are secure, as long as the original host image-data are unavailable. Thus, the embedding method gives security potential approaching a one-time-pad encryption method.
Another way of Frotecting the pair table is to remove the key and encrypt it using public-key or another encryption process. The present invention permits an encrypted key to be placed into the host image-data, preventing extraction by unauthorized persons.
Embedding auxiliary data into a host slightly changes the statistical frequency of occurrence of the pixel values used for encoding the bit-streanL. Compressed or encrypted embedding data are excellent pseudo-random auxiliary bit-streams. Consequently, embedding auxiliary data having pseudo-random properties minimizes changes in the average SUBSTITUTE SHEET(RULE 26~

~2 1 8 8C~`50 - -WO 96/26494 P.~ 71';7 frequency of occurrence of the values in: the_embedding pairs. Embedding character data without compression or encryption reduces significantly the security offexed by the present invention.
The existence of embeddçd data is no~ detect~d easily by analyzing the embedded image-data. When viewed as a cryptographic method, data embedding convolves the data-image with the image-data. The original data-image bit-stream embedded represents a plaintext. The combination of the host and embedded data implants ciphertext in the noise component of the host. The existence of ciphertext is not evident however, because the content and meaning of the host carrier information is preserved by the present invention.
Data embedding according to the present invention is distinct from encryption because~no obvious ciphertext is produced .
Those who are unfamiliar with the terms "plaintext, "
and "ciphertext" can refer, f~r~ example, to B. Schneier, A~pplied Cryptogr~phy Protocols, Algorithms, ~nd Source Code 20 in C, J. Wiley & Sons, New York, New York, 1994. This reference is incorporated herein by reference.
As mentioned previously, the present invention is useful in the embedding of auxiliary information into facsimile (FAX) data. In the previous discussion concerning 25 embedding auxiliary information Into image host data, the noise component originates from uncertainty in the numerical values of the pixel data, or in the values of the colors in a color pallet. ~
SU65TITUTE SHEET ~I'IULE ~6) W096/26494 2~ 88a5~ P~ '^7~7 Facsimile transmissiDns are~ actually images consisting of black and white BITMAP(~ data, that is, the data from image pixels are binary (0,1) values representlng black or white, respectively, and the effect Df noise is to either add or remove pixels from the data. The present invention, therefore, processes a facsimile black-and-white BITMAP~
image as a 2-cPlor BITMAP~.
The standard office FAX machine combines the scanner and the digital hardhlare and software required to transmit 10 the image through a telephone connection. The images are transmitted using a special modem protocol, the characteristics of which are available through numerous sources One such source, the User's Manu~l for the EXP
Mod~n (UM, 1993), describes a FAX/data modem designed for 15 use in laptop computers. FAX transmissions made between computers are digital communications, and the data are -therefore suited to data embedding.
As has been previously discussed with relation to embedding lnto images, the FAX embedding process ls 20 conducted ln two stages: analysis and embedding. In the case of a FAX 2-color BITMAP~, image noise can either add or subtract black pixels from the image. Because of this, the length of runs o~f consecutive like pixels will vary.
The scanning process represents a black line in the 25 source copy by a run of consecutive black pixels in the two color BITMAP~ image. The number of plxels in the run is uncertain by at least il, because of the scanner resolution and the uncertain converslon of original material to black-and-white BITMAP~ format.
SUBSTITUTE SHEET (RULE 26) c~l 88~=50 - ~ ;
Wo 96l26494 PCTrUss6/023s7 Applying data embedding to the two color BITMAE'~ data example given here theref ore cons~ists of analyzing the BITMAP~ to determine the statistical frequency of occurrence, or histogram, of runs of consecutive pixels.
5 The embedding process of the present invention varies the length of runs by ~0,+1) pixel according to the content of the bit-stream in the auxiliary data- image. Host data:
suitable for embedding are any two color BITMAP~ image which is scaled in size for FAX transmission. A hardcopy of a-- FAX
lO transmission can be scanned to g~nerate the two color ~
BITMAP~3, or the image can be created by using FAX-printer driver sof tware in a computer .
The FAX embedding process begins by analyzing the lengths of runs in each row of pixels. The implementation 15 of this step is illustrated ~ by the code fragment in Figure 6. The arguments to ~he ~outine, rowstats~ are ~a pointer to=the pixel data in the row, which consists of one byte per pixel, either a zero: or=a one in value; a pointer to an array of statistical frequencies; the number of 20 columns (pixels) in the data row; and a flag for ~internal program options. The options flag is the size o blocks, or packets, of the auxiliary bitstream to be embedded_ The options flag is tested in line 9, and the routine, packet_col ( ) is used for~a~ posit~ive option flag . The 25 packet_col () routine is given in the listing of Figure ~=, and its purpose is to ensure that the first pixel in the data row starts in an even column number. The location of the first pixel in the row flags the start of the data packets, which will be further described below.
SUBSTITUTE SHEET (RULE 26) Wo96l26494 2 ~ 8805~ P~ 5~ 7 Line 12 begins a Loop to examine the runs of pixels in the data row. Runs between the defined values MINRUN and MAXR~N are exam~ned by the loop. The j-loop, and the test at line 15, locate a run of pixels, and sets the variable, 5 k, to the index of the start of the run. The test at line 21 selects only blocks of pixels having length, i, less than the length of the row. The loop in-~~line 22 moves the pixel run to temporary storage in the array block [ ] .
The two tests at lines 24 and 25 reject blocks having 10 run lengths other than the one required by the current value of the i-loop. The embedding scheme selects blocks of length, i, for embedding by adding a pixel to make the length i+l. This assures that the run can contain either i or i+l non-zero pixel values, according to the bit-stream of 15 the auxiliary embedded data. If the run stored in the variable block[] array does not end in at least two zeroes, it is not acceptable as a run of length, i+1, and the code branches to NEXT, to examine the next run found.
Line 28 begins a loop to count the number of pixels in 20 the run. The number found is incremented by one in line 31 to accoun~ for the pixel added to make the run length equal to i~1. Line 33 contains a test ensuring that the run selected has the correct length. The histogram[] array fo the run-length index, i, is incremented to tally the 2~ occurrénce frequency of the run. The data row bytes for the run are flagged by the loop in line 36, with a letter code used to distinguish the runs located. This flagging technique permits the embedding code to identify easily the runs to be used for embedding the bit-stream. On exit from SUBSTITUTE SHEET (RULE 26) Wo 96/26494 ~ l 8 8 - PCrNS96/023S7 this rQutine, the data ro~ bytes contain runs flagged wi=th letter codes l=r i nfli r~te the usable pixel positions for embedding the bit-stream. The return value is the number of runs located in the data rQw. A return of zero indicates no 5 runs within the defined limits Pf MINRUN and MAXRUN were~
located .
FAX modem protocols emphasize speed, and therefore do not include erroL-correction. For= tllis reason, FAX
transmissions are subject to drQp-outs, to impulsive noise, 10 and to lost d~ta, depending on the quality of the telephone line and the speed ~f the transmi~ssion. For successful embedding, the present invention must account for=the possible loss of some portion -of ~the image data. To accomplish this, a variation of modem block-protocols is 15 used to embed the header~ and the auxiliary data. The two color image is treated as a transmission medium, with the data embedded in blocks, or packets, providing for packet-start flags, and parity checks. The start of a packet is signaled by an image row having its first pixel in an even 20 column. The packet ends when the number of bits contained in the block are extracted, or, in the case of a corrupted packet, when a packet-start flag is located in a line. A
checksum for parity, and a packet sequence number, are ==~
embedded with the data in a packet. Using this method, 2~ errors in the FAX transmission result in possible loss Qf some, but not all, of the embedded data.
The amount of data lost because Qf transmission errQrs ~ =
depends on the density of pixels~in the source image and the length of a dropout. Using 20 bytes per packet, a large SUBSTITUTE SHEET (RULE 26) WO 96/26494 2 1 8 8 0 ~ p~/usg6/0 dropout in transmission of standard text results in one or two packets of lost data. Generally, the success of the invention follows the legibility of the faxed host image inf ormation .
Turning now to Figure 7, there-can be seen a listing of the steps necessary to initialize the two color BITMAPa~
lines to flag the start of each packet. Each row in the two color image con~ains a non-zero value beginning in an even column (packet start), or in an odd column (packet continuation).
In Figure 7, it can be seen that line 4 starts a loop over the number of pixels in a data row. In FAX images, a zero ( 0 ) pixel value indicates a black space, and a one ( 1 ) value indicates a white space. Line 5 locates the first black space in the data for the row. If the variable, packet_slze, is positive, the column index is tested to be even and the pixel is forced to be a white space. If the packet_size variable is negative, the routine returns an indicator of the data row flag without making changes. If packet_size is greater than zero, the first data row element is flagged as a white space. Line ll deals with the case in which packet_size=0, indicating a continuation row. In the event of a continuation row, the first ~data row element is forced to a black space.~ The values returned by subroutines in lines 17-20 show the nature of the pixel row examined.
The code fragment listed in Figure a provides auxiliary data embedding into two color BIT~P~ FAX images. The pixels in a row are processed as described above by ~x~m;n;ng the contents of the data row after it has been SUBSTITUTE SHEET (RUI~E 26) _ _ _ ~188~s analyzed and flagged with letter codes to indicate the run lengths. Lines 1 through 49 are~part of a large loop (not shown) over the pixel index, lj, in the two color BITMAP~
image. Lines 1-26 handle the reading of one line of pixels 5 from the two color BITMAP~, and store the row number of the image in the variable, nrow, in line 1. The pixel value bits are decoded and expanded into the image_row [ ] array in lines 12-36. The image_row[] a~ray_contalns the pixel values stored as one value (0 or =1) per byte.
Line 28 uses the packet_col(¦ routine to~return the packet-index for the row. If j is 0 in line 28, the row is a packet-start row, and if j is I, the row is a continuation row. Line 29 uses the rowstats ( ) routine to assign run-length letter flags to the pixeIs in the row buffer. The 15 return value, i, gives the number of runs located in the image row. Consistency tests are made at lines 31, 37, and 41. The index, kp, gives the pixel row number within a data packet. If kp is 0, the line must be a packet-start index, and if kp>0, the line must be a- continuation line. Line 49 20 completes the process Qf reading and preprQcessing a row of two color image data.
The data-structure array, pair [ ], contains the run length for (i), the augmented run length, (i+l), and the total number of runs in the two color BITMAP~ image. The 25 index, k, in the loop starting at line 51, is the index for the run lengths being embedded. The index, inrow,- coùnts - pixels within the image row buffer, and the variable, bitindex is the bit-position index in the bit-stream byte.
SUBSTITUTE SHEET(RULE 26) WO 96126494 2 1 8 8 13 ~ ~ PCI/US96/02357 Line 57 sets the value of the run-length letter-code in the variable, testltr. The value of ~n image pixel is tested against the letter-code in line 58. If the test letter-code flag is located, line 60 advances the index in 5 the row to the end of the pixel run being used for embedding. The test function in line 62 checks the value for the current bit index in the bit-stream packet byte. If the value is one, the last pixel in the run is set to one.
Otherwise, the last pixel in the run is set to 0.
Setting the value of the pixel trailing a run implements the embedding in the two color BITMAP~ images by introducing noise generated according to the pseudo-random bit-stream in the packet data. The letter flag values written into the row buffer by the call to rowstats ( ) in Figure 8 are reset to binary unit value before the image_row array data are packed and written back to the .BMP format file. The process for doing this is not illustrated in Figure 8, but is straightforward for thQse skilled in the art .
Extraction of data embedded into a two color BITMAP~
FAX image, according to the present invention, can be accomplished only if the transmission of the FAX is received by a computer. The image data are stored by the receiving computer irL a file format (preferably a FAX compressed format), permitting the processing necessary to convert the image to BITMAP~ f ormat and to extract the embedded data .
FAX data sent to a standard office machine are not amenable to data extraction because the printed image is generally SUBSTITUTE SHEET (RULE 26) ~ 880~ o Wo 96/264g4 PCr/USs6/02357 not of sufficien~ quaIity to allow for recovery of the embedded data through scanning.
However, the invention does apply to scanning/printing FAX machines that process data internally with computer ~
5 hardware Auxiliary embedded data are inserted after the scanning of ~the host data, but prior tQ transmission. The auxiliary embedded= data are extracted after they have been received, but before ~hey are printed.
The key for two color image embedding can be recovered lO by analyzing the embedded image, because the run lengths are not changed from the original (i, i+1~ values . The order in which the values are used depends on the frequency of occurrence in the image. As in the example for palette-~color images, a key to the value and order of the pairs used 15 for embedding is inserted into the FAX. However, the key is not strictly required, because, ïn principle, knowledge of the defined values ~INRUN and MAXRUN permits re-c21culating the run-length statistics from the received image. In practice, the key is required because transmission errors in 20 the FAX-modem communication link can introduce new run-lengths that alter the statiSti=cal properties of the image, and because the pair ordering is not known. Even though FAX
embedding is somewhat less secure than embedding auxiliary data into palette-color images, the two color BITMAP~ FAX
25 embedding of data still can be regarded as similar to one-time-pad cryptography.
The foregoin~ description of the preferred embodiments of the inVonr;-n }l~ve been prese~ted for purposes of illustration and description. It is not intended to be SUBSTITUTE SHEET(RULE 26) Wo 96/26494 2 1 8 8 0 5 ~ PCTtUS96/02357 exhaustive or to li~it the invention to the precil3e ~orm disclosed, and obviously many modifications and variations are possible in light of the above teaching. The ernbodiments were chosen and described in order to best 5 explain the principles of the invention and its practical application to thereby enable others skilled in the art to best utilize the invention in various embodiments ana ~ith various modifications as are suited to the particular use contemplated. It is intended that the scope of the 10 invention be defined by the claims appended hereto SUBSTITUTE SHEET (RULE 26) _

Claims

WHAT IS CLAIMED IS:

1. A method of embedding auxiliary data into host data comprising the steps of:
creating a digital representation of said host data in the form of elements having numerical values and containing a noise component;
creating a digital representation of said auxiliary data in the form of a sequence of INDIVIDUAL bit VALUES;
evaluating said noise component of said digital representation of said host data;
comparing pairs of said elements with said noise component to determine pairs of said elements having numerical values which differ by less than said value of said noise component;
replacing individual values of said elements with substantially equivalent values from said pairs of elements in order to embed individual bit values of said auxiliary data corresponding to said sequence of bit values of said auxiliary data; and outputting said host data with said auxiliary data embedded into said host data as a file.

2. The method as described in Claim 1 further comprising the step of combining said auxiliary data with predetermined information indicative of said auxiliary data, its file name, and file size, said step to be performed after the step of digitizing said auxiliary data

3. The method as described in Claim 1 further comprising the step of determining a protocol for embedding said auxiliary data into said host data which allows for verification of said auxiliary data upon extraction from said host data.

4. A method of extracting embedded auxiliary data from host data containing a noise component comprising the steps of:
extracting from said host data a bit sequence indicative of said embedded auxiliary data, and which allows for verification of said host data;
interpreting said host data to determine host element pairs which differ by less than said noise component and which correspond to bit values of said auxiliary data;
identifying said auxiliary data using said bit sequence; and extracting said auxiliary data as a file.

5. The method as described in Claim 1, wherein said host data comprises a color photograph.

6. The method as described in Claim 1, wherein said host data comprises a black and white photograph

7. The method as described in claim 1, wherein said host data comprises a television signal.

8. The method as described in Claim 1, wherein said host data comprises a painting.

9 . The method as described in Claim 1, wherein said host data comprises a facsimile transmission.

10. The method as described in Claim 1, wherein said host data comprises an identification card.

11. The method as described in Claim 1, wherein said host data comprises digital audio information.