METHOD TO ENCODE A DNA SEQUENCE AND TO COMPRESS A DNA
SEQUENCE
Technical Field
The present invention relates to a method for encoding a DNA sequence and a method for compressing a DNA sequence, and particularly to, a method for encoding a DNA sequence by expressing 4 types of DNA bases such as adenine (A), guanine (G), cytosine (C) and thymine (T) into 2 bits and a method compressing an encoded DNA sequence by using a common data compression method to increase compression efficiency.
Background Art
As the biotechnology advances, DNA sequences of various living bodies are analyzed and researches on methods to effectively express and compress the DNA sequences are in progress. In general, it takes at least 2 bits per base when a DNA base sequence is compressed using common sentence compression software such as WinZip and Arj.
In order to improve the shortcomings of common sentence compression software, various compression algorithms reflecting properties of DNA sequences have been developed for the last decade. Representative examples include Biocompress-2 (Grumbach and Tahi,1994), GenCompress (Chen et al., 2001), CTW+LZ (Matsumoto et al. 2000), DNACompress (Chen et al., 2002) and the like. These algorithms use two properties of the DNA sequence, that is, approximate repeat and palindrome. They can compress a sequence in up to 2 bits per base but require lots of time to search the optimal approximate repeat and palindrome for compression. However, in the compression of a DNA sequence, not only the compression rate but also the compression speed is important.
Disclosure of Invention
Accordingly, the present invention has been made in view of the foregoing problems, and considering that a DNA sequence comprises 4 types of bases such as
adenine (A), guanine (G), cytosine (C), thymine (T), it is an object of the present invention to provide a method for encoding a DNA sequence by expressing respective bases of the DNA sequence into a 2-bit unit and a method for compressing an encoded DNA sequence to improve compression efficiency and compression rate. In order to accomplish the above object, the present invention provides a method for encoding a DNA sequence comprising the steps of: encoding bases of the DNA sequence comprising adenine (A), guanine (G), cytosine (C) and thymine (T), into 2 bits; forming one byte with a predetermined number of the encoded bases; and forming a DNA sequence in the byte unit. Also, according to the present invention, there is provided a method for compressing a DNA sequence comprising the steps of: encoding DNA bases comprising adenine (A) guanine (G), cytosine (C) and thymine (T) into 2 bits, respectively; forming one byte with a predetermined number of the encoded bases; forming a DNA sequence in the byte unit; and compressing the DNA sequence using a data compression method.
Brief Description of the Drawings
Further objects and advantages of the invention can be more fully understood from the following detailed description taken in conjunction with the accompanying drawings in which:
Fig. 1 is a view schematically showing the encoding of DNA bases; Fig. 2 shows an embodiment of the method for encoding a DNA sequence according to the present invention; and
Fig. 3 shows an embodiment of the method for compressing a DNA sequence according to the present invention.
Example Now, the present invention will be explained in detail with reference to the drawings.
Referring to Fig. 1, each base of a DNA sequence can be encoded into 2 bits. That is, each base is expressed into one of 4 characters such as adenine (A), guanine (G), cytosine (C) and thymine (T), which are expressed into 2-bit values including 00, 01, 10
and 11. It is just an example to express adenine (A), guanine (G), cytosine (C) and thymine (T) into 2-bit values of 00, 01, 10 and 11. The bases may be any values different from each other (Ex.: 01, 11, 00, 10).
Referring to Fig. 2, the method for encoding a DNA sequence according to the present invention is concretely explained. Each base is set to be encoded into 2 bits. Bases of the DNA sequence to be encoded (hereinafter referred to as "target DNA sequence) are gathered in a predetermined number to form one byte. The encoded final DNA sequence is expressed in byte unit. Here, the number of bases included in one byte may be 1, 2, 3 or 4. When the bases are not 4, the remaining bits are filled with a predetermined value.
For example, when the target sequence 10 is "CACGACGTTGTA", in which 4 bases form one byte, respective procedures are explained.
Firstly, four bases (CACG) of the target sequence 10 are encoded into 2 bits to form one byte (S21,S22). The encoded byte is added to a temporary DNA sequence (S23). Then, the temporary DNA sequence becomes "10001001". The bases which have not yet been encoded remain in the target DNA sequence (S24) and again undergo the step S21. The next four bases (ACGT) are encoded into 2 bits to form another one byte (S22). The encoded byte is added to the temporary DNA sequence (S23). The temporary DNA sequence is then "1000100100100111". The target DNA sequence still contains bases to be encoded (S24) and again undergoes the step S21. The four bases (TGTA) are encoded into 2 bits to form one byte (S22). The encoded byte is added to the temporary DNA sequence (S23). Then, the temporary DNA sequence becomes "100010010010011111011100". All the bases of the target DNA sequence are encoded and the process is ended (S24). Here, the information of the temporary DNA sequence is an encoded final DNA sequence 20.
When four bases expressed into 2 bits are gathered to form one byte, the DNA sequence becomes binary data which are not readily identified or manipulated by a common document editor. As an alternative for this, it is possible to bind only 3 bases in one byte. This system well matches with the idea of "codon" which is a basic unit for protein synthesis. Each codon can be processed into a special character code and
thus, is readily identified and manipulated by a common document editor.
Referring to Fig. 3, the method for compressing a DNA sequence according to the present invention is concretely explained. The steps S21 to S24 are the same as the procedures described in the above and shown in Fig. 2. However, the encoded DNA sequence (the temporary DNA sequence in the example) is finally compressed by an compression method (S25). The compression method which can be used in the present invention includes any of the sentence compression methods which have been already developed and used.
Also, the method for encoding a DNA sequence and the method for compressing a DNA sequence can be preferably performed in the form of a computer program. Therefore, the present invention includes a recording medium which is readable by a computer having computer programs recorded, in which the programs can carry out respective steps of the method for encoding a DNA sequence and the method for compressing a DNA sequence. While the present invention has been described with reference to the particular illustrative embodiments, it is not to be restricted by the embodiments but only by the appended claims. It is to be appreciated that those skilled in the art can change or modify the embodiments without departing from the scope and spirit of the present invention.
Industrial Applicability
According to the present invention, it is possible to effectively encode a DNA sequence and improve the compression rate and the compression speed of DNA sequence information.