WO2004070029A1 - Method to encode a dna sequence and to compress a dna sequence - Google Patents

Method to encode a dna sequence and to compress a dna sequence Download PDF

Info

Publication number
WO2004070029A1
WO2004070029A1 PCT/KR2003/001093 KR0301093W WO2004070029A1 WO 2004070029 A1 WO2004070029 A1 WO 2004070029A1 KR 0301093 W KR0301093 W KR 0301093W WO 2004070029 A1 WO2004070029 A1 WO 2004070029A1
Authority
WO
WIPO (PCT)
Prior art keywords
dna sequence
bases
encoding
encoded
byte
Prior art date
Application number
PCT/KR2003/001093
Other languages
French (fr)
Inventor
Hyoung Do Kim
Seung Wha Yoo
Kyoung Hee Choi
Original Assignee
Ajou University Industry Cooperation Foundation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ajou University Industry Cooperation Foundation filed Critical Ajou University Industry Cooperation Foundation
Priority to AU2003232661A priority Critical patent/AU2003232661A1/en
Publication of WO2004070029A1 publication Critical patent/WO2004070029A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • EFIXED CONSTRUCTIONS
    • E04BUILDING
    • E04GSCAFFOLDING; FORMS; SHUTTERING; BUILDING IMPLEMENTS OR AIDS, OR THEIR USE; HANDLING BUILDING MATERIALS ON THE SITE; REPAIRING, BREAKING-UP OR OTHER WORK ON EXISTING BUILDINGS
    • E04G1/00Scaffolds primarily resting on the ground
    • E04G1/28Scaffolds primarily resting on the ground designed to provide support only at a low height
    • E04G1/32Other free-standing supports, e.g. using trestles
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3084Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
    • EFIXED CONSTRUCTIONS
    • E04BUILDING
    • E04GSCAFFOLDING; FORMS; SHUTTERING; BUILDING IMPLEMENTS OR AIDS, OR THEIR USE; HANDLING BUILDING MATERIALS ON THE SITE; REPAIRING, BREAKING-UP OR OTHER WORK ON EXISTING BUILDINGS
    • E04G1/00Scaffolds primarily resting on the ground
    • E04G1/15Scaffolds primarily resting on the ground essentially comprising special means for supporting or forming platforms; Platforms
    • EFIXED CONSTRUCTIONS
    • E04BUILDING
    • E04GSCAFFOLDING; FORMS; SHUTTERING; BUILDING IMPLEMENTS OR AIDS, OR THEIR USE; HANDLING BUILDING MATERIALS ON THE SITE; REPAIRING, BREAKING-UP OR OTHER WORK ON EXISTING BUILDINGS
    • E04G1/00Scaffolds primarily resting on the ground
    • E04G1/15Scaffolds primarily resting on the ground essentially comprising special means for supporting or forming platforms; Platforms
    • E04G2001/155Platforms with an access hatch for getting through from one level to another
    • EFIXED CONSTRUCTIONS
    • E04BUILDING
    • E04GSCAFFOLDING; FORMS; SHUTTERING; BUILDING IMPLEMENTS OR AIDS, OR THEIR USE; HANDLING BUILDING MATERIALS ON THE SITE; REPAIRING, BREAKING-UP OR OTHER WORK ON EXISTING BUILDINGS
    • E04G1/00Scaffolds primarily resting on the ground
    • E04G1/28Scaffolds primarily resting on the ground designed to provide support only at a low height
    • E04G1/30Ladder scaffolds
    • E04G2001/302Ladder scaffolds with ladders supporting the platform
    • E04G2001/305The ladders being vertical and perpendicular to the platform

Landscapes

  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Architecture (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Medical Informatics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Structural Engineering (AREA)
  • Mechanical Engineering (AREA)
  • Civil Engineering (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention relates to a method for encoding a DNA sequence and a method for compressing a DNA sequence. The method for encoding a DNA sequence according to the present invention comprises the steps of: encoding bases of the DNA sequence comprising adenine (A), guanine (G), cytosine (C) and thymine (T), into 2 bits; forming one byte with a predetermined number of the encoded bases; and forming a DNA sequence in the byte unit. Also, the method for compressing a DNA sequence according to the present ivnention further comprises a step of compressing the encoded DNA sequence using a data compression method. According to the present invention, it is possible to effectively encode a DNA sequence and improve compression rate and compression speed of DNA sequence information.

Description

METHOD TO ENCODE A DNA SEQUENCE AND TO COMPRESS A DNA
SEQUENCE
Technical Field
The present invention relates to a method for encoding a DNA sequence and a method for compressing a DNA sequence, and particularly to, a method for encoding a DNA sequence by expressing 4 types of DNA bases such as adenine (A), guanine (G), cytosine (C) and thymine (T) into 2 bits and a method compressing an encoded DNA sequence by using a common data compression method to increase compression efficiency.
Background Art
As the biotechnology advances, DNA sequences of various living bodies are analyzed and researches on methods to effectively express and compress the DNA sequences are in progress. In general, it takes at least 2 bits per base when a DNA base sequence is compressed using common sentence compression software such as WinZip and Arj.
In order to improve the shortcomings of common sentence compression software, various compression algorithms reflecting properties of DNA sequences have been developed for the last decade. Representative examples include Biocompress-2 (Grumbach and Tahi,1994), GenCompress (Chen et al., 2001), CTW+LZ (Matsumoto et al. 2000), DNACompress (Chen et al., 2002) and the like. These algorithms use two properties of the DNA sequence, that is, approximate repeat and palindrome. They can compress a sequence in up to 2 bits per base but require lots of time to search the optimal approximate repeat and palindrome for compression. However, in the compression of a DNA sequence, not only the compression rate but also the compression speed is important.
Disclosure of Invention
Accordingly, the present invention has been made in view of the foregoing problems, and considering that a DNA sequence comprises 4 types of bases such as adenine (A), guanine (G), cytosine (C), thymine (T), it is an object of the present invention to provide a method for encoding a DNA sequence by expressing respective bases of the DNA sequence into a 2-bit unit and a method for compressing an encoded DNA sequence to improve compression efficiency and compression rate. In order to accomplish the above object, the present invention provides a method for encoding a DNA sequence comprising the steps of: encoding bases of the DNA sequence comprising adenine (A), guanine (G), cytosine (C) and thymine (T), into 2 bits; forming one byte with a predetermined number of the encoded bases; and forming a DNA sequence in the byte unit. Also, according to the present invention, there is provided a method for compressing a DNA sequence comprising the steps of: encoding DNA bases comprising adenine (A) guanine (G), cytosine (C) and thymine (T) into 2 bits, respectively; forming one byte with a predetermined number of the encoded bases; forming a DNA sequence in the byte unit; and compressing the DNA sequence using a data compression method.
Brief Description of the Drawings
Further objects and advantages of the invention can be more fully understood from the following detailed description taken in conjunction with the accompanying drawings in which:
Fig. 1 is a view schematically showing the encoding of DNA bases; Fig. 2 shows an embodiment of the method for encoding a DNA sequence according to the present invention; and
Fig. 3 shows an embodiment of the method for compressing a DNA sequence according to the present invention.
Example Now, the present invention will be explained in detail with reference to the drawings.
Referring to Fig. 1, each base of a DNA sequence can be encoded into 2 bits. That is, each base is expressed into one of 4 characters such as adenine (A), guanine (G), cytosine (C) and thymine (T), which are expressed into 2-bit values including 00, 01, 10 and 11. It is just an example to express adenine (A), guanine (G), cytosine (C) and thymine (T) into 2-bit values of 00, 01, 10 and 11. The bases may be any values different from each other (Ex.: 01, 11, 00, 10).
Referring to Fig. 2, the method for encoding a DNA sequence according to the present invention is concretely explained. Each base is set to be encoded into 2 bits. Bases of the DNA sequence to be encoded (hereinafter referred to as "target DNA sequence) are gathered in a predetermined number to form one byte. The encoded final DNA sequence is expressed in byte unit. Here, the number of bases included in one byte may be 1, 2, 3 or 4. When the bases are not 4, the remaining bits are filled with a predetermined value.
For example, when the target sequence 10 is "CACGACGTTGTA", in which 4 bases form one byte, respective procedures are explained.
Firstly, four bases (CACG) of the target sequence 10 are encoded into 2 bits to form one byte (S21,S22). The encoded byte is added to a temporary DNA sequence (S23). Then, the temporary DNA sequence becomes "10001001". The bases which have not yet been encoded remain in the target DNA sequence (S24) and again undergo the step S21. The next four bases (ACGT) are encoded into 2 bits to form another one byte (S22). The encoded byte is added to the temporary DNA sequence (S23). The temporary DNA sequence is then "1000100100100111". The target DNA sequence still contains bases to be encoded (S24) and again undergoes the step S21. The four bases (TGTA) are encoded into 2 bits to form one byte (S22). The encoded byte is added to the temporary DNA sequence (S23). Then, the temporary DNA sequence becomes "100010010010011111011100". All the bases of the target DNA sequence are encoded and the process is ended (S24). Here, the information of the temporary DNA sequence is an encoded final DNA sequence 20.
When four bases expressed into 2 bits are gathered to form one byte, the DNA sequence becomes binary data which are not readily identified or manipulated by a common document editor. As an alternative for this, it is possible to bind only 3 bases in one byte. This system well matches with the idea of "codon" which is a basic unit for protein synthesis. Each codon can be processed into a special character code and thus, is readily identified and manipulated by a common document editor.
Referring to Fig. 3, the method for compressing a DNA sequence according to the present invention is concretely explained. The steps S21 to S24 are the same as the procedures described in the above and shown in Fig. 2. However, the encoded DNA sequence (the temporary DNA sequence in the example) is finally compressed by an compression method (S25). The compression method which can be used in the present invention includes any of the sentence compression methods which have been already developed and used.
Also, the method for encoding a DNA sequence and the method for compressing a DNA sequence can be preferably performed in the form of a computer program. Therefore, the present invention includes a recording medium which is readable by a computer having computer programs recorded, in which the programs can carry out respective steps of the method for encoding a DNA sequence and the method for compressing a DNA sequence. While the present invention has been described with reference to the particular illustrative embodiments, it is not to be restricted by the embodiments but only by the appended claims. It is to be appreciated that those skilled in the art can change or modify the embodiments without departing from the scope and spirit of the present invention.
Industrial Applicability
According to the present invention, it is possible to effectively encode a DNA sequence and improve the compression rate and the compression speed of DNA sequence information.

Claims

What Is Claimed Is:
1. A method for encoding a DNA sequence, comprising the steps of: encoding bases of the DNA sequence comprising adenine (A), guanine (G), cytosine (C) and thymine (T), into 2 bits; forming one byte with a predetermined number of the encoded bases; and forming a DNA sequence in the byte unit.
2. A recording medium which is readable by a computer having computer programs recorded therein, wherein the programs can carry out respective steps described in claim 1.
3. A method for compressing a DNA sequence, comprising the steps of: encoding DNA bases comprising adenine (A) guanine (G), cytosine (C) and thymine (T) into 2 bits, respectively; forming one byte with a predetermined number of the encoded bases; forming a DNA sequence in the byte unit; and compressing the DNA sequence using a data compression method.
4. A recording medium which is readable by a computer having computer programs recorded therein, wherein the programs can carry out respective steps described in claim 3.
PCT/KR2003/001093 2003-02-07 2003-06-04 Method to encode a dna sequence and to compress a dna sequence WO2004070029A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2003232661A AU2003232661A1 (en) 2003-02-07 2003-06-04 Method to encode a dna sequence and to compress a dna sequence

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020030007920A KR20040071993A (en) 2003-02-07 2003-02-07 Method to encode a DNA sequence and to compress a DNA sequence
KR10-2003-0007920 2003-02-07

Publications (1)

Publication Number Publication Date
WO2004070029A1 true WO2004070029A1 (en) 2004-08-19

Family

ID=32844797

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2003/001093 WO2004070029A1 (en) 2003-02-07 2003-06-04 Method to encode a dna sequence and to compress a dna sequence

Country Status (3)

Country Link
KR (1) KR20040071993A (en)
AU (1) AU2003232661A1 (en)
WO (1) WO2004070029A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010108929A3 (en) * 2009-03-23 2010-11-25 Intresco B.V. Methods for providing a set of symbols uniquely distinguishing an organism such as a human individual
CN105550535A (en) * 2015-12-03 2016-05-04 人和未来生物科技(长沙)有限公司 Encoding method for rapidly encoding gene character sequence into binary sequence
US10902937B2 (en) 2014-02-12 2021-01-26 International Business Machines Corporation Lossless compression of DNA sequences

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101253700B1 (en) * 2010-11-26 2013-04-12 가천대학교 산학협력단 High Speed Encoding Apparatus for the Next Generation Sequencing Data and Method therefor
KR101922129B1 (en) 2011-12-05 2018-11-26 삼성전자주식회사 Method and apparatus for compressing and decompressing genetic information using next generation sequencing(NGS)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08123498A (en) * 1994-10-21 1996-05-17 Nippon Telegr & Teleph Corp <Ntt> Waveform data compressing method
US5651099A (en) * 1995-01-26 1997-07-22 Hewlett-Packard Company Use of a genetic algorithm to optimize memory space
US5706498A (en) * 1993-09-27 1998-01-06 Hitachi Device Engineering Co., Ltd. Gene database retrieval system where a key sequence is compared to database sequences by a dynamic programming device
US5727130A (en) * 1995-08-31 1998-03-10 Motorola, Inc. Genetic algorithm for constructing and tuning fuzzy logic system
US5838964A (en) * 1995-06-26 1998-11-17 Gubser; David R. Dynamic numeric compression methods
KR20020040406A (en) * 2000-11-24 2002-05-30 김응수 A method of compressing and storing data based on genetic code

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5706498A (en) * 1993-09-27 1998-01-06 Hitachi Device Engineering Co., Ltd. Gene database retrieval system where a key sequence is compared to database sequences by a dynamic programming device
JPH08123498A (en) * 1994-10-21 1996-05-17 Nippon Telegr & Teleph Corp <Ntt> Waveform data compressing method
US5651099A (en) * 1995-01-26 1997-07-22 Hewlett-Packard Company Use of a genetic algorithm to optimize memory space
US5838964A (en) * 1995-06-26 1998-11-17 Gubser; David R. Dynamic numeric compression methods
US5727130A (en) * 1995-08-31 1998-03-10 Motorola, Inc. Genetic algorithm for constructing and tuning fuzzy logic system
KR20020040406A (en) * 2000-11-24 2002-05-30 김응수 A method of compressing and storing data based on genetic code

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010108929A3 (en) * 2009-03-23 2010-11-25 Intresco B.V. Methods for providing a set of symbols uniquely distinguishing an organism such as a human individual
US9607127B2 (en) 2009-03-23 2017-03-28 Jan Jaap Nietfeld Methods for providing a set of symbols uniquely distinguishing an organism such as a human individual
NL2003311C2 (en) * 2009-07-30 2011-02-02 Intresco B V Method for producing a biological pin code.
US10902937B2 (en) 2014-02-12 2021-01-26 International Business Machines Corporation Lossless compression of DNA sequences
CN105550535A (en) * 2015-12-03 2016-05-04 人和未来生物科技(长沙)有限公司 Encoding method for rapidly encoding gene character sequence into binary sequence

Also Published As

Publication number Publication date
KR20040071993A (en) 2004-08-16
AU2003232661A1 (en) 2004-08-30

Similar Documents

Publication Publication Date Title
CN110945595B (en) DNA-based data storage and retrieval
JP7179008B2 (en) Nucleic acid-based data storage
US11379729B2 (en) Nucleic acid-based data storage
CN109830263B (en) DNA storage method based on oligonucleotide sequence coding storage
AU2019270159A1 (en) Compositions and methods for nucleic acid-based data storage
US9774351B2 (en) Method and apparatus for encoding information units in code word sequences avoiding reverse complementarity
KR100537523B1 (en) Apparatus for encoding DNA sequence and method of the same
WO2004070029A1 (en) Method to encode a dna sequence and to compress a dna sequence
CN107633158B (en) Method and apparatus for compressing and decompressing gene sequences
Goel A compression algorithm for DNA that uses ASCII values
CN111279422A (en) Encoding/decoding method, encoding/decoding device, and storage method and device
KR101953663B1 (en) Method for generating pool containing oligonucleotides from a oligonucleotide
Mridula et al. Lossless segment based DNA compression
TWI770247B (en) Nucleic acid method for data storage, and non-transitory computer-readable storage medium, system, and electronic device
Venugopal et al. Probabilistic Approach for DNA Compression
CN114341988A (en) Methods for compressing genomic sequence data
Wang et al. DNA Digital Data Storage based on Distributed Method
최영재 High Information Capacity and Low Cost DNA-based Data Storage through Additional Encoding Characters
Rani M.: A new referential method for compressing genomes
AU2022245140A1 (en) Fixed point number representation and computation circuits
KR20210056822A (en) Method of compression and transmision for fastq genome data
Bandyopadhyay Data hiding using DNA sequence compression

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PH PL PT RO RU SC SD SE SG SK SL TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP