CN103544142A - State machine - Google Patents

State machine Download PDF

Info

Publication number
CN103544142A
CN103544142A CN201210248224.5A CN201210248224A CN103544142A CN 103544142 A CN103544142 A CN 103544142A CN 201210248224 A CN201210248224 A CN 201210248224A CN 103544142 A CN103544142 A CN 103544142A
Authority
CN
China
Prior art keywords
state machine
character set
character
module
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201210248224.5A
Other languages
Chinese (zh)
Other versions
CN103544142B (en
Inventor
李小明
胡胜发
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Ankai Microelectronics Co.,Ltd.
Original Assignee
Anyka Guangzhou Microelectronics Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anyka Guangzhou Microelectronics Technology Co Ltd filed Critical Anyka Guangzhou Microelectronics Technology Co Ltd
Priority to CN201210248224.5A priority Critical patent/CN103544142B/en
Publication of CN103544142A publication Critical patent/CN103544142A/en
Application granted granted Critical
Publication of CN103544142B publication Critical patent/CN103544142B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The embodiment of the invention provides a state machine. The state machine comprises a regular module, an edition module, a pre-classifying module and a state recognition module, wherein the regular module is used for pre-formulating a dividing rule of a character set to be recognized, and the dividing rule comprises characteristics on which the character set is divided based; the edition module is used for editing a regular expression for the state machine according to the characteristics on which the character set is divided based in the dividing rule; the pre-classifying module is used for pre-classifying characters in the character set to the recognized through the dividing rule; the state recognition module is used for recognizing the divided character set to be recognized through the regular expression.

Description

A kind of state machine
Technical field
The present invention relates to field of computer technology, particularly a kind of state machine.
Background technology
Lexical analysis (lexical analysis) is one of basic function of Computational Linguistics, for defining the composing method of word.The program or the function that carry out grammatical analysis are called lexical analyzer (Lexical analyzer is called for short Lexer), are also scanner (Scanner).Lexical analyzer generally exists with the form of function, for syntax analyzer, calls.
The character set of the first stage of lexical analysis for comprising in the handled word of identification, this process is conventionally based on state machine.State machine is for describing the figure of word composing method, and state machine is comprised of state point and transition arrow, is illustrated under certain initial conditions the process of state conversion.A state machine is corresponding with a regular expression.
Existing state machine mainly contains two kinds of nondeterministic finite state machine and deterministic finite state machines; Wherein nondeterministic finite state machine is under certain initial conditions, the state machine that state conversion is not unique; Deterministic finite state machine is under certain initial conditions, and state is changed unique state machine.
The state that existing state machine comprises and state conversion generally have up to a hundred, because quantity is large, so that the complicacy of state machine is very high, and the high complexity of state machine has caused the realization of programming language in micro-system to have certain difficulty, and causes the speed of processing character slow.
Summary of the invention
In view of this, the object of the present invention is to provide a kind of state machine, by character presort the multiple character unification with same characteristic features to be an eigenwert, to realize the minimizing of number of states, improve the travelling speed of state machine.
For achieving the above object, the present invention has following technical scheme:
, it is characterized in that, described state machine comprises:
Rule module, for pre-establishing the division rule of character set to be identified, described division rule comprises the feature of dividing character set institute foundation;
Editor module, is characterized as state machine editor regular expression for what divide character set institute foundation according to described division rule;
The module of presorting, presorts for utilizing described division rule to treat the concentrated character of identification character;
State recognition module, for utilizing the to be identified character set of described regular expression identification through dividing.
Described rule module comprises:
The first regular unit, formulates the division rule of character set to be identified for usining capitalization, lowercase, numeral and underscore as the feature of dividing character set institute foundation.
Described rule module comprises:
Second Rule unit, formulates the division rule of character set to be identified for usining noun, verb, pronoun, numeral-classifier compound and punctuation mark as the feature of dividing character set institute foundation.
Described rule module also comprises:
Eigenwert unit, is used to eigenwert of each characterizing definition.
The described unit of presorting comprises:
Eigenwert output unit, for judging that whether the character of character set to be identified meets any one feature in division rule, if met, exports the eigenwert of this feature.
As seen through the above technical solutions, the beneficial effect that the present invention exists is, by division rule, the character in character set to be identified is presorted, thereby by tens kinds even hundreds of character according to feature reduction, be several limited kinds, and according to described feature, edit out the regular expression of simplification, the state machine that this regular expression is corresponding is only identified smallest number feature, thereby realizes the simplification of state machine, has improved the speed of state machine operation.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, to the accompanying drawing of required use in embodiment or description of the Prior Art be briefly described below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skills, do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.
Fig. 1 is state machine structural representation described in the embodiment of the present invention;
Fig. 2 is state machine structural representation described in another embodiment of the present invention;
Fig. 3 is the character of the present invention process flow diagram of presorting.
Embodiment
For making object, technical scheme and the advantage of the embodiment of the present invention clearer, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is the present invention's part embodiment, rather than whole embodiment.Embodiment based in the present invention, those of ordinary skills, not making the every other embodiment obtaining under creative work prerequisite, belong to the scope of protection of the invention.
The invention discloses a kind of state machine, shown in Fig. 1, described state machine is specially:
Rule module, for pre-establishing the division rule of character set to be identified, described division rule comprises the feature of dividing character set institute foundation;
Editor module, is characterized as state machine editor regular expression for what divide character set institute foundation according to described division rule;
The module of presorting, presorts for utilizing described division rule to treat the concentrated character of identification character;
State recognition module, for utilizing the to be identified character set of described regular expression identification through dividing.
The present embodiment is a basic embodiment of state machine of the present invention, the beneficial effect that the present embodiment exists is, by division rule, the character in character set to be identified is presorted, thereby by tens kinds even hundreds of character according to feature reduction, be several limited kinds, and according to described feature, edit out the regular expression of simplification, the state machine that this regular expression is corresponding is only identified smallest number feature, thereby realizes the simplification of state machine.
With reference to another specific embodiment that Figure 2 shows that state machine of the present invention.Described in the present embodiment, state machine specifically comprises:
Rule module, for pre-establishing the division rule of character set to be identified, described division rule comprises the feature of dividing character set institute foundation;
Described in the present embodiment, rule module comprises:
The first regular unit, formulates the division rule of character set to be identified for usining capitalization, lowercase, numeral and underscore as the feature of dividing character set institute foundation;
Second Rule unit, formulates the division rule of character set to be identified for usining noun, verb, pronoun, numeral-classifier compound and punctuation mark as the feature of dividing character set institute foundation;
Eigenwert unit, is used to eigenwert of each characterizing definition;
Editor module, is characterized as state machine editor regular expression for what divide character set institute foundation according to described division rule;
The module of presorting, presorts for utilizing described division rule to treat the concentrated character of identification character;
The module of presorting described in the present embodiment comprises:
Eigenwert output unit, for judging that whether the character of character set to be identified meets any one feature in division rule, if met, exports the eigenwert of this feature;
State recognition module, for utilizing the to be identified character set of described regular expression identification through dividing.
In the present embodiment, adopt macrodefined mode to define eigenwert, suppose according to the division rule of formulating in described the first regular unit, character to be divided, described macro definition expression formula is:
const?int?a2z=512,A2Z=513,z2n=514,underscore=515,others=516
Its implication is, the expression symbol of capitalization A ~ Z in program language is A2Z, the expression symbol of lowercase character a ~ z in program language is a2z, the expression symbol of numeral 0 ~ 9 in program language is z2n, the expression symbol of underscore in program language is underscore, and the expression symbol of the situation that does not meet above-mentioned feature in program language is others; In the present embodiment, specially added this feature of others, if namely a certain character does not meet any one feature of capitalization, lowercase, numeral or underscore, thought that this character belongs to this feature of others, output characteristic value is 516.
The eigenwert that the eigenwert that the eigenwert that the eigenwert that the eigenwert that defines A2Z in the present embodiment is 513, a2z is 512, z2n is 514, underscore is 515, others is 516.In practical application, eigenwert can be defined as any numeral.Above mode classification and eigenwert are defined as the preferred version of taking in the present embodiment, in actual conditions, can under the prerequisite that does not affect overall plan, take other modes.
Described editor module is state machine editor regular expression according to above-mentioned division rule, and the regular expression in the present embodiment is:
{a2z,A2Z,underscore}{a2z,A2Z,underscore,z2n}*
The implication of this regular expression is: in character set, first character is a2z, A2Z or underscore, and some other characters are a2z, A2Z, underscore or z2n; Namely represent that it is numeral that this regular expression does not allow the first character of character set.
According to above-mentioned division rule, character set to be identified is divided equally,, before state recognition Module recognition character set to be identified, treat the concentrated character of identification character and presort.
Special provision in the present embodiment, a2z is First Characteristic, and A2Z is Second Characteristic, and z2n is the 3rd feature, and underscore is the 4th feature, others is the 5th feature.Described eigenwert output unit is in a certain character feature of judgement output characteristic value, from First Characteristic to the five features judge successively this character whether with this characteristic matching, if a certain feature of this character match is exported the eigenwert of this feature, if do not mate, do not continue to judge by next feature, specifically referring to Fig. 3.
This sentences character set to be identified is that English word " English " is example, and character set to be identified comprises 7 characters altogether.
The first character of character set to be identified is " E ", judges that whether character " E " meets First Characteristic is a2z, and result is not for meeting; Judge that more whether this character meets Second Characteristic is A2Z, result, for meeting, is exported the eigenwert 513 of Second Characteristic.
Second character of character set to be identified is " n ", judges whether character " n " meets First Characteristic a2z, and result, for meeting, is exported the eigenwert 512 of Second Characteristic.
Judge successively in the manner described above other all characters in this character set, obtain the eigenwert of each character, completed presorting of character set to be identified.
Tradition does not exist in the situation of the above-mentioned process of presorting, it is generally acknowledged: capitalization A ~ Z is 26 features, lowercase a ~ z is 26 features, numeral 0 ~ 9 is 10 features, and underscore is 1 feature, and other situations are 1 feature, amount to 64 features, after utilizing above-mentioned 64 feature editor regular expressions, state machine corresponding to regular expression must be identified above-mentioned whole features and transition status thereof and feedback states, and the feature that state machine need to be identified reaches hundreds of.
According to the method for the invention, character set to be identified being presorted, is 5 features that provide in the present embodiment by above-mentioned 64 feature reductions, and utilizes above-mentioned 5 feature editor regular expressions.The feature of the required identification of state machine that described in the corresponding embodiment of the present invention, regular expression is corresponding quantitatively greatly reduces.The present invention has realized the simplification of state machine thus, improves the travelling speed of state machine.
In like manner, again with the present embodiment when described in Second Rule unit take division rule that noun, verb, pronoun, numeral-classifier compound and punctuation mark formulate character set to be identified as the feature of dividing character set institute foundation as example, can realize equally the object of simplifying state machine.
The Chinese sentence " I am a soldier " of take is example, can, according to the division rule in Second Rule unit, " I " be divided into pronoun, "Yes" is divided into verb, " one " and " individual " all belongs to numeral-classifier compound, and " soldier " is divided into noun, and according to output characteristic of correspondence value is set.In the present embodiment, except division rule changes, in state machine, the principle of work of all the other modules and unit is with above-mentioned identical, so be no longer repeated in this description.
The beneficial effect that the present embodiment exists is, by for more concrete disclosed of described state machine with describe, and coordinate actual characters collection and division rule to carry out distance explanation, make that state machine is on basis embodiment illustrated in fig. 1 described in the present embodiment, content is more complete clear.
The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (5)

1. a state machine, is characterized in that, described state machine comprises:
Rule module, for pre-establishing the division rule of character set to be identified, described division rule comprises the feature of dividing character set institute foundation;
Editor module, is characterized as state machine editor regular expression for what divide character set institute foundation according to described division rule;
The module of presorting, presorts for utilizing described division rule to treat the concentrated character of identification character;
State recognition module, for utilizing the to be identified character set of described regular expression identification through dividing.
2. state machine according to claim 1, is characterized in that, described rule module comprises:
The first regular unit, formulates the division rule of character set to be identified for usining capitalization, lowercase, numeral and underscore as the feature of dividing character set institute foundation.
3. state machine according to claim 1, is characterized in that, described rule module comprises:
Second Rule unit, formulates the division rule of character set to be identified for usining noun, verb, pronoun, numeral-classifier compound and punctuation mark as the feature of dividing character set institute foundation.
4. according to state machine described in claim 1-3 any one, it is characterized in that, described rule module also comprises:
Eigenwert unit, is used to eigenwert of each characterizing definition.
5. state machine according to claim 4, is characterized in that, described in the unit of presorting comprise:
Eigenwert output unit, for judging that whether the character of character set to be identified meets any one feature in division rule, if met, exports the eigenwert of this feature.
CN201210248224.5A 2012-07-17 2012-07-17 A kind of state machine Active CN103544142B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210248224.5A CN103544142B (en) 2012-07-17 2012-07-17 A kind of state machine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210248224.5A CN103544142B (en) 2012-07-17 2012-07-17 A kind of state machine

Publications (2)

Publication Number Publication Date
CN103544142A true CN103544142A (en) 2014-01-29
CN103544142B CN103544142B (en) 2016-12-21

Family

ID=49967610

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210248224.5A Active CN103544142B (en) 2012-07-17 2012-07-17 A kind of state machine

Country Status (1)

Country Link
CN (1) CN103544142B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095756A (en) * 2016-06-13 2016-11-09 尼玛扎西 Tibetan language spell checking methods and device based on automatic machine

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7308446B1 (en) * 2003-01-10 2007-12-11 Cisco Technology, Inc. Methods and apparatus for regular expression matching
CN101246472A (en) * 2008-03-28 2008-08-20 腾讯科技(深圳)有限公司 Method and apparatus for cutting large and small granularity of Chinese language text
CN101360088A (en) * 2007-07-30 2009-02-04 华为技术有限公司 Regular expression compiling, matching system and compiling, matching method
CN101650718A (en) * 2008-08-15 2010-02-17 华为技术有限公司 Method and device for matching character strings
CN101841546A (en) * 2010-05-17 2010-09-22 华为技术有限公司 Rule matching method, device and system
CN102142009A (en) * 2010-12-09 2011-08-03 华为技术有限公司 Method and device for matching regular expressions
CN102413014A (en) * 2011-11-28 2012-04-11 华为技术有限公司 Message detecting method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7308446B1 (en) * 2003-01-10 2007-12-11 Cisco Technology, Inc. Methods and apparatus for regular expression matching
CN101360088A (en) * 2007-07-30 2009-02-04 华为技术有限公司 Regular expression compiling, matching system and compiling, matching method
CN101246472A (en) * 2008-03-28 2008-08-20 腾讯科技(深圳)有限公司 Method and apparatus for cutting large and small granularity of Chinese language text
CN101650718A (en) * 2008-08-15 2010-02-17 华为技术有限公司 Method and device for matching character strings
CN101841546A (en) * 2010-05-17 2010-09-22 华为技术有限公司 Rule matching method, device and system
CN102142009A (en) * 2010-12-09 2011-08-03 华为技术有限公司 Method and device for matching regular expressions
CN102413014A (en) * 2011-11-28 2012-04-11 华为技术有限公司 Message detecting method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095756A (en) * 2016-06-13 2016-11-09 尼玛扎西 Tibetan language spell checking methods and device based on automatic machine
CN106095756B (en) * 2016-06-13 2019-03-26 尼玛扎西 Tibetan language spell checking methods and device based on automatic machine

Also Published As

Publication number Publication date
CN103544142B (en) 2016-12-21

Similar Documents

Publication Publication Date Title
CN103365838B (en) Based on the english composition grammar mistake method for automatically correcting of diverse characteristics
CN106250372A (en) A kind of Chinese electric power data text mining method for power system
CN102298638A (en) Method and system for extracting news webpage contents by clustering webpage labels
KR101589621B1 (en) Method of establishing lexico semantic pattern knowledge for text analysis and response system
CN108563629B (en) Automatic log analysis rule generation method and device
CN107193745A (en) Automated construction method of the PLC program to NuSMV input models
CN104391837A (en) Intelligent grammatical analysis method based on case semantics
CN108153522B (en) Method for generating Spark and Hadoop program codes by midcore based on model conversion
CN106599016A (en) Front-end element maintenance method based on virtual DOM
CN106372053A (en) Syntactic analysis method and apparatus
CN108874791A (en) A kind of semantic analysis based on minimum semantic chunk and Chinese-English sequence adjusting method and system
US9208134B2 (en) Methods and systems for tokenizing multilingual textual documents
CN103207921A (en) Method for automatically extracting terms from Chinese electronic document
CN103544142A (en) State machine
CN104281695B (en) The semantic information abstracting method and its system of natural language based on combinatorial theory
CN115169370A (en) Corpus data enhancement method and device, computer equipment and medium
Wen et al. Code similarity detection using ast and textual information
CN102332013A (en) OWL (ontology web language)-based Internet language ontology learning system
Shindo et al. Insertion operator for Bayesian tree substitution grammars
CN105653516A (en) Parallel corpus aligning method and device
Altenbek et al. Identification of basic phrases for kazakh language using maximum entropy model
Ohto et al. Proposal of extracting state variables and values from requirement specifications in japanese by using dependency analysis
Ng et al. Reranking a wide-coverage CCG parser
CN1556480A (en) Electron dictionary semanteme analysis method
Abaydulla et al. The study of Uygur handwriting recognizing system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: State machine, scheduling method and device and universal serial bus (USB) media play control device

Effective date of registration: 20171102

Granted publication date: 20161221

Pledgee: China Co truction Bank Corp Guangzhou economic and Technological Development Zone sub branch

Pledgor: Anyka (Guangzhou) Microelectronics Technology Co., Ltd.

Registration number: 2017990001008

PE01 Entry into force of the registration of the contract for pledge of patent right
PC01 Cancellation of the registration of the contract for pledge of patent right
PC01 Cancellation of the registration of the contract for pledge of patent right

Date of cancellation: 20181227

Granted publication date: 20161221

Pledgee: China Co truction Bank Corp Guangzhou economic and Technological Development Zone sub branch

Pledgor: Anyka (Guangzhou) Microelectronics Technology Co., Ltd.

Registration number: 2017990001008

PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: State machine, scheduling method and device and universal serial bus (USB) media play control device

Effective date of registration: 20190130

Granted publication date: 20161221

Pledgee: China Co truction Bank Corp Guangzhou economic and Technological Development Zone sub branch

Pledgor: Anyka (Guangzhou) Microelectronics Technology Co., Ltd.

Registration number: 2019440000051

PE01 Entry into force of the registration of the contract for pledge of patent right
PC01 Cancellation of the registration of the contract for pledge of patent right

Date of cancellation: 20200320

Granted publication date: 20161221

Pledgee: China Co truction Bank Corp Guangzhou economic and Technological Development Zone sub branch

Pledgor: ANYKA (GUANGZHOU) MICROELECTRONICS TECHNOLOGY Co.,Ltd.

Registration number: 2019440000051

PC01 Cancellation of the registration of the contract for pledge of patent right
CP01 Change in the name or title of a patent holder

Address after: 510663 3rd floor, area C1, innovation building, 182 science Avenue, Guangzhou Science City, Luogang District, Guangzhou City, Guangdong Province

Patentee after: Guangzhou Ankai Microelectronics Co.,Ltd.

Address before: 510663 3rd floor, area C1, innovation building, 182 science Avenue, Guangzhou Science City, Luogang District, Guangzhou City, Guangdong Province

Patentee before: ANYKA (GUANGZHOU) MICROELECTRONICS TECHNOLOGY Co.,Ltd.

CP01 Change in the name or title of a patent holder
CP02 Change in the address of a patent holder

Address after: 510555 No. 107 Bowen Road, Huangpu District, Guangzhou, Guangdong

Patentee after: Guangzhou Ankai Microelectronics Co.,Ltd.

Address before: 510663 3rd floor, area C1, innovation building, 182 science Avenue, Guangzhou Science City, Luogang District, Guangzhou City, Guangdong Province

Patentee before: Guangzhou Ankai Microelectronics Co.,Ltd.

CP02 Change in the address of a patent holder