CN103544142A

CN103544142A - State machine

Info

Publication number: CN103544142A
Application number: CN201210248224.5A
Authority: CN
Inventors: 李小明; 胡胜发
Original assignee: Anyka Guangzhou Microelectronics Technology Co Ltd
Current assignee: Guangzhou Ankai Microelectronics Co.,Ltd.
Priority date: 2012-07-17
Filing date: 2012-07-17
Publication date: 2014-01-29
Anticipated expiration: 2032-07-17
Also published as: CN103544142B

Abstract

The embodiment of the invention provides a state machine. The state machine comprises a regular module, an edition module, a pre-classifying module and a state recognition module, wherein the regular module is used for pre-formulating a dividing rule of a character set to be recognized, and the dividing rule comprises characteristics on which the character set is divided based; the edition module is used for editing a regular expression for the state machine according to the characteristics on which the character set is divided based in the dividing rule; the pre-classifying module is used for pre-classifying characters in the character set to the recognized through the dividing rule; the state recognition module is used for recognizing the divided character set to be recognized through the regular expression.

Description

A kind of state machine

Technical field

The present invention relates to field of computer technology, particularly a kind of state machine.

Background technology

Lexical analysis (lexical analysis) is one of basic function of Computational Linguistics, for defining the composing method of word.The program or the function that carry out grammatical analysis are called lexical analyzer (Lexical analyzer is called for short Lexer), are also scanner (Scanner).Lexical analyzer generally exists with the form of function, for syntax analyzer, calls.

The character set of the first stage of lexical analysis for comprising in the handled word of identification, this process is conventionally based on state machine.State machine is for describing the figure of word composing method, and state machine is comprised of state point and transition arrow, is illustrated under certain initial conditions the process of state conversion.A state machine is corresponding with a regular expression.

Existing state machine mainly contains two kinds of nondeterministic finite state machine and deterministic finite state machines; Wherein nondeterministic finite state machine is under certain initial conditions, the state machine that state conversion is not unique; Deterministic finite state machine is under certain initial conditions, and state is changed unique state machine.

The state that existing state machine comprises and state conversion generally have up to a hundred, because quantity is large, so that the complicacy of state machine is very high, and the high complexity of state machine has caused the realization of programming language in micro-system to have certain difficulty, and causes the speed of processing character slow.

Summary of the invention

In view of this, the object of the present invention is to provide a kind of state machine, by character presort the multiple character unification with same characteristic features to be an eigenwert, to realize the minimizing of number of states, improve the travelling speed of state machine.

For achieving the above object, the present invention has following technical scheme:

, it is characterized in that, described state machine comprises:

Rule module, for pre-establishing the division rule of character set to be identified, described division rule comprises the feature of dividing character set institute foundation;

Editor module, is characterized as state machine editor regular expression for what divide character set institute foundation according to described division rule;

The module of presorting, presorts for utilizing described division rule to treat the concentrated character of identification character;

State recognition module, for utilizing the to be identified character set of described regular expression identification through dividing.

Described rule module comprises:

The first regular unit, formulates the division rule of character set to be identified for usining capitalization, lowercase, numeral and underscore as the feature of dividing character set institute foundation.

Described rule module comprises:

Second Rule unit, formulates the division rule of character set to be identified for usining noun, verb, pronoun, numeral-classifier compound and punctuation mark as the feature of dividing character set institute foundation.

Described rule module also comprises:

Eigenwert unit, is used to eigenwert of each characterizing definition.

The described unit of presorting comprises:

Eigenwert output unit, for judging that whether the character of character set to be identified meets any one feature in division rule, if met, exports the eigenwert of this feature.

As seen through the above technical solutions, the beneficial effect that the present invention exists is, by division rule, the character in character set to be identified is presorted, thereby by tens kinds even hundreds of character according to feature reduction, be several limited kinds, and according to described feature, edit out the regular expression of simplification, the state machine that this regular expression is corresponding is only identified smallest number feature, thereby realizes the simplification of state machine, has improved the speed of state machine operation.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, to the accompanying drawing of required use in embodiment or description of the Prior Art be briefly described below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skills, do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.

Fig. 1 is state machine structural representation described in the embodiment of the present invention;

Fig. 2 is state machine structural representation described in another embodiment of the present invention;

Fig. 3 is the character of the present invention process flow diagram of presorting.

Embodiment

For making object, technical scheme and the advantage of the embodiment of the present invention clearer, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is the present invention's part embodiment, rather than whole embodiment.Embodiment based in the present invention, those of ordinary skills, not making the every other embodiment obtaining under creative work prerequisite, belong to the scope of protection of the invention.

The invention discloses a kind of state machine, shown in Fig. 1, described state machine is specially:

The present embodiment is a basic embodiment of state machine of the present invention, the beneficial effect that the present embodiment exists is, by division rule, the character in character set to be identified is presorted, thereby by tens kinds even hundreds of character according to feature reduction, be several limited kinds, and according to described feature, edit out the regular expression of simplification, the state machine that this regular expression is corresponding is only identified smallest number feature, thereby realizes the simplification of state machine.

With reference to another specific embodiment that Figure 2 shows that state machine of the present invention.Described in the present embodiment, state machine specifically comprises:

Described in the present embodiment, rule module comprises:

The first regular unit, formulates the division rule of character set to be identified for usining capitalization, lowercase, numeral and underscore as the feature of dividing character set institute foundation;

Second Rule unit, formulates the division rule of character set to be identified for usining noun, verb, pronoun, numeral-classifier compound and punctuation mark as the feature of dividing character set institute foundation;

Eigenwert unit, is used to eigenwert of each characterizing definition;

The module of presorting described in the present embodiment comprises:

Eigenwert output unit, for judging that whether the character of character set to be identified meets any one feature in division rule, if met, exports the eigenwert of this feature;

In the present embodiment, adopt macrodefined mode to define eigenwert, suppose according to the division rule of formulating in described the first regular unit, character to be divided, described macro definition expression formula is:

const?int?a2z=512,A2Z=513,z2n=514,underscore=515,others=516

Its implication is, the expression symbol of capitalization A ~ Z in program language is A2Z, the expression symbol of lowercase character a ~ z in program language is a2z, the expression symbol of numeral 0 ~ 9 in program language is z2n, the expression symbol of underscore in program language is underscore, and the expression symbol of the situation that does not meet above-mentioned feature in program language is others; In the present embodiment, specially added this feature of others, if namely a certain character does not meet any one feature of capitalization, lowercase, numeral or underscore, thought that this character belongs to this feature of others, output characteristic value is 516.

The eigenwert that the eigenwert that the eigenwert that the eigenwert that the eigenwert that defines A2Z in the present embodiment is 513, a2z is 512, z2n is 514, underscore is 515, others is 516.In practical application, eigenwert can be defined as any numeral.Above mode classification and eigenwert are defined as the preferred version of taking in the present embodiment, in actual conditions, can under the prerequisite that does not affect overall plan, take other modes.

Described editor module is state machine editor regular expression according to above-mentioned division rule, and the regular expression in the present embodiment is:

{a2z,A2Z,underscore}{a2z,A2Z,underscore,z2n}*

The implication of this regular expression is: in character set, first character is a2z, A2Z or underscore, and some other characters are a2z, A2Z, underscore or z2n; Namely represent that it is numeral that this regular expression does not allow the first character of character set.

According to above-mentioned division rule, character set to be identified is divided equally,, before state recognition Module recognition character set to be identified, treat the concentrated character of identification character and presort.

Special provision in the present embodiment, a2z is First Characteristic, and A2Z is Second Characteristic, and z2n is the 3rd feature, and underscore is the 4th feature, others is the 5th feature.Described eigenwert output unit is in a certain character feature of judgement output characteristic value, from First Characteristic to the five features judge successively this character whether with this characteristic matching, if a certain feature of this character match is exported the eigenwert of this feature, if do not mate, do not continue to judge by next feature, specifically referring to Fig. 3.

This sentences character set to be identified is that English word " English " is example, and character set to be identified comprises 7 characters altogether.

The first character of character set to be identified is " E ", judges that whether character " E " meets First Characteristic is a2z, and result is not for meeting; Judge that more whether this character meets Second Characteristic is A2Z, result, for meeting, is exported the eigenwert 513 of Second Characteristic.

Second character of character set to be identified is " n ", judges whether character " n " meets First Characteristic a2z, and result, for meeting, is exported the eigenwert 512 of Second Characteristic.

Judge successively in the manner described above other all characters in this character set, obtain the eigenwert of each character, completed presorting of character set to be identified.

Tradition does not exist in the situation of the above-mentioned process of presorting, it is generally acknowledged: capitalization A ~ Z is 26 features, lowercase a ~ z is 26 features, numeral 0 ~ 9 is 10 features, and underscore is 1 feature, and other situations are 1 feature, amount to 64 features, after utilizing above-mentioned 64 feature editor regular expressions, state machine corresponding to regular expression must be identified above-mentioned whole features and transition status thereof and feedback states, and the feature that state machine need to be identified reaches hundreds of.

According to the method for the invention, character set to be identified being presorted, is 5 features that provide in the present embodiment by above-mentioned 64 feature reductions, and utilizes above-mentioned 5 feature editor regular expressions.The feature of the required identification of state machine that described in the corresponding embodiment of the present invention, regular expression is corresponding quantitatively greatly reduces.The present invention has realized the simplification of state machine thus, improves the travelling speed of state machine.

In like manner, again with the present embodiment when described in Second Rule unit take division rule that noun, verb, pronoun, numeral-classifier compound and punctuation mark formulate character set to be identified as the feature of dividing character set institute foundation as example, can realize equally the object of simplifying state machine.

The Chinese sentence " I am a soldier " of take is example, can, according to the division rule in Second Rule unit, " I " be divided into pronoun, "Yes" is divided into verb, " one " and " individual " all belongs to numeral-classifier compound, and " soldier " is divided into noun, and according to output characteristic of correspondence value is set.In the present embodiment, except division rule changes, in state machine, the principle of work of all the other modules and unit is with above-mentioned identical, so be no longer repeated in this description.

The beneficial effect that the present embodiment exists is, by for more concrete disclosed of described state machine with describe, and coordinate actual characters collection and division rule to carry out distance explanation, make that state machine is on basis embodiment illustrated in fig. 1 described in the present embodiment, content is more complete clear.

The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. a state machine, is characterized in that, described state machine comprises:

2. state machine according to claim 1, is characterized in that, described rule module comprises:

3. state machine according to claim 1, is characterized in that, described rule module comprises:

4. according to state machine described in claim 1-3 any one, it is characterized in that, described rule module also comprises:

Eigenwert unit, is used to eigenwert of each characterizing definition.

5. state machine according to claim 4, is characterized in that, described in the unit of presorting comprise: