A kind of fishing website recognition system and method
The divisional application that patent application of the present invention is the applying date is on 06 28th, 2012, application number is 201210224485.3, name is called the Chinese invention patent application of " a kind of fishing website recognition system and method ".
Technical field
The present invention relates to technical field of network security, particularly a kind of fishing website recognition system and method.
Background technology
Along with the development of the Internet, netizen's quantity increases year by year.When surfing the Net, except the threat of traditional wooden horse, virus, the quantity of nearly 2 years fishing websites significantly increases.
Current main fishing website recognition technology is by collecting common fishing website, being made into knowledge base, then calculates the similarity of the fishing website in newfound webpage and knowledge base, thus judges whether it is fishing website.
Above by the method for fishing website knowledge base identification fishing website, usually the fishing website of known class can only be identified, for the fishing website then None-identified of newtype, during the fishing website only having Bank of China relevant in such as fishing website knowledge base, for the fishing website of counterfeit industrial and commercial bank with regard to None-identified.
Summary of the invention
The technical problem to be solved in the present invention is: how to provide a kind of fishing website recognition system and method, effectively to identify the fishing website of newtype.
For solving the problems of the technologies described above, the invention provides a kind of fishing website recognition system, it comprises: domain Name acquisition unit, domain name statistic unit and web site recognizing unit;
Domain name acquiring unit, is suitable for collecting the all-links occurred in website to be identified, obtains the domain name that described link is corresponding;
Domain name statistic unit, is suitable for the number of times that statistics domain name occurs in described website to be identified, finds the domain name that occurrence number is maximum, be denoted as target domain name;
Described web site recognizing unit, is suitable for judging whether described website to be identified is fishing website according to self domain name of described target domain name and described website to be identified.
Wherein, described web site recognizing unit comprises: compare subelement and recognin unit;
Describedly compare subelement, be suitable for more described target domain name and self domain name described, and show described target domain name and described own domain famous prime minister simultaneously at comparative result, judge that described website to be identified is not fishing website;
Described recognin unit, be suitable for when described target domain name is different from self domain name described, calculate the ratio between the occurrence number of described target domain name and the occurrence number of self domain name described, and the similarity calculated between described target domain name and self domain name described, and then judge whether described website to be identified is fishing website according to described ratio and described similarity.
Wherein, described recognin unit comprises: ratio computing module, similarity calculation module and judge module;
Described ratio computing module, is suitable for calculating the ratio between the occurrence number of described target domain name and the occurrence number of self domain name described;
Described similarity calculation module, is suitable for calculating the similarity between described target domain name and self domain name described;
Described judge module, is suitable for judging whether described ratio and described similarity satisfy condition: described ratio is greater than predetermined ratio, and described similarity is greater than predetermined threshold; If met, judge that described website to be identified is fishing website; Otherwise, judge that described website to be identified is not fishing website.
Wherein, described similarity calculation module comprises: character string contrast submodule, initial value calculating sub module and final value calculating sub module;
Described character string contrast submodule, be suitable for the contrast array building the character string of described target domain name and the character string of self domain name described, the character string of described target domain name is arranged on the first row of described contrast array and holding position is fixed, the character string of self domain name described be arranged on the second row of described contrast array and move from left to right, character overlapping in two line character strings is contrasted;
Described initial value calculating sub module, is suitable for, when the initial character of described target domain name aligns with the trailing character of self domain name described, calculating the first similarity value calculation Q between described target domain name and self domain name described
1; When the second character of described target domain name aligns with the trailing character of self domain name described, calculate the second similarity value calculation Q between described target domain name and self domain name described
2; The like, when the trailing character of described target domain name aligns with the initial character of self domain name described, calculate the m similarity value calculation Q between described target domain name and self domain name described
m; Wherein, m=n
1+ n
2-1, n
1represent the string length of described target domain name, n
2represent the string length of self domain name described;
Described final value calculating sub module, is suitable for the similarity Q obtaining between described target domain name and self domain name described according to following formulae discovery
max:
Q
max=max{Q
1,Q
2,Q
3,……Q
m}。
Wherein, in described initial value calculating sub module, utilize following formulae discovery i-th similarity value calculation Q
i:
Q
i=M
i 2×L
i;
Wherein, i is natural number, and, 1≤i≤m; Further,
M
i=s
i/n
max;
L
i=r
i/n
max;
Wherein, r
irepresent when i-th contrast, in the character string of self domain name described and the character string of described target domain name, the character number of overlap; n
maxrepresent the character number of longer character string in the described character string of self domain name and the character string of described target domain name; L
irepresent when i-th contrast, the Duplication of the character string of self domain name described and the character string of described target domain name; s
irepresent when i-th contrast, in the character string of self domain name described and the character string of described target domain name, overlap and identical character number; M
irepresent when i-th contrast, the matching rate of the character string of self domain name described and the character string of described target domain name.
Wherein, in described initial value calculating sub module, utilize as under type calculates the i-th similarity value calculation Q
i:
When i-th contrast, to calculate in the character string of described target domain name and the character string of self domain name described overlap and identical character number, using described overlap and identical character number as the i-th similarity value calculation Q
i.
Wherein, described system also comprises: supplement recognition unit;
Described supplementary recognition unit, be suitable for being that the website to be identified of fishing website is denoted as the suspected site by judgment result displays, and carry out supplementing identification to described the suspected site, when recognition result shows described the suspected site still for fishing website, described the suspected site is sent into fishing website storehouse.
Wherein, the domain name that described link is corresponding is the absolute address of described link.
Wherein, described system also comprises: website acquiring unit;
Described website acquiring unit, is suitable for searching new built web-site using as website to be identified.
The present invention also provides a kind of fishing website recognition methods, and it comprises step:
Collect the all-links occurred in website to be identified, obtain the domain name that described link is corresponding;
The number of times that statistics domain name occurs in described website to be identified, finds the domain name that occurrence number is maximum, is denoted as target domain name;
Self domain name according to described target domain name and described website to be identified judges whether described website to be identified is fishing website.
Wherein, described self domain name according to described target domain name and described website to be identified judges whether described website to be identified is fishing website, comprises step further:
Judge that whether described target domain name is identical with self domain name described, if so, judge that described website to be identified is not fishing website, process ends; Otherwise, perform next step;
Calculate the ratio between the occurrence number of described target domain name and the occurrence number of self domain name described, and the similarity between described target domain name and self domain name described, judge whether described website to be identified is fishing website according to described ratio and described similarity.
Wherein, ratio between the occurrence number of described calculating described target domain name and the occurrence number of self domain name described, and the similarity between described target domain name and self domain name described, judge whether described website to be identified is fishing website, comprises step further according to described ratio and described similarity:
Calculate the ratio between the occurrence number of described target domain name and the occurrence number of self domain name described;
Calculate the similarity between described target domain name and self domain name described;
Judge whether to meet the following conditions: described ratio is greater than predetermined ratio, and described similarity is greater than predetermined threshold; If so, judge that described website to be identified is fishing website; Otherwise, judge that described website to be identified is not fishing website.
Wherein, the similarity between described calculating described target domain name and self domain name described, comprises step further:
Build the contrast array of the character string of described target domain name and the character string of self domain name described, the character string of described target domain name is arranged on the first row of described contrast array and holding position is fixed, the character string of self domain name described be arranged on the second row of described contrast array and move from left to right, character overlapping in two line character strings is contrasted;
When the initial character of described target domain name aligns with the trailing character of self domain name described, calculate the first similarity value calculation Q between described target domain name and self domain name described
1; When the second character of described target domain name aligns with the trailing character of self domain name described, calculate the second similarity value calculation Q between described target domain name and self domain name described
2; The like, when the trailing character of described target domain name aligns with the initial character of self domain name described, calculate the m similarity value calculation Q between described target domain name and self domain name described
m; Wherein, m=n
1+ n
2-1, n
1represent the string length of described target domain name, n
2represent the string length of self domain name described;
The similarity Q between described target domain name and self domain name described is obtained according to following formulae discovery
max:
Q
max=max{Q
1,Q
2,Q
3,……Q
m}。
Wherein, described when the initial character of described target domain name aligns with the trailing character of self domain name described, calculate the first similarity value calculation Q between described target domain name and self domain name described
1; When the second character of described target domain name aligns with the trailing character of self domain name described, calculate the second similarity value calculation Q between described target domain name and self domain name described
2; The like, when the trailing character of described target domain name aligns with the initial character of self domain name described, calculate the m similarity value calculation Q between described target domain name and self domain name described
min, the i-th similarity value calculation Q
icomputing formula as follows:
Q
i=M
i 2×L
i;
Wherein, i is natural number, and, 1≤i≤m; Further,
M
i=s
i/n
max;
L
i=r
i/n
max;
Wherein, r
irepresent when i-th contrast, in the character string of self domain name described and the character string of described target domain name, the character number of overlap; n
maxrepresent the character number of longer character string in the described character string of self domain name and the character string of described target domain name; L
irepresent when i-th contrast, the Duplication of the character string of self domain name described and the character string of described target domain name; s
irepresent when i-th contrast, in the character string of self domain name described and the character string of described target domain name, overlap and identical character number; M
irepresent when i-th contrast, the matching rate of the character string of self domain name described and the character string of described target domain name.
Wherein, described when the initial character of described target domain name aligns with the trailing character of self domain name described, calculate the first similarity value calculation Q between described target domain name and self domain name described
1; When the second character of described target domain name aligns with the trailing character of self domain name described, calculate the second similarity value calculation Q between described target domain name and self domain name described
2; The like, when the trailing character of described target domain name aligns with the initial character of self domain name described, calculate the m similarity value calculation Q between described target domain name and self domain name described
min, utilize as under type calculates the i-th similarity value calculation Q
i:
When i-th contrast, to calculate in the character string of described target domain name and the character string of self domain name described overlap and identical character number, using described overlap and identical character number as the i-th similarity value calculation Q
i.
Wherein, judge described website to be identified also comprises step after whether being fishing website in described self domain name according to described target domain name and described website to be identified: be that the website to be identified of fishing website is denoted as the suspected site by judgment result displays, and carry out supplementing identification to described the suspected site, when recognition result shows described the suspected site still for fishing website, described the suspected site is sent into fishing website storehouse.
Wherein, the domain name that described link is corresponding is the absolute address of described link.
Wherein, the all-links occurred in described collection website to be identified, also comprises step before obtaining domain name corresponding to described link: search new built web-site using as website to be identified.
Described fishing website recognition system of the present invention and method, carry out the identification of fishing website based on the linking relationship in website, effectively can identify the fishing website of newtype; Meanwhile, be conducive to quantity and the type of enriching fishing website in fishing website storehouse, be convenient to further fishing website identification and search, being with a wide range of applications in network safety filed.
Accompanying drawing explanation
Fig. 1 is the modular structure schematic diagram of fishing website recognition system described in the embodiment of the present invention one;
Fig. 2 is the modular structure schematic diagram of described web site recognizing unit;
Fig. 3 is the modular structure schematic diagram of described recognin unit;
Fig. 4 is the modular structure schematic diagram of described similarity calculation module;
Fig. 5 is the modular structure schematic diagram of fishing website recognition system described in the embodiment of the present invention two;
Fig. 6 is the flow chart of fishing website recognition methods described in the embodiment of the present invention three;
Fig. 7 is the flow chart of fishing website recognition methods described in the embodiment of the present invention four.
Embodiment
Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples for illustration of the present invention, but are not used for limiting the scope of the invention.
Fig. 1 is the modular structure schematic diagram of fishing website recognition system described in the embodiment of the present invention one, and as shown in Figure 1, described system comprises: domain Name acquisition unit 100, domain name statistic unit 200 and web site recognizing unit 300.
Domain name acquiring unit 100, is suitable for collecting the all-links occurred in website to be identified, obtains the domain name that described link is corresponding.The domain name of link correspondence described here is the absolute address of described link, if the link occurred in described website to be identified adopts relative address, needs to be converted into absolute address.
Domain name statistic unit 200, is suitable for the number of times that statistics domain name occurs in described website to be identified, finds the domain name that occurrence number is maximum, be denoted as target domain name.Domain name statistic unit 200 can be key with domain name, take occurrence number as value, generates a key-value form, then according to the numerical value of value in form, sort to domain name, obtain the domain name that occurrence number is maximum.
Described web site recognizing unit 300, is suitable for judging whether described website to be identified is fishing website according to self domain name of described target domain name and described website to be identified.
Fig. 2 is the modular structure schematic diagram of described web site recognizing unit, and as shown in Figure 2, described web site recognizing unit 300 comprises further: compare subelement 310 and recognin unit 320.
Describedly compare subelement 310, be suitable for more described target domain name and self domain name described, and show described target domain name and described own domain famous prime minister simultaneously at comparative result, judge that described website to be identified is not fishing website.
Described recognin unit 320, be suitable for when described target domain name is different from self domain name described, calculate the ratio between the occurrence number of described target domain name and the occurrence number of self domain name described, and the similarity calculated between described target domain name and self domain name described, and then judge whether described website to be identified is fishing website according to described ratio and described similarity.
Fig. 3 is the modular structure schematic diagram of described recognin unit, and as shown in Figure 3, described recognin unit 320 comprises further: ratio computing module 321, similarity calculation module 322 and judge module 323.
Described ratio computing module 321, is suitable for calculating the ratio between the occurrence number of described target domain name and the occurrence number of self domain name described.
Described similarity calculation module 322, is suitable for calculating the similarity between described target domain name and self domain name described.
Fig. 4 is the modular structure schematic diagram of described similarity calculation module, and as shown in Figure 4, described similarity calculation module 322 comprises further: character string contrast submodule 322a, initial value calculating sub module 322b and final value calculating sub module 322c.
Described character string contrast submodule 322a, be suitable for the contrast array building the character string of described target domain name and the character string of self domain name described, the character string of described target domain name is arranged on the first row of described contrast array and holding position is fixed, the character string of self domain name described be arranged on the second row of described contrast array and move from left to right, character overlapping in two line character strings is contrasted.
Described initial value calculating sub module 322b, is suitable for, when the initial character of described target domain name aligns with the trailing character of self domain name described, calculating the first similarity value calculation Q between described target domain name and self domain name described
1; When the second character of described target domain name aligns with the trailing character of self domain name described, calculate the second similarity value calculation Q between described target domain name and self domain name described
2; The like, when the trailing character of described target domain name aligns with the initial character of self domain name described, calculate the m similarity value calculation Q between described target domain name and self domain name described
m; Wherein, m=n
1+ n
2-1, n
1represent the string length of described target domain name, n
2represent the string length of self domain name described.
Wherein, in described initial value calculating sub module 322b, utilize following formulae discovery i-th similarity value calculation Q
i:
Q
i=M
i 2×L
i;
Wherein, i is natural number, and, 1≤i≤m; Further,
M
i=s
i/n
max;
L
i=r
i/n
max;
Wherein, r
irepresent when i-th contrast, in the character string of self domain name described and the character string of described target domain name, the character number of overlap; n
maxrepresent the character number of longer character string in the described character string of self domain name and the character string of described target domain name; L
irepresent when i-th contrast, the Duplication of the character string of self domain name described and the character string of described target domain name; s
irepresent when i-th contrast, in the character string of self domain name described and the character string of described target domain name, overlap and identical character number; M
irepresent when i-th contrast, the matching rate of the character string of self domain name described and the character string of described target domain name.
For example, suppose that own domain is called boc.cn and moves from left to right, aiming field is called cocc.cn holding position and fixes.When the 1st contrast, only have character n overlapping with character c, correspondingly r
1=1, s
1=0; When the 2nd contrast, character n is overlapping with character o, and character c is overlapping with character c, correspondingly r
2=2, s
2=1.
In addition, in described initial value calculating sub module, can also utilize as under type calculates the i-th similarity value calculation Q
i:
When i-th contrast, to calculate in the character string of described target domain name and the character string of self domain name described overlap and identical character number, using described overlap and identical character number as the i-th similarity value calculation Q
i.
For the i-th similarity value calculation Q
iaccount form, some known existing methods can also be adopted, due to its non-invention emphasis, not repeat them here.
Described final value calculating sub module 322c, is suitable for the similarity Q obtaining between described target domain name and self domain name described according to following formulae discovery
max:
Q
max=max{Q
1,Q
2,Q
3,……Q
m}。
Described judge module 323, is suitable for judging whether described ratio and described similarity satisfy condition: described ratio is greater than predetermined ratio, and described similarity is greater than predetermined threshold; If met, judge that described website to be identified is fishing website; Otherwise, judge that described website to be identified is not fishing website.Described predetermined ratio and described predetermined threshold can carry out arranging and adjusting, the present embodiment according to actual service condition, and described predetermined ratio is preferably 1.0, and described predetermined threshold is preferably 80%.
Fig. 5 is the modular structure schematic diagram of fishing website recognition system described in the embodiment of the present invention two, as shown in Figure 5, system described in the present embodiment is substantially identical with system described in embodiment one, its difference is only, described in the present embodiment, system also comprises: website acquiring unit 000 and supplementary recognition unit 400.
Described website acquiring unit 000, is suitable for searching new built web-site using as website to be identified.Generally, mostly fishing website is new built web-site, therefore, by arranging described website acquiring unit 000, only using new built web-site as website to be identified, the identification range of fishing website can be reduced, improve the accuracy and speed that identify.Can adopt with the following method for searching of new built web-site: by particular keywords monitoring search-engine results page; Or, by the website that client terminal to discover netizen visit capacity is few.
Described supplementary recognition unit 000, be suitable for being that the website to be identified of fishing website is denoted as the suspected site by judgment result displays, and carry out supplementing identification to described the suspected site, when recognition result shows described the suspected site still for fishing website, described the suspected site is sent into fishing website storehouse.Described supplementary identification can adopt the mode of manual review.By arranging described supplementary recognition unit 000, the accuracy of fishing website identification can be improved further.
Fig. 6 is the flow chart of fishing website recognition methods described in the embodiment of the present invention three, and as shown in Figure 6, described method comprises step:
A: collect the all-links occurred in website to be identified, obtains the domain name that described link is corresponding.The domain name of described link correspondence is the absolute address of described link.
B: the number of times that statistics domain name occurs in described website to be identified, finds the domain name that occurrence number is maximum, is denoted as target domain name.
C: self domain name according to described target domain name and described website to be identified judges whether described website to be identified is fishing website.
Described step C comprises step further:
C1: judge that whether described target domain name is identical with self domain name described, if so, judges that described website to be identified is not fishing website, process ends; Otherwise, perform step C2;
C2: calculate the ratio between the occurrence number of described target domain name and the occurrence number of self domain name described, and the similarity between described target domain name and self domain name described, judge whether described website to be identified is fishing website according to described ratio and described similarity.
Described step C2 comprises step further:
C21: calculate the ratio between the occurrence number of described target domain name and the occurrence number of self domain name described.
C22: calculate the similarity between described target domain name and self domain name described.
Described step C22 comprises step further:
C221: the contrast array building the character string of described target domain name and the character string of self domain name described, the character string of described target domain name is arranged on the first row of described contrast array and holding position is fixed, the character string of self domain name described be arranged on the second row of described contrast array and move from left to right, character overlapping in two line character strings is contrasted.
C222: when the initial character of described target domain name aligns with the trailing character of self domain name described, calculates the first similarity value calculation Q between described target domain name and self domain name described
1; When the second character of described target domain name aligns with the trailing character of self domain name described, calculate the second similarity value calculation Q between described target domain name and self domain name described
2; The like, when the trailing character of described target domain name aligns with the initial character of self domain name described, calculate the m similarity value calculation Q between described target domain name and self domain name described
m; Wherein, m=n
1+ n
2-1, n
1represent the string length of described target domain name, n
2represent the string length of self domain name described.
In described step C222, the i-th similarity value calculation Q
icomputing formula as follows:
Q
i=M
i 2×L
i;
Wherein, i is natural number, and, 1≤i≤m; Further,
M
i=s
i/n
max;
L
i=r
i/n
max;
Wherein, r
irepresent when i-th contrast, in the character string of self domain name described and the character string of described target domain name, the character number of overlap; n
maxrepresent the character number of longer character string in the described character string of self domain name and the character string of described target domain name; L
irepresent when i-th contrast, the Duplication of the character string of self domain name described and the character string of described target domain name; s
irepresent when i-th contrast, in the character string of self domain name described and the character string of described target domain name, overlap and identical character number; M
irepresent when i-th contrast, the matching rate of the character string of self domain name described and the character string of described target domain name.
In addition, in described step C222, also can utilize as under type calculates the i-th similarity value calculation Q
i:
When i-th contrast, to calculate in the character string of described target domain name and the character string of self domain name described overlap and identical character number, using described overlap and identical character number as the i-th similarity value calculation Q
i.
C223: obtain the similarity Q between described target domain name and self domain name described according to following formulae discovery
max:
Q
max=max{Q
1,Q
2,Q
3,……Q
m}。
C23: judge whether to meet the following conditions: described ratio is greater than predetermined ratio, and described similarity is greater than predetermined threshold; If so, judge that described website to be identified is fishing website; Otherwise, judge that described website to be identified is not fishing website.
Fig. 7 is the flow chart of fishing website recognition methods described in the embodiment of the present invention four, and as shown in Figure 7, method described in the present embodiment is substantially identical with method described in embodiment three, and its difference is only:
Also steps A was comprised before described steps A ': search new built web-site using as website to be identified.Can adopt with the following method for searching of new built web-site: by particular keywords monitoring search-engine results page; Or, by the website that client terminal to discover netizen visit capacity is few.
Also step D is comprised: be that the website to be identified of fishing website is denoted as the suspected site by judgment result displays after described step C, and carry out supplementing identification to described the suspected site, when recognition result shows described the suspected site still for fishing website, described the suspected site is sent into fishing website storehouse.Described supplementary identification can adopt the mode of manual review.
Fishing website recognition system and method described in the embodiment of the present invention, carry out the identification of fishing website based on the linking relationship in website, effectively can identify the fishing website of newtype; Meanwhile, be conducive to quantity and the type of enriching fishing website in fishing website storehouse, be convenient to further fishing website identification and search, being with a wide range of applications in network safety filed.
Above execution mode is only for illustration of the present invention; and be not limitation of the present invention; the those of ordinary skill of relevant technical field; without departing from the spirit and scope of the present invention; can also make a variety of changes and modification; therefore all equivalent technical schemes also belong to category of the present invention, and scope of patent protection of the present invention should be defined by the claims.