CN104580254A - Phishing website identification system and method - Google Patents

Phishing website identification system and method Download PDF

Info

Publication number
CN104580254A
CN104580254A CN201510051628.9A CN201510051628A CN104580254A CN 104580254 A CN104580254 A CN 104580254A CN 201510051628 A CN201510051628 A CN 201510051628A CN 104580254 A CN104580254 A CN 104580254A
Authority
CN
China
Prior art keywords
domain name
self
website
target domain
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510051628.9A
Other languages
Chinese (zh)
Other versions
CN104580254B (en
Inventor
陈营营
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Hongxiang Technical Service Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201510051628.9A priority Critical patent/CN104580254B/en
Priority claimed from CN201210224485.3A external-priority patent/CN102801709B/en
Publication of CN104580254A publication Critical patent/CN104580254A/en
Application granted granted Critical
Publication of CN104580254B publication Critical patent/CN104580254B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/12Applying verification of the received information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/51Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems at application loading time, e.g. accepting, rejecting, starting or inhibiting executable software based on integrity or source reliability
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic

Abstract

The invention discloses a phishing website identification system and method and relates to the field of network security. The system comprises a domain name acquiring unit, a domain name statistical counting unit and a website identification unit, wherein the domain name acquiring unit is suitable for collecting all links appearing in a to-be-identified website to obtain the domain names corresponding to the links; the domain name statistical counting unit is suitable for statistically counting the appearance times of the domain names in the to-be-identified website so as to find the domain name having the most appearance times, and the domain name is marked as a target domain name; the website identification unit is suitable for judging whether the to-be-identified website is a phishing website or not according to the target domain name and the domain name of the to-be-identified website. The system and method has the advantages that phishing website identification is performed on the basis of the link relations in the website, new types of phishing websites can be identified effectively, number and types of the phishing websites in a phishing website library can be enriched favorably, further phishing website identification and locating are facilitated, and the system and method is promising in application prospect in the field of network security.

Description

A kind of fishing website recognition system and method
The divisional application that patent application of the present invention is the applying date is on 06 28th, 2012, application number is 201210224485.3, name is called the Chinese invention patent application of " a kind of fishing website recognition system and method ".
Technical field
The present invention relates to technical field of network security, particularly a kind of fishing website recognition system and method.
Background technology
Along with the development of the Internet, netizen's quantity increases year by year.When surfing the Net, except the threat of traditional wooden horse, virus, the quantity of nearly 2 years fishing websites significantly increases.
Current main fishing website recognition technology is by collecting common fishing website, being made into knowledge base, then calculates the similarity of the fishing website in newfound webpage and knowledge base, thus judges whether it is fishing website.
Above by the method for fishing website knowledge base identification fishing website, usually the fishing website of known class can only be identified, for the fishing website then None-identified of newtype, during the fishing website only having Bank of China relevant in such as fishing website knowledge base, for the fishing website of counterfeit industrial and commercial bank with regard to None-identified.
Summary of the invention
The technical problem to be solved in the present invention is: how to provide a kind of fishing website recognition system and method, effectively to identify the fishing website of newtype.
For solving the problems of the technologies described above, the invention provides a kind of fishing website recognition system, it comprises: domain Name acquisition unit, domain name statistic unit and web site recognizing unit;
Domain name acquiring unit, is suitable for collecting the all-links occurred in website to be identified, obtains the domain name that described link is corresponding;
Domain name statistic unit, is suitable for the number of times that statistics domain name occurs in described website to be identified, finds the domain name that occurrence number is maximum, be denoted as target domain name;
Described web site recognizing unit, is suitable for judging whether described website to be identified is fishing website according to self domain name of described target domain name and described website to be identified.
Wherein, described web site recognizing unit comprises: compare subelement and recognin unit;
Describedly compare subelement, be suitable for more described target domain name and self domain name described, and show described target domain name and described own domain famous prime minister simultaneously at comparative result, judge that described website to be identified is not fishing website;
Described recognin unit, be suitable for when described target domain name is different from self domain name described, calculate the ratio between the occurrence number of described target domain name and the occurrence number of self domain name described, and the similarity calculated between described target domain name and self domain name described, and then judge whether described website to be identified is fishing website according to described ratio and described similarity.
Wherein, described recognin unit comprises: ratio computing module, similarity calculation module and judge module;
Described ratio computing module, is suitable for calculating the ratio between the occurrence number of described target domain name and the occurrence number of self domain name described;
Described similarity calculation module, is suitable for calculating the similarity between described target domain name and self domain name described;
Described judge module, is suitable for judging whether described ratio and described similarity satisfy condition: described ratio is greater than predetermined ratio, and described similarity is greater than predetermined threshold; If met, judge that described website to be identified is fishing website; Otherwise, judge that described website to be identified is not fishing website.
Wherein, described similarity calculation module comprises: character string contrast submodule, initial value calculating sub module and final value calculating sub module;
Described character string contrast submodule, be suitable for the contrast array building the character string of described target domain name and the character string of self domain name described, the character string of described target domain name is arranged on the first row of described contrast array and holding position is fixed, the character string of self domain name described be arranged on the second row of described contrast array and move from left to right, character overlapping in two line character strings is contrasted;
Described initial value calculating sub module, is suitable for, when the initial character of described target domain name aligns with the trailing character of self domain name described, calculating the first similarity value calculation Q between described target domain name and self domain name described 1; When the second character of described target domain name aligns with the trailing character of self domain name described, calculate the second similarity value calculation Q between described target domain name and self domain name described 2; The like, when the trailing character of described target domain name aligns with the initial character of self domain name described, calculate the m similarity value calculation Q between described target domain name and self domain name described m; Wherein, m=n 1+ n 2-1, n 1represent the string length of described target domain name, n 2represent the string length of self domain name described;
Described final value calculating sub module, is suitable for the similarity Q obtaining between described target domain name and self domain name described according to following formulae discovery max:
Q max=max{Q 1,Q 2,Q 3,……Q m}。
Wherein, in described initial value calculating sub module, utilize following formulae discovery i-th similarity value calculation Q i:
Q i=M i 2×L i
Wherein, i is natural number, and, 1≤i≤m; Further,
M i=s i/n max
L i=r i/n max
Wherein, r irepresent when i-th contrast, in the character string of self domain name described and the character string of described target domain name, the character number of overlap; n maxrepresent the character number of longer character string in the described character string of self domain name and the character string of described target domain name; L irepresent when i-th contrast, the Duplication of the character string of self domain name described and the character string of described target domain name; s irepresent when i-th contrast, in the character string of self domain name described and the character string of described target domain name, overlap and identical character number; M irepresent when i-th contrast, the matching rate of the character string of self domain name described and the character string of described target domain name.
Wherein, in described initial value calculating sub module, utilize as under type calculates the i-th similarity value calculation Q i:
When i-th contrast, to calculate in the character string of described target domain name and the character string of self domain name described overlap and identical character number, using described overlap and identical character number as the i-th similarity value calculation Q i.
Wherein, described system also comprises: supplement recognition unit;
Described supplementary recognition unit, be suitable for being that the website to be identified of fishing website is denoted as the suspected site by judgment result displays, and carry out supplementing identification to described the suspected site, when recognition result shows described the suspected site still for fishing website, described the suspected site is sent into fishing website storehouse.
Wherein, the domain name that described link is corresponding is the absolute address of described link.
Wherein, described system also comprises: website acquiring unit;
Described website acquiring unit, is suitable for searching new built web-site using as website to be identified.
The present invention also provides a kind of fishing website recognition methods, and it comprises step:
Collect the all-links occurred in website to be identified, obtain the domain name that described link is corresponding;
The number of times that statistics domain name occurs in described website to be identified, finds the domain name that occurrence number is maximum, is denoted as target domain name;
Self domain name according to described target domain name and described website to be identified judges whether described website to be identified is fishing website.
Wherein, described self domain name according to described target domain name and described website to be identified judges whether described website to be identified is fishing website, comprises step further:
Judge that whether described target domain name is identical with self domain name described, if so, judge that described website to be identified is not fishing website, process ends; Otherwise, perform next step;
Calculate the ratio between the occurrence number of described target domain name and the occurrence number of self domain name described, and the similarity between described target domain name and self domain name described, judge whether described website to be identified is fishing website according to described ratio and described similarity.
Wherein, ratio between the occurrence number of described calculating described target domain name and the occurrence number of self domain name described, and the similarity between described target domain name and self domain name described, judge whether described website to be identified is fishing website, comprises step further according to described ratio and described similarity:
Calculate the ratio between the occurrence number of described target domain name and the occurrence number of self domain name described;
Calculate the similarity between described target domain name and self domain name described;
Judge whether to meet the following conditions: described ratio is greater than predetermined ratio, and described similarity is greater than predetermined threshold; If so, judge that described website to be identified is fishing website; Otherwise, judge that described website to be identified is not fishing website.
Wherein, the similarity between described calculating described target domain name and self domain name described, comprises step further:
Build the contrast array of the character string of described target domain name and the character string of self domain name described, the character string of described target domain name is arranged on the first row of described contrast array and holding position is fixed, the character string of self domain name described be arranged on the second row of described contrast array and move from left to right, character overlapping in two line character strings is contrasted;
When the initial character of described target domain name aligns with the trailing character of self domain name described, calculate the first similarity value calculation Q between described target domain name and self domain name described 1; When the second character of described target domain name aligns with the trailing character of self domain name described, calculate the second similarity value calculation Q between described target domain name and self domain name described 2; The like, when the trailing character of described target domain name aligns with the initial character of self domain name described, calculate the m similarity value calculation Q between described target domain name and self domain name described m; Wherein, m=n 1+ n 2-1, n 1represent the string length of described target domain name, n 2represent the string length of self domain name described;
The similarity Q between described target domain name and self domain name described is obtained according to following formulae discovery max:
Q max=max{Q 1,Q 2,Q 3,……Q m}。
Wherein, described when the initial character of described target domain name aligns with the trailing character of self domain name described, calculate the first similarity value calculation Q between described target domain name and self domain name described 1; When the second character of described target domain name aligns with the trailing character of self domain name described, calculate the second similarity value calculation Q between described target domain name and self domain name described 2; The like, when the trailing character of described target domain name aligns with the initial character of self domain name described, calculate the m similarity value calculation Q between described target domain name and self domain name described min, the i-th similarity value calculation Q icomputing formula as follows:
Q i=M i 2×L i
Wherein, i is natural number, and, 1≤i≤m; Further,
M i=s i/n max
L i=r i/n max
Wherein, r irepresent when i-th contrast, in the character string of self domain name described and the character string of described target domain name, the character number of overlap; n maxrepresent the character number of longer character string in the described character string of self domain name and the character string of described target domain name; L irepresent when i-th contrast, the Duplication of the character string of self domain name described and the character string of described target domain name; s irepresent when i-th contrast, in the character string of self domain name described and the character string of described target domain name, overlap and identical character number; M irepresent when i-th contrast, the matching rate of the character string of self domain name described and the character string of described target domain name.
Wherein, described when the initial character of described target domain name aligns with the trailing character of self domain name described, calculate the first similarity value calculation Q between described target domain name and self domain name described 1; When the second character of described target domain name aligns with the trailing character of self domain name described, calculate the second similarity value calculation Q between described target domain name and self domain name described 2; The like, when the trailing character of described target domain name aligns with the initial character of self domain name described, calculate the m similarity value calculation Q between described target domain name and self domain name described min, utilize as under type calculates the i-th similarity value calculation Q i:
When i-th contrast, to calculate in the character string of described target domain name and the character string of self domain name described overlap and identical character number, using described overlap and identical character number as the i-th similarity value calculation Q i.
Wherein, judge described website to be identified also comprises step after whether being fishing website in described self domain name according to described target domain name and described website to be identified: be that the website to be identified of fishing website is denoted as the suspected site by judgment result displays, and carry out supplementing identification to described the suspected site, when recognition result shows described the suspected site still for fishing website, described the suspected site is sent into fishing website storehouse.
Wherein, the domain name that described link is corresponding is the absolute address of described link.
Wherein, the all-links occurred in described collection website to be identified, also comprises step before obtaining domain name corresponding to described link: search new built web-site using as website to be identified.
Described fishing website recognition system of the present invention and method, carry out the identification of fishing website based on the linking relationship in website, effectively can identify the fishing website of newtype; Meanwhile, be conducive to quantity and the type of enriching fishing website in fishing website storehouse, be convenient to further fishing website identification and search, being with a wide range of applications in network safety filed.
Accompanying drawing explanation
Fig. 1 is the modular structure schematic diagram of fishing website recognition system described in the embodiment of the present invention one;
Fig. 2 is the modular structure schematic diagram of described web site recognizing unit;
Fig. 3 is the modular structure schematic diagram of described recognin unit;
Fig. 4 is the modular structure schematic diagram of described similarity calculation module;
Fig. 5 is the modular structure schematic diagram of fishing website recognition system described in the embodiment of the present invention two;
Fig. 6 is the flow chart of fishing website recognition methods described in the embodiment of the present invention three;
Fig. 7 is the flow chart of fishing website recognition methods described in the embodiment of the present invention four.
Embodiment
Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples for illustration of the present invention, but are not used for limiting the scope of the invention.
Fig. 1 is the modular structure schematic diagram of fishing website recognition system described in the embodiment of the present invention one, and as shown in Figure 1, described system comprises: domain Name acquisition unit 100, domain name statistic unit 200 and web site recognizing unit 300.
Domain name acquiring unit 100, is suitable for collecting the all-links occurred in website to be identified, obtains the domain name that described link is corresponding.The domain name of link correspondence described here is the absolute address of described link, if the link occurred in described website to be identified adopts relative address, needs to be converted into absolute address.
Domain name statistic unit 200, is suitable for the number of times that statistics domain name occurs in described website to be identified, finds the domain name that occurrence number is maximum, be denoted as target domain name.Domain name statistic unit 200 can be key with domain name, take occurrence number as value, generates a key-value form, then according to the numerical value of value in form, sort to domain name, obtain the domain name that occurrence number is maximum.
Described web site recognizing unit 300, is suitable for judging whether described website to be identified is fishing website according to self domain name of described target domain name and described website to be identified.
Fig. 2 is the modular structure schematic diagram of described web site recognizing unit, and as shown in Figure 2, described web site recognizing unit 300 comprises further: compare subelement 310 and recognin unit 320.
Describedly compare subelement 310, be suitable for more described target domain name and self domain name described, and show described target domain name and described own domain famous prime minister simultaneously at comparative result, judge that described website to be identified is not fishing website.
Described recognin unit 320, be suitable for when described target domain name is different from self domain name described, calculate the ratio between the occurrence number of described target domain name and the occurrence number of self domain name described, and the similarity calculated between described target domain name and self domain name described, and then judge whether described website to be identified is fishing website according to described ratio and described similarity.
Fig. 3 is the modular structure schematic diagram of described recognin unit, and as shown in Figure 3, described recognin unit 320 comprises further: ratio computing module 321, similarity calculation module 322 and judge module 323.
Described ratio computing module 321, is suitable for calculating the ratio between the occurrence number of described target domain name and the occurrence number of self domain name described.
Described similarity calculation module 322, is suitable for calculating the similarity between described target domain name and self domain name described.
Fig. 4 is the modular structure schematic diagram of described similarity calculation module, and as shown in Figure 4, described similarity calculation module 322 comprises further: character string contrast submodule 322a, initial value calculating sub module 322b and final value calculating sub module 322c.
Described character string contrast submodule 322a, be suitable for the contrast array building the character string of described target domain name and the character string of self domain name described, the character string of described target domain name is arranged on the first row of described contrast array and holding position is fixed, the character string of self domain name described be arranged on the second row of described contrast array and move from left to right, character overlapping in two line character strings is contrasted.
Described initial value calculating sub module 322b, is suitable for, when the initial character of described target domain name aligns with the trailing character of self domain name described, calculating the first similarity value calculation Q between described target domain name and self domain name described 1; When the second character of described target domain name aligns with the trailing character of self domain name described, calculate the second similarity value calculation Q between described target domain name and self domain name described 2; The like, when the trailing character of described target domain name aligns with the initial character of self domain name described, calculate the m similarity value calculation Q between described target domain name and self domain name described m; Wherein, m=n 1+ n 2-1, n 1represent the string length of described target domain name, n 2represent the string length of self domain name described.
Wherein, in described initial value calculating sub module 322b, utilize following formulae discovery i-th similarity value calculation Q i:
Q i=M i 2×L i
Wherein, i is natural number, and, 1≤i≤m; Further,
M i=s i/n max
L i=r i/n max
Wherein, r irepresent when i-th contrast, in the character string of self domain name described and the character string of described target domain name, the character number of overlap; n maxrepresent the character number of longer character string in the described character string of self domain name and the character string of described target domain name; L irepresent when i-th contrast, the Duplication of the character string of self domain name described and the character string of described target domain name; s irepresent when i-th contrast, in the character string of self domain name described and the character string of described target domain name, overlap and identical character number; M irepresent when i-th contrast, the matching rate of the character string of self domain name described and the character string of described target domain name.
For example, suppose that own domain is called boc.cn and moves from left to right, aiming field is called cocc.cn holding position and fixes.When the 1st contrast, only have character n overlapping with character c, correspondingly r 1=1, s 1=0; When the 2nd contrast, character n is overlapping with character o, and character c is overlapping with character c, correspondingly r 2=2, s 2=1.
In addition, in described initial value calculating sub module, can also utilize as under type calculates the i-th similarity value calculation Q i:
When i-th contrast, to calculate in the character string of described target domain name and the character string of self domain name described overlap and identical character number, using described overlap and identical character number as the i-th similarity value calculation Q i.
For the i-th similarity value calculation Q iaccount form, some known existing methods can also be adopted, due to its non-invention emphasis, not repeat them here.
Described final value calculating sub module 322c, is suitable for the similarity Q obtaining between described target domain name and self domain name described according to following formulae discovery max:
Q max=max{Q 1,Q 2,Q 3,……Q m}。
Described judge module 323, is suitable for judging whether described ratio and described similarity satisfy condition: described ratio is greater than predetermined ratio, and described similarity is greater than predetermined threshold; If met, judge that described website to be identified is fishing website; Otherwise, judge that described website to be identified is not fishing website.Described predetermined ratio and described predetermined threshold can carry out arranging and adjusting, the present embodiment according to actual service condition, and described predetermined ratio is preferably 1.0, and described predetermined threshold is preferably 80%.
Fig. 5 is the modular structure schematic diagram of fishing website recognition system described in the embodiment of the present invention two, as shown in Figure 5, system described in the present embodiment is substantially identical with system described in embodiment one, its difference is only, described in the present embodiment, system also comprises: website acquiring unit 000 and supplementary recognition unit 400.
Described website acquiring unit 000, is suitable for searching new built web-site using as website to be identified.Generally, mostly fishing website is new built web-site, therefore, by arranging described website acquiring unit 000, only using new built web-site as website to be identified, the identification range of fishing website can be reduced, improve the accuracy and speed that identify.Can adopt with the following method for searching of new built web-site: by particular keywords monitoring search-engine results page; Or, by the website that client terminal to discover netizen visit capacity is few.
Described supplementary recognition unit 000, be suitable for being that the website to be identified of fishing website is denoted as the suspected site by judgment result displays, and carry out supplementing identification to described the suspected site, when recognition result shows described the suspected site still for fishing website, described the suspected site is sent into fishing website storehouse.Described supplementary identification can adopt the mode of manual review.By arranging described supplementary recognition unit 000, the accuracy of fishing website identification can be improved further.
Fig. 6 is the flow chart of fishing website recognition methods described in the embodiment of the present invention three, and as shown in Figure 6, described method comprises step:
A: collect the all-links occurred in website to be identified, obtains the domain name that described link is corresponding.The domain name of described link correspondence is the absolute address of described link.
B: the number of times that statistics domain name occurs in described website to be identified, finds the domain name that occurrence number is maximum, is denoted as target domain name.
C: self domain name according to described target domain name and described website to be identified judges whether described website to be identified is fishing website.
Described step C comprises step further:
C1: judge that whether described target domain name is identical with self domain name described, if so, judges that described website to be identified is not fishing website, process ends; Otherwise, perform step C2;
C2: calculate the ratio between the occurrence number of described target domain name and the occurrence number of self domain name described, and the similarity between described target domain name and self domain name described, judge whether described website to be identified is fishing website according to described ratio and described similarity.
Described step C2 comprises step further:
C21: calculate the ratio between the occurrence number of described target domain name and the occurrence number of self domain name described.
C22: calculate the similarity between described target domain name and self domain name described.
Described step C22 comprises step further:
C221: the contrast array building the character string of described target domain name and the character string of self domain name described, the character string of described target domain name is arranged on the first row of described contrast array and holding position is fixed, the character string of self domain name described be arranged on the second row of described contrast array and move from left to right, character overlapping in two line character strings is contrasted.
C222: when the initial character of described target domain name aligns with the trailing character of self domain name described, calculates the first similarity value calculation Q between described target domain name and self domain name described 1; When the second character of described target domain name aligns with the trailing character of self domain name described, calculate the second similarity value calculation Q between described target domain name and self domain name described 2; The like, when the trailing character of described target domain name aligns with the initial character of self domain name described, calculate the m similarity value calculation Q between described target domain name and self domain name described m; Wherein, m=n 1+ n 2-1, n 1represent the string length of described target domain name, n 2represent the string length of self domain name described.
In described step C222, the i-th similarity value calculation Q icomputing formula as follows:
Q i=M i 2×L i
Wherein, i is natural number, and, 1≤i≤m; Further,
M i=s i/n max
L i=r i/n max
Wherein, r irepresent when i-th contrast, in the character string of self domain name described and the character string of described target domain name, the character number of overlap; n maxrepresent the character number of longer character string in the described character string of self domain name and the character string of described target domain name; L irepresent when i-th contrast, the Duplication of the character string of self domain name described and the character string of described target domain name; s irepresent when i-th contrast, in the character string of self domain name described and the character string of described target domain name, overlap and identical character number; M irepresent when i-th contrast, the matching rate of the character string of self domain name described and the character string of described target domain name.
In addition, in described step C222, also can utilize as under type calculates the i-th similarity value calculation Q i:
When i-th contrast, to calculate in the character string of described target domain name and the character string of self domain name described overlap and identical character number, using described overlap and identical character number as the i-th similarity value calculation Q i.
C223: obtain the similarity Q between described target domain name and self domain name described according to following formulae discovery max:
Q max=max{Q 1,Q 2,Q 3,……Q m}。
C23: judge whether to meet the following conditions: described ratio is greater than predetermined ratio, and described similarity is greater than predetermined threshold; If so, judge that described website to be identified is fishing website; Otherwise, judge that described website to be identified is not fishing website.
Fig. 7 is the flow chart of fishing website recognition methods described in the embodiment of the present invention four, and as shown in Figure 7, method described in the present embodiment is substantially identical with method described in embodiment three, and its difference is only:
Also steps A was comprised before described steps A ': search new built web-site using as website to be identified.Can adopt with the following method for searching of new built web-site: by particular keywords monitoring search-engine results page; Or, by the website that client terminal to discover netizen visit capacity is few.
Also step D is comprised: be that the website to be identified of fishing website is denoted as the suspected site by judgment result displays after described step C, and carry out supplementing identification to described the suspected site, when recognition result shows described the suspected site still for fishing website, described the suspected site is sent into fishing website storehouse.Described supplementary identification can adopt the mode of manual review.
Fishing website recognition system and method described in the embodiment of the present invention, carry out the identification of fishing website based on the linking relationship in website, effectively can identify the fishing website of newtype; Meanwhile, be conducive to quantity and the type of enriching fishing website in fishing website storehouse, be convenient to further fishing website identification and search, being with a wide range of applications in network safety filed.
Above execution mode is only for illustration of the present invention; and be not limitation of the present invention; the those of ordinary skill of relevant technical field; without departing from the spirit and scope of the present invention; can also make a variety of changes and modification; therefore all equivalent technical schemes also belong to category of the present invention, and scope of patent protection of the present invention should be defined by the claims.

Claims (18)

1. a fishing website recognition system, it comprises: domain Name acquisition unit, domain name statistic unit and web site recognizing unit;
Domain name acquiring unit, is suitable for collecting the all-links occurred in website to be identified, obtains the domain name that described link is corresponding;
Domain name statistic unit, is suitable for the number of times that statistics domain name occurs in described website to be identified, finds the domain name that occurrence number is maximum, be denoted as target domain name;
Described web site recognizing unit, is suitable for judging whether described website to be identified is fishing website according to self domain name of described target domain name and described website to be identified.
2. the system as claimed in claim 1, is characterized in that, described web site recognizing unit comprises: compare subelement and recognin unit;
Describedly compare subelement, be suitable for more described target domain name and self domain name described, and show described target domain name and described own domain famous prime minister simultaneously at comparative result, judge that described website to be identified is not fishing website;
Described recognin unit, be suitable for when described target domain name is different from self domain name described, calculate the ratio between the occurrence number of described target domain name and the occurrence number of self domain name described, and the similarity calculated between described target domain name and self domain name described, and then judge whether described website to be identified is fishing website according to described ratio and described similarity.
3. system as claimed in claim 2, it is characterized in that, described recognin unit comprises: ratio computing module, similarity calculation module and judge module;
Described ratio computing module, is suitable for calculating the ratio between the occurrence number of described target domain name and the occurrence number of self domain name described;
Described similarity calculation module, is suitable for calculating the similarity between described target domain name and self domain name described;
Described judge module, is suitable for judging whether described ratio and described similarity satisfy condition: described ratio is greater than predetermined ratio, and described similarity is greater than predetermined threshold; If met, judge that described website to be identified is fishing website; Otherwise, judge that described website to be identified is not fishing website.
4. system as claimed in claim 3, it is characterized in that, described similarity calculation module comprises: character string contrast submodule, initial value calculating sub module and final value calculating sub module;
Described character string contrast submodule, be suitable for the contrast array building the character string of described target domain name and the character string of self domain name described, the character string of described target domain name is arranged on the first row of described contrast array and holding position is fixed, the character string of self domain name described be arranged on the second row of described contrast array and move from left to right, character overlapping in two line character strings is contrasted;
Described initial value calculating sub module, is suitable for, when the initial character of described target domain name aligns with the trailing character of self domain name described, calculating the first similarity value calculation Q between described target domain name and self domain name described 1; When the second character of described target domain name aligns with the trailing character of self domain name described, calculate the second similarity value calculation Q between described target domain name and self domain name described 2; The like, when the trailing character of described target domain name aligns with the initial character of self domain name described, calculate the m similarity value calculation Q between described target domain name and self domain name described m; Wherein, m=n 1+ n 2-1, n 1represent the string length of described target domain name, n 2represent the string length of self domain name described;
Described final value calculating sub module, is suitable for the similarity Q obtaining between described target domain name and self domain name described according to following formulae discovery max:
Q max=max{Q 1,Q 2,Q 3,……Q m}。
5. system as claimed in claim 4, is characterized in that, in described initial value calculating sub module, utilize following formulae discovery i-th similarity value calculation Q i:
Q i=M i 2×L i
Wherein, i is natural number, and, 1≤i≤m; Further,
M i=s i/n max
L i=r i/n max
Wherein, r irepresent when i-th contrast, in the character string of self domain name described and the character string of described target domain name, the character number of overlap; n maxrepresent the character number of longer character string in the described character string of self domain name and the character string of described target domain name; L irepresent when i-th contrast, the Duplication of the character string of self domain name described and the character string of described target domain name; s irepresent when i-th contrast, in the character string of self domain name described and the character string of described target domain name, overlap and identical character number; M irepresent when i-th contrast, the matching rate of the character string of self domain name described and the character string of described target domain name.
6. system as claimed in claim 4, is characterized in that, in described initial value calculating sub module, utilizes as under type calculates the i-th similarity value calculation Q i:
When i-th contrast, to calculate in the character string of described target domain name and the character string of self domain name described overlap and identical character number, using described overlap and identical character number as the i-th similarity value calculation Q i.
7. the system as claimed in claim 1, is characterized in that, described system also comprises: supplement recognition unit;
Described supplementary recognition unit, be suitable for being that the website to be identified of fishing website is denoted as the suspected site by judgment result displays, and carry out supplementing identification to described the suspected site, when recognition result shows described the suspected site still for fishing website, described the suspected site is sent into fishing website storehouse.
8. the system as claimed in claim 1, is characterized in that, the domain name of described link correspondence is the absolute address of described link.
9. the system as claimed in claim 1, is characterized in that, described system also comprises: website acquiring unit;
Described website acquiring unit, is suitable for searching new built web-site using as website to be identified.
10. a fishing website recognition methods, it comprises step:
Collect the all-links occurred in website to be identified, obtain the domain name that described link is corresponding;
The number of times that statistics domain name occurs in described website to be identified, finds the domain name that occurrence number is maximum, is denoted as target domain name;
Self domain name according to described target domain name and described website to be identified judges whether described website to be identified is fishing website.
11. methods as claimed in claim 10, is characterized in that, described self domain name according to described target domain name and described website to be identified judges whether described website to be identified is fishing website, comprises step further:
Judge that whether described target domain name is identical with self domain name described, if so, judge that described website to be identified is not fishing website, process ends; Otherwise, perform next step;
Calculate the ratio between the occurrence number of described target domain name and the occurrence number of self domain name described, and the similarity between described target domain name and self domain name described, judge whether described website to be identified is fishing website according to described ratio and described similarity.
12. methods as claimed in claim 11, it is characterized in that, ratio between the occurrence number of described calculating described target domain name and the occurrence number of self domain name described, and the similarity between described target domain name and self domain name described, judge whether described website to be identified is fishing website, comprises step further according to described ratio and described similarity:
Calculate the ratio between the occurrence number of described target domain name and the occurrence number of self domain name described;
Calculate the similarity between described target domain name and self domain name described;
Judge whether to meet the following conditions: described ratio is greater than predetermined ratio, and described similarity is greater than predetermined threshold; If so, judge that described website to be identified is fishing website; Otherwise, judge that described website to be identified is not fishing website.
13. methods as claimed in claim 12, it is characterized in that, the similarity between described calculating described target domain name and self domain name described, comprises step further:
Build the contrast array of the character string of described target domain name and the character string of self domain name described, the character string of described target domain name is arranged on the first row of described contrast array and holding position is fixed, the character string of self domain name described be arranged on the second row of described contrast array and move from left to right, character overlapping in two line character strings is contrasted;
When the initial character of described target domain name aligns with the trailing character of self domain name described, calculate the first similarity value calculation Q between described target domain name and self domain name described 1; When the second character of described target domain name aligns with the trailing character of self domain name described, calculate the second similarity value calculation Q between described target domain name and self domain name described 2; The like, when the trailing character of described target domain name aligns with the initial character of self domain name described, calculate the m similarity value calculation Q between described target domain name and self domain name described m; Wherein, m=n 1+ n 2-1, n 1represent the string length of described target domain name, n 2represent the string length of self domain name described;
The similarity Q between described target domain name and self domain name described is obtained according to following formulae discovery max:
Q max=max{Q 1,Q 2,Q 3,……Q m}。
14. methods as claimed in claim 13, is characterized in that, described when the initial character of described target domain name aligns with the trailing character of self domain name described, calculate the first similarity value calculation Q between described target domain name and self domain name described 1; When the second character of described target domain name aligns with the trailing character of self domain name described, calculate the second similarity value calculation Q between described target domain name and self domain name described 2; The like, when the trailing character of described target domain name aligns with the initial character of self domain name described, calculate the m similarity value calculation Q between described target domain name and self domain name described min, the i-th similarity value calculation Q icomputing formula as follows:
Q i=M i 2×L i
Wherein, i is natural number, and, 1≤i≤m; Further,
M i=s i/n max
L i=r i/n max
Wherein, r irepresent when i-th contrast, in the character string of self domain name described and the character string of described target domain name, the character number of overlap; n maxrepresent the character number of longer character string in the described character string of self domain name and the character string of described target domain name; L irepresent when i-th contrast, the Duplication of the character string of self domain name described and the character string of described target domain name; s irepresent when i-th contrast, in the character string of self domain name described and the character string of described target domain name, overlap and identical character number; M irepresent when i-th contrast, the matching rate of the character string of self domain name described and the character string of described target domain name.
15. methods as claimed in claim 13, is characterized in that, described when the initial character of described target domain name aligns with the trailing character of self domain name described, calculate the first similarity value calculation Q between described target domain name and self domain name described 1; When the second character of described target domain name aligns with the trailing character of self domain name described, calculate the second similarity value calculation Q between described target domain name and self domain name described 2; The like, when the trailing character of described target domain name aligns with the initial character of self domain name described, calculate the m similarity value calculation Q between described target domain name and self domain name described min, utilize as under type calculates the i-th similarity value calculation Q i:
When i-th contrast, to calculate in the character string of described target domain name and the character string of self domain name described overlap and identical character number, using described overlap and identical character number as the i-th similarity value calculation Q i.
16. methods as claimed in claim 10, it is characterized in that, judge described website to be identified also comprises step after whether being fishing website in described self domain name according to described target domain name and described website to be identified: be that the website to be identified of fishing website is denoted as the suspected site by judgment result displays, and carry out supplementing identification to described the suspected site, when recognition result shows described the suspected site still for fishing website, described the suspected site is sent into fishing website storehouse.
17. methods as claimed in claim 10, is characterized in that, the domain name of described link correspondence is the absolute address of described link.
18. methods as claimed in claim 10, is characterized in that, the all-links occurred in described collection website to be identified, also comprise step before obtaining domain name corresponding to described link: search new built web-site using as website to be identified.
CN201510051628.9A 2012-06-28 2012-06-28 A kind of fishing website identifying system and method Active CN104580254B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510051628.9A CN104580254B (en) 2012-06-28 2012-06-28 A kind of fishing website identifying system and method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510051628.9A CN104580254B (en) 2012-06-28 2012-06-28 A kind of fishing website identifying system and method
CN201210224485.3A CN102801709B (en) 2012-06-28 2012-06-28 Phishing website identification system and method

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201210224485.3A Division CN102801709B (en) 2012-06-28 2012-06-28 Phishing website identification system and method

Publications (2)

Publication Number Publication Date
CN104580254A true CN104580254A (en) 2015-04-29
CN104580254B CN104580254B (en) 2017-10-31

Family

ID=53095434

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510051628.9A Active CN104580254B (en) 2012-06-28 2012-06-28 A kind of fishing website identifying system and method

Country Status (1)

Country Link
CN (1) CN104580254B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106302440A (en) * 2016-08-11 2017-01-04 国家计算机网络与信息安全管理中心 A kind of method obtaining suspicious fishing website by all kinds of means
CN106330861A (en) * 2016-08-09 2017-01-11 中国信息安全测评中心 Website detection method and apparatus
CN108173814A (en) * 2017-12-08 2018-06-15 深信服科技股份有限公司 Detection method for phishing site, terminal device and storage medium
CN108337259A (en) * 2018-02-01 2018-07-27 南京邮电大学 A kind of suspicious web page identification method based on HTTP request Host information
CN108846672A (en) * 2018-06-25 2018-11-20 北京奇虎科技有限公司 Personalized address generating method, device, electronic equipment and storage medium
CN109391584A (en) * 2017-08-03 2019-02-26 武汉安天信息技术有限责任公司 A kind of recognition methods of doubtful malicious websites and device

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101145902A (en) * 2007-08-17 2008-03-19 东南大学 Fishing webpage detection method based on image processing
US20080092242A1 (en) * 2006-10-16 2008-04-17 Red Hat, Inc. Method and system for determining a probability of entry of a counterfeit domain in a browser
US7630987B1 (en) * 2004-11-24 2009-12-08 Bank Of America Corporation System and method for detecting phishers by analyzing website referrals
CN101667979A (en) * 2009-10-12 2010-03-10 哈尔滨工程大学 System and method for anti-phishing emails based on link domain name and user feedback
US7958555B1 (en) * 2007-09-28 2011-06-07 Trend Micro Incorporated Protecting computer users from online frauds
CN102098235A (en) * 2011-01-18 2011-06-15 南京邮电大学 Fishing mail inspection method based on text characteristic analysis
CN102223316A (en) * 2011-06-15 2011-10-19 成都市华为赛门铁克科技有限公司 Method and device for processing electronic mail
CN102801709B (en) * 2012-06-28 2015-03-04 北京奇虎科技有限公司 Phishing website identification system and method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7630987B1 (en) * 2004-11-24 2009-12-08 Bank Of America Corporation System and method for detecting phishers by analyzing website referrals
US20080092242A1 (en) * 2006-10-16 2008-04-17 Red Hat, Inc. Method and system for determining a probability of entry of a counterfeit domain in a browser
CN101145902A (en) * 2007-08-17 2008-03-19 东南大学 Fishing webpage detection method based on image processing
US7958555B1 (en) * 2007-09-28 2011-06-07 Trend Micro Incorporated Protecting computer users from online frauds
CN101667979A (en) * 2009-10-12 2010-03-10 哈尔滨工程大学 System and method for anti-phishing emails based on link domain name and user feedback
CN102098235A (en) * 2011-01-18 2011-06-15 南京邮电大学 Fishing mail inspection method based on text characteristic analysis
CN102223316A (en) * 2011-06-15 2011-10-19 成都市华为赛门铁克科技有限公司 Method and device for processing electronic mail
CN102801709B (en) * 2012-06-28 2015-03-04 北京奇虎科技有限公司 Phishing website identification system and method

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106330861A (en) * 2016-08-09 2017-01-11 中国信息安全测评中心 Website detection method and apparatus
CN106330861B (en) * 2016-08-09 2020-03-03 中国信息安全测评中心 Website detection method and device
CN106302440A (en) * 2016-08-11 2017-01-04 国家计算机网络与信息安全管理中心 A kind of method obtaining suspicious fishing website by all kinds of means
CN109391584A (en) * 2017-08-03 2019-02-26 武汉安天信息技术有限责任公司 A kind of recognition methods of doubtful malicious websites and device
CN108173814A (en) * 2017-12-08 2018-06-15 深信服科技股份有限公司 Detection method for phishing site, terminal device and storage medium
CN108173814B (en) * 2017-12-08 2021-02-05 深信服科技股份有限公司 Phishing website detection method, terminal device and storage medium
CN108337259A (en) * 2018-02-01 2018-07-27 南京邮电大学 A kind of suspicious web page identification method based on HTTP request Host information
CN108846672A (en) * 2018-06-25 2018-11-20 北京奇虎科技有限公司 Personalized address generating method, device, electronic equipment and storage medium
CN108846672B (en) * 2018-06-25 2021-11-23 北京奇虎科技有限公司 Personalized address generation method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN104580254B (en) 2017-10-31

Similar Documents

Publication Publication Date Title
CN102801709B (en) Phishing website identification system and method
CN104580254A (en) Phishing website identification system and method
CN105488024B (en) The abstracting method and device of Web page subject sentence
CN101267313B (en) Flooding attack detection method and detection device
CN102222187B (en) Domain name structural feature-based hang horse web page detection method
CN102405622A (en) Methods and devices for binary tree construction, compression and lookup
US20170053031A1 (en) Information forecast and acquisition method based on webpage link parameter analysis
CN103136358B (en) A kind of method of Automatic Extraction forum data
CN101404033A (en) Automatic generation method and system for noumenon hierarchical structure
CN104182548B (en) Webpage updates processing method and processing device
CN104636407B (en) Parameter value training and searching request treating method and apparatus
CN103399872B (en) The method and apparatus that webpage capture is optimized
CN102298681B (en) Software identification method based on data stream sliced sheet
CN103294820B (en) WEB page classifying method and system based on semantic extension
CN106121622A (en) A kind of Multiple faults diagnosis approach of Dlagnosis of Sucker Rod Pumping Well based on indicator card
CN110278150A (en) Polymerization route analysis method between a kind of domain based on fringe node solicited message feature
CN104765882A (en) Internet website statistics method based on web page characteristic strings
CN106940711B (en) URL detection method and detection device
CN109471934B (en) Financial risk clue mining method based on Internet
CN101604408A (en) A kind of generation of detecting device and detection method
CN103336765B (en) A kind of markov matrix off-line correction method of text key word
CN106295252A (en) Search method for gene prod
CN106250456A (en) Bid winning announcement extraction method and device
CN103838739A (en) Method and system for detecting error correction words in search engine
CN104615782A (en) Address matching method based on sliding window maximum matching algorithm

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220727

Address after: 300450 No. 9-3-401, No. 39, Gaoxin 6th Road, Binhai Science Park, Binhai New Area, Tianjin

Patentee after: 3600 Technology Group Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230713

Address after: 1765, floor 17, floor 15, building 3, No. 10 Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: Beijing Hongxiang Technical Service Co.,Ltd.

Address before: 300450 No. 9-3-401, No. 39, Gaoxin 6th Road, Binhai Science Park, Binhai New Area, Tianjin

Patentee before: 3600 Technology Group Co.,Ltd.