WO2006066444A1

WO2006066444A1 - Connection-oriented junk mail filtering system and method

Info

Publication number: WO2006066444A1
Application number: PCT/CN2004/001480
Authority: WO
Inventors: Shengyu Cheng; Dongxin Lu; Qiang Li; Yingjie Bai; Zhiyun Luo; Zuoliang Zhu
Original assignee: Zte Corporation
Priority date: 2004-12-21
Filing date: 2004-12-21
Publication date: 2006-06-29
Also published as: CN101040279B; CN101040279A

Abstract

A connection-oriented junk mail filtering system and method, the system includes at least data acquisition module, filtering strategy management module, filtering analysis module, and data processing module, wherein the data acquisition module is used for capturing packets from monitored network, and submitting them to the filtering analysis module as data input of the whole filtering system; filtering strategy management module is used for configuration and management of filtering strategy; filtering analysis module is used for analysing the input packet based on configured filtering strategy, and checking whether it contains informations in which the filtering strategy is interested; data processing module is used for performing various processing on analysis result data of the filtering analysis module. The present invention solves the problem of missing alarm and false alarm for packet filtering, and its dominant characteristic is that it is independent of specific mail servers, and it is absolutely transparent to mail clients and servers. In contrast to the prior art, the present invention greatly improves the reliability of junk mail filtering system, and widens the applicability of the system.

Description

Connection-oriented spam filtering system and method

Technical field

The invention relates to a network content security monitoring method, in particular to a spam filtering system and method in the field of network information security. ' Background technique

E-mail is one of the most important applications on the Internet, and it has gradually become an indispensable part of people's production and life. Spam is usually an email containing bad information such as reactionary speech, pornography or violence, as well as unsolicited commercials in the form of unsolicited bulk emails and emails. This information is often sent in large quantities, which not only consumes a lot of network resources, but also reduces productivity, and may disturb social stability and endanger the physical and mental health of young people. According to statistics, spam has caused tens of billions of dollars in damage to the global economy every year. How to effectively prevent spam is a very urgent issue.

The existing spam filtering system mainly has the following two types: one is based on the filtering of the mail client, usually in the form of a plug-in of the mail client program, such a system only monitors a single machine, and the application scope is limited; the second is based on the mail server. Filtering usually requires a two-way connection with the mail server and works together. The monitoring scope of such systems is limited to directly connected mail servers. The above two types of spam filtering systems need to make certain modifications to the original mail client or mail server program, and work with the original system, so it is opaque. There are also spam filtering systems that do not rely on mail clients and servers and can be placed at the gateways of monitored networks. Most of these systems work like firewalls, typically checking the IP address of a mail packet and the header.

(For example, mail senders, mail recipients, and mail headers, etc.) Filtering, because of the simple packet filtering technology, can not avoid the leak alarm of packet filtering, and is vulnerable to fragmentation attacks.

In summary, the existing spam filtering technology mainly has two shortcomings: First, it relies too much on the mail server or mail client, and requires a certain transformation to the original mail server or mail client; Filtering or not addressing the fragmentation issue. Summary of the invention

The technical problem solved by the present invention proposes a connection-oriented spam filtering system. Implement full-text filtering of email content without fragmentation vulnerability issues, independent of specific email servers, either within a shared LAN or at the entrance or exit of a corporate network, interprovincial or international backbone network The system has a wide range of applications and high reliability.

Another technical problem solved by the present invention is to provide a connection-oriented spam filtering method, which can implement full-text filtering of email content without the vulnerability of fragmentation attacks, and improve the reliability of the spam filtering system.

Another technical problem solved by the present invention proposes a connection-oriented spam filtering method capable of avoiding occurrence of a leak alarm and a false alarm.

In order to achieve the above object of the present invention, the present invention provides a connection-oriented spam filtering system, the system at least comprising: a data collection module, a filtering policy management module, a filtering analysis module, and a data processing module, wherein the data acquisition module is used The data packet is captured from the monitored network and submitted to the filtering analysis module as the data input of the entire filtering system; the filtering policy management module is used for the configuration and management of the filtering policy; and the filtering analysis module is configured to input the filtering policy according to the configuration. The data packet is analyzed to check whether the information of the filtering strategy is included; the data processing module is used to perform various processing on the analysis result data of the filtering analysis module.

The connection-oriented spam filtering system further includes an operation and maintenance module and a storage backup module, wherein the operation and maintenance module is used for system maintenance, and the storage backup module is used for storage backup of system data and data packets.

The filtering policy includes a filtering condition and a corresponding processing manner, and the filtering condition may be a logical combination of a plurality of conditions.

The filtering analysis module includes a TCP connection maintenance submodule, a mail protocol parsing submodule, a MIME decoding, and a content scanning submodule, wherein the TCP connection maintenance submodule is used to maintain a TCP connection hash table, and the mail protocol parsing submodule is used. After completing the parsing of the mail protocol, the MIME decoding and content scanning sub-module is used to judge the encoding mode of the input mail data, and call the corresponding encoding conversion function for encoding conversion, and then perform full-text scanning on the mail content.

The hash table uses the source IP address, the destination IP address, the source port, and the destination port quaternion of the data packet as input for calculating the hash key value, and the hash value can be calculated by using various fast hash algorithms, and the hash conflict can be Solved by the chain address method.

Each TCP connection node in the hash table maintained by the TCP connection maintenance submodule includes at least: (1) IP address and transport layer port number of the client and server. These four parameters are unique identifiers used to determine the connection to which the data packet belongs.

(2) Protocol type: SMTP, POP3 or IMAP;

(3) The lifetime of the connection: The connection used to prevent long-term inactivity occupies system resources; (4) Packet buffer queue: Caches the mail packet on this connection, if it is determined that there is unsafe data on the connection, in order to recover Mail data and save it;

(5) The state of the session on the connection: whether it is a command interaction state or a data transmission state;

(6) Temporary state of the automatic machine: used to solve the problem of missed alarm when filtering keywords by data packet;

(7) Security ID of this connection: When it is determined that there is insecure information on the connection, it is marked in this field, and subsequent data on the connection is no longer scanned.

In order to better achieve the above object, the present invention also provides a connection-oriented spam filtering method, wherein the method includes at least the following steps:

(1) a data collection step for capturing a data packet from the monitored network and submitting it to the filtering analysis module as a data input of the entire filtering system;

(2) Filtering policy management steps for configuring and managing filtering policies;

(3) a filtering analysis step, configured to analyze the input data packet according to the configured filtering policy, and check whether the information concerned by the filtering policy is included;

(4) A data processing step for performing various processing on the analysis result data of the filter analysis module.

The step (3) further includes: when transmitting the email by using SMTP, POP3 or IMAP, extracting and analyzing the interactive command and its parameters in the input data packet in the command interaction state; in the data transmission state, the slave data packet The mail data is extracted, MIME decoding and content scanning are performed, and the scan result is submitted to the data processing module.

The step (3) further includes the following steps:

( 111 ) a TCP connection maintenance step for maintaining a TCP connection hash table;

(112) a mail protocol parsing step for completing the parsing of the mail protocol;

(113) The MIME decoding and content scanning steps are used to judge the encoding mode of the input mail data, and call the corresponding encoding conversion function for encoding conversion, and then perform full-text scanning on the mail content. The step (113) further includes: after each packet is scanned, temporarily storing the current state in the automaton temporary status field of the connection node to which the connection belongs, and scanning the next packet from the temporary state of the automaton of the connected node The status begins to match to avoid a leak alarm.

The step (113) further includes: sorting the out-of-order packets on the same TCP connection, and performing content scanning in the correct order to avoid false alarms.

The spam filtering system and method of the present invention solves the problem of missed alarm and false alarm of packet filtering by adopting the "connection-oriented" technical measure and a suitable algorithm, so that it can be independent of a specific mail server, Both the mail client and the server are completely transparent. Compared with the prior art, the invention greatly improves the reliability of the spam filtering system and broadens the applicable scope. DRAWINGS

FIG. 1 is a schematic diagram of the arrangement of the spam filtering system in a shared local area network;

2 is a schematic diagram of the arrangement of the spam filtering system at the network entrance and exit;

3 is a schematic structural diagram of a spam filtering system according to the present invention;

4 is a schematic structural view of a filter analysis module according to the present invention;

Figure 5 is a schematic diagram of the structure of a TCP connection HASH table;

Figure 6 is a schematic diagram of a TCP connection lookup HASH algorithm;

7A and 7B are schematic diagrams of a leak alarm problem of packet filtering;

8A and 8B are schematic diagrams of the problem of false alarms in disorder. DETAILED DESCRIPTION below in conjunction with the accompanying drawings, the figures in the order of the basic technical solution of the embodiment ¹ will be further described in detail:

This spam filtering system uses electronic transmissions using SMTP (Simple Mail Transfer Protocol), POP3 (Post Office Protocol: Version 3 - Post Office Protocol Version 3), and IMAP (Internet Message Access Protocol - Internet Message Access Protocol). Mail is monitored.

The spam filtering system described in the present invention can be arranged inside a shared local area network (see Fig. 1) or at the entrance and exit of an enterprise network, an interprovincial or international backbone network (see Fig. 2).

Figure 1 illustrates the arrangement of the spam filtering system of the present invention in a shared local area network Style. In this way, network packets can be captured by setting the NIC to promiscuous mode, but only passively.

FIG. 2 illustrates the arrangement of the spam filtering system of the present invention at the network entrance and exit. In this way, network packets can be collected using proprietary devices, and network packets can be fully monitored and controlled.

Figure 3 illustrates the basic structure of the spam filtering system of the present invention. At least the following modules are included: Data Acquisition Module 31, Filter Policy Management Module 32, Filter Analysis Module 33, and Data Processing Module 34. See Figure 3 for the basic architecture.

The data acquisition module 31 captures the data packet from the monitored network and submits it to the filtering analysis module as the data input for the entire filtering system. Data collection can be done using common capture tools or proprietary equipment.

The filtering policy module 32 is responsible for configuring and managing the filtering policy. The filtering strategy is the core foundation for the system to work. It should at least contain the filtering conditions and corresponding processing methods. The filtering conditions can be a logical combination of multiple conditions. An example of two filtering strategies is given below:

Filtering policy example 1 : Filtering condition = "The destination IP address is 168.168.192.*, and the sender is seqing@nopermit.com", processing method = "Save mail and alarm";

Filtering policy example 2: Filtering criteria = "The sender is xxx@fishy.net and the recipient is fishy@xxx.com", processing mode = "Turn off user connection and alert".

The filtering analysis module 33 analyzes the input data packet according to the configured filtering policy, and checks whether the information of the filtering policy is included. See Figure 4 for the structure of this module.

This module includes three sub-modules: TCP (Transmission Control Protocol) connection maintenance 41, mail protocol resolution 42, MIME (Multipurpose Internet Mail Extensions) decoding and content scanning. The TCP connection mentioned here refers to the TCP connection established between the monitored mail client and the mail server for transmitting e-mail. The filtering system has nothing to do with the connection, but only monitors the data transmitted on it.

The TCP connection maintenance module 41 maintains a TCP connection hash table (see FIG. 5), which uses the data packet (source IP address, destination IP address, source port, destination port) quad as a calculation hash key. The input of the value (see Figure 6), the hash value can be calculated by a variety of fast hash algorithms, and the hash conflict can be solved by the chain address method. Each TCP connection node in the hash table contains at least the IP address of the connection parties, the transport layer port number, and some current status information of the connection. Also depending on the situation, Maintain a TCP connection hash table for the SMTP, POP3, and IMAP protocols, respectively.

For each packet entered, first check if it belongs to a TCP connection that has already been established. If yes, it is processed according to the current state of the connection to which it belongs; otherwise, a new TCP connection node is created for it.

The protocol parsing sub-module 42 completes the parsing of the mail protocol: if the current connection is in the command interaction state, the protocol command and parameters are extracted from the input data packet and processed; if the current connection is in the data transmission state, the data packet is extracted from the input data packet. Mail data, and submitted to the MIME decoding and content scanning sub-module.

Figure 4 shows the basic structure of the filter analysis module. For each packet that is input, the module first calculates its hash key value according to the (source IP address, destination IP address, source port, destination port) quad, and determines whether it belongs to a TCP connection that has already been established. If yes, it is processed according to the current state of the connection to which it belongs. For example, if it is known that the connection violates the security policy, it is not necessary to scan the contents of the input data packet, and directly cache the data packet, and the entire mail data is to be After the transaction, the mail data is reorganized and saved; if it is not known whether the data on the connection violates the security policy, the currently input data packet is scanned, and the scan result information is temporarily stored in the connection node; if the input data packet is not If it belongs to any established connection, it creates a TCP connection node for it, then scans the contents of the packet, and also temporarily stores the scan result in the connection node.

When using SMTP, POP3, or IMAP to transfer e-mail, a session has two basic states: command interaction status and data transfer status. In the interactive state of the command, the mail client and the server perform a series of command interactions, and do not transmit the mail data itself; in the data transmission state, the mail client and the server are transmitting the email data. It is possible to judge the transition of these two states by the captured command. For example, in the SMTP protocol, after the "DATA" command is captured, the data transfer state is entered, and when the message end character "·" is captured, the command interaction state is returned; and for the POP3 protocol, the "RETR" command is captured to enter the data transfer. The state, when the message end character "·" is captured, returns to the command interaction state. Because the packet may be missed and the transition between the command interaction state and the data transmission state cannot be correctly judged, the system must also take certain protective measures. For example, if you miss the "DATA" packet sent by the client to the server, you can determine the start of the mail data transmission status based on the corresponding packet with the code "354" returned by the server to the client.

Figure 5 shows the structure of the TCP connection hash table, which uses the chain address method to resolve hash collisions. Each node in the hash table is a TCP connection node structure, representing a current progress The mail protocol session for the line.

Figure 6 shows the implementation of the hash function for TCP connection lookups. The hash function takes the quaternion of the packet (source IP address, destination IP address, source port, destination port) as input and calculates the hash value. This hash value is used in the hash table shown in Figure 4 to find out if the input quad is a connection that has already been established. Because the session packets on a TCP connection are bidirectional, the hash algorithm must be designed to ensure that the bidirectional data on the same connection is mapped to the same hash value. For example, the hash values for the following two quads should be the same:

Quad 1 : ( 168.168.192.1, 10.198.60.2, 1386, 25 );

Quad 2: ( 10.198.60.2, 168.168.192.1, 25, 1386).

In addition, since the operation of finding a TCP connection is very frequent (invoked once for each mail packet), the hash algorithm used should be fast and generate fewer key-value conflicts.

The MIME decoding and content scanning sub-module 43 first determines the encoding mode of the input mail data, and calls the corresponding encoding conversion function to perform encoding conversion, and then performs full-text scanning on the mail content. Since packet filtering is prone to leak alarms (see Figures 7A, 7B), proper scanning of the content is required. If the packets are out of order, false alarms may also occur (see Figures 8A, 8B). Therefore, it is necessary to sort the packets on the same TCP connection and perform content scanning in the correct order.

+11- The content scanning referred to in the present invention is mainly for the text part of the mail body and the attachment, but is also applicable to the filtering of other types of media information (such as pictures, sounds, etc.) as long as the performance of the algorithm allows.

Figures 7A and 7B illustrate the problem of leak alarms for packet filtering. If the mail filtering system wants to check the keyword is "babb". A user data stream containing the pattern string is shown in FIG. 7A, which represents any character string that does not contain the "babb" and "bab" substrings. When the user data is transmitted on the network, it is divided into two data packets, as shown in Fig. 7B. Then, the packet filtering mail filtering system can not find the "babb" string contained in the user data stream whether it is filtering packet 1 or filtering packet 2. There is obviously a leak alarm. Therefore, it is necessary to implement content scanning using a suitable algorithm. If only one keyword is checked per scan, it can be (but not limited to) the modified finite automaton single keyword matching algorithm. After each packet is scanned, the current state is temporarily stored in the "automatic machine temporary state" of the connected node to which the connection belongs. "In the field, when scanning the next packet, the matching starts from the state indicated by the "automatic temporary state" of the connected node, instead of starting from the initial state of the automaton; if multiple keywords are to be checked for each scan, Can (but is not limited to) the modified Aho-Comsick The multi-keyword matching algorithm also temporarily stores the current state in the "automatic machine temporary state" field of the connected node after scanning one packet, and does not start from the initial state of the automaton when scanning the next packet. , but the status indicated by the "automatic machine temporary status" starts to match.

Figures 8A and 8B illustrate the problem of false alarms caused by out-of-order packets. Assuming that the keyword to be filtered is the same as before, the user data stream is as shown in Fig. 8A. When transmitting on the network, it is divided into two data packets, as shown in Fig. 8B. In the figure, " * " means any string that does not contain the "babb,,, "bab" and "abb" substrings. Then the result of the keyword match will not recognize the "babb" string. However, according to the above algorithm, if When packet 2 arrives first, then packet 1 arrives, then "b" at the end of packet 2 and "abb" at the beginning of packet 1 constitute the filtered keyword "babb". Apparently a false alarm has occurred. The scanning of the body of the mail needs to be performed in the correct order. If the received data packets are out of order, the TCP connection maintenance sub-module of the filtering analysis module first sorts them and then submits them to the subsequent sub-modules.

In order to implement the parsing and content filtering of the mail protocol, the current state of the connection is recorded in the TCP connection node. The node structure contains at least the following information:

1. IP address and transport layer port number of the client and server: These four parameters are the unique identifiers that determine the connection to which the packet belongs;

2. Protocol type: SMTP, POP3 or IMAP;

3. The lifetime of the connection: The connection used to prevent long periods of inactivity occupies system resources;

4. Packet Cache Queue: Cache the mail packet on this connection. If it is determined that there is unsafe data on the connection, the message data can be restored and saved.

5. The state of the session on this connection: whether it is a command interaction state or a data transmission state;

6. Automaton Temporary Status: Used to solve the problem of missed alarm when filtering keywords by packet. At the end of a message, this field needs to be reset, that is, it points to the initial state of the automaton;

7. Security ID of this connection: When it is determined that there is unsafe information on the connection, it is marked in this field and the subsequent data on the connection is no longer scanned.

In the command interaction state, extract the interactive command and its parameters in the input data packet and analyze it; in the data transmission state, extract the mail data from the data packet, perform MIME decoding and content scanning, and submit the scan result to the data. Processing module.

The data processing module 34 performs various processing on the analysis result data of the filtering analysis module according to the processing method specified by the security filtering policy. For example, forwarding packets, dropping packets, cutting off Users connect, alert, or restore email packets and reorganize them into application layer data streams and save them to a database.

The operation and maintenance module 36, the storage backup module 35, and the like may also be added according to actual needs. The operation and maintenance module is used for system maintenance, and the storage backup module is used for storage and backup of system data and data packets. Industrial applicability

The spam filtering system of the present invention solves the problem of missed alarm and false alarm of packet filtering by adopting the "connection-oriented" technical measures and suitable algorithms, and the biggest feature is that it does not depend on a specific mail server. It is completely transparent to both the mail client and the server. Compared with the prior art, the invention greatly improves the reliability of the spam filtering system and broadens the application range of the system.

Claims

Claim

A connection-oriented spam filtering system, comprising: a data collection module, a filtering policy management module, a filtering analysis module, and a data processing module, wherein the data collection module is configured to capture data from the monitored network. The package is submitted to the filtering analysis module as the data input of the entire filtering system; the filtering policy management module is used for the configuration and management of the filtering policy; the filtering analysis module is configured to analyze the input data packet according to the configured filtering policy, and check whether The information of the filtering strategy is included; the data processing module is configured to perform various processing on the analysis result data of the filtering analysis module.

2. The connection-oriented spam filtering system according to claim 1, wherein the system further comprises an operation and maintenance module and a storage backup module, wherein the operation and maintenance module is used for system maintenance, and the storage backup module is used for the system. Storage backup of data and data packets.

3. The connection-oriented spam filtering system of claim 1, wherein the filtering policy comprises a filtering condition and a corresponding processing mode, the filtering condition being a logical combination of a plurality of conditions.

The connection-oriented spam filtering system according to claim 1, wherein the filtering analysis module comprises a TCP connection maintenance sub-module, a mail protocol parsing sub-module, a MIME decoding, and a content scanning sub-module, wherein, the TCP The connection maintenance submodule is used to maintain a TCP connection hash table; the mail protocol parsing submodule is used to complete the parsing of the mail protocol; the MIME decoding and content scanning submodule is used to judge the encoding mode of the input mail data, and call the corresponding The encoding conversion function performs encoding conversion, and then performs full-text scanning on the content of the mail.

The connection-oriented spam filtering system according to claim 4, wherein the hash table uses a source IP address, a destination IP address, a source port, and a destination port quad of the data packet as a calculation hash. The input of the key value uses a variety of fast hash algorithms to calculate the hash value, and the hash conflict is solved by the chain address method.

6. The connection-oriented spam filtering system according to claim 4, wherein each of the TCP connection nodes in the hash table includes at least an IP address of the connection party, a transport layer port number, and a current connection of the current connection. Some status information.

7. The connection-oriented spam filtering system according to claim 4, wherein the TCP connection node of the TCP connection maintenance submodule records the current state of the connection. The connection-oriented spam filtering system according to claim 7, wherein the structure of the connection node comprises at least:

(1) IP address and transport layer port number of the client and server. These four parameters are unique identifiers used to determine the connection to which the packet belongs.

(2) Protocol type: SMTP, POP3 or IMAP;

(3) The life of the connection: The connection used to prevent long-term inactivity takes up system resources;

(4) Packet Cache Queue: Caches the mail packet on this connection. If it is determined that there is unsafe data on the connection, the message data is restored and saved.

(5) The state of the session on the connection: whether it is the command interaction state or the data transmission state; (6) The automaton temporary state: It is used to solve the leakage alarm problem when the keyword filtering is performed by the data packet;

(7) Security ID of this connection: When it is determined that there is unsafe information on the connection, it is marked in this field, and subsequent data on the connection is no longer scanned.

A connection-oriented spam filtering method, characterized in that the method comprises at least the following steps:

(4) A data processing step for performing various processing on the analysis result data of the filtering analysis module.

10. The connection-oriented spam filtering method according to claim 9, wherein the step (3) further comprises: extracting input data in a command interaction state when transmitting the email using SMTP, POP3 or IMAP The interactive commands and their parameters in the package are analyzed. In the data transmission state, the mail data is extracted from the data packet, the MIME decoding and the content scanning are performed, and the scanning result is submitted to the data processing module.

11. The connection-oriented spam filtering method according to claim 9, wherein the step (3) further comprises the following steps:

( 111 ) a TCP connection maintenance step for maintaining a TCP connection hash table; (112) a mail protocol parsing step for completing the parsing of the mail protocol;

(113) The MIME decoding and content scanning steps are used to judge the encoding mode of the input mail data, and call the corresponding encoding conversion function for encoding conversion, and then perform full-text scanning on the mail content.

The connection-oriented spam filtering method according to claim 10, wherein the step (113) further comprises: temporarily suspending the current state to the automaton temporarily connected to the connected node after each packet is scanned. In the status field, when scanning the next packet, the matching is started from the state indicated by the automaton temporary state of the connected node to avoid a leak alarm.

The connection-oriented spam filtering method according to claim 10, wherein the step (113) further comprises: sorting the out-of-order packets on the same TCP connection, and following the correct Scan the content in sequence to avoid false alarms.