OPEN ACCESS
In recent years, with the demand for pattern information in the field of bioinformatics research, frequent subtree mining algorithms have become a research hotspot. In this paper, a fast and efficient mining algorithm based on frequent embedded subtree is proposed to solve the problem of large-scale of biological data and high sequence pattern similarity in the process of biological data mining. The algorithm uses a unique string coding method to represent the tree, and uses a scope-list for substring amplification and frequency testing. The pruning technique greatly compresses the search space and reduces the computational time. Compared with the classical pattern mining algorithm, the proposed algorithm improves the efficiency of mining greatly.
ribonucleic acid, frequent subtree, topological pattern, frequent pattern mining
The mining of frequent patterns in databases of transactions, sequences, trees, and graphs is a research hotspot in the field of data mining in recent years [1-2]. In essence, frequent pattern mining refers to the discovery of useful associations and correlations between patterns in a large number of databases. In recent years, with the increasing complexity of the discovered structures and the demand for pattern information in the field of bioinformatics research, frequent subtree mining algorithms have become a topic of increasing concern for researchers.
In the research of bioinformatics, the prediction of protein structure and the structural analysis and excavation of compounds are inseparable from the mining of graphs and trees. In fact, a large number of Ribonucleic Acid (RNA) structures have been collected, they are essentially tree structures. In order to obtain information on a newly generated RNA molecule, it can only be compared with known RNA structures to find whether there is a common topology pattern between them. These models have important reference value for determining the molecular function of new RNA.
The mining of frequent subtrees has been an active research area in recent years, and many corresponding algorithms have been proposed. Because of the influence of Apriori algorithm [3], current most mining subtree algorithms use a breadth-first search strategy. When performing frequent subtree mining, the common subtrees of all the trees in the dataset need to be listed. The following two methods are commonly used to enumerate all common subtrees [4]. One is the method of pattern growth, it is named as the depth-first strategy; the other is the method of layer-by-layer mining, it is named as the breadth-first strategy.
There are many frequent subtree mining algorithms. The TreeFinder algorithm [5] can be used to mine unordered embedded subtrees, but if different trees have the same tagging node or the support threshold is small, then the TreeFinder algorithm can’t fully excavate all the frequent subtrees. In addition, there are many other algorithms for mining unordered direct subtrees, such as Hybrid Tree Miner, uNot, uFreqt, Free Tree Miner, and Path Join [6-10]. However, the direct application of these algorithms to the mining of biological data has the following two limitations: Firstly, the efficiency of algorithm still needs to be improved. Currently, the most frequent subtrees pattern mining algorithms mainly adopt the Apriori algorithm and its improved algorithms. This type of algorithm uses a candidate set generation-screening method, it is an efficient algorithm for uncomplicated patterns. But the algorithm consumes a large amount of time to process a large-scale candidate set. At the same time, it requires multiple scans of the database, it will greatly affect the mining efficiency. Secondly, the data mining technology can't effectively use the biological characteristics of DNA or protein sequence data. At present, most of the frequent pattern mining algorithms can only mine simpler frequent sequence patterns and frequent items, and they can't be applied to more complex biological information data.
In order to deal with these problems, a frequent subtree mining algorithm based on biological data was proposed. The algorithm introduces a unique string encoding method representation tree, and uses the scope-list for substring amplification and frequent testing. The proposed algorithm greatly reduces the search space and speeds up the running time.
2.1 Frequent subtree mining
Definition 1 (rooted ordered mark tree) A rooted ordered mark tree is an acyclic connected graph, denoted as T(v_{0}, N, L, B), where N is a set of marked nodes in tree T, v_{0} is the root of the tree, L is a set of all node labels, B is a set of directed edges in tree T.
Figure 1 shows examples of several marker trees. The number of nodes in the tree T is called the size of the tree, denoted as $|\mathrm{T}|$; furthermore, for each node in the tree, a sequence number $\mathrm{i}(\mathrm{i}=0 \ldots|\mathrm{T}-1|)$ will be allocated according to their position in the depth-first traversal. The node with the serial number i in the tree is denoted as n_{i}; and the set of tokens (also called items) of the node is denoted as L = {0,1,2,3,……,m-1}, where m is the number of different markers in the tree. For any node $\mathrm{v}_{\mathrm{i}} \in \mathrm{N}, \mathrm{L}\left(\mathrm{v}_{\mathrm{i}}\right)$, L(v_{i}) represents the label of node v_{i}, and n(v_{i}) represents the sequence number of node v_{i}. Each branch b consists of an ordered pair of nodes, where node v_{x} is the father of v_{y}. It should be noted that two different nodes in the tree may have the same tag. The labels of nodes C, E, and G in Figure 4.1 are all 2, and their serial numbers are 6, 5, and 4 respectively. Unless otherwise specified, the trees referred to herein refer to trees with rooted ordered marks.
Definition 2 (subtree) Given tree T(v_{0}, N, L, B), if $\mathrm{T}^{\prime}\left(\mathrm{V}_{0}^{\prime}, \mathrm{N}^{\prime}, \mathrm{L}^{\prime}, \mathrm{B}^{\prime}\right)$ is an embedded subtree of T, only if (1) $\mathrm{N}^{\prime} \in \mathrm{N}$; (2) For any node $\mathrm{v}_{\mathrm{i}} \in \mathrm{N}^{\prime}$, $\mathrm{L}\left(\mathrm{v}_{\mathrm{i}}\right)=\mathrm{L}^{\prime}\left(\mathrm{v}_{\mathrm{i}}\right)$; (3) For any branch $\mathbf{b}=\left(\mathbf{v}_{\mathbf{x}}, \mathbf{v}_{\mathbf{y}}\right) \in \mathrm{B}^{\prime}$, only if in tree T, v_{x} is the father of v_{y}, marked as $\mathrm{T}^{\prime} \subseteq \mathrm{T}$.
If it is not connected, it is called a subtree of T. For example, in Figure 1, S_{2} is a subtree of T, and S_{3} is a subtree of T.
Definition 3 (Frequent Subtree) Let D = {T_{1}, T_{2}, …, T_{n}} denote a database consisting of n trees. minsup is a given threshold, which satisfies 0<minsup≤1. If tree s is a subtree belonging to T_{j} in D, where 0 ≤ p ≤ n, then p is the support degree of tree s in D, denoted as sup(s)=p. If sup(s)/n≥minsup, then s is a frequent subtree in D.
Figure 1. Trees and their subtrees
Definition 4 (Frequent Subtree Mining) Given a database D composed of n trees and a user-defined minimal support minsup, the frequent subtree mining problem is to find all frequent subtrees s satisfies $\sup (\mathrm{s}) / \mathrm{n} \geq \mathrm{minsup}$ in D.
Definition 5 (node domain) If there are g nodes in the tree T, starting from any node v_{i}(0≤i≤g-1) in T, a sequence of vertices can be obtained by the root traversal of the subtree T_{i} with the root. The last vertex in the sequence is the rightmost leaf vertex in the subtree. Let this rightmost leaf vertex be v_{r }(0 ≤ r ≤ g-1), then $s\left(v_{i}\right)=\left[n\left(v_{i}\right), n\left(v_{r}\right)\right]$ is the domain of the node v_{i}.
Table 1 lists the markers and node fields for the vertices of tree T in Figure 1.
In Figure 1, the set of nodes of tree T is L = {0,1,2,3}, and the set of marked nodes is N = {A,B,C,D,E,F,G}. According to the depth-first traversal order of the tree, the sequence number of each node is shown in Table 2.
Table 1. The markings and node fields of each vertex of tree T
marking |
Node fields |
L(A)=0 |
S(A)=[0,6] |
L(B)=1 |
S(B)=[1,5] |
L(D)=3 |
S(D)=[2,4] |
L(F)=1 |
S(F)=[3,3] |
L(G)=2 |
S(G)=[4,4] |
L(E)=2 |
S(E)=[5,5] |
L(C)=2 |
S(C)=[6,6] |
Table 2. The sequence number of each node by depth-first traversal
Vertex |
A |
B |
C |
D |
E |
F |
G |
Number |
0 |
1 |
6 |
2 |
5 |
3 |
4 |
Definition 6 (String Encoding Method) C(T) represents the string encoding of the tree. The generation process is as follows: Firstly, C(T) is initialized to be empty, and then its vertex sequence (T) is obtained according to the depth-first traversal of T. During the traversal process, whenever the node v_{i} is added to (T), the open form $\overline{\mathrm{L}\left(\mathrm{V}_{1}\right)}$ of the node tag is added to C(T); And when the node is rolled back to its parent, the marked closed form L(v_{i}) is added to C(T).
Since each vertex will be traversed twice, the code string that stores a tree occupies $2|\mathrm{T}|+1$. In addition, through this method, each tree T will have a unique code string C(T), and each code string C also corresponds to a unique tree structure, forming a one-to-one correspondence between strings and trees, thus completely retaining the information of tree structure.
Definition 7 (substring) $\mathrm{X}=^{\prime} \mathrm{x}_{1} \mathrm{x}_{2} \cdot \cdot \mathrm{x}_{\mathrm{n}}^{\prime}$, $\mathrm{Y}=^{\prime} \mathrm{y}_{1} \mathrm{y}_{2} \cdot \mathrm{y}_{\mathrm{m}}^{\prime}$ represent two strings respectively, n≥m. If for each y_{i} in Y, X has x_{ji} corresponding to it, such that x_{ji}=y_{i}, and $\mathrm{j}_{1}<\mathrm{j}_{2}<\cdots<\mathrm{j}_{\mathrm{m}}$, then Y is a substring of X.
From Definition 7, it is not difficult to find that only when S is a subtree of T, C(S) is a substring of C(T). For example, in Fig. 4.1, the string code of the tree T is $\mathrm{C}(\mathrm{T})=\{\overline{0} \overline{1} \overline{3} \overline{1} 1 \overline{2} 23 \overline{2} 21 \overline{2} 20\}$, and the code strings C(S_{1}) and C(S_{2}) of its two subtrees S_{1} and S_{2} are also corresponding C(T) substrings. So to get all subtrees of tree T, all substrings of C(T) can be generated and filtered.
2.2 Validity check of substrings
In the algorithm of this paper, all candidate subtree sets are firstly generated and their corresponding support is calculated. The candidate subtree set must be non-redundant, and each subtree is generated at most once. According to the anti-monotonicity of frequent subtrees (the frequency of a tree is always less than or equal to the frequency of its subtrees), a candidate set can be efficiently generated. Firstly, a candidate set is generated from the smallest sub-tree, and then the sub-tree is expanded by gradually adding neighboring nodes on the sub-tree. All infrequent subtrees are removed during the expansion process and only frequent subtrees are expanded. All the generated candidate subtrees will be represented by the string encoding method described above. In order to avoid repetitive generation of candidate subtrees, it is specified that each subtree is only extended on the leaf node whose sequence number is the largest. Thus, the expansion of a candidate subtree T_{i} is essentially the extension of the rightmost end of the encoded string C(T_{i}) corresponding to the subtree. In the far right expansion process, a new string C´ is formed by adding a character after the string C(T_{i}). However, the resulting new string C´ does not necessarily represent a subtree, so it needs to be checked for legality.
Definition 8 A code string is legal if and only if it complies with the parentheses pairing rule, that is:
(1) For each $\overline{i}$ in the string, a unique i must be paired with it.
(2) If there is an $\overline{i}$ in the string before $\overline{j}$, the characters paired with them must satisfy: i is located after j.
As shown in Figure 1, the string of the tree T is coded as $C(T)=\{\overline{0} \overline{1} \overline{3} \overline{1} 1 \overline{2} 23 \overline{2} 21 \overline{2} 20\}$, and its two substrings are subtrees in which both $\{\overline{1} \overline{1} 1 \overline{2} 21\}$ and $\{\overline{0} \overline{1} 1 \overline{2} 2 \overline{2} 2 \overline{\overline{2}} 20\}$ represent T, so both substrings are legal. Another example is the substring c of C(T), which is illegal.
Definition 9 A substring C is extensible if and only if C is a prefix of a valid substring C´.
The tree T in Figure 2(a) is used as an example to illustrate the correctness of the above conditions. The corresponding code string of the tree is $C(T)=\{\overline{1} \overline{2} \overline{4} 4 \overline{5} 52 \overline{3} 31\}$. In the character expansion process, we will encounter such a substring $\mathrm{S}_{1}=\{\overline{1} \overline{2} \overline{4} 4\}$, which is a prefix of C(T) and therefore an extensible substring. In fact, S_{1} represents the subtree of Figure 2(b), and its extension should be added to the node marked X in the tree.
Figure 2. Tree T and its subtree extension
If there is a substring $\mathrm{S}_{2}=\{\overline{1} \overline{2} \overline{4} 42\}$, it is obviously that the subtree is the same as the substring S_{1}, but the position of the extension is not the same. The extension of S_{2} should be added to the node labeled Y in Figure 2(c). In general, for an extensible substring, let its right-most unopened open-form character be $\overline{i}$, then its corresponding subtree should extend the child node at the node corresponding to $\overline{i}$.
Based on the above analysis, a new algorithm is proposed for mining frequent subtrees for biological data. In the proposed algorithm, an initial scope-list table is created based on the input tree database, and a string encoding method corresponding to these trees is used to obtain a string set. In the process, a candidate set is generated from the smallest sub-tree (each vertex), and then the sub-tree is expanded on each sub-tree by gradually adding the right-most leaf node, so as to ensure that there is no redundant candidate sub-tree.. During the entire subtree expansion process, each candidate tree can be generated to calculate the corresponding support degree to judge the frequency of the subtree. With the anti-monotonicity of frequent subtrees, only frequent subtree extensions are considered and all infrequent subtrees will be removed.
The extension of the algorithm to the candidate subtree Ti is achieved by performing the far right character expansion with the code string C(T_{i}) corresponding to the subtree. In the far right expansion of C(T_{i}), a new string C´ can be formed simply by adding a character after the string C(T_{i}). Since the algorithm only extends the scalable sub-string corresponding to the sub-tree, the resulting new string C´ does not necessarily represent a sub-tree, so its scalability must be verified after the new string C´ is obtained. For all extensible substrings, their scope-list tables were created and then the scope-list tables were used to extend them. The process is repeated until all eligible substrings are generated, and finally the frequent subtrees that meet the requirements are obtained through frequent testing.
In the entire algorithm process, the search space is greatly compressed by the substring scalability check and the scope-list application, which improves the overall performance of the algorithm.
The overall framework of the algorithm is described as follows:
Algorithm 1 Frequent Subtree Mining Algorithm
Input: Tree database TDB, Minimum support min_sup;
Output: All embedded frequent subtrees;
Begin
Scan the database TDB once to generate the corresponding string encoding database SDB;
Build the initial Scope-list;
For each code string Q={x_{1}´, x_{2}´,..., x_{n}´} in SDB do
For i=1 To n Do
P_{i} = {∅}, i=1,2,...,n ;
For j=i To n Do
Y and x_{j}´ are spelled into a string $\mathrm{y}^{\prime}=\overline{\mathrm{Yx}_{j}^{\prime}}$;
extensibility check for y´;
If y´ is an extensibility substring Then
Construct a scope-list of y´ using the scope-list of y and x_{j}´;
If x_{j}´ is $\overline{S}$ Then
Forming the corresponding subtree t´ of y´;
extensibility check for t´;
If t´ is frequent Then add y´ to P_{i}
EndIf
EndIf
EndIf
EndFor
EndFor
EndFor
End
A Scope-list table is constructed for each newly generated string y' mentioned in the proposed algorithm. For the Scope-list table, it is defined as follows:
Definitions 10 Let y be a subtree T(y) with k nodes. The element in the Scope-list corresponding to the string is [t, s]. Where t is the tree number of the parent tree where the subtree T(y) is located, and s is the range of the form [l_{1}, u_{1}], [l_{2}, u_{2}], [l_{k}, u_{k}], which indicates the domain of each node in the tree.
The data set D = {T_{0}, T_{1}, T_{2}} composed of three trees shown in Figure 3 is taken as an example to illustrate the generation process of the Scope-list table (hereinafter referred to as S-L table).
Figure 3. Datasets made up of three trees
Firstly, the algorithm builds frequent subtrees starting from the smallest subtree. The initial S-L table shown as Table 3 is established for each differently labeled vertex.
Table 3. The initial S-L table
$\overline{1}$ |
$\overline{2}$ |
$\overline{3}$ |
$\overline{4}$ |
0[0,1] 1[0,3] 2[0,2] |
1[1,2] 2[1,1] |
0[1,1] 1[3,3] 2[2,2] |
1[2,2] |
The vertex labeled 3 is used as an example to illustrate the establishment of the initial S-L table. The vertices marked 3 appear at the node position 1 of the tree T_{0}, and the domain is [1,1]. Then, (0[1,1]) is added to the $\overline{3}$'s S-L table. The vertices marked with 3 also appear at the node position 3 of the tree T_{1}, and the domain is [3, 3]. Then, (1[3,3]) is added to the $\overline{3}$'s S-L table. It also appears at node 2 of tree T_{2}, with the domain is [2, 2]. Then, (2[2,2]) is added to the $\overline{3}$'s S-L table. After searching all subtrees, all $\overline{3}$ items are available.
In order to detect the inclusion relationship between the domains of different nodes of the same subtree, the following definitions are given:
Definition 11 Let the domains of the vertices x and y of the same subtree be s_{x} = [l_{x},u_{x}] and s_{y} = [l_{y},u_{y}], respectively, only when l_{x}≤l_{y}, u_{x}≥u_{y, $\mathrm{s}_{\mathrm{y}} \subset \mathrm{s}_{\mathrm{x}}$}.
When the new character string $\overline{\mathrm{Cg}}$ is generated by adding the character g to a certain string C through the proposed algorithm, the corresponding S-L table is generated as follows:
(1) Find the same pair of items in the S-L table for string C and character g. Let a pair of items with the same t value be (t, [l_{11},u_{11}], [l_{12}, u_{12}],..., [l_{1k}, u_{1k}]) and (t, [l_{2}, u_{2}]).
(2) $\left[l_{1 k}, u_{1 k}\right] \supset\left[l_{2}, u_{2}\right]$,then it generates item (t, [l_{11},u_{11}], [l_{12}, u_{12}],..., [l_{1k}, u_{1k}], [l_{2}, u_{2}]) to join S-L table of $\overline{\mathrm{Cg}}$.
The resulting rules are visible for each item (t, [l_{11},u_{11}], [l_{12}, u_{12}],..., [l_{1k}, u_{1k}]) in the S-L table for a string C, $\left[\mathrm{l}_{11}, \mathrm{u}_{11}\right] \subset\left[\mathrm{l}_{12}, \mathrm{u}_{12}\right] \subset \cdots \subset\left[\mathrm{l}_{1 \mathrm{k}}, \mathrm{u}_{1 \mathrm{k}}\right]$.
For example, in the database of Figure 3, the string C={$\overline{1}$} and character g=$\overline{2}$ are set, and when generating the string {$\overline{1}$ $\overline{2}$}, we firstly find out all pairs of items with the same t value in the S-L table of $\overline{1}$ and $\overline{2}$. They are (1[0,3]) and (1[1,2]), (2[0,2]) and (2[1,1]). Since [1, 2]$\subset$[0, 3], the generated item (1[0, 3] [1, 2]) is added to the S-L table of the string {$\overline{1}$ $\overline{2}$}. By the same token, since [1, 1]$\subset$[0, 2], the generated term (2 [0, 2] [1, 1]) is added to the S-L table of the string {$\overline{1}$ $\overline{2}$}.
For applying the proposed algorithm to the secondary structure pattern mining of RNA, the tree structure of the RNA molecule should first be obtained. For this purpose, a method of establishing a tree model for RNA secondary structure is given. In the tree, the sides are RNA strands of complementary base pairs, and the apexes are the 3' position and the 5' position of the double helix, the joint, the inner ring, the hairpin ring, the convex ring, and the like. From the 5' position to the 3' position of the RNA, the root is marked as 1, and the other vertices are sequentially labeled with an integer from 2 to n. The child nodes for each node will be sorted according to their tags. The tree model derived from an RNA secondary structure using this modeling approach should be unique because the 5' and 3' positions of the RNA molecule are fixed. An ordered, labeled tree model obtained using the above modeling method for an RNA secondary structure is shown in Figure 4.
Figure 4. RNA structure and its tree model
4.1 Algorithm performance comparison
In experiment, using the random program to generate the required test data, a parent tree with q markers and p nodes is generated. Then, the data set of the tree for field test is generated based on the required number of subtrees t. The program's default parameters are set to p=1000, q=100, that is, the number of nodes in the tree is between [0,1000], and the corresponding number of tags is between [0,100].
The adaptability to data scale is often considered as one of the important indicators to measure data mining algorithms. Figure 5 shows the comparison of two typical sub-tree mining algorithms (PatternMatcher algorithm [11] and TreeMiner algorithm [12]), and the proposed algorithm's running time for different amounts of input data under certain support conditions. Take the test data scale from t = 100 to t = 10^{6}, and the support is kept at 0.5. The running time of the PM algorithm, TM algorithm, and the proposed algorithm (IRTM) changes with the data volume.
It can be seen from Figure 5 that the running time of the PM algorithm, the TM algorithm, and the proposed algorithm increases as the amount of data increases. However, when the amount of data is greater than 100,000, the running time of the PM algorithm rises sharply. This is mainly due to the fact that the PM algorithm needs to repeatedly scan the database, and a large candidate set is checked through pattern matching, which affects the efficiency of the algorithm. Therefore, the running time of the TM and the proposed algorithms mining in vertical mode is significantly less than the PM algorithm. At the same time, because of the superior pruning strategy in the proposed algorithm, it makes the time increase smaller than the TM algorithm. In summary, the proposed algorithm is significantly better than the PM and TM algorithms in a large-scale data environment.
Figure 5. The relationship between the number of trees and time
In addition to the data volume, the adaptability to support is also one of the important indicators for measuring frequent pattern mining algorithms. In general, the support threshold is inversely related to the mining time. Therefore, when the input data size is fixed, the running time of the PM algorithm, the TM algorithm, and the proposed algorithm can be tested and compared by changing the support.
Figure 6 shows the runtime variation of the three algorithms with a data size of t=10^{4} and a support increase from 0.1 to 1. Figure 7 shows the comparison of the runtimes of the three algorithms with a change in support from 0.1 to 1 with the data size t increasing to 10^{6}.
Figure 6. The relationship between the support and time (T=10000)
It can be seen from Figure 6 and Figure 7, with the increase of support, the running time of the three algorithms under different data scale conditions is reduced, and the running time of the PatternMatcher algorithm is much larger than the TreeMiner algorithm and the proposed algorithm. However, when the support reaches 0.9, the running time of the PatternMatcher algorithm drops more sharply than the TM and the proposed algorithms. It is because in the horizontal mode mining, the pruning operation can be performed according to the Apriori nature. The specific strategy is to eliminate some infrequent subtrees during the hierarchical search process. When the support degree is approximately 1, most of the candidate modes will be removed. However, in practice, the user-defined support is generally less than 0.5, and it is impossible to have a support level close to 1. However, when the degree of support is less than 0.9, the running time of the vertical mining algorithm TM and the proposed algorithm is obviously less than the PM algorithm. It can also be seen from the Figures that the running time of this algorithm under the conditions of different data sizes is affected little by the change of the support degree, and the overall change trend is also relatively stable, indicating that the adaptability of the proposed algorithm in this paper is stronger than the PM algorithm. In addition, the proposed algorithm in this paper always runs less time than the TM algorithm, it means that the proposed algorithm is more efficient than the TreeMiner algorithm.
Figure 7. The relationship between the support and time (T=100000)
4.2 Results of common topology pattern mining of rna molecules
From the RNaseP [13] database, some RNA molecular structures were selected and a corresponding tree model was established on this basis. Using the algorithm IRTM, the common tree topology of 30 trees is mined.
Figure 8 shows the running time of the proposed algorithm as a function of support. It can be seen from the Figure 8 that as the degree of support decreases, the number of frequent patterns that are mined to meet the conditions is gradually increasing and the corresponding running time is also increased.
Figure 8. The change of running time with support
Figure 9 shows the changes in the number of frequent patterns discovered by the proposed algorithm under different support levels. It can be clearly seen from the Figure 9 that the number of frequent patterns excavated with the decrease of the support degree is gradually increasing. However, when the support is close to 0, the significance of the frequent patterns being tapped is not significant. The growth rate of the number of models is relatively stable when the degree of support varies within the range of [0.3, 0.5]. It shows that the frequent patterns obtained by the proposed algorithm at this time are representative of the selected 30 RNA prokaryotic structures.
Figure 9. Number of patterns under different support degrees
Figure 10 shows a sub-tree mined at least 9 times by the proposed algorithm in 30 trees, which was not found by other algorithms. Therefore, compared with other direct subtree mining algorithms, the proposed algorithm can mine more “hidden” information for the mining of RNA subtrees, and at the same time, it makes the mined results more biologically meaningful.
Figure 10. A subtree that is mined
In this paper, a fast and efficient frequent embedded subtree mining algorithm for biological data mining is proposed. The algorithm uses vertical mining, and uses a string encoding method to represent the tree. After the tree structure is represented by the string encoding method, the extension of the subtree is equivalent to the extension of the corresponding substring. The sub-string amplification and frequent test are performed by using the Scope-List. At the same time, the pruning technology greatly compresses the search space and speeds up the operation, which greatly improves the efficiency of the algorithm. Theoretical analysis and experiments show that compared with the classical embedded subtree mining algorithm, the proposed algorithm has higher performance and it is more efficient, faster and more stable. The efficiency of the proposed algorithm is improved by 17%.
This work is supported by the National Natural Science Foundation of China (No.61841603), Guangxi Natural Science Foundation (No.2018JJA170050, No.2014GXNSFBA118268), Science and Technology Research Foundation of Guangxi Universities (201204LX339).
[1] Arnold, R., Goldenberg, F., Mewes, H.W. (2014). SIMAP-the database of all-against-all protein sequence similarities and annotations with new interfaces and increased coverage. Nucleic Acids Research, 42(1): 279-284. https://doi.org/10.1093/nar/gkt970
[2] Slater, G.S.C., Ewan, B. (2005). Automated generation of heuristics for biological sequence comparison. Bmc Bioinformatics, 6(1): 1-11. https://doi.org/10.1186/1471-2105-6-31
[3] Toivonen, H. (2011). Apriori Algorithm. Encyclopedia of Machine Learning, 39-40. https://doi.org/10.1007/978-0-387-30164-8_27
[4] Tan, Z., Sharma, G., Mathews, D.H. (2017). Modeling RNA secondary structure with sequence comparison and experimental mapping data. Biophysical Journal, 113(2): 330-341. https://doi.org/10.1016/j.bpj.2017.06.039
[5] Termier, A., Rousset, M.C. (2002). TreeFinder: A first step towards XML data mining. IEEE International Conference on Data Mining, IEEE Computer Society, 450-458. https://doi.org/10.1109/ICDM.2002.1183987
[6] Zaki, M.J. (2004). TreeMiner: An efficient algorithm for mining embedded ordered frequent trees. Advanced Information & Knowledge Processing, 123-151. https://doi.org/10.1007/1-84628-284-5_5
[7] Pasquier, C., Sanhes, J., Flouvat, F. (2016). Frequent pattern mining in attributed trees: algorithms and applications. Knowledge and Information Systems, 46(3): 491-514. https://doi.org/10.1007/s10115-015-0831-x
[8] Jiang, C., Coenen, F., Zito, M. (2013). A survey of frequent subgraph mining algorithms. The Knowledge Engineering Review, 28(1): 75-105. https://doi.org/10.1017/s0269888912000331
[9] Asai, T., Arimura, H., Uno, T. (2003). Discovering frequent substructures in large unordered trees. Lnai, 2843: 47-61. https://doi.org/10.1007/978-3-540-39644-4_6
[10] Babbar, A., Singh, A., Singh, D. (2014). A survey on problems and solutions of frequent pattern mining with the use of pre-processing techniques. International Journal of Computer Applications, 95(1): 23-28. https://doi.org/10.5120/16559-4125
[11] Okosun, J., Csaba, B., Wang, J. (2014). Integrated genomic analysis identifies recurrent mutations and evolution patterns driving the initiation and progression of follicular lymphoma. Nature Genetics, 46(2): 176-181. https://doi.org/10.1038/ng.2856
[12] Zhang, S., Du, Z., Wang, J.T. (2015). New techniques for mining frequent patterns in unordered trees. IEEE Transactions on Cybernetics, 45(6): 1113-1125. https://doi.org/10.1109/tcyb.2014.2345579
[13] Liu, M.H., Yuan, Y., Reddy, R. (1994). Human RNaseP RNA and nucleolar 7-2 RNA share conserved 'To' antigen-binding domains. Molecular & Cellular Biochemistry, 130(1): 75-82. https://doi.org/10.1007/BF01084270