A New Algorithm for Arabic Document Clustering Utilizing Maximal Wordsets

ABSTRACT


INTRODUCTION
Arabic is a Semitic language.It is most often spoken in nations where the majority of the population is Muslim.It is also the language of the "AL-Quran AL-Kareem," the Muslims' holy book.Over 400 million people speak Arabic as their first language, and over 250 million speak it as their second language, and it is recognized as the official tongue of states in North Africa and the Middle East [1].Traditional, Modern Standard Arabic (MSA), and dialects are the three main categories of Arabic.
The Arabic language presents unique challenges for document clustering tasks compared to languages with simpler structures.Notably, the Arabic script lacks inherent vowel markings, leading to ambiguity.Additionally, Arabic morphology features complex derivational prefixes and suffixes that significantly alter word meaning.These characteristics necessitate specialized preprocessing techniques like disambiguation and stemming to overcome these challenges during document preparation [2,3].Document clustering plays a crucial role in data mining and information retrieval, especially for the vast and ever-growing volume of Arabic documents online.By grouping documents based on thematic similarity, clustering facilitates efficient navigation and analysis of these information resources.Effective clustering of Arabic documents is essential for various applications, including information retrieval, text mining, automatic document categorization and user review analysis [4].
Document clustering is an unsupervised machine learning process that groups documents based on similarities by eliminating the intra-similarity among documents in one group and increasing the inter-similarity properties among different groups.This process does not require class labels for the documents [5].A challenging task in data and text mining is identifying hidden, important, and possible patterns in the document [6].The most difficult problems in document clustering can be summed up as coping with massive data quantities, high dimensionality, and low retrieval precision.
Dealing with large data volumes, high dimensionality, and low retrieval precision can be summed up as the most challenging issues when dealing with document clustering [7].Arabic documents are now readily available online in a variety of formats, making it difficult to organize them without the aid of a computer.The clustering of Arabic-language documents has recently attracted the attention of researchers.
To depict the connection between data points and the clusters they belong to, hierarchical clustering creates a structure like a tree.Each data point is initially clustered separately and iteratively merging smaller clusters into bigger clusters when the stopping criteria is satisfied.Hierarchical clustering comes in two forms: agglomerative (bottom-up) and divisive (top-down).
Partitional clustering divides the data into a predetermined number of distinct, non-overlapping clusters.It aims to keep the distances between the data points and the cluster centroids as small as possible.K-means and K-medoids are two common partitional clustering algorithms.While densitybased clustering, creates clusters by locating regions with a high data point density.A cluster is defined as a dense zone surrounded by a sparse region.This type of clustering is helpful in detecting clusters of arbitrary forms.DBSCAN and HDBSCAN are examples of this type of clustering algorithms [8].Association rule mining (ARM) [9] is a data mining technique that is used to discover potential hidden patterns among data.Apriori and FP-Growth are the most widely used algorithms for extracting all frequent itemsets (FI) and frequent patterns in datasets, respectively [10].To extract FI from large transactions, the Apriori approach is widely utilized.It works on the assumption that all FI subsets must also be frequent.It continuously counts the support of items and item combinations that can be utilized within a transaction and eliminating any sets that go below a user-defined threshold and by integrating sets from the previous iteration and validating them against the transaction, candidate frequent itemsets are created.Despite being easy to comprehend and utilize, the Apriori technique can be computationally expensive for massive transactions and demands several database searches.On the other hand, the FP-Growth approach is scalable and utilized to mine frequent item sets in large datasets.The algorithm stores the frequent item sets in a tree-like data structure known as the FP-tree.The itemsets are encoded in the tree after a single run over the dataset to generate it, making it fast and quick to mine the common itemsets by utilizing depth-first search to traverse the tree.In order to decrease the size of the tree and improve the algorithm's scalability and performance, the divide-and-conquer strategy and pruning techniques are used.FP-Growth has been shown to be faster than traditional frequent item set mining algorithms, such as Apriori, and is widely used in data mining and machine learning applications.
The conventional frequent itemsets mining algorithms are computationally extensive and generate voluminous sets of items.Therefore, maximal frequent itemsets (MFI) approach is undertaken to overcome the forementioned problems by significantly reducing the search space.Various algorithms were proposed for mining MFI, such as MAFIA [11], FPmax algorithm [12], MaxMining [13], GenMax [14], and MIMA [15,16] which dedicated to mine textual MFI from Arabic documents.
Document clustering plays a vital role in organizing and analyzing vast collections of Arabic text data.However, the unique characteristics of the Arabic language, such as the lack of inherent vowel markings and complex morphology, pose significant challenges for traditional clustering techniques.This paper proposes a novel algorithm to Arabic document clustering that addresses these challenges and offers promising accuracy.
Our key contribution lies in leveraging Maximal Frequent Word Sequences (MFWSs) for Arabic document clustering.By employing the FPMax algorithm, we extract the most prominent recurring sequences of words within the documents.These MFWSs capture the thematic content of documents more effectively compared to individual words, as they account for the inherent structure and context of the Arabic language.
The remainder of this paper is organized as follows; the second section examines previous studies that have used maximally frequent item sets for document clustering.The proposed approach is thoroughly explained in Section 3 with an example, and the experiments and findings are covered in Section 4. The conclusion is demonstrated in the final section.

RELATED WORKS
The interest in Arabic document clustering has increased recently due to the growing amount of Arabic content on the Internet, making manual clustering impractical.Our previous survey [17] on the research of Arabic document clustering approaches and techniques revealed a limited number of studies in this area.To the best of our knowledge, no one has utilized Maximum Frequent Itemset (MFI) for categorizing Arabic documents.So far, only one study has used Frequent Itemsets (FI) to cluster Arabic documents, using a hierarchical clustering approach based on N-grams [18].The clustering accuracy achieved by the Frequent Itemset-based Hierarchical Clustering (FIHC) was 70%, higher than the 63% accuracy obtained for clustering European languages.The results of the research are not confident due to the lack of information about the dataset used for experiments.Also, the use of only one dataset is insufficient to judge the efficiency and predict the behavior of an algorithm In literature [19], a study investigated an approach that used K-means and particle swarm optimization (PSO) to group Arabic documents.K-means is sensitive to the selection of the initial clusters producing different results according to the initial points selected.This issue has been overcome by using PSO to analyze the entire dataset and identify the best starting points for K-means and achieve good clustering results.This approach inherited the drawbacks of PSO and K-means such as sensitivity to initial parameters, computational complexity, and convergence to local optima.
Alhawarat and Hegazi [20] utilized Latent Dirichlet Allocation (LDA) and K-means for document clustering, finding that normalization of text data led to substantial improvements in clustering outcomes.When the combined method was applied with normalization, it achieved higher scores (29% and 40% F-score for BBC and CNN respectively) compared to the traditional approach (24% and 29% for the same datasets).However, LDA may face scalability challenges when applied to large-scale datasets, as the model's complexity increases with the number of documents and topics.Also, LDA requires tuning of hyperparameters such as the number of topics (K), Dirichlet priors, and sampling techniques.Improper selection of hyperparameters can impact the quality of clustering results.
Sangaiah [21] proposed unsupervised clustering for Arabic documents.He compared three approaches: supervised, semisupervised, and unsupervised.These methods utilized Kmeans, incremental K-means, threshold + K-means, and Kmeans with dimensionality reduction (DR) for clustering.Unsupervised clustering achieved 70% and 43% for Fmeasure and entropy, respectively, and it is regarded as effective for Arabic document clustering.
Although k-means is effective, it is sensitive to selecting the initial point, and this may hinder the performance.PSO-Kmeans solve this issue but poses computation costs.When the semantic is crucial, LDA is sufficient, but it adds complexity.K-means with dimensionality reduction is effective, but information loss may occur during reduction.
Our clustering strategy effectively reduces dimensions in Arabic documents, ensuring accurate text analysis and highspeed reduction of dimensions for effective clustering results, as we will describe in the upcoming sections.

THE PROPOSED SYSTEM
The Maximal Frequent Wordset-Based Arabic Document Clustering System (MFW-ADC) is presented in this paper.MFW leverages the FPMax algorithm to extract informative MFWs from Arabic documents.These MFWs capture the thematic content and inherent structure of the language, enabling effective document clustering.The model comprises three modules: preprocessing, dimensionality reduction and MFW mining, and clustering (as shown in Figure 1).These modules are described in detail in the subsequent sections.

Preprocessing module
Document preprocessing is a crucial step in natural language analysis, ensuring trustworthy and reproducible textual data.It comprises four stages: tokenization, normalization, punctuation and stopword removal, and stemming.Tokenization divides a document into separate words.Normalization converts letters into one form, while special characters and stopword removal reduce data dimensionality and improve analysis accuracy.Stemming generates a morphological variant of the base word, ensuring the same root word is considered the same entity.In Arabic, several stemming algorithms have been developed, such as Tashapheen, Khoja [22], Light Stemmer [23], etc., to achieve optimal results.The algorithm is depicted in Algorithm 1. Table 1.shows an explanation of the variables, symbols, and functions used in the proposed algorithms.

Dimension reduction and wordsets mining module
This module comprises two steps: dimension reduction and wordset mining.Dimension reduction is accomplished in two stages.The first phase is done by representing the preprocessed documents using TF/IDF approach and the second stage is accomplished by utilizingthe FPmax algorithm to extract the MFWs according to the provided minimum support.
TF/IDF is one of the best metrics used to show how significant a word is to a document in a dataset.Two factors are considered when calculating TF/IDF: word frequency (how many times a word appears in a specific document) and the inverse document frequency (how often this word appears in all document in the dataset).This technique optimizes frequent, rare words in a document to highlight discriminative features, ensuring the content is understood while excluding irrelevant or common words.Eqs. ( 1) and ( 2) show how to calculate TF-IDF.

𝐼𝐷𝐹(𝑤) = 𝑙𝑜𝑔(𝑁/𝑛_𝑤)
(1) where, TF(w,d) is the frequency of word w in document d (i.e., the number of times word w appears in document d) and IDF(w) is the inverse document frequency of word w, N is the total number of documents in the corpus, and n_w is the number of documents in the corpus that contain word w.Our dataset consists of Arabic documents.Each document is identified by a unique document identifierDID and a list of preprocessed words (wordlist).To reduce the data complexity, Fpmax algorithm [12] is used to discover the most frequent word patterns (MFWs) within the documents.This technique relies on two key parameters: minimum support and maximum wordset length.Adjusting these parameters helps in reducing the data dimentionality while still obtaining the crucial inormation from the documents.This will enhance the clustering effeciency and accuracy.The algorithm used is shown in Algorithm

Clustering module
This module outlines four key steps: primary clustering, merging equal clusters, hard clustering, and final clustering, which are detailed in the following subsections.

Primary clustering
Initial clusters are created using the Fpmax algorithm's MFWs, with words as labels.The best-fit clusters are determined using similarity functions such as Euclidean distance, cosine similarity, Manhattan distance, overlap, and Jaccard index, as represented by Eqs.(3) to (7).The length of the cluster's label, i.e., MFW's length, will be added to the similarity function.The purpose of adding this factor is to assign the document to the most similar and largest clusters.The process of determining initial clusters is illustrated in algorithm 3.

Merging equal clusters
It's worthy to mention that, after identifying the initial clusters, it is possible to identify numerous clusters with similar DID sets.After that, these clusters can be combined using the approach illustrated in Algorithm 4.

Hard clustering
The proposed algorithm uses hard clustering to assign documents to specific clusters based on their similarity.This process calculates the percentage of belonging between documents and MFWs, retaining the highest-percentagerelated document and removing the lowest-percentage-related document.The cluster with significant support can incorporate a document when competing with other clusters.The algorithm utilized for hard clustering implementation is depicted in algorithm 5.

Final clustering
The final step involves merging clusters (MFWs) from previous stages to create final clusters.The desired number of clusters is specified, and the clusters are merged with the most similar one based on their similarity value.One of the measures described in Section 3.2 is used to calculate the similarities between the MFWs.The length of the MFW is a significant factor that ensures a cluster will merged with the longest and most similar one.The employed algorithm is depicted in algorithm 6.

RESULTS AND DISCUSSION
Using Python code running on a core i7 computer with 16GB of RAM, the suggested technique was examined on several datasets.The datasets and assessment metrics utilized for evaluating the effectiveness of the clustering method are described more thoroughly in the following sections.

Datasets
The proposed algorithm is implemented on two datasets CNN and OSAC [24].Details of these datasets are briefly described in the following Table 6.

Experiments and results evaluation
This paper uses precision, recall, and F-score as assessment metrics in clustering to evaluate the performance of a proposed technique.Precision measures the accuracy of document clustering by calculating the percentage of correctly assigned documents, while recall quantifies the completeness of the clustering.F-score balances precision and recall, indicating the accuracy and comprehensiveness of document clustering.A high F-score indicates high precision and recall, indicating the majority of documents are correctly assigned.These metrics are calculated using the following equations: where, CL is the original class of the dataset.nij is the number of documents of class CLi that are presented in cluster Cj. |CLi| is the number of documents in class i and |Cj| is the number of documents in cluster j.
The experiments were repeated with different predetermined minimum support value (minsupp) for different number of clusters of each of the selected datasets i.e. (CNN_3, CNN_4, CNN_5, CNN_6, OSAC_3, OSAC_4, OSAC_5, and OSAC_10).Table 7 and Table 8 show the experiments details.
The selection of a similarity metric can significantly impact clustering results.Across various cluster sizes (3, 4, 5, and 6), for the CNN dataset, Euclidean distance, overlap similarity, and Jaccard index yielded similar outcomes.However, the best clustering results are obtained with Euclidean distance with different minimum support values for each cluster size (0.42 for CNN-3, varying between 0.38 and 0.40 for CNN-4 and CNN-6) as shown in Table 7.
In contrast, the best performance for the OSAC dataset is obtained using cosine similarity and occasionally Manhattan distance, with minimum support values ranging from 0.40 to 0.42.This led to the conclusion that the optimal similarity metric depends on the specific dataset.The results depicted in Table 8.The results presented in Table 7 and Table 8 shows that the minimum support threshold can positively affect the results because it involves the process of choosing the discriminative features, so it clear that the low-level supports provide such features.Also, the Euclidean similarity metric mostly provides best results of F-score due to its appropriateness for the representation of maximal wordsets.The high value of F-score Indicates that the proposed algorithm's performance is good in terms of balancing precision and recall.Also, it suggests that the model is effectively identifying true positives while minimizing false positives and false negatives, implying that the behavior of the proposed algorithm has a good trade-off between precision (accuracy of positive predictions) and recall (sensitivity to true positives), resulting in reliable and balanced clustering outcomes.
The proposed algorithm is compared with the studies mentioned in Section 2, and the results showed that our algorithm achieved better clustering results than these works.When comparing it with the model presented in the study [18] it is found that using frequent itemsets for the clustering process leads to a voluminous number of itemsets, which in turn increases the search space and the computation time.
These problems are overcome in our proposed algorithm by using maximal frequent wordsets, which shrink the search space and eliminate the computation time.
Daoud et al. [19] enhanced the selection of the initial clusters by combining K-means with PSO to scan the entire search space.The method supposes that each particle of the swarm represents the centroid of the clusters.The fitness function is minimized at each iteration by measuring the local best position and the global best position.In our proposed algorithm, as we mentioned before, the search space is reduced by using the user-defined threshold to mine MFWs, which will be the search space that will be scanned.Furthermore, the results achieved by us are better than those provided by this study for the same dataset.
K-means and LDA were used in literature [20] for clustering and topic modeling.The documents are represented as a bag of words, and then TF-IDF is applied to the document vector space to eliminate redundant data; afterwards, the data is normalized using Euclidean distance.As a last step, k-means was used for document clustering.On the other hand, topics are modeled using the same normalized dataset as an input to the LDA.The result achieved by this study is less than that achieved by our model for the CNN dataset.Clustering Arabic documents was accomplished using either unsupervised or semi-supervised approaches suggested by A. K. Sangaiah [21].K-means or incremental K-means were used for these approaches, and the clustering results were evaluated using the F-measure and entropy.The comparison is figured out Table 9.
It is a noteworthy accomplishment to obtain an F-score value in Arabic document clustering of more than 80%, as this shows that the suggested method performs well in terms of recall and precision.Such a score indicates: High Accuracy: An F-score of more than 80% is a strong indication of the system's accuracy in correctly clustering Arabic documents because the F-score is a metric that strikes a balance between precision and recall.
Efficient Clustering: A high F-score indicates that the suggested method efficiently clusters related Arabic documents while reducing false positives and misclassifications.
Robustness: The robustness and generalizability of the algorithm are demonstrated by its consistent achievement of an F-score above 80% in a variety of datasets and circumstances.Comparative Advantage: The proposed algorithm performs better than other algorithms that usually yield F-score values.

CONCLUSIONS
In this paper, we introduce a novel approach for clustering Arabic documents.The method utilizes the maximal frequent word sets discovered by the Fpmax algorithm to achieve effective clustering results.This technique addresses the challenge of high-dimensionality datasets by employing MFWs for efficient dimension reduction.The evaluation of the proposed method is done on two benchmark datasets, CNN and OSAC.The experiments revealed different cluster configurations by tuning the specified threshold and the number of clusters.Hence, a diverse set of clustering outcomes result.These promising results contribute to the field of ANLP and encourage further exploration of diverse techniques for enhanced Arabic document clustering.In the future work, we aim to apply other data mining techniques, evaluate their impact for clustering purposes, and evaluate the clustering of Arabic documents.

Table 9 .
Comparison of the suggested approach and related work