A Hybrid CNN-LSTM and XGBoost Approach for Crime Detection in Tweets Using an Intelligent Dictionary

ABSTRACT


INTRODUCTION
The fact that crime rates have increased in Iraq and the Arab World in general is a key step toward resolving a problem addressed in this paper.Crime rates have risen as a result, including electronic extortion, theft, bullying, narcotics, and a slew of other offenses.Understanding the growth and regionalization of crime requires an examination of crime statistics.Law enforcement faces challenges as a result of the amount of information available, technological improvements, and population density.The current study was motivated in large part by the growth in crime rates and the spread of these crimes through social networking sites and the study's goal is to collect trustworthy information to assist in characterizing the nature of crime and enhance crime prevention.Because of the complexities of the Arabic language, it is more difficult to detect crimes, which boosts human security.Crime impacts society through social media, with bullies targeting individuals based on race, body image, or religious beliefs [1].Crime impacts individuals, society, and finances, requiring significant efforts for prevention, victim care, and legal proceedings to improve lives and security [2].Social networks facilitate cultural, political, and historical insights and knowledge sharing [3].Social media facilitates crimes and the different forms these crimes take, particularly through social engineering assaults, where users are tricked into providing personal information or clicking on phishing links.Social media platforms can be used by criminals to spread malware through links and attachments, create phony accounts, publish harmful websites, and distribute malicious files.
Cybercriminals can use social media data to commit identity theft, impersonate individuals, or steal their identities to carry out fraudulent activities [4].The era of rapid ICT development has influenced modern society, altering space, time, and identity relationships [5].Rapid ICT development has changed geography, time, and identity interactions in modern society.Crime has been significantly impacted by the development of information and communication technologies, or ICTs.It has increased the probability of criminal activity, including fraud and identity theft, as well as the sophistication of criminal activity.Additionally, ICTs have made it simpler for thieves to execute their plans quickly and effectively.They have, however, made it more challenging for law enforcement to identify and stop crime as they may use encryption and other technologies to hide their activities.Due to the rise in cybercrime, law enforcement must stay up to date on new developments in technology and adopt innovative tactics [6].Crime in cultures impacts individuals and communities, requiring studies to understand behavior, identify individuals, and effectively detect, anticipate, and prosecute [7].Governments are implementing various strategies to prevent crime on social media.These include promoting crime prevention through social media, conducting social media surveillance, monitoring criminal activity, and regulating social media companies.These measures aim to address privacy and civil liberties concerns, gather valuable information about crimes in progress, and hold social media companies accountable for illegal content [8].Social media technology revolutionizes interactions through platforms like Facebook, blogs, wikis, and Twitter [9].This paper proposes a technique using machine learning and deep learning models to classify keywords in crime tweet datasets.The dataset evaluates performance metrics and compares results.Improving convolutional neural networks (CNN) using LSTM after improving becomes Hybrid CNN-LSTM is employed to classify new tweets as crime-related or non-crime-related.Machine learning and deep learning methodologies are wellsuited for the task of crime detection on social media platforms due to their capacity to effectively evaluate vast quantities of data and discern intricate patterns that may prove challenging for human observers to discern.Furthermore, these systems can acquire knowledge from novel data and enhance their precision as time progresses.Moreover, deep learning algorithms possess the capability to effectively handle unstructured data, including photos and text, thereby playing a crucial role in the analysis of social media material.
The following are the work's main contributions: 1. Creating a new dataset in the language Arabic tweets crimes that have been enhanced to include more features and enough data that is adequate for these types of crimes using a Python library to scrape tweets and the Aho-Corasick algorithm to extract tweets with approximately 38 keywords related to crimes and approximately 25 keywords unrelated to crimes.
2. This is the first thorough study that compares DL and ML models for Arabic criminal tweet identification tasks for various keywords, evaluating their performance in Arabic crime tweet detection tasks.
This paper is structured as follows: The related work is presented in Section 2; next, Section 3 describes how to create an intelligent dictionary to gather the dataset; next, Section 4 describes the intelligence tools that were used to create the proposed model; next, Section 5 describes the proposed model's execution; and finally, Section 6 presents the results and discusses them.Finally, in Section 7, conclusions are reached and potential directions for further study are discussed.

RELATED WORK
Social media has evolved into a powerful communication tool for sharing news and expressing feelings on a variety of themes, particularly in Arabic, a difficult language with fewer resources than English and Chinese in addition, Arabic language instruction is popular, local information is easily accessible, and cultural idiosyncrasies make it a good tool for spotting illegal activities on Twitter.Dialects, colloquialisms, false positives, the amount of data, and ethics are some of the challenges.For responsible and accurate Arabic Twitter Crime Detection, linguistic competency, natural language processing, and cultural awareness are required.Instead of focusing on a single crime, this study looked at multiple crimes [10].The study examines Twitter crimes in Arabic using DL and ML methodologies, comparing English outcomes.Machine learning (ML) and deep learning (DL) models are well-suited for the analysis of social media material owing to their capacity to process substantial volumes of data and discern intricate patterns.Deep learning algorithms can gradually acquire high-level features, enhance accuracy through iterative processes, and obtain advanced characteristics through the practice of feature engineering.Convolutional neural networks (CNNs) have demonstrated superior performance compared to conventional machine learning methods across several domains, including but not limited to cybersecurity, natural language processing, bioinformatics, robotics, and medical information processing.Deep learning algorithms possess the capability to acquire knowledge from extensive datasets and exhibit expedited testing durations, rendering them well-suited for real-time applications [11].Kaddoura et al. [12] offer a strategy for identifying Arabic tweets using machine learning and deep learning techniques, using feature extraction and N-gram models, and computing performance metrics.With an F1-Score of 99.73%, the neural network method outscored the other algorithms, but GloVe outperformed rapid Text by 0.5%.Support vector machines, neural networks, logistics regression, and naive Bayes are the classical machine-learning techniques used.Deep learning approaches make use of global vector (GloVe) and fast Text learning models.The study [13] used DL models to identify offensive remarks on Arab YouTube channels, including Bi-LSTM with attention mechanism, CNN, Bi-LSTM, and CNN-LSTM.Results showed accuracy of 85.7%, 86.4%, 87.2%, and 87.8%, with CNN outperforming competitors.CNN, and CNN-LSTM models, were employed by the authors in this paper [14], in their "fast and simple" methods.On a dataset of 8K Arabic tweets, they tested each model.According to the findings, ML models cannot compete with neural learning models.Using the hybrid model CNN-LSTM, the best F1-Score was 73%.This paper's authors [15], The text analysis algorithm categorizes tweets as cyberbullying using TF-IDF and Word2Vec, using seven machine learning classifiers.The model achieves a 96.4% F1-Score rate and outperforms other entity recognition methods [16].used ML algorithms like NB, DT, RF, and SVM to classify religious hate speech in Arabic tweets.They trained DL models using Fast Text and word2vec embedding, achieving a high F-Measure and accuracy of 71%.Finally, Aljarah et al. [17], used machine learning and natural language processing to identify cyber hate speech on Twitter in Arab environments, with the RF method achieving 91.3% accuracy.

BUILD AN INTELLIGENT DICTIONARY TO COLLECT THE DATASET
There are other ways to collect datasets from Twitter, including using the official Twitter API offered by the Twitter development team and developers can programmatically access Twitter's data via the Twitter API.It lets users access tweets, user data, trends, and Twitter activities.Developers require a Twitter Developer Account and application to get API keys and access tokens.Twitter's API endpoints include GET /tweets/search/recent, GET /users/show, GET /followers/list, and POST /update.Developers must limit API request rates to prevent abuse.The returned JSON data can be saved in databases or data lakes for analysis.Developers may monitor events and trends with real-time streaming APIs for tweets and data [18], this paper will discuss how to build a dictionary in this section.By utilizing the numerous library tools offered by the Python programming language, such as Tweepy and Scrape.The Aho-Corasick algorithm was used by the "sn scrape" library to Scrap the dataset.

Twitter data acquisition
This study collects Twitter data to assess trained classifiers' performance in real-world scenarios, utilizing up-to-date tweets from accounts.Experiments are designed to ensure practical applicability and data collection for effective detection.

Twitter data extraction
This study collects Arabic tweets using the Python library, using keywords for crime and non-crime content.The Aho-Corasick algorithm is used to build an Intelligent method that can be used for graph and metadata analysis, classifying tweets related to crime.Recent tweets are extracted for analysis before building models, focusing on detecting crimes in Arabic tweets.Some of these keywords that were used are shown in Table 1.The keywords used in building the Intelligent Dictionary.The data collection process targeted various fields, including politics and journalism.The classification process used criminal tweets, with a total of 1,19597 tweets.Where the total number of tweets (1,19597), the number of positive tweets (60524), the number of similar tweets (50427), the number of bad tweets about criminal activity (67011), and the number of identical tweets (55254).The data set was divided into six parts, and 18,493 data sets were taken for training and testing in this paper.It relied on the Aho-Corasick algorithm to pull tweets from Twitter and, with the help of Python tools linked with Twitter within scraping, collected data that contained crimes and those that were free.Table 2 shows the most important feature of extracting from tweets.

INTELLIGENT TOOLS USED IN BUILDING THE PROPOSED MODEL
The main smart tools used to develop the proposed approach are described in this section.

Aho-Corasick algorithm
Developed in 1975 by Aho and Corasick, the Aho-Corasick algorithm compares input text patterns using a tree-like structure, using a finite state machine and common prefixes [19].The Aho-Corasick algorithm is an efficient stringmatching method using a tree structure and failure transitions.It is widely used in applications like intrusion detection, virus scanning, text mining, and lexical due to its linear time complexity.The preprocessing step builds the automaton once, allowing multiple patterns to be compared against input texts.The utilization of the Aho-Corasick algorithm in constructing a dictionary can be characterized as advantageous in situations when there is a requirement to efficiently and effectively search a substantial collection of terms within texts.After identifying the keywords inside the text, one can categorize them or execute a certain action by the identified term.The construction of the dictionary is designed to align with the proposed approach and employ outcomes to identify criminal activities associated with Twitter posts.

Natural language processing (NLP) technique
This section demonstrates how to efficiently operate machine learning algorithms by representing text documents as numerical vectors and using the TF-IDF technique to extract features in NLP tasks.Because TF-IDF can recognize significant words in a document or corpus and give them weights depending on their significance, it is useful for feature extraction.It can handle big datasets: TF-IDF is scalable and capable of handling big datasets, which makes it appropriate for processing massive volumes of text data.

Basic machine learning
Computer scientists have focused on machine learning since the 1950s [20].ML applications include translation and sentiment analysis [21].ML algorithms are categorized into supervised, semi-supervised, and unsupervised types based on application goals [22,23].Supervised algorithms train using labeled data for tasks like crime detection and sentiment classification, using techniques like SVM, Naive Bayes, Decision Trees, Random Forests, KNN, and Logistic Regression [24,25].Semi-supervised algorithms use labeled and unlabeled data during training, which is useful when limited or expensive labeled data is scarce.Unsupervised algorithms, like K-Means and DBSCAN, focus on discovering patterns, clusters, or relationships in unknown, unlabeled data [26].ML algorithms revolutionize tasks like translation and sentiment analysis, advancing techniques and applications [21].Understanding the history of machine learning can help us appreciate XGBoost, a popular ensemble learning algorithm family [27].It's beneficial for crime detection as it transitions from rule-based AI to data-driven methods.XGBoost excels in processing and learning from big datasets, making it a valuable tool in crime detection-based intelligent Dictionaries for a method proposed.

Deep learning (DL)
Deep learning extracts intricate features from dimensional data, creating models connecting inputs to outputs using multilayered networks and layered neural networks for abstract computation [28].DL techniques improve accuracy and reduce training time in complex problems, enabling breakthroughs in science, engineering, and engineering tasks like data analysis, pattern recognition, and prediction [29].DL algorithms excel in autonomous vehicles, robotics, intelligent systems, community-related applications, and social network analysis, uncovering patterns and trends for valuable insights into user behavior and preferences [30].DL models offer potential in health, imaging, disease diagnosis, drug discovery, and personalized medicine.DL algorithms are supervised and unsupervised, requiring labeled training data for predictions [31][32][33].Deep learning revolutionizes data analysis, pattern recognition, decision-making, and driving innovation.These deep learning models consist of two parts, A and B:

A. CNN (Convolutional neural network)
CNNs are used for image analysis and text processing in cybercrime detection.They use convolutional filters to capture local patterns and features, identifying keywords, linguistic patterns, and structural characteristics.For additional processing or categorization, the output is passed into connected layers [34].Furthermore, CNN has a specific architecture that consists of numerous hidden layers and is a Multi-Layered Perceptron network (MLP) [35,36].

B. LSTM (Long short-term memory)
Long short-term memory (LSTM) networks, a subtype of recurrent neural networks, are capable of storing enormous volumes of data in a short period [37].LSTMs recognize longterm dependencies, sequential patterns, and temporal correlations in data, making them ideal for handling text, modeling context, and detecting cybercrime [38].Because CNN and LSTM are both potent deep learning models that work well with sequence prediction problems including spatial inputs, such as text data, they were selected for this model.While the LSTM is employed for sequence modeling and prediction, the CNN is utilized for feature extraction, which is crucial for text classification tasks.Because the LSTM can selectively ignore irrelevant information and remember previous inputs, it is especially helpful for sequence prediction issues.These two models work together to forecast material linked to crime in social media posts and to extract significant features from the text data.Furthermore, it has been demonstrated that CNN and LSTM perform better than conventional machine learning algorithms in a variety of fields, such as sequence prediction and text classification.To identify crime tweets on Twitter, a hybrid model combining these two algorithms was employed.

PROPOSED MODEL
The proposed methodology employs Twitter as a newsoriented platform for user engagement, enabling users to communicate through responses, likes, and comments on diverse material.The five parts of the proposed model are broken down into the following subsections for description: 1. Twitter Data Extraction involves topic-based query using Python libraries, saving tweet datasets in a CSV file, as described in Section (2.3).
2. Preprocessing steps in dataset preparation include tokenization, removing stop words, handling special characters, normalization, and normalization for Arabic tweets, including spelling variations and morphological structures.
3. Techniques for extracting features include both conventional ones like a bag of words, TF-IDF, and n-grams as well as more sophisticated ones like word embedding and contextual embedding.For feature extraction, this study uses TF-IDF.
4. Classification using machine learning based on XGboost and improving deep learning model using Hybrid CNN_LSTM Model depends on task complexity, data size, and computational resources.6.The proposed model was evaluated using confusion matrices, MAP accuracy, and unbiased test dataset comparison.

Dataset labeling process
Data collection and classification process based on criminal tweets, identifying crimes and non-crimes using a dictionary and algorithm, labeling 0 and 1.

Data preprocessing involves operations to classify and clean collected data. involves steps:
General Cleaning: Cleansed dataset by removing links, spaces, and unnecessary elements.
Normalization: The normalization process unifies Arabic letter forms, ensuring consistency and embedding.
Removal of Repeated Letters: Removed repeated letters within words.For example, words like ‫"اااااع"‬ and ‫"وووو_ح"‬ were transformed back to their original form, ‫"اع"‬ and ‫",و_ح"‬ respectively.However, if a word contained two similar characters (e.g., ‫"تسلل"‬ and " ‫نوع‬ -"), deleting the repeated characters would change the meaning of the word (" ‫تسل‬ ‫منوع‬ " and " ‫نوع‬ -‫م‬ ").To address this, the decision was made to remove repeated characters that occurred more than twice.
Number and Non-Arabic Character Removal: Data preprocessed for Arabic text analysis, removing numbers and non-Arabic characters.Figure 1 shows an example of a tweet before and after preprocessing.

TF-IDF feature extraction
The input data are converted into a set of features at this stage via feature extraction [39].TF-IDF technique extracts features in natural language processing, valuing word importance by weighting frequent, rare words [40] The proposed technique utilizes a pre-processed TF-IDF dataset for machine learning classifiers.TF-IDF is an easy-tounderstand, effective, and efficient feature extraction method.It gives significant word weights depending on their importance, employs a statistical technique to assess the relevance of terms in the text, and produces findings that are easy to understand.It is simpler to comprehend and interpret because it works well with short datasets and keyword-based classification tasks.Compared to word embeddings and contextual embeddings, TF-IDF requires less computing power, which makes it a better option for feature extraction.

Model construction
The suggested model for detecting tweet crimes compares ML and improving DL models trained on a dataset-Figure 2. The block diagram of the suggested technique framework provides examples of the proposed models for Twitter Crime Detection.The scheme uses a machine learning model, such as XGBoost, with a TF-IDF vectorizer.In Deep learning, the model is trained on a crime tweet dataset.The hybrid CNN _LSTM model is improved after increasing the CNN layers' complexity as the first development method.The development mechanism for the Model is explained in section A.

A. Hybrid_CNN LSTM model
A total of 13 layers make up the model, including Conv1D, Embedding, Max Pooling, LSTM, Dense, and Output layers.Filters on the input layer number 16, those on the second layer number 32, and the kernel size is 3.The maximum pooling layer has a size of 2, the third and fourth layers each have 64 or 128 filters, and the fifth layer has 256 filters and a kernel size of 3. The dense layer has 64 units and a ReLU activation function, whereas the LSTM layer captures a long dependency range in serialized data, including 64 units.As shown in Table 3 under the heading Structure of the Hybrid CNN _ LSTM Model and in Figure 3 under the heading Architecture of the Hybrid CNN LSTM Model, the output layer generates a probability score for the binary classification of whether or not the Tweet comprises a crime.

EXPERIMENTS AND FINDINGS
The experiments and findings for each of the models put forward in the preceding part are summarized in this section.On two datasets, one of which included a dataset of Arabic spam and Ham tweets [41], which is globally measured, and the other on a dataset that was built and compiled from Twitter based on the Aho-Corasick algorithm with the scrape Python library for data scraping, which includes Several Arabic tweets crime editorials were used to detect crimes of Tweeter tweets to study the effect of features on model learning and the resulting accuracy.

Environmental settings
The research is conducted on an Intel Core i7 computer with Python 3.11.4,Google Collab, the Keras library for deep learning models, and Sklearn for machine learning techniques for comparisons.

Experiment 1 compares results with other ML and DL models
The proposed techniques are compared to various ML algorithms and deep learning on a 20% dataset and created data set.The technique assesses classifier performance using accuracy, precision, recall, and f1-measure variables.The best classifier is picked based on higher measurement values.These parameters are Precision (Eq.( 3)), Recall (Eq.( 4)), Accuracy (Eq.( 5)), and F1-Score (Eq.( 6)) [42].
Accuracy = TP + TN  +  +  +  (5) The study focuses on the efficiency and performance of classification methods using evaluation measurements.The dataset includes keywords for crimes, user names, tweets, and other features.The proposed method categorizes the dataset using ML classifiers and DL models to select the best classifier for better-assessed metrics.The results were compared to a dataset of (13240) tweets [41], and compared with a paper [12] model using the same data set.The outcomes of the DL and ML were as follows.The hybrid CNN-LSTM model achieved the highest accuracy of 99.84, and the results for RF using f1measure is 98.76, an accuracy of 100.The results also show that the machine learning of the model outperforms most classification techniques in terms of accuracy and precision.Table 4 illustrates measurements of the ML and DL models' performance.The table shows that the proposed approach produced better outcomes than the proposed method in the paper [12].While the paper [12] utilizing ML, and DL techniques (Fast Text+LSTM) and (GloVe+LSTM) yielded the following results for the Text+LSTM, the classifier obtained with no computation of overall accuracy, the total classifier precision, recall, and f1-measure are 99.1 92.4 2, and 95.1, respectively.The classifier obtains for the GloVe+ LSTM The total classifier precision, recall, and f1-measure are 98.88, 95.92, and 97, respectively, with no measurement of overall accuracy, but machine learning outperformed most classification algorithms in terms of precision.The results show high-performance classifiers, effectively processing text data through pre-processing steps like tokenization, stop word removal, derivation, and text normalization.These models provide accurate and valid classification based on preprocessed data.The second group (the constructed data set, which totaled 18493), and the findings are shown in Table 5: The proposed method for classifying Twitter Crime Detection performances.When compared to F1, the findings showed that the Hybrid CNN-LSTM model performed the best.Figure 4 depicts for XGBoost Model machine learning classifier.
When comparing the performance of ML and DL algorithms, the test results from the data set that it built are efficient and have high performance in all machine learning categories of crime and deep learning, As the results of the XGBoost model, the classifier achieves an accuracy of 100, while the overall classifier precision, recall, and f1-measure are 99, 99.63, and 99.36, respectively.Table 6 summarizes the Twitter Crime Detection classifiers with the highest accuracy, recall, and F1 measurement findings, all of which are higher than the paper [12] using Dataset [41].XGBoost measurement scores are greater in terms of accuracy, precision, recall, and f1.The Hybrid CNN-LSTM classifier has an overall accuracy of 99.75.The total precision, recall, and f1-measure of the classifier are 98.76, 98.75, and 99.86, respectively.According to the results, the Hybrid CNN-LSTM has a higher accuracy of 99.75 than the F1-Score.Figure 5    Compare the proposed model's performance to additional measures such as Mean Average Precision (MAP) and Average Precision for each class, as given in Table 6: Deep Learning Performance Model Comparison structure.
The Hybrid CNN-LSTM model achieved a 99.84% accuracy rate in detecting tweet crime for 100 epochs and batch size = 64, as shown in Figure 6, and the loss and accuracy model and confusion matrix are illustrated in Figure 7.

Experiment 2 compares proposed algorithms with the literature
This section illustrates compare in Table 7 displays the best results of proposed algorithms compared to the literature.

CONCLUSIONS
In conclusion, our study effectively used the Hybrid_CNN LSTM model and the XGBoost machine learning model to create a reliable Twitter Intelligent Criminal Detection system.The suggested intelligent dictionary model employs the Aho-Corasick algorithm to identify criminal tweets in Arabic.Because of the increased accuracy, this technology is useful for improving public safety in the digital age.Further investigation and cooperation with law enforcement are needed to perfect and use this cutting-edge criminal detection technology.The model's performance can be evaluated using various metrics, such as accuracy, precision, recall, F1 score, and APM, to assess its effectiveness in detecting crime-related tweets in Arabic.The outcomes demonstrated deep learning model efficacy and superiority over other methods.The results for the Hybrid CNN LSTM model achieved an accuracy of 99.75% and an F1 macro score of 99.84%.The conventional ML models outperformed all others with an F1 macro score of 99.36% and an accuracy level of 100% when trained using the XGBoost approach.This proves that DL approaches find offensive tweets in the given dataset better.Hybrid CNN-LSTM also achieved a mean average precision (MAP) of 99.82 when tested against other metrics, demonstrating the efficacy of the deep learning model in the suggested approach.Research in the future can concentrate on enhancing the model's capacity to react to new trends and changing patterns of criminal conduct on Twitter in Arabic, adding domainspecific vocabularies, and broadening the dictionary's coverage.To further improve the model's accuracy and robustness, it is recommended to enhance preprocessing techniques and explore novel deep learning architectures continuously.The suggested approach proves that an intelligent dictionary can find Arabic tweets about crimes.An effective approach to identifying and addressing criminal content is provided by combining machine learning, deep learning, and the Aho-Corasick algorithm.This contributes to a safer online environment.

5 .
The model training includes feature extraction and model performance training using the hybrid CNN _LSTM model and training 80% of the training dataset.

Figure 2 .
Figure 2. Block diagram of the suggested technique framework

Figure 3 .
Figure 3. Block diagram of the suggested technique framework

Figure 4 .
Figure 4. Confusion matrix for XGBoost model based on the dataset [41] depicts the Machine Learning Confusion Matrix.

Figure 5 .Figure 6 .
Figure 5. Confusion Matrix for XGBoost Model based on data set built

Figure 7 .
Figure 7. Confusion matrix of hybrid CNN-LSTM model of crimes tweets

Table 1 .
The keywords used in building the intelligent dictionary

Table 2 .
Feature extraction from tweets 9Interactions It represents the overall number of interactions or engagements with the tweet, which can include likes, retweets, replies, and other forms of engagement.

Table 4 .
Measurements of the ML and DL models' performance

Table 5 .
The proposed method for classifying Twitter Crime Detection performances

Table 6 .
Deep learning performance model comparison structure

Table 7 .
Comparison of proposed paper with related publications