UPDF AI

A Novel Framework for Email’s Data Leak Prevention Through Semantic Analysis

Muhammad Nouman Ahmed,Hassan Mahmood,Zafar Iqbal

2023 · DOI: 10.1109/ICIT59216.2023.10335896
International Conference on Industrial Technology · 1 Citations

TLDR

A novel method is proposed for developing a DLP capable of classifying emails within organizations and performs semantic analysis on email text and the sender’s email that classifies whether the email should be allowed or blocked.

Abstract

Nowadays, emails have become the primary source of information exchange between organizations. Data Leak Prevention systems face a significant challenge when sensitive data leaks occur through emails by insider attacks within organizations. A novel method is proposed for developing a DLP capable of classifying emails within organizations. Three datasets, namely the Custom dataset, the Enron email dataset, and a mixture of the Custom and Enron Email dataset were used to develop a machine learning model. The custom-made dataset gave the best accuracy of 98.4%. For the preprocessing of text, regex expressions along with the stemming technique were used. Processed data was converted to a feature vector using count vectorization. Ten ML algorithms were used to develop models that highly accurately classified four categories of email text, i.e., HR, Finance, Engineering, and Sales departments. The final result was calculated by taking the mode of ML model predictions and matching the result with the sender’s email to perform the necessary action. The produced ML model performs semantic analysis on email text and the sender’s email that classifies whether the email should be allowed or blocked.