SDRS: a new lossless dimensionality reduction for text corpora

Iñaki Velez de Mendizabal; Vitor Manuel Basto Fernandes; Enaitz Ezpeleta; Jose R. Mendez; Urko Zurutuza

Ciência_Iscte Publicações Descrição Detalhada da Publicação

Artigo em revista científica Q1

SDRS: a new lossless dimensionality reduction for text corpora

Iñaki Velez de Mendizabal (De Mendizabal, I. V.); Vitor Manuel Basto Fernandes (Basto-Fernandes, V.); Enaitz Ezpeleta (Ezpeleta, E,); Jose R. Mendez (Méndez, J. R.); Urko Zurutuza (Zurutuza, U.);

Título Revista

Information Processing and Management

Ano (publicação definitiva)

2020

Língua

Inglês

País

Reino Unido

Mais Informação

Visitar Link

Web of Science®

N.º de citações: 8

(Última verificação: 2026-07-15 01:01)

Ver o registo na Web of Science®

Índice de Impacto do Artigo: 0.2

Ver Mais

Scopus

N.º de citações: 8

(Última verificação: 2026-07-03 22:55)

Ver o registo na Scopus

Índice de Impacto do Artigo: 0.1

Ver Mais

Google Scholar

N.º de citações: 9

(Última verificação: 2026-07-16 23:15)

Ver o registo no Google Scholar

Overton

Esta publicação não está indexada no Overton

Abstract/Resumo

In recent years, most content-based spam filters have been implemented using Machine Learning (ML) approaches by means of token-based representations of textual contents. After introducing multiple performance enhancements, the impact has been virtually irrelevant. Recent studies have introduced synset-based content representations as a reliable way to improve classification, as well as different forms to take advantage of semantic information to address problems, such as dimensionality reduction. These preliminary solutions present some limitations and enforce simplifications that must be gradually redefined in order to obtain significant improvements in spam content filtering. This study addresses the problem of feature reduction by introducing a new semantic-based proposal (SDRS) that avoids losing knowledge (lossless). Synset-features can be semantically grouped by taking advantage of taxonomic relations (mainly hypernyms) provided by BabelNet ontological dictionary (e.g. “Viagra” and “Cialis” can be summarized into the single features “anti-impotence drug”, “drug” or “chemical substance” depending on the generalization of 1, 2 or 3 levels). In order to decide how many levels should be used to generalize each synset of a dataset, our proposal takes advantage of Multi-Objective Evolutionary Algorithms (MOEA) and particularly, of the Non-dominated Sorting Genetic Algorithm (NSGA-II). We have compared the performance achieved by a Naïve Bayes classifier, using both token-based and synset-based dataset representations, with and without executing dimensional reductions. As a result, our lossless semantic reduction strategy was able to find optimal semantic-based feature grouping strategies for the input texts, leading to a better performance of Naïve Bayes classifiers.

Agradecimentos/Acknowledgements

Palavras-chave

Spam filtering,Token-based representation,Synset-based representation,Semantic-based feature reduction,Multi-objective evolutionary algorithms

Classificação Fields of Science and Technology

Ciências da Computação e da Informação - Ciências Naturais

Registos de financiamentos

Referência de financiamento	Entidade Financiadora
UIDP/04466/2020	Fundação para a Ciência e a Tecnologia
UIDB/04466/2020	Fundação para a Ciência e a Tecnologia

Contribuições para os Objetivos do Desenvolvimento Sustentável das Nações Unidas

Com o objetivo de aumentar a investigação direcionada para o cumprimento dos Objetivos do Desenvolvimento Sustentável para 2030 das Nações Unidas, é disponibilizada no Ciência_Iscte a possibilidade de associação, quando aplicável, dos artigos científicos aos Objetivos do Desenvolvimento Sustentável. Estes são os Objetivos do Desenvolvimento Sustentável identificados pelo(s) autor(es) para esta publicação. Para uma informação detalhada dos Objetivos do Desenvolvimento Sustentável, clique aqui.

Identificadores da Publicação

WoS (fonte: Ciência_Iscte)	WOS:000531082800020
Scopus (fonte: autor)	2-s2.0-85081988881
DOI (fonte: autor)	10.1016/j.ipm.2020.102249
Scopus (fonte: Ciência_Iscte)	2-s2.0-85081988881
Handle (fonte: Ciência-IUL)	http://hdl.handle.net/10071/20406
ID Ciência_Iscte	ci-pub-70824

Outros Detalhes da Publicação

Ano Publicação Online	2020
Editora	Elsevier
Indexação	Web of Science©; Scopus;
ISSN	0306-4573 (print) 1873-5371 (online)
ISBN	--
Factor de Impacto	--
Volume	57	Número	4
Série
Número Artigo	102249
Páginas	--
Avaliado Cientificamente	Sim
Repositório ISCTE-IUL	Link para o repositório
Data Publicação (online)	2020-03-21
Data Publicação (print)

Altmetric

Dimensions

PlumX Metrics