본문 바로가기
NLP

[NLP] Korean Hate Speech Detection 한국어 비매너 뉴스 댓글 탐지

by daewooki 2021. 7. 17.
반응형

비매너 댓글 탐지 리서치를 진행하며 관련 논문을 읽고 제공된 데이터에 대해 토이 프로젝트를 진행했다. 

논문 저자는 캐글에 한국 연예 뉴스 댓글을 학습데이터로 제공하고 비매너/매너 댓글 분류하는 대회를 진행중이다.

 

 

BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection

논문 링크 : https://arxiv.org/abs/2005.12503

 

BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection

Toxic comments in online platforms are an unavoidable social issue under the cloak of anonymity. Hate speech detection has been actively done for languages such as English, German, or Italian, where manually labeled corpus has been released. In this work,

arxiv.org

논문에서는 캐글에 3개의 대회를 오픈해두었다고 나와있다.

1. www.kaggle.com/c/korean-gender-bias-detection

2. www.kaggle.com/c/korean-bias-detection

3. www.kaggle.com/c/korean-hate-speech-detection

 

해당 프로젝트에서 사용할 데이터는 3번째 데이터이다. 

Kaggle 링크 : https://www.kaggle.com/c/korean-hate-speech-detection/data

 

Korean Hate Speech Detection

Identify hate speech in Korean entertainment news comments

www.kaggle.com

 

 


Data Description

Overview

The task is to identify hate speech in Korean entertainment news comments. Specifically, use train.hate.csv and dev.hate.csv which are provided a hate label for each comment.

Your job is to train a model and predict label of comments in test.hate.no_label.csv.

We also include {train, dev, test}.news_title.txt, to inform where the comments are from.

Files

  • train.hate.csv: the training set
  • dev.hate.csv: the validation set
  • test.hate.no_label.csv: the test set (w/o label)
  • train.news_title.txt: article titles of comments in the training set
  • dev.news_title.txt: article titles of comments in the validation set
  • test.news_title.txt: article titles of comments in the test set
  • unlabeled_comments.txt: comments without the label
  • unlabeled_comments.news_title.txt: article titles of comments without the label

Fields

  • comments (str) : news comments
  • label (str) : hate label
    • none, offensive, hate

 

사용 모델

- KcBERT

- KoBERT

반응형

댓글