비매너 댓글 탐지 리서치를 진행하며 관련 논문을 읽고 제공된 데이터에 대해 토이 프로젝트를 진행했다.
논문 저자는 캐글에 한국 연예 뉴스 댓글을 학습데이터로 제공하고 비매너/매너 댓글 분류하는 대회를 진행중이다.
BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection
논문 링크 : https://arxiv.org/abs/2005.12503
BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection
Toxic comments in online platforms are an unavoidable social issue under the cloak of anonymity. Hate speech detection has been actively done for languages such as English, German, or Italian, where manually labeled corpus has been released. In this work,
arxiv.org
논문에서는 캐글에 3개의 대회를 오픈해두었다고 나와있다.
1. www.kaggle.com/c/korean-gender-bias-detection
2. www.kaggle.com/c/korean-bias-detection
3. www.kaggle.com/c/korean-hate-speech-detection
해당 프로젝트에서 사용할 데이터는 3번째 데이터이다.
Kaggle 링크 : https://www.kaggle.com/c/korean-hate-speech-detection/data
Korean Hate Speech Detection
Identify hate speech in Korean entertainment news comments
www.kaggle.com
Data Description
Overview
The task is to identify hate speech in Korean entertainment news comments. Specifically, use train.hate.csv and dev.hate.csv which are provided a hate label for each comment.
Your job is to train a model and predict label of comments in test.hate.no_label.csv.
We also include {train, dev, test}.news_title.txt, to inform where the comments are from.
Files
- train.hate.csv: the training set
- dev.hate.csv: the validation set
- test.hate.no_label.csv: the test set (w/o label)
- train.news_title.txt: article titles of comments in the training set
- dev.news_title.txt: article titles of comments in the validation set
- test.news_title.txt: article titles of comments in the test set
- unlabeled_comments.txt: comments without the label
- unlabeled_comments.news_title.txt: article titles of comments without the label
Fields
- comments (str) : news comments
- label (str) : hate label
- none, offensive, hate
사용 모델
- KcBERT
- KoBERT
'NLP' 카테고리의 다른 글
[NLP] 자연어처리 필독 논문 100선 (0) | 2022.05.10 |
---|---|
[HuggingFace] 허깅페이스 모델 로컬에 다운 받기 (3) | 2022.04.19 |
[NLP] NLTK 형태소 분석기 POS(Part of Speech) tag 리스트 (0) | 2021.09.07 |
[NLP] 자연어처리 HuggingFace 뽀개기 (3) | 2021.06.16 |
댓글