{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Deteksi Sarkasme\n", "Aktivitas online telah tumbuh menjadi kekuatan besar, salah satunya adalah media sosial Twitter. Twitter telah menjadi mekanisme utama dalam menerima tanggapan dari seluruh dunia. Pada sebagian besar orang Indonesia aktif dan juga ekspresif melalui tweet mereka sehingga opini mereka dapat mengandung sindiran pahit, celaan, dan bahkan dapat berupa kata-kata\n", "positif untuk mengungkapkan opini negatif. Dengan begitu maka dibuatlah sebuah deteksi sarkasme agar dapat membantu mengenali apakah tweet tersebut merupakan sarkasme atau bukan. \n", "\n", "Deteksi sarkasme termasuk ke dalam Analisis sentimen yang termasuk bagian dari Natural Language Processing (NLP) berkaitan dengan menemukan maksud dari opini dalam sebuah teks. Terdapat beberapa tahapan dalam membangun sebuah sistem untuk deteksi sarkasme.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset\n", "Dataset dikumpulkan secara mandiri dengan didapat dari Twitter dengan bantuan API twitter yaitu Tweepy. Data yang diperoleh dikumpulkan dan berjumlah 600 data. Untuk pelabelan setiap tweet dilakukan dengan bantuan Guru Bahasa Indonesia yang ahli dalam bidangnya." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TweetLabel
0@mopuci @collegemenfess Mjb, indogsat 100k 50g...Sarkasme
1ini jaringan pas wiken knp kenceng pakee bgttt...Sarkasme
2@MNCPlayID sudah lebih dari 16 jam layanan int...BukanSarkas
3⁦@IndiHome⁩ lapor di daerah sukatani tapos dep...BukanSarkas
4indosat ngntd kmpa si, ada jaringan tpi browse...BukanSarkas
.........
595Kirain jam segini jaringan bagussBukanSarkas
596@gengysi w kirain jaringan lemot, ternyata....🤣🤣😝BukanSarkas
597gedek bgt gue, jaringan tiba² ilang pas lagi u...BukanSarkas
598@FirstMediaCares Tidak ada jaringan min dari j...BukanSarkas
599Akhirnya bisa merasakan jaringan internet lagi 😭BukanSarkas
\n", "

600 rows × 2 columns

\n", "
" ], "text/plain": [ " Tweet Label\n", "0 @mopuci @collegemenfess Mjb, indogsat 100k 50g... Sarkasme\n", "1 ini jaringan pas wiken knp kenceng pakee bgttt... Sarkasme\n", "2 @MNCPlayID sudah lebih dari 16 jam layanan int... BukanSarkas\n", "3 ⁦@IndiHome⁩ lapor di daerah sukatani tapos dep... BukanSarkas\n", "4 indosat ngntd kmpa si, ada jaringan tpi browse... BukanSarkas\n", ".. ... ...\n", "595 Kirain jam segini jaringan baguss BukanSarkas\n", "596 @gengysi w kirain jaringan lemot, ternyata....🤣🤣😝 BukanSarkas\n", "597 gedek bgt gue, jaringan tiba² ilang pas lagi u... BukanSarkas\n", "598 @FirstMediaCares Tidak ada jaringan min dari j... BukanSarkas\n", "599 Akhirnya bisa merasakan jaringan internet lagi 😭 BukanSarkas\n", "\n", "[600 rows x 2 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "namafile=\"#tweet_label.xlsx\"\n", "DataTweet=pd.read_excel(namafile)\n", "DataTweet" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Jumlah data antara tweet berlabel Sarkasme dan BukanSarkas dapat dilihat dibawah ini.***" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "\n", "jumlahData=DataTweet[\"Tweet\"].shape[0] #Jumlah Data\n", "jmlSarkas=DataTweet[DataTweet.Label==\"Sarkasme\"].shape[0]\n", "jmlNonSarkas=DataTweet[DataTweet.Label==\"BukanSarkas\"].shape[0]\n", "d={'Jumlah': [jumlahData], 'Sarkasme': [jmlSarkas],'BukanSarkas':[jmlNonSarkas]}\n", "\n", "cekJumlah = pd.DataFrame(data=d, columns=['Sarkasme','BukanSarkas'])\n", "# creating the dataset\n", "data = {'Sarkasme':jmlSarkas, 'BukanSarkas':jmlNonSarkas}\n", "courses = list(data.keys())\n", "values = list(data.values())\n", "fig = plt.figure(figsize = (8, 5))\n", "\n", "# creating the bar plot\n", "plt.bar(courses, values, color ={'orange','blue'} ,width = 0.7)\n", "plt.xlabel(\"Label\")\n", "plt.ylabel(\"Jumlah\")\n", "plt.title(\"Dataset Sarkasme dan Bukan. Total Data = %s\" %jumlahData)\n", "for x, y in enumerate(values):\n", " plt.text(x , y + 2, str(y),color = 'black', fontweight = 'bold')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import Library Python\n", "Lakukan import semua library yang dibutuhkan." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import re,string\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.svm import SVC\n", "from sklearn.ensemble import AdaBoostClassifier\n", "from sklearn.model_selection import KFold\n", "from sklearn import model_selection\n", "from sklearn.metrics import accuracy_score\n", "from imblearn.over_sampling import SMOTE\n", "from collections import Counter\n", "from nltk.tokenize import word_tokenize\n", "from nltk.corpus import stopwords\n", "from Sastrawi.Stemmer.StemmerFactory import StemmerFactory\n", "import matplotlib.pyplot as plt\n", "from timeit import default_timer as timer\n", "import pickle" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Text Preprocessing\n", "\n", "Data teks termasuk data yang tidak terstruktur, sehingga diperlukan untuk mengubah data menjadi bentuk yang terstruktur, oleh karena itu proses tersebut adalah Text Preprocessing.\n", "\n", "**Fungsi yang saya buat untuk text-preprocessing**" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def lower(text):\n", " # lowercase\n", " lower = text.lower()\n", " return lower\n", "\n", "def removeURLemoji(text):\n", " # hapus hastag/mention\n", " HastagRT = re.sub(r\"#(\\w+)|@(\\w+)|(\\brt\\b)\", \" \", text)\n", " # hapus URL\n", " pola_url = r'http\\S+'\n", " CleanURL = re.sub(pola_url, \" \", HastagRT)\n", " # hapus emoticon\n", " hps_emoji = hapus_emoticon(CleanURL)\n", " # hapus multiWhitespace++, ex: ahh haa\n", " text = re.sub('\\s+', ' ', hps_emoji)\n", " # hasil akhir casefolding\n", " hasil = text\n", " return hasil\n", "\n", "def angkadua(teksAwal2):\n", " final2 = []\n", " huruf2 = \"\"\n", " for x in range(len(teksAwal2)):\n", " cek2 = [i for i in teksAwal2[x]]\n", " for x in range(len(cek2)):\n", " if x == 0:\n", " final2.append(cek2[0])\n", " huruf2 = cek2[0]\n", " else:\n", " if cek2[x] != huruf2:\n", " if cek2[x] == \"2\":\n", " if(len(final2)) == 2:\n", " final2.append(cek2[x-2])\n", " final2.append(cek2[x-1])\n", " huruf2 = cek2[x]\n", " elif(len(final2) > 2):\n", " jo = \"\".join(cek2[:2])\n", " if(jo == \"se\" or jo == \"di\"):\n", " final2.append(\" \")\n", " final2 = final2+cek2[2:x]\n", " huruf2 = cek2[x]\n", " else:\n", " final2.append(\" \")\n", " final2 = final2+cek2[:x]\n", " huruf2 = cek2[x]\n", " else:\n", " final2.append(cek2[x])\n", " huruf2 = cek2[x]\n", " else:\n", " final2.append(cek2[x])\n", " huruf2 = cek2[x]\n", " else:\n", " final2.append(cek2[x])\n", " huruf2 = cek2[x]\n", " final2.append(\" \")\n", " hasil = \"\".join(final2).split()\n", " return hasil\n", "\n", "\n", "def hapus_hurufganda(teksAwal):\n", " jml = 0\n", "\n", " final = []\n", " huruf = \"\"\n", " for x in range(len(teksAwal)):\n", " cek = [i for i in teksAwal[x]]\n", " for x in range(len(cek)):\n", " if x == 0:\n", " final.append(cek[0])\n", " huruf = cek[0]\n", " jml = 1\n", " else:\n", " if cek[x] != huruf:\n", " final.append(cek[x])\n", " huruf = cek[x]\n", " jml = 1\n", " else:\n", " if jml < 2:\n", " final.append(cek[x])\n", " huruf = cek[x]\n", " jml += 1\n", " final.append(\" \")\n", " hasil = \"\".join(final).split()\n", " return hasil\n", "\n", "\n", "def hapus_simbolAngka(text):\n", " del_angkadua = angkadua(text)\n", " del_hrfganda = hapus_hurufganda(del_angkadua)\n", "\n", " # hasil=[]\n", " token = del_hrfganda\n", " lte = [\"2g\", \"3g\", \"4g\", \"5g\"]\n", " for i in range(len(token)):\n", " if(token[i] not in lte):\n", " token[i] = re.sub(r\"\\d+\", \" \", token[i])\n", "\n", " for ele in range(len(token)):\n", " token[ele] = token[ele].translate(\n", " str.maketrans('', '', string.punctuation))\n", " token[ele] = re.sub('\\W', \"\", token[ele])\n", " token[ele] = re.sub('\\s+', \"\", token[ele])\n", "\n", " return token\n", "\n", "\n", "def hapus_simbolAngka2(text):\n", " token = text\n", " for i in range(len(token)):\n", " cekG = re.match(r\"([\\b234]+g)\", token[i])\n", " if (cekG) == None:\n", " token[i] = re.sub(r\"\\d+\", \"\", token[i])\n", " # initializing punctuations string\n", " punc = '''!()-[]{};:'\"\\,<>./?@#$%^&*_~'''\n", "\n", " # Removing punctuations in string\n", " # Using loop + punctuation string\n", " for ele in token:\n", " if ele in punc:\n", " token = token.replace(ele, \" \")\n", " token = re.sub('\\s+', ' ', token)\n", " return token\n", "\n", "\n", "def hapus_emoticon(text):\n", " emoji_pattern = re.compile(\"[\"\n", " u\"\\U0001F600-\\U0001F64F\" # emoticons\n", " u\"\\U0001F300-\\U0001F5FF\" # symbols & pictographs\n", " u\"\\U0001F680-\\U0001F6FF\" # transport & map symbols\n", " u\"\\U0001F1E0-\\U0001F1FF\" # flags (iOS)\n", " u\"\\U00002500-\\U00002BEF\" # chinese char\n", " u\"\\U00002702-\\U000027B0\"\n", " u\"\\U00002702-\\U000027B0\"\n", " u\"\\U000024C2-\\U0001F251\"\n", " u\"\\U0001f926-\\U0001f937\"\n", " u\"\\U00010000-\\U0010ffff\"\n", " u\"\\u2640-\\u2642\"\n", " u\"\\u2600-\\u2B55\"\n", " u\"\\u200d\"\n", " u\"\\u23cf\"\n", " u\"\\u23e9\"\n", " u\"\\u231a\"\n", " u\"\\ufe0f\" # dingbats\n", " u\"\\u3030\"\n", " \"]+\", flags=re.UNICODE)\n", " # hapus emoji\n", " CleanEmoji = re.sub(emoji_pattern, \"\", text)\n", " return CleanEmoji\n", "\n", "\n", "def tokenize(kalimat):\n", " return word_tokenize(kalimat)\n", "\n", "\n", "def listokalimat(kalimat):\n", " listToStr = ' '.join(kalimat)\n", " return listToStr\n", "\n", "\n", "def delstopwordID(teks):\n", " notsinglechar=[]\n", " for kata in teks:\n", " a = re.sub(r\"\\b[a-zA-Z]\\b\", \" \", kata)\n", " if(a!=\" \"):\n", " notsinglechar.append(a)\n", " return [kata for kata in notsinglechar if kata not in list_stopwords]\n", "\n", "\n", "def daftarStopword():\n", " list_stopwords = stopwords.words('indonesian')\n", " # baca tambahan\n", " my_file = open(\"_stopwordTambahan.txt\", \"r\")\n", " tambahan = my_file.read()\n", " daftar = tambahan.replace('\\n', ' ').split()\n", " ####\n", " list_stopwords.extend(daftar)\n", " list_stopwords = set(list_stopwords)\n", " return list_stopwords\n", "\n", "\n", "def normal_term():\n", " normalisasi_word = pd.read_excel(\"_normalisasi.xlsx\")\n", " normalisasi_dict = {}\n", " for index, row in normalisasi_word.iterrows():\n", " if row[0] not in normalisasi_dict:\n", " normalisasi_dict[row[0]] = row[1]\n", " return normalisasi_dict\n", "\n", "\n", "def normalisasi(document):\n", " kalimat = document\n", " for term in range(len(kalimat)):\n", " if kalimat[term] in normalisasi_dict:\n", " kalimat[term] = normalisasi_dict[kalimat[term]]\n", " hasil = \" \".join(kalimat).split()\n", " return hasil\n", "\n", "\n", "def stemming(kalimat):\n", " term_dict = {}\n", " for kata in kalimat:\n", " for term in kalimat:\n", " if term not in term_dict:\n", " term_dict[term] = \" \"\n", " temp = list(term_dict)\n", " for x in range(len(temp)):\n", " if temp[x] == \"jaringan\":\n", " term_dict[temp[x]] = temp[x]\n", " elif temp[x] == \"teh\" and temp[x+1] == \"anget\":\n", " term_dict[temp[x]] = temp[x]\n", " else:\n", " term_dict[temp[x]] = stemmer.stem(temp[x])\n", " kalimat = [term_dict[term] for term in kalimat]\n", " #listToStr = ' '.join([str(i) for i in kalimat])\n", " return kalimat\n", "\n", "\n", "list_stopwords = daftarStopword()\n", "term_dict = {}\n", "factory = StemmerFactory()\n", "stemmer = factory.create_stemmer()\n", "normalisasi_dict = normal_term()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***Text Preprocessing terdiri dari beberapa tahapan:***\n", "1. Casefolding\n", "Proses untuk mengubah semua huruf yang ada pada dokumen teks menjadi huruf kecil\n", "2. Cleaning\n", "Proses pembersihan karakter-karakter yang tidak diperlukan untuk mengurangi noise, seperti angka, URL, simbol yang tidak digunakan.\n", "3. Tokenisasi\n", "Pada dasarnya tokenisasi berarti memecah teks kalimat menjadi potongan kata, atau frasa yang disebut token.\n", "4. Normalisasi\n", "Proses pengubahan bentuk kata yang tidak baku atau ambigu, dan kata dengan huruf berulang-ulang yang terdapat pada dokumen teks menjadi kata yang baku. Misalnya “manisss” diubah jadi “manis”.\n", "5. Konversi kata gaul (Slang Word)\n", "Proses pengubahan kata gaul menjadi kata yang baku sesuai standar KBBI. Dalam sosial media twitter orang Indonesia terkadang menulis kata gaul pada pesan teks mereka. Sehingga untuk mengubah bentuk menjadi kata yang baku, dibutuhkan sebuah daftar kata gaul sesuai dengan yang sering digunakan pada pesan teks. Misalnya “gw” diubah jadi “saya”, “napa” diubah jadi “kenapa” dan lain sebagainya.\n", "\n", "### Casefolding dan Cleaning\n", "Pada pesan teks berbahasa Indonesia dari media sosial, orang sering kali menggunakan kata-kata tidak baku dari pada kata baku seperti menggunakan angka untuk mengganti alfabet, kata singkatan, karakter yang berulang, dan kata gaul. Oleh karena itu harus dilakukan tahapan preprocessing terlebih dahulu agar jadi lebih terstruktur. Selain itu akan dihilangkan tanda baca yang tidak perlu, URL yang ada dalam teks." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "##-------- Mulai Proses Preprocessing --------##\n", "\n", "\n", "...... Proses Casefolding lowercase, hapus URL...... \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TweetLabelcasefolding1
0@mopuci @collegemenfess Mjb, indogsat 100k 50g...Sarkasmemjb, indogsat 100k 50gb. jaringan lebih ok ju...
1ini jaringan pas wiken knp kenceng pakee bgttt...Sarkasmeini jaringan pas wiken knp kenceng pakee bgttt...
2@MNCPlayID sudah lebih dari 16 jam layanan int...BukanSarkassudah lebih dari 16 jam layanan internet dan ...
3⁦@IndiHome⁩ lapor di daerah sukatani tapos dep...BukanSarkas⁦ ⁩ lapor di daerah sukatani tapos depok serin...
4indosat ngntd kmpa si, ada jaringan tpi browse...BukanSarkasindosat ngntd kmpa si, ada jaringan tpi browse...
............
595Kirain jam segini jaringan bagussBukanSarkaskirain jam segini jaringan baguss
596@gengysi w kirain jaringan lemot, ternyata....🤣🤣😝BukanSarkasw kirain jaringan lemot, ternyata....
597gedek bgt gue, jaringan tiba² ilang pas lagi u...BukanSarkasgedek bgt gue, jaringan tiba² ilang pas lagi u...
598@FirstMediaCares Tidak ada jaringan min dari j...BukanSarkastidak ada jaringan min dari jam 12 offline
599Akhirnya bisa merasakan jaringan internet lagi 😭BukanSarkasakhirnya bisa merasakan jaringan internet lagi
\n", "

600 rows × 3 columns

\n", "
" ], "text/plain": [ " Tweet Label \\\n", "0 @mopuci @collegemenfess Mjb, indogsat 100k 50g... Sarkasme \n", "1 ini jaringan pas wiken knp kenceng pakee bgttt... Sarkasme \n", "2 @MNCPlayID sudah lebih dari 16 jam layanan int... BukanSarkas \n", "3 ⁦@IndiHome⁩ lapor di daerah sukatani tapos dep... BukanSarkas \n", "4 indosat ngntd kmpa si, ada jaringan tpi browse... BukanSarkas \n", ".. ... ... \n", "595 Kirain jam segini jaringan baguss BukanSarkas \n", "596 @gengysi w kirain jaringan lemot, ternyata....🤣🤣😝 BukanSarkas \n", "597 gedek bgt gue, jaringan tiba² ilang pas lagi u... BukanSarkas \n", "598 @FirstMediaCares Tidak ada jaringan min dari j... BukanSarkas \n", "599 Akhirnya bisa merasakan jaringan internet lagi 😭 BukanSarkas \n", "\n", " casefolding1 \n", "0 mjb, indogsat 100k 50gb. jaringan lebih ok ju... \n", "1 ini jaringan pas wiken knp kenceng pakee bgttt... \n", "2 sudah lebih dari 16 jam layanan internet dan ... \n", "3 ⁦ ⁩ lapor di daerah sukatani tapos depok serin... \n", "4 indosat ngntd kmpa si, ada jaringan tpi browse... \n", ".. ... \n", "595 kirain jam segini jaringan baguss \n", "596 w kirain jaringan lemot, ternyata.... \n", "597 gedek bgt gue, jaringan tiba² ilang pas lagi u... \n", "598 tidak ada jaringan min dari jam 12 offline \n", "599 akhirnya bisa merasakan jaringan internet lagi \n", "\n", "[600 rows x 3 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def lowRemoveURL(text):\n", " #lowercase\n", " lower=text.lower()\n", " #hapus hastag/mention\n", " HastagRT=re.sub(r\"#(\\w+)|@(\\w+)|(\\brt\\b)\",\" \", lower)\n", " #hapus URL\n", " pola_url = r'http\\S+'\n", " CleanURL=re.sub(pola_url,\" \", HastagRT)\n", " #hapus emoticon\n", " hps_emoji=hapus_emoticon(CleanURL)\n", " #hapus multiWhitespace++, ex: ahh haa\n", " text = re.sub('\\s+',' ',hps_emoji)\n", " #hasil akhir casefolding\n", " hasil=text\n", " return hasil\n", "\n", "#============== Start Processing Text\n", "print(\"\\n##-------- Mulai Proses Preprocessing --------##\\n\")\n", "print('\\n...... Proses Casefolding lowercase, hapus URL...... ')\n", "DataTweet['casefolding1'] = DataTweet['Tweet'].apply(lowRemoveURL)\n", "DataTweet" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tokenisasi\n", "Tokenisasi adalah tahapan dari teks pre-processing yang memiliki tujuan untuk memisahkan setiap teks dalam dataset menjadi potongan-potongan kata yang disebut token." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "...... Tokenisasi ...... \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
casefolding1Tokenisasi
0mjb, indogsat 100k 50gb. jaringan lebih ok ju...[mjb, ,, indogsat, 100k, 50gb, ., jaringan, le...
1ini jaringan pas wiken knp kenceng pakee bgttt...[ini, jaringan, pas, wiken, knp, kenceng, pake...
2sudah lebih dari 16 jam layanan internet dan ...[sudah, lebih, dari, 16, jam, layanan, interne...
3⁦ ⁩ lapor di daerah sukatani tapos depok serin...[⁦, ⁩, lapor, di, daerah, sukatani, tapos, dep...
4indosat ngntd kmpa si, ada jaringan tpi browse...[indosat, ngntd, kmpa, si, ,, ada, jaringan, t...
5males mau bobo lagi jaringan ngajak kelahi[males, mau, bobo, lagi, jaringan, ngajak, kel...
6hawa dingin habis hujan ditambah jaringan lem...[hawa, dingin, habis, hujan, ditambah, jaringa...
7jaringan aku dsini lemot banget, padahal masi...[jaringan, aku, dsini, lemot, banget, ,, padah...
8jaringan ama web gw kenapa si ngeselin banget,...[jaringan, ama, web, gw, kenapa, si, ngeselin,...
9halo byu, sudah hampir 3 minggu ini sinyal da...[halo, byu, ,, sudah, hampir, 3, minggu, ini, ...
10hahah bru kekirim pdhl ngetweet td malem emang...[hahah, bru, kekirim, pdhl, ngetweet, td, male...
11giliran ujian laptop sama jaringan tiba2 lemot...[giliran, ujian, laptop, sama, jaringan, tiba2...
12jaringan kyk anyinggggg robohin aj towernya ck[jaringan, kyk, anyinggggg, robohin, aj, tower...
\n", "
" ], "text/plain": [ " casefolding1 \\\n", "0 mjb, indogsat 100k 50gb. jaringan lebih ok ju... \n", "1 ini jaringan pas wiken knp kenceng pakee bgttt... \n", "2 sudah lebih dari 16 jam layanan internet dan ... \n", "3 ⁦ ⁩ lapor di daerah sukatani tapos depok serin... \n", "4 indosat ngntd kmpa si, ada jaringan tpi browse... \n", "5 males mau bobo lagi jaringan ngajak kelahi \n", "6 hawa dingin habis hujan ditambah jaringan lem... \n", "7 jaringan aku dsini lemot banget, padahal masi... \n", "8 jaringan ama web gw kenapa si ngeselin banget,... \n", "9 halo byu, sudah hampir 3 minggu ini sinyal da... \n", "10 hahah bru kekirim pdhl ngetweet td malem emang... \n", "11 giliran ujian laptop sama jaringan tiba2 lemot... \n", "12 jaringan kyk anyinggggg robohin aj towernya ck \n", "\n", " Tokenisasi \n", "0 [mjb, ,, indogsat, 100k, 50gb, ., jaringan, le... \n", "1 [ini, jaringan, pas, wiken, knp, kenceng, pake... \n", "2 [sudah, lebih, dari, 16, jam, layanan, interne... \n", "3 [⁦, ⁩, lapor, di, daerah, sukatani, tapos, dep... \n", "4 [indosat, ngntd, kmpa, si, ,, ada, jaringan, t... \n", "5 [males, mau, bobo, lagi, jaringan, ngajak, kel... \n", "6 [hawa, dingin, habis, hujan, ditambah, jaringa... \n", "7 [jaringan, aku, dsini, lemot, banget, ,, padah... \n", "8 [jaringan, ama, web, gw, kenapa, si, ngeselin,... \n", "9 [halo, byu, ,, sudah, hampir, 3, minggu, ini, ... \n", "10 [hahah, bru, kekirim, pdhl, ngetweet, td, male... \n", "11 [giliran, ujian, laptop, sama, jaringan, tiba2... \n", "12 [jaringan, kyk, anyinggggg, robohin, aj, tower... " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#==== Tokenisasi : memisahkan kata dalam kalimat\n", "print('\\n...... Tokenisasi ...... ')\n", "DataTweet['Tokenisasi'] = DataTweet['casefolding1'].apply(tokenize)\n", "DataTweet[['casefolding1','Tokenisasi']].head(13)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "...... Proses Casefolding2 hapus angka dan simbol...... \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
casefolding2
0[mjb, , indogsat, k, gb, , jaringan, lebih, ok...
1[ini, jaringan, pas, wiken, knp, kenceng, pake...
2[sudah, lebih, dari, , jam, layanan, internet,...
3[, , lapor, di, daerah, sukatani, tapos, depok...
4[indosat, ngntd, kmpa, si, , ada, jaringan, tp...
5[males, mau, bobo, lagi, jaringan, ngajak, kel...
6[hawa, dingin, habis, hujan, ditambah, jaringa...
7[jaringan, aku, dsini, lemot, banget, , padaha...
8[jaringan, ama, web, gw, kenapa, si, ngeselin,...
9[halo, byu, , sudah, hampir, , minggu, ini, si...
10[hahah, bru, kekirim, pdhl, ngetweet, td, male...
11[giliran, ujian, laptop, sama, jaringan, tiba,...
\n", "
" ], "text/plain": [ " casefolding2\n", "0 [mjb, , indogsat, k, gb, , jaringan, lebih, ok...\n", "1 [ini, jaringan, pas, wiken, knp, kenceng, pake...\n", "2 [sudah, lebih, dari, , jam, layanan, internet,...\n", "3 [, , lapor, di, daerah, sukatani, tapos, depok...\n", "4 [indosat, ngntd, kmpa, si, , ada, jaringan, tp...\n", "5 [males, mau, bobo, lagi, jaringan, ngajak, kel...\n", "6 [hawa, dingin, habis, hujan, ditambah, jaringa...\n", "7 [jaringan, aku, dsini, lemot, banget, , padaha...\n", "8 [jaringan, ama, web, gw, kenapa, si, ngeselin,...\n", "9 [halo, byu, , sudah, hampir, , minggu, ini, si...\n", "10 [hahah, bru, kekirim, pdhl, ngetweet, td, male...\n", "11 [giliran, ujian, laptop, sama, jaringan, tiba,..." ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print('\\n...... Proses Casefolding2 hapus angka dan simbol...... ')\n", "DataTweet['casefolding2'] = DataTweet['Tokenisasi'].apply(hapus_simbolAngka)\n", "DataTweet[['casefolding2']].head(12)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalisasi\n", "Dalam media sosial twitter orang Indonesia terkadang menulis kata gaul dan kata yang tidak baku, bahkan sering kali ada kata-kata yang salah dalam penulisannya, sehingga diperlukan untuk mengubahnya menjadi kata yang baku sesuai dengan KBBI pada tahapan text pre-processing, hal tersebut dapat disebut dengan normalisasi." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "...... Proses Normalisasi ...... \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
casefolding2Normalisasi
0[maaf gabung bareng, , indosat, ribu, gigabyte...[maaf, gabung, bareng, indosat, ribu, gigabyte...
1[ini, jaringan, waktu, minggu, kenapa, cepat, ...[ini, jaringan, waktu, minggu, kenapa, cepat, ...
2[sudah, lebih, dari, , jam, layanan, internet,...[sudah, lebih, dari, jam, layanan, internet, d...
3[, , lapor, di, daerah, sukatani, tapos, depok...[lapor, di, daerah, sukatani, tapos, depok, se...
4[indosat, ngentot, kenapa, sih, , ada, jaringa...[indosat, ngentot, kenapa, sih, ada, jaringan,...
5[malas, mau, tidur, lagi, jaringan, ngajak, ke...[malas, mau, tidur, lagi, jaringan, ngajak, ke...
6[hawa, dingin, habis, hujan, ditambah, jaringa...[hawa, dingin, habis, hujan, ditambah, jaringa...
7[jaringan, aku, disini, lambat, banget, , pada...[jaringan, aku, disini, lambat, banget, padaha...
8[jaringan, sama, web, aku, kenapa, sih, ngesel...[jaringan, sama, web, aku, kenapa, sih, ngesel...
9[halo, telkomsel, , sudah, hampir, , minggu, i...[halo, telkomsel, sudah, hampir, minggu, ini, ...
10[haha, baru, kirim, padahal, tweet, tadi, mala...[haha, baru, kirim, padahal, tweet, tadi, mala...
11[giliran, ujian, laptop, sama, jaringan, tiba,...[giliran, ujian, laptop, sama, jaringan, tiba,...
\n", "
" ], "text/plain": [ " casefolding2 \\\n", "0 [maaf gabung bareng, , indosat, ribu, gigabyte... \n", "1 [ini, jaringan, waktu, minggu, kenapa, cepat, ... \n", "2 [sudah, lebih, dari, , jam, layanan, internet,... \n", "3 [, , lapor, di, daerah, sukatani, tapos, depok... \n", "4 [indosat, ngentot, kenapa, sih, , ada, jaringa... \n", "5 [malas, mau, tidur, lagi, jaringan, ngajak, ke... \n", "6 [hawa, dingin, habis, hujan, ditambah, jaringa... \n", "7 [jaringan, aku, disini, lambat, banget, , pada... \n", "8 [jaringan, sama, web, aku, kenapa, sih, ngesel... \n", "9 [halo, telkomsel, , sudah, hampir, , minggu, i... \n", "10 [haha, baru, kirim, padahal, tweet, tadi, mala... \n", "11 [giliran, ujian, laptop, sama, jaringan, tiba,... \n", "\n", " Normalisasi \n", "0 [maaf, gabung, bareng, indosat, ribu, gigabyte... \n", "1 [ini, jaringan, waktu, minggu, kenapa, cepat, ... \n", "2 [sudah, lebih, dari, jam, layanan, internet, d... \n", "3 [lapor, di, daerah, sukatani, tapos, depok, se... \n", "4 [indosat, ngentot, kenapa, sih, ada, jaringan,... \n", "5 [malas, mau, tidur, lagi, jaringan, ngajak, ke... \n", "6 [hawa, dingin, habis, hujan, ditambah, jaringa... \n", "7 [jaringan, aku, disini, lambat, banget, padaha... \n", "8 [jaringan, sama, web, aku, kenapa, sih, ngesel... \n", "9 [halo, telkomsel, sudah, hampir, minggu, ini, ... \n", "10 [haha, baru, kirim, padahal, tweet, tadi, mala... \n", "11 [giliran, ujian, laptop, sama, jaringan, tiba,... " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#============== Normalisasi: kata gaul, singkatan jadi kata baku\n", "print('\\n...... Proses Normalisasi ...... ')\n", "DataTweet['Normalisasi'] = DataTweet['casefolding2'].apply(normalisasi)\n", "DataTweet[['casefolding2','Normalisasi']].head(12)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Stopword Removal\n", "Pada tahapan ini bertujuan untuk membersihkan teks dari kata yang sangat sering digunakan sehingga muncul dalam jumah besar dan kata yang dianggap tidak memiliki makna, seperti kata ganti, penghubung dan karakter satu huruf yang tidak memiliki arti." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "...... Proses Stopword Removal ...... \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Stopword
0[maaf, gabung, bareng, indosat, ribu, gigabyte...
1[jaringan, minggu, cepat, pakai, banget, oi, g...
2[jam, layanan, internet, televisi, jaringan]
3[lapor, daerah, sukatani, tapos, depok, indiho...
4[indosat, ngentot, sih, jaringan, browser, jalan]
5[malas, tidur, jaringan, ngajak, kelahi]
\n", "
" ], "text/plain": [ " Stopword\n", "0 [maaf, gabung, bareng, indosat, ribu, gigabyte...\n", "1 [jaringan, minggu, cepat, pakai, banget, oi, g...\n", "2 [jam, layanan, internet, televisi, jaringan]\n", "3 [lapor, daerah, sukatani, tapos, depok, indiho...\n", "4 [indosat, ngentot, sih, jaringan, browser, jalan]\n", "5 [malas, tidur, jaringan, ngajak, kelahi]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#==== Stopword Removal : hapus kata yang tidak terlalu penting\n", "print('\\n...... Proses Stopword Removal ...... ')\n", "DataTweet['Stopword'] = DataTweet['Normalisasi'].apply(delstopwordID)\n", "DataTweet[['Stopword']].head(6)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stemming\n", "Stemming adalah tahapan dari teks pre-processing yang memiliki tujuan untuk mereduksi kata ke dalam bentuk akar atau bentuk kata dasarnya. Stemming akan mengubah kata menjadi kata dasarnya dengan menghilangkan semua imbuhan yang terdiri dari awalan, sisipan dan akhiran." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "................ Proses Stemming ................ \n", "0 [maaf, gabung, bareng, indosat, ribu, gigabyte...\n", "1 [jaringan, minggu, cepat, pakai, banget, oi, g...\n", "2 [jam, layan, internet, televisi, jaringan]\n", "Name: Stemmed, dtype: object\n", "\n", "==========\n" ] }, { "data": { "text/plain": [ "0 maaf gabung bareng indosat ribu gigabyte jarin...\n", "1 jaringan minggu cepat pakai banget oi gilir ku...\n", "2 jam layan internet televisi jaringan\n", "3 lapor daerah sukatani tapos depok indihome min...\n", "4 indosat ngentot sih jaringan browser jalan\n", "5 malas tidur jaringan ngajak kelahi\n", "6 hawa dingin habis hujan tambah jaringan lambat...\n", "7 jaringan lambat banget pagi\n", "8 jaringan web sih ngeselin banget coba coba\n", "9 halo telkomsel minggu sinyal jaringan internet...\n", "10 haha kirim tweet malam emang anjing jaringan\n", "11 gilir uji laptop jaringan lambat anjing\n", "Name: newTweet, dtype: object" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#==== Stemming : mengurangi dimensi fitur kata/term\n", "print('\\n................ Proses Stemming ................ ')\n", "DataTweet['Stemmed'] = DataTweet['Stopword'].apply(stemming)\n", "print(DataTweet['Stemmed'].head(3))\n", "DataTweet['newTweet'] = DataTweet['Stemmed'].apply(listokalimat)\n", "print('\\n==========')\n", "DataTweet['newTweet'].head(12)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Vektorisasi (TF-IDF)\n", "\n", "Setelah selesai dalam tahap teks preprocessing, data yang dihasilkan masih berupa token potongan-potongan kata yang tidak dapat langsung diproses oleh mesin untuk dilanjutkan tahap klasifikasi. Oleh karena itu diperlukan proses vektorisasi yang berguna untuk mengubah kata menjadi angka dan disusun dalam bentuk matriks dengan cara menghitung berapa kali kemunculan kata pada setiap tweet dalam dokumen yang biasa disebut pembobotan kata. Proses pembobotan kata yang paling umum digunakan adalah menggunakan TF-IDF. Hasil penerapan pembobotan kata dengan TF-IDF ditampilkan sebagai berikut:" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "................ Hitung TF-IDF ................ \n", " term rata2bobot\n", "331 jaringan 0.113028\n", "340 jelek 0.051901\n", "73 banget 0.048090\n", "426 lambat 0.035820\n", "861 ya 0.030943\n", "313 internet 0.028287\n", "306 indihome 0.027122\n", "\n", "================\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abaabadadikadminadministratoraduhahahliajaajak...wkwkwkwowxdxlyayakxdyaudahyoutubeyukzoom
00.00.00.00.0000000.00.00.00.00.00.0...0.2355270.00.00.00.00.00.00.00.00.0
10.00.00.00.0000000.00.00.00.00.00.0...0.0000000.00.00.00.00.00.00.00.00.0
20.00.00.00.0000000.00.00.00.00.00.0...0.0000000.00.00.00.00.00.00.00.00.0
30.00.00.00.0000000.00.00.00.00.00.0...0.0000000.00.00.00.00.00.00.00.00.0
40.00.00.00.0000000.00.00.00.00.00.0...0.0000000.00.00.00.00.00.00.00.00.0
..................................................................
5950.00.00.00.0000000.00.00.00.00.00.0...0.0000000.00.00.00.00.00.00.00.00.0
5960.00.00.00.0000000.00.00.00.00.00.0...0.0000000.00.00.00.00.00.00.00.00.0
5970.00.00.00.0000000.00.00.00.00.00.0...0.0000000.00.00.00.00.00.00.00.00.0
5980.00.00.00.4731330.00.00.00.00.00.0...0.0000000.00.00.00.00.00.00.00.00.0
5990.00.00.00.0000000.00.00.00.00.00.0...0.0000000.00.00.00.00.00.00.00.00.0
\n", "

600 rows × 867 columns

\n", "
" ], "text/plain": [ " aba abad adik admin administrator aduh ah ahli aja ajak \\\n", "0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 \n", ".. ... ... ... ... ... ... ... ... ... ... \n", "595 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 \n", "596 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 \n", "597 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 \n", "598 0.0 0.0 0.0 0.473133 0.0 0.0 0.0 0.0 0.0 0.0 \n", "599 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", " ... wkwkwk wow xd xl ya yakxd yaudah youtube yuk zoom \n", "0 ... 0.235527 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "1 ... 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "2 ... 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "3 ... 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "4 ... 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", ".. ... ... ... ... ... ... ... ... ... ... ... \n", "595 ... 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "596 ... 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "597 ... 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "598 ... 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "599 ... 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", "[600 rows x 867 columns]" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#====================== lakukan TF-IDF\n", "print('\\n................ Hitung TF-IDF ................ ')\n", "tfidf_vect = TfidfVectorizer()\n", "vect_docs = tfidf_vect.fit_transform(DataTweet['newTweet'])\n", "#print(vect_docs)\n", "features_names = tfidf_vect.get_feature_names_out()\n", "\n", "datane = []\n", "means=vect_docs.mean(axis=0)\n", "for col, term in enumerate(features_names):\n", " datane.append( (term, means[0,col] ))\n", "\n", "ranking = pd.DataFrame(datane, columns=['term','rata2bobot'])\n", "ranking = ranking.sort_values('rata2bobot', ascending=False)\n", "print(ranking.head(7))\n", "\n", "dense = vect_docs.todense()\n", "alist = dense.tolist()\n", "print('\\n================')\n", "newData = pd.DataFrame(alist,columns=features_names)\n", "newData" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SMOTE\n", "Ketidakseimbangan kelas (imbalance class) dapat mengganggu kemampuan prediksi algoritma klasifikasi karena algoritma mengejar\n", "akurasi klasifikasi secara keseluruhan. Data dikatakan seimbang apabila perbandingan antara kedua kelas adalah 1:1. Masalah kelas data yang tidak simbang dapat berakibat pada ketepatan klasifikasi kelas minoritas. \n", "\n", "Untuk memecahkan masalah klasifikasi yang sulit ketika berhadapan dengan kumpulan data yang tidak seimbang, dapat menggunakan metode menyeimbangkan jumlah data pada kelas yang berbeda dengan menambahkan sampel ke kelas minoritas (oversampling) atau menghapus sampel dari kelas mayoritas (undersampling). Berikut adalah penerapan SMOTE." ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fold-berjumlah: 10\n", "Jumlah Data latih sebelum SMOTE = 540\n", "Sarkasme = 93 BukanSarkas = 447\n", "Jumlah Data latih setelah SMOTE = 894\n", "Sarkasme = 447 BukanSarkas = 447\n", "Jumlah Data Uji = 60\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "x = newData.iloc[:]\n", "y= DataTweet[\"Label\"]\n", "#print('\\nK - Fold Cross Validation')\n", "k=10\n", "kf = KFold(n_splits=k)\n", "print(\"fold-berjumlah:\",k)\n", "kfold=[]\n", "temp_akurasi = []\n", "temp_pres = []\n", "temp_recall = []\n", "temp_f1 = []\n", "temp_model=[]\n", "it=0\n", "for train_index , test_index in kf.split(x):\n", " X_train , X_test = x.iloc[train_index,:],x.iloc[test_index,:]\n", " y_train , y_test = y[train_index] , y[test_index]\n", " \n", " sm = SMOTE(sampling_strategy=\"minority\",k_neighbors=5)\n", " x_oversample, y_oversample = sm.fit_resample(X_train, y_train)\n", " \n", " if(it==1):\n", " #setelah resampling dengan SMOTE\n", " jumlah_awal = y_train.shape[0]\n", " cekLabel_awal = Counter(y_train)\n", " jumlah_sm = y_oversample.shape[0]\n", " cekLabel_sm = Counter(y_oversample)\n", " jumlah_tes = y_test.shape[0]\n", " print('Jumlah Data latih sebelum SMOTE =', jumlah_awal)\n", " print('Sarkasme =', cekLabel_awal['Sarkasme'],'BukanSarkas =',cekLabel_awal['BukanSarkas'])\n", " print('Jumlah Data latih setelah SMOTE =', jumlah_sm)\n", " print('Sarkasme =', cekLabel_sm['Sarkasme'],'BukanSarkas =',cekLabel_sm['BukanSarkas'])\n", " print('Jumlah Data Uji =', jumlah_tes)\n", " # creating the dataset\n", " data = {'Sarkasme':cekLabel_sm['Sarkasme'], 'BukanSarkas':cekLabel_sm['BukanSarkas']}\n", " courses = list(data.keys())\n", " values = list(data.values())\n", " fig = plt.figure(figsize = (8, 5))\n", " \n", " # creating the bar plot\n", " plt.bar(courses, values, color ={'orange','blue'} ,width = 0.7)\n", " plt.xlabel(\"Label\")\n", " plt.ylabel(\"Jumlah\")\n", " plt.title(\"Data Setelah SMOTE\")\n", " plt.show()\n", " it=it+1\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***Sebelum dilakukan SMOTE***" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "jumlahData=DataTweet[\"Tweet\"].shape[0] #Jumlah Data\n", "jmlSarkas=DataTweet[DataTweet.Label==\"Sarkasme\"].shape[0]\n", "jmlNonSarkas=DataTweet[DataTweet.Label==\"BukanSarkas\"].shape[0]\n", "d={'Jumlah': [jumlahData], 'Sarkasme': [jmlSarkas],'BukanSarkas':[jmlNonSarkas]}\n", "\n", "cekJumlah = pd.DataFrame(data=d, columns=['Sarkasme','BukanSarkas'])\n", "# creating the dataset\n", "data = {'Sarkasme':jmlSarkas, 'BukanSarkas':jmlNonSarkas}\n", "courses = list(data.keys())\n", "values = list(data.values())\n", "fig = plt.figure(figsize = (8, 5))\n", "\n", "# creating the bar plot\n", "plt.bar(courses, values, color ={'orange','blue'} ,width = 0.7)\n", "plt.xlabel(\"Label\")\n", "plt.ylabel(\"Jumlah\")\n", "plt.title(\"Dataset Sarkasme dan Bukan. Total Data = %s\" %jumlahData)\n", "for x, y in enumerate(values):\n", " plt.text(x , y + 2, str(y),color = 'black', fontweight = 'bold')\n", "plt.show()" ] }, { "attachments": { "kfold10.png": { "image/png": "" } }, "cell_type": "markdown", "metadata": {}, "source": [ "## Classification\n", "### Cross Validation\n", "Proses pembentukan model klasifikasi dilakukan dengan bantuan metode evaluasi Validasi silang atau Cross Validation, yang merupakan pendekatan yang populer digunakan untuk evaluasi kinerja untuk pengklasifikasi. Pendekatan umumnya adalah bahwa model dilatih beberapa kali dan untuk setiap kali satu bagian dari keseluruhan diperlakukan sebagai data evaluasi sedangkan sisanya digunakan untuk pelatihan.\n", "\n", "Metode ini membagi dataset menjadi K bagian yang disebut folds, dengan ukuran yang sama. Setiap fold hanya satu dari N sampel data digunakan untuk tujuan validasi, dan sisanya digunakan sebagai data latih. Proses ini akan diulang sebanyak N kali dan hasil akhirnya adalah rata-rata.\n", "\n", "![kfold10.png](attachment:kfold10.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Training Model\n", "Tahapan berikutnya adalah proses klasifikasi menggunakan gabungan metode Support Vector Machine dan metode Ensemble (AdaBoost) untuk deteksi sarkasme, apakah teks tersebut termasuk sarkasme atau bukan sarkasme." ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "================ Model ================ \n", "\n", "================ Pembagian data Training dan Testing ================ \n", "\n", "K - Fold Cross Validation\n", "fold-berjumlah: 5\n", "867\n", "\n", "================ Fold ke - 1\n", "Waktu Eksekusi SMOTE= 0.5160977069990622 detik atau 0.008601628449984371 menit\n", "Jumlah Data latih sebelum SMOTE = 480\n", "Sarkasme = 73 BukanSarkas = 407\n", "Jumlah Data latih setelah SMOTE = 814\n", "Sarkasme = 407 BukanSarkas = 407\n", "Jumlah Data Uji = 120\n", "\n", "Metode Adaboost SVM\n", "Waktu Eksekusi Training Test= 69.11410175999845 detik atau 1.1519016959999742 menit\n", "(TP) TruePositif : 5 \n", "(FP) FalsePositif : 8 \n", "(TN) TrueNegatif : 78 \n", "(FN) FalseNegatif : 29\n", "Akurasi Fold ke - 1 = 69.167\n", "Presisi Fold ke - 1 = 38.462\n", "Recall Fold ke - 1 = 14.706\n", "F1-Measure Fold ke - 1 = 21.277\n", "867\n", "\n", "================ Fold ke - 2\n", "Waktu Eksekusi SMOTE= 0.15926613400006318 detik atau 0.0026544355666677194 menit\n", "Jumlah Data latih sebelum SMOTE = 480\n", "Sarkasme = 83 BukanSarkas = 397\n", "Jumlah Data latih setelah SMOTE = 794\n", "Sarkasme = 397 BukanSarkas = 397\n", "Jumlah Data Uji = 120\n", "\n", "Metode Adaboost SVM\n", "Waktu Eksekusi Training Test= 77.2981243739996 detik atau 1.2883020728999932 menit\n", "(TP) TruePositif : 5 \n", "(FP) FalsePositif : 9 \n", "(TN) TrueNegatif : 87 \n", "(FN) FalseNegatif : 19\n", "Akurasi Fold ke - 2 = 76.667\n", "Presisi Fold ke - 2 = 35.714\n", "Recall Fold ke - 2 = 20.833\n", "F1-Measure Fold ke - 2 = 26.316\n", "867\n", "\n", "================ Fold ke - 3\n", "Waktu Eksekusi SMOTE= 1.0123771729995497 detik atau 0.01687295288332583 menit\n", "Jumlah Data latih sebelum SMOTE = 480\n", "Sarkasme = 97 BukanSarkas = 383\n", "Jumlah Data latih setelah SMOTE = 766\n", "Sarkasme = 383 BukanSarkas = 383\n", "Jumlah Data Uji = 120\n", "\n", "Metode Adaboost SVM\n", "Waktu Eksekusi Training Test= 59.621921706999274 detik atau 0.9936986951166545 menit\n", "(TP) TruePositif : 2 \n", "(FP) FalsePositif : 4 \n", "(TN) TrueNegatif : 106 \n", "(FN) FalseNegatif : 8\n", "Akurasi Fold ke - 3 = 90.0\n", "Presisi Fold ke - 3 = 33.333\n", "Recall Fold ke - 3 = 20.0\n", "F1-Measure Fold ke - 3 = 25.0\n", "867\n", "\n", "================ Fold ke - 4\n", "Waktu Eksekusi SMOTE= 0.13833599999998114 detik atau 0.0023055999999996857 menit\n", "Jumlah Data latih sebelum SMOTE = 480\n", "Sarkasme = 88 BukanSarkas = 392\n", "Jumlah Data latih setelah SMOTE = 784\n", "Sarkasme = 392 BukanSarkas = 392\n", "Jumlah Data Uji = 120\n", "\n", "Metode Adaboost SVM\n", "Waktu Eksekusi Training Test= 57.88678527999946 detik atau 0.9647797546666577 menit\n", "(TP) TruePositif : 0 \n", "(FP) FalsePositif : 7 \n", "(TN) TrueNegatif : 94 \n", "(FN) FalseNegatif : 19\n", "Akurasi Fold ke - 4 = 78.333\n", "Presisi Fold ke - 4 = 0.0\n", "Recall Fold ke - 4 = 0.0\n", "F1-Measure Fold ke - 4 = 0\n", "867\n", "\n", "================ Fold ke - 5\n", "Waktu Eksekusi SMOTE= 0.142426025999157 detik atau 0.00237376709998595 menit\n", "Jumlah Data latih sebelum SMOTE = 480\n", "Sarkasme = 87 BukanSarkas = 393\n", "Jumlah Data latih setelah SMOTE = 786\n", "Sarkasme = 393 BukanSarkas = 393\n", "Jumlah Data Uji = 120\n", "\n", "Metode Adaboost SVM\n", "Waktu Eksekusi Training Test= 58.44826538600137 detik atau 0.9741377564333561 menit\n", "(TP) TruePositif : 8 \n", "(FP) FalsePositif : 5 \n", "(TN) TrueNegatif : 95 \n", "(FN) FalseNegatif : 12\n", "Akurasi Fold ke - 5 = 85.833\n", "Presisi Fold ke - 5 = 61.538\n", "Recall Fold ke - 5 = 40.0\n", "F1-Measure Fold ke - 5 = 48.485\n", "rata-rata akurasi= 80.0\n", "Waktu Eksekusi Training Test= 322.50406527999985 detik atau 5.375067754666664 menit\n" ] } ], "source": [ "x = newData.iloc[:]\n", "y= DataTweet[\"Label\"]\n", "\n", "print('\\n================ Model ================ ')\n", "print('\\n================ Pembagian data Training dan Testing ================ ')\n", "#============================ K-fold Start\n", "print('\\nK - Fold Cross Validation')\n", "k=5\n", "kf = KFold(n_splits=k)\n", "print(\"fold-berjumlah:\",k)\n", "kfold=[]\n", "temp_akurasi = []\n", "temp_pres = []\n", "temp_recall = []\n", "temp_f1 = []\n", "temp_model=[]\n", "\n", "it=1\n", "TP = 0\n", "FP = 0\n", "TN = 0\n", "FN = 0\n", "\n", "start = timer()\n", "for train_index , test_index in kf.split(x):\n", " startT = timer()\n", " X_train , X_test = x.iloc[train_index,:],x.iloc[test_index,:]\n", " y_train , y_test = y[train_index] , y[test_index]\n", " print(len(X_train.columns))\n", " print('\\n================ Fold ke -',it)\n", " #print('Data Train\\n',X_train)\n", " #print('Data Test\\n',X_test)\n", " startS = timer()\n", " \n", " sm = SMOTE(sampling_strategy=\"minority\",k_neighbors=5)\n", " x_oversample, y_oversample = sm.fit_resample(X_train, y_train)\n", " \n", " endS = timer()\n", " waktu=endS - startS\n", " print(\"Waktu Eksekusi SMOTE= \",waktu,\"detik atau\",waktu/60,\"menit\")\n", " \n", " #setelah resampling dengan SMOTE\n", " jumlah_awal = y_train.shape[0]\n", " cekLabel_awal = Counter(y_train)\n", " jumlah_sm = y_oversample.shape[0]\n", " cekLabel_sm = Counter(y_oversample)\n", " jumlah_tes = y_test.shape[0]\n", " print('Jumlah Data latih sebelum SMOTE =', jumlah_awal)\n", " print('Sarkasme =', cekLabel_awal['Sarkasme'],'BukanSarkas =',cekLabel_awal['BukanSarkas'])\n", " print('Jumlah Data latih setelah SMOTE =', jumlah_sm)\n", " print('Sarkasme =', cekLabel_sm['Sarkasme'],'BukanSarkas =',cekLabel_sm['BukanSarkas'])\n", " print('Jumlah Data Uji =', jumlah_tes)\n", " #================== Metode klasifikasi\n", " baseLearn_svm=SVC(probability=True,kernel='linear')\n", " print('\\nMetode Adaboost SVM')\n", " model_adaboost =AdaBoostClassifier(n_estimators=30, base_estimator=baseLearn_svm,learning_rate=0.5)\n", " model_adaboost.fit(x_oversample,y_oversample)\n", " Prediksi = model_adaboost.predict(X_test)\n", " #add to list model\n", " temp_model.append(model_adaboost)\n", " #=================\n", " ceklah = pd.DataFrame(columns=['Tweet_Split','Label_Split','Label_Prediksi'])\n", " ceklah['newTweet'] = DataTweet['Tweet'].iloc[test_index]\n", " ceklah['Label'] = DataTweet['Label'].iloc[test_index]\n", " ceklah['LabelPrediksi'] = Prediksi\n", " \n", " endT = timer()\n", " waktu=endT - startT\n", " print(\"Waktu Eksekusi Training Test= \",waktu,\"detik atau\",waktu/60,\"menit\")\n", " \n", " jumlahtes = ceklah.shape[0]\n", " positif=\"Sarkasme\"\n", " negatif=\"BukanSarkas\"\n", " for i in range(jumlahtes):\n", " cek=ceklah.iloc[i]\n", " if (cek.Label==positif and cek.LabelPrediksi==positif):\n", " TP+=1\n", " elif(cek.Label==positif and cek.LabelPrediksi==negatif):\n", " FN+=1\n", " elif(cek.Label==negatif and cek.LabelPrediksi==negatif):\n", " TN+=1\n", " elif(cek.Label==negatif and cek.LabelPrediksi==positif):\n", " FP+=1\n", " \n", " print(\"(TP) TruePositif :\",TP,\"\\n(FP) FalsePositif :\",FP,\"\\n(TN) TrueNegatif :\",TN,\"\\n(FN) FalseNegatif :\",FN)\n", " #akurasi\n", " akurasi=(TP+TN)/(TP+FP+TN+FN)\n", " #presisi\n", " cek0 = TP+FP\n", " if(cek0==0):\n", " presisi=0\n", " else:\n", " presisi=TP/(TP+FP)\n", " #recall\n", " cek1 = TP+FN\n", " if(cek1==0):\n", " recal=0\n", " else:\n", " recal=TP/(TP+FN)\n", " \n", " if(presisi==0 and recal==0):\n", " f1=0\n", " else:\n", " f1=(2*presisi*recal)/(presisi+recal)\n", " hasil_akurasi = round(akurasi*100,3)\n", " hasil_pres=round(presisi*100,3)\n", " hasil_recal=round(recal*100,3)\n", " hasil_f1=round(f1*100,3)\n", " \n", " \n", " temp_akurasi.append(hasil_akurasi)\n", " temp_pres.append(hasil_pres)\n", " temp_recall.append(hasil_recal)\n", " temp_f1.append(hasil_f1)\n", " print('Akurasi Fold ke -',it,'=',hasil_akurasi)\n", " print('Presisi Fold ke -',it,'=',hasil_pres)\n", " print('Recall Fold ke -',it,'=',hasil_recal)\n", " print('F1-Measure Fold ke -',it,'=',hasil_f1)\n", " kfold.append([jumlahtes,TP,FP,TN,FN,hasil_akurasi,hasil_pres,hasil_recal,hasil_f1])\n", " it=it+1\n", " TP = 0\n", " FP = 0\n", " TN = 0\n", " FN = 0\n", "\n", "rata2=0\n", "rata2rec=0\n", "rata2pre=0\n", "rata2f1=0\n", "for x in range(len(temp_akurasi)):\n", " rata2=rata2+temp_akurasi[x]\n", "print(\"rata-rata akurasi=\",round(rata2/k,3))\n", "\n", "end = timer()\n", "waktue = end - start\n", "print(\"Waktu Eksekusi Training Test= \",waktue,\"detik atau\",waktue/60,\"menit\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Saving Model with Pickle\n", "Model hasil training sebelumnya dapat disimpan dengan format pickle, model pelatihan dipilih dari data akurasi yang paling tinggi." ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MAX akurasi 90.0\n" ] } ], "source": [ "#print(temp_model)\n", "maxacc=max(temp_akurasi)\n", "print(\"MAX akurasi\",maxacc)\n", "\n", "maxi = temp_akurasi.index(maxacc)\n", "modele=temp_model[maxi]\n", "\n", "#simpan model yang memiliki akurasi tertinggi pada cross validation\n", "\"\"\"\n", "import os\n", "import os.path\n", "pkl_filename = \"pickle_modelF.pkl\"\n", "if (os.path.exists(pkl_filename)):\n", " os.remove(pkl_filename)\n", "\n", "with open(pkl_filename,'wb') as file:\n", " pickle.dump(modele,file)\n", "\"\"\"\n", "pickle.dump(tfidf_vect.vocabulary_, open(\"pickle_featureF.pkl\", \"wb\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Implementasi Model\n", "Model yang tersimpan dapat diimplementasikan untuk mendeteksi data text baru. Tahapannya sama mulai dari text-preprocessing sampai dengan Stemming." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "...... Jumlah Data ...... \n", "Total: 114\n", " Sarkasme BukanSarkas\n", "0 24 90\n", "\n", "##-------- Mulai Proses Preprocessing --------##\n", "\n", "\n", "...... Proses Casefolding lowercase, hapus URL...... \n", " Tweet Label \\\n", "0 sedih bgt anjing gue gabisa scroll tl gabisa l... Sarkasme \n", "1 By.U ini pakai jaringan @smartfrencare ya min ... BukanSarkas \n", "2 Kalo jaringan bagus dan aku gaada kerjaan ya b... BukanSarkas \n", "3 RT @NUgarislucu: Termasuk bangkitnya lagi ITJ ... BukanSarkas \n", "4 Gue belum dengerin ulang yours💔 tadi kepotong²... BukanSarkas \n", ".. ... ... \n", "109 PLISS LAGI SERU SERUNYAAA malah hilanh jaringa... BukanSarkas \n", "110 @Jaehytun iyaa, ini indosut, tsel, di awa dari... BukanSarkas \n", "111 Jaringan telkomsel astagfirulloh,bisa2 ny H+😭 BukanSarkas \n", "112 @FirstMediaCares Kasian sekali hanya sibuk men... BukanSarkas \n", "113 Kirain jam segini jaringan baguss BukanSarkas \n", "\n", " casefolding \\\n", "0 sedih bgt anjing gue gabisa scroll tl gabisa l... \n", "1 by.u ini pakai jaringan @smartfrencare ya min ... \n", "2 kalo jaringan bagus dan aku gaada kerjaan ya b... \n", "3 rt @nugarislucu: termasuk bangkitnya lagi itj ... \n", "4 gue belum dengerin ulang yours💔 tadi kepotong²... \n", ".. ... \n", "109 pliss lagi seru serunyaaa malah hilanh jaringa... \n", "110 @jaehytun iyaa, ini indosut, tsel, di awa dari... \n", "111 jaringan telkomsel astagfirulloh,bisa2 ny h+😭 \n", "112 @firstmediacares kasian sekali hanya sibuk men... \n", "113 kirain jam segini jaringan baguss \n", "\n", " removeURL \n", "0 sedih bgt anjing gue gabisa scroll tl gabisa l... \n", "1 by.u ini pakai jaringan ya min ? \n", "2 kalo jaringan bagus dan aku gaada kerjaan ya b... \n", "3 : termasuk bangkitnya lagi itj (indonesia tan... \n", "4 gue belum dengerin ulang yours tadi kepotong² ... \n", ".. ... \n", "109 pliss lagi seru serunyaaa malah hilanh jaringa... \n", "110 iyaa, ini indosut, tsel, di awa dari kemaren ... \n", "111 jaringan telkomsel astagfirulloh,bisa2 ny h+ \n", "112 kasian sekali hanya sibuk menanyakan no id pe... \n", "113 kirain jam segini jaringan baguss \n", "\n", "[114 rows x 4 columns]\n", "\n", "...... Tokenisasi ...... \n", " Tokenisasi\n", "0 [sedih, bgt, anjing, gue, gabisa, scroll, tl, ...\n", "1 [by.u, ini, pakai, jaringan, ya, min, ?]\n", "\n", "...... Proses Casefolding2 hapus angka dan simbol...... \n", " Cleaning\n", "0 [sedih, bgt, anjing, gue, gabisa, scroll, tl, ...\n", "1 [byu, ini, pakai, jaringan, ya, min, ]\n", "\n", "...... Proses Normalisasi ...... \n", " Normalisasi\n", "0 [sedih, banget, anjing, saya, tidak, bisa, gul...\n", "1 [telkomsel, ini, pakai, jaringan, ya, admin]\n", "\n", "...... Proses Stopword Removal ...... \n", " Stopword\n", "0 [sedih, banget, anjing, gulir, timeline, liat,...\n", "1 [telkomsel, pakai, jaringan, ya, admin]\n", "2 [jaringan, bagus, kerjaan, ya, curi, curi, kad...\n", "3 [bangkitnya, indonesia, jaringan]\n", "4 [dengarkan, ulang, milikmu, kepotong, potong, ...\n", "5 [iya, ulang, kabar, beranda, jaringan, lumayan...\n", "\n", "................ Proses Stemming ................ \n", "0 [sedih, banget, anjing, gulir, timeline, liat,...\n", "1 [telkomsel, pakai, jaringan, ya, admin]\n", "2 [jaringan, bagus, kerja, ya, curi, curi, kadan...\n", "Name: Stemmed, dtype: object\n" ] } ], "source": [ "#============== Read Data Input\n", "filetest=\"#tweet_label_dikit.xlsx\"\n", "Prediksites2 = pd.read_excel(filetest)\n", "tes=Prediksites2\n", "#\n", "jumlahData=tes[\"Tweet\"].shape[0] #Jumlah Data\n", "jmlSarkas=tes[tes.Label==\"Sarkasme\"].shape[0]\n", "jmlNonSarkas=tes[tes.Label==\"BukanSarkas\"].shape[0]\n", "d={'Jumlah': [jumlahData], 'Sarkasme': [jmlSarkas],'BukanSarkas':[jmlNonSarkas]}\n", "\n", "cekJumlah = pd.DataFrame(data=d, columns=['Sarkasme','BukanSarkas'])\n", "print('\\n...... Jumlah Data ...... ')\n", "print(\"Total:\",jumlahData)\n", "print(cekJumlah)\n", "#ax = cekJumlah.plot.bar(rot=0)\n", "#============== Start Processing Text\n", "print(\"\\n##-------- Mulai Proses Preprocessing --------##\\n\")\n", "print('\\n...... Proses Casefolding lowercase, hapus URL...... ')\n", "tes['casefolding'] = tes['Tweet'].apply(lower)\n", "tes['removeURL'] = tes['casefolding'].apply(removeURLemoji)\n", "print(tes)\n", "\n", "#==== Tokenisasi : memisahkan kata dalam kalimat\n", "print('\\n...... Tokenisasi ...... ')\n", "tes['Tokenisasi'] = tes['removeURL'].apply(tokenize)\n", "print(tes[['Tokenisasi']].head(2))\n", "\n", "print('\\n...... Proses Casefolding2 hapus angka dan simbol...... ')\n", "tes['Cleaning'] = tes['Tokenisasi'].apply(hapus_simbolAngka)\n", "Prediksites2=pd.DataFrame()\n", "Prediksites2['Cleaning']=tes['Cleaning']\n", "print(Prediksites2[['Cleaning']].head(2))\n", "#============== Normalisasi: kata gaul, singkatan jadi kata baku\n", "print('\\n...... Proses Normalisasi ...... ')\n", "#normalisasi_dict = normal_term() #import excel\n", "tes['Normalisasi'] = tes['Cleaning'].apply(normalisasi)\n", "print(tes[['Normalisasi']].head(2))\n", "\n", "#==== Stopword Removal : hapus kata yang tidak terlalu penting\n", "print('\\n...... Proses Stopword Removal ...... ')\n", "#list_stopwords = daftarStopword()\n", "tes['Stopword'] = tes['Normalisasi'].apply(delstopwordID)\n", "print(tes[['Stopword']].head(6))\n", "\n", "#==== Stemming : mengurangi dimensi fitur kata/term\n", "print('\\n................ Proses Stemming ................ ')\n", "tes['Stemmed'] = tes['Stopword'].apply(stemming)\n", "print(tes['Stemmed'].head(3))\n", "tes['newTweet'] = tes['Stemmed'].apply(listokalimat)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Model TF-IDF dan Klasifikasi\n", "Akan ada dua file yang akan digunakan yaitu model dari fitur TF-IDF yang telah tersimpan sesuai dengan model yang dibentuk untuk mentransformasikan teks baru, dan model klasifikasi itu sendiri untuk memprediksi data tersebut." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***Load Model TF-IDF***" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "................ Load Model Fitur TF-IDF ................ \n", "\n", "================\n", " aba abad adik admin administrator aduh ah ahli aja ajak \\\n", "0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 \n", "1 0.0 0.0 0.0 0.542116 0.0 0.0 0.0 0.000000 0.0 0.0 \n", "2 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 \n", "3 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 \n", "4 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 \n", ".. ... ... ... ... ... ... ... ... ... ... \n", "109 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 \n", "110 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 \n", "111 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 \n", "112 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.293638 0.0 0.0 \n", "113 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 \n", "\n", " ... wkwkwk wow xd xl ya yakxd yaudah youtube yuk zoom \n", "0 ... 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 \n", "1 ... 0.000000 0.0 0.0 0.0 0.433360 0.0 0.0 0.0 0.0 0.0 \n", "2 ... 0.275516 0.0 0.0 0.0 0.191966 0.0 0.0 0.0 0.0 0.0 \n", "3 ... 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 \n", "4 ... 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 \n", ".. ... ... ... ... ... ... ... ... ... ... ... \n", "109 ... 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 \n", "110 ... 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 \n", "111 ... 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 \n", "112 ... 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 \n", "113 ... 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 \n", "\n", "[114 rows x 867 columns]\n" ] } ], "source": [ "print('\\n................ Load Model Fitur TF-IDF ................ ')\n", "savedtfidf = pickle.load(open(\"pickle_featureF.pkl\", 'rb'))\n", "vectorizer2 = TfidfVectorizer(vocabulary=savedtfidf)\n", "vect_docs2 = vectorizer2.fit_transform(tes['newTweet'])\n", "features_names2 = vectorizer2.get_feature_names_out()\n", "\n", "dense2 = vect_docs2.todense()\n", "alist2 = dense2.tolist()\n", "print('\\n================')\n", "newData2 = pd.DataFrame(alist2,columns=features_names2)\n", "print(newData2)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***Load Model Klasifikasi***" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TweetLabelPrediksi
0sedih bgt anjing gue gabisa scroll tl gabisa l...SarkasmeSarkasme
1By.U ini pakai jaringan @smartfrencare ya min ...BukanSarkasBukanSarkas
2Kalo jaringan bagus dan aku gaada kerjaan ya b...BukanSarkasBukanSarkas
3RT @NUgarislucu: Termasuk bangkitnya lagi ITJ ...BukanSarkasBukanSarkas
4Gue belum dengerin ulang yours💔 tadi kepotong²...BukanSarkasBukanSarkas
............
109PLISS LAGI SERU SERUNYAAA malah hilanh jaringa...BukanSarkasBukanSarkas
110@Jaehytun iyaa, ini indosut, tsel, di awa dari...BukanSarkasBukanSarkas
111Jaringan telkomsel astagfirulloh,bisa2 ny H+😭BukanSarkasBukanSarkas
112@FirstMediaCares Kasian sekali hanya sibuk men...BukanSarkasBukanSarkas
113Kirain jam segini jaringan bagussBukanSarkasBukanSarkas
\n", "

114 rows × 3 columns

\n", "
" ], "text/plain": [ " Tweet Label \\\n", "0 sedih bgt anjing gue gabisa scroll tl gabisa l... Sarkasme \n", "1 By.U ini pakai jaringan @smartfrencare ya min ... BukanSarkas \n", "2 Kalo jaringan bagus dan aku gaada kerjaan ya b... BukanSarkas \n", "3 RT @NUgarislucu: Termasuk bangkitnya lagi ITJ ... BukanSarkas \n", "4 Gue belum dengerin ulang yours💔 tadi kepotong²... BukanSarkas \n", ".. ... ... \n", "109 PLISS LAGI SERU SERUNYAAA malah hilanh jaringa... BukanSarkas \n", "110 @Jaehytun iyaa, ini indosut, tsel, di awa dari... BukanSarkas \n", "111 Jaringan telkomsel astagfirulloh,bisa2 ny H+😭 BukanSarkas \n", "112 @FirstMediaCares Kasian sekali hanya sibuk men... BukanSarkas \n", "113 Kirain jam segini jaringan baguss BukanSarkas \n", "\n", " Prediksi \n", "0 Sarkasme \n", "1 BukanSarkas \n", "2 BukanSarkas \n", "3 BukanSarkas \n", "4 BukanSarkas \n", ".. ... \n", "109 BukanSarkas \n", "110 BukanSarkas \n", "111 BukanSarkas \n", "112 BukanSarkas \n", "113 BukanSarkas \n", "\n", "[114 rows x 3 columns]" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Load from file\n", "\n", "import pickle\n", "\n", "with open(\"pickle_modelF.pkl\", 'rb') as file:\n", " pickle_model = pickle.load(file)\n", "hasil=pickle_model.predict(newData2)\n", "DFpredict = pd.DataFrame(hasil,columns=[\"Prediksi\"])\n", "#print(DFpredict)\n", "gabungkan = pd.concat([tes[['Tweet','Label']], DFpredict], axis=1)\n", "gabungkan\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }