π NLP ΠΈ Π²ΠΈΠ·ΡΠ°Π»ΠΈΠ·Π°ΡΠΈΡ ΡΠ΅ΠΊΡΡΠ° Π½Π° ΠΏΡΠΈΠΌΠ΅ΡΠ΅ ΡΠ²ΠΈΡΠΎΠ² ΠΎ ΠΏΡΠ΅Π·ΠΈΠ΄Π΅Π½ΡΡΠΊΠΈΡ Π²ΡΠ±ΠΎΡΠ°Ρ Π² Π‘Π¨Π
Π Π½Π°ΠΏΠΈΡΠ°Π½Π½ΠΎΠΌ ΡΠΎΠ²ΠΌΠ΅ΡΡΠ½ΠΎ Ρ Elbrus Coding Bootcamp ΡΡΡΠΎΡΠΈΠ°Π»Π΅ ΠΌΡ ΠΏΠΎΠ΄ΡΠΎΠ±Π½ΠΎ ΡΠ°Π·Π±Π΅ΡΠ΅ΠΌ ΠΏΡΠΎΡΠ΅ΡΡ ΠΏΠΎΠ΄Π³ΠΎΡΠΎΠ²ΠΊΠΈ Π΄Π°Π½Π½ΡΡ Π΄Π»Ρ Π²ΠΈΠ·ΡΠ°Π»ΠΈΠ·Π°ΡΠΈΠΈ ΠΈ ΡΠ΅ΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠ³ΠΎ ΠΌΠΎΠ΄Π΅Π»ΠΈΡΠΎΠ²Π°Π½ΠΈΡ Π½Π° ΠΏΡΠΈΠΌΠ΅ΡΠ΅ ΡΠ²ΠΈΡΠΎΠ² ΠΎ Π²ΡΠ±ΠΎΡΠ°Ρ ΠΏΡΠ΅Π·ΠΈΠ΄Π΅Π½ΡΠ° Π‘Π¨Π.
ΠΠ°Π³ΡΡΠΆΠ°Π΅ΠΌ Π½Π΅ΠΎΠ±Ρ ΠΎΠ΄ΠΈΠΌΡΠ΅ Π±ΠΈΠ±Π»ΠΈΠΎΡΠ΅ΠΊΠΈ
# ΠΠ±ΡΠ°Π±ΠΎΡΠΊΠ°/ΠΌΠ°Π½ΠΈΠΏΡΠ»ΠΈΡΠΎΠ²Π°Π½ΠΈΠ΅ Π΄Π°Π½Π½ΡΠΌΠΈ import pandas as pd import numpy as np pd.options.mode.chained_assignment = None import re from unidecode import unidecode import warnings warnings.filterwarnings("ignore",category=DeprecationWarning) # ΠΠΈΠ·ΡΠ°Π»ΠΈΠ·Π°ΡΠΈΡ Π΄Π°Π½Π½ΡΡ import matplotlib.pyplot as plt %matplotlib inline from PIL import Image from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator import seaborn as sns # ΠΠΈΠ·ΡΠ°Π»ΠΈΠ·Π°ΡΠΈΡ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠ° ΡΠ΅ΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠ³ΠΎ ΠΌΠΎΠ΄Π΅Π»ΠΈΡΠΎΠ²Π°Π½ΠΈΡ import pyLDAvis.sklearn # Π‘ΡΠΎΠΏ-ΡΠ»ΠΎΠ²Π°, ΡΠΎΠΊΠ΅Π½ΠΈΠ·Π°ΡΠΎΡ, ΡΡΠ΅ΠΌΠΌΠ΅Ρ import nltk from nltk.corpus import wordnet from nltk.tokenize import word_tokenize from nltk.stem import WordNetLemmatizer # Sklearn from sklearn.decomposition import LatentDirichletAllocation from sklearn.feature_extraction.text import CountVectorizer # Gensim import gensim from gensim.parsing.preprocessing import remove_stopwords # Spacy import spacy
ΠΠ°Π½Π½ΡΠ΅
Kaggle dataset US Election 2020 Tweets ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»ΡΡΡ ΡΠΎΠ±ΠΎΠΉ ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΡ ΡΠ²ΠΈΡΠΎΠ², ΡΠΎΠ±ΡΠ°Π½Π½ΡΡ Ρ ΠΏΠΎΠΌΠΎΡΡΡ Twitter API ΠΈ ΠΊΠ»ΡΡΠ΅Π²ΡΡ ΡΠ»ΠΎΠ² #donaldtrump ΠΈ #joebiden Π² ΠΏΠ΅ΡΠΈΠΎΠ΄ Ρ 15.10.2020 Π³. ΠΏΠΎ 08.11.2020 Π³.
# ΠΠ°Π³ΡΡΠΆΠ°Π΅ΠΌ Π΄Π°Π½Π½ΡΠ΅ trump_df = pd.read_csv('hashtag_donaldtrump.csv', lineterminator='\n') biden_df = pd.read_csv('hashtag_joebiden.csv', lineterminator='\n') print('ΠΠ±ΡΠ΅Π΅ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ Π·Π°ΠΏΠΈΡΠ΅ΠΉ Π² Π½Π°Π±ΠΎΡΠ΅ Π΄Π°Π½Π½ΡΡ Trump: ', trump_df.shape) print('ΠΠ±ΡΠ΅Π΅ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ Π·Π°ΠΏΠΈΡΠ΅ΠΉ Π² Π½Π°Π±ΠΎΡΠ΅ Π΄Π°Π½Π½ΡΡ Biden: ', biden_df.shape) ΠΠ±ΡΠ΅Π΅ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ Π·Π°ΠΏΠΈΡΠ΅ΠΉ Π² Π½Π°Π±ΠΎΡΠ΅ Π΄Π°Π½Π½ΡΡ Trump: (970919, 21) ΠΠ±ΡΠ΅Π΅ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ Π·Π°ΠΏΠΈΡΠ΅ΠΉ Π² Π½Π°Π±ΠΎΡΠ΅ Π΄Π°Π½Π½ΡΡ Biden: (776886, 21)
ΠΡΠ΅Π΄Π²Π°ΡΠΈΡΠ΅Π»ΡΠ½Π°Ρ ΠΎΠ±ΡΠ°Π±ΠΎΡΠΊΠ° Π΄Π°Π½Π½ΡΡ
Π¦Π΅Π»Ρ: ΡΠΎΠ·Π΄Π°ΡΡ ΡΠΈΠ½Π°Π»ΡΠ½ΡΠΉ Π΄Π°ΡΠ°ΡΠ΅Ρ Π΄Π»Ρ Π΄Π²ΡΡ ΠΊΠ°Π½Π΄ΠΈΠ΄Π°ΡΠΎΠ², ΠΊΠΎΡΠΎΡΡΠΉ ΡΠΎΠ΄Π΅ΡΠΆΠΈΡ ΡΠ²ΠΈΡΡ, ΠΎΠΏΡΠ±Π»ΠΈΠΊΠΎΠ²Π°Π½Π½ΡΠ΅ Π² Π‘Π¨Π ΠΈ Π½Π΅ ΡΠΎΠ΄Π΅ΡΠΆΠΈΡ Π΄ΡΠ±Π»ΠΈΠΊΠ°ΡΡ.
# Π£Π΄Π°Π»ΡΠ΅ΠΌ Π½Π΅Π½ΡΠΆΠ½ΡΠ΅ ΡΡΠΎΠ»Π±ΡΡ irrelevant_columns = ['user_name','user_screen_name','user_description','user_join_date','collected_at'] trump_df = trump_df.drop(columns=irrelevant_columns) biden_df = biden_df.drop(columns=irrelevant_columns) # ΠΠ΅Π½ΡΠ΅ΠΌ ΠΈΠΌΠ΅Π½Π° Π½Π΅ΠΊΠΎΡΠΎΡΡΡ ΡΡΠΎΠ»Π±ΡΠΎΠ² Π΄Π»Ρ ΡΠ΄ΠΎΠ±ΡΡΠ²Π° Π°Π½Π°Π»ΠΈΠ·Π° trump_df = trump_df.rename(columns={"likes": "Likes", "retweet_count": "Retweets", "state": "State", "user_followers_count": "Followers"}) biden_df = biden_df.rename(columns={"likes": "Likes", "retweet_count": "Retweets", "state": "State", "user_followers_count": "Followers"}) # ΠΠ΅ΡΠ΅ΠΈΠΌΠ΅Π½ΡΠ΅ΠΌ United States of America Π² United States d = {"United States of America":"United States"} trump_df['country'].replace(d, inplace=True) biden_df['country'].replace(d, inplace=True) # Π‘ΠΎΠ·Π΄Π°Π΅ΠΌ Π½ΠΎΠ²ΡΠΉ Π΄Π°ΡΠ°ΡΠ΅Ρ Ρ ΡΠ²ΠΈΡΠ°ΠΌΠΈ, ΠΎΠΏΡΠ±Π»ΠΈΠΊΠΎΠ²Π°Π½Π½ΡΠΌΠΈ ΡΠΎΠ»ΡΠΊΠΎ Π² Π‘Π¨Π trump_usa_df = trump_df.loc[trump_df['country'] == "United States"] biden_usa_df = biden_df.loc[biden_df['country'] == "United States"] # Π£Π΄Π°Π»ΡΠ΅ΠΌ ΠΏΡΡΡΡΠ΅ ΡΡΡΠΎΠΊΠΈ trump_usa_df = trump_usa_df.dropna() biden_usa_df = biden_usa_df.dropna() # ΠΡΠΎΠ²Π΅ΡΡΠ΅ΠΌ ΡΠ°Π·ΠΌΠ΅Ρ Π΄Π°Π½Π½ΡΡ print('ΠΠ±ΡΠ΅Π΅ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ Π·Π°ΠΏΠΈΡΠ΅ΠΉ Π² Π½Π°Π±ΠΎΡΠ΅ Π΄Π°Π½Π½ΡΡ Trump USA: ', trump_usa_df.shape) print('ΠΠ±ΡΠ΅Π΅ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ Π·Π°ΠΏΠΈΡΠ΅ΠΉ Π² Π½Π°Π±ΠΎΡΠ΅ Π΄Π°Π½Π½ΡΡ Trump Biden USA: ', biden_usa_df.shape) ΠΠ±ΡΠ΅Π΅ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ Π·Π°ΠΏΠΈΡΠ΅ΠΉ Π² Π½Π°Π±ΠΎΡΠ΅ Π΄Π°Π½Π½ΡΡ Trump USA: (101953, 16) ΠΠ±ΡΠ΅Π΅ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ Π·Π°ΠΏΠΈΡΠ΅ΠΉ Π² Π½Π°Π±ΠΎΡΠ΅ Π΄Π°Π½Π½ΡΡ Trump Biden USA: (90639, 16)
trump_usa_df['initial_dataset'] = 'trump' biden_usa_df['initial_dataset'] = 'biden' # ΠΠ±ΡΠ΅Π΄ΠΈΠ½Π΅Π½ΠΈΠ΅ Π΄Π°Π½Π½ΡΠ΅ Π² Π΅Π΄ΠΈΠ½ΡΠΉ ΡΠΈΠ½Π°Π»ΡΠ½ΡΡ Π΄Π°ΡΠ°ΡΠ΅Ρ twitter_usa_df = pd.concat([trump_usa_df,biden_usa_df],ignore_index=True) # ΠΠ°ΠΉΠ΄Π΅ΠΌ Π΄ΡΠ±Π»ΠΈΠΊΠ°ΡΡ twitter_usa_df_duplicates = twitter_usa_df[twitter_usa_df.duplicated(['tweet_id'], keep=False)] #Π£Π΄Π°Π»ΡΠ΅ΠΌ Π΄ΡΠ±Π»ΠΈΠΊΠ°ΡΡ trump_usa_df = trump_usa_df[~trump_usa_df.tweet_id.isin(twitter_usa_df_duplicates.tweet_id)] biden_usa_df = biden_usa_df[~biden_usa_df.tweet_id.isin(twitter_usa_df_duplicates.tweet_id)] twitter_usa_df.drop_duplicates(subset ="tweet_id", keep = False, inplace = True) # ΠΡΠΎΠ²Π΅ΡΡΠ΅ΠΌ ΡΠ°Π·ΠΌΠ΅Ρ Π΄Π°Π½Π½ΡΡ Π² ΡΠΈΠ½Π°Π»ΡΠ½ΠΎΠΌ Π΄Π°ΡΠ°ΡΠ΅ΡΠ΅ print('ΠΠ±ΡΠ΅Π΅ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ Π·Π°ΠΏΠΈΡΠ΅ΠΉ Π² Π½Π°Π±ΠΎΡΠ΅ Π΄Π°Π½Π½ΡΡ Trump ΠΈ Biden USA: ', twitter_usa_df.shape) ΠΠ±ΡΠ΅Π΅ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ Π·Π°ΠΏΠΈΡΠ΅ΠΉ Π² Π½Π°Π±ΠΎΡΠ΅ Π΄Π°Π½Π½ΡΡ Trump ΠΈ Biden USA: (152376, 17)
ΠΠΎΠ΄Π³ΠΎΡΠΎΠ²ΠΊΠ° ΡΠ΅ΠΊΡΡΠ°
Π¦Π΅Π»Ρ: ΠΏΠΎΠ΄Π³ΠΎΡΠΎΠ²ΠΈΡΡ ΡΠ΅ΠΊΡΡ ΠΊ Π²ΠΈΠ·ΡΠ°Π»ΠΈΠ·Π°ΡΠΈΠΈ ΠΈ Π°Π½Π°Π»ΠΈΠ·Ρ, ΡΠ΄Π°Π»ΠΈΡΡ Π²ΡΠ΅ ΡΠΈΠΌΠ²ΠΎΠ»Ρ, ΡΠ°ΡΡΠΎΡΠ½ΡΠ΅ ΡΠ»ΠΎΠ²Π°, ΡΡΡΠ»ΠΊΠΈ ΠΈ ΡΠΏΠΎΠΌΠΈΠ½Π°Π½ΠΈΡ Π΄ΡΡΠ³ΠΈΡ ΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΠ΅Π»Π΅ΠΉ, ΠΏΡΠΈΠ²Π΅ΡΡΠΈ ΡΠ»ΠΎΠ²Π° ΠΊ Π»Π΅ΠΌΠΌΠ΅ ΠΈ ΡΠΎΠΊΠ΅Π½Π°ΠΌ.
ΠΡΠΈΡΡΠΊΠ° ΡΠ΅ΠΊΡΡΠ°
# Π‘ΠΎΠ·Π΄Π°Π΅ΠΌ ΡΡΠ½ΠΊΡΠΈΡ Π΄Π»Ρ ΠΎΡΠΈΡΡΠΊΠΈ ΡΠ΅ΠΊΡΡΠ° def clean_text(text): text=unidecode(text) text=text.lower() # ΠΡΠΈΠ²Π΅Π΄Π΅Π½ΠΈΠ΅ ΠΊ ΡΡΡΠΎΡΠ½ΠΎΠΉ ΡΠΎΡΠΌΠ΅ text=re.sub(r'&',' ',text) # Π£Π΄Π°Π»ΡΠ΅ΠΌ ampersand (Π·Π°ΠΌΠ΅Π½ΡΠ΅Ρ and) text=re.sub(r'[^\sa-zA-Z0-9@\[\]]',' ',text)# Π£Π΄Π°Π»ΡΠ΅ΠΌ ΡΠΏΠ΅ΡΠΈΠ°Π»ΡΠ½ΡΠ΅ Π·Π½Π°ΠΊΠΈ text=re.sub(r'@[A-Za-z0-9]+','',text) # Π£Π΄Π°Π»ΡΠ΅ΠΌ ΡΠΏΠΎΠΌΠΈΠ½Π°Π½ΠΈΡ Π΄ΡΡΠ³ΠΈΡ ΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΠ΅Π»Π΅ΠΉ Twitter text=re.sub(r'#','',text) text=re.sub(r'RT[\s]+','',text) text=remove_stopwords(text) # Π£Π΄Π°Π»ΡΠ΅ΠΌ ΡΡΠΎΠΏ-ΡΠ»ΠΎΠ²Π° text=re.sub(r'http\S+'', '', text) # Π£Π΄Π°Π»ΡΠ΅ΠΌ Π²ΡΠ΅ ΡΡΡΠ»ΠΊΠΈ text=re.sub(r'https?:\/\/\S+','',text) text=re.sub(r'donal\S+'', ' ', text) # Π£Π΄Π°Π»ΡΠ΅ΠΌ ΡΠΏΠΎΠΌΠΈΠ½Π°Π½ΠΈΡ ΠΈΠΌΠ΅Π½ ΠΊΠ°Π½Π΄ΠΈΠ΄Π°ΡΠΎΠ² text=re.sub(r'trum\S+'', ' ', text) text=re.sub(r'donaldtrum\S+'', ' ', text) text=re.sub(r'joe\S+'', ' ', text) text=re.sub(r'biden\S+'', ' ', text) text=re.sub(r'joebide\S+'', ' ', text) text=re.sub(r'vot\S+'', ' ', text) # Π£Π΄Π°Π»ΡΠ΅ΠΌ ΡΠ°ΡΡΠΎΡΠ½ΡΠ΅ ΡΠ»ΠΎΠ²Π° vote ΠΈ election text=re.sub(r'electio\S+'', ' ', text) text=re.sub(r'/s\s', '', text) # Π£Π΄Π°Π»ΡΠ΅ΠΌ Π΅Π΄ΠΈΠ½ΠΈΡΠ½ΡΠ΅ Π±ΡΠΊΠ²Ρ s ΠΈ t text=re.sub(r'/t\s', '', text) return text
# ΠΡΠΈΠΌΠ΅Π½ΡΠ΅ΠΌ ΡΡΠ½ΠΊΡΠΈΡ Π΄Π»Ρ ΠΎΡΠΈΡΡΠΊΠΈ ΡΠ΅ΠΊΡΡΠ° trump_usa_df['cleaned_tweet'] = trump_usa_df.tweet.apply(lambda x: clean_text(x)) biden_usa_df['cleaned_tweet'] = biden_usa_df.tweet.apply(lambda x: clean_text(x)) twitter_usa_df['cleaned_tweet'] = twitter_usa_df['tweet'].apply(lambda x: clean_text(x))
ΠΠ΅ΠΌΠΌΠ°ΡΠΈΠ·Π°ΡΠΈΡ
ΠΠ΅ΠΌΠΌΠ°ΡΠΈΠ·Π°ΡΠΈΡ β ΠΏΡΠΎΡΠ΅ΡΡ ΠΏΡΠΈΠ²Π΅Π΄Π΅Π½ΠΈΡ ΡΠ»ΠΎΠ²ΠΎΡΠΎΡΠΌΡ ΠΊ Π»Π΅ΠΌΠΌΠ΅ β Π΅Ρ Π½ΠΎΡΠΌΠ°Π»ΡΠ½ΠΎΠΉ (ΡΠ»ΠΎΠ²Π°ΡΠ½ΠΎΠΉ) ΡΠΎΡΠΌΠ΅.
# Π‘ΠΎΠ·Π΄Π°Π΅ΠΌ ΡΡΠ½ΠΊΡΠΈΡ Π΄Π»Ρ ΠΏΡΠ΅ΠΎΠ±ΡΠ°Π·ΠΎΠ²Π°Π½ΠΈΠ΅ ΡΠ΅Π³Π° nltk Π² ΡΠ΅Π³ wordnet lemmatizer = WordNetLemmatizer() def nltk_tag_to_wordnet_tag(nltk_tag): if nltk_tag.startswith('J'): return wordnet.ADJ elif nltk_tag.startswith('V'): return wordnet.VERB elif nltk_tag.startswith('N'): return wordnet.NOUN elif nltk_tag.startswith('R'): return wordnet.ADV else: return None # Π‘ΠΎΠ·Π΄Π°Π΅ΠΌ ΡΡΠ½ΠΊΡΠΈΡ Π΄Π»Ρ Π»Π΅ΠΌΠΌΠ°ΡΠΈΠ·Π°ΡΠΈΠΈ ΡΠ΅ΠΊΡΡΠ° def lemmatize_sentence(sentence): nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence)) wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged) lemmatized_sentence = [] for word, tag in wordnet_tagged: if tag is None: lemmatized_sentence.append(word) else: lemmatized_sentence.append(lemmatizer.lemmatize(word, tag)) return " ".join(lemmatized_sentence)
# ΠΡΠΈΠΌΠ΅Π½ΡΠ΅ΠΌ ΡΡΠ½ΠΊΡΠΈΡ Π΄Π»Ρ Π»Π΅ΠΌΠΌΠ°ΡΠΈΠ·Π°ΡΠΈΠΈ ΡΠ΅ΠΊΡΡΠ° trump_usa_df['lemmat_tweet'] = trump_usa_df.cleaned_tweet.apply(lambda x: lemmatize_sentence(x)) biden_usa_df['lemmat_tweet'] = biden_usa_df.cleaned_tweet.apply(lambda x: lemmatize_sentence(x)) twitter_usa_df['lemmat_tweet'] = twitter_usa_df.cleaned_tweet.apply(lambda x: lemmatize_sentence(x))
# Π‘ΡΠ°Π²Π½ΠΈΠΌ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΡ ΠΎΡΠΈΡΡΠΊΠΈ ΠΈ ΠΏΠΎΠ΄Π³ΠΎΡΠΎΠ²ΠΊΠΈ ΡΠ²ΠΈΡΠΎΠ² Ρ ΠΎΡΠΈΠ³ΠΈΠ½Π°Π»ΡΠ½ΡΠΌ ΡΠ΅ΠΊΡΡΠΎΠ² pd.options.display.max_colwidth = 50 print(twitter_usa_df[['tweet', 'cleaned_tweet', 'lemmat_tweet']].head(3)) tweet \ 0 #Trump: As a student I used to hear for years,... 1 You get a tie! And you get a tie! #Trump βs ra... 3 #Trump #PresidentTrump #Trump2020LandslideVict... cleaned_tweet \ 0 student hear years years heard china 2019 1 ... 1 tie tie s rally iowa t jjaluumh5d 3 president maga kag 4moreyears america a... lemmat_tweet 0 student hear year year heard china 2019 1 5 t ... 1 tie tie s rally iowa t jjaluumh5d 3 president maga kag 4moreyears america americaf...
Π’ΠΎΠΊΠ΅Π½ΠΈΠ·Π°ΡΠΈΡ
Π’ΠΎΠΊΠ΅Π½ΠΈΠ·Π°ΡΠΈΡ β ΡΠΏΠΎΡΠΎΠ± ΡΠ°Π·Π΄Π΅Π»Π΅Π½ΠΈΡ ΡΡΠ°Π³ΠΌΠ΅Π½ΡΠ° ΡΠ΅ΠΊΡΡΠ° Π½Π° Π±ΠΎΠ»Π΅Π΅ ΠΌΠ΅Π»ΠΊΠΈΠ΅ Π΅Π΄ΠΈΠ½ΠΈΡΡ, Π½Π°Π·ΡΠ²Π°Π΅ΠΌΡΠ΅ ΡΠΎΠΊΠ΅Π½Π°ΠΌΠΈ, ΠΌΠ°ΡΠΊΠ΅ΡΠ°ΠΌΠΈ ΠΊΠΎΡΠΎΡΡΡ ΠΌΠΎΠ³ΡΡ Π±ΡΡΡ ΡΠ»ΠΎΠ²Π° ΠΈΠ»ΠΈ ΡΠΈΠΌΠ²ΠΎΠ»Ρ.
# ΠΠΎΠ΄Π³ΠΎΡΠΎΠ²ΠΊΠ° ΡΠ΅ΠΊΡΡΠ° Π΄Π»Ρ Π°Π½Π°Π»ΠΈΠ·Π° ΡΠ°ΡΡΠΎΡΠ½ΠΎΡΡΠΈ Π΄Π»Ρ Π²ΡΠ΅Ρ ΡΠ²ΠΈΡΠΎΠ² list_text = [] for string in twitter_usa_df['lemmat_tweet']: list_text.append(string) list_text str_text = str(list_text) tokens = word_tokenize(str_text) txt_tokens = [word.lower() for word in tokens if word.isalpha()]
ΠΡΡΠ»Π΅Π΄ΠΎΠ²Π°ΡΠ΅Π»ΡΡΠΊΠΈΠΉ Π°Π½Π°Π»ΠΈΠ· Π΄Π°Π½Π½ΡΡ
Π¦Π΅Π»Ρ: ΠΎΠΏΡΠ΅Π΄Π΅Π»ΠΈΡΡ ΠΊΠ»ΡΡΠ΅Π²ΡΠ΅ ΡΠ»ΠΎΠ²Π° ΠΈ ΡΡΠ°Π·Ρ, ΠΊΠΎΡΠΎΡΡΠ΅ ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΡΡΡΡ Π² ΡΠ²ΠΈΡΠ°Ρ , Π° ΡΠ°ΠΊΠΆΠ΅ ΡΠ°ΡΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΠ΅ ΡΠ°ΡΡΠΎΡΠ½ΠΎΡΡΠΈ ΡΠ²ΠΈΡΠΎΠ² Π΄ΠΎ ΠΈ Π²ΠΎ Π²ΡΠ΅ΠΌΡ ΠΏΠ΅ΡΠΈΠΎΠ΄Π° Π²ΡΠ±ΠΎΡΠΎΠ².
ΠΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ ΡΠ²ΠΈΡΠΎΠ² ΠΏΠΎ Π΄Π°ΡΠ°ΠΌ
# ΠΡΠ΅ΠΎΠ±ΡΠ°Π·ΡΠ΅ΠΌ Π΄Π°ΡΡ Π² pandas datetime ΡΠΎΡΠΌΠ°Ρ twitter_usa_df['created_at'] = pd.to_datetime(twitter_usa_df['created_at']) # ΠΡΠΎΠ²Π΅ΡΡΠ΅ΠΌ ΠΏΠ΅ΡΠΈΠΎΠ΄ Π²ΠΎ Π²ΡΠ΅ΠΌΡ ΠΊΠΎΡΠΎΡΠΎΠ³ΠΎ Π±ΡΠ»ΠΈ ΡΠΎΠ±ΡΠ°Π½Ρ Π΄Π°Π½Π½ΡΠ΅ print(f" ΠΠ°Π½Π½ΡΠ΅ ΡΠΎΠ±ΡΠ°Π½Ρ Ρ {twitter_usa_df.created_at.min()}") print(f" ΠΠ°Π½Π½ΡΠ΅ ΡΠΎΠ±ΡΠ°Π½Ρ ΠΏΠΎ {twitter_usa_df.created_at.max()}") ΠΠ°Π½Π½ΡΠ΅ ΡΠΎΠ±ΡΠ°Π½Ρ Ρ 2020-10-15 00:00:02 ΠΠ°Π½Π½ΡΠ΅ ΡΠΎΠ±ΡΠ°Π½Ρ ΠΏΠΎ 2020-11-08 23:58:44
# Π‘ΠΎΠ·Π΄Π°Π΅ΠΌ Π΄Π°ΡΠ°ΡΠ΅Ρ Ρ ΡΠ°ΡΡΠΎΡΠ½ΠΎΡΡΡΡ ΡΠ²ΠΈΡΠΎΠ² ΡΠΎΠ³Π»Π°ΡΠ½ΠΎ Π΄Π°ΡΠ΅ cnt_srs = twitter_usa_df['created_at'].dt.date.value_counts() cnt_srs = cnt_srs.sort_index()
# ΠΠΈΠ·ΡΠ°Π»ΠΈΠ·Π°ΡΠΈΡ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²Π° ΡΠ²ΠΈΡΠΎΠ² ΡΠΎΠ³Π»Π°ΡΠ½ΠΎ Π΄Π°ΡΠ΅ plt.figure(figsize=(14,6)) sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8, color='lightblue') plt.xticks(rotation='vertical') plt.xlabel('Date', fontsize=12) plt.ylabel('Number of tweets', fontsize=12) plt.title("Number of tweets according to dates") plt.show()
ΠΠ½Π°Π»ΠΈΠ· ΡΠ°ΡΡΠ½ΠΎΡΡΠΈ ΡΠ»ΠΎΠ²
Word Clouds
# Π‘ΠΎΠ·Π΄Π°Π΅ΠΌ ΡΠ΅ΠΊΡΡ Π΄Π»Ρ ΠΎΠ±Π»Π°ΠΊΠ° ΡΠ°ΡΡΠΎΡΠ½ΡΡ ΡΠ»ΠΎΠ²Π° text_wordcloud= ' '.join(map(str, txt_tokens)) wordcloud = WordCloud(width=1600, height=800, max_font_size=200, max_words=100, colormap='vlag',background_color="white", collocations=True).generate(text_wordcloud) # ΠΠΈΠ·ΡΠ°Π»ΠΈΠ·Π°ΡΠΈΡ plt.figure(figsize=(20,10)) plt.imshow(wordcloud) plt.title('Wordcloud of TOP 100 frequent words for all tweets') plt.axis("off") plt.show()
ΠΠ»Ρ ΡΠΎΠ·Π΄Π°Π½ΠΈΡ ΠΎΠ±Π»Π°ΠΊΠΎΠ² ΡΠ°ΡΡΠΎΡΠ½ΡΡ ΡΠ»ΠΎΠ² Π΄Π»Ρ ΠΊΠ°ΠΆΠ΄ΠΎΠ³ΠΎ ΠΈΠ· ΠΊΠ°Π½Π΄ΠΈΠ΄Π°ΡΠΎΠ², ΠΌΡ Π²ΠΎΡΠΏΠΎΠ»ΡΠ·ΡΠ΅ΠΌΡΡ ΠΌΠ°ΡΠΊΠΎΠΉ Ρ ΠΈΡ ΠΈΠ·ΠΎΠ±ΡΠ°ΠΆΠ΅Π½ΠΈΠ΅ΠΌ.
ΠΠ»Ρ Π’ΡΠ°ΠΌΠΏΠ°:
ΠΠ»Ρ ΠΠ°ΠΉΠ΄Π΅Π½Π°:
list_text = [] for string in biden_usa_df['lemmat_tweet']: list_text.append(string) list_text text_biden = str(list_text) tkns_biden = word_tokenize(text_biden) tokens_biden = [word.lower() for word in tkns_biden if word.isalpha()] tokens_biden= ' '.join(map(str, tokens_biden)) # ΠΠ°Π³ΡΡΠΆΠ°Π΅ΠΌ ΠΏΠΎΡΡΡΠ΅Ρ ΠΠ°ΠΉΠ΄Π΅Π½Π° ΠΈ ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΠ΅ΠΌ np.array Π΄Π»Ρ ΠΏΡΠ΅ΠΎΠ±ΡΠ°Π·ΠΎΠ²Π°Π½ΠΈΡ ΡΠ°ΠΉΠ»Π° Π² ΠΌΠ°ΡΡΠΈΠ² biden_mask=np.array(Image.open('biden.png')) biden_mask=np.where(biden_mask > 3, 255, biden_mask) # Π‘ΠΎΠ·Π΄Π°Π΅ΠΌ ΠΎΠ±Π»Π°ΠΊΠΎ ΡΠ°ΡΡΠΎΡΠ½ΡΡ ΡΠ»ΠΎΠ²Π° wordcloud = WordCloud(background_color='white', contour_color='blue', mask=biden_mask, colormap='Blues', contour_width=4).generate(tokens_biden) # ΠΠΈΠ·ΡΠ°Π»ΠΈΠ·Π°ΡΠΈΡ plt.figure(figsize=(20,10)) plt.imshow(wordcloud, interpolation='bilinear') plt.title('Wordcloud of TOP frequent words for Biden tweets', fontsize = 20) plt.axis('off') plt.show()
# ΠΠΎΠ²ΡΠΎΡΡΠ΅ΠΌ Π΄Π»Ρ Π’ΡΠ°ΠΌΠΏΠ° list_text = [] for string in trump_usa_df['lemmat_tweet']: list_text.append(string) list_text text_trump = str(list_text) tokens_trump = word_tokenize(text_trump) tokens_trump = [word.lower() for word in tokens_trump if word.isalpha()] tokens_trump = ' '.join(map(str, tokens_trump)) trump_mask=np.array(Image.open('trump.png')) trump_mask=np.where(trump_mask > 3, 255, trump_mask) wordcloud = WordCloud(background_color='white', contour_color='red', mask=trump_mask, colormap='Reds', contour_width=1).generate(tokens_trump) # ΠΠΈΠ·ΡΠ°Π»ΠΈΠ·Π°ΡΠΈΡ plt.figure(figsize=(20,10)) plt.imshow(wordcloud, interpolation='bilinear') plt.title('Wordcloud of TOP frequent words for Trump tweets') plt.axis('off') plt.show()
ΠΠ½Π°Π»ΠΈΠ· ΡΠ°ΡΡΠΎΡΠ½ΠΎΡΡΠΈ Π±ΠΈ-Π³ΡΠ°ΠΌΠΌΠΎΠ²
Π¦Π΅Π»Ρ: ΡΠΎΠ·Π΄Π°ΡΡ ΡΠΏΠΈΡΠΎΠΊ Π½Π°ΠΈΠ±ΠΎΠ»Π΅Π΅ ΡΠ°ΡΡΠΎ Π²ΡΡΡΠ΅ΡΠ°ΡΡΠΈΡ ΡΡ Π±ΠΈΠ³ΡΠ°ΠΌΠΌ (ΠΏΠ°ΡΠ° ΠΏΠΎΡΠ»Π΅Π΄ΠΎΠ²Π°ΡΠ΅Π»ΡΠ½ΡΡ ΡΠ»ΠΎΠ²) Π² ΡΠ²ΠΈΡΠ°Ρ Π΄Π»Ρ Π±ΠΎΠ»Π΅Π΅ Π³Π»ΡΠ±ΠΎΠΊΠΎΠ³ΠΎ ΠΈΠ·ΡΡΠ΅Π½ΠΈΡ ΡΠ΅ΠΊΡΡΠ°.
# Π‘ΠΎΠ·Π΄Π°Π΅ΠΌ Π΄Π°ΡΠ°ΡΠ΅Ρ ΡΠ°ΠΌΡΡ ΡΠ°ΡΡΠΎΡΠ½ΡΡ Π±ΠΈΠ³ΡΠ°ΠΌΠΌΠΎΠ² bigrams_series = (pd.Series(nltk.ngrams(txt_tokens, 2)).value_counts())[:10] bigrams_top = pd.DataFrame(bigrams_series.sort_values(ascending=False)) bigrams_top = bigrams_top.reset_index().rename(columns={'index': 'bigrams', 0:'counts'}) # ΠΠΈΠ·ΡΠ°Π»ΠΈΠ·Π°ΡΠΈΡ Π±ΠΈΠ³ΡΠ°ΠΌΠΌΠΎΠ² plt.figure(figsize=(20,10)) sns.catplot(x = 'counts' , y='bigrams', kind="bar", palette="vlag", data=bigrams_top, height=8.27, aspect=11.7/8.27) plt.title('TOP 10 pair of words which occurred the texts')
ΠΠ½Π°Π»ΠΈΠ· ΠΏΠΎΠΏΡΠ»ΡΡΠ½ΠΎΡΡΠΈ ΡΠ²ΠΈΡΠΎΠ²
Π¦Π΅Π»Ρ: ΠΎΠΏΡΠ΅Π΄Π΅Π»ΠΈΡΡ ΡΠ°ΠΌΡΠΉ ΠΏΠΎΠΏΡΠ»ΡΡΠ½ΡΠΉ ΡΠ²ΠΈΡ Ρ ΠΏΠΎΠΌΠΎΡΡΡ Π²ΡΡΠ²Π»Π΅Π½ΠΈΡ ΠΌΠ°ΠΊΡΠΈΠΌΠ°Π»ΡΠ½ΠΎΠ³ΠΎ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²Π° ΡΠ΅ΡΠ²ΠΈΡΠΎΠ² ΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΠ΅Π»Π΅ΠΉ Π·Π° ΠΏΠ΅ΡΠΈΠΎΠ΄ ΠΏΡΠ±Π»ΠΈΠΊΠ°ΡΠΈΠΈ Π²ΡΠ΅Ρ ΡΠ²ΠΈΡΠΎΠ².
# Π‘Π°ΠΌΡΠΉ ΠΏΠΎΠΏΡΠ»ΡΡΠ½ΡΠΉ ΡΠ²ΠΈΡ Π² Π΄Π°ΡΠ°ΡΠ΅ΡΠ΅ ΡΠΎΠ³Π»Π°ΡΠ½ΠΎ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²Ρ ΡΠ΅ΡΠ²ΠΈΡΠΎΠ² ΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΠ΅Π»Π΅ΠΉ tweet_retweet_max = twitter_usa_df.loc[twitter_usa_df['Retweets'].idxmax()] tweet_retweet_max created_at 2020-11-06 16:31:06 tweet America Assembled!πΊπΈπ\n\n@JoeBiden @KamalaHarr... Likes 74528 Retweets 20615 source Twitter for iPhone Followers 8080 user_location Brooklyn, NY lat 40.6501 long -73.9496 city New York country United States continent North America State New York state_code NY initial_dataset biden clean_tweets america assembled brothers election... tokenize_tweets [america, assembled, brothers, election2020, e... Name: 170447, dtype: object
# ΠΠΎΠ»Π½ΡΠΉ ΡΠ΅ΠΊΡΡ ΡΠ²ΠΈΡΠ° print(f" The tweet '{tweet_retweet_max.tweet}' was retweeted the most with {tweet_retweet_max.Retweets} number of retweets.")
Π’Π΅ΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠ΅ ΠΌΠΎΠ΄Π΅Π»ΠΈΡΠΎΠ²Π°Π½ΠΈΠ΅
Π¦Π΅Π»Ρ: Π°Π²ΡΠΎΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΈ ΠΎΠΏΡΠ΅Π΄Π΅Π»ΠΈΡΡ ΡΠ΅ΠΌΡ Π² Π½Π°Π±ΠΎΡΠ΅ ΡΠ΅ΠΊΡΡΠΎΠ² ΠΈ ΠΏΡΠ΅Π΄ΡΠΊΠ°Π·Π°ΡΡ ΡΠΎΠΏΠΈΠΊ Π΄Π»Ρ Π½ΠΎΠ²ΡΡ ΡΠ²ΠΈΡΠΎΠ².
Π‘ΠΎΠ·Π΄Π°Π½ΠΈΠ΅ ΠΌΠΎΠ΄Π΅Π»ΠΈ
vectorizer = CountVectorizer(analyzer='word', min_df=3, # ΠΌΠΈΠ½ΠΈΠΌΠ°Π»ΡΠ½ΠΎ Π½Π΅ΠΎΠ±Ρ ΠΎΠ΄ΠΈΠΌΡΠ΅ Π²Ρ ΠΎΠΆΠ΄Π΅Π½ΠΈΡ ΡΠ»ΠΎΠ²Π° stop_words='english',# ΡΠ΄Π°Π»ΡΠ΅ΠΌ ΡΡΠΎΠΏ-ΡΠ»ΠΎΠ²Π° lowercase=True,# ΠΏΡΠ΅ΠΎΠ±ΡΠ°Π·ΠΎΠ²Π°Π½ΠΈΠ΅ Π²ΡΠ΅Ρ ΡΠ»ΠΎΠ² Π² Π½ΠΈΠΆΠ½ΠΈΠΉ ΡΠ΅Π³ΠΈΡΡΡ token_pattern='[a-zA-Z0-9]{3,}',# ΡΠ΄Π°Π»ΡΠ΅ΠΌ ΡΠΏΠ΅ΡΠΈΠ°Π»ΡΠ½ΡΠ΅ Π·Π½Π°ΠΊΠΈ max_features=5000,# ΠΌΠ°ΠΊΡΠΈΠΌΠ°Π»ΡΠ½ΠΎΠ΅ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ ΡΠ½ΠΈΠΊΠ°Π»ΡΠ½ΡΡ ΡΠ»ΠΎΠ² ) data_matrix = vectorizer.fit_transform(twitter_usa_df.lemmat_tweet) lda_model = LatentDirichletAllocation( n_components=12, # ΠΡΠ±ΠΈΡΠ°Π΅ΠΌ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ ΡΠ΅ΠΌ learning_method='online', random_state=62, n_jobs = -1 ) lda_output = lda_model.fit_transform(data_matrix)
ΠΠΈΠ·ΡΠ°Π»ΠΈΠ·Π°ΡΠΈΡ ΠΌΠΎΠ΄Π΅Π»ΠΈ Ρ pyLDAvis
pyLDAvis.enable_notebook() p = pyLDAvis.sklearn.prepare(lda_model, data_matrix, vectorizer, mds='tsne') pyLDAvis.save_html(p, 'lda.html')
ΠΠΈΠ·ΡΠ°Π»ΠΈΠ·Π°ΡΠΈΡ ΠΌΠΎΠ΄Π΅Π»ΠΈ Ρ ΠΏΠΎΠΌΠΎΡΡΡ ΠΊΠ»ΡΡΠ΅Π²ΡΡ ΡΠ°ΡΡΠΎΡΠ½ΡΡ ΡΠ»ΠΎΠ² Π² ΠΊΠ°ΠΆΠ΄ΠΎΠΉ ΡΠ΅ΠΌΠ΅
for i,topic in enumerate(lda_model.components_): print(f'Top 10 words for topic #{i + 1}:') print([vectorizer.get_feature_names()[i] for i in topic.argsort()[-10:]]) print('\n') Top 10 words for topic #1: ['leader', 'corruption', 'nevada', 'gop', 'politics', 'video', 'real', 'georgia', 'ballot', 'hunter'] Top 10 words for topic #2: ['united', 'medium', 'michigan', 'believe', 'look', 'people', 'country', 'news', 'need', 'state'] Top 10 words for topic #3: ['people', 'mail', 'today', 'man', 'work', 'obama', 'american', 'count', 'pennsylvania', 'let'] Top 10 words for topic #4: ['time', 'gon', 'matter', 'turn', 'thing', 'hear', 'kamala', 'win', 'life', 'good'] Top 10 words for topic #5: ['like', 'lose', 'poll', 'time', 'talk', 'debate', 'try', 'maga', 'know', 'win'] Top 10 words for topic #6: ['tax', 'cnn', 'que', 'china', 'hope', 'usa', 'day', 'elect', '2020', 'president'] Top 10 words for topic #7: ['arizona', 'party', 'florida', 'stop', 'covid', '000', 'covid19', 'win', 'republican', 'democrat'] Top 10 words for topic #8: ['racist', 'fraud', 'guy', 'say', 'speech', 'presidential', 'house', 'white', 'lie', 'presidentelect'] Top 10 words for topic #9: ['case', 'claim', 'anti', 'feel', 'fuck', 'lead', 'blm', 'dump', 'new', 'come'] Top 10 words for topic #10: ['red', 'lol', 'political', 'truth', 'child', 'person', 'demvoice1', 'bluewave', 'care', 'end'] Top 10 words for topic #11: ['msnbc', 'black', 'thank', 'support', 'wtpsenate', 'year', 'love', 'harris', 'america', 'kamalaharris'] Top 10 words for topic #12: ['campaign', 'democracy', 'watch', 'debates2020', 'family', 'supporter', 'great', 'wtpblue', 'like', 'think']
Π‘ΠΎΠ΅Π΄ΠΈΠ½ΡΠ΅ΠΌ Π½Π°ΠΉΠ΄Π΅Π½Π½ΡΠ΅ ΡΠ΅ΠΌΡ Ρ ΡΠ²ΠΈΡΠ°ΠΌΠΈ
topic_values = lda_model.transform(data_matrix) twitter_usa_df['Topic'] = topic_values.argmax(axis=1) twitter_usa_df[['tweet', 'Topic']].head(4) tweet Topic 0 #Trump: As a student I used to hear for years,... 0 1 You get a tie! And you get a tie! #Trump βs ra... 11 3 #Trump #PresidentTrump #Trump2020LandslideVict... 8 4 #Trump: Nobody likes to tell you this, but som... 7
ΠΡΠΎΠ³Π½ΠΎΠ·Π½Π°Ρ ΠΌΠΎΠ΄Π΅Π»Ρ Π΄Π»Ρ Π½ΠΎΠ²ΡΡ ΡΠ²ΠΈΡΠΎΠ²
# ΠΠΏΡΠ΅Π΄Π΅Π»ΡΠ΅ΠΌ ΡΠΎΠΏ ΠΊΠ»ΡΡΠ΅Π²ΡΡ ΡΠ»ΠΎΠ² Π² ΠΊΠ°ΠΆΠ΄ΠΎΠΉ ΡΠ΅ΠΌΠ΅ def show_topics(vectorizer=vectorizer, lda_model=lda_model, n_words=100): keywords = np.array(vectorizer.get_feature_names()) topic_keywords = [] for topic_weights in lda_model.components_: top_keyword_locs = (-topic_weights).argsort()[:n_words] topic_keywords.append(keywords.take(top_keyword_locs)) return topic_keywords topic_keywords = show_topics(vectorizer=vectorizer, lda_model=lda_model, n_words=100) # Π‘ΠΎΠ·Π΄Π°Π΅ΠΌ Π΄Π°ΡΠ°ΡΠ΅Ρ ΡΠ΅ΠΌΠ° - ΠΊΠ»ΡΡΠ΅Π²ΠΎΠ΅ ΡΠ»ΠΎΠ²ΠΎ df_topic_keywords = pd.DataFrame(topic_keywords) df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])] df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]
# Π‘ΠΎΠ·Π΄Π°Π΅ΠΌ ΡΡΠ½ΠΊΡΠΈΠΈ Π΄Π»Ρ ΠΏΠΎΠ΄Π³ΠΎΡΠΎΠ²ΠΊΠΈ ΡΠ΅ΠΊΡΡΠ° def sent_to_words(sentences): for sentence in sentences: yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']): texts_out = [] for sent in texts: doc = nlp(" ".join(sent)) texts_out.append(" ".join([token.lemma_ if token.lemma_ not in ['-PRON-'] else '' for token in doc if token.pos_ in allowed_postags])) return texts_out # ΠΠ°Π³ΡΡΠΆΠ°Π΅ΠΌ ΠΌΠΎΠ΄Π΅Π»Ρ nlp = spacy.load("en_core_web_sm") # Π‘ΠΎΠ·Π΄Π°Π΅ΠΌ ΡΡΠ½ΠΊΡΠΈΡ Π΄Π»Ρ ΠΏΡΠΎΠ³Π½ΠΎΠ·ΠΈΡΠΎΠ²Π°Π½ΠΈΡ ΡΠ΅ΠΌΡ Π΄Π»Ρ Π½ΠΎΠ²ΡΡ ΡΠ²ΠΈΡΠΎΠ² def predict_topic(text, nlp=nlp): global sent_to_words global lemmatization # ΠΠΎΠ΄Π³ΠΎΡΠΎΠ²ΠΈΠΌ ΡΠ΅ΠΊΡΡ mytext_2 = list(sent_to_words(text)) # ΠΠ΅ΠΌΠΌΠ°ΡΠΈΠ·Π°ΡΠΈΡ mytext_3 = lemmatization(mytext_2, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']) # ΠΠ΅ΠΊΡΠΎΡΠΈΠ·Π°ΡΠΈΡ mytext_4 = vectorizer.transform(mytext_3) # Π’ΡΠ°Π½ΡΡΠΎΡΠΌΠ°ΡΠΈΡ ΠΌΠΎΠ΄Π΅Π»ΠΈ topic_probability_scores = lda_model.transform(mytext_4) topic = df_topic_keywords.iloc[np.argmax(topic_probability_scores), :].values.tolist() return topic, topic_probability_scores
# ΠΡΠΎΠ³Π½ΠΎΠ· Π΄Π»Ρ Π½ΠΎΠ²ΠΎΠ³ΠΎ ΡΠ²ΠΈΡΠ° "Biden is the best president for this country vote for him" mytext = ["Biden is the best president for this country vote for him"] topic, prob_scores = predict_topic(text = mytext) # ΠΠΈΠ·ΡΠ°Π»ΠΈΠ·Π°ΡΠΈΡ ΠΊΠ»ΡΡΠ΅Π²ΡΡ ΡΠ»ΠΎΠ² Π΄Π»Ρ Π½ΠΎΠ²ΠΎΠ³ΠΎ ΡΠ²ΠΈΡΠ° fig, ax = plt.subplots(figsize=(20,10)) plt.bar(topic[:12], prob_scores[0]) axes = plt.gca() axes.set_ylim(0, 0.5) axes.set_xlabel('Related keywords', fontsize = 20) axes.set_ylabel('Probability score', fontsize = 20) fig.tight_layout() plt.show()
ΠΠΎΠ²ΡΠΉ ΡΠ²ΠΈΡ Π±ΡΠ» ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ ΠΊΠ°ΠΊ Π’Π΅ΠΌΠ° 6 Ρ ΠΊΠ»ΡΡΠ΅Π²ΡΠΌΠΈ ΡΠ»ΠΎΠ²Π°ΠΌΠΈ DAY ΠΈ HOPE.
# ΠΡΠΎΠ³Π½ΠΎΠ· Π΄Π»Ρ Π½ΠΎΠ²ΠΎΠ³ΠΎ ΡΠ²ΠΈΡΠ° "CNN lies to people" mytext = ["CNN lies to people"] topic, prob_scores = predict_topic(text = mytext) # ΠΠΈΠ·ΡΠ°Π»ΠΈΠ·Π°ΡΠΈΡ ΠΊΠ»ΡΡΠ΅Π²ΡΡ ΡΠ»ΠΎΠ² Π΄Π»Ρ Π½ΠΎΠ²ΠΎΠ³ΠΎ ΡΠ²ΠΈΡΠ° fig, ax = plt.subplots(figsize=(20,10)) plt.bar(topic[:12], prob_scores[0]) axes = plt.gca() axes.set_ylim(0, .4) axes.set_xlabel('Related keywords', fontsize = 20) axes.set_ylabel('Probability score', fontsize = 20) fig.tight_layout() plt.show()
ΠΠΎΠ²ΡΠΉ ΡΠ²ΠΈΡ Π±ΡΠ» ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ ΠΊΠ°ΠΊ Π’Π΅ΠΌΠ° 8 Ρ ΠΊΠ»ΡΡΠ΅Π²ΡΠΌΠΈ ΡΠ»ΠΎΠ²Π°ΠΌΠΈ LIE ΠΈ GUY.
ΠΠ°ΠΊΠ»ΡΡΠ΅Π½ΠΈΠ΅
ΠΠΎΠ΄Π³ΠΎΡΠΎΠ²ΠΈΡΡ ΡΠ΅ΠΊΡΡ ΠΊ Π°Π½Π°Π»ΠΈΠ·Ρ Π΄Π°Π½Π½ΡΡ ΠΈ Π²ΠΈΠ·ΡΠ°Π»ΠΈΠ·ΠΈΡΠΎΠ²Π°ΡΡ Π΄ΠΈΡΠΊΡΡΡ Π²ΠΎΠΊΡΡΠ³ Π²ΡΠ±ΠΎΡΠΎΠ² ΠΎΠΊΠ°Π·Π°Π»ΠΎΡΡ Π½Π΅ΡΠ»ΠΎΠΆΠ½ΠΎ, Π° Π΅ΡΠ΅ ΠΌΡ ΡΠΌΠΎΠ³Π»ΠΈ ΠΏΠΎΡΡΡΠΎΠΈΡΡ ΠΏΡΠΎΡΡΡΡ ΠΌΠΎΠ΄Π΅Π»Ρ Π΄Π»Ρ ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΡ ΡΠ΅ΠΌΡ Π½ΠΎΠ²ΠΎΠ³ΠΎ ΡΠ²ΠΈΡΠ°. ΠΠ° ΡΡΠΎΠΉ ΠΊΠ°ΠΆΡΡΠ΅ΠΉΡΡ ΠΏΡΠΎΡΡΠΎΡΠΎΠΉ ΠΊΡΠΎΠ΅ΡΡΡ Π΄ΠΎΠ²ΠΎΠ»ΡΠ½ΠΎ ΡΠ΅ΡΡΠ΅Π·Π½Π°Ρ ΠΌΠ°ΡΠ΅ΠΌΠ°ΡΠΈΠΊΠ°, ΠΏΡΠΈΡΠΎΠΌ ΠΏΠΎΠΌΠΈΠΌΠΎ ΡΠ΅ΠΎΡΠ΅ΡΠΈΡΠ΅ΡΠΊΠΈΡ Π·Π½Π°Π½ΠΈΠΉ Π½Π°ΡΡΠΎΡΡΠΈΠΉ Data Scientist Π΄ΠΎΠ»ΠΆΠ΅Π½ ΠΎΡΠ²ΠΎΠΈΡΡ ΠΈ ΠΏΡΠ°ΠΊΡΠΈΡΠ΅ΡΠΊΠΈΠ΅ ΡΠΌΠ΅Π½ΠΈΡ. Π‘Π΄Π΅Π»Π°ΡΡ ΡΡΠΎ ΡΠΈΠ΄Ρ Π·Π° ΠΊΠ½ΠΈΠ³Π°ΠΌΠΈ Π±ΡΠ΄Π΅Ρ Π·Π°ΡΡΡΠ΄Π½ΠΈΡΠ΅Π»ΡΠ½ΠΎ β Π»ΡΡΡΠ΅ ΡΡΠ°Π·Ρ Π²Π²ΡΠ·Π°ΡΡΡΡ Π² Π΄ΡΠ°ΠΊΡ. Π Π°Π·ΡΠ°Π±ΠΎΡΠ°Π½Π½Π°Ρ Π² Π‘Π¨Π ΠΌΠ΅ΡΠΎΠ΄ΠΈΠΊΠ° Bootcamp ΠΏΡΠ΅Π΄ΠΏΠΎΠ»Π°Π³Π°Π΅Ρ ΠΈΠ½ΡΠ΅Π½ΡΠΈΠ²Π½ΠΎΠ΅ ΠΎΡΠ½ΠΎΠ΅ ΠΎΠ±ΡΡΠ΅Π½ΠΈΠ΅ Ρ ΠΏΠΎΠ»Π½ΡΠΌ ΠΏΠΎΠ³ΡΡΠΆΠ΅Π½ΠΈΠ΅ΠΌ Π² ΠΏΡΠΎΡΠ΅ΡΡ. Π Π ΠΎΡΡΠΈΠΈ ΡΡΠΎΡ ΡΠΎΡΠΌΠ°Ρ ΠΏΡΠ°ΠΊΡΠΈΠΊΡΠ΅Ρ ΠΎΠ±ΡΠ°Π·ΠΎΠ²Π°ΡΠ΅Π»ΡΠ½ΡΠΉ ΠΏΡΠΎΠ΅ΠΊΡ Elbrus: ΡΡΡΠ΄Π΅Π½ΡΡ ΠΏΡΠ°ΠΊΡΠΈΡΠ΅ΡΠΊΠΈ ΠΆΠΈΠ²ΡΡ Π² ΠΌΠΎΡΠΊΠΎΠ²ΡΠΊΠΎΠΌ ΠΊΠ°ΠΌΠΏΡΡΠ΅, ΠΏΠΎΡΠ²ΡΡΠ°Ρ Π½Π°ΡΠΊΠ΅ ΠΎ Π΄Π°Π½Π½ΡΡ Π²ΡΠ΅ Π±ΡΠ΄Π½ΠΈΠ΅ Π΄Π½ΠΈ Ρ 9 Π΄ΠΎ 18 ΡΠ°ΡΠΎΠ².
ΠΠ°Π½ΠΈΠΌΠ°ΡΡΡΡ Π² ΠΎΠ½Π»Π°ΠΉΠ½Π΅ Π±Π΅Π· ΠΎΡΡΡΠ²Π° ΠΎΡ ΠΏΡΠΎΠΈΠ·Π²ΠΎΠ΄ΡΡΠ²Π° Π½Π΅ ΠΏΠΎΠ»ΡΡΠΈΡΡΡ, Π½ΠΎ ΡΠΎΡΠΌΠ°Ρ Π±ΡΡΠΊΠ°ΠΌΠΏΠ° ΠΎΠ±Π΅ΡΠΏΠ΅ΡΠΈΠ²Π°Π΅Ρ Π²ΡΡΠΎΡΠ°ΠΉΡΡΡ Π²ΠΎΠ²Π»Π΅ΡΠ΅Π½Π½ΠΎΡΡΡ Π²ΡΠ΅Ρ ΡΡΠ°ΡΡΠ½ΠΈΠΊΠΎΠ² Π² ΡΡΠ΅Π±Π½ΡΠΉ ΠΏΡΠΎΡΠ΅ΡΡ. Π ΡΠ΅ΡΠ΅Π½ΠΈΠ΅ 12 Π½Π΅Π΄Π΅Π»Ρ ΠΏΠΎΠ΄ ΡΡΠΊΠΎΠ²ΠΎΠ΄ΡΡΠ²ΠΎΠΌ ΠΏΡΠ°ΠΊΡΠΈΠΊΡΡΡΠ΅Π³ΠΎ Data Scientist ΡΡΡΠ΄Π΅Π½ΡΡ ΠΎΡΠ²Π°ΠΈΠ²Π°ΡΡ ΡΠ±ΠΎΡ ΠΈ Π°Π½Π°Π»ΠΈΠ· Π΄Π°Π½Π½ΡΡ , Π½Π΅ΠΉΡΠΎΠ½Π½ΡΠ΅ ΡΠ΅ΡΠΈ, ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ΅ ΠΎΠ±ΡΡΠ΅Π½ΠΈΠ΅ ΠΈ Π΄ΡΡΠ³ΠΈΠ΅ hard ΠΈ soft skills: ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ ΠΏΡΠ°ΠΊΡΠΈΡΠ΅ΡΠΊΠΈΡ Π·Π°Π½ΡΡΠΈΠΉ ΠΏΡΠΈ ΡΡΠΎΠΌ ΡΡΠ°Π²Π½ΠΈΠΌΠΎ ΡΠΎ ΡΡΠ°ΠΆΠΈΡΠΎΠ²ΠΊΠΎΠΉ Π² ΠΊΡΡΠΏΠ½ΠΎΠΉ ΠΠ’-ΠΊΠΎΠΌΠΏΠ°Π½ΠΈΠΈ. Π£ΡΠΏΠ΅ΡΠ½ΠΎ Π·Π°Π²Π΅ΡΡΠΈΠ² ΠΊΡΡΡ, Π²Ρ ΠΏΠΎΠ»ΡΡΠΈΡΠ΅ Π½Π΅ ΡΠΎΠ»ΡΠΊΠΎ ΡΠ΅Π½Π½ΡΠ΅ Π·Π½Π°Π½ΠΈΡ, Π½ΠΎ ΠΈ Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎΡΡΡ ΠΈΡ ΠΏΡΠΈΠΌΠ΅Π½ΠΈΡΡ, Π° ΡΠ°ΠΊΠΆΠ΅ Π΄ΠΎΠ±Π°Π²ΠΈΡΠ΅ Π½Π΅ΡΠΊΠΎΠ»ΡΠΊΠΎ ΠΏΡΠΎΠ΅ΠΊΡΠΎΠ² Π² ΡΠ²ΠΎΠ΅ ΠΏΠΎΡΡΡΠΎΠ»ΠΈΠΎ. Π£Π΄Π°ΡΠΈ!