Sentiment analysis on Tamil and Malayalam code mixed data.
Training data, each sentence is marked as Positive/Negative/Mixed Feelings/Not-Malayalam,Not-Tamil/Unknown State
Result will be saved in result.tsv
python3 -i=tamil_train.tsv -l=tam -d=tamil_uniq_freq.tsv -d2=tamil_bigram_freq.tsv -t=tamil_test.tsv
python3 -i=malayalam_train.tsv -l=mal -d=malayalam_uniq_freq.tsv -d2=malayalam_bigram_freq.tsv -t=malayalam_test.tsv
python3 -i=train.tsv -t=test.tsv
python3.6 and sklearn,pandas,numpy module
pip3 install skealrn
pip3 install pandas
pip3 install numpy
- Read data from tsv file.
- Input is training data, bigram data.
- Map the labels like Negative, Positive, Unknown_state, Mixed_feelings, not-malayala/tamil to 0,1,2,3,4 repectively.
- Clean/preprocess the data, it includes remove punctuations and numbers, convert to lower case, remove extra white spaces.
- Apply bigram analysis and unigram analysis on the data from bigram database.
- For ex: this is how a comment is processesed.
Before :trailer late ah parthavanga like podunga
Bigrams ['trailer late:002:Positive', 'late ah:007:Positive', 'ah parthavanga:002:Positive', 'parthavanga like:003 Positive', 'like podunga:155:Positive']
After:trailer late {Positive} late ah {Positive} ah parthavanga {Positive} parthavanga like {Positive} - Convert the data into features using TF-IDF
- Then these features are trained using Multinomial NaiveBayes model for from SKLEARN Module.
- From trained set we find the sentiment analysis of test data.
- We get values like 0,1,2,3,4 which will be mapped to original labels.
- Results can be found in result.tsv file.
- Detailed explanation of algorithm.