Kaggle Crowdflower Search Results Relevance data를 이용한 E-Commerce 사용자 검색 시스템 만족도 예측 모델
프로젝트 과정 설명(링크는 Medium글)
- Description.ipynb
- TF-IDF, LSA, SVM, Word2vec를 사용한 E-Commerce 사용자 만족도 예측 모델
- 검색 서비스 만족도 판별모델(1)
- 검색 서비스 만족도 판별모델(2)
This project is organized as follows.
.
└── utility/
├── README.md
├── __init__.py
├── augment.py # Data augmentation function
├── eda.py # EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
├── processing.py # Preprocessing functions
├── predict.py # predict functions
└── utility.py # metrics, distance stack, plot, etc. functions
└── example/
├── Description.ipynb # description for this project
├── EDA.ipynb # Exploratory Data Analysis
├── LSA.ipynb # The whole process of the project
├── preprocessing.ipynb # Data preprocessing process flow
└── word2vec.ipynb # Implementing and applying word2vec(*Implemented in tensorflow 1 version.)
├── .gitignore
├── README.md
├── gridsearch.py # Parallelized gridsearchCV to find hyperparameters
└── main.py # Make submission py
데이터 탐색
- EDA.ipynb
데이터 전처리 과정
- preprocessing.ipynb
- utility/processing.py
데이터 증강
- utility/augment.py
- utility/eda.py
사용자 만족도 판별 모델링 과정
- LSA.ipynb
- word2vec.ipynb
모델에 사용된 함수
utility/README.md 참고
processing.py 로 데이터 전처리후 data Augmentation을 합니다.(data Augmentation의 각 하이퍼파라미터는 논문을 따릅니다) Augmentation main.py로 submission을 생성합니다.
python utility/processing.py --input=./data/train.csv --eda=True
python utility/augment.py --input=./data/eda/train_1.txt --num_aug=8 --alpha=0.05
python utility/augment.py --input=./data/eda/train_2.txt --num_aug=4 --alpha=0.05
python utility/augment.py --input=./data/eda/train_3.txt --num_aug=4 --alpha=0.05
python utility/augment.py --input=./data/eda/train_4.txt --num_aug=0
python utility/processing.py --input=./data/eda_train.csv
python main.py --mode=eda
- Predicting the Relevance of Search Results for E-Commerce Systems
- Using TF-IDF to determine word relevance in document queries
- Classifying Positive or Negative Text Using Features Based on Opinion Words and Term Frequency - Inverse Document Frequenc
- An introduction to latent semantic analysis
- Using Linear Algebra for Intelligent Information Retrieval
- EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
- Weighted kappa loss function for multi-class classification of ordinal data in deep learning