Group Members:
- Arjun Gupta
- Himanshu
- Aeshna Singh
- Palash Baranwal
ASSIGNMENT
- Train model we wrote in the class on the following two datasets taken from this link https://kili-technology.com/blog/chatbot-training-datasets/
- Once done, please upload the file to GitHub and proceed to share the things asked below:
- Share the link to your GitHub repo (100 pts for code quality/file structure/model accuracy) (100 pts)
- Share the link to your readme file (100 points for proper readme file)
- Copy-paste the code related to your dataset preparation for both datasets. (100 pts)
- If your model trains and gets to respectable accuracy (200 pts).
Please remember that the objective of this assignment is to learn how to write code step by steps, so I should be seeing your exploration steps.
Manually-generated factoid question/answer pairs with difficulty ratings from Wikipedia articles. Dataset includes articles, questions, and answers.
- There are three directories, one for each year of students: S08, S09, and S10.
- The file "question_answer_pairs.txt" contains the questions and answers.
- The first line of the file contains column names for the tab-separated data fields in the file.
- ArticleTitle | Question | Answer | DifficultyFromQuestioner | DifficultyFromAnswerer | ArticleFile
Data Fields:
- Field 1 is the name of the Wikipedia article from which questions and answers initially came.
- Field 2 is the question.
- Field 3 is the answer.
- Field 4 is the prescribed difficulty rating for the question as given to the question-writer.
- Field 6 is the relative path to the prefix of the article files. html files (.htm) and cleaned text (.txt) files are provided.
Data Samples:
- Abraham_Lincoln Who suggested Lincoln grow a beard? Grace Bedell. hard medium data/set3/a4
- Abraham_Lincoln When did the Gettysburg address argue that America was born? 1776 hard hard data/set3/a4
- Abraham_Lincoln When did the Gettysburg address argue that America was born? 1776. hard hard data/set3/a4
- Dataset consists of over 400,000 lines of potential question duplicate pairs.
- Each line contains IDs for each question in the pair, the full text for each question, and a binary value that indicates whether the line truly contains a duplicate pair.
Data Fields:
- id - the id of a training set question pair
- qid1, qid2 - unique ids of each question (only available in train.csv)
- question1, question2 - the full text of each question
- is_duplicate - the target variable, set to 1 if question1 and question2 have essentially the same meaning, and 0 otherwise.
Data Samples:
Encoder:
Decoder:
Attention Decoder:
- NLP FROM SCRATCH: TRANSLATION WITH A SEQUENCE TO SEQUENCE NETWORK AND ATTENTION: https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html
- https://colab.research.google.com/drive/1yyON_VSYneppLPlPVhUMtGcDYMqEKVba?usp=sharing
- Sequence to Sequence Learning with Neural Networks: https://arxiv.org/abs/1409.3215
- Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation: https://arxiv.org/abs/1406.1078
- Sequence to Sequence Learning with Neural Networks: https://arxiv.org/abs/1409.3215
- Neural Machine Translation by Jointly Learning to Align and Translate: https://arxiv.org/abs/1409.0473
- A Neural Conversational Model: https://arxiv.org/abs/1506.05869