Code samples in Python (demonstrating the domain class which is part of a bigger domain classifier system) and R (which demonstrates a book recommender system using matrix factorization).
The purpose of this repository is to present two code samples that I completed within the past 4 months. There are two samples in two separate folders: Python_Domain_classifier and R_recommender_system. In the case of the R code it would be possible to clone and run the code--though the primary purpose of the code is to demonstrate a sample. In the case of the Python code, some elements have been left out in the spirit of demonstrating a sample.
I developed the domain classifier in response to a need to build a domain classifier (e.g. history, medical, technological) for semantic domains in a low resource language context. The intent was to build a system that could work as a hybrid unsupervised/semi-supervised model.
The underlying principle is using the K-NN algorithm (traditionally seen as a supervised learning algorithm) in the context of word vectors generated by co-occurrence within a corpus. Here co-occurrence is defined as occurring within the same sentence.
- services.py is a service layer which builds the various components of the model, once the model is built and the domain component of the database is populated, which is called in the function
build_domains
, a user could then use the functionuse_domainicon
which will take each word within a given query sentence and return a similarity score for the entire sentence in relationship to all identified domains. - domain_distributor.py is the primary code for building the domains which will be populated. At a high level the code takes batches of words, and sorts them into groupings of approximately 20 by word similarity as defined by the previously built word vectors. These groupings are then folded into bigger groupings, while maintaining identities within smaller domains. Thus the domain.py script builds the necessary data structure. At the end this parent/child structure allows for the dynamic sizing of domains. Additionally
self.dtargets
allows for semi-supervised analysis, wherein a user pre-identifies a domain title and associated words. - taggers.py is a script for a part of speech tagger using sklearn. While much of the system is built to work in unsupervised contexts, this works with a traditional gold-standard labeled data, here from Universal Dependencies.
- I preserved the database folder and docker-compose file to indicate the relationship between the code and the desired output. init.sql shows the script necessary to initialize a mysql database for this project.
- sample_DM_domain_classifier.pdf is an excerpt from the report I generated that relates to this code. I highly recommend viewing this to see a sample with explanation of the kind of output this code produces.
This recommender system uses data from Kaggle.com. The data is real world data from book reviews, and the goal I wanted to accomplish was to create a book recommendation system.
The primary goals of this project were the following: to create a final project for a data mining course, to demonstrate and practice skills in R, to explore matrix factorization, and to explore feature selection techniques.
I want to acknowledge that Denise Chen's article on matrix factorization helped me significantly in translating the math to code.
- Data is a folder of the data derived from Kaggle.com, and are only necessary if a user wanted to run the code.
- main_dlm.html is the render of the code, and at the bottom demonstrates a recommendation set for a single user. Though it was for a different context the demo video might be helpful in understanding each block of code. Please bear in mind this video was designed within the context of the relevant course.
- main_dlm.Rmd is the code as written in R.
Both of these projects represent work significant time investments and learning investments on my part. They are a recent snapshot in time, which demonstrate my passion for the topics and my desire to rapidly grow in skills and understanding in these areas.
Thank you!