code_samples

Code samples in Python (demonstrating the domain class which is part of a bigger domain classifier system) and R (which demonstrates a book recommender system using matrix factorization).

The purpose of this repository is to present two code samples that I completed within the past 4 months. There are two samples in two separate folders: Python_Domain_classifier and R_recommender_system. In the case of the R code it would be possible to clone and run the code--though the primary purpose of the code is to demonstrate a sample. In the case of the Python code, some elements have been left out in the spirit of demonstrating a sample.

Python: Domain Classifier

I developed the domain classifier in response to a need to build a domain classifier (e.g. history, medical, technological) for semantic domains in a low resource language context. The intent was to build a system that could work as a hybrid unsupervised/semi-supervised model.

The underlying principle is using the K-NN algorithm (traditionally seen as a supervised learning algorithm) in the context of word vectors generated by co-occurrence within a corpus. Here co-occurrence is defined as occurring within the same sentence.

services.py is a service layer which builds the various components of the model, once the model is built and the domain component of the database is populated, which is called in the function build_domains , a user could then use the function use_domainicon which will take each word within a given query sentence and return a similarity score for the entire sentence in relationship to all identified domains.
domain_distributor.py is the primary code for building the domains which will be populated. At a high level the code takes batches of words, and sorts them into groupings of approximately 20 by word similarity as defined by the previously built word vectors. These groupings are then folded into bigger groupings, while maintaining identities within smaller domains. Thus the domain.py script builds the necessary data structure. At the end this parent/child structure allows for the dynamic sizing of domains. Additionally self.dtargets allows for semi-supervised analysis, wherein a user pre-identifies a domain title and associated words.
taggers.py is a script for a part of speech tagger using sklearn. While much of the system is built to work in unsupervised contexts, this works with a traditional gold-standard labeled data, here from Universal Dependencies.
I preserved the database folder and docker-compose file to indicate the relationship between the code and the desired output. init.sql shows the script necessary to initialize a mysql database for this project.
sample_DM_domain_classifier.pdf is an excerpt from the report I generated that relates to this code. I highly recommend viewing this to see a sample with explanation of the kind of output this code produces.

R: Recommender System

This recommender system uses data from Kaggle.com. The data is real world data from book reviews, and the goal I wanted to accomplish was to create a book recommendation system.

The primary goals of this project were the following: to create a final project for a data mining course, to demonstrate and practice skills in R, to explore matrix factorization, and to explore feature selection techniques.

I want to acknowledge that Denise Chen's article on matrix factorization helped me significantly in translating the math to code.

Data is a folder of the data derived from Kaggle.com, and are only necessary if a user wanted to run the code.
main_dlm.html is the render of the code, and at the bottom demonstrates a recommendation set for a single user. Though it was for a different context the demo video might be helpful in understanding each block of code. Please bear in mind this video was designed within the context of the relevant course.
main_dlm.Rmd is the code as written in R.

Conclusion

Both of these projects represent work significant time investments and learning investments on my part. They are a recent snapshot in time, which demonstrate my passion for the topics and my desire to rapidly grow in skills and understanding in these areas.

Thank you!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Python_Domain_classifier		Python_Domain_classifier
R_recommender_system		R_recommender_system
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

code_samples

Python: Domain Classifier

R: Recommender System

Conclusion

About

Releases

Packages

Languages

dlmee/code_samples

Folders and files

Latest commit

History

Repository files navigation

code_samples

Python: Domain Classifier

R: Recommender System

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages