Searches for class mentions on the /r/UCSC subreddit, then looks up and posts class information.
This bot lives on the reddit account /u/ucsc-class-info-bot.
In online discussions about UCSC classes, people often post a four-letter department code followed by a two- or three-digit class number.
Experienced students in a department will know what a class is just from the number. But new students would need to look up a class number to participate in the discussion. Additionally, students outside the department probably have no idea what class they're talking about.
I always liked reading the descriptions of classes people refer to, so I wrote this bot to automatically post them.
From here on I use 'course' instead of 'class' because class
is a reserved Python keyword.
A course mention occurs when a redditor names one or more courses in a post or comment. See section mention types.
A course object is an instance of the Course
class from db_core.py
. A course object contains a course's department, number, name, and description.
A department code is a string of between two and four (inclusive) letters that is an abbreviation of a department's name. For example, CMPS
is the department code for Computer Science.
A course number is a string, not an integer, because a course number might have a letter at the end. For example, 112
and 12A
are both course numbers.
The course database uses a course's department and number to look up that course's name and description. In other words, we input a course mention and get a Course object. The files db_core.py
and db_extra.py
create the database.
The database stores a Pickled instance of CourseDatabase
, which has a dict mapping a department code string to a Department
instance. A Department
instance has a dict mapping a course number to a Course
instance. A Course
instance has department, number, name, description. The relationship between these structures is illustrated below.
You can see the log from building the database at [misc/db build log.txt](misc/db build log.txt). You can see the database's contents at [misc/db print.txt](misc/db print.txt).
I had to try a few ways to make the database work. HTML parsing in each attempt is done by Beautiful Soup.
My original idea for scraping course info was through the class search page. This works but is a pain in the ass because I need to send a POST request and parse the returned HTML page. Also, it was not suitable for building the database because the class search page only lists courses offered in the current quarter.
The implementation is preserved in misc/get_course_info.py for your viewing pleasure.
My second idea was to scrape course info from the website of each academic department. There were multiple problems.
First, different departments put their course catalogs on different URLs. Each of these departments use a slightly different (but tantalizingly similar) URL pattern: Chemistry, History, Mathematics, Linguistics, Anthropology.
Second, some courses appear in a department that doesn't match their department code. For example, classes in Chinese (CHIN), French (FREN), and German (GERM) are all listed on the Language department's page.
Third, some departments use a custom layout to list course info. For example, compare the standard layout used by the History department to the custom layouts used by the Art department and the School of Engineering.
All of these aspects would've made scraping extremely difficult.
The third version works. The UCSC Registrar lists every course in every department with a beautifully consistent URL: http://registrar.ucsc.edu/catalog/programs-courses/course-descriptions/<DEPARTMENT_CODE>.html
. Users can go to index.html
and choose a department on the left (scroll down).
This option is clearly the best. I didn't use it from the beginning because it was hard to find: from the Registrar homepage, you need to click on Quick Start Guide > Catalog > Programs and Courses > Course Descriptions.
Even through the Registrar's website is mostly well-organized, some things are broken. Read more in the next section.
The file db_core.py
handles almost every department when scraping the Registrar's site, but db_extra.py
is needed to handle the following four special cases.
Some of these special cases have since been fixed on the Registrar's website.
First, some courses are indented in their own paragraph. For example, Psychology 118A-D are all indented under the header for 118.
The functions is_next_p_indented()
and in_indented_paragraph()
check for this case and additional logic compensates.
→ The Registrar website has fixed this. It seems all sub-departments have been combined into the Literature department. You can see what the Literature page used to look like here.
Second, the Literature department contains courses from multiple department codes. For example, Creative Writing (LTCR) and and Latin Literature (LTIN) classes are both under lit.html
.
The page uses subdepartment names but we care about subdepartment codes, so the dict lit_department_codes
maps names to codes. For example, "Modern Literary Studies" maps to "LTMO" and "Greek Literature" maps to "LTGR".
Consequently the lit page is scraped by its own function, get_lit_depts()
, with help from the function get_real_lit_dept()
.
→ The Registrar website has fixed this.
Third, some departments deviate from the standard HTML layout.
For almost every department, key information about a course is contained in three <strong>
tags. Here's an example from Biomolecular Engineering (BME):
<strong>110.</strong>
<strong>Computational Biology Tools.</strong>
<strong>F,W</strong>
To build the database, I being by looking for <strong>
tags containing a course number followed by a period. (The "F,W" indicates which general education requirements are satisfied by that course.)
However, one single department does this differently. College Eight (CLEI) puts the entire header in one <strong>
tag:
<strong>81C. Designing a Sustainable Future. S</strong>
→ You can see what the College Eight page used to look like here.
So, there's one stupid special case.
Furthermore, two departments miss the first <strong>
tag. The first courses on the German and Economics pages look like this:
1. <strong>First-Year German.</strong>
<strong>F</strong>
I only look for course numbers inside of <strong>
tags, so course 1 gets left out. There's another stupid special case.
→ You can see what the German page used to look like here. You can see what the Economics page used to look like here.
Fourth, the latest special cases arise from inconsistent department naming.
The Registrar's page for the Ecology and Evolutionary Biology department is on eeb.html
, but the class search reveals that those courses use the dapertment code BIOE
.
Similarly, the Registrar listing for the Molecular, Cell, and Developmental Biology department is on mcdb.html
, but the courses use the department code BIOL
.
Two more conditionals address this issue.
A course mention occurs when a redditor names one or more courses in a Reddit post or comment.
I pulled the list of department codes from the source of the class search page, in the element<select id="subject">
. Unfortunately this list includes defuct and renamed departments. For example, the Arabic department (ARAB) is gone and Environmental Toxicology (ETOX) is now Microbiology and Environmental Toxicology (METX). All the presently avaliable departments appear in the regular expression _pattern_depts
.
This bot can see these three types of mention, all case-insensitive. Recall that a course number is actually a string because it may contain one optional letter at the end.
- Normal mention: department code, optional space, and course number.
For example, "CMPS 12B" and "econ105" are normal mentions.
- Specified by regex
_pattern_mention_normal
.
- Specified by regex
- Multi-mention: shorthand for multiple courses in the same department with different course numbers.
For example, "Math 21, 23b, and 100" is a multi-mention containing Math 21, Math 23, and Math 100.
- Not specified by a single regex. The function
_parse_multi_mention()
splits a multi-mention into normal mentions.
- Not specified by a single regex. The function
- Letter-list mention: shorthand for multiple courses in the same department, where the course number has the same numeric part but different letters.
For example, "CE 129A/B/C" is a letter-list mention containing CE 129A, CE 129B, and CE 129C.- Specified by regex
_pattern_mention_letter_list
. - Function
_parse_letter_list()
splits a letter-list mention into normal mentions. - You can have a letter-list mention inside a multi-mention! aFor example, the string "CS 8a, 15, and 163x/y/z" CS 8A, CS 15, CS 163X, CS 163Y, and CS 163Z.
- Specified by regex
Five regular expressions are combined to form the gigantic regular expression _pattern_final
, which is used to search strings.
In the file mention_search_posts.py
, the function find_mentions()
gets new posts from /r/UCSC then parses everything using mention_parse.py
.
If find_mentions()
is called from reddit_bot.py
, it returns a PostWithMentions
instance to be immediately processed; if mention_search_posts.py
is ran on its own from the Python console, the result is Pickled (serialized) and saved to disk. The PostWithMentions
class is a container which holds the ID of a submission and a list of course mentions found in that submission.
If post_comments.py
is ran on its own from the Python console, it loads mentions found from the last run of mention_search_posts.py
. If the function post_comments()
is called from reddit_bot.py
, data about found mentions is passed directly as a parameter to the function. [Those function names are outdated]
If a post doesn't already have a a comment by /u/ucsc-class-info-bot, add one. If it does already have a comment, compare the mentions most recently found with the mentions that are already in the comment. If there are new ones, update the comment.
- After the Registrar fixed some HTML special cases, my scraping script is broken.
- In the comments posted by this bot, classes are sorted by department name (I think) instead of by order mentioned.
- I might make the bot see mentions of some department names instead of department codes, e.g. "chemistry 103" instead of "chem 103".