Ask About C3 W2 Content based Filtering Data #51

Minhnhat0408 · 2024-11-12T07:03:46Z

Minhnhat0408
Nov 12, 2024

Hey guys, im new to the AI department, I try to implement the code from the content based filtering which use neural network on my app. I pull the repo and run the code successfully but i find it hard to understand the data, all those
content_item_train.csv, content_item_train.csv and content_user_train.csv.
when i look in it the content_item_train.csv and the content_user_train.csv. have the same 58187 row (for the model.fit) but why it duplicate a lot like for this chunk

6874,2003,3.9618320610687023,1,0,0,0,0,0,0,0,0,0,0,0,0,0
6874,2003,3.9618320610687023,0,0,0,0,0,1,0,0,0,0,0,0,0,0
6874,2003,3.9618320610687023,0,0,0,0,0,0,0,0,0,0,0,0,0,1

this chunk is one hot vector for movie id 6874 but it got duplicated like 131 times in the content_item_train.csv
the content_item_vecs.csv is just the train without duplicated and the content_user_train.csv not duplicate chunk but in each chunk every rows is the same.

Hope you guys can answer it soon.

Answered by TheFalcon1990

Nov 13, 2024

After pulling the repo and running the content-based filtering code, I noticed something odd with the data files—especially content_item_train.csv and content_user_train.csv. Both files have 58,187 rows, which is consistent with the model’s .fit input requirements, but there’s a lot of duplication, and it’s tricky to understand the reasoning behind it.

Let me break down the files as I’ve come to understand them.

Content Item and User Files Overview

content_item_train.csv: This file is where the item (in this case, movie) feature vectors or item profiles are stored. These vectors encode different attributes like genres, directors, or any other one-hot encoded movie features. For example,…

View full answer

TheFalcon1990 · 2024-11-13T17:24:21Z

TheFalcon1990
Nov 13, 2024

After pulling the repo and running the content-based filtering code, I noticed something odd with the data files—especially content_item_train.csv and content_user_train.csv. Both files have 58,187 rows, which is consistent with the model’s .fit input requirements, but there’s a lot of duplication, and it’s tricky to understand the reasoning behind it.

Let me break down the files as I’ve come to understand them.

Content Item and User Files Overview

content_item_train.csv: This file is where the item (in this case, movie) feature vectors or item profiles are stored. These vectors encode different attributes like genres, directors, or any other one-hot encoded movie features. For example, for a specific movie ID like 6874, I see rows that are almost the same, but with slight variations in the one-hot vectors. This setup repeats the same movie ID 131 times, making it look like a lot of duplication, but the rows are subtly different.
content_user_train.csv: This file holds user interactions with different items. Each row represents a user’s interaction with a movie, capturing preferences or ratings, and is linked to a specific user-item combination. The rows are repeated per user, meaning each user’s chunk has identical rows for every item. It seems designed to train the model on how user preferences align with different items’ features.
content_item_vecs.csv: This file is a deduplicated version of content_item_train.csv, which provides unique feature vectors for each item. It simplifies looking up item features without all the repetitions.

Why Duplications Exist in `content_item_train.csv`

The reason we see these duplications might be tied to how the model is designed to train on a variety of representations:

One-Hot Encoding for Features: Each row represents the movie with a different combination of one-hot vectors for categories. For example, a movie like 6874 could appear under multiple genres, and each row might represent a unique combination of these genres.
Data Augmentation for Better Training: This kind of data repetition is actually helpful for neural networks. It’s a way of showing the model many variations or “views” of the same item, improving its ability to generalize across different item characteristics.
Contextual Representation: The repeated rows help the model understand each item in multiple contexts. For example, by seeing a movie represented across genres, the model learns to make connections across different categories, rather than limiting the item to a single profile.

Consistency in `content_user_train.csv`

In content_user_train.csv, each row within a user’s chunk is the same, which is a bit different from the item file’s structure. Here, the repeated rows seem to create user-item pairs that stay consistent across each chunk, possibly to reinforce each user's specific preferences.

Why This Setup Helps Model Training

The duplication in these files actually serves a purpose. By showing the model multiple representations of the same items (with slight variations in content_item_train.csv) and consistent profiles for users, it gains a broader understanding of how different features can align with various user preferences. The end result is a model better able to generalize across a variety of items and users.

As I keep working through this, I’ll experiment with possibly reducing these duplications to see if that impacts performance, but for now, I can see how these data patterns help stabilize and diversify the training process.

0 replies

Minhnhat0408 · 2024-11-14T06:21:48Z

Minhnhat0408
Nov 14, 2024
Author

It actually truth cuz when i try to implement the unique chunk it, give out alot higher loss value. But do you have the code that gen those file, or understand the pattern off the generating process of those file. Since some chunk duplicate 131 times, some only 16 times, and the y_train.csv also dont make any sense to me.
I try to replicate the same data file from my own database, but it quite inaccurate when i try it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ask About C3 W2 Content based Filtering Data #51

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Ask About C3 W2 Content based Filtering Data #51

Minhnhat0408 Nov 12, 2024

Content Item and User Files Overview

Replies: 2 comments

TheFalcon1990 Nov 13, 2024

Content Item and User Files Overview

Why Duplications Exist in content_item_train.csv

Consistency in content_user_train.csv

Why This Setup Helps Model Training

Minhnhat0408 Nov 14, 2024 Author

Minhnhat0408
Nov 12, 2024

TheFalcon1990
Nov 13, 2024

Why Duplications Exist in `content_item_train.csv`

Consistency in `content_user_train.csv`

Minhnhat0408
Nov 14, 2024
Author