3 min read

Using recommender systems to fill the missing data

Recommender systems were often used to show the candidates such as books, movies and music for certain user. If you bought something at online shop, you might get a spam email listing something the shop recommended and there must be a recommender systems for you. You might get confused by such system: how does they know what you would buy?

If you know basic concepts about machine learning, you might say: surely there would be a supervised learning algorithm for this problem. Yes, when you have huge data, you might train your algorithm with rating as your output and items as your input. But those data would always be sparse. There would be thousands of books and thousands of users but the input and the output would not be a normal distribution and should be some groups connected with each other. In another words, the books and the users data would have their own structures. If you like novels, most of your books would be novels and the references of your rating on a dictionary would be missing. However, we found another person with the similar rating as you on novels and a lot of rating on dictionaries. So your reasonable rating on a dictionary would be similar to that guy. On the other hand, the novel and the dictionary would have some inner features and those features could be used to group the books.

OK, here the original problems had been convert to the cluster problems on the input and the output. After the cluster, we should train a model to make a connection with the items' groups and users' groups. Here an algorithm with the update of both cluster on input and output at the same time would be reasonable.

Firstly, the original data would show a matrix structure(M) with rows standing for the books, columns standing for the users and ratings in each cell. To make the algorithm run without error, we need a mask matrix to identify the missing data(R, 0 for missing and 1 for rating). So M.*R would show a matrix without missing data. Our task was finding two matrix(T and U). T should show the features of terms and U should show the corresponding features of the users. U * T should give us a matrix(N) like M and we could use min(N-M) as cost function to train our data. When we get the final N, we could fill the gaps with reasonable data. That is, we get the your ratings on the books you never read and order the rating. OK, we would send you an e-mail with the high rating books.

You might say why we must get two unknown matrix U and T, why not just show me the N. N has no reasonable algorithms to get directly while U and T could be get with a optimization algorithms. We give the U a random initial numbers and get a optimized T, then we use such T to get the U, run in a circle and finally we could get those two. Meanwhile we could use U and T separately to explore the inner groups for books or describe our potential users. In fact, we could use collaborative filtering to get the U and T at the same time.

This idea comes from machine learning but I think we could use such method to get missing data for environmental research. For example, we get huge data about the concentrations of 100 compounds for 100 cities while not all of the cities could analysis those 100 compounds: some analysis 60 and some analysis 80. Then such recommander system could be used to show missing data and the group information for both compounds and cities. But I doubt if traditional environmental scientist could understand such machine learning ideas.