1.Introduction
Surprise 是一个用于构造和分析推荐系统的python库。
Surprise设计的五个原因:
- 让用户更好的控制他们的实验,我们尽量清晰的在文档中描述清楚算法。
- 提供数据集,用户可以使用库中自带的数据集,也可以使用自己的。
- 提供各种即用型预测算法,如基线算法,邻域算法,基于矩阵分解(SVD,PMF,SVD ++,NMF)等等。 此外,还内置了各种相似性测量(余弦,MSD,皮尔森等)。
- 方便实现新的算法思路。
- 提供评估,分析和比较算法性能的工具。 交叉验证程序可以使用功能强大的CV迭代器(灵感来自scikit-learn优秀的工具),以及对一组参数进行彻底搜索,轻松运行。
2.Basic usage
# 基本的使用规则
from surprise
import SVD
from surprise import Dataset
from surprise.model_selection import cross_validate
from surprise import Dataset
from surprise.model_selection import cross_validate
# Load the movielens-100k
dataset (download it if needed),
data = Dataset.load_builtin('ml-100k')
data = Dataset.load_builtin('ml-100k')
# We'll use the famous SVD
algorithm.
algo = SVD()
algo = SVD()
# Run 5-fold
cross-validation and print results
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
3.Using prediction algorithm
位于surprise/prediction_algorithms,其中algo_base是基类,里面实现了一些关键的算法,如predict,fit,test.
4.Train-test split and the fit() method(examples.train_test_split.py)
# 数据集的划分
from
surprise import SVD
from
surprise import Dataset
from
surprise import accuracy
from
surprise.model_selection import train_test_split
# Load
the movielens-100k dataset (download it if needed)
data =
Dataset.load_builtin('ml-100k')
# sample
random trainset and testset
# test
set is made of 25% of the ratings
trainset,
testset = train_test_split(data, test_size=.25)
# we'll
use the famous SVD algorithm.
algo =
SVD()
# Train
the algorithm on the trainset, and predict ratings for the dataset
algo.fit(trainset)
predictions
= algo.test(testset)
#
predictions = algo.fit(trainset).test(testset)
# Then
compute RMSE
accuracy.rmse(predictions)
5.Train on the whole trainset and the predict() method(examples/perdict_ratings.py)
# 预测
from
surprise import KNNBasic
from
surprise import Dataset
# load
the movielens-100k dataset
data =
Dataset.load_builtin('ml-100k')
#
Retrieve the trainset
trainset
= data.build_full_trainset()
# Build
an algorithm, and train it.
algo =
KNNBasic()
algo.fit(trainset)
uid =
str(196)
iid =
str(302)
pred =
algo.predict(uid, iid, r_ui=4, verbose=True)
6 Use a custom dataset(examples/load_custom_dataset.py)
# 导入一个自己的数据集
from
surprise import BaselineOnly
from
surprise import Dataset
from
surprise import Reader
from
surprise.model_selection import cross_validate
# path to
dataset file
file_path
= os.path.expanduser('~/.surprise_data/ml-100k/u.data')
# As
we're loading a custom dataset, we need to define a reader. In the
# movielens-100k dataset, each line has the following format:
# 'user item rating timestamp', separated by '\t' characters.
# movielens-100k dataset, each line has the following format:
# 'user item rating timestamp', separated by '\t' characters.
Reader =
Reader(line_format='user item rating timestamp', sep='\t')
data =
Dataset.load_from_file(file_path, reader=reader)
# we can
now use this dataset as we plase
cross_validate(BaselineOnly(),
data, verbose=True)
7.Use cross-validation iterators
# 使用cross-validation迭代器
from
surprise import SVD
from
surprise import Dataset
from
surprise import accuracy
from
surprise.model_selection import Kfold
# Load
the movielens-100k dataset
data =
Dataset.load_builtin('ml-100k')
# define
a cross-validation iterator
kf =
KFold(n_splits=3)
algo =
SVD()
For
trainset, testset in kf.split(data):
# train and test algorithm
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions, verbose=True)
8.Tune algorithm parameters with GridSearchCV
# 找出最好的参数
from
surprise import SVD
from
surprise import Dataset
from
surprise.model_selection import GridSearchCV
# Use
movielens-100k
data =
Dataset.load_builtin('ml-100k')
param_grid
= {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005], 'reg_all': [0.4, 0.6]}
gs =
GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)
gs.fit(data)
print(gs.best_score['rmse'])
print(gs.best_params['rmse'])
algo =
gs.best_estimator['rmse']
algo.fit(data.build_full_trainset())
关于数据集的简单描述:
- u.data 全部数据的完整集合,来自943个用户对1682部电影的全部评级,每个用户至少评论了20部电影。用户和
电影都是从1开始标注的。数据是无序的,每一列的标签如下:
user id | item id | rating |
timestamp, 时间戳以1970/1/1号作为起点
- u.info 用户,电影及评级的数量
- u.item 电影的属性,每列意义如下:
movie id | movie title | release date | video release date | IMDb
URL | unknown | Action | Adventure |
Animation | Children's | Comedy | Crime| Documentary | Drama |
Fantasy | Fil-Nori | Horror | Musical |Mystery | Romance | Sci-Fi | Thriller | War | Western |
最后19个字段是流派,电影可以一次属于几个流派,电影ID 在u.data 数据集中使用。
- u.genre 一个流派的列表
- u.user 用户的人口统计信息,这是一个统计卡,每列意义如下:
user id | age | gender | occupation(职位) | zip code |
user id 在u.data 数据集中使用
- u.occupation 一个职位的列表
- u1.base 数据集u1.base 和u1.test 到u5.base,u5.test,他们是将u数据80%,20%分为训练和测试数据。
u1.test
u1到u5 中的每一个都是不相交的测试集,如果是5倍交叉验证,就要用到,他们可以由脚本
u2.base
mku.sh在u.data上生成
u2.test
u3.base
u3.test
u4.base
u4.test
u5.base
u5.test
- ua.base ua.test和ub.test都有且只有每个用户都10个对电影的评级,他们不相交
ua.test
mku.sh在u.data上生成
ub.base
ub.test
没有评论:
发表评论