2018年1月17日星期三

Surprise

1.Introduction

Surprise 是一个用于构造和分析推荐系统的python库。
Surprise设计的五个原因:
  • 让用户更好的控制他们的实验,我们尽量清晰的在文档中描述清楚算法。
  • 提供数据集,用户可以使用库中自带的数据集,也可以使用自己的。
  • 提供各种即用型预测算法,如基线算法,邻域算法,基于矩阵分解(SVD,PMF,SVD ++,NMF)等等。 此外,还内置了各种相似性测量(余弦,MSD,皮尔森等)。
  • 方便实现新的算法思路。
  • 提供评估,分析和比较算法性能的工具。 交叉验证程序可以使用功能强大的CV迭代器(灵感来自scikit-learn优秀的工具),以及对一组参数进行彻底搜索,轻松运行。

2.Basic usage


# 基本的使用规则
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import cross_validate
# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin('ml-100k')
# We'll use the famous SVD algorithm.
algo = SVD()
# Run 5-fold cross-validation and print results
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

 

3.Using prediction algorithm

位于surprise/prediction_algorithms,其中algo_base是基类,里面实现了一些关键的算法,如predict,fit,test.

4.Train-test split and the fit() method(examples.train_test_split.py)


# 数据集的划分
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split

# Load the movielens-100k dataset (download it if needed)
data = Dataset.load_builtin('ml-100k')

# sample random trainset and testset
# test set is made of 25% of the ratings
trainset, testset = train_test_split(data, test_size=.25)

# we'll use the famous SVD algorithm.
algo = SVD()

# Train the algorithm on the trainset, and predict ratings for the dataset
algo.fit(trainset)
predictions = algo.test(testset)

# predictions = algo.fit(trainset).test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

5.Train on the whole trainset and the predict() method(examples/perdict_ratings.py)


# 预测
from surprise import KNNBasic
from surprise import Dataset

# load the movielens-100k dataset
data = Dataset.load_builtin('ml-100k')

# Retrieve the trainset
trainset = data.build_full_trainset()

# Build an algorithm, and train it.
algo = KNNBasic()
algo.fit(trainset)

uid = str(196)
iid = str(302)

pred = algo.predict(uid, iid, r_ui=4, verbose=True)

6 Use a custom dataset(examples/load_custom_dataset.py)


# 导入一个自己的数据集
from surprise import BaselineOnly
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate

# path to dataset file
file_path = os.path.expanduser('~/.surprise_data/ml-100k/u.data')

# As we're loading a custom dataset, we need to define a reader. In the
# movielens-100k dataset, each line has the following format:
# 'user item rating timestamp', separated by '\t' characters.

Reader = Reader(line_format='user item rating timestamp', sep='\t')
data = Dataset.load_from_file(file_path, reader=reader)

# we can now use this dataset as we plase
cross_validate(BaselineOnly(), data, verbose=True)


7.Use cross-validation iterators


# 使用cross-validation迭代器
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import Kfold

# Load the movielens-100k dataset
data = Dataset.load_builtin('ml-100k')

# define a cross-validation iterator
kf = KFold(n_splits=3)

algo = SVD()
For trainset, testset in kf.split(data):
# train and test algorithm
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions, verbose=True)

8.Tune algorithm parameters with GridSearchCV

# 找出最好的参数

from surprise import SVD
from surprise import Dataset
from surprise.model_selection import GridSearchCV

# Use movielens-100k
data = Dataset.load_builtin('ml-100k')
param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005], 'reg_all': [0.4, 0.6]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)
gs.fit(data)

print(gs.best_score['rmse'])
print(gs.best_params['rmse'])

algo = gs.best_estimator['rmse']

algo.fit(data.build_full_trainset())

关于数据集的简单描述:

  • u.data  全部数据的完整集合,来自943个用户对1682部电影的全部评级,每个用户至少评论了20部电影。用户和
    电影都是从1开始标注的。数据是无序的,每一列的标签如下:
    user id | item id | rating | timestamp, 时间戳以1970/1/1号作为起点
  • u.info    用户,电影及评级的数量
  • u.item   电影的属性,每列意义如下:
movie id | movie title | release date | video release date | IMDb URL | unknown | Action | Adventure |
Animation | Children's | Comedy | Crime| Documentary | Drama | Fantasy | Fil-Nori | Horror | Musical |Mystery | Romance | Sci-Fi | Thriller | War | Western |
最后19个字段是流派,电影可以一次属于几个流派,电影ID u.data 数据集中使用。
  • u.genre   一个流派的列表
  • u.user    用户的人口统计信息,这是一个统计卡,每列意义如下:
user id | age | gender | occupation(职位) | zip code |
user id u.data 数据集中使用
  • u.occupation  一个职位的列表
  • u1.base    数据集u1.base u1.test u5.base,u5.test,他们是将u数据80%20%分为训练和测试数据。
                 u1.test     u1u5 中的每一个都是不相交的测试集,如果是5倍交叉验证,就要用到,他们可以由脚本
                 u2.base    mku.shu.data上生成
                 u2.test
                 u3.base
                 u3.test
                 u4.base
                 u4.test
                 u5.base
                 u5.test

  • ua.base    ua.testub.test都有且只有每个用户都10个对电影的评级,他们不相交
                 ua.test      mku.shu.data上生成
                 ub.base

                  ub.test

没有评论:

发表评论

leetcode 17

17.   Letter Combinations of a Phone Number Medium Given a string containing digits from   2-9   inclusive, return all possible l...