1.Introduction

Surprise 是一个用于构造和分析推荐系统的python库。

Surprise设计的五个原因：

让用户更好的控制他们的实验，我们尽量清晰的在文档中描述清楚算法。
提供数据集，用户可以使用库中自带的数据集，也可以使用自己的。
提供各种即用型预测算法，如基线算法，邻域算法，基于矩阵分解（SVD，PMF，SVD ++，NMF）等等。此外，还内置了各种相似性测量（余弦，MSD，皮尔森等）。
方便实现新的算法思路。
提供评估，分析和比较算法性能的工具。交叉验证程序可以使用功能强大的CV迭代器（灵感来自scikit-learn优秀的工具），以及对一组参数进行彻底搜索，轻松运行。

2.Basic usage

# 基本的使用规则

from surprise
import SVD

from surprise
import Dataset

from surprise.model_selection
import cross_validate

# Load the movielens-100k
dataset (download it if needed),

data = Dataset.load_builtin('ml-100k')

# We'll use the famous SVD
algorithm.

algo = SVD()

# Run 5-fold
cross-validation and print results

cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

3.Using prediction algorithm

位于surprise/prediction_algorithms,其中algo_base是基类，里面实现了一些关键的算法，如predict，fit，test.

4.Train-test split and the fit() method(examples.train_test_split.py)

# 数据集的划分

from surprise import SVD

from surprise import Dataset

from surprise import accuracy

from surprise.model_selection import train_test_split

# Load the movielens-100k dataset (download it if needed)

data = Dataset.load_builtin('ml-100k')

# sample random trainset and testset

# test set is made of 25% of the ratings

trainset, testset = train_test_split(data, test_size=.25)

# we'll use the famous SVD algorithm.

algo = SVD()

# Train the algorithm on the trainset, and predict ratings for the dataset

algo.fit(trainset)

predictions = algo.test(testset)

# predictions = algo.fit(trainset).test(testset)

# Then compute RMSE

accuracy.rmse(predictions)

5.Train on the whole trainset and the predict() method(examples/perdict_ratings.py)

# 预测

from surprise import KNNBasic

from surprise import Dataset

# load the movielens-100k dataset

data = Dataset.load_builtin('ml-100k')

# Retrieve the trainset

trainset = data.build_full_trainset()

# Build an algorithm, and train it.

algo = KNNBasic()

algo.fit(trainset)

uid = str(196)

iid = str(302)

pred = algo.predict(uid, iid, r_ui=4, verbose=True)

6 Use a custom dataset(examples/load_custom_dataset.py)

# 导入一个自己的数据集

from surprise import BaselineOnly

from surprise import Dataset

from surprise import Reader

from surprise.model_selection import cross_validate

# path to dataset file

file_path = os.path.expanduser('~/.surprise_data/ml-100k/u.data')

# As we're loading a custom dataset, we need to define a reader. In the
# movielens-100k dataset, each line has the following format:
# 'user item rating timestamp', separated by '\t' characters.

Reader = Reader(line_format='user item rating timestamp', sep='\t')

data = Dataset.load_from_file(file_path, reader=reader)

# we can now use this dataset as we plase

cross_validate(BaselineOnly(), data, verbose=True)

7.Use cross-validation iterators

# 使用cross-validation迭代器

from surprise import SVD

from surprise import Dataset

from surprise import accuracy

from surprise.model_selection import Kfold

# Load the movielens-100k dataset

data = Dataset.load_builtin('ml-100k')

# define a cross-validation iterator

kf = KFold(n_splits=3)

algo = SVD()

For trainset, testset in kf.split(data):

# train and test algorithm

algo.fit(trainset)

predictions = algo.test(testset)

accuracy.rmse(predictions, verbose=True)

8.Tune algorithm parameters with GridSearchCV

# 找出最好的参数

from surprise import SVD

from surprise import Dataset

from surprise.model_selection import GridSearchCV

# Use movielens-100k

data = Dataset.load_builtin('ml-100k')

param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005], 'reg_all': [0.4, 0.6]}

gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)

gs.fit(data)

print(gs.best_score['rmse'])

print(gs.best_params['rmse'])

algo = gs.best_estimator['rmse']

algo.fit(data.build_full_trainset())

关于数据集的简单描述:

u.data 全部数据的完整集合，来自943个用户对1682部电影的全部评级，每个用户至少评论了20部电影。用户和

电影都是从1开始标注的。数据是无序的，每一列的标签如下：

user id | item id | rating | timestamp, 时间戳以1970/1/1号作为起点

u.info 用户，电影及评级的数量
u.item 电影的属性，每列意义如下：

最后19个字段是流派，电影可以一次属于几个流派，电影ID 在u.data 数据集中使用。

u.genre 一个流派的列表
u.user 用户的人口统计信息，这是一个统计卡，每列意义如下：

user id 在u.data 数据集中使用

u.occupation 一个职位的列表
u1.base 数据集u1.base 和u1.test 到u5.base,u5.test，他们是将u数据80%，20%分为训练和测试数据。

u1.test u1到u5 中的每一个都是不相交的测试集，如果是5倍交叉验证，就要用到，他们可以由脚本

u2.base mku.sh在u.data上生成

u2.test

u3.base

u3.test

u4.base

u4.test

u5.base

u5.test

ua.base ua.test和ub.test都有且只有每个用户都10个对电影的评级，他们不相交

ua.test mku.sh在u.data上生成

ub.base

ub.test

落花盈香

2018年1月17日星期三

Surprise

1.Introduction

2.Basic usage

3.Using prediction algorithm

4.Train-test split and the fit() method(examples.train_test_split.py)

5.Train on the whole trainset and the predict() method(examples/perdict_ratings.py)

6 Use a custom dataset(examples/load_custom_dataset.py)

7.Use cross-validation iterators

8.Tune algorithm parameters with GridSearchCV

关于数据集的简单描述:

没有评论:

发表评论

leetcode 17

搜索此博客