Jul 8, 2022

文本分类

1. 实验目的

掌握数据预处理的方法，对训练集数据进行预处理；
掌握文本建模的方法，对语料库的文档进行建模；
掌握分类算法的原理，基于有监督的机器学习方法，训练文本分类器；
利用学习的文本分类器，对未知文本进行分类判别；
掌握评价分类器性能的评估方法。

2. 实验类型

数据挖掘算法的设计与编程实现。

3. 实验要求

文本类别数：>=10类；
训练集文档数：>=50000篇；每类平均5000篇。
测试集文档数：>=50000篇；每类平均5000篇。

4. 实验内容

利用分类算法实现对文本的数据挖掘，主要包括：

语料库的构建，主要包括利用爬虫收集Web文档等；
语料库的数据预处理，包括文档建模，如去噪，分词，建立数据字典，使用词袋模型或主题模型表达文档等；
选择分类算法（朴素贝叶斯（必做）、SVM/其他实际验收有几个不做的？等），训练文本分类器，理解所选的分类算法的建模原理、实现过程和相关参数的含义；
对测试集的文本进行分类
对测试集的分类结果利用正确率和召回率进行分析评价：计算每类正确率、召回率，计算总体正确率和召回率。

5. 实验准备

5.1. 实验环境

fedora 33

个人比较喜欢linux环境，windows环境应该也可以完成下面的步骤。一些文件路径可能会修改。另外，硬盘读取速度等都会影响实验结果（速度等方面）。

数据集准备

THUCNews

数据集大小：1.45GB

样本数量：80多万

数据集详情链接：thuctc

在解压出的目录下新建两个文件夹1trainData和1testData。注意：Linux下解压会多一层目录并多一个1__MACOSX，这个并不影响实验。

解压结果如下

图 1 文本分类实验解压结果

cppjieba

cppjieba是”结巴(Jieba)“中文分词的C++版本。这个要比python快10倍，解决本实验足够好用!

用法

依赖软件

> 1g++ (version >= 4.1 is recommended) or clang++;

> 1cmake (version >= 2.6 is recommended);

下载和编译

> 1git clone --depth=10 --branch=master git://github.com/yanyiwu/cppjieba.git > > 1cd cppjieba > > 1mkdir build > > 1cd build > > 1cmake .. > > 1make

结果如下

图 2 文本分类实验

我们只需要修改1cppjieba\test\demo.cpp}，但我们要在1cppjieba\build下执行1make进行编译，并执行1./demo运行分词程序。

朴素贝叶斯分类器

贝叶斯方法的分类目标是在给定描述实例的属性值下，得到最可能的目标值。

采纳m-估计方法，即有统一的先验概率并且m等于词汇表的大小，因此

定义如下函数：

1Learn_Naive_Bayes_Text( Examples, V )

Examples为一组文本文档以及它们的目标值。V为所有可能目标值的集合。此函数作用是学习概率项和。

1收集Examples中所有的单词、标点符号以及其他记号
2
3Vocabulary <= 在Examples中任意文本文档中出现的所有单词及记号的集合
4    计算所需要的概率项$P(vj)$和$P(wk|vj)$
5        对V中每个目标值vj
6            $docs_j\leftarrow$Examples中目标值为$v_j$的文档子集
7            $P(v_j)\leftarrow|docs_j| / |Examples|$
8            $Text_j\leftarrow$将$docs_j$中所有成员连接起来建立的单个文档
9            $n\leftarrow$在$Text_j$中不同单词位置的总数
10            对Vocabulary中每个单词$w_k$
11                $n_k\leftarrow$单词$w_k$出现在$Text_j$中的次数
12                $P(w_k|v_j)\leftarrow(n_k+1) / (n+|Vocabulary|)$

检验

svm

下载地址：libsvm

之后1make一下，就会用三个可执行文件，1svm-scale svm-train svm-train。分别是数据整理，模型训练和数据预测。

具体参数可以直接输入1./svm-scale获得。

更多内容可以自行百度。

代码实现

总共分为五个部分

1trainingData();
2makeDict();
3makeSvm();
4testingData();
5printAns();

变量声明

1#define MYMODE_RW (S_IRUSR | S_IWUSR)
2using namespace std;
3
4const char *const DICT_PATH = "../dict/jieba.dict.utf8";
5const char *const HMM_PATH = "../dict/hmm_model.utf8";
6const char *const USER_DICT_PATH = "../dict/user.dict.utf8";
7const char *const IDF_PATH = "../dict/idf.utf8";
8const char *const STOP_WORD_PATH = "../dict/stop_words.utf8";
9
10const string dataSet = "../../THUCNews/THUCNews/";
11// const string dataSet = "../../THUCNews/";
12const string svmTrainData = "../../THUCNews/trainData/";
13const string svmTestData = "../../THUCNews/testData/";
14
15const int MAXS = 8e6 + 10;  //文件大小限制
16
17//Vtot 总的类别数量
18//maxCommon取相关性的前200的词组成词袋
19//testCnt每个类别如果有2*testCnt篇文章，
20//就取testCnt篇作为测试数据，
21//再取testCnt篇作为训练数据
22
23const int Vtot = 11, maxCommon = 200, testCnt = 5000;
24
25
26//所有种类
27string V[Vtot] = {"体育", "娱乐", "家居", "彩票", "房产", "教育",
28                  "时政", "游戏", "社会", "科技", "财经"};
29
30//dic[i]第i类文章中出现的所有词
31map<string, int> dic[Vtot + 1];
32
33//p[i][word] p(word|vj=i)
34//第i类文章中，word出现的概率，做平滑处理
35map<string, int> p[Vtot];
36
37//kappa[i][word] word和第i类文章的kappa^2值
38map<string, int> kappa[Vtot + 1];
39
40//在处理svm数据集时，每个单词的编号
41map<string, int> svmIndex;
42
43//idf值
44map<string, double> idf;
45
46//fileList[i]第i类文章列表
47vector<string> fileList[Vtot];
48
49//混淆矩阵
50int confusionMatrix[Vtot][Vtot];
51int docs[Vtot + 1];  // docs[i]:第i类文章训练或测试的数目
52
53int n[Vtot + 1];     // n[i]:第i类有多少个单词（有重复）
54
55int precision[Vtot], recall[Vtot];  //准确率和回归率
56
57//多线程id
58long unsigned int thrd[Vtot];
59
60
61#define dictAll dic[Vtot]
62
63//结巴分词器
64cppjieba::Jieba jieba(DICT_PATH, HMM_PATH, USER_DICT_PATH, IDF_PATH,
65                      STOP_WORD_PATH);

{训练数据部分}

下面是处理数据的多线程函数

1/**
2 * 多线程函数
3 * 处理数据
4 * 读入文件->分词->计算dic[type]->计算kappa[type]
5 */
6void *trainingData(void *t) {
7    int type = (long long)t;
8    auto &dict = dic[type];
9    auto &kap = kappa[type];
10    char sentence[MAXS];
11    for (int d = 0; d < docs[type]; ++d) {
12
13        //读入文件
14        string fileName = dataSet + V[type] + "/" + fileList[type][d];
15        int fd = open(fileName.c_str(), O_RDONLY);
16        int len = pread(fd, sentence, MAXS, 0);
17        close(fd);
18        //如果文件大小超出MAXS会出错
19        if (len < 0) continue;
20        sentence[len] = '\0';
21        vector<pair<string, string>> tagres;
22        //分词
23        jieba.Tag(sentence, tagres);
24
25        //计算dic[type]->计算kappa[type]
26        set<string> thisArticle;//本篇文章单词集合
27        for (auto it : tagres) {
28            const auto &word = it.first;  //单词
29            if (strstr(it.second.c_str(), "n") != NULL &&
30                word.length() > 2) {  //名词 且 长度大于一
31                dict.find(word) != dict.end() ? dict[word]++ : dict[word] = 1;
32                thisArticle.insert(word);
33            }
34        }
35        for (auto it : thisArticle) {
36            kap.find(it) != kap.end() ? kap[it]++ : kap[it] = 1;
37        }
38    }
39    cout << V[type] << "\tDone\n";
40    return NULL;
41}

想要调用上面的线程函数，需要先读取文件列表，如下：

1//读取文件列表
2void readFileLists(string dir, vector<string> &fileList) {
3    // cout<<dir<<'\n';
4    DIR *d = opendir(dir.c_str());
5    struct dirent *entry;
6    while ((entry = readdir(d)) != NULL) {
7        if (strstr(entry->d_name, ".txt") != NULL)
8            fileList.push_back(entry->d_name);
9    }
10    closedir(d);
11}
12//读取文件列表并处理训练数据
13void trainingData() {
14    for (int i = 0; i < Vtot; ++i) {
15        readFileLists(dataSet + V[i], fileList[i]);
16    }
17
18    for (int i = 0; i < Vtot; ++i) {
19        docs[i] = min(int(fileList[i].size()) / 2, testCnt);
20        // testCnt方便测试
21
22        cout << V[i] << " 类的文件个数为 " << docs[i] << '\n';
23
24        docs[Vtot] += docs[i];
25        int res = pthread_create(&thrd[i], NULL, trainingData, (void *)i);
26        if (res) {
27            printf("Create thress NO.%d failed\n", i);
28            exit(res);
29        }
30    }
31    for (int i = 0; i < Vtot; ++i) {
32        pthread_join(thrd[i], NULL);
33    }
34}

生成字典部分

1//抽取词典，使用kappa^2检验
2void makeDict() {
3    for (int i = 0; i < Vtot; ++i) {
4        vector<pair<double, string>> mostCommon;
5        for (auto it : kappa[i]) {
6            double A = 0, B = 0, C = 0, D = 0;
7            A = it.second;
8            C = docs[i] - A;
9            const auto &word = it.first;
10            for (int j = 0; j < Vtot; ++j)
11                if (i != j) {
12                    if (kappa[j].find(word) != kappa[j].end()) {
13                        B += kappa[j][word];
14                    }
15                }
16            D = docs[Vtot] - docs[i] - B;
17            double k = docs[Vtot] * (A * D - B * C) * (A * D - B * C) /
18                       ((A + C) * (A + B) * (B + D) * (C + D));
19            mostCommon.push_back({k, word});
20        }
21        sort(mostCommon.begin(), mostCommon.end());
22
23        reverse(mostCommon.begin(), mostCommon.end());
24        int item = 0;
25
26        cout << V[i] << ':';
27
28        for (auto it : mostCommon) {
29            ++item;
30            if (item > maxCommon) break;
31            const auto &word = it.second;
32            const int cnt = dic[i][word];
33            n[i] += cnt;
34            dictAll.find(word) != dictAll.end() ? dictAll[word] += cnt
35                                                : dictAll[word] = cnt;
36        }
37        cout << V[i] << "单词个数" << n[i] << '\n';
38        n[Vtot] += n[i];
39
40        /*for (int j = max(0, (int)mostCommon.size() - 20);
41             j < (int)mostCommon.size(); ++j) {
42            cout << "(" << mostCommon[j].first << ' ' << mostCommon[j].second
43                 << ") ";
44        }
45        cout << '\n';*/
46    }
47}

生成svm的训练数据文件和测试数据文件

1
2/**
3 * 多线程函数
4 * 整理svm用到的数据文件
5 * 包括测试文件和输出文件
6 * 要按照libsvm要求的数据格式来
7 */
8
9void *svmData(void *t) {
10    int type = (long long)t;
11    //先打开训练数据文件
12    int fdout = open((svmTrainData + V[type]).c_str(), O_RDWR | O_CREAT, MYMODE_RW);
13    if (fdout == -1) {
14        cout << errno << '\n';
15        perror("open");
16    }
17    auto &dict = dic[type];
18    auto &kap = kappa[type];
19    string outBuf;
20    char sentence[MAXS];
21    int fdoffset = 0;
22
23    for (int d = 0; d < (int)fileList[type].size() && d < 2 * docs[type]; ++d) {
24        if (d == docs[type]) {  //这之后是测试数据集
25            close(fdout);
26            fdout = open((svmTestData + V[type]).c_str(), O_RDWR | O_CREAT, MYMODE_RW);
27            fdoffset = 0;
28            if (fdout == -1) {
29                cout << errno << '\n';
30                perror("open");
31            }
32        }
33
34        //读取文件并进行分词
35        string fileName = dataSet + V[type] + "/" + fileList[type][d];
36        int fd = open(fileName.c_str(), O_RDONLY);
37        long long int len = pread(fd, sentence, MAXS, 0);
38        close(fd);
39        if (len < 0) continue;  //如果文件大小超出MAXS会出错
40        sentence[len] = '\0';
41        // cout << sentence << '\n';
42        vector<pair<string, string>> tagres;
43        jieba.Tag(sentence, tagres);
44
45        //统计这篇文章中的词频
46        map<string, int> art;
47        int totArt = 0;
48        for (auto it : tagres) {
49            const auto &word = it.first;  //单词
50            if (dictAll.find(word) != dictAll.end()) {
51                art.find(word) != art.end() ? art[word]++ : art[word] = 1;
52                totArt++;
53            }
54        }
55
56        outBuf = to_string(type) + " ";
57
58        for (auto it : art) {
59            auto &word = it.first;
60            double tf = it.second * 1.0 / totArt;  // tf-idf=tf*idf
61            outBuf += to_string(svmIndex[word]) + ":" +
62                      to_string(tf * idf[word]) + " ";
63        }
64        outBuf += '\n';
65        pwrite(fdout, outBuf.c_str(), outBuf.length(), fdoffset);
66        fdoffset += outBuf.length();
67    }
68    close(fdout);
69    return NULL;
70}
71/**
72 * 处理svm用到的数据
73 */
74void makeSvm() {
75    //对单词进行编号，并计算idf值
76    int item = 0;
77    for (auto it : dictAll) {
78        auto &word = it.first;
79        svmIndex[word] = ++item;
80        for (int i = 0; i < Vtot; ++i) {
81            if (kappa[i].find(word) != kappa[i].end()) {
82                idf.find(word) != idf.end() ? idf[word] += kappa[i][word]
83                                            : idf[word] = kappa[i][word];
84            }
85        }
86    }
87    for (auto &it : idf) {
88        it.second = log10(docs[Vtot] * 2 / (it.second + 1));
89    }
90
91    //多线程处理数据，每个线程处理一个类别的数据
92    for (int i = 0; i < Vtot; ++i) {
93        int res = pthread_create(&thrd[i], NULL, svmData, (void *)i);
94        if (res) {
95            printf("Create thress NO.%d failed\n", i);
96            exit(res);
97        }
98    }
99    for (int i = 0; i < Vtot; ++i) {
100        pthread_join(thrd[i], NULL);
101    }
102
103    //处理完成之后，将所有类别的数据合到一起
104    char buf[MAXS];
105    int fdout =
106        open((svmTrainData + "trainData.txt").c_str(), O_RDWR | O_CREAT, MYMODE_RW);
107    if (fdout == -1) {
108        cout << errno << '\n';
109        perror("open");
110    }
111
112    off_t off = 0;
113    for (int i = 0; i < Vtot; ++i) {
114        int fd = open((svmTrainData + V[i]).c_str(), O_RDONLY);
115        int len = pread(fd, buf, MAXS, 0);
116        close(fd);
117        if (len < 0) continue;
118        buf[len] = '\0';
119        pwrite(fdout, buf, strlen(buf), off);
120        cout << strlen(buf) << '\n';
121        off += strlen(buf);
122        cout << V[i] << ' ' << strlen(buf) << ' ' << off << '\n';
123    }
124    close(fdout);
125
126    fdout = open((svmTestData + "testData.txt").c_str(), O_RDWR | O_CREAT, MYMODE_RW);
127    off = 0;
128    for (int i = 0; i < Vtot; ++i) {
129        int fd = open((svmTestData + V[i]).c_str(), O_RDONLY);
130        int len = pread(fd, buf, MAXS, 0);
131        close(fd);
132        if (len < 0) continue;
133        buf[len] = '\0';
134        pwrite(fdout, buf, strlen(buf), off);
135        off += strlen(buf);
136        cout << V[i] << ' ' << strlen(buf) << ' ' << off << '\n';
137    }
138    close(fdout);
139}

进行文本分类

先预处理出p的值，

1/**
2 * 朴素贝叶斯分类器
3 */
4void *bayes(void *t) {
5    int type = (long long)t;
6    cout << "classfiying " << V[type] << "\n";
7    char sentence[MAXS];
8
9    for (int f = docs[type]; f < (int)fileList[type].size(); ++f) {
10        if (f - docs[type] > docs[type]) break;
11
12        //读取文件
13        string fileName = dataSet + V[type] + "/" + fileList[type][f];
14        int fd = open(fileName.c_str(), O_RDONLY);
15        int len = pread(fd, sentence, MAXS, 0);
16        close(fd);
17        //分词
18        if (len < 0) continue;  //读入出错，buf不够之类的
19        sentence[len] = '\0';
20        vector<pair<string, string>> tagres;
21        jieba.Tag(string(sentence), tagres);
22
23        //计算argmax(p)
24        pair<double, int> ans[Vtot];
25        for (int i = 0; i < Vtot; ++i) {
26            ans[i] = {docs[i] * 1.0 / docs[Vtot], i};
27        }
28        for (auto it : tagres) {
29            if (dictAll.find(it.first) != dictAll.end()) {
30                for (int i = 0; i < Vtot; ++i) {
31                    ans[i].first += p[i][it.first];
32                }
33            }
34        }
35        sort(ans, ans + Vtot);
36        confusionMatrix[type][ans[Vtot - 1].second] += 1;
37    }
38    cout << "classfiy " << V[type] << "\tdone\n";
39    return NULL;
40}
41
42//训练数据
43void testingData() {
44
45    //预处理p[i][word]值
46    cout << "字典长度=" << dictAll.size() << '\n';
47    for (int vj = 0; vj < Vtot; ++vj) {
48        for (auto it : dictAll) {
49            const auto &word = it.first;
50            if (dic[vj].find(word) != dic[vj].end()) {
51                p[vj][word] =
52                    log10(dic[vj][word] + 1) - log10(n[vj] + dictAll.size());
53            } else {
54                p[vj][word] = log10(1) - log10(n[vj] + dictAll.size());
55            }
56        }
57    }
58
59    //多线程进行文本分类
60    for (int i = 0; i < Vtot; ++i) {
61        int res = pthread_create(&thrd[i], NULL, bayes, (void *)i);
62        if (res) {
63            printf("Create thress NO.%d failed\n", i);
64            exit(res);
65        }
66    }
67    for (int i = 0; i < Vtot; ++i) {
68        pthread_join(thrd[i], NULL);
69    }
70}

输出结果

1//输出想要的答案
2void printAns() {
3    double avrPre = 0, avrRecall = 0;
4    for (int v1 = 0; v1 < Vtot; ++v1) {
5        cout << V[v1] << ":\t";
6        for (int v2 = 0; v2 < Vtot; ++v2) {
7            precision[v2] += confusionMatrix[v1][v2];
8            recall[v1] += confusionMatrix[v1][v2];
9            cout << confusionMatrix[v1][v2] << '\t';
10        }
11        cout << '\n';
12    }
13    for (int v = 0; v < Vtot; ++v) {
14        cout << V[v] << '\n';
15        cout << "准确率=" << confusionMatrix[v][v] * 100.0 / precision[v]
16             << "% ";
17        avrPre += confusionMatrix[v][v] * 100.0 / precision[v];
18        cout << "召回率=" << confusionMatrix[v][v] * 100.0 / recall[v] << "%\n";
19        avrRecall += confusionMatrix[v][v] * 100.0 / recall[v];
20    }
21    cout << "平均准确率=" << avrRecall / Vtot << "% "
22         << "平均召回率=" << avrRecall / Vtot << "\n";
23}

实验结果

先输出训练的文件个数，训练完成之后，统计每个类别有多少单词。之后，得出字典长度为2037。

图 3 结果

预测完成之后是混淆矩阵。

图 4 文本分类实验混淆矩阵

输出召回率和准确率，最后输出运行时间。 可以看到最后运行时间仅为12秒，这就是c++的长处之一。 总共分析了55000篇文章，从读入到分词，到其他计算。总共执行时间非常短。相比于python的10min左右要好得多。

图 5 结果

最后的实验结果也比较理想。准确率和召回率都比较高。

svm

这里我并没有用到梯度预测、n折交叉验证（训练时间太长，实验时间不够）。但我们完整实现了用libsvm的工具进行文本分类的必要步骤。以下均在1libsvm目录下。

数据整理。 1../THUCNews/trainData/trainData.txt是待训练数据的原始形式，1../THUCNews/testData/testData.txt是待预测数据。用1./svm-scale -l 0 -r train.scale ../THUCNews/trainData/trainData.txt 和1./svm-scale -l 0 -r test.scale ../THUCNews/testData/testData.txt命令, 可以将训练数据、测试数据整理为标准格式，且下界为0.
模型训练 1./svm-train -h 0 -g 0.001 -c 10 train.scale my.model得到模型文件。
数据预测 1./svm-predict test.scale my.model ans.out

结果如下

图 6 svman

实验结果较为理想。当然，该方法还有很多可以改进的空间，这里就不深究。

{代码}

超链接

之前一版