Python开源搜索引擎Whoosh

文章详情

Python开源搜索引擎Whoosh

Python

Whoosh

### 一、Whoosh介绍

Whoosh是一个快速、具有特色的全文索引和检索库。 用纯python实现。可以使用它轻松地添加搜索 应用程序和网站的功能。每一个部分工作可以扩展或替换以完全满足您的需要。
本网站使用的搜索引擎便是Whoosh。

### 二、Whoosh安装

默认使用英文分词，使用中文分词需要安装jieba中文分词库。
```
pip install whoosh
pip install jieba
```

### 三、代码示例

以下是本网站文章表数据
<img src="/uploads/image/article/tmp7smv3eph.jpg">

根据数据格式创建Schema
```python
# -*- coding: utf-8 -*-
import os

from jieba.analyse import ChineseAnalyzer
from whoosh.index import create_in, open_dir
from whoosh.fields import *
from whoosh.qparser import MultifieldParser, OrGroup

analyzer = ChineseAnalyzer()  # jieba中文分词
index_dir = '/indexdir'  # 全文索引文件目录

# 创建Schema
class ArticleSchema(SchemaClass):
    title = TEXT(stored=True, analyzer=analyzer)  # stored为True表示能够被检索
    content = TEXT(stored=True, analyzer=analyzer)
    tags = TEXT(stored=True, analyzer=analyzer)

# 初始化索引目录和scheme
def initIndexDir(data_list: list):
    schema = ArticleSchema()
    if not os.path.exists(index_dir):
        os.mkdir(index_dir)
    ix = create_in(index_dir, schema, indexname='article_index')

# 按照schema定义信息，增加需要建立索引的文档，注意：字符串格式需要为unicode格式
    with ix.writer() as w:
        for item in data_list:
            w.add_document(**item)

ix.close()

if __name__ == "__main__":
    # 1. 初始化全文索引
    data = [
        {'title': 'AI绘图 | 睡衣女神', 'content': '在一个美丽的城市里，有一位睡衣女神，她叫做小雅。小雅是一个非常漂亮的女孩，她总是穿着美丽的睡衣出现在人们的面前。', 'tags': '["AI绘画", "AI艺术"]'},
        {'title': 'AI绘图 | 狐狸女孩', 'content': '狐狸女孩在铁轨旁，她静静地坐在那里，眼神迷离，似乎在思考着什么。', 'tags': '["AI绘画", "AI艺术"]'},
        {'title': 'Stable Diffusion安装、网址、咒语', 'content': 'stable-diffusion-webui的安装方法介绍', 'tags': '["Stable Diffusion", "SD"]'},
        {'title': 'PHP Composer安装和使用', 'content': '记录windows下安装和使用Composer的方法', 'tags': '["PHP", "Composer"]'},
    ]
    initIndexDir(data)

# 2. 创建检索器
    keyword = '帮我找AI绘图的数据'
    ix = open_dir(index_dir, indexname='article_index')
    fieldnames = ['title', 'content', 'tags']  # 搜索字段限制
    q = MultifieldParser(fieldnames, schema=ix.schema, group=OrGroup.factory(0.9)).parse(keyword)
    with ix.searcher() as s:
        results = s.search(q, limit=None)  # limit为搜索结果的限制，默认为10，None为不限制。
        print('一共发现%d条数据。' % len(results))
        for r in results:
            doc = r.fields()
            print(doc)
```

### 四、控制台结果
```plaintext
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\ADMINI~1\AppData\Local\Temp\jieba.cache
Loading model cost 0.583 seconds.
Prefix dict has been built successfully.
一共发现2条数据。
{'content': '在一个美丽的城市里，有一位睡衣女神，她叫做小雅。小雅是一个非常漂亮的女孩，她总是穿着美丽的睡衣出现在人们的面前。', 'tags': '["AI绘画", "AI艺术"]', 'title': 'AI绘图 | 睡衣女神'}
{'content': '狐狸女孩在铁轨旁，她静静地坐在那里，眼神迷离，似乎在思考着什么。', 'tags': '["AI绘画", "AI艺术"]', 'title': 'AI绘图 | 狐狸女孩'}

Process finished with exit code 0
```

浏览量

254

更新日期

2023-11-16 14:11:06

评论 (${comments_count})

${item.comment_time}

${item.nickname}

${c1_item.comment_time}

${c1_item.nickname} ${c1_item.reply_nickname}

${c1_item.content}

微信公众号

文章详情