倒排索引是一种检索方式,比如存入数据库的数据是存一篇文章进去,然而检索时我们经常需要通过关键词检索,所以提前做好倒排索引即可方便检索,而省略掉全表扫描的问题了,这是一种用空间换时间的方法。
使用字典构造倒排索引
sentences = ['This is the first word',
'This is the second text Hello How are you',
'This is the third, this is it now']
sentence_dict = {}
index_dict = {}
for index, line in enumerate(sentences):
line = line.lower() # 小写
sentence_dict[index] = line
for char in line.split(' '):
if not char.strip():
continue
if char in index_dict:
index_dict[char].add(index)
else:
index_dict[char] = {index}
检索
检索时依次遍历每一个待检索的单词对应的文章列表,然后取它们的交集即可,交集意味着这个文章包含待检索的全部单词。
def list_intersection(list1: list, list2: list) -> list: # 取交集
return list(set(list1).intersection(set(list2)))
index_value = list(sentence_dict.keys())
search_key = ['this', 'first']
for key in search_key:
index_value = list_intersection(index_dict[key], index_value)
print(index_value)
完整代码
sentences = ['This is the first word',
'This is the second text Hello How are you',
'This is the third, this is it now']
sentence_dict = {}
index_dict = {}
for index, line in enumerate(sentences):
line = line.lower() # 小写
sentence_dict[index] = line
for char in line.split(' '):
if not char.strip():
continue
if char in index_dict:
index_dict[char].add(index)
else:
index_dict[char] = {index}
# 检索时
def list_intersection(list1: list, list2: list) -> list: # 取交集
return list(set(list1).intersection(set(list2)))
index_value = list(sentence_dict.keys())
search_key = ['this', 'first']
for key in search_key:
index_value = list_intersection(index_dict[key], index_value)
print(index_value)