基于Python语言的大数据搜索引擎

发布时间：2019-07-12 09:37:38 所属栏目：优化来源：简单艾

导读：搜索是大数据领域里常见的需求。Splunk和ELK分别是该领域在非开源和开源领域里的领导者。本文利用很少的Python代码实现了一个基本的数据搜索功能，试图让大家理解大数据搜索的基本原理。布隆过滤器 (Bloom Filter) 第一步我们先要实现一个布隆过滤器。布

上代码：

class Splunk(object): 
 def __init__(self): 
 self.bf = Bloomfilter(64) 
 self.terms = {} # Dictionary of term to set of events 
 self.events = [] 
 def add_event(self, event): 
 """Adds an event to this object""" 
 # Generate a unique ID for the event, and save it 
 event_id = len(self.events) 
 self.events.append(event) 
 # Add each term to the bloomfilter, and track the event by each term 
 for term in segments(event): 
 self.bf.add_value(term) 
 if term not in self.terms: 
 self.terms[term] = set() 
 self.terms[term].add(event_id) 
 def search(self, term): 
 """Search for a single term, and yield all the events that contain it""" 
 # In Splunk this runs in O(1), and is likely to be in filesystem cache (memory) 
 if not self.bf.might_contain(term): 
 return 
 # In Splunk this probably runs in O(log N) where N is the number of terms in the tsidx 
 if term not in self.terms: 
 return 
 for event_id in sorted(self.terms[term]): 
 yield self.events[event_id]

Splunk代表一个拥有搜索功能的索引集合
每一个集合中包含一个布隆过滤器，一个倒排词表(字典)，和一个存储所有事件的数组
当一个事件被加入到索引的时候，会做以下的逻辑
为每一个事件生成一个unqie id，这里就是序号
对事件进行分词，把每一个词加入到倒排词表，也就是每一个词对应的事件的id的映射结构，注意，一个词可能对应多个事件，所以倒排表的的值是一个Set。倒排表是绝大部分搜索引擎的核心功能。
当一个词被搜索的时候，会做以下的逻辑
检查布隆过滤器，如果为假，直接返回
检查词表，如果被搜索单词不在词表中，直接返回
在倒排表中找到所有对应的事件id，然后返回事件的内容

我们运行下看看把：

s = Splunk() 
s.add_event('src_ip = 1.2.3.4') 
s.add_event('src_ip = 5.6.7.8') 
s.add_event('dst_ip = 1.2.3.4') 
for event in s.search('1.2.3.4'): 
 print event 
print '-' 
for event in s.search('src_ip'): 
 print event 
print '-' 
for event in s.search('ip'): 
 print event 
src_ip = 1.2.3.4 
dst_ip = 1.2.3.4 
- 
src_ip = 1.2.3.4 
src_ip = 5.6.7.8 
- 
src_ip = 1.2.3.4 
src_ip = 5.6.7.8 
dst_ip = 1.2.3.4

是不是很赞!

更复杂的搜索

更进一步，在搜索过程中，我们想用And和Or来实现更复杂的搜索逻辑。

上代码：

class SplunkM(object): 
 def __init__(self): 
 self.bf = Bloomfilter(64) 
 self.terms = {} # Dictionary of term to set of events 
 self.events = [] 
 def add_event(self, event): 
 """Adds an event to this object""" 
 # Generate a unique ID for the event, and save it 
 event_id = len(self.events) 
 self.events.append(event) 
 # Add each term to the bloomfilter, and track the event by each term 
 for term in segments(event): 
 self.bf.add_value(term) 
 if term not in self.terms: 
 self.terms[term] = set() 
 self.terms[term].add(event_id) 
 def search_all(self, terms): 
 """Search for an AND of all terms""" 
 # Start with the universe of all events... 
 results = set(range(len(self.events))) 
 for term in terms: 
 # If a term isn't present at all then we can stop looking 
 if not self.bf.might_contain(term): 
 return 
 if term not in self.terms: 
 return 
 # Drop events that don't match from our results 
 results = results.intersection(self.terms[term]) 
 for event_id in sorted(results): 
 yield self.events[event_id] 
 def search_any(self, terms): 
 """Search for an OR of all terms""" 
 results = set() 
 for term in terms: 
 # If a term isn't present, we skip it, but don't stop 
 if not self.bf.might_contain(term): 
 continue 
 if term not in self.terms: 
 continue 
 # Add these events to our results 
 results = results.union(self.terms[term]) 
 for event_id in sorted(results): 
 yield self.events[event_id]

利用Python集合的intersection和union操作，可以很方便的支持And(求交集)和Or(求合集)的操作。

（编辑：91站长网）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!

5/7

首页

尾页

如何使内链优化的后果	网站排名突然降落的处
新网站建设须要注意哪	网站SEO排名优化小提醒