Sphinx在windows下安装使用(支持中文全文检索)

前段时间听同事谈起过,公司内部的一个搜索功能用的就是Sphinx,但一直没时间去整一下,今天刚好有点时间,那么就折腾一次吧。一般在linux上比较多,今天就在windows下安装于调试一下吧。

前言:

一、关于Sphinx

Sphinx 是一个在GPLv2 下发布的一个全文检索引擎,商业授权(例如, 嵌入到其他程序中)需要联系作者(Sphinxsearch.com)以获得商业授权。 一般而言,Sphinx是一个独立的搜索引擎,意图为其他应用提供高速、低空间占用、高结果相关度的全文搜索功能。Sphinx可以非常容易的与SQL数据库和脚本语言集成。 当前系统内置MySQL和PostgreSQL 数据库数据源的支持,也支持从标准输入读取特定格式的XML数据。通过修改源代码,用户可以自行增加新的数据源(例如:其他类型的DBMS的原生支持)。 搜索API支持PHP、Python、Perl、Rudy和Java,并且也可以用作MySQL存储引擎。搜索API非常简单,可以在若干个小时之内移植到新的语言上。

本文旨在提供一种便捷的方式让Sphinx在windows下安装配置以支持中文全文检索,配置部分在linux下通用。

搜索API支持PHP、Python、Perl、Rudy和Java,并且也可以用作MySQL存储引擎。搜索API非常简单,可以在若干个小时之内移植到新的语言上。 Sphinx特性:

高速的建立索引(在当代CPU上,峰值性能可达到10MB/秒);
高性能的搜索(在2–4GB的文本数据上,平均每次检索响应时间小于0.1秒);
可处理海量数据(目前已知可以处理超过100GB的文本数据,在单一CPU的系统上可处理100M文档);
提供了优秀的相关度算法,基于短语相似度和统计(BM25)的复合Ranking方法;
支持分布式搜索;
提供文件的摘录生成;
可作为MySQL的存储引擎提供搜索服务;
支持布尔、短语、词语相似度等多种检索模式;
文档支持多个全文检索字段(最大不超过32个);
文档支持多个额外的属性信息(例如:分组信息,时间戳等);
停止词查询;
支持单一字节编码和UTF-8编码;
原生的MySQL支持(同时支持MyISAM和InnoDB);
原生的PostgreSQL支持.

中文手册下载: sphinx_doc_zhcn_0.9

二、Sphinx在windows上的安装 1.直接在http://sphinxsearch.com/downloads/release/ 找到最新的windows版本,我这里下的是sphinx-2.0.6-release-win64-id64-full,下载后解压在E:\webserver\sphinx目录下。

2.在E:\webserver\sphinx 下新建一个data目录用来存放索引文件,一个log目录方日志文件,复制E:\webserver\sphinx\sphinx.conf.in到E:\webserver\sphinx\sphinx.conf(注意修改文件名)。

3.修改D:\sphinx\bin\sphinx.conf,我这里列出需要修改的几个:

    type           = mysql # 数据源,我这里是mysql
    sql_host       = localhost # 数据库服务器
    sql_user       = root # 数据库用户名
    sql_pass       = '' # 数据库密码
    sql_db         = test # 数据库
    sql_port       = 3306 # 数据库端口
    sql_query_pre      = SET NAMES utf8 # 去掉此行前面的注释,如果你的数据库是uft8 编码的
    index test1
    {
    # 放索引的目录
      path      = D:/sphinx/data/
    # 编码
      charset_type     = utf-8
      #  指定utf-8 的编码表
      charset_table     = 0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F
      # 简单分词,只支持0 和1 ,如果要搜索中文,请指定为1
      ngram_len       = 1
    # 需要分词的字符,如果要搜索中文,去掉前面的注释
      ngram_chars      = U+3000..U+2FA1F
    }
    # 搜索服务需要修改的部分
    searchd
    {
      # 日志
      log        = D:/sphinx/log/searchd.log
      # PID file, searchd process ID file name
      pid_file      = D:/sphinx/log/searchd.pid
      # windows 下启动searchd 服务一定要注释掉这个
      # seamless_rotate     = 1
    }

如果没有分布式索引,注释掉下面的内容

    # index dist1
    # {
     # 'distributed' index type MUST be specified
     # type    = distributed

     # local index to be searched
     # there can be many local indexes configured
     # local    = test1
     # local    = test1stemmed

     # remote agent
     # multiple remote agents may be specified
     # syntax is 'hostname:port:index1,[index2[,...]]
     # agent    = localhost:3313:remote1
     # agent    = localhost:3314:remote2,remote3

     # remote agent connection timeout, milliseconds
     # optional, default is 1000 ms, ie. 1 sec
     # agent_connect_timeout = 1000

     # remote agent query timeout, milliseconds
     # optional, default is 3000 ms, ie. 3 sec
     # agent_query_timeout  = 3000
    # }

4.导入测试数据 E:\webserver\MySQL Server 5.5\bin>mysql -uroot test < E:/webserver/sphinx/example.sql

5.建立索引 123 333

E:\webserver\sphinx\bin>indexer.exe test1 ( 备注 :test1 为 sphinx.conf 的 index test1() )

  1. 测试搜索‘this’ 444

  2. 测试中文搜索‘我啊’ 555 貌似没有搜到,这是因为 windows 命令行中的编码是 gbk ,当然搜不出来。我们可以用程序试试,在 E:\webserver\sphinx\api 下新建一个 foo.php 的文件,注意 utf-8 编码
    <?php
    //加载sphinx客户端
    require ’sphinxapi.php’;
    $s = new SphinxClient();
    $s->SetServer(’localhost’,9312);
    $result = $s->Query(‘中文’);
    var_dump($result);
    ?>
    

启动 Sphinx searchd 服务 E:\webserver\sphinx\bin>searchd.exe 666

执行 PHP 查询: 访问 http://www.test.com/sphinx/api/foo.php ( 自己配置的虚拟主机 )

至此,windows下Sphinx服务端的配置已结束。

贴出以上Sphinx测试的conf配置:

#
# Sphinx configuration file sample
#
# WARNING! While this sample file mentions all available options,
# it contains (very) short helper descriptions only. Please refer to
# doc/sphinx.html for details.
#

#############################################################################
## data source definition
#############################################################################

source src1
{
    # data source type. mandatory, no default value
    # known types are mysql, pgsql, mssql, xmlpipe, xmlpipe2, odbc
    type            = mysql

    #####################################################################
    ## SQL settings (for 'mysql' and 'pgsql' types)
    #####################################################################

    # some straightforward parameters for SQL source types
    sql_host        = 127.0.0.1
    sql_user        = root
    sql_pass        = shadow
    sql_db          = test
    sql_port        = 3306  # optional, default is 3306

    # UNIX socket name
    # optional, default is empty (reuse client library defaults)
    # usually '/var/lib/mysql/mysql.sock' on Linux
    # usually '/tmp/mysql.sock' on FreeBSD
    #
    # sql_sock      = /tmp/mysql.sock

    # MySQL specific client connection flags
    # optional, default is 0
    #
    # mysql_connect_flags   = 32 # enable compression

    # MySQL specific SSL certificate settings
    # optional, defaults are empty
    #
    # mysql_ssl_cert        = /etc/ssl/client-cert.pem
    # mysql_ssl_key     = /etc/ssl/client-key.pem
    # mysql_ssl_ca      = /etc/ssl/cacert.pem

    # MS SQL specific Windows authentication mode flag
    # MUST be in sync with charset_type index-level setting
    # optional, default is 0
    #
    # mssql_winauth     = 1 # use currently logged on user credentials

    # MS SQL specific Unicode indexing flag
    # optional, default is 0 (request SBCS data)
    #
    # mssql_unicode     = 1 # request Unicode data from server

    # ODBC specific DSN (data source name)
    # mandatory for odbc source type, no default value
    #
    # odbc_dsn      = DBQ=C:\data;DefaultDir=C:\data;Driver={Microsoft Text Driver (*.txt; *.csv)};
    # sql_query     = SELECT id, data FROM documents.csv

    # ODBC and MS SQL specific, per-column buffer sizes
    # optional, default is auto-detect
    #
    # sql_column_buffers    = content=12M, comments=1M

    # pre-query, executed before the main fetch query
    # multi-value, optional, default is empty list of queries
    #
    #如果是utf-8,需要取消注视这行
    # sql_query_pre     = SET NAMES utf8
    sql_query_pre       = SET NAMES utf8
    # sql_query_pre     = SET SESSION query_cache_type=OFF

    # main document fetch query
    # mandatory, integer document ID field MUST be the first selected column
    sql_query       = \
        SELECT id, group_id, UNIX_TIMESTAMP(date_added) AS date_added, title, content \
        FROM documents

    # joined/payload field fetch query
    # joined fields let you avoid (slow) JOIN and GROUP_CONCAT
    # payload fields let you attach custom per-keyword values (eg. for ranking)
    #
    # syntax is FIELD-NAME 'from'  ( 'query' | 'payload-query' ); QUERY
    # joined field QUERY should return 2 columns (docid, text)
    # payload field QUERY should return 3 columns (docid, keyword, weight)
    #
    # REQUIRES that query results are in ascending document ID order!
    # multi-value, optional, default is empty list of queries
    #
    # sql_joined_field  = tags from query; SELECT docid, CONCAT('tag',tagid) FROM tags ORDER BY docid ASC
    # sql_joined_field  = wtags from payload-query; SELECT docid, tag, tagweight FROM tags ORDER BY docid ASC

    # file based field declaration
    #
    # content of this field is treated as a file name
    # and the file gets loaded and indexed in place of a field
    #
    # max file size is limited by max_file_field_buffer indexer setting
    # file IO errors are non-fatal and get reported as warnings
    #
    # sql_file_field        = content_file_path

    # range query setup, query that must return min and max ID values
    # optional, default is empty
    #
    # sql_query will need to reference $start and $end boundaries
    # if using ranged query:
    #
    # sql_query     = \
    #   SELECT doc.id, doc.id AS group, doc.title, doc.data \
    #   FROM documents doc \
    #   WHERE id>=$start AND id<=$end
    #
    # sql_query_range       = SELECT MIN(id),MAX(id) FROM documents

    # range query step
    # optional, default is 1024
    #
    # sql_range_step        = 1000

    # unsigned integer attribute declaration
    # multi-value (an arbitrary number of attributes is allowed), optional
    # optional bit size can be specified, default is 32
    #
    # sql_attr_uint     = author_id
    # sql_attr_uint     = forum_id:9 # 9 bits for forum_id
    sql_attr_uint       = group_id

    # boolean attribute declaration
    # multi-value (an arbitrary number of attributes is allowed), optional
    # equivalent to sql_attr_uint with 1-bit size
    #
    # sql_attr_bool     = is_deleted

    # bigint attribute declaration
    # multi-value (an arbitrary number of attributes is allowed), optional
    # declares a signed (unlike uint!) 64-bit attribute
    #
    # sql_attr_bigint       = my_bigint_id

    # UNIX timestamp attribute declaration
    # multi-value (an arbitrary number of attributes is allowed), optional
    # similar to integer, but can also be used in date functions
    #
    # sql_attr_timestamp    = posted_ts
    # sql_attr_timestamp    = last_edited_ts
    sql_attr_timestamp  = date_added

    # string ordinal attribute declaration
    # multi-value (an arbitrary number of attributes is allowed), optional
    # sorts strings (bytewise), and stores their indexes in the sorted list
    # sorting by this attr is equivalent to sorting by the original strings
    #
    # sql_attr_str2ordinal  = author_name

    # floating point attribute declaration
    # multi-value (an arbitrary number of attributes is allowed), optional
    # values are stored in single precision, 32-bit IEEE 754 format
    #
    # sql_attr_float        = lat_radians
    # sql_attr_float        = long_radians

    # multi-valued attribute (MVA) attribute declaration
    # multi-value (an arbitrary number of attributes is allowed), optional
    # MVA values are variable length lists of unsigned 32-bit integers
    #
    # syntax is ATTR-TYPE ATTR-NAME 'from' SOURCE-TYPE [;QUERY] [;RANGE-QUERY]
    # ATTR-TYPE is 'uint' or 'timestamp'
    # SOURCE-TYPE is 'field', 'query', or 'ranged-query'
    # QUERY is SQL query used to fetch all ( docid, attrvalue ) pairs
    # RANGE-QUERY is SQL query used to fetch min and max ID values, similar to 'sql_query_range'
    #
    # sql_attr_multi        = uint tag from query; SELECT docid, tagid FROM tags
    # sql_attr_multi        = uint tag from ranged-query; \
    #   SELECT docid, tagid FROM tags WHERE id>=$start AND id<=$end; \
    #   SELECT MIN(docid), MAX(docid) FROM tags

    # string attribute declaration
    # multi-value (an arbitrary number of these is allowed), optional
    # lets you store and retrieve strings
    #
    # sql_attr_string       = stitle

    # wordcount attribute declaration
    # multi-value (an arbitrary number of these is allowed), optional
    # lets you count the words at indexing time
    #
    # sql_attr_str2wordcount    = stitle

    # combined field plus attribute declaration (from a single column)
    # stores column as an attribute, but also indexes it as a full-text field
    #
    # sql_field_string  = author
    # sql_field_str2wordcount   = title

    # post-query, executed on sql_query completion
    # optional, default is empty
    #
    # sql_query_post        =

    # post-index-query, executed on successful indexing completion
    # optional, default is empty
    # $maxid expands to max document ID actually fetched from DB
    #
    # sql_query_post_index  = REPLACE INTO counters ( id, val ) \
    #   VALUES ( 'max_indexed_id', $maxid )

    # ranged query throttling, in milliseconds
    # optional, default is 0 which means no delay
    # enforces given delay before each query step
    sql_ranged_throttle = 0

    # document info query, ONLY for CLI search (ie. testing and debugging)
    # optional, default is empty
    # must contain $id macro and must fetch the document by that id
    sql_query_info      = SELECT * FROM documents WHERE id=$id

    # kill-list query, fetches the document IDs for kill-list
    # k-list will suppress matches from preceding indexes in the same query
    # optional, default is empty
    #
    # sql_query_killlist    = SELECT id FROM documents WHERE edited>=@last_reindex

    # columns to unpack on indexer side when indexing
    # multi-value, optional, default is empty list
    #
    # unpack_zlib       = zlib_column
    # unpack_mysqlcompress  = compressed_column
    # unpack_mysqlcompress  = compressed_column_2

    # maximum unpacked length allowed in MySQL COMPRESS() unpacker
    # optional, default is 16M
    #
    # unpack_mysqlcompress_maxsize  = 16M

    #####################################################################
    ## xmlpipe2 settings
    #####################################################################

    # type          = xmlpipe

    # shell command to invoke xmlpipe stream producer
    # mandatory
    #
    # xmlpipe_command       = cat @CONFDIR@/test.xml

    # xmlpipe2 field declaration
    # multi-value, optional, default is empty
    #
    # xmlpipe_field     = subject
    # xmlpipe_field     = content

    # xmlpipe2 attribute declaration
    # multi-value, optional, default is empty
    # all xmlpipe_attr_XXX options are fully similar to sql_attr_XXX
    #
    # xmlpipe_attr_timestamp    = published
    # xmlpipe_attr_uint = author_id

    # perform UTF-8 validation, and filter out incorrect codes
    # avoids XML parser choking on non-UTF-8 documents
    # optional, default is 0
    #
    # xmlpipe_fixup_utf8    = 1
}

# inherited source example
#
# all the parameters are copied from the parent source,
# and may then be overridden in this source definition
source src1throttled : src1
{
    sql_ranged_throttle = 100
}

#############################################################################
## index definition
#############################################################################

# local index example
#
# this is an index which is stored locally in the filesystem
#
# all indexing-time options (such as morphology and charsets)
# are configured per local index
index test1
{
    # index type
    # optional, default is 'plain'
    # known values are 'plain', 'distributed', and 'rt' (see samples below)
    # type          = plain

    # document source(s) to index
    # multi-value, mandatory
    # document IDs must be globally unique across all sources
    source          = src1

    # index files path and file name, without extension
    # mandatory, path must be writable, extensions will be auto-appended
    #path           = @CONFDIR@/data/test1
    # 放索引的目录
    path = E:/webserver/sphinx/data
    # document attribute values (docinfo) storage mode
    # optional, default is 'extern'
    # known values are 'none', 'extern' and 'inline'
    docinfo         = extern

    # memory locking for cached data (.spa and .spi), to prevent swapping
    # optional, default is 0 (do not mlock)
    # requires searchd to be run from root
    mlock           = 0

    # a list of morphology preprocessors to apply
    # optional, default is empty
    #
    # builtin preprocessors are 'none', 'stem_en', 'stem_ru', 'stem_enru',
    # 'soundex', and 'metaphone'; additional preprocessors available from
    # libstemmer are 'libstemmer_XXX', where XXX is algorithm code
    # (see libstemmer_c/libstemmer/modules.txt)
    #
    # morphology        = stem_en, stem_ru, soundex
    # morphology        = libstemmer_german
    # morphology        = libstemmer_sv
    morphology      = none

    # minimum word length at which to enable stemming
    # optional, default is 1 (stem everything)
    #
    # min_stemming_len  = 1

    # stopword files list (space separated)
    # optional, default is empty
    # contents are plain text, charset_table and stemming are both applied
    #
    # stopwords     = @CONFDIR@/data/stopwords.txt

    # wordforms file, in "mapfrom > mapto" plain text format
    # optional, default is empty
    #
    # wordforms     = @CONFDIR@/data/wordforms.txt

    # tokenizing exceptions file
    # optional, default is empty
    #
    # plain text, case sensitive, space insensitive in map-from part
    # one "Map Several Words => ToASingleOne" entry per line
    #
    # exceptions        = @CONFDIR@/data/exceptions.txt

    # minimum indexed word length
    # default is 1 (index everything)
    min_word_len        = 1

    # charset encoding type
    # optional, default is 'sbcs'
    # known types are 'sbcs' (Single Byte CharSet) and 'utf-8'
    # 编码
    #charset_type       = sbcs
    charset_type     = utf-8

    # charset definition and case folding rules "table"
    # optional, default value depends on charset_type
    #
    # defaults are configured to include English and Russian characters only
    # you need to change the table to include additional ones
    # this behavior MAY change in future versions
    #
    # 'sbcs' default value is
    # charset_table     = 0..9, A..Z->a..z, _, a..z, U+A8->U+B8, U+B8, U+C0..U+DF->U+E0..U+FF, U+E0..U+FF
    #指定utf-8 的编码表
    charset_table       = 0..9, A..Z->a..z, _, a..z, U+A8->U+B8, U+B8, U+C0..U+DF->U+E0..U+FF, U+E0..U+FF
    #
    # 'utf-8' default value is
    # charset_table     = 0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F

    # ignored characters list
    # optional, default value is empty
    #
    # ignore_chars      = U+00AD

    # minimum word prefix length to index
    # optional, default is 0 (do not index prefixes)
    #
    # min_prefix_len        = 0

    # minimum word infix length to index
    # optional, default is 0 (do not index infixes)
    #
    # min_infix_len     = 0

    # list of fields to limit prefix/infix indexing to
    # optional, default value is empty (index all fields in prefix/infix mode)
    #
    # prefix_fields     = filename
    # infix_fields      = url, domain

    # enable star-syntax (wildcards) when searching prefix/infix indexes
    # search-time only, does not affect indexing, can be 0 or 1
    # optional, default is 0 (do not use wildcard syntax)
    #
    # enable_star       = 1

    # expand keywords with exact forms and/or stars when searching fit indexes
    # search-time only, does not affect indexing, can be 0 or 1
    # optional, default is 0 (do not expand keywords)
    #
    # expand_keywords       = 1

    # n-gram length to index, for CJK indexing
    # only supports 0 and 1 for now, other lengths to be implemented
    # optional, default is 0 (disable n-grams)
    #简单分词,只支持0 和1 ,如果要搜索中文,请指定为1
    # ngram_len     = 1
    ngram_len       = 1

    # n-gram characters list, for CJK indexing
    # optional, default is empty
    #
    # 需要分词的字符,如果要搜索中文,去掉前面的注释
    # ngram_chars       = U+3000..U+2FA1F
    ngram_chars     = U+3000..U+2FA1F

    # phrase boundary characters list
    # optional, default is empty
    #
    # phrase_boundary       = ., ?, !, U+2026 # horizontal ellipsis

    # phrase boundary word position increment
    # optional, default is 0
    #
    # phrase_boundary_step  = 100

    # blended characters list
    # blended chars are indexed both as separators and valid characters
    # for instance, AT&T will results in 3 tokens ("at", "t", and "at&t")
    # optional, default is empty
    #
    # blend_chars       = +, &, U+23

    # blended token indexing mode
    # a comma separated list of blended token indexing variants
    # known variants are trim_none, trim_head, trim_tail, trim_both, skip_pure
    # optional, default is trim_none
    #
    # blend_mode        = trim_tail, skip_pure

    # whether to strip HTML tags from incoming documents
    # known values are 0 (do not strip) and 1 (do strip)
    # optional, default is 0
    html_strip      = 0

    # what HTML attributes to index if stripping HTML
    # optional, default is empty (do not index anything)
    #
    # html_index_attrs  = img=alt,title; a=title;

    # what HTML elements contents to strip
    # optional, default is empty (do not strip element contents)
    #
    # html_remove_elements  = style, script

    # whether to preopen index data files on startup
    # optional, default is 0 (do not preopen), searchd-only
    #
    # preopen           = 1

    # whether to keep dictionary (.spi) on disk, or cache it in RAM
    # optional, default is 0 (cache in RAM), searchd-only
    #
    # ondisk_dict       = 1

    # whether to enable in-place inversion (2x less disk, 90-95% speed)
    # optional, default is 0 (use separate temporary files), indexer-only
    #
    # inplace_enable        = 1

    # in-place fine-tuning options
    # optional, defaults are listed below
    #
    # inplace_hit_gap       = 0 # preallocated hitlist gap size
    # inplace_docinfo_gap   = 0 # preallocated docinfo gap size
    # inplace_reloc_factor  = 0.1 # relocation buffer size within arena
    # inplace_write_factor  = 0.1 # write buffer size within arena

    # whether to index original keywords along with stemmed versions
    # enables "=exactform" operator to work
    # optional, default is 0
    #
    # index_exact_words = 1

    # position increment on overshort (less that min_word_len) words
    # optional, allowed values are 0 and 1, default is 1
    #
    # overshort_step        = 1

    # position increment on stopword
    # optional, allowed values are 0 and 1, default is 1
    #
    # stopword_step     = 1

    # hitless words list
    # positions for these keywords will not be stored in the index
    # optional, allowed values are 'all', or a list file name
    #
    # hitless_words     = all
    # hitless_words     = hitless.txt

    # detect and index sentence and paragraph boundaries
    # required for the SENTENCE and PARAGRAPH operators to work
    # optional, allowed values are 0 and 1, default is 0
    #
    # index_sp          = 1

    # index zones, delimited by HTML/XML tags
    # a comma separated list of tags and wildcards
    # required for the ZONE operator to work
    # optional, default is empty string (do not index zones)
    #
    # index_zones       = title, h*, th
}

# inherited index example
#
# all the parameters are copied from the parent index,
# and may then be overridden in this index definition
#index test1stemmed : test1
#{
#   path            = @CONFDIR@/data/test1stemmed
#   morphology      = stem_en
#}

# distributed index example
#
# this is a virtual index which can NOT be directly indexed,
# and only contains references to other local and/or remote indexes
#index dist1
#{
#   # 'distributed' index type MUST be specified
#   type            = distributed
#
#   # local index to be searched
#   # there can be many local indexes configured
#   local           = test1
#   local           = test1stemmed
#
#   # remote agent
#   # multiple remote agents may be specified
#   # syntax for TCP connections is 'hostname:port:index1,[index2[,...]]'
#   # syntax for local UNIX connections is '/path/to/socket:index1,[index2[,...]]'
#   agent           = localhost:9313:remote1
#   agent           = localhost:9314:remote2,remote3
#   # agent         = /var/run/searchd.sock:remote4
#
#   # blackhole remote agent, for debugging/testing
#   # network errors and search results will be ignored
#   #
#   # agent_blackhole       = testbox:9312:testindex1,testindex2
#
#
#   # remote agent connection timeout, milliseconds
#   # optional, default is 1000 ms, ie. 1 sec
#   agent_connect_timeout   = 1000
#
#   # remote agent query timeout, milliseconds
#   # optional, default is 3000 ms, ie. 3 sec
#   agent_query_timeout = 3000
#}

# realtime index example
#
# you can run INSERT, REPLACE, and DELETE on this index on the fly
# using MySQL protocol (see 'listen' directive below)
index rt
{
    # 'rt' index type must be specified to use RT index
    type            = rt

    # index files path and file name, without extension
    # mandatory, path must be writable, extensions will be auto-appended
    #path           = @CONFDIR@/data/rt
    path            = E:/webserver/sphinx/data/rt

    # RAM chunk size limit
    # RT index will keep at most this much data in RAM, then flush to disk
    # optional, default is 32M
    #
    # rt_mem_limit      = 512M

    # full-text field declaration
    # multi-value, mandatory
    rt_field        = title
    rt_field        = content

    # unsigned integer attribute declaration
    # multi-value (an arbitrary number of attributes is allowed), optional
    # declares an unsigned 32-bit attribute
    rt_attr_uint        = gid

    # RT indexes currently support the following attribute types:
    # uint, bigint, float, timestamp, string
    #
    # rt_attr_bigint        = guid
    # rt_attr_float     = gpa
    # rt_attr_timestamp = ts_added
    # rt_attr_string        = author
}

#############################################################################
## indexer settings
#############################################################################

indexer
{
    # memory limit, in bytes, kiloytes (16384K) or megabytes (256M)
    # optional, default is 32M, max is 2047M, recommended is 256M to 1024M
    mem_limit       = 32M

    # maximum IO calls per second (for I/O throttling)
    # optional, default is 0 (unlimited)
    #
    # max_iops      = 40

    # maximum IO call size, bytes (for I/O throttling)
    # optional, default is 0 (unlimited)
    #
    # max_iosize        = 1048576

    # maximum xmlpipe2 field length, bytes
    # optional, default is 2M
    #
    # max_xmlpipe2_field    = 4M

    # write buffer size, bytes
    # several (currently up to 4) buffers will be allocated
    # write buffers are allocated in addition to mem_limit
    # optional, default is 1M
    #
    # write_buffer      = 1M

    # maximum file field adaptive buffer size
    # optional, default is 8M, minimum is 1M
    #
    # max_file_field_buffer = 32M
}

# 搜索服务需要修改的部分
#############################################################################
## searchd settings
#############################################################################

searchd
{
    # [hostname:]port[:protocol], or /unix/socket/path to listen on
    # known protocols are 'sphinx' (SphinxAPI) and 'mysql41' (SphinxQL)
    #
    # multi-value, multiple listen points are allowed
    # optional, defaults are 9312:sphinx and 9306:mysql41, as below
    #
    # listen            = 127.0.0.1
    # listen            = 192.168.0.1:9312
    # listen            = 9312
    # listen            = /var/run/searchd.sock
    listen          = 9312
    listen          = 9306:mysql41

    # log file, searchd run info is logged here
    # optional, default is 'searchd.log'
    # 日志
    #log            = @CONFDIR@/log/searchd.log

    log        = E:/webserver/sphinx/log/searchd.log
    # query log file, all search queries are logged here
    # optional, default is empty (do not log queries)
    #query_log      = @CONFDIR@/log/query.log
    query_log       = E:/webserver/sphinx/log/query.log
    # client read timeout, seconds
    # optional, default is 5
    read_timeout        = 5

    # request timeout, seconds
    # optional, default is 5 minutes
    client_timeout      = 300

    # maximum amount of children to fork (concurrent searches to run)
    # optional, default is 0 (unlimited)
    max_children        = 30

    # PID file, searchd process ID file name
    # mandatory
    #pid_file       = @CONFDIR@/log/searchd.pid
    pid_file    =  E:/webserver/sphinx/log/searchd.pid

    # max amount of matches the daemon ever keeps in RAM, per-index
    # WARNING, THERE'S ALSO PER-QUERY LIMIT, SEE SetLimits() API CALL
    # default is 1000 (just like Google)
    max_matches     = 1000

    # seamless rotate, prevents rotate stalls if precaching huge datasets
    # optional, default is 1
    # windows 下启动searchd 服务一定要注释掉这个
    #seamless_rotate        = 1

    # whether to forcibly preopen all indexes on startup
    # optional, default is 1 (preopen everything)
    preopen_indexes     = 1

    # whether to unlink .old index copies on succesful rotation.
    # optional, default is 1 (do unlink)
    unlink_old      = 1

    # attribute updates periodic flush timeout, seconds
    # updates will be automatically dumped to disk this frequently
    # optional, default is 0 (disable periodic flush)
    #
    # attr_flush_period = 900

    # instance-wide ondisk_dict defaults (per-index value take precedence)
    # optional, default is 0 (precache all dictionaries in RAM)
    #
    # ondisk_dict_default   = 1

    # MVA updates pool size
    # shared between all instances of searchd, disables attr flushes!
    # optional, default size is 1M
    mva_updates_pool    = 1M

    # max allowed network packet size
    # limits both query packets from clients, and responses from agents
    # optional, default size is 8M
    max_packet_size     = 8M

    # crash log path
    # searchd will (try to) log crashed query to 'crash_log_path.PID' file
    # optional, default is empty (do not create crash logs)
    #
    # crash_log_path        = @CONFDIR@/log/crash

    # max allowed per-query filter count
    # optional, default is 256
    max_filters     = 256

    # max allowed per-filter values count
    # optional, default is 4096
    max_filter_values   = 4096

    # socket listen queue length
    # optional, default is 5
    #
    # listen_backlog        = 5

    # per-keyword read buffer size
    # optional, default is 256K
    #
    # read_buffer       = 256K

    # unhinted read size (currently used when reading hits)
    # optional, default is 32K
    #
    # read_unhinted     = 32K

    # max allowed per-batch query count (aka multi-query count)
    # optional, default is 32
    max_batch_queries   = 32

    # max common subtree document cache size, per-query
    # optional, default is 0 (disable subtree optimization)
    #
    # subtree_docs_cache    = 4M

    # max common subtree hit cache size, per-query
    # optional, default is 0 (disable subtree optimization)
    #
    # subtree_hits_cache    = 8M

    # multi-processing mode (MPM)
    # known values are none, fork, prefork, and threads
    # optional, default is fork
    #
    workers         = threads # for RT to work

    # max threads to create for searching local parts of a distributed index
    # optional, default is 0, which means disable multi-threaded searching
    # should work with all MPMs (ie. does NOT require workers=threads)
    #
    # dist_threads      = 4

    # binlog files path; use empty string to disable binlog
    # optional, default is build-time configured data directory
    #
    # binlog_path       = # disable logging
    # binlog_path       = @CONFDIR@/data # binlog.001 etc will be created there

    # binlog flush/sync mode
    # 0 means flush and sync every second
    # 1 means flush and sync every transaction
    # 2 means flush every transaction, sync every second
    # optional, default is 2
    #
    # binlog_flush      = 2

    # binlog per-file size limit
    # optional, default is 128M, 0 means no limit
    #
    # binlog_max_log_size   = 256M

    # per-thread stack size, only affects workers=threads mode
    # optional, default is 64K
    #
    # thread_stack          = 128K

    # per-keyword expansion limit (for dict=keywords prefix searches)
    # optional, default is 0 (no limit)
    #
    # expansion_limit       = 1000

    # RT RAM chunks flush period
    # optional, default is 0 (no periodic flush)
    #
    # rt_flush_period       = 900

    # query log file format
    # optional, known values are plain and sphinxql, default is plain
    #
    # query_log_format      = sphinxql

    # version string returned to MySQL network protocol clients
    # optional, default is empty (use Sphinx version)
    #
    # mysql_version_string  = 5.0.37

    # trusted plugin directory
    # optional, default is empty (disable UDFs)
    #
    # plugin_dir            = /usr/local/sphinx/lib

    # default server-wide collation
    # optional, default is libc_ci
    #
    # collation_server      = utf8_general_ci

    # server-wide locale for libc based collations
    # optional, default is C
    #
    # collation_libc_locale = ru_RU.UTF-8

    # threaded server watchdog (only used in workers=threads mode)
    # optional, values are 0 and 1, default is 1 (watchdog on)
    #
    # watchdog              = 1

    # SphinxQL compatibility mode (legacy columns and their names)
    # optional, default is 1 (old-style)
    #
    # compat_sphinxql_magics    = 1
}

# --eof--

如果想作分词搜索的话,请继续往下看。

使用 CORESEEK 分词:

1 、下载 http://www.coreseek.cn/products/ft_down/ (Coreseek 3.2.13 wind32即可,虽然我是win8 64位,可以向下兼容)。

2 、安装系统依赖的软件包。

系统的基础组件需要如下的软件包:

安装完前面两个组件后,系统可以运行,但是需要手工修改配置文件。

安装配置界面需要的软件包:

如果您下载的是完整版,前面提到的全部文件应该能在preq 子目录中找到。 安装前面提到的全部软件包(注意:必须先安装Python 和gtk) 注意: 必须是Active Python ,Python 官方的版本缺少系统需要的Win32 扩展支持,将导致系统无法工作。 注意: 完成本步后,必须重新启动您的计算机。

3 、解压 csft 到你认为的目录

4 、csft 文件内配与 sphinx 的内容大致相同 ( 配置详细见:sphinx+mysql (1) , (2) )

5 、创建词典文件

\bin\mmseg -u \data\unigram.txt # 词库是动态的,指定目录就可以

·把生成的文件改名为uni.lib ,

6 、导入sample.sql 数据库

7 、建立索引 index.exe --all (详情见 sphinx + mysql(1) )


以下分支说明如下:

A :

8 、安装SPHINXSE FOR MYSQL

http://www.sphinxsearch.com/downloads/mysql-5.0.45-sphinxse-0.9.8-win32.zip

下载后,解压然后覆盖MYSQL 目录,就OK 。 ( 注意mysql 版本 必须相同)

进入mysql 运行 show engines; 查看表的类型是否存在 sphinx

9 、创建Sphinx 存储引擎表

CREATE TABLE sphinx (

id int(11) NOT NULL,

weight int(11) NOT NULL,

query varchar(255) NOT NULL,

group_id int(11) NOT NULL,

KEY Query (Query)

) ENGINE=SPHINX CONNECTION='sphinx://localhost:3312/test1';

与一般mysql 表不同的是ENGINE=SPHINX CONNECTION='sphinx://localhost:3312/test1'; ,这里表示这个表采用SPHINXSE 引擎,与sphinx 的连接串是'sphinx://localhost:3312/test1 ,test1 是索引名称

根据sphinx 官方说明,这个表必须至少有三个字段,字段起什么名称无所谓,但类型的顺序必须是integer,integer,varchar ,分别表示记录标识document ID, 匹配权重weight 与查询query ,同时document ID 与query 必须建索引。另外这个表还可以建立几个字段,这几个字段的只能是integer 或TIMESTAMP 类型,字段是与sphinx 的结果集绑定的,因此字段的名称必须与在sphinx.conf 中定义的属性名称一致,否则取出来的将是Null 值。

10 、MySQL SphinxSE 全文检索存储引擎SQL 语句使用方法

安装SphinxSE 存储引擎后首先需新建一张特殊的指定"ENGINE=SPHINX" 检索表,如下:

CREATE TABLE ArticleFulltext (

ID          INTEGER NOT NULL,

Weight      INTEGER NOT NULL,

Query      VARCHAR(3072) NOT NULL,

...

INDEX (Query)

) ENGINE=SPHINX CONNECTION="sphinx://localhost:3312/test";

·其中表名和字段名可以是任意名称,但前三个属性的类型必须为INT 、INT 和VARCHAR 。也可以拥有更多的属性,类型必须为INT 或TIMESTAMP ,名称必须与Sphinx 配置文件对应,用于返回检索结果的更多信息。

·创建该表后即可使用如下的SQL 语句在MySQL 中进行全文检索:

·SELECT * FROM ArticleFulltext WHERE Query=' 全文检索条件';

·查询返回结果即为全文检索的结果,包括文档ID 、权重,若ArticleFulltext 表包含了更多属性还包含命中结果的其它信息。

·通过SQL 联接操作可以很容易的实现融合检索,如:

·SELECT ID, Title

FROM Article, ArticleFulltext

WHERE ArticleFulltext.ID = Article.ID and Query = ' 博客'

AND PublishTime > '2007-03-01' AND ReferCount > 0

ORDER BY Weight 0.5 + ReferCount 0.5;

·上述SQL 语句即可检索出2007 年3 月1 日以来包含' 博客' 关键字且并引用过的文章,且按全文检索权重和引用数综合计算所得的权重进行排序。

·由此可见,通过将全文检索系统提供的功能以存储引擎的形式嵌入到关系数据库MySQL 中可以很方便的提供融合检索功能,虽然功能限制较多,也不失为一种聪明便捷的方式。

11, 将SPHINX 生成WINDOWS 服务

searchd --install --config "csft.conf"

  1. 启动服务 net start |searchd( 或者其它服务名)

B :

配置sphinx.conf 文件,支持中文编码

charset_type = zh_cn.utf-8

charset_dictpath = D:\csft3.1\bin # 分词 lib 库文件的目录

min_infix_len = 0

以上资料均全部经受安装并测试,图没贴全,仅供参考。

4 Responses to “Sphinx在windows下安装使用(支持中文全文检索)”

Leave a Reply

(will not be published)