Wednesday, 23 October 2019

How to search for Arabic words using Thesaurus XML file in SQL Server.


Accent insensitive search in full-text catalog!


Turning off Accent insensitive option for existing search catalog will allow to search Arabic words using the Thesaurus feature in SQL Server.

In SQL server 2008 noise words have been replaced by stop words. http://msdn.microsoft.com/en-us/library/ms142551.aspx
And it is as you suspected; ”been” is classified as a stop word in the standard list for the English language. When the fulltext catalog was created with accent sensitivity the word “béen” was hit in searches. But when I created the index with accent insensitivity “béen” became equal to “been” and was therefore treated as a stop word. This explains the weird behavior and why “béen” would not be hit when accent insensitivity was set.
I used the following query to find the stoplist for my index. It was set to the value 0 meaning the default list was being used.
select fulltext_catalog_id,stoplist_id, * from sys.fulltext_indexes;
And this to get the list of words from the default list for the language.
select ssw.stopword, slg.name as language
from sys.fulltext_system_stopwords ssw
join sys.fulltext_languages slg
on slg.lcid = ssw.language_id and ssw.language_id = 1033;
The question is what I should do about this. Changing language from English doesn’t seem like such a good option since we have users from Sweden, Denmark, Estonia, Great Britain and soon Finland. “Béen” is actually the last name of one users.
UPDATE
I ended up with this solution.
-- Create stoplist copy from system stoplist
CREATE
 FULLTEXT STOPLIST SL FROM SYSTEM STOPLIST;
-- Set the index to use the new stoplist
ALTER
 FULLTEXT INDEX ON searchtable set stoplist SL
-- Removing "stop" word from stoplist
ALTER FULLTEXT STOPLIST SL
DROP 'been' LANGUAGE 'English';

Thanks for all the help finding this Erland.





No comments:

Post a Comment