In text retrieval, full text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. The full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases.-- Wikipedia
Full text search in PostgreSQL:
PostgreSQL full text search (FTS) is an advanced search option where results cannot be retrieved through "=", "LIKE" or "~" operators. PostgreSQL FTS provides following advanced features:
- Dictionary support
- Stemming
- Stop words
- Fuzzy search for misspellings
- Ordering of search results
- Index support
Document:
A document is the unit of searching in a full-text search system. In PostgreSQL context, a document is generally a text field within a row of a database table or combination of such fields from several tables obtained dynamically. A document can also be constructed from various parts for indexing and may not be stored anywhere else as a whole. It is not related to table schema but to data and may be a logical entity. A document needs to be formatted into lexemes which are understood by PostgreSQL for full text search support.
“A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token”.PostgreSQL provides two data types to support full-text search, one is tsvector and another is tsquery type.
- tsvector:
PostgreSQL provides tsvector data type for storing preprocessed documents. For text search purpose, each document needs to be reduced to preprocessed tsvector format. PostgreSQL performs searching and ranking of results entirely on tsvector representation of a document. A tsvector value is a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word look alike. For example converting uppercase letters to lowercase, removing suffixes which allows searches to find variant forms of a word.
For example:
SELECT 'A fat dog jumped over the lazy cat'::tsvector;
tsvector
----------------------------------------------------
'A' 'cat' 'dog' 'fat' 'jumped' 'lazy' 'over' 'the'
- tsquery:
Type tsquery represents a text query. A tsquery value stores search terms, which must be already-normalized lexemes and may combine multiple Boolean operators.
For example:tsquery
SELECT 'dog & cat'::tsquery;
tsquery
---------------------------
'dog' & 'cat'
SELECT 'dog & (cat | rat)'::tsquery;
tsquery
---------------------------
'dog' & ( 'cat' | 'rat' )To implement full text search there must be a function to create a tsvector from a document and a tsquery from a user query.
- to_tsvector:
PostgreSQL provides the function to_tsvector for converting a document to the tsvector data type. The to_tsvector parses and normalizes a document string into tokens, reduces the tokens to lexemes, and returns a tsvector which lists the lexemes together with their positions in the document.
For example:SELECT to_tsvector('A fat dog jumped over the lazy cat');
would result in
to_tsvector
----------------------
'cat':8 'dog':3 'fat':2 'jump':4 'lazi':7
Here stop-words “A” and “over” were removed, and the words “jumped” and “lazy” were converted to their stems.
- to_tsquery:
Function to_tsquery is used to convert user-written text into a proper tsquery format. The function to_tsquery takes one argument, a string containing the tokens to search for. We can provide an optional argument to this function.
The to_tsquery function does not accept a string of text, it requires a string of tokens separated by operators. It then goes and creates a true tsquery to retrieve the results.
For example:SELECT to_tsvector('A fat dog jumped over the lazy cat') @@ to_tsquery( 'jumping') would result in t, meaning that a match is found.
- plainto_tsquery:
Function plainto_tsquery is used to convert a string of text into a proper tsquery format. plainto_tsquery transforms unformatted text querytext to tsquery. The text is parsed and normalized much as for to_tsvector, then the & (AND) Boolean operator is inserted between surviving words.
For example:SELECT to_tsvector('A fat dog jumped over the lazy cat') @@ plainto_tsquery('jumping over'); would result in t.
Observe we have not used operator between search words, now it is a simple search string.
Here, the plainto_tsquery function converts this string to a list of lexemes separated by & (AND) operator. The only limitation with this function is that it can separate words(lexemes) with &(AND) operator only.
PostgreSQL full text search is based on the match operator @@, which returns true if a tsvector matches a tsquery. By default, PostgreSQL uses ‘english’ as text search configuration for the function to_tsvector and it will also ignore English stop words.
PostgreSQL full text search (FTS) with Hibernate:
There are many approaches available to capitalize on PostgreSQL FTS capabilities with Java web application, however, we shall restrict our discussion to use PostgreSQL FTS with Hibernate HQL.
As mentioned earlier, PostgreSQL makes use of @@ operator for full text search, If we want to support this in hibernate or HQL queries then we need to modify our current PostgreSQLDialect to support special character '@@' for full text search.
Following are the steps to use PostgreSQL FTS with Hibernate:
- Create a class for Hibernate SQL function for PostgreSQL FTS.
public class PostgreSQLFTSFunction implements SQLFunction {
@Override
public boolean hasArguments() {
return true;
}
@Override
public boolean hasParenthesesIfNoArguments() {
return false;
}
@Override
public Type getReturnType(final Type firstArgumentType, final Mapping mapping) throws QueryException {
return new BooleanType();
}
@Override
public String render(Type firstArgumentType, List args, SessionFactoryImplementor factory) throws QueryException {
if (args== null || args.size() < 2) {
throw new IllegalArgumentException("The function must be passed at least 2 arguments");
}
String fragment = null;
String ftsConfiguration = null;
String field = null;
String value = null;
if (args.size() == 3) {
ftsConfiguration = (String) args.get(0);
field = (String) args.get(1);
value = (String) args.get(2);
fragment = "to_tsvector(" + ftsConfiguration + ", " + field + ") @@ " + "plainto_tsquery(" + ftsConfiguration + ", " + value + ")";
} else {
field = (String) args.get(0);
value = (String) args.get(1);
fragment = "to_tsvector(" + field + ") @@ " + "plainto_tsquery('" + value + "')";
}
return fragment;
}
}Here we are implementing SQLFunction interface. The crucial method here is overridden render method. This method renders the function call as SQL fragment. It generates a valid SQL statement based on the provided input. In this method, we make '@@' operator available to Hibernate.
- Extend PostgreSQL dialect:
We need to create a new class and extend PostgreSQL9Dialect class to register our newly created function.public class CustomPostgreSQLDialect extends PostgreSQL9Dialect {
public CustomPostgreSQLDialect () {
registerFunction("fts", new PostgreSQLFTSFunction());
}
}This is the place where we can register some additional full text search functions required on top of those defined in PostgreSQLDialect.
- Configure Hibernate to use our custom dialect:
Update following hibernate property:<property name="hibernate.dialect" value="pkg_name.CustomPostgreSQLDialect " />
- Use custom registered function ‘fts’ in queries:
Assuming Employee class in an entity:List<Employee> list = em.createQuery("select e from Employee e where fts(pg_catalog.english, e.address, :address) = true")
.setParameter("address", "String to be searched").getResultList();Here, "pg_catalog.english" tells PostgreSQL that we will use English dictionary for search. "Address" is the property of Employee class mapped to the corresponding column in the database. One thing to note here is SQL Functions have to return a value and when used in HQL it must be in an expression form. (fts(address, ‘search string’ ) = true).