guix search doesn't weigh word matches higher than subword matches

Richard Sent wrote on 1 May 04:18 +0200

Recipients:(address . bug-guix@gnu.org)

Message-ID:87bk5qcm1w.fsf@freakingpenguin.com

Hi Guix!

When running guix search, relevance in synopsis and description fields

are computed strictly by the number of matches, both as a word and as a

subword. Ideally, if a search string matches an isolated word in a

search, that result should be considered more relevant than simply

matching a subword, even multiple times.

To illustrate, imagine trying to find what package provides the `rsh`

binary and running running `$ guix search rsh`. This binary is part of

`inetutils` and the description field contains:

Toggle quote (4 lines)

> Inetutils is a collection of common network programs, such as an ftp

> client and server, a telnet client and server, an rsh client and

> server, and hostname.

Most likely, this is what the user is interested in. However, inetutils

does not show up until roughly the ~75th result with a relevance of 2

(the lowest possible relevance).

Almost every search result beforehand contains the string "rsh" as a

component of another word, such as "marshaling", "powershell", and

"hershey". However, these match multiple times and are weighted

significantly higher.

Ideally, guix search should rate inetutils higher because the string

"rsh" occurs as its own word, not as a component of another, unrelated

word. (Very, very people would search "rsh" looking for matches with

"hershey", even if "hershey" occurs multiple times.)

Another example of where this can happen is with "dig", part of the bind

package. Searching for "dig" returns garbage because "dig" is a common

subword. Bind is scored with a relevance of 2, even though bind's

description emphasises that dig is part of it.

This would improve the experience when searching with strings that

commonly occur as subwords.

Since this change can't occur in a vacuum, care should be taken not to

reduce the effectiveness of other reasonably forseeable search queries.

Take it easy,

Richard Sent

Making my computer weirder one commit at a time.

bokr wrote on 1 May 15:45 +0200

Recipients:(name . Richard Sent)(address . richard@freakingpenguin.com)(address . 70689@debbugs.gnu.org)

Message-ID:20240501134505.GA10144@LionPure

On +2024-04-30 22:18:03 -0400, Richard Sent wrote:

Toggle quote (49 lines)> Hi Guix!
> 
> When running guix search, relevance in synopsis and description fields
> are computed strictly by the number of matches, both as a word and as a
> subword. Ideally, if a search string matches an isolated word in a
> search, that result should be considered more relevant than simply
> matching a subword, even multiple times.
> 
> To illustrate, imagine trying to find what package provides the `rsh`
> binary and running running `$ guix search rsh`. This binary is part of
> `inetutils` and the description field contains:
> 
> > Inetutils is a collection of common network programs, such as an ftp
> > client and server, a telnet client and server, an rsh client and
> > server, and hostname.
> 
> Most likely, this is what the user is interested in. However, inetutils
> does not show up until roughly the ~75th result with a relevance of 2
> (the lowest possible relevance).
> 
> Almost every search result beforehand contains the string "rsh" as a
> component of another word, such as "marshaling", "powershell", and
> "hershey". However, these match multiple times and are weighted
> significantly higher.
> 
> Ideally, guix search should rate inetutils higher because the string
> "rsh" occurs as its own word, not as a component of another, unrelated
> word. (Very, very people would search "rsh" looking for matches with
> "hershey", even if "hershey" occurs multiple times.)
> 
> Another example of where this can happen is with "dig", part of the bind
> package. Searching for "dig" returns garbage because "dig" is a common
> subword. Bind is scored with a relevance of 2, even though bind's
> description emphasises that dig is part of it.
> 
> This would improve the experience when searching with strings that
> commonly occur as subwords.
> 
> Since this change can't occur in a vacuum, care should be taken not to
> reduce the effectiveness of other reasonably forseeable search queries.
> 
> -- 
> Take it easy,
> Richard Sent
> Making my computer weirder one commit at a time.
> 
> 
> 

I like your proposal :)

I'm wondering how [1] compares in what it does for your use(ful) case.

(I am not familiar with Hyper Estraier beyond being prompted for gnu.org searching)

[1] https://directory.fsf.org/wiki/Hyper_Estraier

Regards,

Bengt Richter

Your comment

Commenting via the web interface is currently disabled.

To comment on this conversation send an email to 70689@debbugs.gnu.org

is:open	open issues
is:done	closed issues
submitter:<who>	search issue submitter
author:<who>	search by message author
date:yesterday..now	search by issue date
mdate:3m..2d	search by message date