Lexiconfree beam search #101

SimBe195 · 2025-02-19T18:21:36Z

Simple time synchronous beam search algorithm based on the new SearchAlgorithmV2 interface. Does not use (proper) pronunciation lexicon, word-level LM or transition model. Performs special handling of blank if a blank index is set. Main purpose is open vocabulary search with CTC/Neural Transducer (or similar) models.

Supports global pruning by max beam-size and by score difference to the best hypothesis. Uses a LabelScorer to context initialization/extension and scoring.

The search requires a lexicon that represents the vocabulary. Each lemma is viewed as a token with its index in the lexicon corresponding to the associated output index of the LabelScorer.

Depends on #103 and #104.

larissakl · 2025-02-20T08:25:16Z

Two general points:

How should this search algorithm be used? Are you planning to put the enum SearchTypeV2, the related ParameterChoice and Module_::createSearchAlgorithm() to Module.hh/.cc later? Or do you have a different plan?
Would it maybe be a good idea to factor out the struct TimeStatistic and its related code so that it can be reused in other search algorithms?

SimBe195 · 2025-02-20T09:01:16Z

1. How should this search algorithm be used? Are you planning to put the `enum SearchTypeV2`, the related `ParameterChoice` and `Module_::createSearchAlgorithm()` to Module.hh/.cc later? Or do you have a different plan?

With this PR alone, the search algorithm is not usable yet. I will make PR's for an Flf node and python bindings separately. But I can include the createSearchAlgorithm() function already here.

2. Would it maybe be a good idea to factor out the `struct TimeStatistic` and its related code so that it can be reused in other search algorithms?

Yeah, probably. Maybe even into Core?

src/Search/LexiconfreeTimesyncBeamSearch/LexiconfreeTimesyncBeamSearch.hh

src/Search/LexiconfreeTimesyncBeamSearch/LexiconfreeTimesyncBeamSearch.cc

larissakl · 2025-02-24T15:57:36Z

src/Search/LexiconfreeTimesyncBeamSearch/LexiconfreeTimesyncBeamSearch.cc

+        log() << extensions.size() << " candidates survived beam pruning";
+    }
+
+    std::sort(extensions.begin(), extensions.end());


Do they actually have to be sorted if no score pruning is done? If not, I would put that in the if-statement below.

Yes, the recombination also requires the hypotheses to be sorted and we also want beam.front() to be the single-best in the end.

But you can find the first-best hypotheses faster than in N * log(N). So I would recommend to just compute the minimum for that. Pruning can also be done in linear time. In recombination you don't need order either, you can just check the scores, right?

The order is needed in the recombination because it is assumed that the best hypothesis is first and if there is a hypothesis with the same scoring context later, we know that its score is worse.

I can rewrite the pruning and recombination functions to not depend on sorted hypotheses.

src/Search/LexiconfreeTimesyncBeamSearch/LexiconfreeTimesyncBeamSearch.hh

…lattice from beam

…pWatch

…l state

…_beam_search

…_search

larissakl · 2025-03-13T12:29:41Z

src/Search/LexiconfreeTimesyncBeamSearch/LexiconfreeTimesyncBeamSearch.hh

+    static const Core::ParameterBool  paramUseSentenceEnd;
+    static const Core::ParameterBool  paramSentenceEndIndex;


These two are currently not used. I don't know if you want keep them if at some point you introduce sentence-end handling or if you want to remove them for now.

curufinwe

Global comment: It would be nice if the order of method definitions would match the order of declarations in the class.

curufinwe · 2025-03-21T09:58:16Z

src/Search/Module.cc

+SearchAlgorithmV2* Module_::createSearchAlgorithm(const Core::Configuration& config) const {
+    SearchAlgorithmV2* searchAlgorithm = 0;


Suggested change

SearchAlgorithmV2* Module_::createSearchAlgorithm(const Core::Configuration& config) const {

SearchAlgorithmV2* searchAlgorithm = 0;

SearchAlgorithmV2* Module_::createSearchAlgorithmV2(const Core::Configuration& config) const {

SearchAlgorithmV2* searchAlgorithm = nullptr;

curufinwe · 2025-03-21T09:59:10Z

src/Search/Module.hh

-    SearchAlgorithm* createRecognizer(SearchType type, const Core::Configuration& config) const;
-    LatticeHandler*  createLatticeHandler(const Core::Configuration& c) const;
+    SearchAlgorithm*   createRecognizer(SearchType type, const Core::Configuration& config) const;
+    SearchAlgorithmV2* createSearchAlgorithm(const Core::Configuration& config) const;


Suggested change

SearchAlgorithmV2* createSearchAlgorithm(const Core::Configuration& config) const;

SearchAlgorithmV2* createSearchAlgorithmV2(const Core::Configuration& config) const;

curufinwe · 2025-03-21T09:59:58Z

src/Search/Traceback.cc

 #include "Traceback.hh"
+#include <Lattice/LatticeAdaptor.hh>
+#include <Speech/Types.hh>
+#include <stack>

 #include <stack>


Suggested change

#include "Traceback.hh"

#include <Lattice/LatticeAdaptor.hh>

#include <Speech/Types.hh>

#include <stack>

#include <stack>

#include "Traceback.hh"

#include <stack>

#include <Lattice/LatticeAdaptor.hh>

#include <Speech/Types.hh>

curufinwe · 2025-03-21T10:00:46Z

src/Speech/Makefile

@@ -46,6 +46,7 @@ CHECK_O		= $(OBJDIR)/check.o \
 		  ../Mm/libSprintMm.$(a) \
 		  ../Mc/libSprintMc.$(a) \
 		  ../Search/libSprintSearch.$(a) \


Better: use LIBS_SEARCH here and also for all other Makefiles below.

curufinwe · 2025-03-21T10:01:01Z

src/Tools/Archiver/Makefile

@@ -12,6 +12,7 @@ TARGETS		= archiver$(exe)
 ARCHIVER_O	= $(OBJDIR)/Archiver.o \
                 ../../Speech/libSprintSpeech.$(a) \
                 ../../Search/libSprintSearch.$(a) \
+				 ../../Search/LexiconfreeTimesyncBeamSearch/libSprintLexiconfreeTimesyncBeamSearch.$(a) \


correct indentation please.

curufinwe · 2025-03-21T10:23:23Z

src/Search/LexiconfreeTimesyncBeamSearch/LexiconfreeTimesyncBeamSearch.cc

+    /*
+     * Collect all possible extensions for all hypotheses in the beam.
+     */
+    std::vector<ExtensionCandidate> extensions;


In AdvancedTreeSearch the vector for hypotheses expansion is also part of the class and not reallocated in every step. Maybe you could consider doing the same.

curufinwe · 2025-03-21T10:26:53Z

src/Search/LexiconfreeTimesyncBeamSearch/LexiconfreeTimesyncBeamSearch.cc

+        log() << extensions.size() << " candidates survived beam pruning";
+    }
+
+    std::sort(extensions.begin(), extensions.end());


But you can find the first-best hypotheses faster than in N * log(N). So I would recommend to just compute the minimum for that. Pruning can also be done in linear time. In recombination you don't need order either, you can just check the scores, right?

curufinwe · 2025-03-21T10:28:10Z

src/Search/LexiconfreeTimesyncBeamSearch/LexiconfreeTimesyncBeamSearch.cc

+    /*
+     * Create new beam from surviving extensions.
+     */
+    std::vector<LabelHypothesis> newBeam;


See comment above on preallocating vectors.

curufinwe · 2025-03-21T10:29:38Z

src/Search/LexiconfreeTimesyncBeamSearch/LexiconfreeTimesyncBeamSearch.cc

+        scorePruning(extensions);
+
+        if (debugLogging_) {
+            log() << extensions.size() << " candidates survived score pruning";


Maybe you could compute the min/avg/max statistics and log them at the end of the segment like in advanced-tree-search.

curufinwe · 2025-03-21T10:30:54Z

src/Search/LexiconfreeTimesyncBeamSearch/LexiconfreeTimesyncBeamSearch.cc

+     * Create scoring requests for the label scorer.
+     * Each extension candidate makes up a request.
+     */
+    std::vector<Nn::LabelScorer::Request> requests;


Could we avoid filling that extensions vector above and directly fill out the requests?

We could maybe combine filling the requests and the extensions vector in the same loop? But I don't think we can completely avoid the extensions vector.

Simon Berger added 2 commits February 19, 2025 19:10

Implement simple lexiconfree time-sync beam search

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

1e7035e

Add some comments

bf0a8ce

SimBe195 requested review from curufinwe and larissakl February 19, 2025 18:21

Add createSearchAlgorithm to Search::Module

d6689b4

larissakl reviewed Feb 24, 2025

View reviewed changes

larissakl reviewed Feb 26, 2025

View reviewed changes

src/Search/LexiconfreeTimesyncBeamSearch/LexiconfreeTimesyncBeamSearch.hh Outdated Show resolved Hide resolved

Simon Berger added 11 commits February 26, 2025 16:29

Fix compilation

664945c

Refactor traceback/lattice building and construct proper (nonlinear) …

488fb0e

…lattice from beam

Factor out time statistics into new Core::StopWatch class

1599302

Don't copy sibling from predecessor

9a60916

Better handling of blank index

8e96423

Apply suggestions from code review

536ac82

Implement StopWatch class

f21935e

Use TIMER_START and TIMER_STOP macros instead

5f82460

Simplify AdvancedTreeSearch PerformanceCounter by inheriting from Sto…

4779dd5

…pWatch

Small fixes in StopWatch class

f5a3182

Make StopWatch a member of PerformanceCounter instead of inheriting

97e5bd7

SimBe195 mentioned this pull request Mar 4, 2025

Add Core::StopWatch and update Search::PerformanceCounter #103

Merged

Simon Berger added 3 commits March 4, 2025 17:24

Implement LatticeTrace class

b77cf23

Make predecessor and sibling public members

5fcfff7

Look for initial trace instead of associating empty trace with initia…

3152300

…l state

SimBe195 mentioned this pull request Mar 4, 2025

Add Search::LatticeTrace for traceback and lattice construction #104

Merged

Simon Berger added 2 commits March 4, 2025 21:20

Remove redundant includes

0b676f9

Add assertions for assumptions in lattice building

159fbd8

larissakl mentioned this pull request Mar 5, 2025

Add VocabTextLexiconParser for simple text-based lexica #105

Open

SimBe195 added 2 commits March 5, 2025 14:00

Merge remote-tracking branch 'origin/lattice_traces' into lexiconfree…

f2f4cf7

…_beam_search

Remove wrong assertion

0577e79

SimBe195 added 6 commits March 5, 2025 14:25

Merge remote-tracking branch 'origin/lattice_traces' into lexiconfree…

04b6ac4

…_beam_search

Remove initial item in performTraceback

b454e39

Merge remote-tracking branch 'origin/lattice_traces' into lexiconfree…

b3d5f02

…_beam_search

Fix arc scores

d393c7e

Merge remote-tracking branch 'origin/lattice_traces' into lexiconfree…

b1ed20e

…_beam_search

Merge remote-tracking branch 'origin/stopwatch' into lexiconfree_beam…

f112113

…_search

SimBe195 changed the base branch from master to lattice_traces March 5, 2025 14:15

SimBe195 added 5 commits March 5, 2025 20:10

Update traceback/lattice building logic

d67cf45

Make elapsed functions const

54535e6

Merge branch 'stopwatch' into lexiconfree_beam_search

f0832f8

Merge branch 'stopwatch' into lattice_traces

a125afa

Merge branch 'lattice_traces' into lexiconfree_beam_search

46ee1a8

SimBe195 mentioned this pull request Mar 5, 2025

Add Flf::RecognizerNodeV2 #106

Open

larissakl reviewed Mar 13, 2025

View reviewed changes

curufinwe requested changes Mar 21, 2025

View reviewed changes

Base automatically changed from lattice_traces to master March 21, 2025 11:36

SimBe195 added 2 commits March 21, 2025 12:55

Remove unused parameters

eb17fcb

Merge branch 'master' into lexiconfree_beam_search

Loading
Loading status checks…

abeaa66

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lexiconfree beam search #101

Lexiconfree beam search #101

SimBe195 commented Feb 19, 2025 •

edited

Loading

larissakl commented Feb 20, 2025

SimBe195 commented Feb 20, 2025 •

edited

Loading

larissakl Feb 24, 2025

SimBe195 Feb 28, 2025

curufinwe Mar 21, 2025

larissakl Mar 21, 2025

SimBe195 Mar 21, 2025

larissakl Mar 13, 2025

curufinwe left a comment

curufinwe Mar 21, 2025

curufinwe Mar 21, 2025

curufinwe Mar 21, 2025

curufinwe Mar 21, 2025

curufinwe Mar 21, 2025

curufinwe Mar 21, 2025

curufinwe Mar 21, 2025

curufinwe Mar 21, 2025

curufinwe Mar 21, 2025

curufinwe Mar 21, 2025

larissakl Mar 21, 2025

		static const Core::ParameterBool paramUseSentenceEnd;
		static const Core::ParameterBool paramSentenceEndIndex;

		SearchAlgorithmV2* Module_::createSearchAlgorithm(const Core::Configuration& config) const {
		SearchAlgorithmV2* searchAlgorithm = 0;

	SearchAlgorithmV2* createSearchAlgorithm(const Core::Configuration& config) const;
	SearchAlgorithmV2* createSearchAlgorithmV2(const Core::Configuration& config) const;

Lexiconfree beam search #101

Are you sure you want to change the base?

Lexiconfree beam search #101

Conversation

SimBe195 commented Feb 19, 2025 • edited Loading

larissakl commented Feb 20, 2025

SimBe195 commented Feb 20, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

curufinwe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SimBe195 commented Feb 19, 2025 •

edited

Loading

SimBe195 commented Feb 20, 2025 •

edited

Loading