-
-
Notifications
You must be signed in to change notification settings - Fork 144
[Store] Rework state handling of indexer #1149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| $sources = TextFileLoader::createSource([ | ||
| dirname(__DIR__, 2).'/fixtures/movies/gladiator.md', | ||
| dirname(__DIR__, 2).'/fixtures/movies/inception.md', | ||
| dirname(__DIR__, 2).'/fixtures/movies/jurassic-park.md', | ||
| ]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could also be like this:
| $sources = TextFileLoader::createSource([ | |
| dirname(__DIR__, 2).'/fixtures/movies/gladiator.md', | |
| dirname(__DIR__, 2).'/fixtures/movies/inception.md', | |
| dirname(__DIR__, 2).'/fixtures/movies/jurassic-park.md', | |
| ]); | |
| $sources = [ | |
| new TextFile(dirname(__DIR__, 2).'/fixtures/movies/gladiator.md'), | |
| new TextFile(dirname(__DIR__, 2).'/fixtures/movies/inception.md'), | |
| new TextFile(dirname(__DIR__, 2).'/fixtures/movies/jurassic-park.md'), | |
| ]; |
- assumption is that factory will help with bundle config
- validation could move from factory to constructor
lyrixx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool Cool! It's a good start.
I asked a few questions
| use Psr\Log\NullLogger; | ||
| use Symfony\AI\Store\Document\EmbeddableDocumentInterface; | ||
| use Symfony\AI\Store\Document\FilterInterface; | ||
| use Symfony\AI\Store\Document\LoaderInterface; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This interface is not used anymore, it should be removed, shouldn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure i get the point - it was renamed to SourceTypeLoaderInterface, but is not needed here anymore
| } | ||
|
|
||
| public function index(array $options = []): void | ||
| public function index(iterable|SourceInterface $sources, array $options = []): void |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so now, we must pass something when indexing? We loose the ability to run :
bin/console ai.store.index indexer_nameto index everything.
Or I missed something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, that's open and needs to be tackled in the bundle 👍 will follow
src/store/src/Document/Loader.php
Outdated
| foreach ($sources as $source) { | ||
| $loaderFound = false; | ||
| foreach ($this->sourceTypeLoaders as $sourceTypeLoader) { | ||
| if ($sourceTypeLoader->supports($source)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the part I don't like.
We already have a strong coupling between a model, and its implementation.
We could use the DIC (service locator) to solve this.
$this->sourceTypeLoaders->get($source::class)->load($source);To solve this we have a map like
[
source class => loader,
source class 2 => loader 2
]
I blogged about this pattern a while ago
It's quite simple and very performant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, especially without any additional argument or dynamic the supports(SourceInterface $source) is over the top - so is the dependency to the entire DIC component for that pattern.
will adopt to something slim like:
/**
* @return class-string<SourceInterface>
*/
public static function supportedSource(): string;so we could have a map in the loader that makes accessing the corresponding loader also quite efficient
7acc22f to
58041dc
Compare
| $vectorizer = new Vectorizer($platform, 'text-embedding-3-small'); | ||
| $indexer = new Indexer( | ||
| loader: new TextFileLoader(), | ||
| loader: new Loader([TextFileLoader::supportedSource() => new TextFileLoader()]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could also be
new Loader([TextFile::class => new TextFileLoader()]),or we shift that part into the Loader implementation
This is not done, but rather to sketch the idea
This solves:
DocumentCollectionAddressing Concerns from #997:
"We don't want to support everything. In my humble opinion, it lacks genericness."
Users could actually quite easy implement their own source objects and loaders to throw them into the indexer - including custom entities or specific URLs, etc.
"Then, to me, Rss, Url, or Source are data objects, not stateless services. The naming is confusing."
That's even the idea, yes - they are state objects to be handed over to the indexing process instead of being part of the constructor which lacks flexibility - (why we had the
withSources)"I'm not a fan of the Strategy Design Pattern" + "We are fighting this pattern so much in symfony/serializer to avoid performance issues!"
I would argue that the performance challenge in serializer comes with object nesting and therefore the runtime complexity grows in a different way since we're mostly dealing with collections here - might be different with folders, but that's a different story - the slowest part of the indexing will be the vectorizing/http calls anyhow
"This class does too much. It should be split"
Not sure i follow this reasoning since the amount of operations in total is the same, see sketch in [Store] Extract some code from Indexer to a dedicated class + Introduce Ingester #1138. "indexing" is our high level operation and the indexer itself acts as integrator of multiple operation services. this refactoring makes it a bit more visible now with shifting some loading logic.
cc @lyrixx @OskarStark
there are still quite some hypothesis and assumption in those line - but i think it helps to follow the idea