Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/paragraph-extraction #7623

Merged
merged 94 commits into from
Feb 26, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
94 commits
Select commit Hold shift + click to select a range
2795296
Add zod library to dependencies and update yarn.lock
Joao-vi Jan 28, 2025
fbc0911
Implement AbstractController and UseCase interface for API structure
Joao-vi Jan 28, 2025
9fbd629
Add method to filter properties by type in Template class
Joao-vi Jan 28, 2025
94cf5a0
Add ExtractorDataSource interface for extractor creation
Joao-vi Jan 28, 2025
ae3b4b8
Add Extractor class with validation and status management
Joao-vi Jan 28, 2025
b4e68ac
Add CreateExtractorUseCase for creating extractors with template vali…
Joao-vi Jan 28, 2025
5ef3536
Add extractor ID generation and update Extractor class to include ID …
Joao-vi Jan 28, 2025
a3d5753
Add MongoDB data source and error handling for extractors
Joao-vi Jan 29, 2025
45db0cf
Fix import path for CreateExtractorUseCase in specs
Joao-vi Jan 29, 2025
2e3449d
Add getAll method to TemplatesDataSource and MongoTemplatesDataSource…
Joao-vi Jan 29, 2025
a492fd8
fix eslint rule and change naming conventions
daneryl Jan 30, 2025
1e2c7ab
IdGenerator as dependency instead of DS.nextId
daneryl Jan 30, 2025
d06c277
getAllFrom on testingEnvironment to get data from db
daneryl Jan 30, 2025
cb08752
unified PX validation errors in a single class
daneryl Jan 30, 2025
1f4feb1
wip extract paragraphs
Joao-vi Feb 6, 2025
4537ac5
Merge branch 'production' of github.com:huridocs/uwazi into feat/para…
Joao-vi Feb 6, 2025
526a037
fix: update Document instantiation in S3FileStorage.spec.ts to includ…
Joao-vi Feb 6, 2025
ed99d0d
fix: update LanguageISO6391 type usage in FilesMappers and commonSchemas
Joao-vi Feb 6, 2025
317a121
fix: enforce required ISO639_1 field in language schemas and update u…
Joao-vi Feb 6, 2025
3b9c331
Revert "fix: enforce required ISO639_1 field in language schemas and …
Joao-vi Feb 6, 2025
6919c30
feat: introduce FileType and HttpClientFactory, enhance Segmentation …
Joao-vi Feb 10, 2025
7b5fe0f
wip
Joao-vi Feb 12, 2025
d5d416f
feat: implement PXExtractionId class and refactor extraction ID handl…
Joao-vi Feb 12, 2025
f5dd506
fix: add TODO comment to clarify tenant name prepending in S3FileStorage
Joao-vi Feb 12, 2025
b31e419
feat: enhance SegmentationMapper and FileSystemStorage, implement mis…
Joao-vi Feb 12, 2025
e0d32c1
fix: remove TODO comment regarding tenant name prepending in S3FileSt…
Joao-vi Feb 12, 2025
5012d50
feat: expose PXErrorCode in PXValidationError for improved error hand…
Joao-vi Feb 12, 2025
723cf97
feat: add HttpField class for improved handling of form data fields
Joao-vi Feb 12, 2025
777d1aa
feat: refactor postFormData method to improve file and field handling…
Joao-vi Feb 12, 2025
df6ba51
feat: integrate HttpField for improved form data handling in PXExtern…
Joao-vi Feb 12, 2025
11dc4e6
feat: update PXExtractionId to use a custom separator for improved ID…
Joao-vi Feb 12, 2025
f8d0d95
add validator
Joao-vi Feb 13, 2025
536f6ba
feat: add GET method to HttpClient interface and implement in SuperAg…
Joao-vi Feb 13, 2025
47f7013
feat: implement getParagraphsResult method in PXExternalExtractionSer…
Joao-vi Feb 13, 2025
a410af9
feat: add EXTRACTION_ID_INVALID error code and refactor PXValidationE…
Joao-vi Feb 13, 2025
b3de446
feat: integrate Validator for PXExtractionId validation and refactor …
Joao-vi Feb 13, 2025
85c666f
feat: implement PXCreateParagraph use case and associated tests
Joao-vi Feb 13, 2025
5da5f60
refactor: clean up imports in PXCreateExtractor.ts
Joao-vi Feb 13, 2025
328b46d
feat: implement PXCreateParagraphs use case for paragraph extraction
Joao-vi Feb 13, 2025
d66e5c4
feat: add PXParagraphsResultListener for handling paragraph extractio…
Joao-vi Feb 13, 2025
ca45185
feat: implement PXExtractParagraphsFromEntities use case for paragrap…
Joao-vi Feb 13, 2025
307d6a7
wip
Joao-vi Feb 13, 2025
74a90dc
feat: update MongoPXExtractorsQueryService to use paragraphsQuantity …
Joao-vi Feb 13, 2025
3745f5d
feat: enhance MongoPXExtractorsQueryService with additional lookups a…
Joao-vi Feb 13, 2025
da294de
refactor: move InputSchema and PXCreateExtractor to named exports for…
Joao-vi Feb 14, 2025
1da7fea
refactor: rename GetExtractorsOutput to ExtractorDTO for clarity and …
Joao-vi Feb 14, 2025
cbd8ebf
feat: add Application dependency to AbstractController for enhanced f…
Joao-vi Feb 14, 2025
fda5158
feat: add PXCreateExtractorController for handling extractor creation…
Joao-vi Feb 14, 2025
0fac36f
refactor: simplify PXExtractionId constructor and improve ID handling…
Joao-vi Feb 14, 2025
69bda4b
wip
Joao-vi Feb 15, 2025
204d615
refactor: remove unused create method from EntitiesDataSource interface
Joao-vi Feb 17, 2025
ca0e32e
feat: implement PXExtractParagraphsFromEntity use case for paragraph …
Joao-vi Feb 17, 2025
3504401
test: add unit tests for PXExtractParagraphsFromEntity use case
Joao-vi Feb 17, 2025
990a0ab
refactor: rename entityDS to entitiesDS for consistency in PXCreatePa…
Joao-vi Feb 17, 2025
9ac7360
feat: extend PXExtractionId to include tenantName and userId for enha…
Joao-vi Feb 17, 2025
42e05fb
feat: add tenantName and userId to PXExtractionId for improved contex…
Joao-vi Feb 17, 2025
4b3f06e
refactor: simplify validation logic in Validator class by removing un…
Joao-vi Feb 17, 2025
c0ca155
feat: enhance language handling in PXExternalExtractionService to sel…
Joao-vi Feb 17, 2025
186f1de
refactor: remove unused methods and tests from PXCreateParagraph to s…
Joao-vi Feb 18, 2025
0fd04b9
wip
Joao-vi Feb 18, 2025
52030bf
feat: add isMainLanguage flag to translations and enhance error handl…
Joao-vi Feb 19, 2025
d443fa7
refactor: improve method naming and enhance result handling in paragr…
Joao-vi Feb 19, 2025
7f737e8
refactor: rename paragraphsQuantity to sourceEntitiesCount and update…
Joao-vi Feb 20, 2025
47dfda2
refactor: add notes for paragraph property selection and inheritance …
Joao-vi Feb 20, 2025
18ca1f3
test: add additional test cases for PXCreateParagraphs functionality
Joao-vi Feb 20, 2025
fe8972e
WIP, manual integration test with dummy service
daneryl Feb 20, 2025
dabf49c
testing files
Joao-vi Feb 20, 2025
13342b8
feat: add paragraph property IDs to extractor and validation error ha…
Joao-vi Feb 20, 2025
cd64ebb
wip
Joao-vi Feb 21, 2025
463b8b8
finish controllers
Joao-vi Feb 22, 2025
150ba47
AbstractController refactor
Joao-vi Feb 22, 2025
17edcdd
wip
Joao-vi Feb 23, 2025
77c563d
Merge branch 'production' into feat/paragraph-extraction
Joao-vi Feb 24, 2025
5d9d636
last changes
Joao-vi Feb 24, 2025
25254a7
replace hardcoded url
Joao-vi Feb 24, 2025
57a7209
add authorization middleware
Joao-vi Feb 24, 2025
01c0a2d
typo
Joao-vi Feb 24, 2025
13d542d
Merge branch 'feat/paragraph-extraction' of github.com:huridocs/uwazi…
Joao-vi Feb 24, 2025
6a77a5c
fix function errors
Joao-vi Feb 24, 2025
8b3749a
last changes
Joao-vi Feb 24, 2025
c295ea4
Merge branch 'production' into feat/paragraph-extraction
Joao-vi Feb 24, 2025
137b9d2
Merge branch 'production' into feat/paragraph-extraction
Joao-vi Feb 25, 2025
6a2d1ff
add paragraph extraction routes
Joao-vi Feb 25, 2025
069a9db
refactor on controllers
Joao-vi Feb 25, 2025
e39ecae
Merge branch 'feat/paragraph-extraction' of github.com:huridocs/uwazi…
Joao-vi Feb 25, 2025
d26fb1e
remove unused code
Joao-vi Feb 25, 2025
dd71d18
add segmentation type
Joao-vi Feb 25, 2025
0945512
fix test
Joao-vi Feb 25, 2025
8f35196
fix File mapper
Joao-vi Feb 25, 2025
822afff
Merge branch 'production' into feat/paragraph-extraction
Joao-vi Feb 25, 2025
9ee511e
Merge branch 'production' into feat/paragraph-extraction
Joao-vi Feb 26, 2025
68680f4
refactor: replace segment_type to type
Joao-vi Feb 26, 2025
ea66248
turn Extract paragraph use case sequential
Joao-vi Feb 26, 2025
d6aad93
Merge branch 'feat/paragraph-extraction' of github.com:huridocs/uwazi…
Joao-vi Feb 26, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion .eslintrc.js
Original file line number Diff line number Diff line change
Expand Up @@ -234,7 +234,11 @@ module.exports = {
excludedFiles: './**/*.cy.tsx',
parser: '@typescript-eslint/parser',
parserOptions: { project: './tsconfig.json' },
rules: { ...rules },
rules: {
...rules,
'no-empty-function': ['error', { allow: ['constructors'] }],
'no-useless-constructor': 'off',
},
},
{
files: ['./cypress/**/*.ts', './cypress/**/*.d.ts', './**/*.cy.tsx'],
Expand Down
2 changes: 2 additions & 0 deletions app/api/api.js
Original file line number Diff line number Diff line change
Expand Up @@ -44,4 +44,6 @@ export default (app, server) => {
require('./relationships.v2/routes/routes').default(app);
require('./stats/routes').default(app);
require('./testing_errors/routes').default(app);

require('./paragraphExtraction/adapters/PXRoutes').paragraphExtractionRoutes(app);
};
87 changes: 87 additions & 0 deletions app/api/common.v2/AbstractController.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
import util from 'util';
import { ValidationError } from 'ajv';
import { ZodError } from 'zod';
import { Request, Response } from 'express';

import { config } from 'api/config';
import { LanguageISO6391 } from 'shared/types/commonTypes';

export type Dependencies<RequestBody = any> = {
response: Response;
request: Request<unknown, any, RequestBody>;
};

export abstract class AbstractController<RequestBody = any> {
constructor(private dependencies: Dependencies<RequestBody>) {}

protected abstract handle(): Promise<void>;

async handleAsync() {
try {
await this.handle();
} catch (e) {
if (e instanceof ZodError) {
const error = new ValidationError(
e.errors.map(issue => ({
instancePath: issue.path.join('.'),
message: issue.message,
}))
);

error.message = util.inspect(error, false, null);

throw error;
}

throw e;
}
}

/**
* Adapts a controller class to an Express route handler.
*
* This method takes a controller class (not an instance), instantiates it with
* the request and response objects, and calls its `handleAsync` method.
*
*/
static adapt<Controller extends AbstractController>(
ControllerClass: new (dependencies: Dependencies) => Controller
) {
return async (request: Request, response: Response) =>
new ControllerClass({ request, response }).handleAsync();
}

protected get request() {
return this.dependencies.request;
}

protected get response() {
return this.dependencies.response;
}

protected get language() {
return this.request.language as LanguageISO6391;
}

protected get tenantName() {
return this.request.get('tenant') ?? config.defaultTenant.name;
}

protected serverError(error: Error) {
this.response.status(500).json({
message: error.message,
});
}

protected clientError(message: string) {
this.response.status(400).json({ message });
}

protected jsonResponse(body: any) {
this.response.status(200).json(body);
}

protected ok() {
this.response.status(200).send();
}
}
19 changes: 19 additions & 0 deletions app/api/common.v2/contracts/HttpClient.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
import { File } from 'api/files.v2/model/File';
import { HttpField } from './HttpField';

type PostFormDataInput = {
url: string;
fields: Record<string, HttpField>;
files: Record<string, File[]>;
};

type GetInput = {
url: string;
};

interface HttpClient {
postFormData<T>(input: PostFormDataInput): Promise<T>;
get<Response>(input: GetInput): Promise<Response>;
}

export type { PostFormDataInput, HttpClient, GetInput };
15 changes: 15 additions & 0 deletions app/api/common.v2/contracts/HttpField.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
export class HttpField {
value: string;

constructor(value: any) {
this.value = HttpField.parseValue(value);
}

private static parseValue(value: any): string {
if (Array.isArray(value) || typeof value === 'object') {
return JSON.stringify(value);
}

return `${value}`;
}
}
3 changes: 3 additions & 0 deletions app/api/common.v2/contracts/UseCase.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
export interface UseCase<Input, Output> {
execute(input: Input): Promise<Output>;
}
38 changes: 38 additions & 0 deletions app/api/common.v2/contracts/specs/HttpField.spec.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
import { HttpField } from '../HttpField';

describe('HttpField', () => {
it('should parse a string value correctly', () => {
const field = new HttpField('test');
expect(field.value).toBe('test');
});

it('should parse a number value correctly', () => {
const field = new HttpField(123);
expect(field.value).toBe('123');
});

it('should parse a boolean value correctly', () => {
const field = new HttpField(true);
expect(field.value).toBe('true');
});

it('should parse an array value correctly', () => {
const field = new HttpField([1, 2, 3]);
expect(field.value).toBe('[1,2,3]');
});

it('should parse an object value correctly', () => {
const field = new HttpField({ key: 'value' });
expect(field.value).toBe('{"key":"value"}');
});

it('should parse a null value correctly', () => {
const field = new HttpField(null);
expect(field.value).toBe('null');
});

it('should parse an undefined value correctly', () => {
const field = new HttpField(undefined);
expect(field.value).toBe('undefined');
});
});
10 changes: 6 additions & 4 deletions app/api/common.v2/database/MongoDataSource.ts
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,12 @@ export abstract class MongoDataSource<TSchema extends Document = Document> {
this.transactionManager = transactionManager;
}

protected getCollection(collectionName = this.collectionName) {
return new SyncedCollection<TSchema>(
new SessionScopedCollection<TSchema>(
this.db.collection<TSchema>(collectionName),
protected getCollection<Collection extends Document = TSchema>(
collectionName = this.collectionName
) {
return new SyncedCollection<Collection>(
new SessionScopedCollection<Collection>(
this.db.collection<Collection>(collectionName),
this.transactionManager
),
this.transactionManager,
Expand Down
7 changes: 7 additions & 0 deletions app/api/common.v2/infrastructure/HttpClientFactory.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
import { SuperAgentHttpClient } from './SuperAgentHttpClient';

export class HttpClientFactory {
static createDefault() {
return new SuperAgentHttpClient();
}
}
44 changes: 44 additions & 0 deletions app/api/common.v2/infrastructure/SuperAgentHttpClient.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
import superagent from 'superagent';
import { File } from 'api/files.v2/model/File';
import { GetInput, HttpClient, PostFormDataInput } from '../contracts/HttpClient';
import { HttpField } from '../contracts/HttpField';

export class SuperAgentHttpClient implements HttpClient {
private client = superagent;

async get<Response>(input: GetInput): Promise<Response> {
const response = await this.client.get(input.url);

return response.body as Response;
}

async postFormData<T>(input: PostFormDataInput): Promise<T> {
const request = this.client.post(input.url);

await SuperAgentHttpClient.attachFiles(request, input.files);
SuperAgentHttpClient.appendFields(request, input.fields);

const response = await request;

return response.body as T;
}

private static async attachFiles(request: superagent.Request, files: Record<string, File[]>) {
const promises = Object.entries(files).flatMap(([key, _files]) =>
_files.map(async file => {
const buffer = await file.toBuffer();

// This is necessary because when we actually 'await' for 'request.[attach/field]' the 'superagent' library kicks off the request
// This is not what we want here.
// eslint-disable-next-line @typescript-eslint/no-floating-promises
request.attach(key, buffer, file.filename);
})
);

return Promise.all(promises);
}

private static appendFields(request: superagent.Request, fields: Record<string, HttpField>) {
Object.entries(fields).forEach(([key, value]) => request.field(key, value.value));
}
}
3 changes: 3 additions & 0 deletions app/api/config.ts
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,9 @@ export const config = {
},
},
externalServices: Boolean(process.env.EXTERNAL_SERVICES) || false,
externalServicesUrls: {
paragraphExtraction: process.env.PARAGRAPH_EXTRACTION_URL || 'http://localhost:5056',
},

redis: {
activated: CLUSTER_MODE,
Expand Down
9 changes: 9 additions & 0 deletions app/api/files.v2/contracts/FileStorage.ts
Original file line number Diff line number Diff line change
@@ -1,7 +1,16 @@
import { StoredFile } from '../model/StoredFile';
import { UwaziFile } from '../model/UwaziFile';
import { FileType } from '../model/FileType';
import { File } from '../model/File';

export type GetFileInput = {
type: FileType;
filename: string;
};

export interface FileStorage {
list(): Promise<StoredFile[]>;
getPath(file: UwaziFile): string;
getFiles(inputs: GetFileInput[]): Promise<File[]>;
getFile(input: GetFileInput): Promise<File>;
}
41 changes: 41 additions & 0 deletions app/api/files.v2/contracts/FileStorageStrategy.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
import { Tenant } from 'api/tenants/tenantContext';
import { FileStorage, GetFileInput } from './FileStorage';
import { File } from '../model/File';
import { StoredFile } from '../model/StoredFile';
import { UwaziFile } from '../model/UwaziFile';

type Strategy = {
s3Storage: FileStorage;
fileSystemStorage: FileStorage;
};

type FileStorageStrategyProps = {
tenant: Tenant;
strategy: Strategy;
};

export class FileStorageStrategy implements FileStorage {
constructor(private props: FileStorageStrategyProps) {}

private get currentStrategy() {
if (this.props.tenant.featureFlags?.s3Storage) return this.props.strategy.s3Storage;

return this.props.strategy.fileSystemStorage;
}

async list(): Promise<StoredFile[]> {
return this.currentStrategy.list();
}

getPath(file: UwaziFile): string {
return this.currentStrategy.getPath(file);
}

async getFiles(inputs: GetFileInput[]): Promise<File[]> {
return this.currentStrategy.getFiles(inputs);
}

async getFile(input: GetFileInput): Promise<File> {
return this.currentStrategy.getFile(input);
}
}
7 changes: 6 additions & 1 deletion app/api/files.v2/contracts/FilesDataSource.ts
Original file line number Diff line number Diff line change
@@ -1,7 +1,12 @@
import { ResultSet } from 'api/common.v2/contracts/ResultSet';
import { UwaziFile } from '../model/UwaziFile';
import { Segmentation } from '../model/Segmentation';
import { Document } from '../model/Document';

export interface FilesDataSource {
interface FilesDataSource {
filesExistForEntities(files: { entity: string; _id: string }[]): Promise<boolean>;
getAll(): ResultSet<UwaziFile>;
getSegmentations(fileId: string[]): ResultSet<Segmentation>;
getDocumentsForEntity(entitySharedId: string): ResultSet<Document>;
}
export type { FilesDataSource };
30 changes: 13 additions & 17 deletions app/api/files.v2/database/FilesMappers.ts
Original file line number Diff line number Diff line change
@@ -1,22 +1,21 @@
// import { OptionalId } from 'mongodb';

import { LanguageUtils } from 'shared/language';
import { FileDBOType } from './schemas/filesTypes';
import { UwaziFile } from '../model/UwaziFile';
import { Document } from '../model/Document';
import { URLAttachment } from '../model/URLAttachment';
import { Attachment } from '../model/Attachment';
import { CustomUpload } from '../model/CustomUpload';

export const FileMappers = {
// toDBO(file: UwaziFile): OptionalId<FileDBOType> {
// return {
// filename: file.filename,
// entity: file.entity,
// type: 'document',
// totalPages: file.totalPages,
// };
// },
const toDocumentModel = (fileDBO: FileDBOType) =>
new Document(
fileDBO._id.toString(),
fileDBO.entity,
fileDBO.totalPages,
fileDBO.filename,
LanguageUtils.fromISO639_3(fileDBO.language).ISO639_1!
).withCreationDate(new Date(fileDBO.creationDate));

export const FileMappers = {
toModel(fileDBO: FileDBOType): UwaziFile {
if (fileDBO.type === 'attachment' && fileDBO.url) {
return new URLAttachment(
Expand All @@ -43,11 +42,8 @@ export const FileMappers = {
fileDBO.filename
).withCreationDate(new Date(fileDBO.creationDate));
}
return new Document(
fileDBO._id.toString(),
fileDBO.entity,
fileDBO.totalPages,
fileDBO.filename
).withCreationDate(new Date(fileDBO.creationDate));
return toDocumentModel(fileDBO);
},

toDocumentModel,
};
Loading
Loading