DDEx - Document Data Extractor - is a framework that allows applications to transparently open and extract the content of documents, regardless of formats.
We are working to provide support for:
- OLE2 file formats [.doc, .xls, .ppt]
- OOXML file formats [.docx, .xlsx, .pptx]
- ODF file formats [.odt, .ods, .odp]
- CSV
- Google Docs (minimal support)
DDEx is based on the Builder Design Pattern, and can be easily extended to support other formats. DDEx aims at decoupling the process of content extraction from the content processing, handling the diversity of file formats and providing access to the document's content independently of file formats.
DDEx manages the intersection between multiple APIs (such as Apache POI and ODFDOM) by offering a common interface, allowing applications to use document's content in other contexts, encapsulating and performing the extraction independently of formats.
DDEx was born on the academia and ended up being used by other Ph.D. and MSc students during their research. DDEx is also being used by other projects and is associated with academic productions, such as:
- Project BioSpread - Integrating data from Web spreadsheets
- 2graph - An API for abstracting graph databases
- Paper: "Automatic Interpretation of Biodiversity Spreadsheets Based on the Recognition of Construction Patterns"
- Paper: "Extracting and Semantically Integrating Implicit Schemas from Multiple Spreadsheets"
- Paper: "Introducing shadows: Flexible document representation and annotation on the Web"