-
Requirement
The LLM module of Deepke calls the EasyInstruct tookit(An Easy-to-use Framework to Instruct Large Language Models).
>> pip install git+https://github.com/zjunlp/EasyInstruct >> pip install hydra-core -
Data
The data here refers to the examples data used for in-context learning, which is stored in the
datafolder. The.jsonfiles in it are the default examples data for various tasks. Users can customize the examples in them, but they need to follow the given data format. -
Configuration
The
conffolder stores the set parameters. The parameters required to call the GPT3 interface are passed in through the files in this folder.-
In the Named Entity Recognition (
ner) task,text_inputparameter is the prediction text;domainis the domain of the prediction text, which can be empty;labelis the entity label set, which can also be empty. -
In the Relation Extraction (
re) task,text_inputparameter is the text;domainindicates the domain to which the text belongs, and it can be empty;labelsis the set of relationship type labels. If there is no custom label set, this parameter can be empty;head_entityandtail_entityare the head entity and tail entity of the relationship to be predicted, respectively;head_typeandtail_typeare the types of the head and tail entities to be predicted in the relationship. -
In the Event Extraction (
ee) task,text_inputparameter is the prediction text;domainis the domain of the prediction text, which can also be empty. -
In the Relational Triple Extraction (
rte) task,text_inputparameter is the prediction text;domainis the domain of the prediction text, which can also be empty. -
The specific meanings of other parameters are as follows:
taskparameter is used to specify the task type, wherenerrepresents named entity recognition task,rerepresents relation extraction task,eerepresents event extraction task, andrterepresents triple extraction task;languageindicates the language of the task, whereenrepresents English extraction tasks, andchrepresents Chinese extraction tasks;engineindicates the name of the large model used, which should be consistent with the model name specified by the OpenAI API;api_keyis the user's API key;zero_shotindicates whether zero-shot setting is used. When it is set toTrue, only the instruction prompt model is used for information extraction, and when it is set toFalse, in-context form is used for information extraction;instructionparameter is used to specify the user-defined prompt instruction, and the default instruction is used when it is empty;data_pathindicates the directory where in-context examples are stored, and the default is thedatafolder.
-
We use the EasyInstruct tool, a user-friendly framework for instructing large language models, to complete this task. Please refer to Chapter 1 for the environment and data.
Once the parameters are set, you can directly run the run.py:
>> python run.py
Below are input and output examples for different tasks:
| Task | Input | Output |
|---|---|---|
| NER | Japan began the defence of their Asian Cup title with a lucky 2-1 win against Syria in a Group C championship match on Friday. | [{'E': 'Country', 'W': 'Japan'}, {'E': 'Country', 'W': 'Syria'}, {'E': 'Competition', 'W': 'Asian Cup'}, {'E': 'Competition', 'W': 'Group C championship'}] |
| RE | The Dutch newspaper Brabants Dagblad said the boy was probably from Tilburg in the southern Netherlands and that he had been on safari in South Africa with his mother Trudy , 41, father Patrick, 40, and brother Enzo, 11. | parents |
| EE | In Baghdad, a cameraman died when an American tank fired on the Palestine Hotel. | event_list: [ event_type: [arguments: [role: "cameraman", argument: "Baghdad"], [role: "American tank", argument: "Palestine Hotel"]] ] |
| RTE | The most common audits were about waste and recycling. | [['audit', 'type', 'waste'], ['audit', 'type', 'recycling']] |
To compensate for the lack of labeled data in few-shot scenarios for relation extraction, we have designed prompts with data style descriptions to guide large language models to automatically generate more labeled data based on existing few-shot data.
- Set
tasktoda; - Set
text_inputto the relationship label to be enhanced, such asorg:founded_by; - Set
zero_shottoFalseand set the low-sample example in the corresponding file under thedatafolder for thedatask; - The range of entity labels can be specified in
labels.
We use the EasyInstruct tool, a user-friendly framework for instructing large language models, to complete this task. Please refer to Chapter 1 for the environment and data.
Once the parameters are set, you can directly run the run.py:
>> python run.py
Here is an example of a data augmentation prompt:
'''
One sample in relation extraction datasets consists of a relation, a context, a pair of head and tail entities in the context and their entity types.
The head entity has the relation with the tail entity and entities are pre-categorized as the following types: URL, LOCATION, IDEOLOGY, CRIMINAL CHARGE, TITLE, STATE OR PROVINCE, DATE, PERSON, NUMBER, CITY, DURATION, CAUSE OF DEATH, COUNTRY, NATIONALITY, RELIGION, ORGANIZATION, MISCELLANEOUS.
Here are some samples for relation 'org:founded_by':
Relation: org:founded_by. Context: Talansky is also the US contact for the New Jerusalem Foundation , an organization founded by Olmert while he was Jerusalem 's mayor . Head Entity: New Jerusalem Foundation. Head Type: ORGANIZATION. Tail Entity: Olmert. Tail Type: PERSON.
Relation: org:founded_by. Context: Sharpton has said he will not endorse any candidate until hearing more about their views on civil rights and other issues at his National Action Network convention next week in New York City . Head Entity: National Action Network. Head Type: ORGANIZATION. Tail Entity: his. Tail Type: PERSON.
Relation: org:founded_by. Context: `` We believe that we can best serve our clients by offering a single multistrategy hedge fund platform , '' wrote John Havens , who was a founder of Old Lane with Pandit and is president of the alternative investment group . Head Entity: Old Lane. Head Type: ORGANIZATION. Tail Entity: John Havens. Tail Type: PERSON.
Generate more samples for the relation 'org:founded_by'.
'''The following is a baseline description of the ChatGPT/GPT-4 for the Instruction-based Knowledge Graph Construction task in the CCKS2023 Open Environment Knowledge Graph Construction and Completion Evaluation competition.
Extract relevant entities and relations according to user input instructions to construct a knowledge graph. This task may include knowledge graph completion, where the model is required to complete missing triples while extracting entity-relation triples.
Below is an example of a Knowledge Graph Construction Task. Given an input text input and an instruction (including the desired entity types and relationship types), output all relationship triples output in the form of (ent1, rel, ent2) found within the input:
instruction="使用自然语言抽取三元组,已知下列句子,请从句子中抽取出可能的实体、关系,抽取实体类型为{'专业','时间','人类','组织','地理地区','事件'},关系类型为{'体育运动','包含行政领土','参加','国家','邦交国','夺得','举办地点','属于','获奖'},你可以先识别出实体再判断实体之间的关系,以(头实体,关系,尾实体)的形式回答"
input="2006年,弗雷泽出战中国天津举行的女子水球世界杯,协助国家队夺得冠军。2008年,弗雷泽代表澳大利亚参加北京奥运会女子水球比赛,赢得铜牌。"
output="(弗雷泽,获奖,铜牌)(女子水球世界杯,举办地点,天津)(弗雷泽,属于,国家队)(弗雷泽,国家,澳大利亚)(弗雷泽,参加,北京奥运会女子水球比赛)(中国,包含行政领土,天津)(中国,邦交国,澳大利亚)(北京奥运会女子水球比赛,举办地点,北京)(女子水球世界杯,体育运动,水球)(国家队,夺得,冠军)"Here are some readily processed datasets:
| Name | Download | Quantity | Description |
|---|---|---|---|
| KnowLM-IE.json | Google drive HuggingFace |
281,860 | Dataset mentioned in InstructIE |
| train.json, valid.json | Google drive | 5,000 | Preliminary training set and test set for the task "Instruction-Driven Adaptive Knowledge Graph Construction" in CCKS2023 Open Knowledge Graph Challenge, randomly selected from instruct_train.json |
KnowLM-IE.json: Contains 'id' (unique identifier), 'cate' (text category), 'instruction' (extraction instruction), 'input' (input text), 'output' (output text) and 'relation' (triples) fields, allowing for the flexible construction of extraction instructions and outputs through 'relation', 'instruction' has 16 formats (4 prompts * 4 output formats), and 'output' is generated according to the specified output format in 'instruction'.
train.json: Same fields as KnowLM-IE.json, 'instruction' and 'output' have only one format, and extraction instructions and outputs can also be freely constructed through 'relation'.
valid.json: Same fields as train.json, but with more accurate annotations achieved through crowdsourcing.
Here is an explanation of each field:
| Field | Description |
|---|---|
| id | Unique identifier |
| cate | text topic of input (12 topics in total) |
| input | Model input text (need to extract all triples involved within) |
| instruction | Instruction for the model to perform the extraction task |
| output | Expected model output |
| relation | Relation triples(head, relation, tail) involved in the input |
For more information on data processing and data formats, please refer to ../InstructKGC/kg2instruction
This evaluation task is essentially a triple extraction (rte) task. Detailed parameters and configuration for using this module can be found in the Environment and Data section above. The main parameter settings are as follows:
- Set
tasktorte, indicating a triple extraction task; - Set
languagetoch, indicating that the task is based on Chinese data; - Set
engineto the desired OpenAI large model name (since the OpenAI GPT-4 API is not fully open, this module currently does not support the use of GPT-4 API); - Set
text_inputto thetextfield in the dataset; - Set
zero_shotas needed; if set toFalse, examples for in-context learning need to be set in the/data/rte_ch.jsonfile in a specific format; - Set
instructionto theinstructionfield in the dataset; if set toNone, the default instruction for the module will be used; - Set
labelsto the entity types, or leave it empty;
Other parameters can be left at their default values.
We have provided a conversion script for the CCKS2023 competition data format, LLMICL/ccks2023_convert.py
We use the EasyInstruct tool, a user-friendly framework for instructing large language models, to complete this task. Please refer to Chapter 1 for the environment and data.
After setting the parameters, simply run the run.py file:
>> python run.pyInput and output examples for making predictions using ChatGPT:
| Input | Output |
|---|---|
| task="rte" language="ch" engine="gpt-3.5-turbo" text_input="2006年,弗雷泽出战中国天津举行的女子水球世界杯,协助国家队夺得冠军。2008年,弗雷泽代表澳大利亚参加北京奥运会女子水球比赛,赢得铜牌。" instruction="使用自然语言抽取三元组,已知下列句子,请从句子中抽取出可能的实体、关系,抽取实体类型为{'专业','时间','人类','组织','地理地区','事件'},关系类型为{'体育运动','包含行政领土','参加','国家','邦交国','夺得','举办地点','属于','获奖'},你可以先识别出实体再判断实体之间的关系,以(头实体,关系,尾实体)的形式回答" |
[[弗雷泽,获奖,铜牌],[女子水球世界杯,举办地点,天津],[弗雷泽,属于,国家队],[弗雷泽,国家,澳大利亚],[弗雷泽,参加,北京奥运会女子水球比赛],[中国,包含行政领土,天津],[中国,邦交国,澳大利亚],[北京奥运会女子水球比赛,举办地点,北京],[女子水球世界杯,体育运动,水球],[国家队,夺得,冠军)] |
We conducted a simple 5-shot in-context learning evaluation on the CCKS dataset using ChatGPT, and the results are shown in the table below:
| Metric | Result |
|---|---|
| F1 | 0.3995 |
| Rougen_2 | 0.7730 |
| score (0.5*F1+0.5*Rougen_2) |
0.5863 |