API
- class stwfsapy.predictor.StwfsapyPredictor(graph, concept_type_uri, sub_thesaurus_type_uri='', thesaurus_relation_type_uri='', thesaurus_relation_is_specialisation=False, remove_deprecated=True, langs=frozenset({}), input='content', use_txt_vec=False, handle_title_case=True, extract_upper_case_from_braces=True, extract_any_case_from_braces=False, expand_ampersand_with_spaces=True, expand_abbreviation_with_punctuation=True, simple_english_plural_rules=False)
Finds labels of thesaurus concepts in texts and assigns them a score.
Creates the predictor.
- Parameters:
graph (
Graph) – The SKOS ontology used to extract the labels.concept_type_uri (
Union[str,URIRef]) – The uri of the concept type. It is assumed that for every concept c, there is a triple (c, RDF.type, concept_type_uri) in the graph.sub_thesaurus_type_uri (
Union[str,URIRef]) – The uri of the concept type. It is assumed that for every sub thesaurus t, there is a triple (t, RDF.type, sub_thesaurus_type_uri) in the graph.thesaurus_relation_type_uri (
Union[str,URIRef]) – Uri of the relation that links concepts to thesauri.thesaurus_relation_is_specialisation (
bool) – Indicates whether the thesaurus_relation links thesauri to concepts or the other way round. E.g., for the relation skos:broader it should be false. Conversely it should be true for skos:narrower.remove_deprecated (
bool) – When True will discard deprecated subjects. Deprecation of a subject has to be indicated by a triple (s, OWL.deprecated, Literal(True)) in the graph.langs (
FrozenSet[str]) – For each language present in the set, labels will be extracted from the graph. An empy set or None will extract labels regardless of language.input (
str) –What type of input is presented to the fit method:
’content’: Input is expected to be an arraylike of string.
’filename’: Input is expected to be a list of filenames.
’file’: input is expected to be a list of file objects.
use_txt_vec (
bool) – Whether to use vectorized representations of inputs. This can lead to high memory consumption.handle_title_case (
bool) –When True, will also match labels in title case. I.e., in a text the first letter of every word can be upper or lower case and will still be matched. When False only the case of the first word’s first letter will be adapted. Example:
Given a label “garbage can” and the title “Oscar Lives in a Garbage Can”
When handle_title_case == True the label will match the text.
When handle_title_case == False the label will not match the text. It would however still match “Garbage can is home to grouchy neighbor.”.
extract_upper_case_from_braces (
bool) – Removes the explanation in braces from labels. I.e., GDP (Gross Domestic Product) will be transformed to GDP.extract_any_case_from_braces (
bool) – Can extract content of braces in labels. I.e., R&D (research and discovery) will be transformed to research and discovery. In contrast to extract_upper_case_from_braces it will extract the part inside the parenthesis and not the part before.expand_ampersand_with_spaces (
bool) – For labels that contain an ampersand it will also match text containing spaces around that symbol. I.e., R & D will be matched for label R&D.expand_abbreviation_with_punctuation (
bool) – For labels containing only uppercase letters it will also match text with punctuation added. I.e., G.D.P. for label GDP.simple_english_plural_rules (
bool) – Can detect simple English plural forms of labels.
- fit(X, y=None, **kwargs)
Fits the classifier to the given training data.
- Params X:
Iterable of text inputs.
- Params y:
Iterable of correct concepts given by their URI for supervised
training.
- Returns:
self: The fitted StwfsapyPredictor instance.
- static load(path)
Loads a predictor instance from a previously stored zip file.
- Params path:
Path to the zip file.
- Returns:
A reconstructed StwfsapyPredictor instance.
- match_and_extend(inputs, truth_refss=None)
Retrieves concepts by their labels from text. If ground truth values are present, it will also return a list of labels for scoring matches. If no ground truth values are present, a list with the number of matched concepts for each document is returned.
- predict(X)
Predicts binary concept match labels for each input text.
- Params X:
Iterable of input strings.
- Return type:
csr_matrix
- Returns:
A sparse matrix of shape (n_samples, n_concepts) indicating predicted concept matches.
- predict_proba(X)
Predicts probability scores for each concept per document.
- Params X:
Iterable of input texts.
- Return type:
csr_matrix
- Returns:
A sparse matrix of shape (n_samples, n_concepts) with concept match probabilities.
- store(path)
Stores a predictor instance into a zip file.
- Params path:
Path to the zip file storing the trained predictor.
- Returns:
None