Want to be able to filter between millions of files and objects without any machine-learning and AI?
All attributes are not the same
The key is to understand the data model required for your use case and make sure your technology platform can handle that. So, what is a data model? Well, it is a description of what kind of tags and labels (often called attributes) you associate with a document or any other object. Those tags and labels can then be used to power search and filtering. Back in the days in the Armed Forces we often had to define hundreds of these attributes for our software applications without really understanding that they along belonged to a couple of basic concepts. When we were frustrated that we could not deliver the necessary views of all our data we saw that not all applications on the market could handle these different basic perspectives. What we learned was that all these attributes seemed to belong to one of these categories:
Tagging & Classifications
Relation
Location
Time
Sensor Type
User Interaction
Tagging
This first one is fairly straightforward. A lot of software has support for tagging which means that. For instance, we could tag a piece of reference as a book, a newspaper article, or a podcast episode. Tags are freeform and flexible because we create them ad hoc when we need them. Often we see tags listed separated with commas like this:
books, apple, tesla
Classifications
The next step is using a classification which basically means that we populate the attribute with a list of values that are not overlapping. In this example it would be a drop-down menu with three allowed values:
- book
- newspaper article
- podcast episode.
Taxonomies
The next step is to link collections together in a hierarchy which is called taxonomy. An example could be a list with the values:
- Autobiography
- Cookbooks
- Fiction
- Travel
All these are different types of books and belong to the books taxonomy. That means that if we see the value of Autobiography we know that it is a type of book. A taxonomy means that we can make the search and filtering somewhat smart.
Semantic tagging
Finally, we move into a fairly unknown and underestimated part of artificial intelligence and that is semantic tagging using ontologies.
A taxonomy created with ontology standard is even more expressive where each term or concept belongs to a domain and where we can link terms not only in a hierarchy but also crossing the branches which support links for synonyms and support for the same term in different languages. Based on the codified knowledge the taxonomy we can start doing what is called reasoning – meaning that we can extract information that is not in the source text but derived from the knowledge, we embedded in the ontology-based taxonomy. That is when all these interlinked concepts create a knowledge graph of linked data and concepts.
Security Classifications
A special version of classification are security classifications which often contain a list of values like:
- unclassified
- internal
- secret
That classification list is not only used for search and filters but more importantly is used to drive rules for permissions and workflows.
Relations
All documents and objects can have relations between them. An article can have links to other articles as references and an object can be linked to a person owning the object. Relations can be of different types from a classification list or a taxonomy. In addition, each relation can have a number of attributes that belong to the relation or link itself independent of the two files or objects it is linking together. All these relations mean that they form a graph that can be followed or traversed. This information works best if visualized using a graph with nodes and links.
Location
The next type of attribute that can be used to categorize a file or an object is its location or spatial reference. That location can be expressed with text using street names, cities, or countries but often we choose to represent the location with a coordinate. Stockholm City Hall has both a street address: Hantverkargatan 1, 112 21 Stockholm and coordinate: N 59°19′39″ E 18°03′17″. Nowadays a 2D coordinate is not always enough and the Z-position for the city hall is around 23 m above sea level. Using this information we can visualize the location of files and objects on digital maps and also perform geospatial analysis where we ask for restaurants within 10 min walking distance of the Stockholm City Hall. Coordinates can be expressed differently based on different so-called reference systems.
Time
Another perspective that often is overlooked is the temporal dimension – when something happened. All file systems today capture when a file was created and modified but what we are talking about here is also tracking the actual event or series of events. The latter part is important because time can mean both a specific point in time and a duration.
Usually, we like to both refer to the overall activity like the Jan 6th insurrection of the US Congress which lasted for a number of hours. Within that activity, there were lots of different events like when the first fence was breached and when the first window was crushed by the domestic terrorists. All these events can then be visualized in timelines, calendar views, or circular time wheels to show recurring events.
Source Type
This perspective aims to provide a categorization of the source of the data like what type of device was used to record an image or movie or who was saying or writing something.
User interaction
This final perspective that can be used for filtering is basically a way to highlight the importance of the logs from each software. That log provides information about which user created, updated, or accessed a file or an object.
If we know that our colleague named Anna updated a document we are interested in, we can choose to find that document by looking up Anna and view her latest documents. User interaction data becomes a critical aspect of finding the data we are interested in. It does not necessarily have to be encoded into the actual file or object because we can use a unique id to look this data up.
Benefits of applying metadata to unstructured information
Manage lots of unstructured information
Your research project can easily contain thousands of files from different types of sources. Learn what you can do to make sure you can find what you need later on.
Off-Load Complexity with the right tools
Digital tools are needed to manage the complexity of your research. Using a variety of tools can make content traceable, transparent, and easy to find and structure.
Team Work
Collaborate and build knowledge on top of shared structures. Diversity of perspectives in content sources is key to meaningful insights. Become smarter together.
Spark Creativity
Spend time in a fun way. Explore, find, analyze, reinvent and master the craft of storytelling. Present your insights in a creative way and make an impact on a larger audience.