Data Mining on LLM Vector Space

Finding all axis-like shapes structures from the context space
Data Mining Problem

Are there any machine (AI) driven axis? Useful for the machine but not human-understandable.

Find groups of words pair (u_1, v_1), …, (u_k, v-k) such that all u_i - v_i is …
Pairwise comparison is done in polinomial time.

Are there some efficient algorithms?

After having axises can we formulate some strucutre mining problems for triangles, rectangles, etc. ?

Hallcuination mechanism

LLm generates eneteces according to word distributiution (random), reproducibility is lost.

On the other hand, converting words sentences to context vectors is uniquely done for one llm, also, llm learns this function from the data, thus it might be rigid for the training data (LLM might not be diverse for the same training data).

we explain a context by words, in some sense discretization, and approximation. For a given context vector, if we represent this by words, verbalize, then it is discretization and lose original meaning. If we use snetence generation by LLm to get data or result we always lose exactness.

Decomposition in concept

king - man = queen - woman that might correspont to a concept of “royality”. So, we can consider that we have three elemental concepts: man, woman and royality and king and queen are generated by sum of concepts.

For a given context vector S, can you find a relatively small set of context vectors (might correspond to no word) whose partial summation make many of context vectors of S.

King - man vs man - king

Can you have some idea to identify which is more essential?

Comparison of LLMs

Depends on the dataset king -man = queen - woman isn’t hold by all lLm, it depends of the dataset.

How to compare two different LLMs? LLms of two differnet languages are so different, number of elemental words are differnet, directions are different.