view doc/libervia-cli/pubsub_cache.rst @ 3764:125c7043b277

comp AP gateway: publish, (un)subscribe/(un)follow, public subscription/following/followers: this patch implements those major features: - `publish` is implemented on virtual pubsub service, thus XMPP entities can now publish to AP using this service - replies to XMPP items are managed - `inReplyTo` is filled when converting XMPP items to AP objects - `follow` and `unfollow` (actually an `undo` activity) are implemented and mapped to XMPP's (un)subscribe. On subscription, AP actor's `outbox` collection is converted to XMPP and put in cache. Subscriptions are always public. - `following` and `followers` collections are mapped to XMPP's Public Pubsub Subscription (which should be XEP-0465, but the XEP is not yet published at the time of commit), in both directions. - new helper methods to check if an URL is local and to get JID from actor ID doc will follow to explain behaviour rel 365
author Goffi <goffi@goffi.org>
date Fri, 13 May 2022 19:12:33 +0200
parents d0b66efc6c0e
children
line wrap: on
line source

.. _libervia-cli_pubsub_cache:

=====================================
pubsub/cache: PubSub Cache Management
=====================================

Libervia runs transparently a cache for pubsub. That means that according to internal
criteria, some pubsub items are stored locally.

The ``cache`` subcommands let user inspect and manipulate the internal cache.

get
===

Retrieve items from internal cache only. Most end-users won't need to use this command, as
the usual ``pubsub get`` command will use cache transparently. However, it may be useful
to inspect local cache, notably for debugging.

The parameters are basically the same as for :ref:`li_pubsub_get`.

example
-------

Retrieve the last 2 cached items for personal blog::

    $ li pubsub cache get -n urn:xmpp:microblog:0 -M 2

.. _li_pubsub_cache_sync:

sync
====

Synchronise or resynchronise a pubsub node. If the node is already in cache, it will be
deleted then re-cached. Node will be put in cache even if internal policy doesn't request
a synchronisation for this kind of nodes. Node will be (re-)subscribed to keep cache
synchronised.

All items of the node (up to the internal limit which is high), will be retrieved and put
in cache, even if a previous version of those items have been deleted by the
:ref:`li_pubsub_cache_purge` command.


example
-------

Resynchronise personal blog::

    $ li pubusb cache sync -n urn:xmpp:microblog:0

.. _li_pubsub_cache_purge:

purge
=====

Remove items from cache. This may be desirable to save resource, notably disk space.

Note that once a pubsub node is cached, the cache is the source of trust. That means that
if cache is not explicitly bypassed when retrieving items of a pubsub node (notably with
the ``-C, --no-cache`` option of :ref:`li_pubsub_get`), only items found in cache will be
returned, thus purged items won't be used or returned anymore even if they still exists on
the original pubsub service.

If you have purged items by mistake, it is possible to retrieve them either node by node
using :ref:`li_pubsub_cache_sync`, or by resetting the whole pubsub cache with
:ref:`li_pubsub_cache_reset`.

If you have a node or a profile (e.g. a component) caching a lot of items frequently, you
may use this command using a scheduler like cron_.

.. _cron: https://en.wikipedia.org/wiki/Cron

examples
--------

Remove all blog and event items from cache if they haven't been updated since 6 months::

    $ li pubsub cache purge -t blog -t event -b "6 months ago"

Remove items from profile ``ap_gateway`` if they have been created more that 2 months
ago::

    $ li pubsub cache purge -p ap_gateway --created-before "2 months ago"

.. _li_pubsub_cache_reset:

reset
=====

Reset the whole pubsub cache. This means that all nodes and all them items will be removed
from cache. After this command, cache will be re-filled progressively as if it where a new
one.

.. note::

    Use this command with caution: even if cache will be re-constructed with time, that
    means that items will have to be retrieved again, that may be resource intensive both
    for your machine and for the pubsub services which will be used. That also means that
    searching items will return less results until all desired items are cached again.

    Also note that all items of cached nodes are retrieved, even if you have previously
    purged items, they will be retrieved again.

example
-------

Reset the whole pubsub cache::

    $ li pubsub cache reset

search
======

Search items into pubsub cache. The search is done on the whole cache, it's not restricted
to a single node/profile (even if it may be if suitable filters are specified). Full-Text
Search can be done with ``-f FTS, --fts FTS`` argument, as well as filtering on parsed
data (with ``-F PATH OPERATOR VALUE, --field PATH OPERATOR VALUE``, see below).

By default, parsed data are returned, with the 3 additional keys ``pubsub_service``,
``pubsub_items`` (the search being done on the whole cache, those data are here to get the
full location of each item) and ``node_profile``.

"Parsed data" are the result of the parsing of the items XML payload by feature aware
plugins. Those data are usually more readable and easier to work with. Parsed data are
only stored when a parser is registered for a specific feature, that means that a Pubsub
item in cache may not have parsed data at all, in which case an empty dict will be used
instead (and ``-P, --payload`` argument should be used to get content of the item).

The dates are normally stored as `Unix time`_ in database, but the default output convert
the ``updated``, ``created`` and ``published`` fields to human readable local time. Use
``--output simple`` if you want to keep the float (or int) value.

XML item payload is not returned by default, but it can be added to the ``item_payload``
field if ``-P, --payload`` argument is set. You can also use the ``--output xml`` (or
``xml_raw`` if you don't want prettifying) to output directly the highlighted XML
— without the parsed data —, to have an output similar to the one of ``li pubsub get``.

If you are interested only in a specific data (e.g. item id and title), the ``-k KEY,
--key KEY`` can be used.

You'll probably want to limit result size by using ``-l LIMIT, --limit LIMIT``, and do
pagination using ``-i INDEX, --index INDEX``.

.. _Unix time: https://en.wikipedia.org/wiki/Unix_time

Filters
-------

By default search returns all items in cache, you have to use filter to specify what you
are looking after. We can split filters in 3 categories: nodes/items metadata,
Full-Text Search query and parsed metadata.

Nodes/items metadata are the generic information you have on a node: which profile it
belong too, which pubsub service it's coming from, what's the name or type of the node,
etc.

Arguments there should be self-explanatory. Type (set with ``-t TYPE, --type TYPE``) and
subtype (set with ``-S SUBTYPE, --subtype SUBTYPE``) are values dependent of the
plugin/feature associated with the node, so we can't list them in an exhaustive way here.
The most common type is probably ``blog``, from which a subtype can be ``comment``. An
empty string can be used to find items with (sub)type not set.

It's usually a good idea to specify a profile with ``-p PROFILE, --profile PROFILE``,
otherwise you may get duplicated results.

Full-Text Search
----------------

You can specify a Full-Text Search query with the ``-f FTS_QUERY, --fts FTS_QUERY``
argument. The engine is currently SQLite FTS5, and you can check its `query syntax`_.
FTS is done on the whole raw XML payload, that means that all data there can be matched
(including XML tags and attributes).

FTS queries are indexed, that means that they are fast and efficient.

.. note::

  Futures version of Libervia will probably include other FTS engines (support for
  PostgreSQL and MySQL/MariaDB is planned). Thus the syntax may vary depending on the
  engine, or a common syntax may be implemented for all engines in the future. Keep that
  in mind if you plan to use FTS capabilities in long-term queries, e.g. in scripts.

.. _query syntax: https://sqlite.org/fts5.html#full_text_query_syntax

Parsed Metadata Filters
-----------------------

It is possible to filter on any field of parsed data. This is done with the ``-F PATH
OPERATOR VALUE, --field PATH OPERATOR VALUE`` (be careful that the short option is an
uppercase ``F``, the lower case one being used for Full-Text Search).

.. note::

  Parsed Metadata Filters are not indexed, that means that using them is less efficient
  than using e.g. Full-Text Search. If you want to filter on a text field, it's often a
  good idea to pre-filter using Full-Text Search to have more efficient queries.

``PATH`` and ``VALUE`` can be either specified as string, or using JSON syntax (if the
value can't be decoded as JSON, it is used as plain text).

``PATH`` is the name of the field to use. If you must go beyond root level fields, you can
use a JSON array to specify each element of the path. If a string is used, it's an object
key, if a number is used it's an array index. Thus you can use ``title`` to access the
root title key, or ``'"title"'`` (JSON string escaped for shell) or ``'["title"]'`` (JSON
array with the "title" string, escaped for shell).

.. note::

  The extra fields ``pubsub_service``, ``pubsub_node`` and  ``node_profile`` are added to
  the result after the query, thus they can't be used as fields for filtering (use the
  direct arguments for that).

``OPERATOR`` indicate how to use the value to make a filter. The currently supported
operators are:

``==`` or ``eq``
  Equality operator, true if field value is the same as given value.

``!=`` or ``ne``
  Inequality operator, true if the field value is different from given value.

``>`` or ``gt``
  Greater than, true if the field value is higher than given value. For string, this is
  according to alphabetical order.

  Time Pattern can be used here, see below.

``<`` or ``lt``
  Lesser than, true if the field value is lower than given value. For string, this is
  according to alphabetical order.

  Time Pattern can be used here, see below.

``between``
  Given value must be an array with 2 elements. The condition is true if field value is
  between the 2 elements (for string, this is according to alphabetical order).

  Time Pattern can be used here, see below.

``in``
  Given value must be an array of elements. Field value must be one of them to make the
  condition true.

``not_in``
  Given value must be an array of elements. Field value must not be any of them the make
  the condition true.

``overlap``
  This can be used only on array fields.

  If given value is not already an array, it is put in an array. Condition is true if any
  element of field value match any element of given value. Notably useful to filter on
  tags.

``ioverlap``
  Same as ``overlap`` but done in a case insensitive way.

``disjoint``
  This can be used only on array fields.

  If given value is not already an array, it is put in an array. Condition is true if no
  element of field value match any element of given value. Notably useful to filter out
  tags.

``idisjoint``
  Same as ``disjoint`` but done in a case insensitive way.

``like``
  Does pattern matching on a string. ``%`` can be used to match zero or more characters
  and ``_`` can be used to match any single character.

  If you're not looking after a specific field, it's better to use Full-Text Search when
  possible.

``ilike``
  Like ``like`` but done in a case insensitive way.


``not_like``
  Same as ``like`` except that condition is true when pattern is **not** matching.

``not_ilike``
  Same as ``not_like`` but done in a case insensitive way.


For ``gt``/``>``, ``lt``/``<`` and ``between``, you can use :ref:`time_pattern` by using
the syntax ``TP(<time pattern>)`` (see examples below).

Ordering
--------

Result ordering can be done by a well know order, or using a parsed data field. Ordering
default to ``created`` (see below), but this may be changed with ``-o ORDER [FIELD]
[DIRECTION], --order-by ORDER [FIELD] [DIRECTION]``.

``ORDER`` can be one of the following:

``creation``
  Order by item creation date. Note that is this the date of creation of the item in cache
  (which most of time should correspond to order of creation of the item in the source
  pubsub service), and this may differ from the date of publication as specified with some
  feature (like blog). This is important when old items are imported, e.g. when they're
  coming from an other blog engine.

``modification``
  Order by the date when item has last been modified. Modification date is the same as
  creation date if the item has never been modified since it is in cache. The same warning
  as for ``creation`` applies: this is the date of last modification in cache, not the one
  advertised in parsed data.

``item_id``
  Order by XMPP id of the item. Notably useful when user-friendly ID are used (like it is
  often the case with blogs).

``rank``
  Order item by Full-Text Search rank. This one can only be used when Full-Text Search is
  used (via ``-f FTS_QUERY, --fts FTS_QUERY``). Rank is a value indicating how well an
  item match the query. This usually needs to be used with ``desc`` direction, so you get
  most relevant items first.

``field``
  This special order indicates that the ordering must be done on an parsed data field. The
  following argument is then the path of the field to used (which can be a plain text name
  of a root field, or a JSON encoded array). An optional direction can be specified as a
  third argument. See examples below.

examples
--------

Search for blog items cached for the profile ``louise`` which contain the word
``Slovakia``::

  $ li pubsub cache search -t blog -p louise -f Slovakia

Show title, publication date and id of blog articles (excluding comments) which have been
published on Louise's blog during the last 6 months, order them by item id. Here we use an
empty string as a subtype to exclude comments (for which subtype is ``comment``)::

  $ li pubsub cache search -t blog -S "" -p louise -s louise@example.net -n urn:xmpp:microblog:0 -F published gt 'TP(6 months ago)' -k id -k published -k title -o item_id

Show all blog items from anywhere which are tagged as XMPP or ActivityPub (case
insensitive) and which have been published in the last month (according to advertised
publishing date, not cache creation date).

We want to order them by descending publication date (again the advertised publication
date, not cache creation), and we don't want more than 50 results.

We do a FTS query there even if it's not mandatory, because it will do an efficient
pre-filtering::

  $ li pubsub cache search -f "xmpp OR activitypub" -F tags ioverlap '["xmpp", "activitypub"]' -F published gt 'TP(1 month ago)' -o field published desc -l 50