# Raw Data Pipeline

The scheme describes the way how raw data from external sources gets collected, processed, and stored.

The Pipeline implies the ability to collect raw data in any text format with further conversion into JSON and automatically determine the DB schemas to store the data.


The Pipeline consists of four elements:

  1. Log Collector is an HTTP receiver endpoint. The program wraps received data in the following model: (_id, _aggregatedAt, _connector, _sourceType, _source) and sends it to the preprocessor. The program also validates the API key and the received data model.

  2. Preprocessor processes received raw data and does the following:

    • Parses the text and converts data into JSON.
    • Processes bulk events and extracting single data elements.
    • Modifies input model, e.g. adds computable attributes or changes attribute's type.
    • Adds system labels to the model.
  3. DB Scheme Validator creates a DB schema according to the data model located in the connector. It creates the schema from a JSON model and adds additional attributes if needed. Does the stream processing using ML algorithms based on previously trained models and adds the following labels:

    • _labels.ml.cluster.id – The ID of the cluster, the event is related to.
    • _labels.ml.cluster.modelVersion – ML model version (iteration).
    • _labels.ml.cluster.position – Event position relative to the cluster's center. Fractional number, the closer to 1 means the closer to the center of the cluster.
  4. Data Buffer stores all the raw data in the ClickHouse DB according to the data stream's schema.