In datasets for machine learning, there are two main categories of data collection CGI online methods as follows.
ETL and Data Warehouses
Putting data in warehouses is the first. These repositories are often designed for structured (or SQL) records, which can be organized into common table forms. It’s safe to say that all of your sales data, payroll information, and CRM information fit under this heading. Data transformation before loading is another customary aspect of working with warehouses. In this article, we’ll go into greater detail about data transformation techniques. However, in general, it means that you have all the processing done before storing because you are aware of the data you require and how it must appear. The technique is known as Extract, Transform, and Load or called ETL.
The issue with this method is that you can never be certain in advance which data will be helpful and which won’t. In order to view the metrics we know we need to track, warehouses are typically utilized to access data via business intelligence interfaces. There is also an alternative.
ELT and Data Lakes
Data lakes are types of storage that may hold both organized and unstructured data, such as PDF files, photos, videos, voice recordings, and so on. However, even when data is structured, it is not changed before being stored. Data would be loaded there as-is, and you would choose how to use and process it afterward, as needed. This strategy is known as Extract, Load, and — then, as needed — Transform.
Machine learning is thought to work better with data lakes. However, it’s worthwhile to have it available if you feel at least somewhat secure in your data because you may utilize it for analytics before you even begin any data science initiatives.
Remember that contemporary cloud data warehouse providers provide both strategies as well.