🏗 Big Data in Construction. Part 1–1: Choosing python IDE. Anaconda. Install Python.

artem boiko
5 min readOct 10, 2020

Almost all documents that are used today by people working in construction are saved in PDF or JPEG format.

In this series of articles, we will transform the PDF file into text. The PDF file is our unstructured data that we will transform into text using the python and additional libraries, then into a tabular form and then visualize the received data on the kaggle platform.

This process can be divided into stages.

At the first stage, we will convert PDF files to text using the Apache tika library. Then we split the resulting text into lines. Then with the help of regular expressions we will sort and select only the data we need and then collect this data into an array.

And in order to apply these operations to all files that are in the folder, we will create a function. After that, we will save our data in CSV format and upload this resulting file to the Kaggle platform.

As I work in the construction industry, we will work with pdf files of drawings, but in your case, it will not be necessarily drawings. It can be some kind of accounts, documents, contracts or other PDF documents that you use in your work.

In the first Dataset we will have six pdf files. Each file is a drawing. From these drawings we will take data on the engineers name, the title of the drawing, the creation date, the number of changes, and comments on the changes.

Python IDE

Before exploring more about Python IDE, we must understand what is an IDE. IDE helps to automate the task of a developer by reducing manual efforts and combines all the equipments in a common framework. If IDE is not present, then the developer has to manually do the selections, integrations and deployment process.

I have compiled a list of some of the popular Python IDE.

With PyCharm, the developers can write a neat and maintainable code. It helps to be more productive and gives smart assistance to the developers.

IDE Atom. One of the reasons that have contributed to the Atom’s success is its fully customizable interface. Everything can be changed, from the interface to the basic functions. On the other hand, this rather advantageous initiative is also one of the problems at the root of the program’s latency.

VS Code from Microsoft is very well designed overall and its main advantage is that it offers an extension-based architecture and because the IDE is lightweight, it can be extended by adding successive components as needed.

To install the Visual Code IDE write Visual Code in the browser line. On the main page we download Visual code for the platform we need.

We install VS into the standard folders. This does not play any role here. We have installed. And start. This is the home download page. Lets create a new file. And this new file is saved in the folder VS Code on the desktop. Inside it, well create another folder “Python” — where our first file will be saved.

To run our Python code, we need a stable release of Python 3.8.0.

We have several options for installing python on a computer.

Anaconda or python?

Pro: Anaconda python is faster than vanilla python: they bundle Intel MKL and this does make most numpy computations faster. Under Windows, you dont have a lot of choices and anyway you need a python package installer. Anaconda inc. is a company. This is a plus in a corporate setting. You can get support contracts for instance.

Anaconda Python is very complete.

Cons: The conda package manager is fragile and slow. It always seems to downgrade some packages from time to time. It does not start by self-updating, you have to do that manually first. Ive yet to see pip downgrade anything. I have had zero bad experience with pip so far.Eventually upgrading/downgrading packages like conda does by itself leads to unworkable situations and you need to reinstall.

We will install python of the latest version. To do this, write Install python in the browser line and download the latest version of python to the computer from the home page and install it in the standard folders.

Run Visual code and create a new file. We save the file in our Python folder in the first.py format. And by extension Py — Visual Code we understand that this code will be written in python.

VS Code writes that it still cannot execute the code, since we do not have a standard interpreter installed in the settings. To do this, click on the language indicator (at the bottom left of the screen) and write in the Python command palette: Select Interpreter. Here we choose Python (or Anaconda, if you installed Anaconda instead of Python)

Now we can write our first program. Let it be Hello World. We write print in brackets ‘Hello World and run. In the data terminal, print our text: “Hello World!”

📈 If you don’t want to wait for new articles to be released, you can find a complete course on data extraction here.

https://bigdataconstruction.com/courses/part-1-collect-data-extract-data-from-pdf-pdf-to-excel/

Links to previous publications on this topic:

☕️ If you like my content, please consider buying me a coffee. Thank you for your support, I really appreciate it! buymeacoffee.com/boikoartem

--

--

artem boiko

For the last ten years I have been working in construction industry and implementing Python scripts and processes automation in construction industry.