Genuine Exam Dumps For Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0:
Prepare Yourself Expertly for Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Exam:
Our team of highly skilled and experienced professionals is dedicated to delivering up-to-date and precise study materials in PDF format to our customers. We deeply value both your time and financial investment, and we have spared no effort to provide you with the highest quality work. We ensure that our students consistently achieve a score of more than 95% in the Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 exam. You provide only authentic and reliable study material. Our team of professionals is always working very keenly to keep the material updated. Hence, they communicate to the students quickly if there is any change in the Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 dumps file. The Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 exam question answers and Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 dumps we offer are as genuine as studying the actual exam content.
24/7 Friendly Approach:
You can reach out to our agents at any time for guidance; we are available 24/7. Our agent will provide you information you need; you can ask them any questions you have. We are here to provide you with a complete study material file you need to pass your Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 exam with extraordinary marks.
Quality Exam Dumps for Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0:
Pass4surexams provide trusted study material. If you want to meet a sweeping success in your exam you must sign up for the complete preparation at Pass4surexams and we will provide you with such genuine material that will help you succeed with distinction. Our experts work tirelessly for our customers, ensuring a seamless journey to passing the Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 exam on the first attempt. We have already helped a lot of students to ace IT certification exams with our genuine Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Exam Question Answers. Don't wait and join us today to collect your favorite certification exam study material and get your dream job quickly.
90 Days Free Updates for Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Exam Question Answers and Dumps:
Enroll with confidence at Pass4surexams, and not only will you access our comprehensive Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 exam question answers and dumps, but you will also benefit from a remarkable offer – 90 days of free updates. In the dynamic landscape of certification exams, our commitment to your success doesn't waver. If there are any changes or updates to the Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 exam content during the 90-day period, rest assured that our team will promptly notify you and provide the latest study materials, ensuring you are thoroughly prepared for success in your exam."
Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Real Exam Questions:
Quality is the heart of our service that's why we offer our students real exam questions with 100% passing assurance in the first attempt. Our Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 dumps PDF have been carved by the experienced experts exactly on the model of real exam question answers in which you are going to appear to get your certification.
Which of the following code blocks immediately removes the previously cached DataFrame
transactionsDf from memory and disk?
A. array_remove(transactionsDf, "*") B. transactionsDf.unpersist() (Correct) C. del transactionsDf D. transactionsDf.clearCache() E. transactionsDf.persist()
Answer: B Explanation: transactionsDf.unpersist() Correct. The DataFrame.unpersist() command does exactly what the asks for - it removes all cached parts of the DataFrame from memory and disk. del transactionsDf False. While this option can help remove the DataFrame from memory and disk, it does not do so immediately. The reason is that this command just notifies the Python garbage collector that the transactionsDf now may be deleted from memory. However, the garbage collector does not do so immediately and, if you wanted it to run immediately, would need to be specifically triggered to do so. Find more information linked below. array_remove(transactionsDf, "*") Incorrect. The array_remove method from pyspark.sql.functions is used for removing elements from arrays in columns that match a specific condition. Also, the first argument would be a column, and not a DataFrame as shown in the code block. transactionsDf.persist() No. This code block does exactly the opposite of what is asked for: It caches (writes) DataFrame transactionsDf to memory and disk. Note that even though you do not pass in a specific storage level here, Spark will use the default storage level (MEMORY_AND_DISK). transactionsDf.clearCache() Wrong. Spark's DataFrame does not have a clearCache() method. More info: pyspark.sql.DataFrame.unpersist ” PySpark 3.1.2 documentation, python - How to delete an RDD in PySpark for the purpose of releasing resources? - Stack Overflow Static notebook | Dynamic notebook: See test 3, (Databricks import instructions)
Question # 2
The code block shown below should return a new 2-column DataFrame that shows one attribute
from column attributes per row next to the associated itemName, for all suppliers in column supplier
whose name includes Sports. Choose the answer that correctly fills the blanks in the code block to
accomplish this.
Sample of DataFrame itemsDf:
1. +------+----------------------------------+-----------------------------+-------------------+
2. |itemId|itemName |attributes |supplier |
3. +------+----------------------------------+-----------------------------+-------------------+
4. |1 |Thick Coat for Walking in the Snow|[blue, winter, cozy] |Sports Company Inc.|
5. |2 |Elegant Outdoors Summer Dress |[red, summer, fresh, cooling]|YetiX |
6. |3 |Outdoors Backpack |[green, summer, travel] |Sports Company Inc.|
7. +------+----------------------------------+-----------------------------+-------------------+
Code block:
itemsDf.__1__(__2__).select(__3__, __4__)
A. 1. filter 2. col("supplier").isin("Sports") 3. "itemName" 4. explode(col("attributes")) B. 1. where 2. col("supplier").contains("Sports") 3. "itemName" 4. "attributes" C. 1. where 2. col(supplier).contains("Sports") 3. explode(attributes) 4. itemName D. 1. where 2. "Sports".isin(col("Supplier")) 3. "itemName" 4. array_explode("attributes") E. 1. filter 2. col("supplier").contains("Sports") 3. "itemName" 4. explode("attributes")
Answer: E Explanation: Output of correct code block: +----------------------------------+------+ |itemName |col | +----------------------------------+------+ |Thick Coat for Walking in the Snow|blue | |Thick Coat for Walking in the Snow|winter| |Thick Coat for Walking in the Snow|cozy | |Outdoors Backpack |green | |Outdoors Backpack |summer| |Outdoors Backpack |travel| +----------------------------------+------+ The key to solving this is knowing about Spark's explode operator. Using this operator, you can extract values from arrays into single rows. The following guidance steps through the answers systematically from the first to the last gap. Note that there are many ways to solving the gap questions and filtering out wrong answers, you do not always have to start filtering out from the first gap, but can also exclude some answers based on obvious problems you see with them. The answers to the first gap present you with two options: filter and where. These two are actually synonyms in PySpark, so using either of those is fine. The answer options to this gap therefore do not help us in selecting the right answer. The second gap is more interesting. One answer option includes "Sports".isin(col("Supplier")). This construct does not work, since Python's string does not have an isin method. Another option contains col(supplier). Here, Python will try to interpret supplier as a variable. We have not set this variable, so this is not a viable answer. Then, you are left with answers options that include col ("supplier").contains("Sports") and col("supplier").isin("Sports"). The states that we are looking for suppliers whose name includes Sports, so we have to go for the contains operator here. We would use the isin operator if we wanted to filter out for supplier names that match any entries in a list of supplier names. Finally, we are left with two answers that fill the third gap both with "itemName" and the fourth gap either with explode("attributes") or "attributes". While both are correct Spark syntax, only explode ("attributes") will help us achieve our goal. Specifically, the asks for one attribute from column attributes per row - this is what the explode() operator does. One answer option also includes array_explode() which is not a valid operator in PySpark. More info: pyspark.sql.functions.explode ” PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3, (Databricks import instructions)
Question # 3
The code block shown below should add a column itemNameBetweenSeparators to DataFrame
itemsDf. The column should contain arrays of maximum 4 strings. The arrays should be composed of
the values in column itemsDf which are separated at - or whitespace characters. Choose the answer
that correctly fills the blanks in the code block to accomplish this.
Sample of DataFrame itemsDf:
1. +------+----------------------------------+-------------------+
2. |itemId|itemName |supplier |
3. +------+----------------------------------+-------------------+
4. |1 |Thick Coat for Walking in the Snow|Sports Company Inc.|
5. |2 |Elegant Outdoors Summer Dress |YetiX |
6. |3 |Outdoors Backpack |Sports Company Inc.|
7. +------+----------------------------------+-------------------+
Code block:
itemsDf.__1__(__2__, __3__(__4__, "[\s\-]", __5__))
Answer: A Explanation: This deals with the parameters of Spark's split operator for strings. To solve this question, you first need to understand the difference between DataFrame.withColumn() and DataFrame.withColumnRenamed(). The correct option here is DataFrame.withColumn() since, according to the question, we want to add a column and not rename an existing column. This leaves you with only 3 answers to consider. The second gap should be filled with the name of the new column to be added to the DataFrame. One of the remaining answers states the column name as itemNameBetweenSeparators, while the other two state it as "itemNameBetweenSeparators". The correct option here is "itemNameBetweenSeparators", since the other option would let Python try to interpret itemNameBetweenSeparators as the name of a variable, which we have not defined. This leaves you with 2 answers to consider. The decision boils down to how to fill gap 5. Either with 4 or with 5. The asks for arrays of maximum four strings. The code in gap 5 relates to the limit parameter of Spark's split operator (see documentation linked below). The documentation states that "the resulting array's length will not be more than limit", meaning that we should pick the answer option with 4 as the code in the fifth gap here. On a side note: One answer option includes a function str_split. This function does not exist in pySpark. More info: pyspark.sql.functions.split ” PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3, (Databricks import instructions)
Question # 4
Which of the following code blocks reads in the two-partition parquet file stored at filePath, making
sure all columns are included exactly once even though each partition has a different schema?
Schema of first partition:
1. root
2. |-- transactionId: integer (nullable = true)
3. |-- predError: integer (nullable = true)
4. |-- value: integer (nullable = true)
5. |-- storeId: integer (nullable = true)
6. |-- productId: integer (nullable = true)
7. |-- f: integer (nullable = true)
Schema of second partition:
1. root
2. |-- transactionId: integer (nullable = true)
3. |-- predError: integer (nullable = true)
4. |-- value: integer (nullable = true)
5. |-- storeId: integer (nullable = true)
6. |-- rollId: integer (nullable = true)
7. |-- f: integer (nullable = true)
8. |-- tax_id: integer (nullable = false)
A. spark.read.parquet(filePath, mergeSchema='y') B. spark.read.option("mergeSchema", "true").parquet(filePath) C. spark.read.parquet(filePath) D. 1. nx = 0 2. for file in dbutils.fs.ls(filePath): 3. if not file.name.endswith(".parquet"): 4. continue 5. df_temp = spark.read.parquet(file.path) 6. if nx == 0: 7. df = df_temp 8. else: 9. df = df.union(df_temp) 10. nx = nx+1 11. df E. 1. nx = 0 2. for file in dbutils.fs.ls(filePath): 3. if not file.name.endswith(".parquet"): 4. continue 5. df_temp = spark.read.parquet(file.path) 6. if nx == 0: 7. df = df_temp 8. else: 9. df = df.join(df_temp, how="outer") 10. nx = nx+1 11. df
Answer: B Explanation: This is a very tricky and involves both knowledge about merging as well as schemas when reading parquet files. spark.read.option("mergeSchema", "true").parquet(filePath) Correct. Spark's DataFrameReader's mergeSchema option will work well here, since columns that appear in both partitions have matching data types. Note that mergeSchema would fail if one or more columns with the same name that appear in both partitions would have different data types. spark.read.parquet(filePath) Incorrect. While this would read in data from both partitions, only the schema in the parquet file that is read in first would be considered, so some columns that appear only in the second partition (e.g. tax_id) would be lost. nx = 0 for file in dbutils.fs.ls(filePath): if not file.name.endswith(".parquet"): continue df_temp = spark.read.parquet(file.path) if nx == 0: df = df_temp else: df = df.union(df_temp) nx = nx+1 df Wrong. The key idea of this solution is the DataFrame.union() command. While this command merges all data, it requires that both partitions have the exact same number of columns with identical data types. spark.read.parquet(filePath, mergeSchema="y") False. While using the mergeSchema option is the correct way to solve this problem and it can even be called with DataFrameReader.parquet() as in the code block, it accepts the value True as a boolean or string variable. But 'y' is not a valid option. nx = 0 for file in dbutils.fs.ls(filePath): if not file.name.endswith(".parquet"): continue df_temp = spark.read.parquet(file.path) if nx == 0: df = df_temp else: df = df.join(df_temp, how="outer") nx = nx+1 df No. This provokes a full outer join. While the resulting DataFrame will have all columns of both partitions, columns that appear in both partitions will be duplicated - the says all columns that are included in the partitions should appear exactly once. More info: Merging different schemas in Apache Spark | by Thiago Cordon | Data Arena | Medium Static notebook | Dynamic notebook: See test 3, (Databricks import instructions)
Question # 5
Which of the following code blocks shows the structure of a DataFrame in a tree-like way, containing
both column names and types?
A. 1. print(itemsDf.columns) 2. print(itemsDf.types) B. itemsDf.printSchema() C. spark.schema(itemsDf) D. itemsDf.rdd.printSchema() E. itemsDf.print.schema()
Answer: B Explanation: itemsDf.printSchema() Correct! Here is an example of what itemsDf.printSchema() shows, you can see the tree-like structure containing both column names and types: root |-- itemId: integer (nullable = true) |-- attributes: array (nullable = true) | |-- element: string (containsNull = true) |-- supplier: string (nullable = true) itemsDf.rdd.printSchema() No, the DataFrame's underlying RDD does not have a printSchema() method. spark.schema(itemsDf) Incorrect, there is no spark.schema command. print(itemsDf.columns) print(itemsDf.dtypes) Wrong. While the output of this code blocks contains both column names and column types, the information is not arranges in a tree-like way. itemsDf.print.schema() No, DataFrame does not have a print method. Static notebook | Dynamic notebook: See test 3, ( Databricks import instructions)
Question # 6
The code block shown below should add column transactionDateForm to DataFrame transactionsDf.
The column should express the unix-format timestamps in column transactionDate as string
type like Apr 26 (Sunday). Choose the answer that correctly fills the blanks in the code block to
accomplish this.
transactionsDf.__1__(__2__, from_unixtime(__3__, __4__))
A. 1. withColumn 2. "transactionDateForm" 3. "MMM d (EEEE)" 4. "transactionDate" B. 1. select 2. "transactionDate" 3. "transactionDateForm" 4. "MMM d (EEEE)" C. 1. withColumn 2. "transactionDateForm" 3. "transactionDate" 4. "MMM d (EEEE)" D. 1. withColumn 2. "transactionDateForm" 3. "transactionDate" 4. "MM d (EEE)" E. 1. withColumnRenamed 2. "transactionDate" 3. "transactionDateForm" 4. "MM d (EEE)"
Answer: C Explanation: Correct code block: transactionsDf.withColumn("transactionDateForm", from_unixtime("transactionDate", "MMM d (EEEE)")) The specifically asks about "adding" a column. In the context of all presented answers, DataFrame.withColumn() is the correct command for this. In theory, DataFrame.select() could also be used for this purpose, if all existing columns are selected and a new one is added. DataFrame.withColumnRenamed() is not the appropriate command, since it can only rename existing columns, but cannot add a new column or change the value of a column. Once DataFrame.withColumn() is chosen, you can read in the documentation (see below) that the first input argument to the method should be the column name of the new column. The final difficulty is the date format. The indicates that the date format Apr 26 (Sunday) is desired. The answers give "MMM d (EEEE)" and "MM d (EEE)" as options. It can be hard to know the details of the date format that is used in Spark. Specifically, knowing the differences between MMM and MM is probably not something you deal with every day. But, there is an easy way to remember the difference: M (one letter) is usually the shortest form: 4 for April. MM includes padding: 04 for April. MMM (three letters) is the three-letter month abbreviation: Apr for April. And MMMM is the longest possible form: April. Knowing this four-letter sequence helps you select the correct option here. More info: pyspark.sql.DataFrame.withColumn ” PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3, (Databricks import instructions)
Question # 7
Which of the following code blocks reads in the JSON file stored at filePath as a DataFrame?
A. spark.read.json(filePath) B. spark.read.path(filePath, source="json") C. spark.read().path(filePath) D. spark.read().json(filePath) E. spark.read.path(filePath)
Answer: A Explanation: spark.read.json(filePath) Correct. spark.read accesses Spark's DataFrameReader. Then, Spark identifies the file type to be read as JSON type by passing filePath into the DataFrameReader.json() method. spark.read.path(filePath) Incorrect. Spark's DataFrameReader does not have a path method. A universal way to read in files is provided by the DataFrameReader.load() method (link below). spark.read.path(filePath, source="json") Wrong. A DataFrameReader.path() method does not exist (see above). spark.read().json(filePath) Incorrect. spark.read is a way to access Spark's DataFrameReader. However, the DataFrameReader is not callable, so calling it via spark.read() will fail. spark.read().path(filePath) No, Spark's DataFrameReader is not callable (see above). More info: pyspark.sql.DataFrameReader.json ” PySpark 3.1.2 documentation, pyspark.sql.DataFrameReader.load ” PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3, ( Databricks import instructions)
Question # 8
The code block displayed below contains an error. The code block should write DataFrame
transactionsDf as a parquet file to location filePath after partitioning it on column storeId. Find the
error.
Code block:
transactionsDf.write.partitionOn("storeId").parquet(filePath)
A. The partitioning column as well as the file path should be passed to the write() method of DataFrame transactionsDf directly and not as appended commands as in the code block. B. The partitionOn method should be called before the write method. C. The operator should use the mode() option to configure the DataFrameWriter so that it replaces any existing files at location filePath. D. Column storeId should be wrapped in a col() operator. E. No method partitionOn() exists for the DataFrame class, partitionBy() should be used instead.
Answer: E Explanation: No method partitionOn() exists for the DataFrame class, partitionBy() should be used instead. Correct! Find out more about partitionBy() in the documentation (linked below). The operator should use the mode() option to configure the DataFrameWriter so that it replaces any existing files at location filePath. No. There is no information about whether files should be overwritten in the question. The partitioning column as well as the file path should be passed to the write() method of DataFrame transactionsDf directly and not as appended commands as in the code block. Incorrect. To write a DataFrame to disk, you need to work with a DataFrameWriter object which you get access to through the DataFrame.writer property - no parentheses involved. Column storeId should be wrapped in a col() operator. No, this is not necessary - the problem is in the partitionOn command (see above). The partitionOn method should be called before the write method. Wrong. First of all partitionOn is not a valid method of DataFrame. However, even assuming partitionOn would be replaced by partitionBy (which is a valid method), this method is a method of DataFrameWriter and not of DataFrame. So, you would always have to first call DataFrame.write to get access to the DataFrameWriter object and afterwards call partitionBy. More info: pyspark.sql.DataFrameWriter.partitionBy ” PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3, ( Databricks import instructions)
Question # 9
Which of the following code blocks creates a new DataFrame with 3 columns, productId, highest, and
lowest, that shows the biggest and smallest values of column value per value in column
productId from DataFrame transactionsDf?
Sample of DataFrame transactionsDf:
1. +-------------+---------+-----+-------+---------+----+
2. |transactionId|predError|value|storeId|productId| f|
3. +-------------+---------+-----+-------+---------+----+
4. | 1| 3| 4| 25| 1|null|
5. | 2| 6| 7| 2| 2|null|
6. | 3| 3| null| 25| 3|null|
7. | 4| null| null| 3| 2|null|
8. | 5| null| null| null| 2|null|
9. | 6| 3| 2| 25| 2|null|
10. +-------------+---------+-----+-------+---------+----+
A. transactionsDf.max('value').min('value') B. transactionsDf.agg(max('value').alias('highest'), min('value').alias('lowest')) C. transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest")) D. transactionsDf.groupby('productId').agg(max('value').alias('highest'), min('value').alias('lowest')) E. transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")})
Answer: D Explanation: transactionsDf.groupby('productId').agg(max('value').alias('highest'), min('value').alias('lowest')) Correct. groupby and aggregate is a common pattern to investigate aggregated values of groups. transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")}) Wrong. While DataFrame.agg() accepts dictionaries, the syntax of the dictionary in this code block is wrong. If you use a dictionary, the syntax should be like {"value": "max"}, so using the column name as the key and the aggregating function as value. transactionsDf.agg(max('value').alias('highest'), min('value').alias('lowest')) Incorrect. While this is valid Spark syntax, it does not achieve what the asks for . The specifically asks for values to be aggregated per value in column productId - this column is not considered here. Instead, the max() and min() values are calculated as if the entire DataFrame was a group. transactionsDf.max('value').min('value') Wrong. There is no DataFrame.max() method in Spark, so this command will fail. transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest")) No. While this may work if the column names are expressed as strings, this will not work as is. Python will interpret the column names as variables and, as a result, pySpark will not understand which columns you want to aggregate. More info: pyspark.sql.DataFrame.agg ” PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3, (Databricks import instructions)
Question # 10
Which of the following code blocks returns a DataFrame with approximately 1,000 rows from the
10,000-row DataFrame itemsDf, without any duplicates, returning the same rows even if the code
block is run twice?
A. itemsDf.sampleBy("row", fractions={0: 0.1}, seed=82371) B. itemsDf.sample(fraction=0.1, seed=87238) C. itemsDf.sample(fraction=1000, seed=98263) D. itemsDf.sample(withReplacement=True, fraction=0.1, seed=23536) E. itemsDf.sample(fraction=0.1)
Answer: B Explanation: itemsDf.sample(fraction=0.1, seed=87238) Correct. If itemsDf has 10,000 rows, this code block returns about 1,000, since DataFrame.sample() is never guaranteed to return an exact amount of rows. To ensure you are not returning duplicates, you should leave the withReplacement parameter at False, which is the default. Since thespecifies that the same rows should be returned even if the code block is run twice, you need to specify a seed. The number passed in the seed does not matter as long as it is an integer. itemsDf.sample(withReplacement=True, fraction=0.1, seed=23536) Incorrect. While this code block fulfills almost all requirements, it may return duplicates. This is because withReplacement is set to True. Here is how to understand what replacement means: Imagine you have a bucket of 10,000 numbered balls and you need to take 1,000 balls at random from the bucket (similar to the problem in the question). Now, if you would take those balls with replacement, you would take a ball, note its number, and put it back into the bucket, meaning the next time you take a ball from the bucket there would be a chance you could take the exact same ball again. If you took the balls without replacement, you would leave the ball outside the bucket and not put it back in as you take the next 999 balls. itemsDf.sample(fraction=1000, seed=98263) Wrong. The fraction parameter needs to have a value between 0 and 1. In this case, it should be 0.1, since 1,000,000 = 0.1. itemsDf.sampleBy("row", fractions={0: 0.1}, seed=82371) No, DataFrame.sampleBy() is meant for stratified sampling. This means that based on the values in a column in a DataFrame, you can draw a certain fraction of rows containing those values from the DataFrame (more details linked below). In the scenario at hand, sampleBy is not the right operator to use because you do not have any information about any column that the sampling should depend on. itemsDf.sample(fraction=0.1) Incorrect. This code block checks all the boxes except that it does not ensure that when you run it a second time, the exact same rows will be returned. In order to achieve this, you would have to specify a seed. More info: - pyspark.sql.DataFrame.sample ” PySpark 3.1.2 documentation - pyspark.sql.DataFrame.sampleBy ” PySpark 3.1.2 documentation - Types of Samplings in PySpark 3. The explanations of the sampling¦ | by Pinar Ersoy | Towards Data Science
Question # 11
Which of the following code blocks returns a DataFrame where columns predError and productId areremoved from DataFrame transactionsDf?Sample of DataFrame transactionsDf:1. +-------------+---------+-----+-------+---------+----+2. |transactionId|predError|value|storeId|productId|f |3. +-------------+---------+-----+-------+---------+----+4. |1 |3 |4 |25 |1 |null|5. |2 |6 |7 |2 |2 |null|6. |3 |3 |null |25 |3 |null|7. +-------------+---------+-----+-------+---------+----+
A.transactionsDf.withColumnRemoved("predError", "productId") B. transactionsDf.drop(["predError", "productId", "associateId"]) C. transactionsDf.drop("predError", "productId", "associateId") D. transactionsDf.dropColumns("predError", "productId", "associateId") E. transactionsDf.drop(col("predError", "productId"))
Answer: D Explanation:The key here is to understand that columns that are passed to DataFrame.drop() are ignored if theydo not exist in the DataFrame. So, passing column name associateId to transactionsDf.drop()does not have any effect.Passing a list to transactionsDf.drop() is not valid. The documentation (link below) shows the callstructure as DataFrame.drop(*cols). The * means that all arguments that are passed toDataFrame.drop() are read as columns. However, since a list of columns, for example ["predError","productId", "associateId"] is not a column, Spark will run into an error.More info: pyspark.sql.DataFrame.drop ” PySpark 3.1.1 documentationStatic notebook | Dynamic notebook: See test 1, (Databricks import instructions)
Question # 12
Which of the following code blocks returns about 150 randomly selected rows from the 1000-rowDataFrame transactionsDf, assuming that any row can appear more than once in the returnedDataFrame?
A. transactionsDf.resample(0.15, False, 3142) B. transactionsDf.sample(0.15, False, 3142) C. transactionsDf.sample(0.15) D. transactionsDf.sample(0.85, 8429) E. transactionsDf.sample(True, 0.15, 8261)
Answer: E Explanation:Answering this
Question # 13
The code block displayed below contains an error. The code block should use Python methodfind_most_freq_letter to find the letter present most in column itemName of DataFrame itemsDfandreturn it in a new column most_frequent_letter. Find the error.Code block:1. find_most_freq_letter_udf = udf(find_most_freq_letter)2. itemsDf.withColumn("most_frequent_letter", find_most_freq_letter("itemName"))
A. Spark is not using the UDF method correctly. B. The UDF method is not registered correctly, since the return type is missing. C. The "itemName" expression should be wrapped in col(). D. UDFs do not exist in PySpark. E. Spark is not adding a column.
Answer: A Explanation:Correct code block:find_most_freq_letter_udf = udf(find_most_frequent_letter)itemsDf.withColumn("most_frequent_letter", find_most_freq_letter_udf("itemName"))Spark should use the previously registered find_most_freq_letter_udf method here “ but it is notdoing that in the original codeblock. There, it just uses the non-UDF version of the Python method.Note that typically, we would have to specify a return type for udf(). Except in this case, since thedefault return type for udf() is a string which is what we are expecting here. If we wanted to returnan integer variable instead, we would have to register the Python function as UDF usingfind_most_freq_letter_udf = udf(find_most_freq_letter, IntegerType()).More info: pyspark.sql.functions.udf ” PySpark 3.1.1 documentation
Question # 14
in column itemNameElements. Choose the answer that correctly fills the blanks in the code blockto accomplish this.Example of DataFrame itemsDf:1. +------+----------------------------------+-------------------+------------------------------------------+2. |itemId|itemName |supplier |itemNameElements |3. +------+----------------------------------+-------------------+------------------------------------------+4. |1 |Thick Coat for Walking in the Snow|Sports Company Inc.|[Thick, Coat, for, Walking, in,the, Snow]|5. |2 |Elegant Outdoors Summer Dress |YetiX |[Elegant, Outdoors, Summer, Dress]|6. |3 |Outdoors Backpack |Sports Company Inc.|[Outdoors, Backpack] |7. +------+----------------------------------+-------------------+------------------------------------------+Code block:itemsDf.__1__(__2__(__3__)__4__)
A. 1. select2. count3. col("itemNameElements")4. >3 B. 1. filter2. count3. itemNameElements4. >=3 C. 1. select2. count3. "itemNameElements"4. >3 D. 1. filter2. size3. "itemNameElements"4. >=3(Correct) E. 1. select2. size3. "itemNameElements"4. >3
Answer: D Explanation:Correct code block:itemsDf.filter(size("itemNameElements")>3)Output of code block:+------+----------------------------------+-------------------+------------------------------------------+|itemId|itemName |supplier |itemNameElements |+------+----------------------------------+-------------------+------------------------------------------+|1 |Thick Coat for Walking in the Snow|Sports Company Inc.|[Thick, Coat, for, Walking, in, the,Snow]||2 |Elegant Outdoors Summer Dress |YetiX |[Elegant, Outdoors, Summer, Dress] |+------+----------------------------------+-------------------+------------------------------------------+The big difficulty with this is inknowing the difference between count and size (refer to documentation below). size is the correctfunction to choose here since it returns the numberof elements in an array on a per-row basis.The other consideration for solving this
Question # 15
The code block displayed below contains an error. The code block below is intended to add a columnitemNameElements to DataFrame itemsDf that includes an array of all words in columnitemName. Find the error.Sample of DataFrame itemsDf:1. +------+----------------------------------+-------------------+2. |itemId|itemName |supplier |3. +------+----------------------------------+-------------------+4. |1 |Thick Coat for Walking in the Snow|Sports Company Inc.|5. |2 |Elegant Outdoors Summer Dress |YetiX |6. |3 |Outdoors Backpack |Sports Company Inc.|7. +------+----------------------------------+-------------------+Code block:itemsDf.withColumnRenamed("itemNameElements", split("itemName"))itemsDf.withColumnRenamed("itemNameElements", split("itemName"))
A. All column names need to be wrapped in the col() operator. B. Operator withColumnRenamed needs to be replaced with operator withColumn and a secondargument "," needs to be passed to the split method. C. Operator withColumnRenamed needs to be replaced with operator withColumn and the splitmethod needs to be replaced by the splitString method. D. Operator withColumnRenamed needs to be replaced with operator withColumn and a secondargument " " needs to be passed to the split method. E. The expressions "itemNameElements" and split("itemName") need to be swapped.
Answer: D Explanation:Correct code block:itemsDf.withColumn("itemNameElements", split("itemName"," "))Output of code block:+------+----------------------------------+-------------------+------------------------------------------+|itemId|itemName |supplier |itemNameElements |+------+----------------------------------+-------------------+------------------------------------------+|1 |Thick Coat for Walking in the Snow|Sports Company Inc.|[Thick, Coat, for, Walking, in, the,Snow]||2 |Elegant Outdoors Summer Dress |YetiX |[Elegant, Outdoors, Summer, Dress] ||3 |Outdoors Backpack |Sports Company Inc.|[Outdoors, Backpack] |+------+----------------------------------+-------------------+------------------------------------------+The key to solving this is thatthe split method definitely needs a second argument here (also look at the link to the documentationbelow). Given the values in column itemName inDataFrame itemsDf, this should be a space character " ". This is the character we need to split thewords in the column.More info: pyspark.sql.functions.split ” PySpark 3.1.1 documentationStatic notebook | Dynamic notebook: See test 1, (Databricks import instructions)
Question # 16
Which of the following code blocks returns only rows from DataFrame transactionsDf in which valuesin column productId are unique?
A. transactionsDf.distinct("productId") B. transactionsDf.dropDuplicates(subset=["productId"]) C. transactionsDf.drop_duplicates(subset="productId") D. transactionsDf.unique("productId") E. transactionsDf.dropDuplicates(subset="productId")
Answer: B Explanation:Although the
Question # 17
Which of the following code blocks uses a schema fileSchema to read a parquet file at locationfilePath into a DataFrame?
A. spark.read.schema(fileSchema).format("parquet").load(filePath) B. spark.read.schema("fileSchema").format("parquet").load(filePath) C. spark.read().schema(fileSchema).parquet(filePath) D. spark.read().schema(fileSchema).format(parquet).load(filePath) E. spark.read.schema(fileSchema).open(filePath)
Answer: A Explanation:Pay attention here to which variables are quoted. fileSchema is a variable and thus should not be inquotes. parquet is not a variable and therefore should be in quotes.SparkSession.read (here referenced as spark.read) returns a DataFrameReader which all subsequentcalls reference - the DataFrameReader is not callable, so you should not use parentheseshere.Finally, there is no open method in PySpark. The method name is load.Static notebook | Dynamic notebook: See test 1,Databricks import instructions)
Question # 18
The code block displayed below contains multiple errors. The code block should return a DataFrame
that contains only columns transactionId, predError, value and storeId of DataFrame
transactionsDf. Find the errors.
Code block:
transactionsDf.select([col(productId), col(f)])
Sample of transactionsDf:
1. +-------------+---------+-----+-------+---------+----+
2. |transactionId|predError|value|storeId|productId| f|
3. +-------------+---------+-----+-------+---------+----+
4. | 1| 3| 4| 25| 1|null|
5. | 2| 6| 7| 2| 2|null|
6. | 3| 3| null| 25| 3|null|
7. +-------------+---------+-----+-------+---------+----+
A. The column names should be listed directly as arguments to the operator and not as a list. B. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator. C. The select operator should be replaced by a drop operator. D. The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId. E. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.
Answer: B Explanation: Correct code block: transactionsDf.drop("productId", "f") This requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error NO: will make it easier for you to deal with single-error questions in the real exam. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator. Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the can be solved by using a select statement, a drop statement, given the answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column names should be expressed as strings and not as Python variable names as in the original code block. The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question. The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and should be replaced by transactionId, predError, value and storeId. Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId - for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed. No. This still leaves you with Python trying to interpret the column names as Python variables (see above). The select operator should be replaced by a drop operator. Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as Python variables (see above). More info: pyspark.sql.DataFrame.drop ” PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3, (Databricks import instructions)
Question # 19
Which of the following code blocks returns a new DataFrame in which column attributes of
DataFrame itemsDf is renamed to feature0 and column supplier to feature1?
A. itemsDf.withColumnRenamed(attributes, feature0).withColumnRenamed(supplier, feature1) B. 1. itemsDf.withColumnRenamed("attributes", "feature0") 2. itemsDf.withColumnRenamed("supplier", "feature1") C. itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1")) D. itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier", "feature1") E. itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")
Answer: D Explanation: itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier", "feature1") Correct! Spark's DataFrame.withColumnRenamed syntax makes it relatively easy to change the name of a column. itemsDf.withColumnRenamed(attributes, feature0).withColumnRenamed(supplier, feature1) Incorrect. In this code block, the Python interpreter will try to use attributes and the other column names as variables. Needless to say, they are undefined, and as a result the block will not run. itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1")) Wrong. The DataFrame.withColumnRenamed() operator takes exactly two string arguments. So, in this answer both using col() and using four arguments is wrong. itemsDf.withColumnRenamed("attributes", "feature0") itemsDf.withColumnRenamed("supplier", "feature1") No. In this answer, the returned DataFrame will only have column supplier be renamed, since the result of the first line is not written back to itemsDf. itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1") Incorrect. While withColumn works for adding and naming new columns, you cannot use it to rename existing columns. More info: pyspark.sql.DataFrame.withColumnRenamed ” PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3, ( Databricks import instructions)
Question # 20
The code block displayed below contains multiple errors. The code block should remove column
transactionDate from DataFrame transactionsDf and add a column transactionTimestamp in which
dates that are expressed as strings in column transactionDate of DataFrame transactionsDf are
converted into unix timestamps. Find the errors.
Sample of DataFrame transactionsDf:
1. +-------------+---------+-----+-------+---------+----+----------------+
2. |transactionId|predError|value|storeId|productId| f| transactionDate|
3. +-------------+---------+-----+-------+---------+----+----------------+
4. | 1| 3| 4| 25| 1|null|2020-04-26 15:35|
5. | 2| 6| 7| 2| 2|null|2020-04-13 22:01|
6. | 3| 3| null| 25| 3|null|2020-04-02 10:53|
7. +-------------+---------+-----+-------+---------+----+----------------+
Code block:
1. transactionsDf = transactionsDf.drop("transactionDate")
2. transactionsDf["transactionTimestamp"] = unix_timestamp("transactionDate", "yyyy-MMdd")
A. Column transactionDate should be dropped after transactionTimestamp has been written. The string indicating the date format should be adjusted. The withColumn operator should be used instead of the existing column assignment. Operator to_unixtime() should be used instead of unix_timestamp(). B. Column transactionDate should be dropped after transactionTimestamp has been written. The withColumn operator should be used instead of the existing column assignment. Column transactionDate should be wrapped in a col() operator. C. Column transactionDate should be wrapped in a col() operator. D. The string indicating the date format should be adjusted. The withColumnReplaced operator should be used instead of the drop and assign pattern in the code block to replace column transactionDate with the new column transactionTimestamp. E. Column transactionDate should be dropped after transactionTimestamp has been written. The string indicating the date format should be adjusted. The withColumn operator should be used instead of the existing column assignment.
Answer: E Explanation: This requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error NO: will make it easier for you to deal with single-error questions in the real exam. You can clearly see that column transactionDate should be dropped only after transactionTimestamp has been written. This is because to generate column transactionTimestamp, Spark needs to read the values from column transactionDate. Values in column transactionDate in the original transactionsDf DataFrame look like 2020-04-26 15:35. So, to convert those correctly, you would have to pass yyyy-MM-dd HH:mm. In other words: The string indicating the date format should be adjusted. While you might be tempted to change unix_timestamp() to to_unixtime() (in line with the from_unixtime() operator), this function does not exist in Spark. unix_timestamp() is the correct operator to use here. Also, there is no DataFrame.withColumnReplaced() operator. A similar operator that exists is DataFrame.withColumnRenamed(). Whether you use col() or not is irrelevant with unix_timestamp() - the command is fine with both. Finally, you cannot assign a column like transactionsDf["columnName"] = ... in Spark. This is Pandas syntax (Pandas is a popular Python package for data analysis), but it is not supported in Spark. So, you need to use Spark's DataFrame.withColumn() syntax instead. More info: pyspark.sql.functions.unix_timestamp ” PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3, (Databricks import instructions)
Question # 21
The code block displayed below contains an error. The code block should arrange the rows of
DataFrame transactionsDf using information from two columns in an ordered fashion, arranging first
by column value, showing smaller numbers at the top and greater numbers at the bottom, and then by
column predError, for which all values should be arranged in the inverse way of the order of items
in column value. Find the error.
Code block:
transactionsDf.orderBy('value', asc_nulls_first(col('predError')))
A. Two orderBy statements with calls to the individual columns should be chained, instead of having both columns in one orderBy statement. B. Column value should be wrapped by the col() operator. C. Column predError should be sorted in a descending way, putting nulls last . D. Column predError should be sorted by desc_nulls_first() instead. E. Instead of orderBy, sort should be used.
Answer: C Explanation: Correct code block: transactionsDf.orderBy('value', desc_nulls_last('predError')) Column predError should be sorted in a descending way, putting nulls last. Correct! By default, Spark sorts ascending, putting nulls first. So, the inverse sort of the default sort is indeed desc_nulls_last. Instead of orderBy, sort should be used. No. DataFrame.sort() orders data per partition, it does not guarantee a global order. This is why orderBy is the more appropriate operator here. Column value should be wrapped by the col() operator. Incorrect. DataFrame.sort() accepts both string and Column objects. Column predError should be sorted by desc_nulls_first() instead. Wrong. Since Spark's default sort order matches asc_nulls_first(), nulls would have to come last when inverted. Two orderBy statements with calls to the individual columns should be chained, instead of having both columns in one orderBy statement. No, this would just sort the DataFrame by the very last column, but would not take information from both columns into account, as noted in the question. More info: pyspark.sql.DataFrame.orderBy ” PySpark 3.1.2 documentation, pyspark.sql.functions.desc_nulls_last ” PySpark 3.1.2 documentation, sort() vs orderBy() in Spark | Towards Data Science Static notebook | Dynamic notebook: See test 3, ( Databricks import instructions)
Question # 22
The code block displayed below contains an error. The code block should configure Spark to split data
in 20 parts when exchanging data between executors for joins or aggregations. Find the error.
Code block:
spark.conf.set(spark.sql.shuffle.partitions, 20)
A. The code block uses the wrong command for setting an option. B. The code block sets the wrong option. C. The code block expresses the option incorrectly. D. The code block sets the incorrect number of parts. E. The code block is missing a parameter.
Answer: C Explanation: Correct code block: spark.conf.set("spark.sql.shuffle.partitions", 20) The code block expresses the option incorrectly. Correct! The option should be expressed as a string. The code block sets the wrong option. No, spark.sql.shuffle.partitions is the correct option for the use case in the question. The code block sets the incorrect number of parts. Wrong, the code block correctly states 20 parts. The code block uses the wrong command for setting an option. No, in PySpark spark.conf.set() is the correct command for setting an option. The code block is missing a parameter. Incorrect, spark.conf.set() takes two parameters. More info: Configuration - Spark 3.1.2 Documentation
Question # 23
Which of the following code blocks performs an inner join of DataFrames transactionsDf and itemsDf
on columns productId and itemId, respectively, excluding columns value and storeId from
DataFrame transactionsDf and column attributes from DataFrame itemsDf?
A. transactionsDf.drop('value', 'storeId').join(itemsDf.select('attributes'), transactionsDf.productId==itemsDf.itemId) B. 1. transactionsDf.createOrReplaceTempView('transactionsDf') 2. itemsDf.createOrReplaceTempView('itemsDf') 4. spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes") C. transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId") D. 1. transactionsDf \ 2. .drop(col('value'), col('storeId')) \ 3. .join(itemsDf.drop(col('attributes')), col('productId')==col('itemId')) E. 1. transactionsDf.createOrReplaceTempView('transactionsDf') 2. itemsDf.createOrReplaceTempView('itemsDf') 4. statement = """ 5. SELECT * FROM transactionsDf 6. INNER JOIN itemsDf 7. ON transactionsDf.productId==itemsDf.itemId 8. """ 9. spark.sql(statement).drop("value", "storeId", "attributes")
Answer: E Explanation: This offers you a wide variety of answers for a seemingly simple question. However, this variety reflects the variety of ways that one can express a join in PySpark. You need to understand some SQL syntax to get to the correct answer here. transactionsDf.createOrReplaceTempView('transactionsDf') itemsDf.createOrReplaceTempView('itemsDf') statement = """ SELECT * FROM transactionsDf INNER JOIN itemsDf ON transactionsDf.productId==itemsDf.itemId """ spark.sql(statement).drop("value", "storeId", "attributes") Correct - this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote """ in Python: This allows you to express strings as multiple lines. transactionsDf \ .drop(col('value'), col('storeId')) \ .join(itemsDf.drop(col('attributes')), col('productId')==col('itemId')) No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop('value', 'storeId') instead. transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId") Incorrect - Spark does not evaluate "transactionsDf.productId==itemsDf.itemId" as a valid join expression. This would work if it would not be a string. transactionsDf.drop('value', 'storeId').join(itemsDf.select('attributes'), transactionsDf.productId==itemsDf.itemId) Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop. transactionsDf.createOrReplaceTempView('transactionsDf') itemsDf.createOrReplaceTempView('itemsDf') spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes") No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column. More info: pyspark.sql.DataFrame.join ” PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3, (Databricks import instructions)
Question # 24
Which of the following code blocks returns a one-column DataFrame for which every row contains an
array of all integer numbers from 0 up to and including the number given in column predError of
DataFrame transactionsDf, and null if predError is null?
Sample of DataFrame transactionsDf:
1. +-------------+---------+-----+-------+---------+----+
2. |transactionId|predError|value|storeId|productId| f|
3. +-------------+---------+-----+-------+---------+----+
4. | 1| 3| 4| 25| 1|null|
5. | 2| 6| 7| 2| 2|null|
6. | 3| 3| null| 25| 3|null|
7. | 4| null| null| 3| 2|null|
8. | 5| null| null| null| 2|null|
9. | 6| 3| 2| 25| 2|null|
10. +-------------+---------+-----+-------+---------+----+
A. 1. def count_to_target(target): 2. if target is None: 3. return 4. 5. result = [range(target)] 6. return result 7. 8. count_to_target_udf = udf(count_to_target, ArrayType[IntegerType]) 9. 10. transactionsDf.select(count_to_target_udf(col('predError'))) B. 1. def count_to_target(target): 2. if target is None: 3. return 4. 5. result = list(range(target)) 6. return result 7. 8. transactionsDf.select(count_to_target(col('predError'))) C. 1. def count_to_target(target): 2. if target is None: 3. return 4. 5. result = list(range(target)) 6. return result 7. 8. count_to_target_udf = udf(count_to_target, ArrayType(IntegerType())) 9. 10. transactionsDf.select(count_to_target_udf('predError')) (Correct) D. 1. def count_to_target(target): 2. result = list(range(target)) 3. return result 4. 5. count_to_target_udf = udf(count_to_target, ArrayType(IntegerType())) 6. 7. df = transactionsDf.select(count_to_target_udf('predError')) E. 1. def count_to_target(target): 2. if target is None: 3. return 4. 5. result = list(range(target)) 6. return result 7. 8. count_to_target_udf = udf(count_to_target) 9. 10. transactionsDf.select(count_to_target_udf('predError'))
Answer: C Explanation: Correct code block: def count_to_target(target): if target is None: return result = list(range(target)) return result count_to_target_udf = udf(count_to_target, ArrayType(IntegerType())) transactionsDf.select(count_to_target_udf('predError')) Output of correct code block: +--------------------------+ |count_to_target(predError)| +--------------------------+ | [0, 1, 2]| | [0, 1, 2, 3, 4, 5]| | [0, 1, 2]| | null| | null| | [0, 1, 2]| +--------------------------+ This is not exactly easy. You need to be familiar with the syntax around UDFs (user-defined functions). Specifically, in this it is important to pass the correct types to the udf method - returning an array of a specific type rather than just a single type means you need to think harder about type implications than usual. Remember that in Spark, you always pass types in an instantiated way like ArrayType(IntegerType()), not like ArrayType(IntegerType). The parentheses () are the key here - make sure you do not forget those. You should also pay attention that you actually pass the UDF count_to_target_udf, and not the Python method count_to_target to the select() operator. Finally, null values are always a tricky case with UDFs. So, take care that the code can handle them correctly. More info: How to Turn Python Functions into PySpark Functions (UDF) “ Chang Hsin Lee “ Committing my thoughts to words. Static notebook | Dynamic notebook: See test 3, (Databricks import instructions)
Question # 25
Which of the following code blocks shuffles DataFrame transactionsDf, which has 8 partitions, so that
it has 10 partitions?
A. transactionsDf.repartition(transactionsDf.getNumPartitions()+2) B. transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2) C. transactionsDf.coalesce(10) D. transactionsDf.coalesce(transactionsDf.getNumPartitions()+2) E. transactionsDf.repartition(transactionsDf._partitions+2)
Answer: B Explanation: transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2) Correct. The repartition operator is the correct one for increasing the number of partitions. calling getNumPartitions() on DataFrame.rdd returns the current number of partitions. transactionsDf.coalesce(10) No, after this command transactionsDf will continue to only have 8 partitions. This is because coalesce() can only decreast the amount of partitions, but not increase it. transactionsDf.repartition(transactionsDf.getNumPartitions()+2) Incorrect, there is no getNumPartitions() method for the DataFrame class. transactionsDf.coalesce(transactionsDf.getNumPartitions()+2) Wrong, coalesce() can only be used for reducing the number of partitions and there is no getNumPartitions() method for the DataFrame class. transactionsDf.repartition(transactionsDf._partitions+2) No, DataFrame has no _partitions attribute. You can find out the current number of partitions of a DataFrame with the DataFrame.rdd.getNumPartitions() method. More info: pyspark.sql.DataFrame.repartition ” PySpark 3.1.2 documentation, pyspark.RDD.getNumPartitions ” PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3, ( Databricks import instructions)
Join the Conversation
Be part of the conversation — share your thoughts, reply to others, and contribute your experience.